Custom Kubernetes Installation Journey

If you are here and reading this post, I am assuming that you either actively use Kubernetes, are trying to explore it, or just want to learn more about it. Regardless of that, I am going to share my experience of using Kubernetes. Along the way, I’m going to share what I have learned and talk about the journey that I had on how to install Kubernetes for the production use case in one of my previous projects at work.

When I started to look into Kubernetes, I started to learn about it by reading its documentation, digging into its internal components, doing some POC, and playing with it, using EKS, the managed kubernetes service on AWS. After a couple of months, finally, it was time to think about production and a way to automate the entire deployment of Kubernetes and have a method to provision and bootstrap Kubernetes in a one click fashion. One of the requirements that we had, was to be able to deploy Kubernetes control plan in either single mode for smaller customers or HA enabled mode for larger customers. The other requirement was to be able to deploy Kubernetes on any type of public or private cloud provider. Because of the second requirement, we couldn’t use any of the managed Kubernetes services such as EKS and we had to deploy both control plane and data plane. 
We tried different methods such as using Ansible and Terraform, using Kubeadm and a couple of other methods, each with its own pros and cons until finally, we found the best way that works for our us based on our use case and requirements.

Continue reading “Custom Kubernetes Installation Journey”

Using Hive and Presto on Amazon EMR

Introduction

In this post, I am going to go over a simple project running on Amazon EMR. I am using a dataset “Baby Names from Social Security Card Applications In The US” which holds the data for 109 Years (1910-2018). I transformed the data to make it compatible with this project and made it available in Github. I converted the CSV files to Parquet format and used both of them to compare the performance.

Continue reading “Using Hive and Presto on Amazon EMR”

Hadoop HDFS

Introduction

The Hadoop file system is designed as a highly fault-tolerant file system that can be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large datasets. 

HDFS is designed for batch processing rather than interactive use by users. HDFS provides write-once-read-many access models. In this model, once a file is created, written and closed, it can not be changed except for appending and removing.  HDFS does not support hard links or soft links. 

Continue reading “Hadoop HDFS”