big-data – Deep Dive Tech Blog

February 18, 2020February 20, 2020

Using Hive and Presto on Amazon EMR

Introduction

In this post, I am going to go over a simple project running on Amazon EMR. I am using a dataset “Baby Names from Social Security Card Applications In The US” which holds the data for 109 Years (1910-2018). I transformed the data to make it compatible with this project and made it available in Github. I converted the CSV files to Parquet format and used both of them to compare the performance.

February 7, 2020April 2, 2021

Hadoop Data Formats a deep dive in Avro, ORC, and Parquet

If you ever work with any application in the Hadoop ecosystem you probably used or maybe heard of one of these file formats.

April 14, 2019February 7, 2020

Hadoop HDFS

Introduction

The Hadoop file system is designed as a highly fault-tolerant file system that can be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large datasets.

HDFS is designed for batch processing rather than interactive use by users. HDFS provides write-once-read-many access models. In this model, once a file is created, written and closed, it can not be changed except for appending and removing. HDFS does not support hard links or soft links.

Continue reading “Hadoop HDFS”