If you ever work with any application in the Hadoop ecosystem you probably used or maybe heard of one of these file formats.
But what are these file formats? What are the differences and similarities? And when should we use one over another one? In this post, I go over the answer to these questions.
In general, choosing an appropriate file format can have benefits for:
- Write latency
- Read latency
- Schema evolution
- Storage cost
Before we deep dive into these file formats, let us clarify that all these data formats are used for better efficiency both in storage and processing cost. All three support some sort of compression and are optimized for storage on Hadoop. These three file formats are also machine-readable binary formats. This means that unlike a JSON or YAML file you can not simply open the file and read it without any special tools, because they are machine-readable and not human readable. These file formats are also suitable for parallel processing since they can be split and stored across multiple disks. This also makes these file formats highly scalable.
Avro, ORC, and Parquet all are self-described which means they hold their schema in their files. This allows us to export and load any of these file formats from one machine to another.
Apache Avro is a highly splittable and row-based file format. This makes Avro very suitable for write-heavy transactional workloads.
Avro is often used as a language-agnostic serialization platform. Avro schema is stored as a JSON format which is easy parse with any language and allows you to modify it quickly. On the other hand, the actual data is stored in binary format for efficiency.
One of the best features of Avro is its support for schema evolution over time. Avro can support changes like missing data, added/modified fields. Avro also supports a rich set of data structures.
Apache Parquet is a columnar file format. This makes Parquet file format very suitable for OLAP and Analytical workload. The columnar file format also has a better data compression compare to row-based. This is because in column-based all the data in one column is of one type (homogeneous) which makes it way more efficient for compression algorithms. The space-saving becomes very noticeable at the scale of a Hadoop cluster. Columnar storage is also very efficient for the type of queries that need to access the data of only one or a few columns.
One of the main features of Parquet is that it can store the nested data structure also in columnar fashion. This means in Parquet file format, even the nested fields can be accessed individually without the need to access all the fields in the nested structure.
Parquet’s most advantage is in its efficiency in storing and processing nested data types.
Parquet is usually used with Apache Impala and Apache Drill which is MapReduce favored SQL on Hadoop.
Optimized Row Columnar (ORC)
ORC is a self-describing and a type-aware columnar file format designed for the Hadoop ecosystem. ORC is highly optimized for large streaming reads, but with integrated support for finding required rows quickly.
The metadata stored in ORC is using Protocol Buffer which allows the addition and removal of the fields. ORC stores collections of rows in one file and within the collection the row data is stored in a columnar format.
ORC file contains groups of row data called stripes, along with metadata in a file footer which contains a list of stripes in the file, the number of rows per stripe, and each column’s data type. Footer also contains column-level aggregates such as Count, Min, Max, and Sum.
The default stripe size is 250 MB. Larger stripes enable large and efficient reads from HDFS. ORC is commonly used with Apache Hive and Presto
As you can see choosing between these three file formats is heavily dependent on the use case. Some applications are read-heavy and some others write-heavy. Depending on your application and your use case you can choose the best option for your underlying data type.
If your application is OLAP type, you can eliminate Avro as you most probably want a columnar file format. On the other hand, if your application is an OLTP type, you might want to choose Avro.
It is very important to choose a file format for your big data solution. So it worth really investigating and understanding your application and use case before choose one data format.