Friday, November 21, 2014

JSON, Avro, Trevni, and Parquet: how they are related

JSON – consider it an alternative to xml. It’s smaller, faster, and easier to read. Just use it.


Avro – a data storage system that stores JSON along with the schema for the JSON. Think of it as a file that contains loads of objects stored in JSON, and then the schema is stored along with it. In addition,

“When Avro is used in RPC, the client and server exchange schemas in the connection handshake”.


Trevni – a columnar storage format. Instead of writing out all the columns for a row then moving on to the next row, Trevni writes out all the rows for a given column and then moves on to the next row. This means all the column values are stored sequentially, which allows for much faster BI-like reads.


Parquet – Cloudera and Twitter took Trevni and improved it. So, at least in the Cloudera distribution, you’ll see Parquet instead of Trevni. I suspect most BI-type systems will be using Parquet from now on.


Really, JSON and Avro are not directly related to Trevni and Parquet. However, Serializers/Deserilizers (SerDe) come by default with Hive, so it’s good to know.


If you use Avro, it means you can do strong typing while moving around lots of NOSQL data. Sqoop, for instance, now can export directly to Avro files – it generates the JSON schema and everything for you based on the columns in the source tables.


Avro:


http://ift.tt/1rGZ2oV


Trevni:


http://ift.tt/11Ee0nD


Parquet:


http://ift.tt/1lgvjVh





No comments:

Post a Comment