Thursday, December 4, 2014

Setting up LZO compression on your Cloudera Hadoop cluster

LZO makes Impala faster because files compressed with LZO can be split up and assigned to different nodes. If you use GZIP compression and write a single large file to HDFS, it gets split up across many HDFS blocks. Since Hadoop and Impala are all about parallel processing, this makes it slower because only a single node can process any given compressed file… unless it’s been compressed with LZO.


Anyways, the docs were *kinda* clear on what to do, but they were still confusing about whether or not you have to muck around on the command line. Well, you don’t. You can just do it all via Cloudera Manager and the Parcels systems (assuming you chose to install using Parcels when you initially set up the cluster).


In short, the LZO compression features are contained in the GPL Extras parcel. Here’s the page on how to install it:


http://ift.tt/1zXhxsR


It’s not too clear, especially since they don’t list anything for the 5.2 release, which is what I have. I just used http://ift.tt/1vT806E without appending any version number to the end. So, later on when there’s 5.3 or something, you may need to put in http://ift.tt/1zXhxJ6 or something like that.


Specifically, I did the following to set it up and get the parcel installed:


1. Cloudera Manager > at the top-middle-rightish, click on the parcel icon (looks like a gift box) > Edit Settings at the top right > In “Remote Parcel Repository URLs” add a new entry and paste in “http://ift.tt/1vT806K;

2. Save changes

3. Restart Cluster

4. I had to redeploy client configurations after the cluster restarted (there were icons mentioning as much in the “Home” page of Cloudera manager).


Now, you can go to each of your nodes and run “yum install lzop -y”. Once that’s done, LZO should be magically available for things like filecrush.





No comments:

Post a Comment