Solving the Small Files problem in Hadoop. The Filecrush project on Github by Edward Capriolo seems to be a viable solution. Amazon has release S3DistCp, which would be another solution. For this, I’m covering filecrush.
You need the filecrush JAR, which is not something included in the Github project. There are links to www.jointhegrid.com, but that site has been down for me. Not sure why. Other searches for a JAR of filecrush only yields sketchy results. Sooooo… let’s build it.
Hint, when you see a file named “pom.xml” on a Github or Bitbucket project, it means you can build the thing with Maven… pretty seamlessly.
For Maven you need to install a Java SDK (if openJDK, make sure to get the one with “-devel” at the end of the package name, e.g. yum install java-1.7.*-devel.
For Maven, you need to download the Tar, untar it, move it to the expected location, and export some environment variables.
Download the Github project (the whole zip file). Install unzip and unzip it.
From inside the unzipped filecrush folder (same level as pom.xml) you need to run Maven.
In the case of filecrush, currently (2.2.2) you need to tell Maven to skip the tests, since they seem to be breaking due to some nontrivial reason caused by a recent Hadoop release. It appears it’s not an issue though: http://ift.tt/1uM35ml (from the auther of filecrush)
mvn -Dmaven.test.skip=true package
It will download dependencies and then eventually spit out a jar file in the targets directory. Copy that up to your Hadoop cluster – I put mine in /user/hive/aux_libs/. Then, “Refresh Cluster” from the Home of Cloudera Manager on the relevant cluster.
Run filecrush like this:
From a SSH session on one of the nodes:
hadoop jar filecrush-2.2.2-SNAPSHOT.jar com.m6d.filecrush.crush.Crush –input-format text –output-format text /user/root/ingfolderwithloadsoffiles/ /user/root/outputfolder/ 20101121121212
(I did the –input-format and –output-format because the files were gzipped text files)
See the docs for more usage options: http://ift.tt/1gzGOis