Wednesday, January 7, 2015

Impala 2.0 dies if you query on gzip files that are too big to fit in memory

Say you run Sqoop2 to import a very large table – say 6 billion rows – and you have it output to gzip.


Well, it’ll happily do all that, but you will end up with files of 3.5 GB or something. Yay! How space-efficient!


Then, you make an external table that points at those files and query with Hive. an hour or so later you’ll get results. Yeesh. Well, maybe Impala is faster….. hmm… error message of this:


“Bad status for request 708: TGetOperationStatusResp(status=TStatus(errorCode=None, errorMessage=None, sqlState=None, infoMessages=None, statusCode=0), operationState=5, errorMessage=None, sqlState=None, errorCode=None)”


What’s happened is you’ve killed Impala because you don’t have enough memory on the machines (and that’s what Impala is configured to do if Impala is what takes up all the memory on a node. You’ll see Impala is dead in the Cloudera Manager main dashboard, so restart it. Since the data is all in gzip files, it has to decompress each file in order to do stuff on it, and each of those 3.5GB files will blow up to, well, a whole lot bigger than 3.5GB.


You have to use something like Filecrush to break up those files into smaller pieces, preferrably something no larger than your HDFS block size (which Filecrush will happily do by default).


I’ve got some posts about setting up and using Filecrush:


http://ift.tt/14pOyE3


Filecrush project:


http://ift.tt/1gzGOis





No comments:

Post a Comment