Sunday, November 23, 2014

Impala query of HBase data via a Hive table schema – seems broken

I want to store all log and event data in HBase. I want to generate a Hive schema for each event type. I want to then query with Impala, ideally inserting data into parquet-backed tables. CDH 5.2 is my Cloudera Hadoop install.


No dice. There must be a bug; that or we’re all missing some crucial bit of info.


0% Complete (0 out of 2) Backend 1:DoNotRetryIOException: Failed after retry of OutOfOrderScannerNextException: was there a rpc timeout? CAUSED BY: OutOfOrderScannerNextException:


Here are a few mentions from people wanting to do the same thing and coming up with nada:


http://ift.tt/1C3yKGk


http://ift.tt/1xqnUlT


http://ift.tt/1C3yKGm


This Apache bug *seems* to be the same. The solution seems to be increase the RPC timeout in Impala:


http://ift.tt/1xqnV9l


There is a config in the Impala service for RPC timeout to HBase (but it didn’t seem to do anything for me; it just kind of sits there forever at 0% complete):

In Cloudera Manager > “Home” > Impala in Cluster > Configuration > type “rpc” in search > HBase RPC Timeout > change to something larger


Besides this Impala-HBase issue, I saw some seriously lackluster performance with Hive doing the same query. It would get up into the mid-to-high 90% range in the Map stage, but then just sort of stall out and get never really finish (not by the time I called it after some “reasonable amount of time”).


This is discouraging. It looks like I won’t be able to use HBase as the core datastore for events and log data for all products, from which I then dynamically maintain Hive table schemas and and materialize things into parquet-backed tables which I then query with Impala. It’s just not ready. Time to step back and just have Sqoop lay down Avro files and hopefully figure out how to get it to create external Hive table schemas, so nobody accidentally drops one of the tables.





No comments:

Post a Comment