Sunday, November 30, 2014

Run Splunk as non-root user

You really should run Splunk as a non-root user. Here’s how to do it:


1. create a new user (if it doesn’t already exist)

– useradd splunk

– passwd splunk


2. Stop splunk

– /opt/splunk/bin/splunk stop


3. give ownership of all splunk files to the “splunk” user

– chown -R splunk:splunk /opt/splunk/


4. set splunk to start up under the “splunk” user at system boot

– /opt/splunk/bin/splunk enable boot-start -user splunk


5. reboot and make sure splunk starts up as expected

– top


If it doesn’t start up, the most likely thing is that for some reason the “splunk” user does not have permissions on some file somewhere.





Saturday, November 29, 2014

Impala memory constraints (and the errors that accompany them)

I saw this sort of error come up when using Impala:


When I execute some aggregation query, a red bar comes up that indicates some failures but doesn’t really tell me much at all. I hit “execute” again on the same query, usually the below sort of error comes up in the Query Log and the query fails.


“impala Backend 3:Memory Limit Exceeded Process: memory limit exceeded. Limit=”


In my case the query memory usage was somewhere just above 8GB and the limit was right around 8GB. Well, this means the configured memory limit for Impala queries was reached on at least one node. Yeah, if just one node dies, then the entire thing dies. In my case, there was just one Impala group that was set to around an 8GB memory limit for Impala.


Here’s how to check and change:

Cloudera Manager > Home > Impala in desired Cluster > Configuration > ”

Impala Daemon Memory Limit” > make those as big as you can > restart cluster.





Friday, November 28, 2014

Thursday, November 27, 2014

Sqoop 1.4.6 will support importing directly to parquet files

http://ift.tt/1FxyTAj


I think this is really cool. Prior to this upcoming release if you wanted to use parquet files you had to do a separate create and insert statement then drop the “incoming” table.


Hive .14 will support Avro as a full-class storage format. You’ll be able to do “create table mytable () stored as avro”. Different type of format than parquet, yeah, but these are examples of cool stuff coming soon :)





Wednesday, November 26, 2014

Why do you need to upload a hive-site.xml file for each Oozie workflow Sqoop action

At the bottom of every action config window there is a field that says “Job XML”. These sorts of things have always scared me. Well, here’s what it means in Hadoop-land.


If you don’t do anything with this field and you set up a Sqoop task, that task can run along just fine, happy as can be… until it needs to do something involving Hive. At that point, it has no idea what or where to do stuff in Hive…. because it doesn’t know anything about where Hive is. And that’s why the hive-site.xml file has to be specified there. You click the dot-dot-dot and upload a file – the hive-site.xml file you get from here:


Cloudera Manager > Cluster > Hive > Actions drop-down on the top right > “Download Client Configuration” > in that zip file will be hive-site.xml. That’s the file you upload. That’s the file that defines where anything Hive-related will be.


You could get fancy and store your hive-site (and all the other *-site.xml files in some HDFS folder you just point to), but that’s fancy and I’m not ready for that ;)





Tuesday, November 25, 2014

Installing Cloudera Hadoop with MySQL for back-end database

At Database setup during the cluster setup, you see this error when you test the connections to the DBs:

JDBC driver cannot be found. Unable to find the JDBC database jar on host :


You need to make sure you get the latest mysql connector jar file and put it here (named this way too):

/usr/share/java/mysql-connector-java.jar (the below error messages say it’s explicitly looking for a file by that name)


In /var/log/cloudera-scm-server/cloudera-scm-server.log I saw this:


+ exec /usr/java/jdk1.7.0_67-cloudera/bin/java -Djava.net.preferIPv4Stack=true -Djava.security.egd=file:///dev/urandom -cp ‘/var/run/cloudera-scm-agent/process/6-HIVE-test-db-connection:/usr/share/java/mysql-connector-java.jar:/usr/share/cmf/lib/postgresql-9.0-801.jdbc4.jar:/usr/share/java/oracle-connector-java.jar:/usr/share/cmf/lib/*’ com.cloudera.enterprise.dbutil.DbCommandExecutor db.properties





Monday, November 24, 2014

Sunday, November 23, 2014

Impala query of HBase data via a Hive table schema – seems broken

I want to store all log and event data in HBase. I want to generate a Hive schema for each event type. I want to then query with Impala, ideally inserting data into parquet-backed tables. CDH 5.2 is my Cloudera Hadoop install.


No dice. There must be a bug; that or we’re all missing some crucial bit of info.


0% Complete (0 out of 2) Backend 1:DoNotRetryIOException: Failed after retry of OutOfOrderScannerNextException: was there a rpc timeout? CAUSED BY: OutOfOrderScannerNextException:


Here are a few mentions from people wanting to do the same thing and coming up with nada:


http://ift.tt/1C3yKGk


http://ift.tt/1xqnUlT


http://ift.tt/1C3yKGm


This Apache bug *seems* to be the same. The solution seems to be increase the RPC timeout in Impala:


http://ift.tt/1xqnV9l


There is a config in the Impala service for RPC timeout to HBase (but it didn’t seem to do anything for me; it just kind of sits there forever at 0% complete):

In Cloudera Manager > “Home” > Impala in Cluster > Configuration > type “rpc” in search > HBase RPC Timeout > change to something larger


Besides this Impala-HBase issue, I saw some seriously lackluster performance with Hive doing the same query. It would get up into the mid-to-high 90% range in the Map stage, but then just sort of stall out and get never really finish (not by the time I called it after some “reasonable amount of time”).


This is discouraging. It looks like I won’t be able to use HBase as the core datastore for events and log data for all products, from which I then dynamically maintain Hive table schemas and and materialize things into parquet-backed tables which I then query with Impala. It’s just not ready. Time to step back and just have Sqoop lay down Avro files and hopefully figure out how to get it to create external Hive table schemas, so nobody accidentally drops one of the tables.





Saturday, November 22, 2014

Adding more space to your Cloudera Hadoop nodes

I noticed that my Cloudera cluster install seemed to put all the DFS directories on the root volume of each node at /dfs/dn. There were 2 volumes on each machine and the larger volume was mounted at /home. I can’t recall why or if I had something to do with that. However, needless to say, most of the space on this cluster was not even being used by the cluster in any way. Here’s how I got my Hadoop install to use the extra space:


Note you will have to restart your cluster; or if you have Enterprise, do a rolling restart.


On each node, create a directory at /home/dfs/dn. It’s in /home *only* because I didn’t want to redo all the mount points and change up partitions. I plan on nuking each data node one at a time and installing more disk space anyways, so this will do for now. Anyways…


1. create the directory you want HDFS to use

– mkdir /home/dfs

– mkdir /home/dfs/dn

– chown -R hdfs:hdfs /home dfs

2. Go to your Cloudera Manager web interface and click on “Home” at the top > HDFS service in the relevant Cluster > Instances > “DataNode” (you’ll do this for each node) > Configuration > now click the “+” sign in the DataNode Data Directory config section and type in “/home/dfs/dn”.

3. Go back to “Home” and you’ll see an icon next to the HDFS service that indicates restarts are necessary. Do that.


Once the cluster comes back up, new data writes should start going to the new directories. You should also see the bar in the HDFS Summary area indicate the additional available space.





Friday, November 21, 2014

JSON, Avro, Trevni, and Parquet: how they are related

JSON – consider it an alternative to xml. It’s smaller, faster, and easier to read. Just use it.


Avro – a data storage system that stores JSON along with the schema for the JSON. Think of it as a file that contains loads of objects stored in JSON, and then the schema is stored along with it. In addition,

“When Avro is used in RPC, the client and server exchange schemas in the connection handshake”.


Trevni – a columnar storage format. Instead of writing out all the columns for a row then moving on to the next row, Trevni writes out all the rows for a given column and then moves on to the next row. This means all the column values are stored sequentially, which allows for much faster BI-like reads.


Parquet – Cloudera and Twitter took Trevni and improved it. So, at least in the Cloudera distribution, you’ll see Parquet instead of Trevni. I suspect most BI-type systems will be using Parquet from now on.


Really, JSON and Avro are not directly related to Trevni and Parquet. However, Serializers/Deserilizers (SerDe) come by default with Hive, so it’s good to know.


If you use Avro, it means you can do strong typing while moving around lots of NOSQL data. Sqoop, for instance, now can export directly to Avro files – it generates the JSON schema and everything for you based on the columns in the source tables.


Avro:


http://ift.tt/1rGZ2oV


Trevni:


http://ift.tt/11Ee0nD


Parquet:


http://ift.tt/1lgvjVh





Installing and using Filecrusher with Cloudera Hadoop 5.2

Solving the Small Files problem in Hadoop. The Filecrush project on Github by Edward Capriolo seems to be a viable solution. Amazon has release S3DistCp, which would be another solution. For this, I’m covering filecrush.


You need the filecrush JAR, which is not something included in the Github project. There are links to www.jointhegrid.com, but that site has been down for me. Not sure why. Other searches for a JAR of filecrush only yields sketchy results. Sooooo… let’s build it.


Hint, when you see a file named “pom.xml” on a Github or Bitbucket project, it means you can build the thing with Maven… pretty seamlessly.


For Maven you need to install a Java SDK (if openJDK, make sure to get the one with “-devel” at the end of the package name, e.g. yum install java-1.7.*-devel.


For Maven, you need to download the Tar, untar it, move it to the expected location, and export some environment variables.


Download the Github project (the whole zip file). Install unzip and unzip it.


From inside the unzipped filecrush folder (same level as pom.xml) you need to run Maven.


In the case of filecrush, currently (2.2.2) you need to tell Maven to skip the tests, since they seem to be breaking due to some nontrivial reason caused by a recent Hadoop release. It appears it’s not an issue though: http://ift.tt/1uM35ml (from the auther of filecrush)


mvn -Dmaven.test.skip=true package


It will download dependencies and then eventually spit out a jar file in the targets directory. Copy that up to your Hadoop cluster – I put mine in /user/hive/aux_libs/. Then, “Refresh Cluster” from the Home of Cloudera Manager on the relevant cluster.


Run filecrush like this:


From a SSH session on one of the nodes:


hadoop jar filecrush-2.2.2-SNAPSHOT.jar com.m6d.filecrush.crush.Crush –input-format text –output-format text /user/root/ingfolderwithloadsoffiles/ /user/root/outputfolder/ 20101121121212


(I did the –input-format and –output-format because the files were gzipped text files)


See the docs for more usage options: http://ift.tt/1gzGOis





Thursday, November 20, 2014

going through the Kafka quickstart

Which JAR files do you need in the sharelib to run a sqoop job via Oozie that imports to HBase?

Wednesday, November 19, 2014

Tuesday, November 18, 2014

Monday, November 17, 2014

Dropping a Hive table that was created with a custom serde can be a problem

If you add a SerDe via the “Add Jar” command in a Hive query and then create a table that uses that SerDe, note that you will not be able to later drop that table from a different Hive session without first adding the same SerDe via “Add Jar” command.


Moral of the story – make sure you hang on to every custom SerDe any of your users ever uses to create a Hive table.


I’m sure there are other ways to drop a table without the SerDe present, but still… just something to be aware of.





Easiest way to install a Splunk 6.2 cluster

Sunday, November 16, 2014

Saturday, November 15, 2014

Working with XML or JSON in Hive (and Impala)

You need 2 SerDe jar files, and you need to configure the Hive Auxiliary Jars path.


1. Pick the directory where you will always put all your globally-accessible additional SerDe jars:

– these will be usable by everyone who uses Hive, so consider that I guess

– I’m going with /var/lib/hive/aux_jars

– mkdir /var/lib/hive/aux_jars

– do this on each node that is running HiveServer2 or HiveServer.


2. From the following 2 projects get the SerDe jars and somehow copy them into the /var/lib/hive/aux_jars folder on all your nodes running HiveServer2 and/or HiveServer:

http://ift.tt/1xm1hEC

http://ift.tt/1eVPKm1

– make sure to do a chown -R hive:hive /var/lib/hive/aux_jars


3. In Cloudera Manager, click on the Hive service and go to the configuration tab:

– type “aux” to filter the configs to show the

Hive Auxiliary JARs Directory config. enter /var/lib/hive/aux_jars

– this is my own path, not something official or some magic number

– it’s just telling Hadoop-land which directory on the HiveServer2 nodes to look for additional SerDe jars.

– redeploy and restart – just do whatever Cloudera Manager tells you do to in order to deploy the config changes


4. Now you can use those SerDes as they are documented in step 3 above. If not, double-triple check your path spelling. I’ve not had it *not* work for me for any other reason.





Querying Hive and Impala from Tableau (on Windows)

You need to install the 32 and 64 bit ODBC drivers/Connectors for one of or both Hive and Impala from here:


http://ift.tt/1rB1pwX


You will need to run the ODBC Administrator program that’s built into Windows to configure the “System DSN” that gets created for each one when you install the drivers/Connectors. Just hit the Windows key and type “odbc admin”. There’s a 32-bit one and a 64-bit one. The 32-bit one can only edit the 32-bit DSNs. The 64-bit one can only edit the 64-bit DSNs. Find them on the “System DSN” tab. And, yes, you need to edit both to point at your Impala and/or Hive server. If you’re not sure which node(s) that is, you can find them in Cloudera Manager:


1. Cluster > Hive > Instances > HiveServer2

– Whichever machine is running that role instance is where you need to point all your DSNs

2. The ports to point at are found here:

– Cluster > Impala > Configuration > Ports > “Impala Daemon HiveServer2 Port” (default is 21050)

– Cluster > Hive > Configuration > Ports > “HiveServer2 Port” (default is 10000)





Friday, November 14, 2014

Failed to start Cloudera Manager Agent

Some errors when you click on Details:

MainThread agent ERROR Could not determine hostname or ip address; proceeding.

agent.py: error: argument –hostname is required


Make sure your hostnames match:


/etc/sysconfig/network

/etc/hosts


notably, check the hosts file on the machine on which you’re running Cloudera Manager, e.g. if its hosts file has a different hostname for the machine at 192.168.1.222 than what that machine has in either its /etc/hosts or /etc/sysconfig/network files.





Sqoop has inconsistent behavior with creating a Hive table if it doesn’t already exist

If you want to import data into HDFS via Sqoop and have it auto-create the Hive table schema, currently (1.4.5) Sqoop jobs will fail if the the Hive table already exists. Running the same “sqoop import” command directly will successfully complete.


The reason is that when a Sqoop job is created the “PROPNAME” of hive.fail.table.exists is set to true. If you update that via SQLTool in the hive metastore DB so it’s set to false, the jobs will run fine.


I can’t find anything in the docs that indicates how you can specify this behavior. I’ve had to manually run “update” statements directly on the HSQLDB instance…. that or just before creating your sqoop job run a “sqoop import” command with –create-hive-table and specifying “where 1=0″ in the query somehow so the table exists…. then when you create your sqoop job removing the –create-hive-table bit from the import config.


I think we just need to add another configurable parameter that lets you specify if you want the import to fail if the hive table already exists.


Here’s the SQL statement that will make it so it won’t fail if the table already exists:

update sqoop_sessions set propval = false where propname = ‘hive.fail.table.exists';


I submitted my very first Apache Jira bug for this :) http://ift.tt/1174byG





Thursday, November 13, 2014

Wednesday, November 12, 2014

Sqoop import into HBase with large numbers of rows fails with “Error: Java heap space” in stderr

Fluoridation Efficacy Question

I hate citing just one URL and asking questions, as doing so seems too mypic. However, I think it will serve to address a question I’ve had about fluoride that I’ve not really found a great answer to.


Question: Would it be cheaper and less risky if we skipped fluoridating water and instead just did twice-annual fluoride treatments?


This review seems to indicate there would be no difference in outcomes: http://ift.tt/1xP65yH


“Initial studies of community water fluoridation demonstrated that reductions in childhood dental caries attributable to fluoridation were approximately 50%–60% (94–97). More recent estimates are lower — 18%–40% (98,99). This decrease in attributable benefit is likely caused by the increasing use of fluoride from other sources, with the widespread use of fluoride toothpaste probably the most important.”


“Clinical trials conducted during 1940–1970 demonstrated that professionally applied fluorides effectively reduce caries experience in children (233). In more recent studies, semiannual treatments reportedly caused an average decrease of 26% in caries experience in the permanent teeth of children residing in nonfluoridated areas”


80% of caries in children happen in 25% of the population. It seems widespread fluoridation is kind of like carpet bombing everything just to make sure we get it all. In a world of web content personalization, it seems we ought to be able to target with at least some level of precision.





Incremental Import a limited chunk of data from a database into Hadoop via Sqoop job run by Oozie… and doing it with the least amount of effort

Tuesday, November 11, 2014

Sqoop metastore “loses” the fact that I have configured it to save job passwords

Sqoop seems to forget I told it to store the password in the database. It’ll run fine without prompting for the password one or two times. Then out of the blue it seems to start prompting again. This means Oozie-scheduled tasks fail and I have to recreate the job (after copy-pasting the –last-value I get by doing a sqoop job –show job-name).


This is only a problem when you specify a –password. If you can pack the username/password in the connection string to the DB, it can’t “lose” that. For SQL Server, I can do that. For MySQL, it’s not worked yet…. mrfgh…





Run a sqoop job from an Oozie workflow action (could not load db driver class)

When running a Sqoop job via an Oozie task, I got the following errors:


WARN org.apache.sqoop.tool.SqoopTool – $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration.


ERROR org.apache.sqoop.Sqoop – Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver


These were because the mysql connector jdbc driver was not in the Oozie sharelibs folder in HDFS. hdfs://user/share/lib/lib_20141030223533. Note you need to restart the Oozie service in the cluster in order for the driver to get picked up. The Cloudera docs seem to say you don’t have to restart Oozie, but I’ve tested it out and I have to restart Oozie from Cloudera Manager.


If you delete one of the jars from the sharelib and run an Oozie job that somehow depends on that jar, you get insta-failed ;)





Oozie job fails for some random “begin > end in range” reason

You submit an oozie job and it fails. If you see the following error followed by a stack trace in the Logs:


Launcher exception: begin > end in range (begin, end): (1415382327807, 1415382308114)


-AND/OR this-


java.lang.IllegalArgumentException: begin > end in range (begin, end): (1415382327807, 1415382308114)


Check the system time on all your hadoop nodes. Those times are not coming from your query – they’re coming from your hadoop nodes comparing their system times.


$> date


If they’re skewed even a little bit, that error will come up. So, make sure you’ve got ntp running on all the servers and make sure they’re all pointed at the same ntp server.


Here’s an existing Jira task in the Apache project for Oozie that I finally came across while searching (the submitter mentions just an 8 second time skew causing this):


http://ift.tt/1B4G9oc





Monday, November 10, 2014

Sunday, November 9, 2014

Linux hosts file

/etc/hosts


Say it looks like this:

127.0.0.1 localhost.localdomain localhost.localdomain localhost4 localhost4.localdomain4 localhost mycpu001

::1 localhost.localdomain localhost.localdomain localhost6 localhost6.localdomain6 localhost mycpu001

192.168.1.51 mycpu001

192.168.1.52 mycpu002

192.168.1.53 mycpu003

192.168.1.54 mycpu004


Sometimes if you do nslookup mycpu001, it will resolve to “localhost” or some variant in that line. Not sure that would be entirely expected.


I’m going to remove the “mycpu001″ from the first 2 lines (and the equivalent on all other machines)





Saturday, November 8, 2014

mysql failed to start up right after install

in /var/log/mysqld.log you see this line toward the end:

“Fatal error: Can’t open and lock privilege tables: Table ‘mysql.host’ doesn’t exist”


Do yourself a favor and just uninstall mysql, disable selinux, reboot, then reinstall mysql:


1. yum remove mysql-server

rm -rf /var/lib/mysql

2. nano /etc/selinux/config

– “enforcing” -> “disabled”

3. reboot

4. yum install mysql-server

5. continue where you left off

– your my.cnf file will remain


I’m sure someone really really disagrees with the “disable selinux” bit ;)





Friday, November 7, 2014

Thursday, November 6, 2014

Sqoop on Cloudera 5.2 out of nowhere starts failing

14/11/06 17:25:04 INFO mapreduce.Job: Running job: job_1415319646271_0006

14/11/06 17:25:14 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=FAILED. Redirecting to job history server

14/11/06 17:25:14 INFO mapreduce.Job: Job job_1415319646271_0006 running in uber mode : false

14/11/06 17:25:14 INFO mapreduce.Job: map 0% reduce NaN%

14/11/06 17:25:14 INFO mapreduce.Job: Job job_1415319646271_0006 failed with state FAILED due to:

14/11/06 17:25:14 INFO mapreduce.ImportJobBase: The MapReduce job has already been retired. Performance

14/11/06 17:25:14 INFO mapreduce.ImportJobBase: counters are unavailable. To get this information,

14/11/06 17:25:14 INFO mapreduce.ImportJobBase: you will need to enable the completed job store on

14/11/06 17:25:14 INFO mapreduce.ImportJobBase: the jobtracker with:

14/11/06 17:25:14 INFO mapreduce.ImportJobBase: mapreduce.jobtracker.persist.jobstatus.active = true

14/11/06 17:25:14 INFO mapreduce.ImportJobBase: mapreduce.jobtracker.persist.jobstatus.hours = 1

14/11/06 17:25:14 INFO mapreduce.ImportJobBase: A jobtracker restart is required for these settings

14/11/06 17:25:14 INFO mapreduce.ImportJobBase: to take effect.

14/11/06 17:25:14 ERROR tool.ImportTool: Error during import: Import job failed!


I have not idea why this is happening. It may be something to do with a copy-paste of the command from notepad. If I retype the last parts of the command in the shell, it doesn’t throw that error.


The lame thing is that Google searches for “mapreduce.ImportJobBase: mapreduce.jobtracker.persist.jobstatus.active = true” get nearly nothing besides source code dumps. Whatever the case, if you get that sort of result after running a sqoop command, just retype the whole mess, especially if you copied and pasted it in to the console window.





Tuesday, November 4, 2014

Browsing HBase via Hue in Cloudera

This was the single remaining config error I was seeing in Cloudera Manager, and I was not finding too much about it in my searches:


Thrift Server role must be configured in HBase service to use the Hue HBase Browser application.


I’d click on the details for that config error, and the only option was “None”. In Hue, the error was that it couldn’t connect to localhost:9090. Well, the problem was that there were no HBase Thrift instances install in my cluster. Heh heh. After installing one it showed up as an option other than “None” in the config file.





Sunday, November 2, 2014

Hortonworks vs MapR vs Cloudera

My thoughts after trying them all (on local VMs)


1. You need more than 18GB of RAM on your machine in order to effectively test. Just do it.


2. Cloudera is the easiest to install. AND it sets up Hue for you. Hortonworks and MapR require a LOT of manual edits to arcane config files (they seem arcane when you’re new). Hortonworks is the next-easiest. MapR was the hardest.


3. MapR, to me, has the most promise, given it’s closer to the metal. The promise of random writes directly to the HDSF cluster just seems really really good.


4. MapR is the hardest to install. It just takes more command-line work.


5. Adding nodes to a cluster is strightforward with Hortonworks and Cloudera. With Mapr, you have to do more command-line prep than a noob will prefer.


6. MapR seemed to have some of the best documentation on how Hadoop works. Hortonworks was up there too. Cloudera seemed a little less than screamingly clear… but that could have been due to the fact that theirs was the first docs I had started reading.


7. Hortonworks installs a MySQL instance for the Hive metastore. Cloudera and MapR use some embedded Postgre DB, which they repeatedly say not to use for much beyond a proof of concept cluster.


8. Cloudera has some proactive notifications on config best practices. However, I’m not sure why something like Java heap size configs would differ – I suppose the installer may set things to some percentage of available RAM.