Tuesday, October 28, 2014

Getting data into Hadoop from Windows via SSIS (SSDT) and a DSN

Figured I’d consolidate what I did to get data into a Hadoop cluster with SSIS/SSDT. I used SSDT 2013 – the BI version install. In this case, it was a Hortonworks Hadoop cluster.


1. Install both the 32 and 64 bit versions of the Hortonworks Hive ODBC driver: http://ift.tt/1wsXYrp

– edit the DSNs that should already be in there from the installer to point at whichever node is running a HiveServer2 service (find it in Ambari)

– you have to edit the 32 bit one via the 32 bit ODBC Data Source Administrator and the 64 bit DSN via the 64 bit ODBC Data Source Administrator

– In “Authentication”, I could only get it to work with “username” or “username and password”. “No Authentication” resulted in a test just hanging for a long time. I used the “hive” account creds as defined in Ambari


2. At this point you must have a Hive table schema already existing. Just map columns from the data flow to columns in the Hive table schema and you should be good to go.

– Now, to be honest, I keep getting a permissions issue with writing data to the table. I haven’t solved it yet, but it may be problematic. I’ll try to update later with what works.


How do you create a table in Hive? You can use Hue or HCat or whatever web-based tool may be installed on your Hadoop cluster. But here is how you can do it via the command-line:


1. SSH into one of the nodes on your cluster that’s running Hive

2. change use to a user that has permissions in Hive:

– su hive

3. start hive

– hive

4. Once you’re in hive, it’s just like when you’re logged in to a mysql client – generally the same commands are available)

– show databases;

– use database default;

– create table (Year int, Month);





No comments:

Post a Comment