Friday, December 12, 2014

Connecting to a fairly old version of mysql from Tableau

I had to connect to a MySQL instance to do some very quick reporting. But after installing the latest MySQL drivers, Tableau threw an error with this:


connection using old (pre-4.1.1) authentication protocol refused


The error mentioned 3.51.0 as some version number. Lo and behold that’s a driver version when you go to download MySQL drivers. I downloaded and installed 3.51.30 drivers and the connection was happy.


Drivers are here (Tableau Drivers page links here):


http://ift.tt/KV3jEL





Thursday, December 11, 2014

MySQL Slow Query Log – change it without having to restart mysql

How do you enable slow query logging in MySQL without having to restart?


set @@global.slow_query_log = 1;


I prefer to log these things to the mysql.slow_log table rather than to a file. This way you can view the slow queries with this query: "select * from mysql.slow_log":


set @@global.log_output = "TABLE";


Set the threshold above which a query is considered a “slow query” with this (in seconds):


set @@global.long_query_time = .33;


You can view your current settings for all these configs with these queries:



select @@global.slow_query_log;

select @@global.log_output:

select @@global.long_query_time;


You can make these setting permanent so they are set when mysql starts by updating the my.cnf file with these:



--slow_query_log = 1

--log-output = "TABLE"

--long_query_time = .3


http://ift.tt/kChfet


Clear out the slow query data logged to that mysql.slow_log table:

truncate table mysql.slow_log (or any sql delete statement I think)


A note on select @@long_query_time vs. @@global.long_query_time:

@@long_query_time shows your current session settings, while @@global.long_query_time shows global settings for all sessions.


http://ift.tt/1utvdrX


Just use @@global.long_query_time, @@global.slow_query_log, @@global.log_output.


set @@global.log_slow_queries is what was used in older versions of mysql. Don’t use it now though, if only for “staying with the times”.





Saturday, December 6, 2014

GitHub Projects that are written in Java and do not have any JAR files to download

In this case I’ll do it with the Hadoop filecrush project here: http://ift.tt/1gzGOis


There are no JAR files to download.


What do you do?


You build it :)


How?


Look for pom.xml. That means it can all be built by maven. Here’s how to do that:


1. install Java Development Kit (Open-JDK in this case)

– yum install java-1.7.0-openjdk-devel (the “-devel” includes the actual JDK… the version without “-devel” doesn’t have the JDK… despite the name)

– I found that by first going “yum list *jdk* and looking through what it displays

– set JAVA_HOME environment variable

– export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.71.x86_64/jre


2. install wget

– yum install wget


2. install maven

– Maven is not in the yum repository, so we’re going to just follow the directions from the Maven site: http://ift.tt/1j5lN1M

– wget http://ift.tt/12xh1qc

– tar xzf apache-maven-3.2.3-bin.tar.gz -C /usr/local/

– cd /usr/local

– export M2_HOME=/usr/local/apache-maven-3.2.3/

– export PATH=${M2_HOME}/bin:${PATH}

– “mvn -version” to make sure it’s installed

– you should see no error messages


3. Download the filecrush project:

– wget http://ift.tt/1wKXqR5


4. Install unzip

– yum install unzip


5. decompress project

– unzip master.zip


6. Run Maven to build the project

– cd filecrush-master

– mvn package

– it’ll download a bunch of stuff now

– “mvn package” automatically looks for a “pom.xml” file, since that file contains all the build instructions.

– “mvn package” produces a JAR file, whereas “mvn install” would install the files somewhere on the system.

– hmm… seems to throw exceptions… looking for error about “Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.12.4:test (default-test) on project filecrush: There are test failures.”


7. Found something: http://ift.tt/1uM35ml

– this is someone reporting what appears to be this very error

– The filecrush author says it looks to be a non-deterministic problem that we can safely ignore with the following build command:

– mvn -Dmaven.test.skip=true package

– there we go. Seems to build fine


8. Get built JAR

– the file at target/filecrush-2.2.2-SNAPSHOT.jar





Friday, December 5, 2014

Shrinking tempdb in SQL Server 2008r2

Say you’ve done some fairly complex query on a 6 billion row table. It took 5 hours but it completed. Yay! However, now you have a 200GB tempdb and your drives are all full. Since this was a one-time deal, you want to shrink that tempdb back down to something much smaller.


If you try doing the typical database shrinking stuff but it doesn’t actually finish – in SSMS > System Databases > right click on tempdb > shrink > files > then find the big file and try to shrink it. Check the specified initial size in the database files properties – right-click tempdb > properties > Files > look for initial size column. One file may have a really big initial size specified. Just make that number smaller.





Thursday, December 4, 2014

Setting up LZO compression on your Cloudera Hadoop cluster

LZO makes Impala faster because files compressed with LZO can be split up and assigned to different nodes. If you use GZIP compression and write a single large file to HDFS, it gets split up across many HDFS blocks. Since Hadoop and Impala are all about parallel processing, this makes it slower because only a single node can process any given compressed file… unless it’s been compressed with LZO.


Anyways, the docs were *kinda* clear on what to do, but they were still confusing about whether or not you have to muck around on the command line. Well, you don’t. You can just do it all via Cloudera Manager and the Parcels systems (assuming you chose to install using Parcels when you initially set up the cluster).


In short, the LZO compression features are contained in the GPL Extras parcel. Here’s the page on how to install it:


http://ift.tt/1zXhxsR


It’s not too clear, especially since they don’t list anything for the 5.2 release, which is what I have. I just used http://ift.tt/1vT806E without appending any version number to the end. So, later on when there’s 5.3 or something, you may need to put in http://ift.tt/1zXhxJ6 or something like that.


Specifically, I did the following to set it up and get the parcel installed:


1. Cloudera Manager > at the top-middle-rightish, click on the parcel icon (looks like a gift box) > Edit Settings at the top right > In “Remote Parcel Repository URLs” add a new entry and paste in “http://ift.tt/1vT806K;

2. Save changes

3. Restart Cluster

4. I had to redeploy client configurations after the cluster restarted (there were icons mentioning as much in the “Home” page of Cloudera manager).


Now, you can go to each of your nodes and run “yum install lzop -y”. Once that’s done, LZO should be magically available for things like filecrush.





Wednesday, December 3, 2014

Working with Partitioned Impala Tables

Figured I’d mention a few things I discovered that for some reason didn’t stand out to me in the docs.


If you have a Hive table that you’ve been dumping new data into, and you want to have your analysis tools pointed at an Impala/Parquet version of the data in that table, you can’t simply do a “create table mytable_parquet_partitioned like mytable partitioned by (year int) stored as parquet”.


You have to do it this way:


if mytable is structured like this:



mytable

(

year int,

day int,

transactiontime timestamp,

product string,

user int,

ip string

)


and you want to partition on year for your mytable_parquet_partitioned, then here is what your “create table” statement and your insert needs to look like:




create table mytable_parquet_partitioned

(

day int,

transactiontime timestamp,

product string,

user int,

ip string

)

partitioned by (year int)

stored by parquet;


Notice how you don't include the "year" column in the column list of the "create table" statement. Rather, you have to put it *only* in the "partitioned by" clause. Where does that "year" column end up being in the table structure?




insert into mytable_parquet_partitioned

partition (year)

select

day int,

transactiontime timestamp,

product string,

user int,

ip string

from mytable


Notice how you do not select the year column from the source table. Rather, you specify it in the partition clause.


The resulting table will be structured like this:



mytable_parquet_partitioned

(

day int,

transactiontime timestamp,

product string,

user int,

ip string,

year int

)


Notice how the partition column is at the end of the list of columns?





Tuesday, December 2, 2014

Multiple dfs directories on the same volume

You can do it, e.g. configure a DataNode to store data on multiple directories on the same drive.


Something that happens is that your total cluster storage estimate double-counts the available space on that volume. Each DFS directory you have configured assumes its the only one on the volume, so the available space is summed.