Figured I’d mention a few things I discovered that for some reason didn’t stand out to me in the docs.
If you have a Hive table that you’ve been dumping new data into, and you want to have your analysis tools pointed at an Impala/Parquet version of the data in that table, you can’t simply do a “create table mytable_parquet_partitioned like mytable partitioned by (year int) stored as parquet”.
You have to do it this way:
if mytable is structured like this:
mytable
(
year int,
day int,
transactiontime timestamp,
product string,
user int,
ip string
)
and you want to partition on year for your mytable_parquet_partitioned, then here is what your “create table” statement and your insert needs to look like:
create table mytable_parquet_partitioned
(
day int,
transactiontime timestamp,
product string,
user int,
ip string
)
partitioned by (year int)
stored by parquet;
Notice how you don't include the "year" column in the column list of the "create table" statement. Rather, you have to put it *only* in the "partitioned by" clause. Where does that "year" column end up being in the table structure?
insert into mytable_parquet_partitioned
partition (year)
select
day int,
transactiontime timestamp,
product string,
user int,
ip string
from mytable
Notice how you do not select the year column from the source table. Rather, you specify it in the partition clause.
The resulting table will be structured like this:
mytable_parquet_partitioned
(
day int,
transactiontime timestamp,
product string,
user int,
ip string,
year int
)
Notice how the partition column is at the end of the list of columns?