Sunday, December 04, 2011

Accessing recursive Apache Hive partitions in CDH3

In this post, I describe the minor Hadoop (0.20.2-cdh3u2) patches required to access data deep inside a multi-level directory structure using hive 0.7. Consider the following directory structure:
Logs/
    2011_01
        01
        02
        ..
        31
    2011_02
        01
        02
        ..
        28
We want to issue hive queries involving individual days as well as whole months. For accessing individual days, we define one hive partition per day. For example, we define a partition 2011_01_02 with LOCATION Logs/2011_01/02. To access the whole month of 2011_01, we define a partition 2011_01 with LOCATION Logs/2011_01. However, if you query the 2011_01 partition, you will get no results. This is because hadoop 0.20.2 does not support recursive directory listing.
In order to get this monthly query working, you must first apply the following patch (based on MAPREDUCE-1501, which did not make it into hadoop 0.20.2) to hadoop 0.20.2.cdh3u2. After applying the patch, compile hadoop and point the HADOOP_HOME on the machine running the hive client to the patched hadoop jars. You do NOT have to replace the hadoop jars on the hadoop cluster; the recursive directory listing feature is only needed by the hive client.


In addition to the patched jars, you should also add the following lines to your hive-site.xml:
<property>
  <name>mapred.input.dir.recursive</name>
  <value>true</value>
</property>
After this querying the 2011_01 partition will work fine.

No comments: