Showing posts with label hadoop. Show all posts
Showing posts with label hadoop. Show all posts

Sunday, December 04, 2011

Accessing recursive Apache Hive partitions in CDH3

In this post, I describe the minor Hadoop (0.20.2-cdh3u2) patches required to access data deep inside a multi-level directory structure using hive 0.7. Consider the following directory structure:
Logs/
    2011_01
        01
        02
        ..
        31
    2011_02
        01
        02
        ..
        28
We want to issue hive queries involving individual days as well as whole months. For accessing individual days, we define one hive partition per day. For example, we define a partition 2011_01_02 with LOCATION Logs/2011_01/02. To access the whole month of 2011_01, we define a partition 2011_01 with LOCATION Logs/2011_01. However, if you query the 2011_01 partition, you will get no results. This is because hadoop 0.20.2 does not support recursive directory listing.
In order to get this monthly query working, you must first apply the following patch (based on MAPREDUCE-1501, which did not make it into hadoop 0.20.2) to hadoop 0.20.2.cdh3u2. After applying the patch, compile hadoop and point the HADOOP_HOME on the machine running the hive client to the patched hadoop jars. You do NOT have to replace the hadoop jars on the hadoop cluster; the recursive directory listing feature is only needed by the hive client.


In addition to the patched jars, you should also add the following lines to your hive-site.xml:
<property>
  <name>mapred.input.dir.recursive</name>
  <value>true</value>
</property>
After this querying the 2011_01 partition will work fine.

Tuesday, January 11, 2011

How to avoid excessive logging when using Hive JDBC

By default, the Hive JDBC driver outputs a HUGE amount of log messages at the INFO log level.  That makes it very hard to track the log statements added by you.  I spent quite some time trying to change the log level to ERROR by tinkering with the hive-log4j.properties and hive-exec-log4j.properties files.   This approach was not successful. The Hive JDBC driver does not seem to read these log4j config files.  Hence, in order to change the log level, you must add the following lines near the beginning of your program:

from org.apache.log4j import Logger as Logger
from org.apache.log4j import Level as Level
rootLogger = Logger.getRootLogger()
rootLogger.setLevel(Level.ERROR)

Note: The above code is in Jython.  Something very similar will work in Java.