Sunday, December 04, 2011

Accessing recursive Apache Hive partitions in CDH3

In this post, I describe the minor Hadoop (0.20.2-cdh3u2) patches required to access data deep inside a multi-level directory structure using hive 0.7. Consider the following directory structure:
Logs/
    2011_01
        01
        02
        ..
        31
    2011_02
        01
        02
        ..
        28
We want to issue hive queries involving individual days as well as whole months. For accessing individual days, we define one hive partition per day. For example, we define a partition 2011_01_02 with LOCATION Logs/2011_01/02. To access the whole month of 2011_01, we define a partition 2011_01 with LOCATION Logs/2011_01. However, if you query the 2011_01 partition, you will get no results. This is because hadoop 0.20.2 does not support recursive directory listing.
In order to get this monthly query working, you must first apply the following patch (based on MAPREDUCE-1501, which did not make it into hadoop 0.20.2) to hadoop 0.20.2.cdh3u2. After applying the patch, compile hadoop and point the HADOOP_HOME on the machine running the hive client to the patched hadoop jars. You do NOT have to replace the hadoop jars on the hadoop cluster; the recursive directory listing feature is only needed by the hive client.


In addition to the patched jars, you should also add the following lines to your hive-site.xml:
<property>
  <name>mapred.input.dir.recursive</name>
  <value>true</value>
</property>
After this querying the 2011_01 partition will work fine.

Thursday, November 24, 2011

Quickly find and open files

I frequently need to find a file that is located deep within the current directory and operate on it -- like opening in vim or svn diffing it.  I can never remember the exact path to the file, and sometimes can't even remember the full name.  All I know is that the file is somewhere within the current directory and its sub-directories.  So, I end up running the UNIX find command, and then cut-pasting the returned file path into the command of the interest.  This wastes time.  So I wrote a small python script to make it easier.

Copy the script f.py (located at the end of the post) into some directory that is on your PATH.  Suppose you are looking for the file that starts with Foo, you just run:

$ f.py Foo*
1) ./subdir1/subdir2/Foo1.java
2) ./subdir1/Foo.java
Enter file number:

Enter the number of the file you are interested in.  That will bring up the following menu of operations you can perform on the selected file.

Process ./subdir1/subdir2/Foo1.java
1. vim
2. emacs
3. svn add
4. svn diff
5. open (OSX only)
Enter choice (Default is 1):

If the pattern you specify matches only a single file, the script directly jumps to the operation selection menu.  Hope this will save you some key-strokes.

#!/usr/bin/python
# This program is used to easily locate a file matching 
# the user specified pattern within the current directory
# and to quickly perform some common operations (like
# opening it in vim) against it.
import subprocess
import sys
import os

def processFile(fileName):
    """
    Show the user the possible actions with the specified file,
    and prompt the user to select a particular action.

    """

    fileName = fileName.strip()
    print "Process %s" % fileName
    print "1. vim"
    print "2. emacs"
    print "3. svn add"
    print "4. svn diff"
    print "5. open (OSX only)"

    choice = raw_input("Enter choice (Default is 1):").strip()
    
    if choice == "1" or choice == "":
        cmd = "vim %s" % fileName
    elif choice == "2":
        cmd = "emacs %s" % fileName
    elif choice == "3":
        cmd = "svn add %s" % fileName
    elif choice == "4":
        cmd = "svn diff %s" % fileName
    elif choice == "5":
        cmd = "open %s" % fileName
    print cmd
    os.system(cmd)


def listFiles(fileNames):
    """ 
    Show the list of files and prompt user to select one 
    """

    fileIndex = 1
    for fileName in fileNames:
        print "%d) %s" % (fileIndex, fileName.strip())
        fileIndex += 1
    choice = raw_input("Enter file number:")
    chosenFileName = fileNames[int(choice)-1].strip()
    processFile(chosenFileName)


if __name__ == "__main__":

    if len(sys.argv) < 2:
        print "Usage: f.py FILE_PATTERN_OF_INTEREST"
        sys.exit(-1)

    pattern = sys.argv[1]
    proc = subprocess.Popen("find . -name \"%s\" | grep -v svn" % pattern, 
        shell=True, stdout=subprocess.PIPE)
    lines = proc.stdout.readlines()
    if len(lines) == 0:
        print "No matching files found. Note you can use wild cards like *"
    elif len(lines) == 1:
        processFile(lines[0])
    else:
        listFiles(lines)


Wednesday, November 09, 2011

Automatically create Eclipse projects for your Scala projects using sbt

In my previous post, I described the bare minimum you need to know to get started using sbt.  In this post, I will describe how sbt can be used to automatically create the .project and .classpath files you need to create for loading your project into the Eclipse IDE. Using sbt to create your Eclipse project files ensures that they are always in sync with your build definition. And of course, it saves a lot of the clicks or key strokes need to manually specify classpaths in Eclipse.

Step 1
Install the Eclipse plugin for Scala, if you don't already have it installed.
Step 2
Add the following lines to myproject/project/plugins/build.sbt. This tells sbt that you want to use the sbteclipse plugin. Note that this build.sbt is different from the build.sbt at the top level of your project, i.e. in myproject/ directory.
resolvers += Classpaths.typesafeResolver

addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse" % "1.4.0")
Step 3
From the myproject/ directory, type
sbt "eclipse create-src"
This step will create the .project and .classpath files required by Eclipse inside the myproject/ directory.
Step 4
Import the project into Eclipse. Following the menu File > Import > General > Existing Projects into Workspace and browse to myproject