Sunday, December 04, 2011

Accessing recursive Apache Hive partitions in CDH3

In this post, I describe the minor Hadoop (0.20.2-cdh3u2) patches required to access data deep inside a multi-level directory structure using hive 0.7. Consider the following directory structure:
Logs/
    2011_01
        01
        02
        ..
        31
    2011_02
        01
        02
        ..
        28
We want to issue hive queries involving individual days as well as whole months. For accessing individual days, we define one hive partition per day. For example, we define a partition 2011_01_02 with LOCATION Logs/2011_01/02. To access the whole month of 2011_01, we define a partition 2011_01 with LOCATION Logs/2011_01. However, if you query the 2011_01 partition, you will get no results. This is because hadoop 0.20.2 does not support recursive directory listing.
In order to get this monthly query working, you must first apply the following patch (based on MAPREDUCE-1501, which did not make it into hadoop 0.20.2) to hadoop 0.20.2.cdh3u2. After applying the patch, compile hadoop and point the HADOOP_HOME on the machine running the hive client to the patched hadoop jars. You do NOT have to replace the hadoop jars on the hadoop cluster; the recursive directory listing feature is only needed by the hive client.


In addition to the patched jars, you should also add the following lines to your hive-site.xml:
<property>
  <name>mapred.input.dir.recursive</name>
  <value>true</value>
</property>
After this querying the 2011_01 partition will work fine.

Thursday, November 24, 2011

Quickly find and open files

I frequently need to find a file that is located deep within the current directory and operate on it -- like opening in vim or svn diffing it.  I can never remember the exact path to the file, and sometimes can't even remember the full name.  All I know is that the file is somewhere within the current directory and its sub-directories.  So, I end up running the UNIX find command, and then cut-pasting the returned file path into the command of the interest.  This wastes time.  So I wrote a small python script to make it easier.

Copy the script f.py (located at the end of the post) into some directory that is on your PATH.  Suppose you are looking for the file that starts with Foo, you just run:

$ f.py Foo*
1) ./subdir1/subdir2/Foo1.java
2) ./subdir1/Foo.java
Enter file number:

Enter the number of the file you are interested in.  That will bring up the following menu of operations you can perform on the selected file.

Process ./subdir1/subdir2/Foo1.java
1. vim
2. emacs
3. svn add
4. svn diff
5. open (OSX only)
Enter choice (Default is 1):

If the pattern you specify matches only a single file, the script directly jumps to the operation selection menu.  Hope this will save you some key-strokes.

#!/usr/bin/python
# This program is used to easily locate a file matching 
# the user specified pattern within the current directory
# and to quickly perform some common operations (like
# opening it in vim) against it.
import subprocess
import sys
import os

def processFile(fileName):
    """
    Show the user the possible actions with the specified file,
    and prompt the user to select a particular action.

    """

    fileName = fileName.strip()
    print "Process %s" % fileName
    print "1. vim"
    print "2. emacs"
    print "3. svn add"
    print "4. svn diff"
    print "5. open (OSX only)"

    choice = raw_input("Enter choice (Default is 1):").strip()
    
    if choice == "1" or choice == "":
        cmd = "vim %s" % fileName
    elif choice == "2":
        cmd = "emacs %s" % fileName
    elif choice == "3":
        cmd = "svn add %s" % fileName
    elif choice == "4":
        cmd = "svn diff %s" % fileName
    elif choice == "5":
        cmd = "open %s" % fileName
    print cmd
    os.system(cmd)


def listFiles(fileNames):
    """ 
    Show the list of files and prompt user to select one 
    """

    fileIndex = 1
    for fileName in fileNames:
        print "%d) %s" % (fileIndex, fileName.strip())
        fileIndex += 1
    choice = raw_input("Enter file number:")
    chosenFileName = fileNames[int(choice)-1].strip()
    processFile(chosenFileName)


if __name__ == "__main__":

    if len(sys.argv) < 2:
        print "Usage: f.py FILE_PATTERN_OF_INTEREST"
        sys.exit(-1)

    pattern = sys.argv[1]
    proc = subprocess.Popen("find . -name \"%s\" | grep -v svn" % pattern, 
        shell=True, stdout=subprocess.PIPE)
    lines = proc.stdout.readlines()
    if len(lines) == 0:
        print "No matching files found. Note you can use wild cards like *"
    elif len(lines) == 1:
        processFile(lines[0])
    else:
        listFiles(lines)


Wednesday, November 09, 2011

Automatically create Eclipse projects for your Scala projects using sbt

In my previous post, I described the bare minimum you need to know to get started using sbt.  In this post, I will describe how sbt can be used to automatically create the .project and .classpath files you need to create for loading your project into the Eclipse IDE. Using sbt to create your Eclipse project files ensures that they are always in sync with your build definition. And of course, it saves a lot of the clicks or key strokes need to manually specify classpaths in Eclipse.

Step 1
Install the Eclipse plugin for Scala, if you don't already have it installed.
Step 2
Add the following lines to myproject/project/plugins/build.sbt. This tells sbt that you want to use the sbteclipse plugin. Note that this build.sbt is different from the build.sbt at the top level of your project, i.e. in myproject/ directory.
resolvers += Classpaths.typesafeResolver

addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse" % "1.4.0")
Step 3
From the myproject/ directory, type
sbt "eclipse create-src"
This step will create the .project and .classpath files required by Eclipse inside the myproject/ directory.
Step 4
Import the project into Eclipse. Following the menu File > Import > General > Existing Projects into Workspace and browse to myproject

Tuesday, November 08, 2011

Quick Start to using Scala Simple Build Tool (sbt)

After you finish a basic Hello World program in Scala, and want to start your first real Scala project, you will need to choose a build tool.  While you can use tools like ant or maven, Simple Build Tool (sbt in short) is a very popular option among Scala programmers.  In this post, I describe the bare minimum you need to know to quickly get started on using sbt.

Step 1: Install sbt
On a Mac it is as simple as 
sudo port install sbt
or, if you are using homebrew,
brew install sbt
For other operating systems, please see the official Getting Started Setup page.

Step 2: Create your project's directory structure
myproject/
    src/
        main/
            scala/
            java/
            resources/
        test/
            scala/
            java/
            resources/
myproject is your project's top-level directory. resources contain non-code files that are packaged up together with your project, like image or data files.

Step 3: Start writing your project's code
Let's just use a Hello World program.  Create myproject/src/main/scala/HelloWorld.scala that contains the following code:
import org.slf4j._

object HelloWorld {
    def main(args: Array[String]) = {
        val logger:Logger = LoggerFactory.getLogger("MyLogger");
        logger.info("Hello World");
    }
}
We use the slf4j logging library instead of a simple println in order to demonstrate how external dependencies are specified in sbt.

Step 4: Create a build definition file
Put the following lines in myproject/build.sbt. Note that the blank lines below ARE absolutely necessary.
name := "Hello World"

version := "1.0"

scalaVersion := "2.9.1"

libraryDependencies ++= Seq(
  "org.slf4j" % "slf4j-api" % "1.6.4",
  "org.slf4j" % "slf4j-simple" % "1.6.4"
)
The libraryDependencies setting specifies the managed dependencies, i.e., the dependencies which are automatically downloaded for you from the Maven repositories. A dependency is specified as groupId % artifactId % revision. Conventions for groupId, artifactId and revision are discussed at http://maven.apache.org/guides/mini/guide-naming-conventions.html. The automatically downloaded dependencies are usually stored in ~/.ivy2/cache.

You don't have to use maven if you don't want to. Instead you can use unmanaged dependencies. Just put the appropriate jars in myproject/lib, and don't specify them in the libraryDependencies setting.

Step 5: Compile, run and package your program
cd myproject
sbt run
The first time you run sbt, it will download all the dependencies (sometimes including the scala version specified in the scalaVersion setting). This means that it will take some time. Subsequent runs will be much faster.

If you just want to compile,  type:
sbt compile
You can set up sbt to automatically compile your program as soon as any source file changes.  To do so, type:
sbt ~compile
Continuous compilation is a great time-saver.

To package your program for distribution as a jar, type:
sbt package
A jar containing all your compiled classes and resources from myproject/src/main/resources will be found in myproject/target/scala-YOUR_SCALA_VERSION_HERE. Note that dependencies are NOT included in the jar. To include all dependencies into the output jar, you will need to use the assembly plugin. See the next step for pointers to info about plugins.

Step 6: Read the official Getting Started Guide
This post only aims to get you quickly started on sbt.  There are tons of features it does not cover -- multiple projects and plugins, just to name two very important ones.  To learn how to use these features and to understand the fundamental principles behind sbt (which is in fact just a Scala Domain Specific Language), please read the Official Getting Started Guide.  It is long and sometimes too deep, but very useful indeed.

Sunday, July 10, 2011

JQueryMobile app with Facebook integration and a Ruby on Rails backend

I am currently playing around with mobile app development.  I want to develop a mobile app that:

  1. Works across multiple devices (iPhone, Android, etc)
  2. Integrates with Facebook
  3. Has a Ruby on Rails backend to store app-specific data
I have implemented a toy jQueryMobile app to demonstrate the implementation of the above requirements. The toy app simply asks you to login with your Facebook credentials and lists your friends.  You can try it out by visiting http://jqmfbror.heroku.com using your mobile or desktop web browser.  The full source code for app is available at https://github.com/dilip/jqmfbror.

Facebook authentication happens at the Ruby on Rails backend using the Omniauth gem.  This means that the backend can uniquely identify repeat visitors and associate app-specific data with them.  The backend can use the Facebook access token generated on login to retrieve the user's Facebook data, and pass it on for display on the frontend.  Alternatively, the backend can simply pass the Facebook access token to the frontend, which then uses JSONP to retrieve the user's Facebook data.  The latter approach is more efficient as it decreases the load on the backend.  The toy app uses this approach.  

The app also uses MongoDb (instead of mysql) and Heroku (for super-easy hosting of Ruby on Rails applications).

Thursday, June 23, 2011

Infographic Resume Creator

A few months ago, I came across some real cool infographic resumes - http://blog.chumbonus.com/infographic-resumes/.  Since I didn't have the Photoshop/Illustrator skills to make my own, I decided to make a web app to automatically create a basic infographic resume from my LinkedIn profile.  Thus http://inforesume.heroku.com was born.  All you need to do is log in with LinkedIn.  Do check it out.

If you want a more fancy infographic resume, http://vizualize.me/ will probably be useful.  It's a startup announced on Hacker News just yesterday.

Monday, May 23, 2011

Craigslist Car Finder

I bought my first car in 2006 through Craigslist.  Searching for a used car  on Craigslist was a pain -- there was too much junk.  I found myself wasting 2-3 hours every day sifting through craigslist postings looking for good deals. So I wrote script to automatically parse craigslist car ads and show me only the ones that may be of interest to me.   This script helped me and at least four of my friends find a great deal. on Craigslist.  Hope it is useful to you too.

The script has the following features:
  • Allows you to look for only specific car models.
  • Allows you to ignore coupes.
  • Allows you to specify price and year range.
  • Allows you to ignore manual transmission cars.
  • Compares the price of the vehicle with the Edmunds True Market Value price and highlights good deals.
  • Highlights car with low miles.
Here is the script: craigslist.tar.gz.

In 2006, PERL was the only scripting language I knew. It was an absolute horror to do object oriented programming in PERL.  I wish I knew Python and Ruby back then.

If you find the script useful or find bugs, please do send me an email.

Saturday, April 02, 2011

dealwall.me : Procrastinating about taxes

Last weekend, I was supposed to do my taxes.  But, I found a good way to procrastinate -- build a web application.  The result was dealwall.me, a site where you can boast about the Groupon deals you bagged. This weekend, I am procrastinating by blogging about it.


My wife and I are huge fans of Groupon.  Through Groupon, we have tried numerous great restaurants, taken flying lessons, taken swimming lessons, biked the bay, etc. etc.  We thought it will be cool to blog about our Groupon adventures.  Why make a static page, when you can make a web app!  And thus dealwall.me was born.  Please check out deal wall at http://dealwall.me/4d91637b4570a36150000001.


"Build" is not really the right word to describe my activities over the weekend.  It's more like cobble together.  There are so many fantastic technologies and libraries available today, that putting together a simple web application is quick, simple and frustration-free.  In the rest of this article, I will describe the technologies/libraries that helped me put together dealwall.me over a weekend.
    Where should I store my code? 
    imgres.jpegBefore writing any code, I needed a version control system.  I went with Mercurial hosted on Bitbucket.  The main reason I went with Mercurial/Bitbucket instead of git/github is that Bitbucket offers free private repositories.  I am extremely happy with Mercurial/Bitbucket, and do not miss git at all.


    rails.png
    What web framework should I use?  
    I went with Ruby on Rails.   I had played with Ruby On Rails a couple of years ago when I was a grad student.  These days,  I do a lot of programming in Python/Django at my day job. I like Ruby On Rails a lot more than Django.  Ruby On Rails just looks much nicer and has better libraries, which mitigate a lot of programming frustration. 


    Where do I store data?  
    imgres.jpegA year ago, the answer was obvious - a relational database like mysql.  Today, there are many nosql choices, which are very apt for web apps like the one I was cobbling together.  I went with mongoDB, mainly because of the absolutely positive experience I had with it at work.  It was super simple to install and use.
    I did consider building a CouchApp using CouchDB.  It  probably would have been a good fit for my simple app.  However, I wanted to finish the app over the weekend, and the learning curve associated with CouchDB would not have made that possible (based on my experience twiddling with CouchApps a few months ago).

    What object relational mapper should I use?  
    If I were using sql with Rails, the most obvious choice would have been Rails' own ActiveRecord.  But I was using MongoDB.  I choose lightweight Mongomatic instead of the more heavyweight Mongoid and Mongomapper ORMs. Mongomatic is a very thin layer on top of the mongodb ruby drivers.  Since this was my first mongoDB + Ruby On Rails app, I wanted to have full control over how and when data is stored and accessed.  Mongomatic is very simple to use, and works great for my app's very simple model -- a wall which has a list of deals within it.

    How do I make my web app look pretty?
    blueprint_header_clean.png
    I am not good at making websites look professional.  So I needed all the help I can get. That's where Blueprint, Compass and Fancy-Buttons provide a lot of support.  Blueprint is a CSS framework that provides a grid on which you can easily align various page elements.  Along with nice-looking default typography and sane style defaults, Blueprint gives a jumpstart towards building a professional looking website.

    Compass is a stylesheet authoring framework that makes it easy to organize your CSS.  No more CSS files containing hundreds of lines.  Instead, Compass allows you to break up your CSS into smaller re-usable parts called plugins, and to compose them into bigger files just like you use #include in C++.  Fancy-buttons is a compass plugin that provides neat looking CSS buttons.  See http://codefastdieyoung.com/2011/03/want-to-move-fast-just-do-this-part-1-design for an excellent article on how to quickly design a professional looking web app.

    How do I make my app social?
    Super simple.  Copy some Javascript from the Facebook dev site to embed Facebook Like buttons and the new Facebook commenting system.  I also added a ShareThis widget to enable easy sharing through email, twitter, and other social-networking sites.


    Where do I host my app?

    imgres.jpgThe code has been written.  The design looks reasonable.  Everything has been checked into Bitbucket.  The next task is to host the app somewhere so that it is accessible to the world.  I first considered using the free Amazon EC2 instance.  Since I did not have time to set up the webserver and other software infrastructure needed to host a Ruby on Rails app, I chose a shortcut - Heroku.  Heroku is probably the easiest way to deploy a Ruby on Rails app.  It is a real gem -- extremely simple to signup and get my first app running.  The whole process took less than an hour.  Deploying an app is as simple as one command --  hg push git+ssh://git@heroku.com:.git.  Since Heroku relies on git,  I had to install hg-git using the instructions at http://smith-stubbs.com/notes/2010/04/30/deploying-to-heroku-with-mercurial.

    imgres.jpegBy default, Heroku offers postgres as the data store.  Since I was using mongoDB, I installed the mongoHQ add-on.  It took just one click.  I chose the free 16MB plan. If my app takes off, that won't be sufficient.  But I will worry about upgrading to the $5/month plan if and when my app takes off.  

    Where do I get my domain?
    imgres.jpgI got mine from GoDaddy, $8.99 for dealwall.me.  This was the only part where I had to shell out real money.  I usually buy my domains through my hosting account at bluehost.  However, they don't sell .me domains.




    How do I make money?
    Yes, I have a plan to make money.  Ofcourse, this assumes that the app takes off :-).  I have signed up as a Groupon affiliate.  All links associated with the deals users post on their walls are linked back to Groupon through my affiliate link.  I also display the Groupon daily deal widget on the side of each wall.  Anytime someone buys a Groupon deal through dealwall.me, I make money.

    How do I get users for my app?
    So, if I want to make money, I need users (obviously!).  Right now, I have just asked a few friends to try the app out.  I am planning to publicize the app on Facebook, by "Liking" it and also by sharing my deal wall.  However, I am going to wait for a week or two.   Currently, the Facebook activity streams of everyone I know are inundated with messages about India's victory in the Cricket World Cup.  Notifications from my little app have absolutely no chance of getting noticed.

    This is a lesson I learnt when publishing my first Android app -- frync.  You need to publish at the right time.   When an app is published, it stays on the "Just In" list for at least a few hours.  I should have published the app just after Christmas, when everyone would have been playing with their new smart phones received as gifts.  The app would have been noticed much more and would have received more installs, just by virtue of being visible at the right time.  So this is a lesson learnt.

    What next?


    Doing my taxes, ofcourse.  As soon as I finish writing this blog entry.... and if I don't get distracted into more web app building.
    The app is running fine at dealwall.me.  It is still hard to use.  Since Groupon (obviously) does not offer an API to retrieve a particular user's deals, users have to login to Groupon and then paste the HTML source of their All Groupons page into my app.  I am still thinking about how to make this process easier.  Don't understand what I am talking about?  Please try creating your wall at dealwall.me.  If you have any ideas about how to make it better, please leave a comment.

    Tuesday, January 11, 2011

    How to avoid excessive logging when using Hive JDBC

    By default, the Hive JDBC driver outputs a HUGE amount of log messages at the INFO log level.  That makes it very hard to track the log statements added by you.  I spent quite some time trying to change the log level to ERROR by tinkering with the hive-log4j.properties and hive-exec-log4j.properties files.   This approach was not successful. The Hive JDBC driver does not seem to read these log4j config files.  Hence, in order to change the log level, you must add the following lines near the beginning of your program:

    from org.apache.log4j import Logger as Logger
    from org.apache.log4j import Level as Level
    rootLogger = Logger.getRootLogger()
    rootLogger.setLevel(Level.ERROR)
    

    Note: The above code is in Jython.  Something very similar will work in Java.

    Sunday, January 02, 2011

    Frync: My first android app

    During my 5 years at grad school, I used to call a different friend almost every day during my 45-minute walks to and from my lab.  I used to be in touch with a lot of people, and it felt wonderful.  For the past 1.5 years, I have been working at a startup. I now find myself woefully out of sync with most of my friends.  The reason is simple: There are too many things going on all the time that I forget to call my friends (and relatives).


    That was the motivation for writing my first Android app - Frync.  Frync stands for Friend Sync.  In Frync, you associate the contacts in your phone's address book with the frequency at which you wish to call them -- for example, Every Day, Every Week, Every Month, etc.  Frync automatically tracks your phone call activity, and reminds you to call the friends whom you have not talked to at the desired frequency.  You can install Frync from the Android market: market://details?id=danjo.frm.  Screenshots of Frync are at www.frync.com.

    According to Techcrunch, The Phone Call is Dead (although Economist disagrees).  Text-based modes of communication like SMS, Tweets and Facebook messages are going to dominate the future.   In the future, I hope to expand Frync into a more powerful Friend Relationship Management tool that will help you to be in sync with friends across multiple modes of communication.