<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-20761651</id><updated>2012-01-16T07:02:35.284-08:00</updated><category term='simple build tool'/><category term='hive'/><category term='eclipse'/><category term='sbt'/><category term='scala'/><category term='apache hive'/><category term='hadoop'/><title type='text'>The Ex CS Grad Student</title><subtitle type='html'></subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://csgrad.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/20761651/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://csgrad.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Dilip Joseph</name><uri>https://profiles.google.com/101964878145903134320</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-l2G0IfSXbg0/AAAAAAAAAAI/AAAAAAAAC9I/uqEkVKk7XCk/s512-c/photo.jpg'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>14</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-20761651.post-3839523196790361272</id><published>2011-12-04T20:09:00.001-08:00</published><updated>2011-12-04T21:11:05.569-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='apache hive'/><category scheme='http://www.blogger.com/atom/ns#' term='hive'/><category scheme='http://www.blogger.com/atom/ns#' term='hadoop'/><title type='text'>Accessing recursive Apache Hive partitions in CDH3</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;div&gt;In this post, I describe the minor Hadoop (0.20.2-cdh3u2) patches required to access data deep inside a multi-level directory structure using hive 0.7.  Consider the following directory structure:&lt;br /&gt;&lt;pre class="prettyprint"&gt;Logs/&lt;br /&gt;    2011_01&lt;br /&gt;        01&lt;br /&gt;        02&lt;br /&gt;        ..&lt;br /&gt;        31&lt;br /&gt;    2011_02&lt;br /&gt;        01&lt;br /&gt;        02&lt;br /&gt;        ..&lt;br /&gt;        28&lt;br /&gt;&lt;/pre&gt;&lt;div&gt;We want to issue hive queries involving individual days as well as whole months. For accessing individual days, we define one hive partition per day. For example, we define a partition &lt;tt&gt;2011_01_02&lt;/tt&gt; with &lt;tt&gt;LOCATION&lt;/tt&gt; &lt;tt&gt;Logs/2011_01/02&lt;/tt&gt;. To access the whole month of &lt;tt&gt;2011_01&lt;/tt&gt;, we define a partition &lt;tt&gt;2011_01&lt;/tt&gt; with &lt;tt&gt;LOCATION&lt;/tt&gt; &lt;tt&gt;Logs/2011_01&lt;/tt&gt;. However, if you query the &lt;tt&gt;2011_01&lt;/tt&gt; partition, you will get no results. This is because hadoop 0.20.2 does not support recursive directory listing.&lt;/div&gt;&lt;div&gt;In order to get this monthly query working, you must first apply the following patch (based on &lt;a href="https://issues.apache.org/jira/browse/MAPREDUCE-1501"&gt;MAPREDUCE-1501&lt;/a&gt;, which did not make it into hadoop 0.20.2) to hadoop 0.20.2.cdh3u2. After applying the patch, compile hadoop and point the &lt;tt&gt;HADOOP_HOME&lt;/tt&gt; on the machine running the hive client to the patched hadoop jars. You do NOT have to replace the hadoop jars on the hadoop cluster; the recursive directory listing feature is only needed by the hive client.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;script src="https://gist.github.com/1432301.js"&gt; &lt;/script&gt;&lt;br /&gt;&lt;div&gt;In addition to the patched jars, you should also add the following lines to your &lt;tt&gt;hive-site.xml:&lt;/tt&gt;&lt;/div&gt;&lt;pre class="prettyprint"&gt;&amp;lt;property&amp;gt;&lt;br /&gt;  &amp;lt;name&amp;gt;mapred.input.dir.recursive&amp;lt;/name&amp;gt;&lt;br /&gt;  &amp;lt;value&amp;gt;true&amp;lt;/value&amp;gt;&lt;br /&gt;&amp;lt;/property&amp;gt;&lt;br /&gt;&lt;/pre&gt;&lt;div&gt;After this querying the &lt;tt&gt;2011_01&lt;/tt&gt; partition will work fine.&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/20761651-3839523196790361272?l=csgrad.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://csgrad.blogspot.com/feeds/3839523196790361272/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=20761651&amp;postID=3839523196790361272' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/20761651/posts/default/3839523196790361272'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/20761651/posts/default/3839523196790361272'/><link rel='alternate' type='text/html' href='http://csgrad.blogspot.com/2011/12/accessing-recursive-apache-hive.html' title='Accessing recursive Apache Hive partitions in CDH3'/><author><name>Dilip Joseph</name><uri>https://profiles.google.com/101964878145903134320</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-l2G0IfSXbg0/AAAAAAAAAAI/AAAAAAAAC9I/uqEkVKk7XCk/s512-c/photo.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-20761651.post-2176999559127941721</id><published>2011-11-24T13:50:00.001-08:00</published><updated>2011-11-24T16:48:41.118-08:00</updated><title type='text'>Quickly find and open files</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;I frequently need to find a file that is located deep within the current directory and operate on it -- like opening in vim or svn diffing it. &amp;nbsp;I can never remember the exact path to the file, and sometimes can't even remember the full name. &amp;nbsp;All I know is that the file is somewhere within the current directory and its sub-directories. &amp;nbsp;So, I end up running the UNIX &lt;tt&gt;find&lt;/tt&gt; command, and then cut-pasting the returned file path into the command of the interest. &amp;nbsp;This wastes time. &amp;nbsp;So I wrote a small python script to make it easier.&lt;br /&gt;&lt;br /&gt;&lt;div&gt;Copy the script &lt;tt&gt;f.py&lt;/tt&gt; (located at the end of the post) into some directory that is on your &lt;tt&gt;PATH&lt;/tt&gt;. &amp;nbsp;Suppose you are looking for the file that starts with Foo, you just run:&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;pre class="prettyprint"&gt;$ f.py Foo*&lt;br /&gt;1) ./subdir1/subdir2/Foo1.java&lt;br /&gt;2) ./subdir1/Foo.java&lt;br /&gt;Enter file number:&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Enter the number of the file you are interested in. &amp;nbsp;That will bring up the following menu of operations you can perform on the selected file.&lt;br /&gt;&lt;br /&gt;&lt;pre class="prettyprint"&gt;Process ./subdir1/subdir2/Foo1.java&lt;br /&gt;1. vim&lt;br /&gt;2. emacs&lt;br /&gt;3. svn add&lt;br /&gt;4. svn diff&lt;br /&gt;5. open (OSX only)&lt;br /&gt;Enter choice (Default is 1):&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;If the pattern you specify matches only a single file, the script directly jumps to the operation selection menu. &amp;nbsp;Hope this will save you some key-strokes.&lt;br /&gt;&lt;br /&gt;&lt;pre class="prettyprint"&gt;#!/usr/bin/python&lt;br /&gt;# This program is used to easily locate a file matching &lt;br /&gt;# the user specified pattern within the current directory&lt;br /&gt;# and to quickly perform some common operations (like&lt;br /&gt;# opening it in vim) against it.&lt;br /&gt;import subprocess&lt;br /&gt;import sys&lt;br /&gt;import os&lt;br /&gt;&lt;br /&gt;def processFile(fileName):&lt;br /&gt;    """&lt;br /&gt;    Show the user the possible actions with the specified file,&lt;br /&gt;    and prompt the user to select a particular action.&lt;br /&gt;&lt;br /&gt;    """&lt;br /&gt;&lt;br /&gt;    fileName = fileName.strip()&lt;br /&gt;    print "Process %s" % fileName&lt;br /&gt;    print "1. vim"&lt;br /&gt;    print "2. emacs"&lt;br /&gt;    print "3. svn add"&lt;br /&gt;    print "4. svn diff"&lt;br /&gt;    print "5. open (OSX only)"&lt;br /&gt;&lt;br /&gt;    choice = raw_input("Enter choice (Default is 1):").strip()&lt;br /&gt;    &lt;br /&gt;    if choice == "1" or choice == "":&lt;br /&gt;        cmd = "vim %s" % fileName&lt;br /&gt;    elif choice == "2":&lt;br /&gt;        cmd = "emacs %s" % fileName&lt;br /&gt;    elif choice == "3":&lt;br /&gt;        cmd = "svn add %s" % fileName&lt;br /&gt;    elif choice == "4":&lt;br /&gt;        cmd = "svn diff %s" % fileName&lt;br /&gt;    elif choice == "5":&lt;br /&gt;        cmd = "open %s" % fileName&lt;br /&gt;    print cmd&lt;br /&gt;    os.system(cmd)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;def listFiles(fileNames):&lt;br /&gt;    """ &lt;br /&gt;    Show the list of files and prompt user to select one &lt;br /&gt;    """&lt;br /&gt;&lt;br /&gt;    fileIndex = 1&lt;br /&gt;    for fileName in fileNames:&lt;br /&gt;        print "%d) %s" % (fileIndex, fileName.strip())&lt;br /&gt;        fileIndex += 1&lt;br /&gt;    choice = raw_input("Enter file number:")&lt;br /&gt;    chosenFileName = fileNames[int(choice)-1].strip()&lt;br /&gt;    processFile(chosenFileName)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;if __name__ == "__main__":&lt;br /&gt;&lt;br /&gt;    if len(sys.argv) &amp;lt; 2:&lt;br /&gt;        print "Usage: f.py FILE_PATTERN_OF_INTEREST"&lt;br /&gt;        sys.exit(-1)&lt;br /&gt;&lt;br /&gt;    pattern = sys.argv[1]&lt;br /&gt;    proc = subprocess.Popen("find . -name \"%s\" | grep -v svn" % pattern, &lt;br /&gt;        shell=True, stdout=subprocess.PIPE)&lt;br /&gt;    lines = proc.stdout.readlines()&lt;br /&gt;    if len(lines) == 0:&lt;br /&gt;        print "No matching files found. Note you can use wild cards like *"&lt;br /&gt;    elif len(lines) == 1:&lt;br /&gt;        processFile(lines[0])&lt;br /&gt;    else:&lt;br /&gt;        listFiles(lines)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/20761651-2176999559127941721?l=csgrad.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://csgrad.blogspot.com/feeds/2176999559127941721/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=20761651&amp;postID=2176999559127941721' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/20761651/posts/default/2176999559127941721'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/20761651/posts/default/2176999559127941721'/><link rel='alternate' type='text/html' href='http://csgrad.blogspot.com/2011/11/quickly-find-and-open-files.html' title='Quickly find and open files'/><author><name>Dilip Joseph</name><uri>https://profiles.google.com/101964878145903134320</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-l2G0IfSXbg0/AAAAAAAAAAI/AAAAAAAAC9I/uqEkVKk7XCk/s512-c/photo.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-20761651.post-6336674010154931929</id><published>2011-11-09T21:54:00.000-08:00</published><updated>2011-11-09T21:54:44.819-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='eclipse'/><category scheme='http://www.blogger.com/atom/ns#' term='sbt'/><category scheme='http://www.blogger.com/atom/ns#' term='scala'/><title type='text'>Automatically create Eclipse projects for your Scala projects using sbt</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;In my &lt;a href="http://csgrad.blogspot.com/2011/11/quick-start-to-using-scala-simple-build.html"&gt;previous post&lt;/a&gt;, I described the bare minimum you need to know to get started using sbt. &amp;nbsp;In this post, I will describe how sbt can be used to automatically create the &lt;tt&gt;.project&lt;/tt&gt; and &lt;tt&gt;.classpath&lt;/tt&gt; files you need to create for loading your project into the Eclipse IDE.  Using sbt to create your Eclipse project files ensures that they are always in sync with your build definition.  And of course, it saves a lot of the clicks or key strokes need to manually specify classpaths in Eclipse.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Step 1&lt;/b&gt;&lt;br /&gt;Install the &lt;a href="http://www.scala-ide.org/"&gt;Eclipse plugin for Scala&lt;/a&gt;, if you don't already have it installed.&lt;br /&gt;&lt;b&gt;Step 2&lt;/b&gt;&lt;br /&gt;Add the following lines to &lt;tt&gt;myproject/project/plugins/build.sbt&lt;/tt&gt;.  This tells sbt that you want to use the &lt;a href="https://github.com/typesafehub/sbteclipse"&gt;sbteclipse&lt;/a&gt; plugin.  Note that this &lt;tt&gt;build.sbt&lt;/tt&gt; is different from the &lt;tt&gt;build.sbt&lt;/tt&gt; at the top level of your project, i.e. in &lt;tt&gt;myproject/&lt;/tt&gt; directory.&lt;br /&gt;&lt;pre class="prettyprint"&gt;resolvers += Classpaths.typesafeResolver&lt;br /&gt;&lt;br /&gt;addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse" % "1.4.0")&lt;br /&gt;&lt;/pre&gt;&lt;b&gt;Step 3&lt;/b&gt;&lt;br /&gt;From the &lt;tt&gt;myproject/&lt;/tt&gt; directory, type &lt;br /&gt;&lt;pre class="prettyprint"&gt;sbt "eclipse create-src"&lt;br /&gt;&lt;/pre&gt;This step will create the &lt;tt&gt;.project&lt;/tt&gt; and &lt;tt&gt;.classpath&lt;/tt&gt; files required by Eclipse inside the&amp;nbsp;&lt;tt&gt;myproject/&lt;/tt&gt;&amp;nbsp;directory.&lt;br /&gt;&lt;b&gt;Step 4&lt;/b&gt;&lt;br /&gt;Import the project into Eclipse. Following the menu &lt;tt&gt;File &amp;gt; Import &amp;gt; General &amp;gt; Existing Projects into Workspace&lt;/tt&gt; and browse to &lt;tt&gt;myproject&lt;/tt&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/20761651-6336674010154931929?l=csgrad.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://csgrad.blogspot.com/feeds/6336674010154931929/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=20761651&amp;postID=6336674010154931929' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/20761651/posts/default/6336674010154931929'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/20761651/posts/default/6336674010154931929'/><link rel='alternate' type='text/html' href='http://csgrad.blogspot.com/2011/11/automatically-create-eclipse-projects.html' title='Automatically create Eclipse projects for your Scala projects using sbt'/><author><name>Dilip Joseph</name><uri>https://profiles.google.com/101964878145903134320</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-l2G0IfSXbg0/AAAAAAAAAAI/AAAAAAAAC9I/uqEkVKk7XCk/s512-c/photo.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-20761651.post-5848785906885687139</id><published>2011-11-08T22:37:00.000-08:00</published><updated>2011-11-08T22:37:19.599-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='simple build tool'/><category scheme='http://www.blogger.com/atom/ns#' term='sbt'/><category scheme='http://www.blogger.com/atom/ns#' term='scala'/><title type='text'>Quick Start to using Scala Simple Build Tool (sbt)</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;After you finish a basic Hello World program in Scala, and want to start your first real Scala project, you will need to choose a build tool. &amp;nbsp;While you can use tools like ant or maven, Simple Build Tool (&lt;a href="https://github.com/harrah/xsbt/wiki"&gt;sbt&lt;/a&gt; in short) is a very popular option among Scala programmers. &amp;nbsp;In this post, I describe the bare minimum you need to know to quickly get started on using sbt. &lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Step 1: Install sbt&lt;/b&gt;&lt;/div&gt;&lt;div&gt;On a Mac it is as simple as&amp;nbsp;&lt;/div&gt;&lt;div&gt;&lt;pre class="prettyprint"&gt;sudo port install sbt&lt;/pre&gt;or, if you are using homebrew,&lt;br /&gt;&lt;pre class="prettyprint"&gt;brew install sbt&lt;/pre&gt;&lt;/div&gt;&lt;div&gt;For other operating systems, please see the official&amp;nbsp;&lt;a href="https://github.com/harrah/xsbt/wiki/Getting-Started-Setup"&gt;Getting Started Setup&lt;/a&gt; page.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;b&gt;Step 2: Create your project's directory structure&lt;/b&gt;&lt;br /&gt;&lt;pre class="prettyprint"&gt;myproject/&lt;br /&gt;    src/&lt;br /&gt;        main/&lt;br /&gt;            scala/&lt;br /&gt;            java/&lt;br /&gt;            resources/&lt;br /&gt;        test/&lt;br /&gt;            scala/&lt;br /&gt;            java/&lt;br /&gt;            resources/&lt;br /&gt;&lt;/pre&gt;&lt;tt&gt;myproject&lt;/tt&gt; is your project's top-level directory. &lt;tt&gt;resources&lt;/tt&gt; contain non-code files that are packaged up together with your project, like image or data files.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Step 3: Start writing your project's code&lt;/b&gt;&lt;br /&gt;Let's just use a Hello World program. &amp;nbsp;Create &lt;tt&gt;myproject/src/main/scala/HelloWorld.scala&lt;/tt&gt; that contains the following code:&lt;br /&gt;&lt;pre class="prettyprint"&gt;import org.slf4j._&lt;br /&gt;&lt;br /&gt;object HelloWorld {&lt;br /&gt;    def main(args: Array[String]) = {&lt;br /&gt;        val logger:Logger = LoggerFactory.getLogger("MyLogger");&lt;br /&gt;        logger.info("Hello World");&lt;br /&gt;    }&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;We use the &lt;a href="http://www.slf4j.org/"&gt;slf4j&lt;/a&gt; logging library instead of a simple &lt;tt&gt;println&lt;/tt&gt; in order to demonstrate how external dependencies are specified in sbt.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Step 4: Create a build definition file&lt;/b&gt;&lt;br /&gt;Put the following lines in &lt;tt&gt;myproject/build.sbt&lt;/tt&gt;. Note that the blank lines below ARE absolutely necessary.&lt;br /&gt;&lt;pre class="prettyprint"&gt;name := "Hello World"&lt;br /&gt;&lt;br /&gt;version := "1.0"&lt;br /&gt;&lt;br /&gt;scalaVersion := "2.9.1"&lt;br /&gt;&lt;br /&gt;libraryDependencies ++= Seq(&lt;br /&gt;  "org.slf4j" % "slf4j-api" % "1.6.4",&lt;br /&gt;  "org.slf4j" % "slf4j-simple" % "1.6.4"&lt;br /&gt;)&lt;br /&gt;&lt;/pre&gt;The &lt;tt&gt;libraryDependencies&lt;/tt&gt; setting specifies the &lt;i&gt;managed&lt;/i&gt; dependencies, i.e., the dependencies which are automatically downloaded for you from the Maven repositories.  A dependency is specified as &lt;tt&gt;groupId % artifactId % revision&lt;/tt&gt;.  Conventions for &lt;tt&gt;groupId&lt;/tt&gt;, &lt;tt&gt;artifactId&lt;/tt&gt; and &lt;tt&gt;revision&lt;/tt&gt; are discussed at &lt;a href="http://maven.apache.org/guides/mini/guide-naming-conventions.html"&gt;http://maven.apache.org/guides/mini/guide-naming-conventions.html&lt;/a&gt;.  The automatically downloaded dependencies are usually stored in &lt;tt&gt;~/.ivy2/cache&lt;/tt&gt;.&lt;br /&gt;&lt;br /&gt;You don't have to use maven if you don't want to.  Instead you can use &lt;i&gt;unmanaged&lt;/i&gt; dependencies.  Just put the appropriate jars in &lt;tt&gt;myproject/lib&lt;/tt&gt;, and don't specify them in the &lt;tt&gt;libraryDependencies&lt;/tt&gt; setting.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Step 5: Compile, run and package your program&lt;/b&gt;&lt;br /&gt;&lt;pre class="prettyprint"&gt;cd myproject&lt;br /&gt;sbt run&lt;br /&gt;&lt;/pre&gt;The first time you run sbt, it will download all the dependencies (sometimes including the scala version specified in the &lt;tt&gt;scalaVersion&lt;/tt&gt; setting).  This means that it will take some time.  Subsequent runs will be much faster.&lt;br /&gt;&lt;br /&gt;If you just want to compile, &amp;nbsp;type:&lt;br /&gt;&lt;pre class="prettyprint"&gt;sbt compile&lt;/pre&gt;You can set up sbt to automatically compile your program as soon as any source file changes. &amp;nbsp;To do so, type:&lt;br /&gt;&lt;pre class="prettyprint"&gt;sbt ~compile&lt;br /&gt;&lt;/pre&gt;Continuous compilation is a great time-saver.&lt;br /&gt;&lt;br /&gt;To package your program for distribution as a jar, type:&lt;br /&gt;&lt;pre class="prettyprint"&gt;sbt package&lt;br /&gt;&lt;/pre&gt;A jar containing all your compiled classes and resources from &lt;tt&gt;myproject/src/main/resources&lt;/tt&gt; will be found in &lt;tt&gt;myproject/target/scala-YOUR_SCALA_VERSION_HERE&lt;/tt&gt;.  Note that dependencies are NOT included in the jar.  To include all dependencies into the output jar, you will need to use the &lt;tt&gt;assembly&lt;/tt&gt; plugin.  See the next step for pointers to info about plugins.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Step 6: Read the official Getting Started Guide&lt;/b&gt;&lt;br /&gt;This post only aims to get you quickly started on sbt. &amp;nbsp;There are tons of features it does not cover -- multiple projects and plugins, just to name two very important ones. &amp;nbsp;To learn how to use these features and to understand the fundamental principles behind sbt (which is in fact just a Scala Domain Specific Language), please read the &lt;a href="https://github.com/harrah/xsbt/wiki/Getting-Started-Welcome"&gt;Official Getting Started Guide&lt;/a&gt;. &amp;nbsp;It is long and sometimes too deep, but very useful indeed.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/20761651-5848785906885687139?l=csgrad.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://csgrad.blogspot.com/feeds/5848785906885687139/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=20761651&amp;postID=5848785906885687139' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/20761651/posts/default/5848785906885687139'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/20761651/posts/default/5848785906885687139'/><link rel='alternate' type='text/html' href='http://csgrad.blogspot.com/2011/11/quick-start-to-using-scala-simple-build.html' title='Quick Start to using Scala Simple Build Tool (sbt)'/><author><name>Dilip Joseph</name><uri>https://profiles.google.com/101964878145903134320</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-l2G0IfSXbg0/AAAAAAAAAAI/AAAAAAAAC9I/uqEkVKk7XCk/s512-c/photo.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-20761651.post-3660241311394042786</id><published>2011-07-10T15:45:00.000-07:00</published><updated>2011-07-10T15:45:01.316-07:00</updated><title type='text'>JQueryMobile app with Facebook integration and a Ruby on Rails backend</title><content type='html'>I am currently playing around with mobile app development. &amp;nbsp;I want to develop a mobile app that:&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Works across multiple devices (iPhone, Android, etc)&lt;/li&gt;&lt;li&gt;Integrates with Facebook&lt;/li&gt;&lt;li&gt;Has a Ruby on Rails backend to store app-specific data&lt;/li&gt;&lt;/ol&gt;&lt;div&gt;I have implemented a toy &lt;a href="http://jquerymobile.com/"&gt;jQueryMobile&lt;/a&gt; app to demonstrate the implementation of the above requirements. The toy app simply asks you to login with your Facebook credentials and lists your friends. &amp;nbsp;You can try it out by visiting&amp;nbsp;&lt;a href="http://jqmfbror.heroku.com/"&gt;http://jqmfbror.heroku.com&lt;/a&gt; using your mobile or desktop web browser. &amp;nbsp;The full source code for app is available at&amp;nbsp;&lt;a href="https://github.com/dilip/jqmfbror"&gt;https://github.com/dilip/jqmfbror&lt;/a&gt;.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Facebook authentication happens at the Ruby on Rails backend using the &lt;a href="https://github.com/intridea/omniauth/wiki"&gt;Omniauth&lt;/a&gt; gem. &amp;nbsp;This means that the backend can uniquely identify repeat visitors and associate app-specific data with them. &amp;nbsp;The backend can use the Facebook access token generated on login to retrieve the user's Facebook data, and pass it on for display on the frontend. &amp;nbsp;Alternatively, the backend can simply pass the Facebook access token to the frontend, which then uses JSONP to retrieve the user's Facebook data. &amp;nbsp;The latter approach is more efficient as it decreases the load on the backend. &amp;nbsp;The toy app uses this approach. &amp;nbsp;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The app also uses MongoDb (instead of mysql) and Heroku (for super-easy hosting of Ruby on Rails applications).&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/20761651-3660241311394042786?l=csgrad.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://csgrad.blogspot.com/feeds/3660241311394042786/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=20761651&amp;postID=3660241311394042786' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/20761651/posts/default/3660241311394042786'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/20761651/posts/default/3660241311394042786'/><link rel='alternate' type='text/html' href='http://csgrad.blogspot.com/2011/07/jquerymobile-app-with-facebook.html' title='JQueryMobile app with Facebook integration and a Ruby on Rails backend'/><author><name>Dilip Joseph</name><uri>https://profiles.google.com/101964878145903134320</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-l2G0IfSXbg0/AAAAAAAAAAI/AAAAAAAAC9I/uqEkVKk7XCk/s512-c/photo.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-20761651.post-1945711270423937545</id><published>2011-06-23T22:07:00.000-07:00</published><updated>2011-06-23T22:07:44.859-07:00</updated><title type='text'>Infographic Resume Creator</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;A few months ago, I came across some real cool infographic resumes -&amp;nbsp;&lt;a href="http://blog.chumbonus.com/infographic-resumes/"&gt;http://blog.chumbonus.com/infographic-resumes/&lt;/a&gt;. &amp;nbsp;Since I didn't have the Photoshop/Illustrator skills to make my own, I decided to make a web app to automatically create a basic infographic resume from my LinkedIn profile. &amp;nbsp;Thus &lt;a href="http://inforesume.heroku.com/"&gt;http://inforesume.heroku.com&lt;/a&gt; was born. &amp;nbsp;All you need to do is log in with LinkedIn. &amp;nbsp;Do check it out.&lt;br /&gt;&lt;br /&gt;If you want a more fancy infographic resume,&amp;nbsp;&lt;a href="http://vizualize.me/"&gt;http://vizualize.me/&lt;/a&gt; will probably be useful. &amp;nbsp;It's a startup announced on Hacker News just yesterday.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/20761651-1945711270423937545?l=csgrad.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://csgrad.blogspot.com/feeds/1945711270423937545/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=20761651&amp;postID=1945711270423937545' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/20761651/posts/default/1945711270423937545'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/20761651/posts/default/1945711270423937545'/><link rel='alternate' type='text/html' href='http://csgrad.blogspot.com/2011/06/infographic-resume-creator.html' title='Infographic Resume Creator'/><author><name>Dilip Joseph</name><uri>https://profiles.google.com/101964878145903134320</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-l2G0IfSXbg0/AAAAAAAAAAI/AAAAAAAAC9I/uqEkVKk7XCk/s512-c/photo.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-20761651.post-8630312761606063116</id><published>2011-05-23T22:02:00.000-07:00</published><updated>2011-05-23T22:03:49.916-07:00</updated><title type='text'>Craigslist Car Finder</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;h1&gt;&lt;/h1&gt;&lt;div align="justify"&gt;I bought my first car in 2006 through Craigslist.&amp;nbsp; Searching for a used car&amp;nbsp; on Craigslist was a pain -- there was too much junk.&amp;nbsp; I found myself wasting 2-3 hours  every day sifting through craigslist postings looking for good deals.   So I wrote script to automatically parse craigslist car ads and  show me only the ones that may be of interest to me.&amp;nbsp;&amp;nbsp; This script helped me and at least four of my friends find a great deal. on Craigslist.&amp;nbsp;  Hope it is useful to you too.&lt;/div&gt;&lt;div align="justify"&gt;&lt;br /&gt;&lt;/div&gt;The script has the following features: &lt;br /&gt;&lt;ul&gt;&lt;li&gt;Allows you to look for only specific car models.&lt;/li&gt;&lt;li&gt;Allows you to ignore coupes.&lt;/li&gt;&lt;li&gt;Allows you to specify price and year range.&lt;/li&gt;&lt;li&gt;Allows you to ignore manual transmission cars.&lt;/li&gt;&lt;li&gt;Compares the price of the vehicle with the &lt;a href="http://www.edmunds.com/"&gt;Edmunds&lt;/a&gt; True Market Value price and highlights good deals.&lt;/li&gt;&lt;li&gt;Highlights car with low miles.&lt;/li&gt;&lt;/ul&gt;&lt;div align="justify"&gt;&lt;/div&gt;&lt;div align="justify"&gt;Here is the script: &lt;a href="http://www.accountdiary.com/craigslist.tar.gz"&gt;craigslist.tar.gz&lt;/a&gt;. &lt;/div&gt;&lt;div align="justify"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div align="justify"&gt;In 2006, PERL was the only scripting language I knew. It was an absolute horror to do object oriented programming in PERL.&amp;nbsp; I wish I knew Python and Ruby back then. &lt;/div&gt;&lt;div align="justify"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div align="justify"&gt;If you find the script useful or find bugs, please do send me an email.   &lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/20761651-8630312761606063116?l=csgrad.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://csgrad.blogspot.com/feeds/8630312761606063116/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=20761651&amp;postID=8630312761606063116' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/20761651/posts/default/8630312761606063116'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/20761651/posts/default/8630312761606063116'/><link rel='alternate' type='text/html' href='http://csgrad.blogspot.com/2011/05/craigslist-car-finder.html' title='Craigslist Car Finder'/><author><name>Dilip Joseph</name><uri>https://profiles.google.com/101964878145903134320</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-l2G0IfSXbg0/AAAAAAAAAAI/AAAAAAAAC9I/uqEkVKk7XCk/s512-c/photo.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-20761651.post-4369962560500318521</id><published>2011-04-02T15:01:00.000-07:00</published><updated>2011-04-02T15:39:23.611-07:00</updated><title type='text'>dealwall.me : Procrastinating about taxes</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;Last weekend, I was supposed to do my taxes. &amp;nbsp;But, I found a good way to procrastinate -- build a web application. &amp;nbsp;The result was &lt;/span&gt;&lt;a href="http://dealwall.me/"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;dealwall.me&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;, a site where you can boast about the Groupon deals you bagged. This weekend, I am procrastinating by blogging about it.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;My wife and I are huge fans of &lt;a href="http://www.groupon.com/r/uu3372643"&gt;Groupon&lt;/a&gt;. &amp;nbsp;Through Groupon, we have tried numerous great restaurants, taken flying lessons, taken swimming lessons, biked the bay, etc. etc. &amp;nbsp;We thought it will be cool to blog about our Groupon adventures. &amp;nbsp;Why make a static page, when you can make a web app! &amp;nbsp;And thus &lt;/span&gt;&lt;a href="http://dealwall.me/"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;dealwall.me&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt; was born. &amp;nbsp;Please check out deal wall at&amp;nbsp;&lt;/span&gt;&lt;a href="http://dealwall.me/4d91637b4570a36150000001"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;http://dealwall.me/4d91637b4570a36150000001&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;"Build" is not really the right word to describe my activities over the weekend. &amp;nbsp;It's more like &lt;i&gt;cobble together&lt;/i&gt;. &amp;nbsp;There are so many fantastic technologies and libraries available today, that putting together a simple web application is quick, simple and frustration-free. &amp;nbsp;In the rest of this article, I will describe the technologies/libraries that helped me put together dealwall.me over a weekend.&lt;/span&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="webkit-fake-url://4B03F9D7-27F1-4E55-BFCD-4A007BBB0CFE/imgres.jpeg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/div&gt;&lt;ul style="text-align: left;"&gt;&lt;/ul&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;b&gt;Where should I store my code?&lt;/b&gt;&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;a href="webkit-fake-url://4B03F9D7-27F1-4E55-BFCD-4A007BBB0CFE/imgres.jpeg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img alt="imgres.jpeg" border="0" src="webkit-fake-url://4B03F9D7-27F1-4E55-BFCD-4A007BBB0CFE/imgres.jpeg" /&gt;&lt;/a&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;Before writing any code, I needed a version control system. &amp;nbsp;I went with Mercurial hosted on &lt;a href="http://bitbucket.org/"&gt;Bitbucket&lt;/a&gt;. &amp;nbsp;The main reason I went with Mercurial/Bitbucket instead of git/github is that Bitbucket offers free private repositories. &amp;nbsp;I am extremely happy with Mercurial/Bitbucket, and do not miss git at all.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;br /&gt;&lt;/span&gt; &lt;br /&gt;&lt;div&gt;&lt;a href="webkit-fake-url://F8A96B42-AB37-49FB-88FD-236FA38F895F/rails.png" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;img alt="rails.png" border="0" src="webkit-fake-url://F8A96B42-AB37-49FB-88FD-236FA38F895F/rails.png" /&gt;&lt;/span&gt;&lt;/a&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;b&gt;What web framework should I use?&lt;/b&gt; &amp;nbsp;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;I went with &lt;a href="http://rubyonrails.org/"&gt;Ruby on Rails&lt;/a&gt;. &amp;nbsp; I had played with Ruby On Rails a couple of years ago when I was a grad student. &amp;nbsp;These days, &amp;nbsp;I do a lot of programming in Python/Django at my day job. I like Ruby On Rails a lot more than Django. &amp;nbsp;Ruby On Rails just looks much nicer and has better libraries, which mitigate a lot of programming frustration.&amp;nbsp;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;Where do I store data?&lt;/b&gt;&amp;nbsp;&amp;nbsp;&lt;/div&gt;&lt;div&gt;&lt;a href="webkit-fake-url://5532EC9E-78DB-45BF-8BE2-BBAD252D7992/imgres.jpeg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img alt="imgres.jpeg" border="0" src="webkit-fake-url://5532EC9E-78DB-45BF-8BE2-BBAD252D7992/imgres.jpeg" /&gt;&lt;/a&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;A year ago, the answer was obvious - a relational database like mysql. &amp;nbsp;Today, there are many &lt;i&gt;nosql&lt;/i&gt; choices, which are very apt for web apps like the one I was cobbling together. &amp;nbsp;I went with &lt;a href="http://www.mongodb.org/"&gt;mongoDB&lt;/a&gt;, mainly because of the absolutely positive experience I had with it at work. &amp;nbsp;It was super simple to install and use.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;I did consider building a CouchApp using CouchDB. &amp;nbsp;It &amp;nbsp;probably would have been a good fit for my simple app. &amp;nbsp;However, I wanted to finish the app over the weekend, and the learning curve associated with CouchDB would not have made that possible (based on my experience twiddling with CouchApps a few months ago).&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;b&gt;What object relational mapper should I use?&lt;/b&gt; &amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;If I were using sql with Rails, the most obvious choice would have been Rails' own ActiveRecord. &amp;nbsp;But I was using MongoDB. &amp;nbsp;I choose lightweight &lt;a href="http://mongomatic.com/"&gt;Mongomatic&lt;/a&gt; instead of the more heavyweight Mongoid and Mongomapper ORMs. Mongomatic is a very thin layer on top of the mongodb ruby drivers. &amp;nbsp;Since this was my first mongoDB + Ruby On Rails app, I wanted to have full control over how and when data is stored and accessed. &amp;nbsp;Mongomatic is very simple to use, and works great for my app's very simple model -- a &lt;i&gt;wall&lt;/i&gt; which has a list of &lt;i&gt;deals &lt;/i&gt;within it.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;br /&gt;&lt;/span&gt; &lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;b&gt;How do I make my web app look pretty?&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="webkit-fake-url://777F2DBB-795D-4661-A25D-153E48D8714B/blueprint_header_clean.png" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"&gt;&lt;img alt="blueprint_header_clean.png" border="0" height="48" src="webkit-fake-url://777F2DBB-795D-4661-A25D-153E48D8714B/blueprint_header_clean.png" width="200" /&gt;&lt;/a&gt;&lt;/div&gt;I am not good at making websites look professional. &amp;nbsp;So I needed all the help I can get. That's where &lt;a href="http://www.blueprintcss.org/"&gt;Blueprint&lt;/a&gt;, &lt;a href="http://compass-style.org/"&gt;Compass&lt;/a&gt; and&lt;a href="https://github.com/imathis/fancy-buttons"&gt; Fancy-Buttons&lt;/a&gt; provide a lot of support. &amp;nbsp;Blueprint is a CSS framework that provides a grid on which you can easily align various page elements. &amp;nbsp;Along with nice-looking default typography and sane style defaults, Blueprint gives a jumpstart towards building a professional looking website. &lt;br /&gt;&lt;br /&gt;Compass is a stylesheet authoring framework that makes it easy to organize your CSS. &amp;nbsp;No more CSS files containing hundreds of lines. &amp;nbsp;Instead, Compass allows you to break up your CSS into smaller re-usable parts called plugins, and to compose them into bigger files just like you use &lt;i&gt;#include&lt;/i&gt; in C++. &amp;nbsp;Fancy-buttons is a compass plugin that provides neat looking CSS buttons. &amp;nbsp;See&amp;nbsp;&lt;a href="http://codefastdieyoung.com/2011/03/want-to-move-fast-just-do-this-part-1-design"&gt;http://codefastdieyoung.com/2011/03/want-to-move-fast-just-do-this-part-1-design&lt;/a&gt; for an excellent article on how to quickly design a professional looking web app. &lt;br /&gt;&lt;br /&gt;&lt;div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;b&gt;How do I make my app social?&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;a href="http://4.bp.blogspot.com/-KZYC5QXKAT8/TZeUBDHlmfI/AAAAAAAAC6A/WSgHCjUpUpw/s1600/like.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://4.bp.blogspot.com/-KZYC5QXKAT8/TZeUBDHlmfI/AAAAAAAAC6A/WSgHCjUpUpw/s1600/like.png" /&gt;&lt;/a&gt;&lt;a href="http://2.bp.blogspot.com/-a34RPpRNs5g/TZeUfriFwSI/AAAAAAAAC6E/zzd0Fx7kpDQ/s1600/sharethis.png" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"&gt;&lt;img border="0" src="http://2.bp.blogspot.com/-a34RPpRNs5g/TZeUfriFwSI/AAAAAAAAC6E/zzd0Fx7kpDQ/s1600/sharethis.png" /&gt;&lt;/a&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;span class="Apple-style-span" style="font-family: Helvetica; font-size: 12px;"&gt;&lt;/span&gt;Super simple. &amp;nbsp;Copy some Javascript from the Facebook dev site to embed Facebook &lt;i&gt;Like&lt;/i&gt; buttons and the new Facebook commenting system. &amp;nbsp;I also added a &lt;a href="http://sharethis.com/"&gt;ShareThis&lt;/a&gt; widget to enable easy sharing through email, twitter, and other social-networking sites.&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;b&gt;Where do I host my app?&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;br /&gt;&lt;/span&gt; &lt;/div&gt;&lt;div&gt;&lt;a href="webkit-fake-url://0AA866EF-7412-4333-ACB7-F5DAB4FFBFE0/imgres.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img alt="imgres.jpg" border="0" src="webkit-fake-url://0AA866EF-7412-4333-ACB7-F5DAB4FFBFE0/imgres.jpg" /&gt;&lt;/a&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;The code has been written. &amp;nbsp;The design looks reasonable. &amp;nbsp;Everything has been checked into Bitbucket. &amp;nbsp;The next task is to host the app somewhere so that it is accessible to the world. &amp;nbsp;I first considered using the free&amp;nbsp;Amazon EC2 instance. &amp;nbsp;Since I did not have time to set up the webserver and other software infrastructure needed to host a Ruby on Rails app, I chose a shortcut - &lt;a href="http://heroku.com/"&gt;Heroku&lt;/a&gt;. &amp;nbsp;Heroku is probably the easiest way to deploy a Ruby on Rails app. &amp;nbsp;It is a real gem -- extremely simple to signup and get my first app running. &amp;nbsp;The whole process took less than an hour. &amp;nbsp;Deploying an app is as simple as one command -- &amp;nbsp;&lt;/span&gt;&lt;span class="Apple-style-span" style="color: #444444; font-family: monospace; font-size: 16px; line-height: 16px; white-space: pre;"&gt;hg push git+ssh://git@heroku.com:&lt;your-heroku-app-name&gt;.git.&lt;/your-heroku-app-name&gt;&lt;/span&gt;&amp;nbsp;&amp;nbsp;Since Heroku relies on git, &amp;nbsp;I had to install &lt;a href="http://hg-git.github.com/"&gt;hg-git&lt;/a&gt; using the instructions at &lt;a href="http://smith-stubbs.com/notes/2010/04/30/deploying-to-heroku-with-mercurial"&gt;http://smith-stubbs.com/notes/2010/04/30/deploying-to-heroku-with-mercurial&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;a href="webkit-fake-url://25E8DFE9-D6BA-4514-974C-C68198212C1A/imgres.jpeg" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"&gt;&lt;img alt="imgres.jpeg" border="0" src="webkit-fake-url://25E8DFE9-D6BA-4514-974C-C68198212C1A/imgres.jpeg" /&gt;&lt;/a&gt;By default, Heroku offers postgres as the data store. &amp;nbsp;Since I was using mongoDB, I installed the &lt;a href="http://mongohq.com/"&gt;mongoHQ&lt;/a&gt; add-on. &amp;nbsp;It took just one click. &amp;nbsp;I chose the free 16MB plan. If my app takes off, that won't be sufficient. &amp;nbsp;But I will worry about upgrading to the $5/month plan if and when my app takes off. &amp;nbsp;&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;b&gt;Where do I get my domain?&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="webkit-fake-url://02295009-0E78-4263-82A7-7FB1A8A18674/imgres.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img alt="imgres.jpg" border="0" src="webkit-fake-url://02295009-0E78-4263-82A7-7FB1A8A18674/imgres.jpg" /&gt;&lt;/a&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;span class="Apple-style-span" style="font-family: Helvetica; font-size: 12px;"&gt;&lt;/span&gt;I got mine from GoDaddy, $8.99 for &lt;a href="http://dealwall.me/"&gt;dealwall.me&lt;/a&gt;. &amp;nbsp;This was the only part where I had to shell out real money. &amp;nbsp;I usually buy my domains through my hosting account at bluehost. &amp;nbsp;However, they don't sell .me domains.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;How do I make money?&lt;/b&gt;&lt;/div&gt;&lt;div&gt;Yes, I have a plan to make money. &amp;nbsp;Ofcourse, this assumes that the app takes off :-). &amp;nbsp;I have signed up as a &lt;a href="http://www.groupon.com/r/uu3372643"&gt;Groupon&lt;/a&gt; affiliate. &amp;nbsp;All links associated with the deals users post on their walls are linked back to Groupon through my affiliate link. &amp;nbsp;I also display the Groupon daily deal widget on the side of each wall. &amp;nbsp;Anytime someone buys a Groupon deal through dealwall.me, I make money.&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-weight: normal;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;How do I get users for my app?&lt;/b&gt;&lt;br /&gt;So, if I want to make money, I need users (obviously!). &amp;nbsp;Right now, I have just asked a few friends to try the app out. &amp;nbsp;I am planning to publicize the app on Facebook, by "Liking" it and also by sharing my deal wall. &amp;nbsp;However, I am going to wait for a week or two. &amp;nbsp; Currently, the Facebook activity streams of everyone I know are inundated with messages about India's victory in the Cricket World Cup. &amp;nbsp;Notifications from my little app have absolutely no chance of getting noticed. &lt;br /&gt;&lt;br /&gt;This is a lesson I learnt when publishing my first Android app -- &lt;a href="http://www.frync.com/"&gt;frync&lt;/a&gt;. &amp;nbsp;You need to publish at the right time. &amp;nbsp; When an app is published, it stays on the "Just In" list for at least a few hours. &amp;nbsp;I should have published the app just after Christmas, when everyone would have been playing with their new smart phones received as gifts. &amp;nbsp;The app would have been noticed much more and would have received more installs, just by virtue of being visible at the right time. &amp;nbsp;So this is a lesson learnt.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-size: 12px;"&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;What next?&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&lt;span class="Apple-style-span" style="font-size: 12px;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&lt;span class="Apple-style-span" style="font-size: 12px;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&lt;span class="Apple-style-span" style="font-size: 12px;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;Doing my taxes, ofcourse. &amp;nbsp;As soon as I finish writing this blog entry.... and if I don't get distracted into more web app building.&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;The app is running fine at &lt;/span&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&lt;a href="http://dealwall.me/"&gt;dealwall.me&lt;/a&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;. &amp;nbsp;It is still hard to use. &amp;nbsp;Since Groupon (obviously) does not offer an API to retrieve a particular user's deals, users have to login to Groupon and then paste the HTML source of their &lt;/span&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&lt;i&gt;All Groupons&lt;/i&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt; page into my app. &amp;nbsp;I am still thinking about how to make this process easier. &amp;nbsp;Don't understand what I am talking about? &amp;nbsp;Please try creating your wall at &lt;/span&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&lt;a href="http://dealwall.me/"&gt;dealwall.me&lt;/a&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;. &amp;nbsp;If you have any ideas about how to make it better, please leave a comment.&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/20761651-4369962560500318521?l=csgrad.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://csgrad.blogspot.com/feeds/4369962560500318521/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=20761651&amp;postID=4369962560500318521' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/20761651/posts/default/4369962560500318521'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/20761651/posts/default/4369962560500318521'/><link rel='alternate' type='text/html' href='http://csgrad.blogspot.com/2011/04/deallwallme-procrastinating-about-taxes.html' title='dealwall.me : Procrastinating about taxes'/><author><name>Dilip Joseph</name><uri>https://profiles.google.com/101964878145903134320</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-l2G0IfSXbg0/AAAAAAAAAAI/AAAAAAAAC9I/uqEkVKk7XCk/s512-c/photo.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-KZYC5QXKAT8/TZeUBDHlmfI/AAAAAAAAC6A/WSgHCjUpUpw/s72-c/like.png' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-20761651.post-1794937542437473523</id><published>2011-01-11T01:40:00.000-08:00</published><updated>2011-12-04T21:11:23.656-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='apache hive'/><category scheme='http://www.blogger.com/atom/ns#' term='hive'/><category scheme='http://www.blogger.com/atom/ns#' term='hadoop'/><title type='text'>How to avoid excessive logging when using Hive JDBC</title><content type='html'>By default, the Hive JDBC driver outputs a HUGE amount of log messages at the &lt;tt&gt;INFO&lt;/tt&gt; log level. &amp;nbsp;That makes it very hard to track the log statements added by you. &amp;nbsp;I spent quite some time trying to change the log level to &lt;tt&gt;ERROR&lt;/tt&gt; by tinkering with the &lt;tt&gt;hive-log4j.properties&lt;/tt&gt; and &lt;tt&gt;hive-exec-log4j.properties&lt;/tt&gt; files. &amp;nbsp; This approach was not successful. The Hive JDBC driver does not seem to read these log4j config files. &amp;nbsp;Hence, in order to change the log level, you must add the following lines near the beginning of your program:&lt;br /&gt;&lt;br /&gt;&lt;pre class="prettyprint"&gt;from org.apache.log4j import Logger as Logger&lt;br /&gt;from org.apache.log4j import Level as Level&lt;br /&gt;rootLogger = Logger.getRootLogger()&lt;br /&gt;rootLogger.setLevel(Level.ERROR)&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;div&gt;Note: The above code is in Jython. &amp;nbsp;Something very similar will work in Java.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/20761651-1794937542437473523?l=csgrad.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://csgrad.blogspot.com/feeds/1794937542437473523/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=20761651&amp;postID=1794937542437473523' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/20761651/posts/default/1794937542437473523'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/20761651/posts/default/1794937542437473523'/><link rel='alternate' type='text/html' href='http://csgrad.blogspot.com/2011/01/how-to-avoid-excessive-logging-when.html' title='How to avoid excessive logging when using Hive JDBC'/><author><name>Dilip Joseph</name><uri>https://profiles.google.com/101964878145903134320</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-l2G0IfSXbg0/AAAAAAAAAAI/AAAAAAAAC9I/uqEkVKk7XCk/s512-c/photo.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-20761651.post-227742976870731178</id><published>2011-01-02T19:17:00.000-08:00</published><updated>2011-01-11T01:22:52.173-08:00</updated><title type='text'>Frync: My first android app</title><content type='html'>&lt;span class="Apple-style-span" style="clear: left; float: left; font-family: Helvetica; font-size: 12px; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;/span&gt;During my 5 years at grad school, I used to call a different friend almost every day during my 45-minute walks to and from my lab. &amp;nbsp;I used to be in touch with a lot of people, and it felt wonderful. &amp;nbsp;For the past 1.5 years, I have been working at a startup. I now find myself woefully out of sync with most of my friends. &amp;nbsp;The reason is simple: There are too many things going on all the time that I forget to call my friends (and relatives). &lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/_eD0ZHwCnHPo/TSwhULzE3oI/AAAAAAAAC5c/8UF1KyKUv8o/s1600/frm5.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="200" src="http://2.bp.blogspot.com/_eD0ZHwCnHPo/TSwhULzE3oI/AAAAAAAAC5c/8UF1KyKUv8o/s200/frm5.png" width="194" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;That was the motivation for writing my first Android app - &lt;i&gt;&lt;b&gt;&lt;a href="http://www.frync.com/"&gt;Frync&lt;/a&gt;&lt;/b&gt;&lt;/i&gt;. &amp;nbsp;Frync stands for &lt;b&gt;Fr&lt;/b&gt;iend S&lt;b&gt;ync&lt;/b&gt;. &amp;nbsp;In Frync, you associate the contacts in your phone's address book with the frequency at which you wish to call them -- for example, &lt;i&gt;Every Day&lt;/i&gt;, &lt;i&gt;Every Week&lt;/i&gt;, &lt;i&gt;Every Month&lt;/i&gt;, etc. &amp;nbsp;Frync automatically tracks your phone call activity, and reminds you to call the friends whom you have not talked to at the desired frequency. &amp;nbsp;You can install Frync from the Android market:&amp;nbsp;&lt;a href="market://details?id=danjo.frm"&gt;market://details?id=danjo.frm&lt;/a&gt;. &amp;nbsp;Screenshots of Frync are at &lt;a href="http://www.frync.com/"&gt;www.frync.com&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;According to Techcrunch, &lt;a href="http://techcrunch.com/2010/11/13/alexia-phone-home/"&gt;The Phone Call is Dead&lt;/a&gt;&amp;nbsp;(although&amp;nbsp;&lt;a href="http://www.economist.com/node/17797782?story_id=17797782&amp;amp;fsrc=rss&amp;amp;utm_source=feedburner&amp;amp;utm_medium=feed&amp;amp;utm_campaign=Feed%3A+economist%2Ffull_print_edition+%28The+Economist%3A+Full+print+edition%29"&gt;Economist&lt;/a&gt;&amp;nbsp;disagrees). &amp;nbsp;Text-based modes of communication like SMS, Tweets and Facebook messages are going to dominate the future. &amp;nbsp; In the future, I hope to expand Frync into a more powerful &lt;i&gt;Friend Relationship Management&lt;/i&gt; tool that will help you to be in sync with friends across multiple modes of communication.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/20761651-227742976870731178?l=csgrad.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://csgrad.blogspot.com/feeds/227742976870731178/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=20761651&amp;postID=227742976870731178' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/20761651/posts/default/227742976870731178'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/20761651/posts/default/227742976870731178'/><link rel='alternate' type='text/html' href='http://csgrad.blogspot.com/2011/01/frync-my-first-android-app.html' title='Frync: My first android app'/><author><name>Dilip Joseph</name><uri>https://profiles.google.com/101964878145903134320</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-l2G0IfSXbg0/AAAAAAAAAAI/AAAAAAAAC9I/uqEkVKk7XCk/s512-c/photo.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_eD0ZHwCnHPo/TSwhULzE3oI/AAAAAAAAC5c/8UF1KyKUv8o/s72-c/frm5.png' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-20761651.post-846199980245771689</id><published>2010-07-14T22:57:00.000-07:00</published><updated>2010-07-15T08:29:05.111-07:00</updated><title type='text'>A Guided Tour of the Hadoop Zoo: Querying the Data</title><content type='html'>In a &lt;a href="http://csgrad.blogspot.com/2010/07/guided-tour-of-hadoop-zoo-getting-data.html"&gt;previous post&lt;/a&gt;, we saw how we can get our data into hadoop.&amp;nbsp; The next step is to extract useful information from it.&amp;nbsp; Originally, this step required programmers to write multiple MapReduce programs in Java, and to carefully and meticulously orchestrate them such that a job runs only after the input required by it has been produced by a prior job. &amp;nbsp; If you care about speed and processing/memory efficiency, this is still the way to go.&amp;nbsp; However, if you are more concerned about speed of writing queries and want to avoid mundane boiler plate Java code, the projects described below can help.&lt;br /&gt;&lt;br /&gt;Let's start with &lt;a href="http://www.cascading.org/"&gt;Cascading&lt;/a&gt;.&amp;nbsp; Cascading is a Java library/API that enables us to easily assemble together the data processing actions required to extract information from our data.&amp;nbsp; The following example, taken from the very detailed &lt;a href="http://www.cascading.org/1.1/userguide/html/ch02.html"&gt;Cascading User Guide&lt;/a&gt;, demonstrates how we can use Cascading to read each line     of text from a file, parse it into words, then count  the     number of time the word is encountered.&lt;br /&gt;&lt;br /&gt;&lt;pre class="prettyprint"&gt;// define source and sink Taps.&lt;br /&gt;Scheme sourceScheme = new TextLine( new Fields( "line" ) );&lt;br /&gt;Tap source = new Hfs( sourceScheme, inputPath );&lt;br /&gt;&lt;br /&gt;Scheme sinkScheme = new TextLine( new Fields( "word", "count" ) );&lt;br /&gt;Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE );&lt;br /&gt;&lt;br /&gt;// the 'head' of the pipe assembly&lt;br /&gt;Pipe assembly = new Pipe( "wordcount" );&lt;br /&gt;&lt;br /&gt;// For each input Tuple&lt;br /&gt;// parse out each word into a new Tuple with the field name "word"&lt;br /&gt;// regular expressions are optional in Cascading&lt;br /&gt;String regex = "(?&amp;lt;!\\pL)(?=\\pL)[^ ]*(?&amp;lt;=\\pL)(?!\\pL)";&lt;br /&gt;Function function = new RegexGenerator( new Fields( "word" ), regex );&lt;br /&gt;assembly = new Each( assembly, new Fields( "line" ), function );&lt;br /&gt;&lt;br /&gt;// group the Tuple stream by the "word" value&lt;br /&gt;assembly = new GroupBy( assembly, new Fields( "word" ) );&lt;br /&gt;&lt;br /&gt;// For every Tuple group&lt;br /&gt;// count the number of occurrences of "word" and store result in&lt;br /&gt;// a field named "count"&lt;br /&gt;Aggregator count = new Count( new Fields( "count" ) );&lt;br /&gt;assembly = new Every( assembly, count );&lt;br /&gt;&lt;br /&gt;// initialize app properties, tell Hadoop which jar file to use&lt;br /&gt;Properties properties = new Properties();&lt;br /&gt;FlowConnector.setApplicationJarClass( properties, Main.class );&lt;br /&gt;&lt;br /&gt;// plan a new Flow from the assembly using the source and sink Taps&lt;br /&gt;// with the above properties&lt;br /&gt;FlowConnector flowConnector = new FlowConnector( properties );&lt;br /&gt;Flow flow = flowConnector.connect( "word-count", source, sink, assembly );&lt;br /&gt;&lt;br /&gt;// execute the flow, block until complete&lt;br /&gt;flow.complete();&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;The data is processed as it flows through the &lt;i&gt;pipes&lt;/i&gt; defined by the Cascading API.&amp;nbsp; Cascading converts the data flow pipe assembly into a collection of MapReduce jobs.&amp;nbsp; It takes care of orchestrating the jobs such that they are launched only after their dependencies are satisfied -- i.e., all jobs producing output of interest to the job are complete.&amp;nbsp; If any error occurs in the processing pipeline, Cascading triggers a notification callback function and can continue processing after copying the offending data to a special trap file.&amp;nbsp; If we are not satisfied with the data processing primitives offered by Cascading, we can write our own Java MapReduce jobs and incorporate them into the Cascading data flow.&lt;br /&gt;&lt;br /&gt;In Cascading, we don't have to write any Java code to implement MapReduce jobs via the Hadoop API. &amp;nbsp; There is hardly any boilerplate code.&amp;nbsp; Hower, we do have to write Java code to assemble the Cascading pipeline.&amp;nbsp; Can  we do this without writing Java code?&amp;nbsp; &lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/_eD0ZHwCnHPo/TDfupQzDnEI/AAAAAAAAC3o/RlE_jiBNEuE/s1600/pig-logo.gif" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://2.bp.blogspot.com/_eD0ZHwCnHPo/TDfupQzDnEI/AAAAAAAAC3o/RlE_jiBNEuE/s320/pig-logo.gif" /&gt;&lt;/a&gt;&lt;/div&gt;Yes, we can... with Pig.&amp;nbsp; In Pig, the data processing pipeline is written in &lt;i&gt;Pig Latin&lt;/i&gt;.&amp;nbsp; Oundssay uchmay implersay anthay Avajay, ightray?&amp;nbsp; May we should have just stuck with Java!?.&amp;nbsp; No, we are not talking about &lt;a href="http://en.wikipedia.org/wiki/Pig_Latin"&gt;http://en.wikipedia.org/wiki/Pig_Latin&lt;/a&gt;; instead we are talking about &lt;a href="http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html"&gt;Hadoop PigLatin&lt;/a&gt;.&amp;nbsp; Let us see what it looks like.&amp;nbsp; The folowing example, very closely based on the examples in the &lt;a href="http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html"&gt;Pig Latin Manual&lt;/a&gt;, tries to find the number of  pets owned by adults in a pet store.&lt;br /&gt;&lt;br /&gt;&lt;pre class="prettyprint"&gt;all_data = LOAD 'petstore.dat' AS (owner:chararray, pet_type:chararray, pet_num:int, owner_age:int);&lt;br /&gt;DUMP all_data;&lt;br /&gt;(Alice,turtle,1,30)&lt;br /&gt;(Alice,goldfish,5,30)&lt;br /&gt;(Alice,cat,2,30)&lt;br /&gt;(Bob,dog,2,19)&lt;br /&gt;(Bob,cat,2,19) &lt;br /&gt;(Chris,dog,1,22) &lt;br /&gt;&lt;br /&gt;adult_data = FILTER all_data BY age &amp;gt; 21;&lt;br /&gt;DUMP adult_data; &lt;br /&gt;(Alice,turtle,1,30)&lt;br /&gt;(Alice,goldfish,5,30)&lt;br /&gt;(Alice,cat,2,30)&lt;br /&gt;(Chris,dog,1,22) &lt;br /&gt;&lt;br /&gt;pets_by_owner = GROUP adult_data BY owner;&lt;br /&gt;&lt;br /&gt;DUMP pets_by_owner;&lt;br /&gt;(Alice,{(Alice,turtle,1,30),(Alice,goldfish,5,30),(Alice,cat,2,30)})&lt;br /&gt;(Chris,{(Bob,dog,1,22))})&lt;br /&gt;owner_pet_count = FOREACH B GENERATE group, SUM(adult_data.pet_num);&lt;br /&gt;DUMP owner_pet_count;&lt;br /&gt;(Alice,8L)&lt;br /&gt;(Chris,1L)&lt;/pre&gt;&lt;br /&gt;Above, we first load the data from our log file (most likely stored in HDFS), filter by age, group the pets by owner and finally output the total number of pets per owner.  We can supply User Defined Functions (UDFs) written in Java to support complex processing at any of the data processing stages listed above. For example, if the data is stored in some esoteric format, we can provide our own Java parser.  If the filter condition needs to be complex, we can supply our own filter condition written in Java.&lt;br /&gt;&lt;br /&gt;With Pig and Cascading, we can create our data processing pipeline much faster than manually writing a sequence of MapReduce jobs.&amp;nbsp; The main disadvantage of Pig and Cascsading is that you must learn a new language or API, however simple it is.&amp;nbsp; Wouldn't it be great if we could just use a query language we are already familiar with?&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/_eD0ZHwCnHPo/TD6NgobSr-I/AAAAAAAAC3w/Sm4h-wc4DBU/s1600/hive.jpeg" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"&gt;&lt;img border="0" src="http://1.bp.blogspot.com/_eD0ZHwCnHPo/TD6NgobSr-I/AAAAAAAAC3w/Sm4h-wc4DBU/s320/hive.jpeg" /&gt;&lt;/a&gt;&lt;/div&gt;That's what the folks at Facebook thought too.. and we got &lt;a href="http://hadoop.apache.org/hive/"&gt;Hive&lt;/a&gt;.&amp;nbsp; Hive enables us to query our data that is stored in HDFS using a SQL like language.&amp;nbsp; In Hive, the pet store query will be:&lt;br /&gt;&lt;pre class="prettyprint"&gt;CREATE EXTERNAL TABLE pets (&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; owner STRING,&lt;br /&gt;&amp;nbsp; &amp;nbsp; pet_type STRING,&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; pet_num INT,&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; owner_age INT&lt;br /&gt;&amp;nbsp;)&amp;nbsp;&lt;br /&gt;&amp;nbsp;ROW FORMAT DELIMITED FIELD TERMINATED BY '\t'&lt;br /&gt;LOCATION '/path/to/petstore.dat';&amp;nbsp;&lt;br /&gt;&amp;nbsp;&lt;br /&gt;SELECT owner, SUM(pet_num)&lt;br /&gt;FROM pets&lt;br /&gt;WHERE owner_age &amp;gt; 21;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;We first overlay a structure on top of our petstore.dat log file using the CREATE TABLE statement.&amp;nbsp; Note that we do not modify or move the log file.&amp;nbsp; We can then run almost any SQL query&amp;nbsp; against the table - group by, order by, nested queries, joins are all supported.&amp;nbsp; Hive compiles these queries into a sequence of MapReduce jobs.&amp;nbsp; The default Hive shells prints the output tables to stdout.&amp;nbsp; We can also run HQL queries from within scripts and capture their output using the Hive JDBC and ODBC drivers.&lt;br /&gt;&lt;br /&gt;We can extend HQL (Hive Query Language) with User Defined Functions written in Java.&amp;nbsp; For example, we can use a UDF that &lt;a href="http://www.jointhegrid.com/hive-udf-geo-ip-jtg/index.jsp"&gt;converts IP addresses to city names&lt;/a&gt;&amp;nbsp; in our HQL queries.&amp;nbsp; We can also embed custom map/reduce tasks written in ANY language directly into HQL queries.&amp;nbsp; With a custom SerDe written in Java, we can run queries against log files that are in custom formats.&lt;br /&gt;.&lt;br /&gt;Not everyone understands or likes SQL.&amp;nbsp; Are there any languages for those folks?&amp;nbsp; I know of atleast one more -- &lt;a href="http://nathanmarz.com/blog/introducing-cascalog-a-clojure-based-query-language-for-hado.html"&gt;Cascalog&lt;/a&gt;, announced at the Hadoop 2010 Summit. You may want to use Cascalog for querying data stored in hadoop, if you like Lisp or Datalog or functional programming.&amp;nbsp; Otherwise Cascalog may look a bit alien: Here's a query to count the number of words:&lt;br /&gt;&lt;br /&gt;&lt;pre class="prettyprint"&gt;(?&amp;lt;- (stdout) [?word ?count] (sentence ?s) (split ?s  :&amp;gt; ?word) (c/ count ?count))&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Cascalog runs on top of &lt;a href="http://clojure.org/"&gt;Clojure&lt;/a&gt; (a functional programming language on the JVM) and uses Cascading to produce the Map Reduce queries.&lt;br /&gt;&lt;br /&gt;Let me conclude this post by talking about my language of preference.&amp;nbsp; I prefer Hive because I and a LOT of developers/analysts are very familiar with SQL.&amp;nbsp; Also the database-like interface provided by Hive allows us to connect existing business intelligence tools like &lt;a href="http://www.blogger.com/goog_1708936832"&gt;Microstrategy with Hiv&lt;/a&gt;&lt;a href="http://www.microstrategy.com/news/pr_system/press_release.asp?ctry=167&amp;amp;id=2075"&gt;e&lt;/a&gt; (another tool example: &lt;a href="http://www.intellicus.com/about/news_room/news_room.htm"&gt;Intellicus&lt;/a&gt;) and perform complex analysis with ease.&amp;nbsp; If I wanted tight programmatic control of my MapReduce jobs, I will most likely use Cascading.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/20761651-846199980245771689?l=csgrad.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://csgrad.blogspot.com/feeds/846199980245771689/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=20761651&amp;postID=846199980245771689' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/20761651/posts/default/846199980245771689'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/20761651/posts/default/846199980245771689'/><link rel='alternate' type='text/html' href='http://csgrad.blogspot.com/2010/07/guided-tour-of-hadoop-zoo-querying-data.html' title='A Guided Tour of the Hadoop Zoo: Querying the Data'/><author><name>Dilip Joseph</name><uri>https://profiles.google.com/101964878145903134320</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-l2G0IfSXbg0/AAAAAAAAAAI/AAAAAAAAC9I/uqEkVKk7XCk/s512-c/photo.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_eD0ZHwCnHPo/TDfupQzDnEI/AAAAAAAAC3o/RlE_jiBNEuE/s72-c/pig-logo.gif' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-20761651.post-2381191867781610882</id><published>2010-07-05T23:32:00.000-07:00</published><updated>2010-07-05T23:36:32.011-07:00</updated><title type='text'>A Guided Tour of the Hadoop Zoo: Getting Data In</title><content type='html'>You probably have tens  or thousands of servers producing loads of data, logging each and every  interaction of users with your web application.&amp;nbsp;&amp;nbsp; To load these logs  into HDFS, you have two options:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Write a script to  periodically copy each file into HDFS using the HDFS commandline put  command.&lt;/li&gt;&lt;li&gt;Change your logging system so that servers directly write to HDFS.&lt;/li&gt;&lt;/ol&gt;To&amp;nbsp; do (1) scalably and reliably, you probably need to enlist a  script ninja.&amp;nbsp; But for (2), you can simply rely upon the tortoises - &lt;b&gt;Chukwa&lt;/b&gt; and &lt;b&gt;Honu&lt;/b&gt;, or the water channel they probably never  swim in -- &lt;b&gt;Flume&lt;/b&gt; (I am trying hard to keep the zoo setting here.&amp;nbsp; Don't  think I will make it much further.), or &lt;b&gt;Sqoop&lt;/b&gt; (Ok.&amp;nbsp; I am out.&amp;nbsp; Did not expect to lose the animal analogy so early!), or &lt;b&gt;Scribe&lt;/b&gt;.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/_eD0ZHwCnHPo/TDKVHUKQT7I/AAAAAAAAC3Q/MmWtptqT0i4/s1600/chukwa_logo_small.jpg" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"&gt;&lt;img border="0" height="200" src="http://2.bp.blogspot.com/_eD0ZHwCnHPo/TDKVHUKQT7I/AAAAAAAAC3Q/MmWtptqT0i4/s200/chukwa_logo_small.jpg" width="171" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;a href="http://hadoop.apache.org/chukwa/"&gt;Chukwa&lt;/a&gt; is a distributed data  collection and processing platform.&amp;nbsp; A Chukwa agent runs on each  application server.&amp;nbsp; The application sends logs to the Chukwa agent via  files or UDP packets.&amp;nbsp; The agent forwards the logs to a handful of Collectors.&amp;nbsp; These collectors aggregate logs from hundreds of agents  and write them into HDFS as big files (HDFS is better at serving a small  number of large files, rather than a large number of small files).&amp;nbsp; A  MapReduce job archives and demuxes these log  files, every few minutes. Archiving involves rewriting the log files so that logs of the  same type (say, logs from application X on cluster Z) are written  together on disk.&amp;nbsp; Demuxing involves parsing the log files to extract  out structured data (say, key value pairs) which can subsequently  be loaded into a database and queried through the Hadoop Infrastructure  Care Center web portal.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://wiki.github.com/jboulon/Honu/"&gt;Honu&lt;/a&gt;, recently  opensourced by Netflix, is very similar to Chukwa.&amp;nbsp; Chukwa's original  focus was to aggregate and query log files generated by a Hadoop cluster.&amp;nbsp; In  contrast, Honu's focus is to directly stream&amp;nbsp; (non-Hadoop) application  logs to stable storage and provide a simple interface to query them.&amp;nbsp;  Additionally, Honu focuses on achieving this in the cloud -- for  example using Amazon Elastic Compute Cloud (EC2),&amp;nbsp; Simple Storage  Service (S3) and Elastic Map Reduce (EMR).&amp;nbsp; To use Honu, applications  write their log messages through the Honu client-side SDK.&amp;nbsp; The SDK  forwards the logs to collectors (no intermediary agents like in Chukwa),  which continuously save the log files to HDFS.&amp;nbsp; Periodic MapReduce jobs  process the log files and save them in a format that can be queried  through the SQL-like interface provided by Hive (to be discussed in a subsequent post).&amp;nbsp; Unlike Chukwa, Honu can collect logs from  non-Java applications.&lt;br /&gt;&lt;br /&gt;Flume is an even more recent  distributed log collection and processing system, announced by Cloudera  at the 2010 Hadoop Summit.&amp;nbsp;&amp;nbsp;Flume shares the same basic architecture as  Chukwa and Honu --&amp;nbsp; Agents gather logs from applications and forward  them to collectors that aggregate and store them.&amp;nbsp; What sets Flume apart  is its comprehensive built-in support for manageability, extensibility,  and multiple degrees of reliability, and the extensive &lt;a href="http://archive.cloudera.com/cdh/3/flume/UserGuide.html"&gt;documentation&lt;/a&gt;.  The entire log collection data flow is defined and managed from a  centralized Flume Master (Web UI/console).&amp;nbsp; The administrator can  specify the input sources and their format (syslogd, apache logs, text  file, scribe, twitter, RPC,custom, etc), the collectors that the agents  talk to and their failover paths, the data collection sinks (HDFS, RPC,  scribe, text file, etc) and their output format (avro, json, log4j,  syslog), how log events should be bucketed into different directories  based on their meta data dictionary, the reliability level to be used  (end-to-end, store-on-failure, best effort), and much more, all through  the Flume Master without having to restart the agents and collectors.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/_eD0ZHwCnHPo/TDLAc1Xw14I/AAAAAAAAC3Y/AKBakt4mVy8/s1600/sqoop.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://4.bp.blogspot.com/_eD0ZHwCnHPo/TDLAc1Xw14I/AAAAAAAAC3Y/AKBakt4mVy8/s320/sqoop.png" /&gt;&lt;/a&gt;&lt;/div&gt;Chukwa, Honu and Flume help you load  large volumes of logs into HDFS for further analysis.&amp;nbsp; What if your  analysis involves data (for example, user profile information) locked up  in relational databases?&amp;nbsp; Instead of directly hitting and potentially  slowing down production relational databases, it is better to  periodically dump the tables of interest into Hadoop before analysis. &lt;a href="http://www.cloudera.com/blog/2009/06/introducing-sqoop/"&gt;Sqoop&lt;/a&gt;,  also from Cloudera, makes loading large amounts of data from a  relational database into Hadoop (or even Hive) just one command line  away.&amp;nbsp; Sqoop takes care of automatically retrieving the table structure  using JDBC, creating Java classes to aid MapReduce, and to efficiently  bulk load the data into HDFS using the specific database's bulk  export mechanisms.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/_eD0ZHwCnHPo/TDLKmRW3OrI/AAAAAAAAC3g/frJOf_017wc/s1600/scribe.png" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"&gt;&lt;img border="0" src="http://1.bp.blogspot.com/_eD0ZHwCnHPo/TDLKmRW3OrI/AAAAAAAAC3g/frJOf_017wc/s320/scribe.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;a href="http://wiki.github.com/facebook/scribe/"&gt;Scribe&lt;/a&gt;, from Facebook, is a streaming log collection that has been available for over a year now.&amp;nbsp; An application uses the scribe client side library to send log messages to the scribe server running on the same server.&amp;nbsp; The scribe server aggregates the log messages and forwards them to a central group of one or more scribe servers.&amp;nbsp; These central servers write the aggregated logs to a distributed file system (like HDFS) or forward them further to another layer of scribe servers.&amp;nbsp; Log messages are routed to different destination files based on their user-defined category, controlled by dynamic configuration files.&amp;nbsp; A local scribe server stores the log messages on local disk and retries later if the central server to which it wishes to send logs is not reachable.&amp;nbsp;&amp;nbsp; Unlike Chukwa, Honu and Flume which are implemented in Java,&amp;nbsp; Scribe is implemented in C++.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/20761651-2381191867781610882?l=csgrad.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://csgrad.blogspot.com/feeds/2381191867781610882/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=20761651&amp;postID=2381191867781610882' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/20761651/posts/default/2381191867781610882'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/20761651/posts/default/2381191867781610882'/><link rel='alternate' type='text/html' href='http://csgrad.blogspot.com/2010/07/guided-tour-of-hadoop-zoo-getting-data.html' title='A Guided Tour of the Hadoop Zoo: Getting Data In'/><author><name>Dilip Joseph</name><uri>https://profiles.google.com/101964878145903134320</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-l2G0IfSXbg0/AAAAAAAAAAI/AAAAAAAAC9I/uqEkVKk7XCk/s512-c/photo.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_eD0ZHwCnHPo/TDKVHUKQT7I/AAAAAAAAC3Q/MmWtptqT0i4/s72-c/chukwa_logo_small.jpg' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-20761651.post-36201886589682429</id><published>2010-07-05T22:50:00.000-07:00</published><updated>2010-07-05T22:50:23.329-07:00</updated><title type='text'>A Guided Tour of the Hadoop Zoo: Welcome</title><content type='html'>It's a fine sunny Sunday, a perfect day to do something outdoors, like visit the zoo.&amp;nbsp; However, if you are lazy like I am, how about visiting the Hadoop Zoo.&amp;nbsp; From elephants to elephant bees to elephant birds, the Hadoop Zoo has got enough 'animals' to rival a zoo.&lt;br /&gt;&lt;br /&gt;Let's start with the basic question.&amp;nbsp; Why would you want to visit the Hadoop Zoo?&amp;nbsp; The most common answer is that you have a lot (by lot, I mean giga/tera/peta bytes of data) what you want to store and do something useful with.&amp;nbsp; The various animals, and actually, lots of non-animals, help you do exactly this.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/_eD0ZHwCnHPo/TDE5Z81AZOI/AAAAAAAAC3I/pq4hkLnrOUg/s1600/hadoop-logo.jpeg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://3.bp.blogspot.com/_eD0ZHwCnHPo/TDE5Z81AZOI/AAAAAAAAC3I/pq4hkLnrOUg/s320/hadoop-logo.jpeg" /&gt;&lt;/a&gt;&lt;/div&gt;Let's start with the cute elephant, Hadoop, the central attraction of the zoo.&amp;nbsp; The Hadoop File System (HDFS) and the Hadoop MapReduce Framework are the core components of Hadoop.&amp;nbsp; HDFS stores gigantic amounts of data in a distributed, scalable and reliable fashion.&amp;nbsp; The Hadoop MapReduce framework helps you write Java (and some other languages) programs to efficiently process your data into valuable information.&lt;br /&gt;&lt;br /&gt;That's it - HDFS and the MapReduce framework are all you need to store and process a huge amount of data.&amp;nbsp; And that's all you had a few years ago.&amp;nbsp; But now, there are more animals in the zoo, that make your visit more fun and your life easier.&lt;br /&gt;&lt;br /&gt;This series of blog posts is an attempt to record and expand my understanding of the  various  components of the Hadoop ecosystem.&amp;nbsp; I am by no means a Hadoop  expert.&amp;nbsp;  So, if you find something wrong or missing, please do send me  an  email or add a comment.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/20761651-36201886589682429?l=csgrad.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://csgrad.blogspot.com/feeds/36201886589682429/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=20761651&amp;postID=36201886589682429' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/20761651/posts/default/36201886589682429'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/20761651/posts/default/36201886589682429'/><link rel='alternate' type='text/html' href='http://csgrad.blogspot.com/2010/07/guided-tour-of-hadoop-zoo-welcome.html' title='A Guided Tour of the Hadoop Zoo: Welcome'/><author><name>Dilip Joseph</name><uri>https://profiles.google.com/101964878145903134320</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-l2G0IfSXbg0/AAAAAAAAAAI/AAAAAAAAC9I/uqEkVKk7XCk/s512-c/photo.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_eD0ZHwCnHPo/TDE5Z81AZOI/AAAAAAAAC3I/pq4hkLnrOUg/s72-c/hadoop-logo.jpeg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-20761651.post-3712979999844640516</id><published>2010-04-11T14:18:00.000-07:00</published><updated>2010-04-11T14:21:28.637-07:00</updated><title type='text'>Running HQL from Python without using the Hive Standalone Server</title><content type='html'>To &lt;a href="http://wiki.apache.org/hadoop/Hive/HiveClient"&gt;use a language other than Java (say python) with Hive&lt;/a&gt;, you must use the &lt;a href="http://wiki.apache.org/hadoop/Hive/HiveServer"&gt;Hive Standalone Server&lt;/a&gt;. The main disadvantage of using the Hive Standalone Server is that it is currently single threaded [&lt;a href="https://issues.apache.org/jira/browse/HIVE-80"&gt;HIVE-80&lt;/a&gt;].&amp;nbsp; Additionally, there is the inconvenience of running an additional server.&lt;br /&gt;&amp;nbsp; &lt;br /&gt;We can solve this problem by using &lt;a href="http://www.jython.org/"&gt;Jython&lt;/a&gt; (and possibly &lt;a href="http://jruby.org/"&gt;JRuby&lt;/a&gt;). &amp;nbsp; Jython enables us to use Hive's Java client library to execute the HQL query and retrieve the results.&amp;nbsp; We can then process the results in pure python. &lt;br /&gt;&lt;br /&gt;Let us try it out:&lt;br /&gt;&lt;br /&gt;&lt;b&gt;STEP 1:&lt;/b&gt;&lt;br /&gt;&lt;a href="http://www.jython.org/downloads.html"&gt;Download&lt;/a&gt; and install Jython.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;STEP 2:&lt;/b&gt;&lt;br /&gt;Make sure you have the following jars and directories in your CLASSPATH.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;hive-service-0.6.0.jar&lt;/li&gt;&lt;li&gt;libfb303.jar&lt;/li&gt;&lt;li&gt;log4j-1.2.15.jar&lt;/li&gt;&lt;li&gt;antlr-runtime-3.0.1.jar derby.jar&lt;/li&gt;&lt;li&gt;jdo2-api-2.3-SNAPSHOT.jar&lt;/li&gt;&lt;li&gt;commons-logging-1.0.4.jar&lt;/li&gt;&lt;li&gt;datanucleus-core-1.1.2.jar &lt;/li&gt;&lt;li&gt;datanucleus-enhancer-1.1.2.jar&lt;/li&gt;&lt;li&gt;datanucleus-rdbms-1.1.2.jar&lt;/li&gt;&lt;li&gt;hive-exec-0.6.0.jar&lt;/li&gt;&lt;li&gt;hive-jdbc-0.6.0.jar&lt;/li&gt;&lt;li&gt;hive-metastore-0.6.0.jar&lt;/li&gt;&lt;li&gt;derby.jar&lt;/li&gt;&lt;li&gt;jdo2-api-2.3-SNAPSHOT.jar&lt;/li&gt;&lt;li&gt;commons-lang-2.4.jar&lt;/li&gt;&lt;li&gt;hadoopcore/hadoop-0.20.0/hadoop-0.20.0-core.jar&lt;/li&gt;&lt;li&gt;/usr/lib/hadoop-0.20/lib/mysql-connector-java-5.0.8-bin.jar &lt;/li&gt;&lt;li&gt;conf (this is your hive installation's build/dist/conf directory)&lt;/li&gt;&lt;/ul&gt;Jar locations and versions may be different in your hive installation.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;STEP 3:&lt;/b&gt;&lt;br /&gt;Create a test data file /tmp/test.dat with the following lines&lt;br /&gt;&lt;pre class="prettyprint"&gt;1:one&lt;br /&gt;2:two&lt;br /&gt;3:three&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;b&gt;STEP 4:&lt;/b&gt;&lt;br /&gt;Run the following Jython script&lt;br /&gt;&lt;pre class="prettyprint"&gt;from java.lang import *&lt;br /&gt;from java.lang import *&lt;br /&gt;from java.sql import *&lt;br /&gt;&lt;br /&gt;driverName = "org.apache.hadoop.hive.jdbc.HiveDriver";&lt;br /&gt;&lt;br /&gt;try:&lt;br /&gt;  Class.forName(driverName);&lt;br /&gt;except Exception, e:&lt;br /&gt;  print "Unable to load %s" % driverName&lt;br /&gt;  System.exit(1);&lt;br /&gt;&lt;br /&gt;conn = DriverManager.getConnection("jdbc:hive://");&lt;br /&gt;stmt = conn.createStatement();&lt;br /&gt;&lt;br /&gt;# Drop table&lt;br /&gt;#stmt.executeQuery("DROP TABLE testjython")&lt;br /&gt;&lt;br /&gt;# Create a table&lt;br /&gt;res = stmt.executeQuery("CREATE TABLE testjython (key int, value string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ':'")&lt;br /&gt;&lt;br /&gt;# Show tables&lt;br /&gt;res = stmt.executeQuery("SHOW TABLES")&lt;br /&gt;print "List of tables:"&lt;br /&gt;while res.next():&lt;br /&gt;    print res.getString(1)&lt;br /&gt;&lt;br /&gt;# Load some data&lt;br /&gt;res = stmt.executeQuery("LOAD DATA LOCAL INPATH '/tmp/test.dat' INTO TABLE testjython")&lt;br /&gt;&lt;br /&gt;# SELECT the data&lt;br /&gt;res = stmt.executeQuery("SELECT * FROM testjython")&lt;br /&gt;print "Listing contents of table:"&lt;br /&gt;while res.next():&lt;br /&gt;    print res.getInt(1), res.getString(2)&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;You should see the following output, amidst a whole lot of debug statements:&lt;br /&gt;1 one&lt;br /&gt;2 two&lt;br /&gt;3 three&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/20761651-3712979999844640516?l=csgrad.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://csgrad.blogspot.com/feeds/3712979999844640516/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=20761651&amp;postID=3712979999844640516' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/20761651/posts/default/3712979999844640516'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/20761651/posts/default/3712979999844640516'/><link rel='alternate' type='text/html' href='http://csgrad.blogspot.com/2010/04/to-use-language-other-than-java-say.html' title='Running HQL from Python without using the Hive Standalone Server'/><author><name>Dilip Joseph</name><uri>https://profiles.google.com/101964878145903134320</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-l2G0IfSXbg0/AAAAAAAAAAI/AAAAAAAAC9I/uqEkVKk7XCk/s512-c/photo.jpg'/></author><thr:total>0</thr:total></entry></feed>
