We can solve this problem by using Jython (and possibly JRuby). Jython enables us to use Hive's Java client library to execute the HQL query and retrieve the results. We can then process the results in pure python.
Let us try it out:
STEP 1:
Download and install Jython.
STEP 2:
Make sure you have the following jars and directories in your CLASSPATH.
- hive-service-0.6.0.jar
- libfb303.jar
- log4j-1.2.15.jar
- antlr-runtime-3.0.1.jar derby.jar
- jdo2-api-2.3-SNAPSHOT.jar
- commons-logging-1.0.4.jar
- datanucleus-core-1.1.2.jar
- datanucleus-enhancer-1.1.2.jar
- datanucleus-rdbms-1.1.2.jar
- hive-exec-0.6.0.jar
- hive-jdbc-0.6.0.jar
- hive-metastore-0.6.0.jar
- derby.jar
- jdo2-api-2.3-SNAPSHOT.jar
- commons-lang-2.4.jar
- hadoopcore/hadoop-0.20.0/hadoop-0.20.0-core.jar
- /usr/lib/hadoop-0.20/lib/mysql-connector-java-5.0.8-bin.jar
- conf (this is your hive installation's build/dist/conf directory)
STEP 3:
Create a test data file /tmp/test.dat with the following lines
1:one 2:two 3:three
STEP 4:
Run the following Jython script
from java.lang import * from java.lang import * from java.sql import * driverName = "org.apache.hadoop.hive.jdbc.HiveDriver"; try: Class.forName(driverName); except Exception, e: print "Unable to load %s" % driverName System.exit(1); conn = DriverManager.getConnection("jdbc:hive://"); stmt = conn.createStatement(); # Drop table #stmt.executeQuery("DROP TABLE testjython") # Create a table res = stmt.executeQuery("CREATE TABLE testjython (key int, value string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ':'") # Show tables res = stmt.executeQuery("SHOW TABLES") print "List of tables:" while res.next(): print res.getString(1) # Load some data res = stmt.executeQuery("LOAD DATA LOCAL INPATH '/tmp/test.dat' INTO TABLE testjython") # SELECT the data res = stmt.executeQuery("SELECT * FROM testjython") print "Listing contents of table:" while res.next(): print res.getInt(1), res.getString(2)
You should see the following output, amidst a whole lot of debug statements:
1 one
2 two
3 three
1 comment:
Hi, Thanks for the example. One problem though.. select * from table runs fine. But if you try a query which has a where clause which spins off a mapreduce job, it fails. Have you gotten around it?
Post a Comment