The Ex CS Grad Student: Accessing S3 data in Spark

Before running a Spark job on EC2, the input data is typically copied from S3 to a local HDFS cluster. The Spark jobs read the data from HDFS instead of directly from S3. When I tried making the Spark job read directly from S3 by specifying a path of the form s3n://AWS_KEY_ID:AWS_SECRET_KEY@BUCKETNAME/mydata, I kept getting the following exception:

org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/mydata' - ResponseCode=403, ResponseMessage=Forbidden

My AWS_SECRET_KEY contained a "/", which I did correctly escape with a %2F. But I still kept getting the exception. I was able to work around the problem by applying the following patch to Spark.

When I run my Spark program, I pass in the access key id and secret key from the command line as follows

scala  -Djava.library.path=/root/mesos/build/src/.libs -Dmesos.master=master@MASTER_HOST:5050 -DawsAccessKeyId=MyAccessKeyId -DawsSecretAccessKey=MySecretKey ...

The path to the data is simply specified as s3n://BUCKETNAME/mydata. There is no need to specify the AWS key id and secret key in the URI anymore.

There probably is some better way to do this. If someone knows how, please do leave a comment.

The Ex CS Grad Student

Tuesday, July 10, 2012

Accessing S3 data in Spark

2 comments: