The Ex CS Grad Student

Tuesday, July 10, 2012

Accessing S3 data in Spark

Before running a Spark job on EC2, the input data is typically copied from S3 to a local HDFS cluster. The Spark jobs read the data from HDFS instead of directly from S3. When I tried making the Spark job read directly from S3 by specifying a path of the form s3n://AWS_KEY_ID:AWS_SECRET_KEY@BUCKETNAME/mydata, I kept getting the following exception:

org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/mydata' - ResponseCode=403, ResponseMessage=Forbidden

My AWS_SECRET_KEY contained a "/", which I did correctly escape with a %2F. But I still kept getting the exception. I was able to work around the problem by applying the following patch to Spark.

When I run my Spark program, I pass in the access key id and secret key from the command line as follows

scala  -Djava.library.path=/root/mesos/build/src/.libs -Dmesos.master=master@MASTER_HOST:5050 -DawsAccessKeyId=MyAccessKeyId -DawsSecretAccessKey=MySecretKey ...

The path to the data is simply specified as s3n://BUCKETNAME/mydata. There is no need to specify the AWS key id and secret key in the URI anymore.

There probably is some better way to do this. If someone knows how, please do leave a comment.

Thursday, June 21, 2012

tcpdump tutorial

Today, I needed to use tcpdump after a very long time. Hence, I did the most natural thing: GOOGLE tcpdump tutorial. I was pleasantly surprised to find that the 3rd link in the search results was a tcpdump tutorial I had written myself 6 years ago while I was a Teaching Assistant for the undergrad networking class at UC Berkeley : http://inst.eecs.berkeley.edu/~ee122/fa06/projects/tcpdump-6up.pdf

Looks like this tiny tutorial I wrote has had more impact than my PhD thesis :-)

Dog-pile Effect : Squid versus Apache Traffic Server

When using a forward web proxy cache, it is possible to encounter the dog-pile effect. When a new page suddenly becomes very popular or when a popular page expires from the cache, the proxy will receive a large number of requests for the page at the same time. There are two ways to handle this:

Since the page is not already in the cache, the proxy forwards each request to the origin server, OR
The proxy forwards the first request to the origin server, and queues the others till the response to the first request fills the cache.

Option 1 leads to the dog-pile effect. The origin server is rapidly bombarded with a large number of requests. This is usually problematic. The server slows down and the requests keep piling up.

Option 2 is called connection collapsing or collapsed forwarding. Squid supports this feature - http://www.squid-cache.org/Doc/config/collapsed_forwarding/. However, Apache Traffic Server currently does not support it (Thanks to the super-responsive folks on the traffic-server IRC channel for confirming this). It used to be supported , but was removed since the implementation was buggy : http://mail-archives.apache.org/mod_mbox/trafficserver-commits/201102.mbox/%3C20110209171030.3247A23889BF@eris.apache.org%3E