Sunday, August 26, 2012

Java Concurrency in Practice - Summary - Part 1

This is part 1 of  my notes from reading Java Concurrency in Practice.



NOTE: These summaries are NOT meant to replace the book.  I highly recommend buying your own copy of the book if you haven't already read it.

Chapter 1 - Introduction
  1. Writing correct concurrent programs is very hard.
  2. Threads are the easiest way to effectively use multi-processor systems, which are now ubiquitous.
  3. When writing multi-threaded programs, we must pay attention to the following:
    1. Safety - Nothing  bad ever happens, i.e. program correctness is guaranteed irrespective of interleaved execution.
    2. Liveness - Something good eventually happens.  For eg: no deadlock.
    3. Performance - Something good happens fast enough. For eg: no excessive context switches.
  4. Many Java frameworks (GUI toolkits, RMI, Timers, etc) internally use threads.  So your code must be thread-safe even if you do not explicitly use threads.

Chapter 2 - Thread Safety

  1. Writing concurrent programs is all about correctly managing access to shared, mutable state.  Threads are just one kind of mechanism.
    1. An object's state = any data that can affect its externally visible behavior.  
    2. An object's mutable state needs to be protected from uncontrolled concurrent access from multiple threads.
  2. A class is thread-safe if it continues to behave correctly when accessed from multiple threads, with no additional synchronization or coordination required of the calling code.  In the absence of formal specifications (i.e., invariants constraining an object's fields, postconditions defining the effect of operations on the object etc), we assume that the single-threaded behavior of a class is its correct behavior (after verification, of course!).
    1. It is much easier to design a class to be thread-safe than to retrofit thread-safety into it later.
    2. It is easier to make a class thread-safe if its state is private.  In other words, follow good OO practices.
    3. Thread-safe classes encapsulate any needed synchronization so that calling code need not provide their own.
  3. Stateless objects are always thread-safe.
  4. The most common race condition is associated with check-then-act sequences.  Lazy initialization of expensive objects is a common place where check-then-act is used.
  5. Race condition != data race.  Data race happens when a thread writes a variable without synchronization and another thread tries to read it - the reading thread may see partial or completely incorrect data.
  6. If all you need is a thread-safe counter, just use java.util.concurrent.atomic.AtomicLong.  If multiple pieces of state are involved, this is not sufficient - further synchronization is necessary.
  7. synchronized block - Java's built-in locking mechanism for enforcing atomicity
    1. synchronized block is associated with an object that serves as the lock, and a block of code to be guarded.  
    2. Every java object can act as a lock for a synchronized block.  These built-in locks are called intrinsic or monitor locks.
    3. Intrinsic locks are mutexes; i.e., only one thread can own it at a time.
    4. Intrinsic locks are reentrant - a thread can immediately acquire a lock that it is currently holding.
  8. Each mutable variable that is read/written from multiple threads must be guarded by synchronization with the SAME lock object EVERY TIME it is read/written.
    1. Use @GuardedBy("lockobject") annotation on each mutable shared variable to document the locking strategy.
  9. For every invariant that involves more than one variable, all the variables involved in the invariant must be guarded by the same lock.

Tuesday, July 10, 2012

Accessing S3 data in Spark

Before running a Spark job on EC2, the input data is typically copied from S3 to a local HDFS cluster. The Spark jobs read the data from HDFS instead of directly from S3.  When I tried making the Spark job read directly from S3 by specifying a path of the form s3n://AWS_KEY_ID:AWS_SECRET_KEY@BUCKETNAME/mydata, I kept getting the following exception:

org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/mydata' - ResponseCode=403, ResponseMessage=Forbidden

My AWS_SECRET_KEY contained a "/", which I did correctly escape with a %2F.  But I still kept getting the exception.  I was able to work around the problem by applying the following patch to Spark.
When I run my Spark program, I pass in the access key id and secret key from the command line as follows

scala  -Djava.library.path=/root/mesos/build/src/.libs -Dmesos.master=master@MASTER_HOST:5050 -DawsAccessKeyId=MyAccessKeyId -DawsSecretAccessKey=MySecretKey ...

The path to the data is simply specified as s3n://BUCKETNAME/mydata.  There is no need to specify the AWS key id and secret key in the URI anymore.

There probably is some better way to do this.  If someone knows how, please do leave a comment.

Thursday, June 21, 2012

tcpdump tutorial

Today, I needed to use tcpdump after a very long time.  Hence, I did the most natural thing: GOOGLE tcpdump tutorial.  I was pleasantly surprised to find that the 3rd link in the search results was a tcpdump tutorial I had written myself  6 years ago while I was a Teaching Assistant for the undergrad networking class at UC Berkeley : http://inst.eecs.berkeley.edu/~ee122/fa06/projects/tcpdump-6up.pdf

Looks like this tiny tutorial I wrote has had more impact than my PhD thesis :-)