The Ex CS Grad Student: Java Concurrency in Practice - Summary

NOTE: These summaries are NOT meant to replace the book. I highly recommend buying your own copy of the book if you haven't already read it.

Chapter 11 - Performance and Scalability

Avoid premature optimization - first make it right, then make it fast, if not fast enough already (as indicated by actual performance measurements)
Tuning for scalability is often different from tuning for performance, and are often contradictory.
Amdahl's Law : Speedup <= 1/( F + (1-F)/N) where F is the fraction of computation that must be executed serially, and N is the number of processors.
A shared work queue adds some (often overlooked) serial processing. Result handling is another form of serialization hidden inside otherwise seemingly 100% concurrent programs.
Costs of using threads

context switches - managing shared data structures in OS and JVM take memory and CPU. Can also cause flurry of processor cache misses on a thread context switch.
When a thread blocks on a lock, it is switched out by JVM before reaching its full scheduled CPU quantum, leading to more overhead.

Context switching costs 5000-10000 clock cycles (few microseconds). Use vmstat to find % of time program spent in the kernel. High % can indicate high context switching.
synchronized and volatile result in the use of special CPU instructions called memory barriers that involve flushing/invalidating CPU caches, stalling execution pipelines, flushing hardware write buffers, and inhibit compiler optimizations as operations cannot be reordered.
Performance of contended and uncontended synchronization are very different. synchronized is optimized for the uncontended scenario (20 to 250 clock cycles). volatile is always uncontended.
Modern JVMs can optimize away locking code that can be proven to never contend.
Modern JVMs perform escape analysis to identify thread-confined objects and avoid locking them.
Modern JVMs can do lock coarsening to merge multiple adjacent locks into a larger lock to avoid multiple lock/unlocks.
Synchronization by one thread affects performance of other threads due to traffic on the shared memory bus.
Uncontended synchronization can be handled entirely in JVM. Contended synchronization involves OS activity - OS needs to suspend the thread that loses the contention.
Blocking can implemented by spin-waiting or by suspending the thread via the OS. spin-waiting is preferred for short waits. JVM decides what to use based on profiling past performance.
Reducing lock contention

reduce duration for which locks are held.
reduce frequency at which locks are requested. Coarsen lock granularity by lock splitting (for moderately contended locks) and lock striping (for heavily contended locks).
replace exclusive locks with coordination mechanisms that permit greater concurrency.

Lock striping - ConcurrentHashMap uses 16 locks - bucket N is guarded by lock N % 16. Locking for exclusive access to entire collection is hard when lock striping is used.
Avoid hot fields like cached values - for eg: size is cached for a Map, in order to convert an O(n) operation to a O(1) operation. Use striped counters or atomic variables.
Alternatives to exclusive locks - concurrent collections, read-write locks, immutable objects, atomic variables.
Do not use object pools. Object allocation and GC were slow in earlier versions of Java. Now object allocation is faster than a C malloc - only 10 machine instructions. Object pools also introduce synchronization overheads

Every test must wait till all the threads created by it terminate. It should then report any failures in tearDown().
Testing blocking operations need some way to unblock a thread that has blocked as expected. This is usually done by doing the blocking operation in a new thread and interrupting it after waiting for some time. An InterruptedException is thrown if the operation blocked as expected.
Thread.getState() should not be used for concurrency control or testing. Useful only for debugging.
One approach to test producer-consumer programs is to check that everything that is put into a queue or buffer eventually comes out of it, and nothing else does.

For single producer-single consumer designs, use order sensitive checksum of elements that are added, and verify them when the element is removed. Do not use a synchronized shadow list to track the elements as that will introduce artificial serialization.
For multiple producer-consumer designs, use an order insensitive checksum that can be combined at the end of the test to verify that all enqueued elements have been dequeued.

Make sure that the checksums are not guessable by the compiler (for eg: consecutive integers), so that they are not precomputed. Use a simple random number generator like xorShift(int y) { y ^= (y << 6); y ^= (y >>> 21); y ^= (y << 7); return y;}

Test on multi-processor machines with fewer processors than active threads.
Generate more thread interleaving by using Thread.yield() to encourage more context switches during operations that access shared state.
Always include some basic functionality testing when doing performance testing to make sure that you are not measuring performance of broken code.
Non-fair semaphores provide better throughput, while fair semaphores provide lower variance in responsiveness.
Avoiding performance testing pitfalls

Ensure that garbage collection does not run at all during your test (check this using the -verbose:gc flag) OR ensure that garbage collection runs a number of times during the test (need to run test for a long time).
Your tests should run only after all code has been compiled; no point measuring performance of interpreted byte code. Dynamic compilation takes CPU resources. Compiled code executes much faster.

Code may be decompiled/recompiled multiple times during execution - for eg: if some previous assumption made by JVM is invalidated, or to compile with better optimization flags based on recently gathered performance statistics.
Run program long enough (several minutes) so that compilation and interpreted execution represent a small fraction of the results and do not bias it.
Or have an unmeasured warm-up run before starting to collect performance statistics.
Run JVM with -XX:+PrintCompilation so that we know when dynamic compilation happens.

When running multiple unrelated computationally intensive tests in a single JVM, place explicit pauses between tests in order to give the JVM a chance to catch up with its background tasks. Don't do this when measuring multiple related activities, since omitting CPU required by background tasks gives unrealistic results.
In order to obtain realistic results, concurrent performance tests should approximate the thread-local computation done by a typical application. Otherwise, there will be unrealistic contention.
Make sure that compilers do not optimize away benchmarking code.

Trick to make sure that benchmarking calculation is not optimized away: if (fox.hashCode() == System.nanoTime()) System.out.print(" ");

Inconsistent synchronization.
Invoking Thread.run (Thread.start() is what is usually invoked, not Thread.run())
Unreleased lock
Empty synchronized block
Double-checked locking
Starting a thread from a constructor
Notification errors
Condition wait errors: Object.wait() or Condition.await() should be called in a loop with the appropriate lock held after testing some state predicate.
Misuse of Lock and Condition
Sleeping or waiting while holding a lock.
Spin loops