This is part 8 of my notes from reading Java Concurrency in Practice.
NOTE: These summaries are NOT meant to replace the book. I highly recommend buying your own copy of the book if you haven't already read it.
Chapter 11 - Performance and Scalability
- Avoid premature optimization - first make it right, then make it fast, if not fast enough already (as indicated by actual performance measurements)
- Tuning for scalability is often different from tuning for performance, and are often contradictory.
- Amdahl's Law : Speedup <= 1/( F + (1-F)/N) where F is the fraction of computation that must be executed serially, and N is the number of processors.
- A shared work queue adds some (often overlooked) serial processing. Result handling is another form of serialization hidden inside otherwise seemingly 100% concurrent programs.
- Costs of using threads
- context switches - managing shared data structures in OS and JVM take memory and CPU. Can also cause flurry of processor cache misses on a thread context switch.
- When a thread blocks on a lock, it is switched out by JVM before reaching its full scheduled CPU quantum, leading to more overhead.
- Context switching costs 5000-10000 clock cycles (few microseconds). Use vmstat to find % of time program spent in the kernel. High % can indicate high context switching.
- synchronized and volatile result in the use of special CPU instructions called memory barriers that involve flushing/invalidating CPU caches, stalling execution pipelines, flushing hardware write buffers, and inhibit compiler optimizations as operations cannot be reordered.
- Performance of contended and uncontended synchronization are very different. synchronized is optimized for the uncontended scenario (20 to 250 clock cycles). volatile is always uncontended.
- Modern JVMs can optimize away locking code that can be proven to never contend.
- Modern JVMs perform escape analysis to identify thread-confined objects and avoid locking them.
- Modern JVMs can do lock coarsening to merge multiple adjacent locks into a larger lock to avoid multiple lock/unlocks.
- Synchronization by one thread affects performance of other threads due to traffic on the shared memory bus.
- Uncontended synchronization can be handled entirely in JVM. Contended synchronization involves OS activity - OS needs to suspend the thread that loses the contention.
- Blocking can implemented by spin-waiting or by suspending the thread via the OS. spin-waiting is preferred for short waits. JVM decides what to use based on profiling past performance.
- Reducing lock contention
- reduce duration for which locks are held.
- reduce frequency at which locks are requested. Coarsen lock granularity by lock splitting (for moderately contended locks) and lock striping (for heavily contended locks).
- replace exclusive locks with coordination mechanisms that permit greater concurrency.
- Lock striping - ConcurrentHashMap uses 16 locks - bucket N is guarded by lock N % 16. Locking for exclusive access to entire collection is hard when lock striping is used.
- Avoid hot fields like cached values - for eg: size is cached for a Map, in order to convert an O(n) operation to a O(1) operation. Use striped counters or atomic variables.
- Alternatives to exclusive locks - concurrent collections, read-write locks, immutable objects, atomic variables.
- Do not use object pools. Object allocation and GC were slow in earlier versions of Java. Now object allocation is faster than a C malloc - only 10 machine instructions. Object pools also introduce synchronization overheads
Chapter 12 - Testing Concurrent Programs
- Every test must wait till all the threads created by it terminate. It should then report any failures in tearDown().
- Testing blocking operations need some way to unblock a thread that has blocked as expected. This is usually done by doing the blocking operation in a new thread and interrupting it after waiting for some time. An InterruptedException is thrown if the operation blocked as expected.
- Thread.getState() should not be used for concurrency control or testing. Useful only for debugging.
- One approach to test producer-consumer programs is to check that everything that is put into a queue or buffer eventually comes out of it, and nothing else does.
- For single producer-single consumer designs, use order sensitive checksum of elements that are added, and verify them when the element is removed. Do not use a synchronized shadow list to track the elements as that will introduce artificial serialization.
- For multiple producer-consumer designs, use an order insensitive checksum that can be combined at the end of the test to verify that all enqueued elements have been dequeued.
- Make sure that the checksums are not guessable by the compiler (for eg: consecutive integers), so that they are not precomputed. Use a simple random number generator like xorShift(int y) { y ^= (y << 6); y ^= (y >>> 21); y ^= (y << 7); return y;}
- Test on multi-processor machines with fewer processors than active threads.
- Generate more thread interleaving by using Thread.yield() to encourage more context switches during operations that access shared state.
- Always include some basic functionality testing when doing performance testing to make sure that you are not measuring performance of broken code.
- Non-fair semaphores provide better throughput, while fair semaphores provide lower variance in responsiveness.
- Avoiding performance testing pitfalls
- Ensure that garbage collection does not run at all during your test (check this using the -verbose:gc flag) OR ensure that garbage collection runs a number of times during the test (need to run test for a long time).
- Your tests should run only after all code has been compiled; no point measuring performance of interpreted byte code. Dynamic compilation takes CPU resources. Compiled code executes much faster.
- Code may be decompiled/recompiled multiple times during execution - for eg: if some previous assumption made by JVM is invalidated, or to compile with better optimization flags based on recently gathered performance statistics.
- Run program long enough (several minutes) so that compilation and interpreted execution represent a small fraction of the results and do not bias it.
- Or have an unmeasured warm-up run before starting to collect performance statistics.
- Run JVM with -XX:+PrintCompilation so that we know when dynamic compilation happens.
- When running multiple unrelated computationally intensive tests in a single JVM, place explicit pauses between tests in order to give the JVM a chance to catch up with its background tasks. Don't do this when measuring multiple related activities, since omitting CPU required by background tasks gives unrealistic results.
- In order to obtain realistic results, concurrent performance tests should approximate the thread-local computation done by a typical application. Otherwise, there will be unrealistic contention.
- Make sure that compilers do not optimize away benchmarking code.
- Trick to make sure that benchmarking calculation is not optimized away: if (fox.hashCode() == System.nanoTime()) System.out.print(" ");
- Complementary Testing Approaches
- Code Review
- Static analysis tools: FindBugs has detectors for:
- Inconsistent synchronization.
- Invoking Thread.run (Thread.start() is what is usually invoked, not Thread.run())
- Unreleased lock
- Empty synchronized block
- Double-checked locking
- Starting a thread from a constructor
- Notification errors
- Condition wait errors: Object.wait() or Condition.await() should be called in a loop with the appropriate lock held after testing some state predicate.
- Misuse of Lock and Condition
- Sleeping or waiting while holding a lock.
- Spin loops
No comments:
Post a Comment