The Ex CS Grad Student

Sunday, September 30, 2012

Java Concurrency in Practice - Summary - Part 9

This is part 9 of my notes from reading Java Concurrency in Practice.

NOTE: These summaries are NOT meant to replace the book. I highly recommend buying your own copy of the book if you haven't already read it.

Chapter 13 - Explicit Locks

Unlike intrinsic locking, the Lock interface offers unconditional, polled, timed, and interruptible lock acquisition.
Lock implementations provide the same memory visibility guarantees as intrinsic locking. They can vary in locking semantics, scheduling algorithms, ordering guarantees and performance.
ReentrantLock has same semantics as a synchronized block.
Why use explicit locks over intrinsic locks?

Unlike intrinsic locking, a thread waiting to acquire a ReentrantLock can be interrupted.
ReentrantLock also supports timed lock acquisition.
WIth intrinsic locks, a deadlock is fatal.
Intrinsic locks must be released in the same code block they are acquired in. This makes non-blocking designs impossible.
ReentrantLock is much faster than intrinsic locking in Java 5.0

Lock objects are usually released in a finally block, to make sure that it is released if an exception is thrown.
lockInterruptibly() helps us build cancelable tasks.
tryLock() returns false if the lock cannot be acquired. Timed tryLock() is also responsive to interruption.
ReentrantLock offers two fairness options

Fair - threads acquire locks in order of requesting.
Non-fair (default) - thread can acquire lock if it is available at the time of the lock request, even if earlier threads are waiting. Non-fair locking is useful because it avoids the overhead of suspending/resuming a thread if the lock is available at time of the lock request.
Fairness is usually not needed, and has a very high performance penalty (multiple orders of magnitude).
Fair locks work best when they are held for a relatively long time or when the mean time between lock requests is large.

When to use intrinsic locks?

synchronized blocks have a more concise syntax. You can never forget to unlock a synchronized block.
Use ReentrantLock only when advanced features like timed, polled, interruptible lock acquisition, fairness or non-block structured locking are needed.
Harder to debug deadlock problems when using ReentrantLock because lock acquisition is not tied to a particular stack frame, and thus the stack dump is not very helpful.
synchronized is likely to have more performance improvements in the future (eg: lock coarsening) as it is part of the Java language spec.

Read-Write Lock - protected resource can be accessed by multiple readers or one writer at the same time.

offers readLock() and writeLock() methods which return a Lock object that must be acquired before doing the respective operations.
More complex implementation. Hence has lower performance except in read-heavy workloads.

Lock can only be released by thread that acquired it.

Chapter 14 - Building Custom Synchronizers

State-dependent classes - blocking operations can proceed only if state-precondition becomes true (for example, you cannot retrieve result of FutureTask if computation has not yet finished).
Try to use existing state-dependent classes whenever possible.
Condition queue - allows a group of threads (called wait set) to wait for a specific condition to become true.
Intrinsic condition queues - Any java object can act as a condition queue via the Object.wait(), notify() and notifyAll() functions.

Must hold intrinsic lock on an object before you can call wait(), notify() or notifyAll().
Calling Object.wait() atomically releases lock and suspends the current thread. It reacquires the lock upon waking up, just before returning from the wait() function call. wait() blocks till thread is awakened by a notification, a specified timeout expires or the thread is interrupted.
In order to use condition queues, we must first identify and document the pre-condition that makes an operation state-dependent. The state variables involved in the condition must be protected by the same lock object as the one we wait() on.
A single intrinsic condition queue can be used with more than one condition predicate. This means that when a thread is awakened by a notifyAll, the condition it was waiting on need not be true. wait() can even return spuriously without any notify(). The condition can also become false by the time wait() reacquires the lock after waking up. Hence when waking up from wait(), the condition predicate must be tested again and we must go back to waiting if it is false. Hence, call wait() in a loop: synchronized(lockObj) { while(!conditionPredicate()) { lock.wait();} // object is in desired state now

Notifications are not sticky - i.e. a thread won't know about notifications that occurred before it called wait().
In order to call notify() or notifyAll() on an object, you must hold the intrinsic lock on that object. Unlike wait(), the lock is not automatically released. The lock must be manually released soon as none of the woken up threads can make progress without acquiring the lock.
Use notifyAll() instead of notify(). If multiple threads are waiting on the same condition queue for different condition predicates, calling notify() instead of notifyAll() can lead to missed signals, as only the wrong thread may be woken up.

However using notifyAll() can be very inefficient, as multiple threads are woken up and contend for the lock where only one of them can usually make progress.
notify() can be used only if

The same condition predicate is associated with the condition queue and each thread executes the same logic on returning from wait().
A notification on the condition queue enables at most one thread to proceed.

A bounded buffer implementation needs to call notify only when moving away from the empty state or full states. Such conditional notifications are efficient, but makes the code hard to get right. Hence, avoid unless necessary as an optimization.
A state dependent class should either fully document its waiting/notification protocols to sub-classes or prevent sub-classes from participating in them at all.
Encapsulate condition queue objects in order to avoid external code from incorrectly calling wait() or notify() on them. This often implies the usage of a private lock object instead of using the main object itself.
Explicit Condition objects - Condition

Each intrinsic lock can have only one associated condition queue. Hence multiple threads may wait on same condition queue for different condition predicates.
A Condition is associated with a single Lock object. A Condition is created by calling Lock.newCondition(). You can create multiple Condition objects per Lock.
Equivalents of wait(), notify() and notifyAll() for Condition are await(), signal() and signalAll(). Since Condition is an Object, wait() and notify() are also available. Do not confuse them.
Explicit Condition objects make it easier to use signal() instead of signalAll().

Synchronizers

Both Semaphore and ReentrantLock extend AbstractQueuedSynchronzer (AQS) class.
AQS is a framework for building locks and synchronizers.
When using AQS, there is only one point of contention.
Acquisition - state dependent operation that can block.
Release - allows some threads blocked in acquire to proceed. Not-blocking

AQS manages a single integer of state for the synchronizer class. It can be accessed with getState(), setState() and compareAndSetState() methods. The integer can represent arbitrary semantics. For example, FutureTask uses it to represent the state (running, completed, canceled) of the task. Semaphore uses it to track the number of permits remaining.

Synchronizers track additional state variables themselves.
Synchronizers override tryAcquire, tryRelease, isHeldExclusively, tryAcquireShared and tryReleaseShared. The acquire, release, etc methods of AQS call the appropriate try methods,

Thursday, September 27, 2012

Java Concurrency in Practice - Summary - Part 8

This is part 8 of my notes from reading Java Concurrency in Practice.

NOTE: These summaries are NOT meant to replace the book. I highly recommend buying your own copy of the book if you haven't already read it.

Chapter 11 - Performance and Scalability

Avoid premature optimization - first make it right, then make it fast, if not fast enough already (as indicated by actual performance measurements)
Tuning for scalability is often different from tuning for performance, and are often contradictory.
Amdahl's Law : Speedup <= 1/( F + (1-F)/N) where F is the fraction of computation that must be executed serially, and N is the number of processors.
A shared work queue adds some (often overlooked) serial processing. Result handling is another form of serialization hidden inside otherwise seemingly 100% concurrent programs.
Costs of using threads

context switches - managing shared data structures in OS and JVM take memory and CPU. Can also cause flurry of processor cache misses on a thread context switch.
When a thread blocks on a lock, it is switched out by JVM before reaching its full scheduled CPU quantum, leading to more overhead.

Context switching costs 5000-10000 clock cycles (few microseconds). Use vmstat to find % of time program spent in the kernel. High % can indicate high context switching.
synchronized and volatile result in the use of special CPU instructions called memory barriers that involve flushing/invalidating CPU caches, stalling execution pipelines, flushing hardware write buffers, and inhibit compiler optimizations as operations cannot be reordered.
Performance of contended and uncontended synchronization are very different. synchronized is optimized for the uncontended scenario (20 to 250 clock cycles). volatile is always uncontended.
Modern JVMs can optimize away locking code that can be proven to never contend.
Modern JVMs perform escape analysis to identify thread-confined objects and avoid locking them.
Modern JVMs can do lock coarsening to merge multiple adjacent locks into a larger lock to avoid multiple lock/unlocks.
Synchronization by one thread affects performance of other threads due to traffic on the shared memory bus.
Uncontended synchronization can be handled entirely in JVM. Contended synchronization involves OS activity - OS needs to suspend the thread that loses the contention.
Blocking can implemented by spin-waiting or by suspending the thread via the OS. spin-waiting is preferred for short waits. JVM decides what to use based on profiling past performance.
Reducing lock contention

reduce duration for which locks are held.
reduce frequency at which locks are requested. Coarsen lock granularity by lock splitting (for moderately contended locks) and lock striping (for heavily contended locks).
replace exclusive locks with coordination mechanisms that permit greater concurrency.

Lock striping - ConcurrentHashMap uses 16 locks - bucket N is guarded by lock N % 16. Locking for exclusive access to entire collection is hard when lock striping is used.
Avoid hot fields like cached values - for eg: size is cached for a Map, in order to convert an O(n) operation to a O(1) operation. Use striped counters or atomic variables.
Alternatives to exclusive locks - concurrent collections, read-write locks, immutable objects, atomic variables.
Do not use object pools. Object allocation and GC were slow in earlier versions of Java. Now object allocation is faster than a C malloc - only 10 machine instructions. Object pools also introduce synchronization overheads

Chapter 12 - Testing Concurrent Programs

Every test must wait till all the threads created by it terminate. It should then report any failures in tearDown().
Testing blocking operations need some way to unblock a thread that has blocked as expected. This is usually done by doing the blocking operation in a new thread and interrupting it after waiting for some time. An InterruptedException is thrown if the operation blocked as expected.
Thread.getState() should not be used for concurrency control or testing. Useful only for debugging.
One approach to test producer-consumer programs is to check that everything that is put into a queue or buffer eventually comes out of it, and nothing else does.

For single producer-single consumer designs, use order sensitive checksum of elements that are added, and verify them when the element is removed. Do not use a synchronized shadow list to track the elements as that will introduce artificial serialization.
For multiple producer-consumer designs, use an order insensitive checksum that can be combined at the end of the test to verify that all enqueued elements have been dequeued.

Make sure that the checksums are not guessable by the compiler (for eg: consecutive integers), so that they are not precomputed. Use a simple random number generator like xorShift(int y) { y ^= (y << 6); y ^= (y >>> 21); y ^= (y << 7); return y;}

Test on multi-processor machines with fewer processors than active threads.
Generate more thread interleaving by using Thread.yield() to encourage more context switches during operations that access shared state.
Always include some basic functionality testing when doing performance testing to make sure that you are not measuring performance of broken code.
Non-fair semaphores provide better throughput, while fair semaphores provide lower variance in responsiveness.
Avoiding performance testing pitfalls

Ensure that garbage collection does not run at all during your test (check this using the -verbose:gc flag) OR ensure that garbage collection runs a number of times during the test (need to run test for a long time).
Your tests should run only after all code has been compiled; no point measuring performance of interpreted byte code. Dynamic compilation takes CPU resources. Compiled code executes much faster.

Code may be decompiled/recompiled multiple times during execution - for eg: if some previous assumption made by JVM is invalidated, or to compile with better optimization flags based on recently gathered performance statistics.
Run program long enough (several minutes) so that compilation and interpreted execution represent a small fraction of the results and do not bias it.
Or have an unmeasured warm-up run before starting to collect performance statistics.
Run JVM with -XX:+PrintCompilation so that we know when dynamic compilation happens.

When running multiple unrelated computationally intensive tests in a single JVM, place explicit pauses between tests in order to give the JVM a chance to catch up with its background tasks. Don't do this when measuring multiple related activities, since omitting CPU required by background tasks gives unrealistic results.
In order to obtain realistic results, concurrent performance tests should approximate the thread-local computation done by a typical application. Otherwise, there will be unrealistic contention.
Make sure that compilers do not optimize away benchmarking code.

Trick to make sure that benchmarking calculation is not optimized away: if (fox.hashCode() == System.nanoTime()) System.out.print(" ");

Complementary Testing Approaches

Code Review
Static analysis tools: FindBugs has detectors for:

Inconsistent synchronization.
Invoking Thread.run (Thread.start() is what is usually invoked, not Thread.run())
Unreleased lock
Empty synchronized block
Double-checked locking
Starting a thread from a constructor
Notification errors
Condition wait errors: Object.wait() or Condition.await() should be called in a loop with the appropriate lock held after testing some state predicate.
Misuse of Lock and Condition
Sleeping or waiting while holding a lock.
Spin loops

Java Concurrency in Practice - Summary - Part 7

This is part 7 of my notes from reading Java Concurrency in Practice.

NOTE: These summaries are NOT meant to replace the book. I highly recommend buying your own copy of the book if you haven't already read it.

Chapter 9 - GUI Applications

Almost all GUI toolkits, including Swing, are implemented as a single-threaded subsystem. All GUI activity is confined to a single dedicated event dispatch thread. Attempts at multi-threaded GUIs suffered from deadlocks and race conditions. User actions manifest as events that bubble up from the GUI component to the application. Application initiated actions bubble down from the application code to the GUI components. Hence, GUI components are often accessed in opposite order, creating ripe conditions for deadlocks.
Tasks that execute in the event thread must complete quickly. Otherwise the UI will hang.
In Swing, GUI objects are kept consistent not by synchronization, but by thread confinement. They must NOT be accessed from any other thread.
A few Swing methods are thread-safe:

SwingUtilities.isEventDispatchThread
SwingUtilities.invokeLater - schedules a Runnable to be executed on the event thread.
SwingUtilities.invokeAndWait - callable only from a non-GUI thread. Schedules Runnable to be executed on GUI thread and waits for it complete
methods to enqueue a repaint or revalidate request on the event queue.
methods for adding/removing event listeners.

Short-running tasks can be run directly on the GUI thread. For long running tasks, use Executors.newCachedThreadPool().
Use Future, so that tasks can be easily cancelled. The task must be coded so that it is responsive to interruption.
SwingWorker class provides support for cancellation, progress indication, completion notification. So, we don't have to implement our own using FutureTask and Executor.
Data models must be thread-safe if they are to be accessed from the GUI thread.
A program that has both a presentation-domain and an application domain data model is said to have a split-model design.

presentation data model is confined to event thread. Application domain data model is thread-safe and is shared between the application and GUI threads.
presentation model registers listeners with the application model so that it can be notified of updates. Presentation model can be updated from the application model by sending a snapshot of the current state or via incremental updates.

Chapter 10 - Avoiding Liveness Hazards

Unlike database systems, JVM does not do deadlock detection or recovery
A program will be free of lock-ordering deadlocks if all threads acquire the needed locks in a fixed global order.

The order of locks acquired by a thread may depend on external input. Hence static analysis alone is not sufficient to avoid lock-ordering deadlocks.
An alternative is to induce an ordering on locks by using System.identityHashCode. Order lock acquisition by the hash code of the lock object.

In the extremely unlike scenario where the hash codes of two lock objects are equal, acquire a third "tie" lock before trying to acquire the original two locks. The tie lock can be a global lock. Since hash collisions are infrequent, the tie lock won't introduce a concurrency bottleneck.

If the lock objects (say bank Accounts) have a unique key, lock acquisition can be ordered by the key, and there is no need for the tie-lock.
Multiple locks may not always acquired in the same method. Hence, it is not easy to detect lock-ordering deadlocks. Watch out for invocation of alien methods while holding a lock.

Calling a method with no locks held is called an open call. Liveness of a program can be more easily analyzed if all calls are open.
Use synchronized blocks within methods to guard shared state, instead of making the entire method synchronized.

In cases where loss of atomicity of the synchronized method is unacceptable, we need to construct application level protocols. For example, when shutting down a service, lock for just long enough to mark the service as shutting down, and wait for existing tasks to complete without holding the lock. Since the service is marked as shutting down, no new tasks will start.

In addition to deadlocking waiting for locks, threads can also deadlock waiting for resources like database connections.
If you must acquire multiple locks, lock ordering must be part of your design. Minimize number of locks needed. Document ordering policy.
Timed locks offered by the Lock class are another option for detecting and recovering from deadlocks. The tryLock() method returns failure if timeout expires. It can return failure even if no deadlock occurred, but the thread just took a long time due to some other reason.
JVM prints out deadlock information in thread dumps. To trigger a thread dump, send SIGQUIT (kill -3) to the JVM. Explicit Lock objects are not clearly shown in a thread dump.
Starvation - a thread is perpetually denied access to needed resources.

CPU cycle starvation can be caused by inappropriate use of thread priorities, or by executing infinite loops with locks held.
Avoid setting thread priorities as they are platform-dependent and can cause liveness issues. Set lower priorities only for truly background tasks, that can improve the responsiveness of foreground tasks.

Livelock - thread is not blocked, but cannot make progress because it keeps retrying an operation that will always fail. For example, when a code bug is triggered when processing a particular input, and that input is re-queued for processing by over-eager error handling code. An unrecoverable error is being mistakenly being treated as a recoverable one. Solution for some forms for livelocks is to introduce randomness into the retry.