acm-header
Sign In

Communications of the ACM

Practice

Nonblocking Algorithms and Scalable Multicore Programming


Nonblocking Algorithms and Scalable Multicore Programming, illustration

Credit: Oleh Voinilovych

back to top 

Real-world systems with complicated quality-of-service guarantees may require a delicate balance between throughput and latency in order to meet operating requirements in a cost-efficient manner. The increasing availability and decreasing cost of commodity multicore and many-core systems make concurrency and parallelism increasingly necessary for meeting demanding performance requirements. Unfortunately, the design and implementation of correct, efficient, and scalable concurrent software is often a daunting task.

Nonblocking synchronization may be used in building predictable and resilient systems while avoiding the problems of lock-based synchronization. This class of synchronization is not a silver bullet, especially outside of the abstract models in which its performance properties were originally defined. Several of the performance bottlenecks and error conditions associated with lock-based synchronization remain (albeit in more obfuscated forms); therefore, ensuring correctness requires more complex verification methods, and in some cases nonblocking algorithms require the adoption of complex support systems. Stemming from these complexities, nonblocking synchronization is frequently the victim of hype, fear, uncertainty, and doubt. This article aims to equip the reader with the knowledge needed to identify situations that benefit from nonblocking synchronization.

Back to Top

Common Principles

Before elaborating on typical motivations for adopting nonblocking data structures, this section highlights some important principles for understanding scalability on multiprocessor systems in the traditional threading model. Both lock-based and nonblocking synchronization have performance characteristics that are a function of these principles. Practitioners who are already intimately familiar with cache coherent multiprocessors, out-of-order execution, and the design and implementation of lock-based synchronization objects may wish to skip this section. While this section is by no means exhaustive, additional references have been provided. Paul E. McKenney outlines a methodology for choosing appropriate lock-based synchronization mechanisms,15 and other common principles have been described in previous articles.5,13,17,

Contention inhibits scalability. Contention on shared objects in parallel applications is a primary impediment to scalability and predictability. Regardless of the higher-level synchronization facilities being used, contention over a shared object involves some form of serialization by the underlying runtime environment, whether it is a language runtime or a shared-memory multiprocessor.

Shared mutable state and contention. Understanding the mechanisms that provide coherency guarantees on multiprocessors is a prerequisite to understanding contention on such systems. Cache coherency protocols are the prominent mechanisms for guaranteeing the eventual consistency of shared state across multiple cache coherent processors. These protocols implement coherency guarantees at the cache-line level, the unit that caches write to and read from main memory (at least from the point of view of the coherency mechanism).

The three prominent cache coherency protocols on commodity processors—MESI, MESIF, and MOESI—are named after the states they define for cache lines: Modified, Owned, Exclusive, Shared, Invalid, or Forward states. The cache coherency protocol manages these states to guarantee the eventual consistency of shared memory with the help of a memory-coherence mechanism. Figure 1 illustrates the state machine associated with the MOESI protocol.1 You may interpret the probe transitions as those triggered by external memory accesses originating from other cores.

Figure 2 illustrates the life cycle of shared reads and writes in a cache coherent multiprocessor. The state machine associated with this life cycle has been simplified for brevity. This program spawns a thread that reads a modified variable.

The example assumes that the coherency mechanism being used is bus snooping, which allows any processor to monitor any memory accesses to shared locations. The initial state of the system is presented in Figure 3. There are two sockets, each with one core and a 256-byte L2 direct-mapped write-back cache with 64-byte cache lines.

The process initially executes on core 0. Assume the address of variable x is 0x20c4. The increment of the value 10010 to x (x = x + 10010;) may decompose into three operations:

  1. Load x from memory into a CPU register.
  2. Increment the register by 10010.
  3. Store value of register into the memory location of x.

The address of x is 0x20c4 and is hashed to cache line 3. Since this cache line is in a modified state and contains data from a different address (0x00c0), it must be written back to main memory. No other socket contains the cache line for 0x20c4, so a 64-byte block (starting from 0x20c0) is read into cache line 3 from main memory and set to the exclusive state (see Figure 4).

The store into x transitions the cache line from exclusive to modified state.

The new thread spawns and eventually begins executing the thread function. Assume this execution occurs on core 1 (See Figure 5). The thread function executes a load from the address of x as an input to the fprintf function call. This load issues a read probe, requiring a transition from the modified state to the owned state. MOESI allows for cache-to-cache transfer of data, an advantage if probe latency is lower than latency to main memory. In this specific configuration, this operation involves a probe on what is typically a higher-latency memory interconnect (higher latency to intrasocket cache access). The latency associated with the probe is also a function of the topology. On larger-scale machines with asymmetric interconnect topologies, substantial performance mismatch may exist on memory accesses; some sockets may require one hop for an invalidation cycle, while others may require two or more.

In Figure 6, a cache-friendly program (one able to retain shared state in cache with minimal coherence traffic) scales well on modern cache coherent multiprocessors. Mutations to actively shared state lead directly to contention as cache controllers mediate ownership of cache lines.

Coherency granularity. As mentioned earlier, the cache line is the granularity at which coherency is maintained on a cache coherent multiprocessor system. This is also the unit of contention. If logically disparate mutable objects share the same cache line, then operations on these objects exhibit contention and may become a scalability bottleneck. This is called false sharing.

Consider the C snippet in Figure 7. Each thread is provided a unique index (UNIQUE_THREAD_ID) into an array of counters that each thread reads from and writes to. Since these counters are laid out sequentially in memory and cache, cache-line ownership ping pongs between all processors incrementing the counter value. This vicious cycle of invalid/shared/modified/owned cache-line transitions is an example of the kind of cache-line ping-ponging that can occur as a result of false sharing.

One way of avoiding this problem is to guarantee, at most, one counter is in a given cache line by padding to a cache-line size. Assuming the host processor of the application has 64-byte cache lines, the following modification avoids false sharing:

ins01.gif

In the same vein, padding should be done only when necessary. It is still important to keep access patterns in mind. For example, if two mutexes are acquired in succession on the hot path (that is, the most frequently executed segment of code), then cache-line padding does little but drive up lock-operation latency. Excessive padding may impact the overall memory footprint of the application, causing increased pressure on memory resources. Additional approaches exist to detect and minimize false sharing.14,15

Unavoidable costs of ordering and visibility guarantees. Certain correctness requirements call for the use of synchronization instructions that are heavier-weight than atomic loads and stores. To improve performance, many modern processors batch memory operations and/or provide some form of instruction-level parallelism. The presence of these optimizations may require the use of memory barrier (or serializing) instructions to enforce ordering of memory operations and their intended side effects. (The term memory barrier can be used interchangeably with memory fence.) The need for barrier instructions is dependent on the processor memory model, which also defines the reordering possibilities of memory operations.

Consider the source-code snippet in Figure 8. In this pseudocode, a producer thread executing on a dedicated processor creates a message and then signals a dedicated consumer thread on another processor by updating ready. With no memory barriers, it is possible for the processor executing consume to receive or process the remote store to ready before the remote store to message->value completes and/or is made visible to other processors. To guarantee that produce commits the store to message->value before the ready = 1 operation, a memory barrier may be necessary. Then, to guarantee the reception of the message in consume after the consumer has observed ready in a nonzero state contains any memory updates before the accompanying ready = 1 operation (in produce), another memory barrier may be necessary (See Figure 9).

Some processors have specialized barrier instructions that guarantee the partial ordering of only some memory operations (examples include serializing only stores with respect to each other or only loads with respect to each other). The requirement of memory barriers is dependent on the underlying memory model. Again, Paul McKenney offers additional details on these memory models.15

To make matters worse (in terms of complexity), some languages allow their respective compilers and runtime environments to reorder operations. Until recently, even C and C++ lacked a concurrent memory model (such that the semantics of concurrent accesses were very much compiler and environment specific). Many programming languages may exclusively emit memory barriers through acquire/release semantics. Popular lock-based synchronization mechanisms hide the details of memory barriers from program designers by having implicit heavyweight memory barriers. This simplifies concurrent programs as developers are left to reason with serializability,3 but it sometimes comes at the cost of reduced performance.

Memory barriers are also necessary for many other concurrent algorithms, especially classes of algorithms that do not rely on locks and their ordering guarantees. Ultimately, the ability to avoid heavyweight synchronization mechanisms depends on the correctness and visibility requirements of the data structure. Recently, the requirement of expensive synchronization instructions for certain correctness constraints in the presence of common access patterns has been formalized.4

Atomic read-modify-write operations are expensive. The cost of atomic read-modify-write operations depends on the underlying architecture and actual programming language implementation. TSO (total store ordering) is becoming an increasingly common memory model on commodity processors. Among other things, this memory model defines atomic operations as having a total ordering. This guarantee comes at a significant cost, even in the absence of contention. Table 1 illustrates non-contending throughput of atomic CAS (compare-and-swap) operations (lock cmpxchg) versus regular CAS operations (cmpxchg). The atomic CAS can provide total ordering and atomicity guarantees across multiple processors, while the regular CAS operation cannot. These measurements were made on an Intel Core i7-3615QM at 2.30 GHz and include the overhead of register spillage.

On TSO architectures, these atomic operations imply heavyweight memory-barrier semantics. Even on architectures with weaker memory models, atomic operations may be implemented in terms of complex instructions that incur the cost of long dependency chains in the pipeline. In other cases, programming languages may themselves define a total ordering to these instructions, which may lead to unnecessarily heavyweight memory fences being emitted. Atomic operations should be used only if necessary. On most architectures, atomic read-modify-write instructions are expensive (relative to other instructions)—even in the absence of contention.

Before deciding on the use of atomic operations, take into account the progress guarantees of the actual atomic implementation. For example, some architectures relying on LL/SC (load-linked/store-conditional primitives) such as ARM or Power architectures may still leave an application sensitive to external disturbances (jitter) such as preemption, even in the absence of contention.

Take advantage of fetch semantics provided by the atomic operation arsenal. If a platform provides an atomic fetch-and-increment operation or a CAS operation that returns the previous value, take full advantage of these semantics (see Figure 10).

Be wary of topology. The latency and throughput properties of memory accesses vary according to the type of memory access. For example, accessing memory in a remote cache may be cheaper than accessing local uncached memory. This performance asymmetry is an example of a non-negligible NUMA (non-uniform memory access) factor. The higher the NUMA factor, the higher this asymmetry. Optimizing the placement of objects that are shared across cores minimizes NUMA effects. For example, if shared state is accessed by a cluster of threads on a single socket, then there is no reason for that object to be contained in remote memory.

Synchronization objects that are not NUMA-aware may also be susceptible to NUMA effects. These effects manifest not only as a fast-path performance mismatch between cores but also as starvation or even livelock under load.

Fair locks guarantee starvation-freedom at the cost of increased preemption-sensitivity. Variants of these locks also allow for scalable point-to-point wake-up mechanisms. Recently, lock cohorting was developed as a generalized methodology for allowing NUMA awareness in locks at the cost of a higher fast-path latency.8

Predictability matters. Predictability is a desirable property in high-performance concurrent applications requiring stringent latency or throughput bounds. The progress guarantees of synchronization mechanisms define behavior in the presence of contention and unforeseen execution delays (which can include jitter). If a program is designed solely for the fast path, then minor disturbances in execution and/or workload may compound to result in unpredictably low performance.

Systems relying on blocking synchronization can be especially sensitive to execution delays. A significant delay in one thread holding a synchronization object leads to significant delays for all other threads waiting on the same synchronization object. Examples of such execution delays include process preemption and timer interrupts. The significance of these delays depends on the latency constraints of the application. If the application is required to be in the microsecond latency range, then even a network interrupt may be significant. On general-purpose processors and operating systems, isolating all external sources of delays is difficult. If there are strict latency requirements for a program relying on blocking synchronization in the fast path, consider minimizing external disturbances (real-time scheduling policies and interrupt rerouting are common). For the time scales where every disturbance counts, specialized real-time operating systems and hardware exist to address those challenges.

Back to Top

Nonblocking Synchronization

Before looking at reasons for adopting nonblocking data structures, this section introduces some terminology and briefly examines the complexities associated with the practical design, implementation, and application of nonblocking data structures.

In the literature, nonblocking synchronization and nonblocking operations fall into three primary classes of algorithms, each with a unique set of progress guarantees:

  • OF (Obstruction-Freedom). The algorithm provides single-thread progress guarantees in the absence of conflicting operations.
  • LF (Lock-Freedom). The algorithm provides systemwide progress guarantees. At least one active invocation of an operation is guaranteed to complete in a finite number of steps. There is no guarantee of starvation-freedom.
  • WF (Wait-Freedom). The algorithm provides per-operation progress guarantees. Every active invocation of an operation completes in a finite number of steps. In the absence of overload, there is a guarantee of starvation-freedom.

There is a total ordering to these classes of algorithms such that any wait-free algorithm is also lock-free and obstruction-free; any lock-free algorithm is also obstruction-free. A nonblocking data structure implements operations that satisfy the progress guarantees in the nonblocking hierarchy (summarized earlier) for some (or an infinite) level of concurrency.11 It is possible to implement a data structure that provides a nonblocking progress guarantee for a limited number of readers or writers. A traditional example of this is the Michael Scott's two-lock queue, which provides wait-free progress guarantees in the presence of up to one concurrent enqueue and one concurrent dequeue operation.19

It is also possible for a data structure to implement different progress guarantees for different sets of operations. Examples include a wait-free enqueue operation with a multiconsumer-blocking dequeue operation, as seen in the URCU (userspace read-copy-update) library (http://lttng.org/urcu), and a bounded-FIFO (first-in first-out) with a single-producer wait-free enqueue and multiconsumer lock-free dequeue, as seen in the Concurrency Kit library (http://concurrencykit.org/).

State-space explosion. Nonblocking data structures rely on atomic loads and stores or more complex atomic read-modify-write operations. The mechanism used depends on the level of concurrency the data structures intend to support and their correctness requirements.4,11 In lock-based synchronization, it is usually sufficient to reason in terms of locking dependencies and critical sections (and read-side or write-side critical sections for asymmetric lock-based synchronization such as read-write locks).3 When reasoning about the correctness of concurrent algorithms in the absence of critical sections, such as in nonblocking algorithms, the number of execution histories involving the interaction of shared variables can be enormous. This state space can quickly become infeasible for a mere human being to exhaust strictly through state-based reasoning. Consider the following program:

ins02.gif

Assuming this program is executed by two threads under a memory model that implements TSO, there are approximately 20 possible execution histories, if you take only the three store operations into consideration—and consider them completely uniform and deterministic. With four threads, 369,600 distinct execution histories exist. A derivation yields that for N deterministic processes with M distinct actions, there are (NM)! / (M!)N execution histories.21 Coupled with programming-language reordering possibilities, processor-memory reordering possibilities, and out-of-order execution, the state space will grow even more quickly in complexity.

Verifying the correctness of nonblocking algorithms, especially those of the wait-free and lock-free class, requires a good understanding of a program's underlying memory model. The importance of understanding the memory model cannot be understated, especially if the model itself is not precisely defined or allows for machine- and/or compiler-dependent behavior (as is the case for languages such as C or C++). What may seem like memory-model minutiae may be a violation in the correctness of your program.

Figure 11 takes a closer look at the C snippet. Assume that send_to_consumer consists solely of loads and stores and no serializing instructions. On x86 processors that implement TSO, stores to memory locations are committed in order with a few exceptions. Assuming that the memset function consists solely of regular store instructions, then the assertion in consumer_thread should not fail. The reality is that certain instructions, operations, and memory types are not guaranteed to comply with the x86 TSO model.20 In the example, the memset function can be implemented with high-performance non-temporal and/or vectorized store instructions that violate the usual memory-ordering semantics. It would be necessary to determine whether the compiler or standard library implementation and target platform guarantees memory-store serialization with respect to memset. Fortunately, most (but not all) popular implementations of memset that use these instructions emit a memory barrier of some kind (by silent contract), but this is not a requirement. If you are designing lock-less concurrent programs for a low-level language, be prepared to explore similar dark recesses of implementation-defined behavior.

Correctness. The most common correctness guarantee for nonblocking data structures and operations is linearizability. This requires that operations appear to complete atomically at some point between their invocation and completion. The linearization point of an operation is the instant in which the operation appears to have been completed atomically. It helps to reason in terms of linearization points with respect to the sequential specification of the data structure. For reasons described earlier, proving the linearizability of an algorithm is often difficult.

Specializing a nonblocking data structure for a fixed level of concurrency can greatly reduce the complexity of the state space. Considering the potential size of the state space of nonblocking concurrent algorithms, however, more advanced methods of testing and validation are necessary for verification.

The woes of unmanaged languages. Languages that lack built-in support for automatic memory management such as C or C++ require specialized techniques for managing dynamically allocated lock-less objects. (This section may not be relevant to those who plan on working exclusively with languages that have automatic memory management such as Java or C#.)

A popular scheme for reducing contention in lock-based synchronization involves decoupling the liveness of an object from its reachability. For example, a concurrent cache implementation using lock-based synchronization may have a mutex protecting a cache but a separate mutex protecting concurrent accesses for objects contained within the cache. This scheme is commonly implemented using in-band reference counters as illustrated in Figure 12.

This scheme allows for concurrent accesses to objects fetched from the cache while still allowing for concurrent fetches to proceed. The scheme works in the current example because the reachability of the object is managed atomically with respect to the reference counter. In other words, there is never a state in which the reference counter is 0 and the object is still in the cache. The object is never destroyed if there are still references to it. The reference counter in this case provides a mechanism for allowing the program to safely reclaim memory associated with concurrently accessed objects whose reachability is determined by cache state.

On modern processors, nonblocking data structures are implemented in terms of atomic operations that can modify only one target memory location at a time. Assume that the previous cache example was actually transformed to become lock-free. The reachability of an object in the cache is likely determined by a linearization point consisting of an atomic operation to a single memory location. For this scheme to work, the reachability of the object must be managed atomically with respect to concurrent cache_lookup operations. In the absence of atomic operations that operate on disparate memory locations, this common in-band reference-counting scheme is unable to provide sufficient safety guarantees.7

For this reason, dynamic nonblocking and lock-less data structures (structures that access dynamically allocated memory that may also be freed at runtime) must typically rely on alternative safe memory reclamation mechanisms. Besides full-fledged garbage collection, safe memory reclamation techniques include:

  • EBR (Epoch-Based Reclamation)9 is a passive scheme that allows for safe memory reclamation by carefully monitoring observers of a global epoch counter. It is unsuitable for protecting objects that are created and destroyed at a high frequency. Threads attempting synchronous (immediate) object destruction using EBR may block until the mechanism detects a safe destruction point. It is possible to implement variants of this scheme with minimal cost on the fast path.
  • HP (Hazard Pointers)18 is a scheme that works through the live detection of references to objects that require safe memory reclamation. To take advantage of it, nonblocking algorithms may require modification that is specific to this scheme. However, it is more suitable for protecting objects that are created and destroyed at a high frequency, and threads wanting to destroy protected objects are not required to block for an unbounded amount of time. This scheme can come at a high cost (full memory barrier) on the fast path and may not be suitable for traversal-heavy workloads.
  • QSBR (Quiescent-State-Based Reclamation)6,16 is a passive scheme that allows for safe memory reclamation through the detection of quiescent states in a program, in which no references to objects with live references could possibly exist. It is not suitable for protecting objects that are created and destroyed at a high frequency. Threads attempting synchronous (immediate) object destruction using QSBR may block until the mechanism detects a safe destruction point. It is possible to implement variants of this scheme with zero cost on the fast path.
  • PC (Proxy Collection) is an out-of-band amortized reference-counting scheme.

Additional mechanisms exist to achieve safe memory reclamation as well.12 Thomas Hart et al. compared the performance properties of the first three schemes.10 Note that these mechanisms may have attractive performance properties compared with reference counting for many workloads and are not required to be used exclusively with nonblocking data structures.

Resilience. Systems that require any of the following properties may benefit from the use of lock-free and wait-free algorithms:

  • Deadlock-Freedom. The absence of locking means that wait-free and lock-free data structures are immune to deadlock conditions, which can be difficult to avoid in large and complex locking hierarchies.
  • ASYNC-Signal Safety. Providing deadlock-freedom and coherency in the context of asynchronous interruptions of a critical section is a nontrivial problem. Wait-free and lock-free synchronization is immune to these problems.
  • Termination Safety. Wait-free and lock-free operations that satisfy linearizability may be aborted at any time. This means that processors or threads that are in the middle of wait-free and lock-free operations may terminate without sacrificing the overall availability of a system.
  • Preemption Tolerance. With wait-freedom, the completion of any operation is guaranteed to occur in a finite number of steps in the absence of resource overload, even in the presence of preemption and other external delays. Lock-freedom guarantees the overall progress of a system, as a delay in one thread can only incur a bounded delay in another thread (linearizability may require other threads to assist in the completion of the delayed operation, which may require additional instructions on the fast path).
  • Priority Inversion Avoidance. In the absence of overload to resources such as memory, wait-freedom can provide strong bound guarantees for priority inversion, but this may at times come at a significant cost to the fast path. This additional overhead can be avoided with algorithms specialized for a fixed level of concurrency. Lock-freedom can avoid priority inversion on uniprocessor systems at a lower overhead than lock-based synchronization with the right scheduling policies.2 On multiprocessor systems with high levels of contention, bounded priority inversion is difficult to avoid (the extent to which it can be avoided is a function of the fairness and prioritization guarantees provided by the underlying coherence mechanism, which usually provides little to no guarantee on either). Lock-freedom remains vulnerable to unbounded priority inversion even in the absence of overload to memory resources, as a lower-priority thread may still cause any number of delays of a higher-priority thread with interfering operations. Contention avoidance may be used to prevent this situation.

Bridging the gap from abstract models. It is important to frame the progress guarantees of nonblocking synchronization in the context of real hardware. The wait-freedom progress guarantee can break down in real-world systems under severely high levels of contention over memory and coherency resources. At these levels of contention, it is possible to starve processors from acceptable rates of operation completion. The progress guarantees of your program can be only as good as those of the concurrency primitives it is built on and of the underlying coherency mechanism. Fortunately, as interconnect bandwidth continues to increase and interconnect latency continues to decrease, the severity of this problem dwindles as well. Lock-based synchronization is, of course, also susceptible to this problem. In the absence of overloading of memory resources and in the presence of jitter, wait-freedom can provide stronger latency-bound guarantees than lock-based synchronization (because of the increased tolerance to external delays). This latency bound is a function of the underlying microarchitecture, memory resources, and contention.

The fast-path latency of write operations on a nonblocking data structure is usually higher than the fast-path latency of a lock-based variant. This trade-off is especially common for nonblocking objects designed to work for any level of concurrency because they usually require one or more heavier-weight atomic read-modify-write operations on the fast path. This trade-off in fast-path latency is a function of the cost of an underlying platform's lock implementations and the complexity of the nonblocking algorithm.

This concept can be illustrated by comparing the performance of a spinlock-protected stack with a lock-free stack (the implementation used was ck_stack.h from concurrencykit.org). The lock-free stack contains a single compare_and_swap operation for both the push and pop operations. On the x86 architecture, the fast-path latency of the spinlock-backed stack implementation is significantly lower than a lock-free stack implementation. On the Power 7 architecture, this is not the case. Table 2 displays uncontested latency (in ticks) of these various operations.

On the x86 it is possible to implement a spinlock using significantly cheaper atomic operations (in this case, the xchg instruction) than the ones used in the lock-free stack. The TSO of the architecture does not require the use of explicit memory barriers for this algorithm. On the other hand, the lock-free stack implementation requires the more complex cmpxchg (and cmpxchg16b) instruction, which exhibits higher baseline latency.

On the Power architecture, both the spinlock and lock-free implementations rely on the same underlying LL/SC primitives. The primary difference is the spinlock implementation requires a heavyweight memory barrier on invocation of every stack operation, while the nonblocking stack requires lighter-weight load/store memory barriers.

Regardless of the architecture-imposed trade-off in latency, the desirable properties of nonblocking data structures are revealed under contention. Figure 13 depicts the latency distribution of a push-only workload on a single stack across four threads on an x86 server (Intel Xeon L5640). Active measures were taken to avoid jitter, including use of the SCHED_FIFO scheduling class, core affinity, and IRQ (interrupt request) affinity (Linux 2.6.32-100.0.19.el5).

The latency distribution illustrated in Figure 13 is not a result of stronger latency bound on individual stack operations, but a side effect of stronger systemwide progress guarantees provided by lock-free algorithms. Specifically, lock-free algorithms are able to guarantee systemwide progress in the presence of preemption. In the blocking data structure, a preempted or blocked thread inside the stack-operation critical section prevents systemwide progress. On the other hand, the lock-free stack only forces a stack operation retry if another thread has made progress by updating the stack. If all sources of jitter were removed from the test system, then the latency profile of the spinlock-based stack would prevail.

Asymmetric workloads involving a combination of write- and read-only operations can also benefit from nonblocking synchronization. Often, in combination with the safe memory reclamation techniques described earlier, it is possible to provide strong guarantees of progress at minimal to no cost of heavyweight synchronization instructions on the fast path. This is a desirable property: it avoids many of the write-side/read-side fairness problems exhibited in popular single-preference read-write locks, while also avoiding fast-path degradation associated with heavier-weight fair read-write locks.

Back to Top

Conclusion

Having examined some of the common principles of parallel programming on multiprocessor systems and the real-world implications of nonblocking synchronization, this article has aimed to equip readers with the background needed to explore alternatives to lock-based synchronization. Even though the design, implementation, and verification of nonblocking algorithms are all difficult, these algorithms are becoming more prevalent among standard libraries and open-source software. The information presented here helps identify situations that call for the resiliency and performance characteristics of nonblocking synchronization.

Back to Top

Acknowledgments

Thanks to Mathieu Desnoyers and Paul McKenney for their contributions in lock-less synchronization to the open source community. Thanks to Devon H. O'Dell and Gabriel Parmer for feedback on this article. Thanks to Theo Schlossnagle for his support in the early stages of this process. Thanks to Silvio Cesare, Wez Furlong, Maxime Henrion, David Joseph, Paul Khuong, Abel Mathew, Shreyas Prasad, Brendon Scheinman, Andrew Sweeney, and John Wittrock for their helpful feedback. Thanks to Jim Maurer, the ACM Queue Editorial Board, Matt Slaybaugh, and Susie Holly.

q stamp of ACM QueueRelated articles
on queue.acm.org

Real-World Concurrency
Bryan Cantrill and Jeff Bonwick
http://queue.acm.org/detail.cfm?id=1454462

Unlocking Concurrency
Ali-Reza Adl-Tabatabai, Christos Kozyrakis and Bratin Saha
http://queue.acm.org/detail.cfm?id=1189288

Software and the Concurrency Revolution
Herb Sutter and James Larus
http://queue.acm.org/detail.cfm?id=1095421

Back to Top

References

1. AMD. AMD64 Architecture Programmer's Manual, 2: System Programming. Publication 24593, 2007.

2. Anderson, J.H., Ramamurthy, S. and Jeffay, K. Real-time computing with lock-free shared objects. ACM Transactions on Computer Systems 15, 2 (1997), 134–165; http://doi.acm.org/10.1145/253145.253159.

3. Attiya, H., Ramalingam, G. and Rinetzky, N. Sequential verification of serializability. SIGPLAN Notices 45, 1 (2010): 31–42; http://doi.acm.org/10.1145/1707801.1706305.

4. Attiya, H., Guerraoui, R., Hendler, D., Kuznetsov, P., Michael, M.M. and Vechev, M. Laws of order: Expensive synchronization in concurrent algorithms cannot be eliminated. In Proceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (2011), 487–498; http://doi.acm.org/10.1145/1926385.1926442.

5. Cantrill, B. and Bonwick, J. Real-world concurrency. Queue 6, 5 (2008), 16–25; http://doi.acm.org/10.1145/1454456.1454462.

6. Desnoyers, M., McKenney, P.E., Stern, A.S., Dagenais, M.R. and Walpole, J. User-level implementations of read-copy update. IEEE Transactions on Parallel and Distributed Systems 23, 2 (2012); 375–382.

7. Detlefs, D.L., Martin, P.A., Moir, M., Steele Jr., G.L. Lock-free reference counting. In Proceedings of the 20th Annual ACM Symposium on Principles of Distributed Computing (2001), 190–199; http://doi.acm.org/10.1145/383962.384016.

8. Dice, D., Marathe, V.J. and Shavit, N. Lock cohorting: A general technique for designing NUMA locks. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming: (2012), 247–256; http://doi.acm.org/10.1145/2145816.2145848.

9. Fraser, K. Practical lock freedom. Ph.D. thesis. Computer Laboratory, University of Cambridge (2003). Also available as Technical Report UCAM-CL-TR-639, Cambridge University.

10. Hart, T.E., McKenney, P.E., Demke Brown, A. and Walpole, J. Performance of memory reclamation for lockless synchronization. Journal of Parallel and Distributed Computing 67, 12 (2007), 1270–1285; http://dx.doi.org/10.1016/j.jpdc.2007.04.010.

11. Herlihy, M. Wait-free synchronization. ACM Transactions on Programming Languages and Systems 13, 1 (1991), 124–149; http://doi.acm.org/10.1145/114005.102808.

12. Herlihy, M., Luchangco, V. and Moir, M. The repeat offender problem: a mechanism for supporting dynamic-sized lock-free data structures. Technical Report. Sun Microsystems Inc, 2002.

13. Herlihy, M. and Shavit, N. The Art of Multiprocessor Programming. Morgan Kaufmann Publishers Inc., San Francisco, CA, 2008.

14. Kandemir, M., Choudhary, A., Banerjee, P. and Ramanujam, J. On reducing false sharing while improving locality on shared memory multiprocessors. In Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques, 203–211. IEEE Computer society; http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=807529&contentType=Conference+Publications.

15. McKenney, P.E. Selecting locking primitives for parallel programming. Commun. ACM 39, 10 (1996), 75–82; http://doi.acm.org/10.1145/236156.236174.

16. McKenney, P.E. Exploiting deferred destruction: An analysis of read-copy-update techniques in operating system kernels. Ph.D. dissertation. Oregon Health and Science University. AAI3139819, 2004.

17. McKenney, P.E. Is parallel programming hard, and, if so, what can you do about it? kernel.org (2011); https://www.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html.

18. Michael, M.M. Hazard pointers: Safe memory reclamation for lock-free objects. IEEE Transactions on Parallel and Distributed Systems 15, 6 (2004), 491–504; http://dx.doi.org/10.1109/TPDs.2004.8.

19. Michael, M.M., Scott, M.L. Simple, fast, and practical nonblocking and blocking concurrent queue algorithms. Technical Report, 1995. University of Rochester, NY.

20. Owens, S., Sarkar, S., Sewell, P. A better x86 memory model: x86-TSO. Theorem Proving in Higher Order Logics. S. Berghofer, T. Nipkow, C. Urban, and M. Wenzel, eds, 2009, 391–407. Springer, Berlin Heidelberg; http://dx.doi.org/10.1007/978-3-642-03359-9_27.

21. Schneider, F.B. 1997. On Concurrent Programming. Springer-Verlag, New York, 1997.

Back to Top

Author

Samy Al Bahra is an engineering team lead at AppNexus, playing a key role in the development of a leading real-time online advertising platform. Previously, he was involved with Message Systems in the development of a high-performance messaging server and was the lead developer of the George Washington University High Performance Computing Laboratory UPC I/O library reference implementation. He is also the maintainer of Concurrency Kit (http://concurrencykit.org).

Back to Top

Figures

F1Figure 1. The MOESI state machine.1

F2Figure 2. Reading modified variable on remote processor.

F3Figure 3. Initial state of system.

F4Figure 4. After initial load of variable x.

F5Figure 5. After store to x.

F6Figure 6. After remote read of x.

F7Figure 7. A program that exhibits false sharing.

F8Figure 8. A program lacking necessary memory barriers.

F9Figure 9. Utilizing memory barriers for visibility and ordering.

F10Figure 10. Avoiding unnecessary read-write cycles.

F11Figure 11. Implemention-defined behavior of memset.

F12Figure 12. Traditional in-band reference counting scheme.

F13Figure 13. Latency distribution of a push-only workload.

Back to Top

Tables

T1Table 1. The cost of atomic read-modify-write operations.

T2Table 2. Uncontested latency of various operations.

Back to top


©2013 ACM  0001-0782/13/07

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from [email protected] or fax (212) 869-0481.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2013 ACM, Inc.


 

No entries found