Toward generic atomic operations

August 1, 2012

This article was contributed by Jon Masters

Modern Linux distributions support a number of different computer architectures. Each of these architectures has its own quirks and implementation differences that are largely abstracted by a clever collaboration between the kernel and system libraries, such as the GNU C Library (glibc). However, there are still some ways in which core architecture differences are exposed to higher-level software. One example of this is in the implementation of atomic memory operations.

Atomic operations are necessary to ensure programming correctness in those situations where there are multiple threads of simultaneous execution. (Atomic operations are even necessary on uniprocessor systems, where interrupts and asynchronous scheduling of other threads provide the illusion of multithreading.) Atomicity means that a given operation (such as incrementing a counter variable) takes place in an indivisible fashion; its result is either visible to all CPUs in the system instantaneously, or does not take place at all (it is similar in concept but less fashionable than transactional memory).

Atomic operations are typically fairly small, hand-optimized assembly functions that provide for atomic increment and decrement of counters, acquisition and release of locks, and so on. Since these operations differ from one architecture to another, typically few developers on any given project understand the different implementations in their entirety, and even fewer care to vouch for the code being correct across all supported architectures. Although there are generic implementations available in libraries such as pthreads, not all projects can make use of them, for a variety of reasons, including a desire to be portable to non-Linux platforms; thus a number of projects within the average Linux distribution still contain their own custom implementations of atomic operations.

Atomic operations are particularly useful on modern systems with many CPUs running multi-threaded applications, but even a system with a single CPU (core) has a need for them. After all, the Linux kernel may interrupt a running task thread "A" to service an interrupt routine, and may then schedule a different task thread "B" before returning to the one that was originally interrupted. Without a means to ensure certain operations have taken place atomically, there would be no way to cope with potential interference between task thread B and task thread A (e.g., if both threads race for the same lock or operate on the same variable).

How do atomic operations work?

Atomic operations fundamentally require underlying hardware support. There are, broadly speaking, two popular mechanisms used by CPUs in implementing support for atomic operations in modern computer architectures. The older, more traditional approach involves directly manipulating memory locations, for example, a compare-and-swap (or compare-and-exchange) instruction such as CMPXCHG on Intel's x86 and Itanium architectures. These instructions compare the value of a given memory location with a value supplied as part of the instruction. If the two values are identical, then yet another supplied value is written back to the memory location, while the overall result is signaled in the form of a returned value (almost universally in a register). This whole sequence takes place as a single processor instruction, for example by locking the processor local bus, and disabling external interrupts for its duration.

A more modern alternative to directly acting upon memory locations is to implement a reservation engine within the processor. A reservation engine (as used by modern RISC architectures such as ARM, POWER, etc.) is typically implemented under the control of two special processor instructions. The first instruction, often called load-with-reservation, load-exclusive, or load-link, atomically loads the value of a given memory location into a register and marks that memory location as reserved.

The loaded value can then be manipulated arbitrarily before a second instruction, often called store-exclusive, or store-conditional, atomically stores an updated value from a register back to a given memory location, provided that no modification has been made to that memory location in the interim. The store-exclusive operation returns a value indicating whether the store operation completed successfully or not, which is important because there is an opportunity for external interference between the load, modification, and subsequent store. This means that higher-level atomic operations built using these instructions typically involve a loop, retrying the entire operation until the store is successful.

Reservation engines are slightly more complex to work with in software (requiring two instructions and a comparison in a loop block), but they come with multiple benefits. Although compare-and-exchange appears simpler because it is implemented in a single instruction, it in fact causes poor performance in the CPU pipeline because multiple additional sub-stages are required for the implied memory operations. By contrast, the reservation engine approach explicitly separates memory reads and stores into multiple operations. A reservation engine can be implemented separately from the bulk of the core CPU logic, and can be as complex as desired (including necessary logic to synchronize with other reservation engines).

Some reservation engine implementations handle only a single memory location at a time on a given processor, while others are more complex. In every case, outstanding reservations are invalidated upon a context switch between running tasks (often as a result of a specific invalidation in the context-switch code). The reservation approach can also handle the "ABA problem"—that is, it can detect any changes to the target memory location after the atomic load, even if the original value is written back prior to the store, because the reservation engine is aware of all memory modifications.

The story doesn't quite end there. Some architectures lack full support for certain atomic operations that are required by higher-level software, such as atomic 64-bit (multiple word) load and store operations. In this case, there are workarounds (e.g. the "kuser" helper, a VDSO-like helper on older ARM processors), but that is a topic best saved for another article.

How are atomic operations used?

Atomic-operations libraries typically provide a set of functions that include incrementing and decrementing a memory location, compare-and-swap of a memory location, and higher-level operations built using these functions, such as lock acquisition and release. All of these various operations are built using the fundamental architecture-specific processor instructions of the kind described above. As an example, the OpenMPI message-passing library includes the following inline assembly code to implement an atomic 32-bit addition operation on version 7 of the ARM Architecture:

    START_FUNC(opal_atomic_add_32)
           LSYM(13)
           ldrex   r2, [r0]        @ exlusively load address at r0 into r2
           add     r2, r2, r1      @ increment the value of r2 with value in r1
           strex   r3, r2, [r0]    @ attempt to store the value of r2 at the address in r0
           cmp     r3, #0          @ r3 contains result from store exclusive, test if successful 
           bne     REFLSYM(13)     @ repeat entire operation if it was interrupted
           mov     r0, r2          @ return value that was written
           bx      lr
    END_FUNC(opal_atomic_add_32)

This atomic increment function works by using the special ldrex and strex instructions, which control the CPU's reservation engine, to gain exclusive access to a desired memory location. The example code first loads the contents of a given memory location into a general-purpose register, adds a value to the register, and then tests the result of attempting to exclusively store this change back to memory. If it is successful, the function returns. If it is not successful, the operation is repeated until it completes without interference.

OpenMPI includes a custom atomic-operations library that implements support for 13 different base architectures. Some of those architectures have multiple ways to achieve the same thing, depending on which version is in use. For example, ARM processors have moved away from the deprecated SWP (compare-and-swap) instruction in favor of a reservation-engine-based approach. Both approaches need to be supported if code is to run on newer and older ARM processors. It is unfortunate that projects such as OpenMPI have needed to implement their own atomic-operations libraries, which must be periodically updated for new processors and are hard to maintain because they require special knowledge of multiple underlying architectures.

The C11 memory model

The main culprit for this state of affairs is the venerable C programming language. Traditionally, C had no explicit internal notion of multi-threaded applications, and only a very weakly ordered memory model. That is, it was hard to guarantee that the compiler would not reorder memory operations on shared variables because the language lacked the built-in constructs that are necessary to inform the compiler of such hidden data dependencies. Over the years, independent platform-specific libraries have provided support for general threading abstractions, including atomic operations performed on their own defined types. This is all well and good, but not all projects can rely upon such platform libraries for atomic operations, especially those that want to remain highly portable to non-Linux systems.

This is where C11 comes in. C11 introduces a new memory model explicitly designed with support for threading and atomic operations. It introduces the new standard header stdatomic.h, atomic integer types such as atomic_int (constructed using the _Atomic type qualifier), and a new memory_order enumerated type that defines various levels of memory ordering from the weakest memory_order_relaxed (no specific ordering requirement) through to memory_order_seq (sequentially consistent), the strongest ordering. Using the C11 memory model, the previous inline assembly can be reduced to defining an _Atomic typed variable and using one of the atomic fetch-and-modify generic functions, such as atomic_fetch_add().

Here is an example of using the C11 defined atomics:

    #include <stdatomic.h>

    _Atomic int magic_number = ATOMIC_VAR_INIT(42); // can also use _Atomic(int)
    atomic_fetch_add(&magic_number, 5);             // make Star Trek fans happy

This defines and initializes a new variable called magic_number to the value 42 (using ATOMIC_VAR_INIT()) before correcting that value for the true answer to the ultimate question of life, the universe, and everything, which, as everyone knows, Star Trek correctly defined to be 47. Using the new C11 extensions, projects such as OpenMPI do not need to implement their own atomic-operations library, because the underlying language now provides the necessary support, already optimized for each target architecture.

There is, however, at least one little problem with rushing to embrace C11. As of this writing, GCC and glibc do not yet have full support for the new atomic types. This is slated to be added in the GCC 4.8 time frame. (The glibc maintainers are aware of the topic, and plan to incorporate support once it is available in GCC.) Meanwhile, GCC 4.7 gained support for a new set of built-in functions to provide memory-model-aware atomic operations that were designed specifically to meet the requirements of the C11 memory model. The idea is that the higher-level C11 atomic primitives can be easily built using these built-ins in time for GCC 4.8. In the meantime, there are several alternative options. One of these is to use a third-party macro-based implementation of the C11 types (which already use the GCC built-ins), of which several exist.

Another option prior to broader C11 support being available is to use the new GCC built-ins directly. For common operations, such as atomic increment, the GCC 4.7 built-in atomic functions look very similar to those that form part of the broader C11 standard:

    __atomic_store_n(&v, 42, __ATOMIC_SEQ_CST);
    __atomic_add_fetch(&v, 5, __ATOMIC_SEQ_CST);

The OpenMPI atomic-add example code could thus be replaced with a single call to __atomic_add_fetch(), which will atomically fetch a value from a memory location, add a supplied value, and return the result, doing the right thing for every supported architecture. Compiling the example and disassembling it will (unsurprisingly) produce a sequence of operations that appears very similar to the inline assembly it replaces. Of course, one does need to be careful in using the GCC built-ins directly because they do not require the use of variables with an _Atomic type qualifier. This means that it is possible to mix the use of atomic functions with manipulations of regular variables without triggering any compiler warnings. Still, this is no different than existing code failing to use a call to a special inline assembly function, which is also incorrect.

In time, it is the hope of this author that most projects implementing custom inline assembly for atomic operations can move to a standard C11 based implementation using stdatomic.h. That would be both portable to many different platforms, and easier to maintain by distributions and upstream projects themselves, because a specialist knowledge of the architecture specifics can be abstracted by the compiler. (Note, however, that projects may need to continue to support legacy approaches to atomic operations, if they want to continue supporting old compilers.) You can read more about the new C11 atomic operations (including the details of the memory model and its orderings not covered in depth here) in section 7.17.7 of the final draft version of the C11 specification [PDF].

Index entries for this article
GuestArticles	Masters, Jon

Toward generic atomic operations

Posted Aug 2, 2012 3:06 UTC (Thu) by jamesd (guest, #39451) [Link]

In case people aren't grokking "make Star Trek fans happy", this https://plus.google.com/106265217227408958782/posts/3FrbN...

Toward generic atomic operations

Posted Aug 2, 2012 3:15 UTC (Thu) by jcm (subscriber, #18262) [Link]

In case anyone wonders why 47 is the one true magic number...

http://www.youtube.com/watch?v=SPoiH0JlQ9A

Jon.

Toward generic atomic operations

Posted Aug 2, 2012 5:23 UTC (Thu) by kev009 (guest, #43906) [Link] (2 responses)

Some of the tips in this article are pretty cavalier. Notably, until C11 is common enough to be required by a project, one _SHOULD_ use a library like libatomic_ops (from Boehm GC) for these operations and NOT proprietary compiler extensions.

Toward generic atomic operations

Posted Aug 2, 2012 7:42 UTC (Thu) by jcm (subscriber, #18262) [Link]

Good points. The intention here is not to advocate for "proprietary extensions", but on some level I'm saying that if a project relies upon GCC, it's worth just considering using their existing atomics rather than re-implementing the wheel with inline assembly. An preferred alternative is to use an existing library, but that is not always possible. I just reviewed some of the Fedora ARM codebase for inline assembly and this kind of stuff showed up way more than I expected. It's something that needs fixing. I hope C11 is quickly adopted as means to do so.

Toward generic atomic operations

Posted Aug 2, 2012 14:37 UTC (Thu) by tvld (guest, #59052) [Link]

Also note that the atomics are just a part of the memory model. Using libatomic-ops, for example, does not guarantee that your compiler actually adheres to the memory model (e.g., does not introduce speculative stores that could be data races in a multi-threaded execution). There are different opinions about whether the C++11/C11 memory model should be the default even for non C++11/C11 compilations. IMO, this would be A Good Thing; other people are worried about potential performance disadvantages for single-threaded programs.

Toward generic atomic operations

Posted Aug 2, 2012 14:40 UTC (Thu) by tvld (guest, #59052) [Link] (4 responses)

GCC 4.7 already has experimental support for C++11 atomics, including the front-end bits. So if you have a C++ program, you can already use the new atomics.

Toward generic atomic operations

Posted Aug 2, 2012 23:28 UTC (Thu) by nix (subscriber, #2304) [Link] (3 responses)

... as long as you don't mind using C++11. Which is ABI-incompatible with C++98. This is often going to be a bit of a problem.

Toward generic atomic operations

Posted Aug 2, 2012 23:30 UTC (Thu) by nix (subscriber, #2304) [Link] (2 responses)

Of course, this incompatibility is only apparent in the STL. So I suppose one could write a non-STL-using atomic-ops translation unit that just did atomic stuff and didn't go near the STL, then *just* compile *that* as C++11. It should link to C++98 stuff without any trouble, and the resulting binary should work even if the C++98 stuff uses the STL, as long as you don't specify C++11ness at link time.

Toward generic atomic operations

Posted Aug 3, 2012 13:25 UTC (Fri) by jwakely (subscriber, #60262) [Link] (1 responses)

The std::list incompatibility has been reverted, so only GCC versions 4.7.0 and 4.7.1 are incompatible with other versions. The same releases had an ABI bug with passing std::pair, which has also been fixed now. There are some other differences still that result in (mostly harmless) ODR violations. The new plan is to implement custom attributes that affect name mangling to allow C++03/C++11 interoperability, see http://gcc.gnu.org/ml/gcc/2012-07/msg00100.html

Toward generic atomic operations

Posted Aug 4, 2012 11:58 UTC (Sat) by nix (subscriber, #2304) [Link]

Excellent news! The mangling change proposed in the mail you link to seems ever so much better than a silent ABI change.

(And, gosh, that's embarrassing: the revert, like the implemntation, was done by my own co-worker just last month and I didn't notice until I made a fool of myself in this thread. I must pay more attention!)

What about C11 atomics in clang/LLVM?

Posted Aug 2, 2012 16:53 UTC (Thu) by jnareb (subscriber, #46500) [Link]

Does clang compiler (the C compiler of LLVM project) support stdatomic.h from C11 standard?

Toward generic atomic operations

Posted Aug 2, 2012 17:02 UTC (Thu) by jonabbey (guest, #2736) [Link]

Great essay, thanks.

Toward generic atomic operations

Posted Aug 3, 2012 1:59 UTC (Fri) by wahern (subscriber, #37304) [Link] (1 responses)

I realize that this is Linux Weekly News, but since when was Linux the only platform with pthreads support? Heck, even Windows has pthreads-win32. The only platform I know of with broken pthreads is OpenBSD, which is going through a messy transition from user threads to 1:1 pre-emptive threading. (I say broken because some signal handling is presently fubar.) The reason people roll their own threading library isn't because they want to be portable to non-Linux, it's because they want to be "portable" to Windows. That's a tremendous distinction. (Also, it's because rolling your own threading API is one of those low-level tasks that's enticing to people who still think of C as a high-level assembler, aching to be abused.)

And, unless I'm mistaken, both GCC and clang have the capability to support <stdatomic.h>, they just don't ship with that header. But all the way back to GCC 4.1'ish (an eternity ago), GCC has had atomic primitives, based on a proposed Intel specification and which bares more than a passing resemblance to the C11 and C++11 API. If you want <stdatomic.h> you can find an implementation based on GCC and clang intrinsics here,

http://svnweb.freebsd.org/base/head/include/stdatomic.h?v...

I'm not sure if it's 100% complete, but for all the basic stuff (increment, store, load, compare-and-swap), it should suffice.

Toward generic atomic operations

Posted Aug 3, 2012 3:34 UTC (Fri) by jcm (subscriber, #18262) [Link]

Like I said, there are implementations already that you can install (there's also been stuff in P99, and Google turns up others, like your example), and I noted that GCC has built-ins for atomics. I didn't go into the older implementation that wasn't universal for every architecture and wasn't used by end applications because that wasn't really its purpose. C11 by contrast isn't a hack, it's a standard that is supposed to be used universally.

Anyway, thanks for the feedback!

Toward generic atomic operations

Posted Aug 7, 2012 14:18 UTC (Tue) by sdalley (subscriber, #18550) [Link] (1 responses)

The first instruction, often called load-with-reservation, load-exclusive, or load-link, atomically loads the value of a given memory location into a register and marks that memory location as reserved.

The loaded value can then be manipulated arbitrarily before a second instruction, often called store-exclusive, or store-conditional, atomically stores an updated value from a register back to a given memory location, provided that no modification has been made to that memory location in the interim.

...In every case, outstanding reservations are invalidated upon a context switch between running tasks (often as a result of a specific invalidation in the context-switch code)....

Why do we need the last bit? Surely you don't want to invalidate reservations willy-nilly - the pre-empting task may have nothing to do with the one being pre-empted and therefore be no danger to it. Maybe a task kicked off by unrelated network activity, etc, etc. If it is actually contending, it will also be attempting a reservation and will fail/loop at that point anyway?

Toward generic atomic operations

Posted Aug 8, 2012 9:18 UTC (Wed) by mpr22 (subscriber, #60784) [Link]

ISTR running into a problem where store-conditional appeared to be behaving as "store if there is a currently valid reservation" rather than the desired "store if the destination address is the target of the currently valid reservation".

Toward generic atomic operations

Posted Aug 9, 2012 5:51 UTC (Thu) by scientes (guest, #83068) [Link]

Since Linux 3.1 (or Linux 2.6.30 with armv6k or later) and gcc 4.7, ARM supports the same 8 byte atomic operations as x86 does,(using the __kernel_cmpxchg64 syscall) so hardware doesn't really get in your way. (armv6k+ cpus do this much faster however)