Lockless patterns: an introduction to compare-and-swap

March 12, 2021

This article was contributed by Paolo Bonzini

In the first part of this series, I showed you the theory behind concurrent memory models and how that theory can be applied to simple loads and stores. However, loads and stores alone are not a practical tool for the building of higher-level synchronization primitives such as spinlocks, mutexes, and condition variables. Even though it is possible to synchronize two threads using the full memory-barrier pattern that was introduced last week (Dekker's algorithm), modern processors provide a way that is easier, more generic, and faster—yes, all three of them—the compare-and-swap operation.

From the point of view of a Linux kernel programmer, compare-and-swap has the following prototype:

    T cmpxchg(T *ptr, T old, T new);

where T can be either an integer type that is at most as wide as a pointer, or a pointer type. In order to support such polymorphism, cmpxchg() is defined as a macro rather than a function, but the macro is written carefully to avoid evaluating its arguments multiple times. Linux also has a cmpxchg64() macro that takes 64-bit integers as the arguments, but it may not be available on all 32-bit platforms.

cmpxchg() loads the value pointed to by *ptr and, if it is equal to old, it stores new in its place. Otherwise, no store happens. The value that was loaded is then returned, regardless of whether it matched old or not. The compare and the store are atomic: if the store is performed, you are guaranteed that no thread could sneak in and write a value other than old to *ptr. Because a single operation provides the old version of the value and stores a new one, compare-and-swap is said to be an atomic read-modify-write operation.

In Linux, the cmpxchg() macro puts strong ordering requirements on the surrounding code. A compare-and-swap operation comprises a load and a store; for the sake of this article, you can consider them to be, respectively, load-acquire and store-release operations. This means that cmpxchg() can synchronize with both load-acquire or store-release operations performed on the same location by other threads.

Lock-free stacks and queues

"Lockless algorithms for mere mortals" already mentioned the use of compare-and-swap for lock-free lists. Here, we'll look at how a lockless, singly linked list could be implemented in C, and what it could be useful for. First of all, however, let's recap how a single-threaded C program would add an item in front of a singly-linked list:

    struct thing {
        struct thing *next;
        ...
    };
    struct thing *first;

    node->next = first;
    first = node;

Armed with the knowledge from the first part of the series, we know that we should turn the assignment to first into a store-release, so that node->next is visible to other threads doing a load-acquire. This would be an instance of the pattern presented there.

However, that pattern only worked for a single producer and a single consumer; in the presence of multiple producers, the two instructions would have to be placed under a lock. This is because the value of first can change between the two instructions, for example if another element is added at the same time by another thread. If that happens, the outgoing pointer (node->next) in the new element will point to whatever first held before the assignment happened. This teaches us an important, if obvious, lesson: acquire and release semantics are just one part of designing and proving the correctness of lockless algorithms. Logic mistakes and race conditions can and will still happen.

Instead of using a lock, cmpxchg() lets us catch the other thread in the act of modifying first. Something like this would work for any number of producers:

    if (cmpxchg(&first, node->next, node) == node->next)
        /* yay! */
    else
        /* now what? */

There are still a few things to sort out, as you can see. First and foremost, what to do if the cmpxchg() notices that first has changed. The answer in that case is simply to read the new value of first to node->next and try again. This is possible, because node is still invisible to other threads. Nobody will notice our stroke of bad luck.

A second and more subtle question is: how do we load first? The load need not have either acquire or release semantics, because the code is not doing other memory accesses that depend on the value of first. On the other hand, perhaps the big bad optimizing compiler might think that first cannot change across iterations of the loop? Even though Linux's cmpxchg() does prevent this kind of compiler optimization, it is a good practice to mark relaxed loads and stores of shared memory using READ_ONCE() and WRITE_ONCE().

Putting everything together, we get:

    struct thing *old, *expected;
    old = READ_ONCE(first);
    do {
        node->next = expected = old;
        old = cmpxchg(&first, expected, node);
    } while (old != expected);

This is all nice, but it's only half of the story. We still have not seen how the list can be read on the consumer side. The answer is that it depends on the relationship between producers and consumers, the number of consumers, and whether the consumers are interested in accessing elements in LIFO (last-in-first-out) or FIFO (first-in-first-out) order.

First of all, it could be that all reads happen after the producers have finished running. In this case, the synchronization between producers and consumers happens outside the code that manipulates the list, and the consumers can access the list through normal, non-atomic loads. The synchronization mechanism could be a thread-exit/thread-join pair such as the one we saw in the first article, for example.

If reads are rare or can be batched, a more tricky implementation could allow producers to proceed locklessly, while reads would be serialized. Such an implementation could use a reader-writer lock (rwlock); however, the producers would take the lock for shared access (with a read_lock() call) and the consumer(s) would take the lock for exclusive access (with write_lock())! This would also avoid reads executing concurrently with writes and, therefore, the consumer would be able to employ non-atomic loads. Hopefully, this example will show that there's no such thing as too many comments or too much documentation, even if you're sticking to the most common lockless programming patterns.

If many consumers run concurrently with the producers, but they can consume the elements in any order, the consumers can obtain a whole batch of elements (removing them from the list) with a single instruction:

    my_things = xchg_acquire(&first, NULL);

xchg(), like cmpxchg(), performs an atomic combination of a read and a write to a memory location. In this case it returns the previous head of the list and writes NULL in its place, thus emptying the list. Here I am using the xchg_acquire() variant, which has acquire semantics for its load of first, but (just like WRITE_ONCE()) does not apply release semantics when it stores NULL. Acquire semantics suffice here, since this is still basically the same store-release/load-acquire pattern from part 1. More precisely, it is a multi-producer, multi-consumer extension of that pattern.

Should we do the same on the writer side and replace cmpxchg() with cmpxchg_release()? Indeed we could: in principle, all that the writer needs is to publish the store of node->next to the outside world. However, cmpxchg()'s acquire semantics when loading the list head have a useful side effect: they synchronize each writer with the thread that wrote the previous element. In the following picture, the load-acquire and store-release operations are all part of a successful series of cmpxchg() calls:

    thread 1: load-acquire first (returns NULL)
              store-release node1 into first
                  \
      thread 2: load-acquire first (returns node1)
                store-release node2 into first
                    \
         thread 3: load-acquire first (returns node2)
                   store-release node3 into first
                       \
            thread 4: xchg-acquire first (returns node3)

Thread 3's cmpxchg() is the only one to synchronize with thread 4's xchg_acquire(). However, because of transitivity, all cmpxchg()s happen before the xchg_acquire(). Therefore, if cmpxchg() is used in the writers, the readers can go through the list with regular loads.

If, instead, the writers used cmpxchg_release(), the happens-before relation would look like this:

    thread 1: load-acquire first (returns NULL)
              store-release node1 into first

      thread 2: load first (returns node1)
                store-release node2 into first

         thread 3: load first (returns node2)
                   store-release node3 into first
                       \
            thread 4: xchg-acquire first (returns node3)

Thread 4 would always read node2 from node3->next, because it read the value that thread 3 wrote to first. However, there would be no happens before edge from thread 1 and thread 2 to thread 4; therefore, thread 4 would need a smp_load_acquire() in order to see node1 in node2->next.

The above data structure is already implemented in Linux's linux/llist.h header. You're highly encouraged not to reinvent the wheel and use that version, of course. That API, in fact, includes two more interesting functions: llist_del_first() and llist_reverse_order().

llist_del_first() returns the first element of the llist and advances the head pointer to the second element. Its documentation warns that it should only be used if there is a single reader. If, instead, there were two consumers, an intricate sequence of adds and deletes could lead to the so-called ABA problem. Since this article rests firmly on the principle of "if it hurts, don't do it", a detailed explanation is beyond its scope. However, it's worth pointing out the similarity with the earlier rwlock example. Just as in that case, multiple consumers will have to use locking to serialize concurrent access to the data structures. llist_del_first(), instead, lets writers call llist_add() without taking a lock at all; readers instead can use a spinlock or a mutex.

llist_del_first() provides LIFO semantics for the llist. If your application requires FIFO order, however, there is a useful trick that you can apply, and that's where llist_reverse_order() comes into play. Removing a batch of items with xchg() (as is done with llist_del_all()) does provide the batches in FIFO order, only the items in each batch are ordered back to front. The following algorithm then comes to mind:

    struct thing *first, *current_batch;

    if (current_batch == NULL) {
        current_batch = xchg_acquire(&first, NULL);
        ... reverse the order of the nodes in current_batch ...
    }
    node = current_batch;
    current_batch = current_batch->next;

Every execution of the previous pseudocode will return an element of the linked list in FIFO order. This is also a single-consumer data structure, as it assumes that only a single thread accesses current_batch at any given time. It is left as an exercise for the reader to convert the pseudocode to the llist API.

That is all for this installment. The next article in this series will continue exploring read-modify-write operations, how to build them from compare-and-swap, and how they can be put into use to speed up reference-counting operations.

Index entries for this article
Kernel	cmpxchg()
Kernel	Lockless algorithms
GuestArticles	Bonzini, Paolo

Lockless patterns: an introduction to compare-and-swap

Posted Mar 12, 2021 16:22 UTC (Fri) by jcm (subscriber, #18262) [Link] (4 responses)

Worth noting that the compare-and-swap operation might provide both acquire and release semantics, but this doesn't make it equivalent to a full memory barrier on every architecture. The examples in this article carefully accommodate that but don't call it out.

Lockless patterns: an introduction to compare-and-swap

Posted Mar 12, 2021 18:01 UTC (Fri) by jreiser (subscriber, #11027) [Link] (1 responses)

> [not] equivalent to a full memory barrier on every architecture

Please name the two most-prevalent architectures on which compare-and-swap is not equivalent to a full memory barrier.

Lockless patterns: an introduction to compare-and-swap

Posted Mar 13, 2021 0:42 UTC (Sat) by pbonzini (subscriber, #60935) [Link]

If you mean compare-and-swap and not Linux cmpxchg, then ARM, PPC, and RISC-V all have opcodes (or can use LL/SC sequences) that provide relaxed compare and swap.

Lockless patterns: an introduction to compare-and-swap

Posted Mar 13, 2021 0:40 UTC (Sat) by pbonzini (subscriber, #60935) [Link] (1 responses)

On Linux actually cmpxchg does imply a memory barrier. The article is only written in terms of acquire and release because I wanted to introduce the "weaving" of the happens-before relation through all the add-to-list operations; this will come in handy in part 5. C11's atomic_compare_and_swap does *not* have the stricter reordering semantics of Linux's cmpxchg though.

There are a few other simplifications throughout the articles, they will all be "corrected" in due time.

Lockless patterns: an introduction to compare-and-swap

Posted Mar 18, 2021 21:22 UTC (Thu) by Alan.Stern (subscriber, #12437) [Link]

There's a technical point that's relevant here. In the section where you mention using cmpxchg_release instead of cmpxchg:

"Thread 4 would always read node2 from node3->next, because it read the value that thread 3 wrote to first. However, there would be no happens before edge from thread 1 and thread 2 to thread 4; therefore, thread 4 would need a smp_load_acquire() in order to see node1 in node2->next."

In fact this isn't so. In the Linux kernel memory model, an edge from a store-release to a relaxed load _does_ establish a happens-before relation. What it doesn't do is guarantee that this store or prior ones will be visible to other loads following the first, because there will be nothing to force those loads to execute in order (i.e., there generally is no happens-before relation from a relaxed load to later loads in the same thread). But in threads 2 and 3 there are no loads following the one inside the cmpxchg_release, so this lack doesn't matter. Only thread 4 needs to use a load-acquire -- which it already does: xchg_acquire.

One other thing you might consider mentioning in the last article. Full memory barriers, including those associated with value-returning atomic operations like xchg or cmpxchg, are even stronger than discussed so far. They do more than order all earlier and later memory accesses, both loads and stores, against each other; they also act as global synchronizing operations. That is, whenever two threads both carry out a full memory barrier, there is always a happens-before relation from the barrier that executes first to the one that executes second. To put it another way, if each thread has a full memory barrier between each pair of memory accesses, the whole program will end up obeying the Sequential Consistency model.

Lockless patterns: an introduction to compare-and-swap

Posted Mar 12, 2021 23:15 UTC (Fri) by dgc (subscriber, #6611) [Link] (20 responses)

A couple of important things about cmpxchg algorithms that the article implies but doesn't explicitly mention. It says that cmpxchg is an "atomic read-modify-write operation" but then doesn't explain the scalabilty limitations this imposes on "lockless" cmpxchg algorithms.

1. cmpxchg is a hardware locked operation rather than a software locked operation. It still bounces cachelines around the machine via the hardware cache coherency protocol, so frequent cmpxchg operations will always reach an update saturation point where CPU usage goes non-linear and the update rate goes backwards.

2. cmpxchg algorithms can be extremely unfair and effectively livelock when subject to high concurrent modification rates because they directly expose the unfairness in hardware cache coherency protocols (i.e. update patterns usually reflect core-to-core latency differences).

Hence a number of important cmpxchg algorithms in the kernel drop out of the cmpxchg() loop after a set number of failures and fall back to spin locks to avoid the worst costs of these adverse behaviours. e.g. the lockref infrastructure (lockref_get/lockref_put) used by the dentry cache breaks out of cmpxchg loops after 100 consecutive failures.

As an example of how much work can be done before cacheline contention (cache coherency protocol saturation) starts to be a problem, I'm seeing non-linearity in CPU usage in a hot cmpxchg loop being measurable at about 2.5-3 million updates/s to a single 64 bit variable on it's own isolated cacheline on a CPU bound 32-way workload on a 2yr old 2-socket, 32p machine. It's not quite completely saturated yet, but profiles have a single cmpxchg loop starting to consume significant single digit percentages of the entire CPU usage of a workload creating 650,000 files/s.

In constrast, spinlocks that are seeing the same amount of traffic are consuming ~30% of the entire CPU usage of this workload. e.g. per-sb inode list lock is being trafficked 2.6m times/s resulting in it burning ~10% CPU (3 whole CPUs) spinning. That's the same atomic cacheline update rate as the hot cmpxchg loop I'm also seeing CPU usage go non-linear....

IOWs, cmpxchg() algorithms can scale better than locked algorithms, but they do not offer unbound scalability because they are not truly lockless operations. A couple of decades of writing concurrent algorithms has taught me that scalability is really defined by the frequency the concurrent algorithm accesses/updates shared data, not whether the software algorithm is considered "lockless" or not. A lock is simply a high frequency atomic access to shared data, hence it's easy to see why they become a scalability limitation very quickly if you consider them from a CPU cacheline access perspective. It should also be obvious now that cmpxchg() or atomic variables don't offer a huge improvement in fundamental scalability over locks - all they do is allow much finer grained and hence less frequent access to shared data. It is the reduction in access frequency to shared data that improves scalability, not the use of a "lockless" pattern.

This is the art of concurrent programming - it's not enough just to know what a lockless algorithm is, you need to understand the data access patterns those algorithms result in and when those access patterns are going to become a limitation to the software. Of course, knowing when not to use a lockless algorithm because there are better ways to reduce shared data access is also part of the art... :)

-Dave.

Lockless patterns: an introduction to compare-and-swap

Posted Mar 13, 2021 0:18 UTC (Sat) by pbonzini (subscriber, #60935) [Link] (7 responses)

Hi Dave, thanks very much for this write up. Some of the things you wrote are the main reason why I wasn't sure whether to include linked lists at all. On one hand they are the poster child of lock-free data structures, on the other hand they are substantially less useful than any of the other things I am talking about, for all the reasons that you went into. They do have their place, but they're easy to misuse. For example the few times that I used llist, I did so not because of the lock-free producer side (which seems so great but has fairness/livelock problems if taken to the extreme), but because I knew writes would be rare and I wanted the (single) read side to be as fast as possible. This is not obvious at all.

In the end I included them because they provided me with good running examples, but more than any other patterns in this series they require answering the question of *when* to use lock free algorithms.

I was going to put that into the conclusive article (though now I might have to rethink that :), but I will explain quickly my ideas here. What I wanted to "teach" with this series is that on one hand you should not fear lockless algorithms because they can be properly distilled into common patterns and abstracted, on the other hand they are nothing but a tool. What matters the most in almost all cases is maintainability, and then:

* knowing a few well-known and/or well-abstracted patterns can actually make your code *more* maintainable and scalable despite using more advanced techniques. This is where lock-free linked lists fail the worst, because they're far from being the silver bullet that an inexperienced programmer could imagine them to be.

* you should split as much as possible your code between in a lock-free part, which will usually be one of these patterns or another abstraction provided by Linux such as lockref, and a lock-based slow path. This makes it easier to figure out the specific technique to use, as the lockless part remains small;

* {single,multiple} {producer,consumer} is a huge part of choosing your lockless technique. Knowing that you have a single producer or a single consumer lets you pick the right pattern and lets you keep the complication of the techniques at bay.

Understanding relative frequency of reads vs writes (or the relative frequency of slow vs fast path) is to some extent a prerequisite since otherwise you won't even know where the scalability problems are. It is a tradeoff that is present everywhere Linux uses lockless patterns (the above personal anecdote with llist is one example but the most common and also extreme case is probably RCU), and it also touches the previous bullets because, for low-enough frequencies, you can just treat the less frequent case as a slow path, use a lock to go from multiple to single producer/consumer, and see what this simplification can do for you in terms of new patterns to use.

Lockless patterns: an introduction to compare-and-swap

Posted Mar 13, 2021 1:45 UTC (Sat) by willy (subscriber, #9762) [Link] (4 responses)

Understanding the behaviour of one's code is essential to choosing the locking pattern to use (lockfree being, of course, one locking pattern).

I've seen several examples of code in the kernel that uses an rwlock_t for protection because one call path reads the data structure. The author didn't stop to think "Is there actual parallelism to be extracted here?" or "What's the cost of an rwlock vs a spinlock?"

Lockless patterns: an introduction to compare-and-swap

Posted Mar 13, 2021 10:00 UTC (Sat) by pbonzini (subscriber, #60935) [Link] (3 responses)

Also another related topic is that of sharding. Splitting the data structure in multiple parts can reduce the number of writes and reads to a single shard, which is beneficial in itself, and it can again turn a multiple producer or multiple consumer scenario into one with a single producer and/or a single consumer—without locks this time. That in turn completely changes the tools you can use.

Lockless patterns: an introduction to compare-and-swap

Posted Mar 13, 2021 15:49 UTC (Sat) by jcm (subscriber, #18262) [Link] (1 responses)

...which of course is literally one of the reasons willy is working on Maple Trees, to remove the locking required in nodes for red-black tree rotation :)

Lockless patterns: an introduction to compare-and-swap

Posted Mar 13, 2021 16:18 UTC (Sat) by jcm (subscriber, #18262) [Link]

On a total tangent, this makes me wonder how long you could force vmap_area_lock to be held by performing an extreme number of small VMA allocations and then freeing them.

Lockless patterns: an introduction to compare-and-swap

Posted Mar 13, 2021 22:02 UTC (Sat) by dgc (subscriber, #6611) [Link]

Sharding is another of those terms I dislike. It's a catch-all phrase that covers a huge number of different techniques. e.g. a simple per-cpu counter is a "sharded counter". A per-node list (e.g. list_lru.h) is a "sharded list". splitting an index space up into multiple independent index trees is a "sharded tree". It just doesn't tell you anything about how the algorithm works to maintain global state or any of the downsides to doing global scope operations, just that the data structure has been split up in some way.

Even stuff like optimistic lock coupled trees could come under the "sharded" title, because each node in the tree has it's own individual locks or mechanism of lockless atomic updates. (Yes, I really did just say lockless lock coupled trees :) Again, calling such a structure as "sharded" does not provide any understanding of how data accesses have been isolated and reduced. Mention OLC and most people with knowledge of scalable tree structures immediately understand how it works, the constraints it works under, how they scale and when it is desirable to use such a construct.

Oh, there I go again, talking about "scalable" instead of "lockless" or, in this case, "sharding"... :)

-Dave.

Lockless patterns: an introduction to compare-and-swap

Posted Mar 13, 2021 21:43 UTC (Sat) by dgc (subscriber, #6611) [Link] (1 responses)

Hi Paolo, You make some good points about the inherent complexity of lockless algorithms that programmers and engineers need to understand about lockless patterns and when and how to use them. I've been reading your series closely because I might learn something I've missed from it, but I am probably looking at them from a different angle than most readers because I do "scalability" and not "lockless"....

It was this article that I put my finger on what was missing from the other articles: an explanation of when you _wouldn't_ want to use a particular pattern. Your articles explain the pattern and give examples of how to use them correctly and in a beneficial way, but the downsides of those patterns are not really enumerated. The context you've given here, I think, probably should have been in a leader article explaining what the series of articles is going to teach the reader. Paul is good at doing this. :)

Other than this, I'm finding the articles easy to read and fairly good at explaining complex behaviour without being overwhelming. Keep 'em coming! :)

-Dave.

Lockless patterns: an introduction to compare-and-swap

Posted Mar 14, 2021 9:56 UTC (Sun) by pbonzini (subscriber, #60935) [Link]

Heh, I doubt you'd learn anything new! I do hope that more experienced people get some insights about how to present abstract concepts and their concrete embodiments at the same time, for example happens-before both as a math relation and as a property of APIs.

There's a chicken-and-egg problem between teaching the "what" and the "when/why". My choice here was to first teach the "what" so that people would be able to read others' code, and leave the "when/why" for the end, but I'm sure it was possible to do it otherwise. I read somewhere that you can't teach something well until the fifth time you teach it.

> I do "scalability" and not "lockless"....

Well, everybody should be doing scalability and not lockless. I focused the series on "lockless patterns" because, again, my aim was to teach people how to deal with code written by others. (Fun fact: I had never seen before the TCP code I explained in part 3. I literally grepped for smp_mb and picked a random example. It's too easy to omit details when explaining code you're familiar with, and I wanted to be in the same shoes as the readers).

But lockless (and sharding :)) are just tools that you use to achieve scalability. If you don't need scalability just throw a big lock around everything and call it a day!

Lockless patterns: an introduction to compare-and-swap

Posted Mar 13, 2021 15:50 UTC (Sat) by PaulMcKenney (subscriber, #9624) [Link]

The other oft-cited benefit of lock-free algorithms is behavior at high CPU load in workloads suffering from frequent preemption. Of course, as you say, livelock is no better than blocking, but carefully designed algorithms can in some cases avoid both. Another alternative is to limit CPU utilization, thus greatly reducing the duration of the preemptions, in turn allowing simpler algorithms to perform well.

But no matter what you do, there can be no question that having your software work reliably under extreme conditions does require extreme care in both development and validation.

Lockless patterns: an introduction to compare-and-swap

Posted Mar 13, 2021 15:51 UTC (Sat) by jcm (subscriber, #18262) [Link] (4 responses)

This is why I'm a fan of load-link/store-conditional in classical RISC. On the other hand, it suffers from complexity too in that you need to build an exclusive monitor that tracks updates in the coherency protocol to any affected lines (and is possibly permitted to only track one line, for example). RISC-V seems to have taken a purist position here.

Lockless patterns: an introduction to compare-and-swap

Posted Mar 13, 2021 17:14 UTC (Sat) by foom (subscriber, #14868) [Link]

While it is nice in theory, LL/SC can often have _worse_ scalability. There's a reason ARMv8.1 added the "LSE" instructions, consisting of cmpxchg and read-modify-write (add, sub, etc) atomic memory operations instructions. RISC-V has the latter set, as well.

With these single instructions, the memory system is able to e.g. send an atomic-add-1-and-store operation over the memory bus, and have that operation run in the memory controller, rather than having to fetch the data to the cpu core, run the add on the cpu, then write it back out. This can greatly increase the ability to scale to higher levels of parallelism.

Lockless patterns: an introduction to compare-and-swap

Posted Mar 14, 2021 11:48 UTC (Sun) by pbonzini (subscriber, #60935) [Link]

Isn't that backwards? LL/SC must always be wrapped in a loop, so it cannot produce wait-free algorithms. Instead, processor-level RMW instructions can be wait-free.

Lockless patterns: an introduction to compare-and-swap

Posted Mar 14, 2021 22:10 UTC (Sun) by roc (subscriber, #30627) [Link] (1 responses)

Hardly anyone cares, but LL/SC has the unfortunate property that in practice you sometimes have to retry even in the absence of contention (e.g. due to a hardware interrupt). That happens to break rr (https://rr-project.org) because it introduces nondeterminism that we can't track.

Lockless patterns: an introduction to compare-and-swap

Posted Mar 15, 2021 19:24 UTC (Mon) by dancol (guest, #142293) [Link]

How does RR address the problem?

Lockless patterns: an introduction to compare-and-swap

Posted Mar 14, 2021 9:30 UTC (Sun) by epa (subscriber, #39769) [Link] (5 responses)

When you talk about cache coherency slowing things down, is it possible that a hardware compare-and-swap operation never using the cache, but always going directly to main memory, could end up being more scalable? (For the sake of argument suppose that the counter accessed for compare-and-swap is in a special area of memory that can't be accessed by any other instruction.)

Lockless patterns: an introduction to compare-and-swap

Posted Mar 14, 2021 11:05 UTC (Sun) by excors (subscriber, #95769) [Link]

I think that's what GPUs usually do. There are multiple L1 caches and a single shared L2 cache, and no cache coherency. Atomic operations on global memory have to be sent to the L2 controller, which can typically handle one atomic operation per cycle to a single address (and perhaps multiple per cycle if spread over multiple addresses in a single cache line). Operations include cmpxchg, add/sub, min/max, and/or/xor, and sometimes even floating-point addition, performed by ALUs inside the L2 controller.

The downside is that in the uncontended case where only a single thread is accessing the address, every atomic operation that returns a value (like cmpxchg) will have hundreds of cycles of latency. And if you have multiple logically-independent threads each accessing a different address in a different cache line, they can't execute independently - they all have to go to the L2 controller and create contention there. But it scales well when you have hundreds of threads contending for the same address.

(GPUs also have local memory (closer to the L1 level) and a separate implementation of atomics for that. If you're doing a big reduction computation (e.g. finding the sum of a big array), you should do it hierarchically - each thread would reduce some input elements by simply looping and using registers, then a group of threads would merge their results using local memory atomics, then a group of groups would merge their results using global memory atomics. That way you can have a huge number of threads and still avoid L2 contention.)

Lockless patterns: an introduction to compare-and-swap

Posted Mar 17, 2021 5:15 UTC (Wed) by dgc (subscriber, #6611) [Link] (3 responses)

> When you talk about cache coherency slowing things down, is it possible that a
> hardware compare-and-swap operation never using the cache, but always going
> directly to main memory, could end up being more scalable?

Even if you have no caches, you still need a hardware mechanism that guarantees atomic, exclusive access to that memory multiple CPUs all try to read/modify it at the same time. You can't have two CPUs perform concurrent atomic cmpxchg operations on the same memory and have both succeed - one has to fail and retry when the other succeeds. So even if you have no caches, you still need hardware level memory access coordination protocols....

-Dave.

Lockless patterns: an introduction to compare-and-swap

Posted Mar 19, 2021 8:13 UTC (Fri) by epa (subscriber, #39769) [Link] (2 responses)

I suppose the logical next step is a special kind of RAM that implements compare-and-swap as an atomic operation in hardware.

Lockless patterns: an introduction to compare-and-swap

Posted Mar 19, 2021 11:38 UTC (Fri) by excors (subscriber, #95769) [Link] (1 responses)

I don't think there's any need to implement CAS in RAM directly, because there's already a memory controller in front of the RAM that has to coordinate and serialise memory accesses from multiple CPUs, and you can put the atomic logic there instead. And that's already supported by some CPUs: see e.g. AXI (https://developer.arm.com/documentation/ihi0022/latest) and CHI (https://developer.arm.com/documentation/ihi0050/latest), which are ARM's bus protocols for connecting the various components on a SoC (CPU clusters, GPU, memory controllers, etc). They define "atomic transactions" including AtomicCompare (CAS of up to 16 bytes). A CPU can send an AtomicCompare packet to the memory controller, which will execute it atomically on DRAM and send the result back, avoiding all the costs of cache coherency. If two CPUs do a simultaneous AtomicCompare, the memory controller will just buffer one transaction for a few cycles until the other has finished.

Lockless patterns: an introduction to compare-and-swap

Posted Mar 19, 2021 12:36 UTC (Fri) by foom (subscriber, #14868) [Link]

While this certainly helps scalability significantly, the second cmpxchg operation to execute will still necessarily fail and require a retry, because its "compare" value is out of date.

(Where the performance really shines is the operations that can be fully implemented in the memory controller, such as atomic fetch-and-add/sub/or/and/etc, because these cannot fail.)