Lockless patterns: an introduction to compare-and-swap
In the first part of this series, I showed you the theory behind concurrent memory models and how that theory can be applied to simple loads and stores. However, loads and stores alone are not a practical tool for the building of higher-level synchronization primitives such as spinlocks, mutexes, and condition variables. Even though it is possible to synchronize two threads using the full memory-barrier pattern that was introduced last week (Dekker's algorithm), modern processors provide a way that is easier, more generic, and faster—yes, all three of them—the compare-and-swap operation.
From the point of view of a Linux kernel programmer, compare-and-swap has the following prototype:
T cmpxchg(T *ptr, T old, T new);
where T can be either an integer type that is at most as wide as a pointer, or a pointer type. In order to support such polymorphism, cmpxchg() is defined as a macro rather than a function, but the macro is written carefully to avoid evaluating its arguments multiple times. Linux also has a cmpxchg64() macro that takes 64-bit integers as the arguments, but it may not be available on all 32-bit platforms.
cmpxchg() loads the value pointed to by *ptr and, if it is equal to old, it stores new in its place. Otherwise, no store happens. The value that was loaded is then returned, regardless of whether it matched old or not. The compare and the store are atomic: if the store is performed, you are guaranteed that no thread could sneak in and write a value other than old to *ptr. Because a single operation provides the old version of the value and stores a new one, compare-and-swap is said to be an atomic read-modify-write operation.
In Linux, the cmpxchg() macro puts strong ordering requirements on the surrounding code. A compare-and-swap operation comprises a load and a store; for the sake of this article, you can consider them to be, respectively, load-acquire and store-release operations. This means that cmpxchg() can synchronize with both load-acquire or store-release operations performed on the same location by other threads.
Lock-free stacks and queues
"Lockless algorithms for mere mortals" already mentioned the use of compare-and-swap for lock-free lists. Here, we'll look at how a lockless, singly linked list could be implemented in C, and what it could be useful for. First of all, however, let's recap how a single-threaded C program would add an item in front of a singly-linked list:
struct thing { struct thing *next; ... }; struct thing *first; node->next = first; first = node;
Armed with the knowledge from the first part of the series, we know that we should turn the assignment to first into a store-release, so that node->next is visible to other threads doing a load-acquire. This would be an instance of the pattern presented there.
However, that pattern only worked for a single producer and a single consumer; in the presence of multiple producers, the two instructions would have to be placed under a lock. This is because the value of first can change between the two instructions, for example if another element is added at the same time by another thread. If that happens, the outgoing pointer (node->next) in the new element will point to whatever first held before the assignment happened. This teaches us an important, if obvious, lesson: acquire and release semantics are just one part of designing and proving the correctness of lockless algorithms. Logic mistakes and race conditions can and will still happen.
Instead of using a lock, cmpxchg() lets us catch the other thread in the act of modifying first. Something like this would work for any number of producers:
if (cmpxchg(&first, node->next, node) == node->next) /* yay! */ else /* now what? */
There are still a few things to sort out, as you can see. First and foremost, what to do if the cmpxchg() notices that first has changed. The answer in that case is simply to read the new value of first to node->next and try again. This is possible, because node is still invisible to other threads. Nobody will notice our stroke of bad luck.
A second and more subtle question is: how do we load first? The load need not have either acquire or release semantics, because the code is not doing other memory accesses that depend on the value of first. On the other hand, perhaps the big bad optimizing compiler might think that first cannot change across iterations of the loop? Even though Linux's cmpxchg() does prevent this kind of compiler optimization, it is a good practice to mark relaxed loads and stores of shared memory using READ_ONCE() and WRITE_ONCE().
Putting everything together, we get:
struct thing *old, *expected; old = READ_ONCE(first); do { node->next = expected = old; old = cmpxchg(&first, expected, node); } while (old != expected);
This is all nice, but it's only half of the story. We still have not seen how the list can be read on the consumer side. The answer is that it depends on the relationship between producers and consumers, the number of consumers, and whether the consumers are interested in accessing elements in LIFO (last-in-first-out) or FIFO (first-in-first-out) order.
First of all, it could be that all reads happen after the producers have finished running. In this case, the synchronization between producers and consumers happens outside the code that manipulates the list, and the consumers can access the list through normal, non-atomic loads. The synchronization mechanism could be a thread-exit/thread-join pair such as the one we saw in the first article, for example.
If reads are rare or can be batched, a more tricky implementation could allow producers to proceed locklessly, while reads would be serialized. Such an implementation could use a reader-writer lock (rwlock); however, the producers would take the lock for shared access (with a read_lock() call) and the consumer(s) would take the lock for exclusive access (with write_lock())! This would also avoid reads executing concurrently with writes and, therefore, the consumer would be able to employ non-atomic loads. Hopefully, this example will show that there's no such thing as too many comments or too much documentation, even if you're sticking to the most common lockless programming patterns.
If many consumers run concurrently with the producers, but they can consume the elements in any order, the consumers can obtain a whole batch of elements (removing them from the list) with a single instruction:
my_things = xchg_acquire(&first, NULL);
xchg(), like cmpxchg(), performs an atomic combination of a read and a write to a memory location. In this case it returns the previous head of the list and writes NULL in its place, thus emptying the list. Here I am using the xchg_acquire() variant, which has acquire semantics for its load of first, but (just like WRITE_ONCE()) does not apply release semantics when it stores NULL. Acquire semantics suffice here, since this is still basically the same store-release/load-acquire pattern from part 1. More precisely, it is a multi-producer, multi-consumer extension of that pattern.
Should we do the same on the writer side and replace cmpxchg() with cmpxchg_release()? Indeed we could: in principle, all that the writer needs is to publish the store of node->next to the outside world. However, cmpxchg()'s acquire semantics when loading the list head have a useful side effect: they synchronize each writer with the thread that wrote the previous element. In the following picture, the load-acquire and store-release operations are all part of a successful series of cmpxchg() calls:
thread 1: load-acquire first (returns NULL) store-release node1 into first \ thread 2: load-acquire first (returns node1) store-release node2 into first \ thread 3: load-acquire first (returns node2) store-release node3 into first \ thread 4: xchg-acquire first (returns node3)
Thread 3's cmpxchg() is the only one to synchronize with thread 4's xchg_acquire(). However, because of transitivity, all cmpxchg()s happen before the xchg_acquire(). Therefore, if cmpxchg() is used in the writers, the readers can go through the list with regular loads.
If, instead, the writers used cmpxchg_release(), the happens-before relation would look like this:
thread 1: load-acquire first (returns NULL) store-release node1 into first thread 2: load first (returns node1) store-release node2 into first thread 3: load first (returns node2) store-release node3 into first \ thread 4: xchg-acquire first (returns node3)
Thread 4 would always read node2 from node3->next, because it read the value that thread 3 wrote to first. However, there would be no happens before edge from thread 1 and thread 2 to thread 4; therefore, thread 4 would need a smp_load_acquire() in order to see node1 in node2->next.
The above data structure is already implemented in Linux's linux/llist.h header. You're highly encouraged not to reinvent the wheel and use that version, of course. That API, in fact, includes two more interesting functions: llist_del_first() and llist_reverse_order().
llist_del_first() returns the first element of the llist and advances the head pointer to the second element. Its documentation warns that it should only be used if there is a single reader. If, instead, there were two consumers, an intricate sequence of adds and deletes could lead to the so-called ABA problem. Since this article rests firmly on the principle of "if it hurts, don't do it", a detailed explanation is beyond its scope. However, it's worth pointing out the similarity with the earlier rwlock example. Just as in that case, multiple consumers will have to use locking to serialize concurrent access to the data structures. llist_del_first(), instead, lets writers call llist_add() without taking a lock at all; readers instead can use a spinlock or a mutex.
llist_del_first() provides LIFO semantics for the llist. If your application requires FIFO order, however, there is a useful trick that you can apply, and that's where llist_reverse_order() comes into play. Removing a batch of items with xchg() (as is done with llist_del_all()) does provide the batches in FIFO order, only the items in each batch are ordered back to front. The following algorithm then comes to mind:
struct thing *first, *current_batch; if (current_batch == NULL) { current_batch = xchg_acquire(&first, NULL); ... reverse the order of the nodes in current_batch ... } node = current_batch; current_batch = current_batch->next;
Every execution of the previous pseudocode will return an element of the linked list in FIFO order. This is also a single-consumer data structure, as it assumes that only a single thread accesses current_batch at any given time. It is left as an exercise for the reader to convert the pseudocode to the llist API.
That is all for this installment. The next article in this
series will continue exploring read-modify-write operations, how
to build them from compare-and-swap, and how they can be put into use
to speed up reference-counting operations.
Index entries for this article | |
---|---|
Kernel | cmpxchg() |
Kernel | Lockless algorithms |
GuestArticles | Bonzini, Paolo |
Lockless patterns: an introduction to compare-and-swap
Posted Mar 12, 2021 16:22 UTC (Fri)
by jcm (subscriber, #18262)
[Link] (4 responses)
Posted Mar 12, 2021 16:22 UTC (Fri) by jcm (subscriber, #18262) [Link] (4 responses)
Lockless patterns: an introduction to compare-and-swap
Posted Mar 12, 2021 18:01 UTC (Fri)
by jreiser (subscriber, #11027)
[Link] (1 responses)
> [not] equivalent to a full memory barrier on every architecture
Posted Mar 12, 2021 18:01 UTC (Fri) by jreiser (subscriber, #11027) [Link] (1 responses)
Please name the two most-prevalent architectures on which compare-and-swap is not equivalent to a full memory barrier.
Lockless patterns: an introduction to compare-and-swap
Posted Mar 13, 2021 0:42 UTC (Sat)
by pbonzini (subscriber, #60935)
[Link]
Posted Mar 13, 2021 0:42 UTC (Sat) by pbonzini (subscriber, #60935) [Link]
Lockless patterns: an introduction to compare-and-swap
Posted Mar 13, 2021 0:40 UTC (Sat)
by pbonzini (subscriber, #60935)
[Link] (1 responses)
Posted Mar 13, 2021 0:40 UTC (Sat) by pbonzini (subscriber, #60935) [Link] (1 responses)
There are a few other simplifications throughout the articles, they will all be "corrected" in due time.
Lockless patterns: an introduction to compare-and-swap
Posted Mar 18, 2021 21:22 UTC (Thu)
by Alan.Stern (subscriber, #12437)
[Link]
Posted Mar 18, 2021 21:22 UTC (Thu) by Alan.Stern (subscriber, #12437) [Link]
"Thread 4 would always read node2 from node3->next, because it read the value that thread 3 wrote to first. However, there would be no happens before edge from thread 1 and thread 2 to thread 4; therefore, thread 4 would need a smp_load_acquire() in order to see node1 in node2->next."
In fact this isn't so. In the Linux kernel memory model, an edge from a store-release to a relaxed load _does_ establish a happens-before relation. What it doesn't do is guarantee that this store or prior ones will be visible to other loads following the first, because there will be nothing to force those loads to execute in order (i.e., there generally is no happens-before relation from a relaxed load to later loads in the same thread). But in threads 2 and 3 there are no loads following the one inside the cmpxchg_release, so this lack doesn't matter. Only thread 4 needs to use a load-acquire -- which it already does: xchg_acquire.
One other thing you might consider mentioning in the last article. Full memory barriers, including those associated with value-returning atomic operations like xchg or cmpxchg, are even stronger than discussed so far. They do more than order all earlier and later memory accesses, both loads and stores, against each other; they also act as global synchronizing operations. That is, whenever two threads both carry out a full memory barrier, there is always a happens-before relation from the barrier that executes first to the one that executes second. To put it another way, if each thread has a full memory barrier between each pair of memory accesses, the whole program will end up obeying the Sequential Consistency model.
Lockless patterns: an introduction to compare-and-swap
Posted Mar 12, 2021 23:15 UTC (Fri)
by dgc (subscriber, #6611)
[Link] (20 responses)
Posted Mar 12, 2021 23:15 UTC (Fri) by dgc (subscriber, #6611) [Link] (20 responses)
1. cmpxchg is a hardware locked operation rather than a software locked operation. It still bounces cachelines around the machine via the hardware cache coherency protocol, so frequent cmpxchg operations will always reach an update saturation point where CPU usage goes non-linear and the update rate goes backwards.
2. cmpxchg algorithms can be extremely unfair and effectively livelock when subject to high concurrent modification rates because they directly expose the unfairness in hardware cache coherency protocols (i.e. update patterns usually reflect core-to-core latency differences).
Hence a number of important cmpxchg algorithms in the kernel drop out of the cmpxchg() loop after a set number of failures and fall back to spin locks to avoid the worst costs of these adverse behaviours. e.g. the lockref infrastructure (lockref_get/lockref_put) used by the dentry cache breaks out of cmpxchg loops after 100 consecutive failures.
As an example of how much work can be done before cacheline contention (cache coherency protocol saturation) starts to be a problem, I'm seeing non-linearity in CPU usage in a hot cmpxchg loop being measurable at about 2.5-3 million updates/s to a single 64 bit variable on it's own isolated cacheline on a CPU bound 32-way workload on a 2yr old 2-socket, 32p machine. It's not quite completely saturated yet, but profiles have a single cmpxchg loop starting to consume significant single digit percentages of the entire CPU usage of a workload creating 650,000 files/s.
In constrast, spinlocks that are seeing the same amount of traffic are consuming ~30% of the entire CPU usage of this workload. e.g. per-sb inode list lock is being trafficked 2.6m times/s resulting in it burning ~10% CPU (3 whole CPUs) spinning. That's the same atomic cacheline update rate as the hot cmpxchg loop I'm also seeing CPU usage go non-linear....
IOWs, cmpxchg() algorithms can scale better than locked algorithms, but they do not offer unbound scalability because they are not truly lockless operations. A couple of decades of writing concurrent algorithms has taught me that scalability is really defined by the frequency the concurrent algorithm accesses/updates shared data, not whether the software algorithm is considered "lockless" or not. A lock is simply a high frequency atomic access to shared data, hence it's easy to see why they become a scalability limitation very quickly if you consider them from a CPU cacheline access perspective. It should also be obvious now that cmpxchg() or atomic variables don't offer a huge improvement in fundamental scalability over locks - all they do is allow much finer grained and hence less frequent access to shared data. It is the reduction in access frequency to shared data that improves scalability, not the use of a "lockless" pattern.
This is the art of concurrent programming - it's not enough just to know what a lockless algorithm is, you need to understand the data access patterns those algorithms result in and when those access patterns are going to become a limitation to the software. Of course, knowing when not to use a lockless algorithm because there are better ways to reduce shared data access is also part of the art... :)
-Dave.
Lockless patterns: an introduction to compare-and-swap
Posted Mar 13, 2021 0:18 UTC (Sat)
by pbonzini (subscriber, #60935)
[Link] (7 responses)
Posted Mar 13, 2021 0:18 UTC (Sat) by pbonzini (subscriber, #60935) [Link] (7 responses)
In the end I included them because they provided me with good running examples, but more than any other patterns in this series they require answering the question of *when* to use lock free algorithms.
I was going to put that into the conclusive article (though now I might have to rethink that :), but I will explain quickly my ideas here. What I wanted to "teach" with this series is that on one hand you should not fear lockless algorithms because they can be properly distilled into common patterns and abstracted, on the other hand they are nothing but a tool. What matters the most in almost all cases is maintainability, and then:
* knowing a few well-known and/or well-abstracted patterns can actually make your code *more* maintainable and scalable despite using more advanced techniques. This is where lock-free linked lists fail the worst, because they're far from being the silver bullet that an inexperienced programmer could imagine them to be.
* you should split as much as possible your code between in a lock-free part, which will usually be one of these patterns or another abstraction provided by Linux such as lockref, and a lock-based slow path. This makes it easier to figure out the specific technique to use, as the lockless part remains small;
* {single,multiple} {producer,consumer} is a huge part of choosing your lockless technique. Knowing that you have a single producer or a single consumer lets you pick the right pattern and lets you keep the complication of the techniques at bay.
Understanding relative frequency of reads vs writes (or the relative frequency of slow vs fast path) is to some extent a prerequisite since otherwise you won't even know where the scalability problems are. It is a tradeoff that is present everywhere Linux uses lockless patterns (the above personal anecdote with llist is one example but the most common and also extreme case is probably RCU), and it also touches the previous bullets because, for low-enough frequencies, you can just treat the less frequent case as a slow path, use a lock to go from multiple to single producer/consumer, and see what this simplification can do for you in terms of new patterns to use.
Lockless patterns: an introduction to compare-and-swap
Posted Mar 13, 2021 1:45 UTC (Sat)
by willy (subscriber, #9762)
[Link] (4 responses)
Posted Mar 13, 2021 1:45 UTC (Sat) by willy (subscriber, #9762) [Link] (4 responses)
I've seen several examples of code in the kernel that uses an rwlock_t for protection because one call path reads the data structure. The author didn't stop to think "Is there actual parallelism to be extracted here?" or "What's the cost of an rwlock vs a spinlock?"
Lockless patterns: an introduction to compare-and-swap
Posted Mar 13, 2021 10:00 UTC (Sat)
by pbonzini (subscriber, #60935)
[Link] (3 responses)
Posted Mar 13, 2021 10:00 UTC (Sat) by pbonzini (subscriber, #60935) [Link] (3 responses)
Lockless patterns: an introduction to compare-and-swap
Posted Mar 13, 2021 15:49 UTC (Sat)
by jcm (subscriber, #18262)
[Link] (1 responses)
Posted Mar 13, 2021 15:49 UTC (Sat) by jcm (subscriber, #18262) [Link] (1 responses)
Lockless patterns: an introduction to compare-and-swap
Posted Mar 13, 2021 16:18 UTC (Sat)
by jcm (subscriber, #18262)
[Link]
Posted Mar 13, 2021 16:18 UTC (Sat) by jcm (subscriber, #18262) [Link]
Lockless patterns: an introduction to compare-and-swap
Posted Mar 13, 2021 22:02 UTC (Sat)
by dgc (subscriber, #6611)
[Link]
Posted Mar 13, 2021 22:02 UTC (Sat) by dgc (subscriber, #6611) [Link]
Even stuff like optimistic lock coupled trees could come under the "sharded" title, because each node in the tree has it's own individual locks or mechanism of lockless atomic updates. (Yes, I really did just say lockless lock coupled trees :) Again, calling such a structure as "sharded" does not provide any understanding of how data accesses have been isolated and reduced. Mention OLC and most people with knowledge of scalable tree structures immediately understand how it works, the constraints it works under, how they scale and when it is desirable to use such a construct.
Oh, there I go again, talking about "scalable" instead of "lockless" or, in this case, "sharding"... :)
-Dave.
Lockless patterns: an introduction to compare-and-swap
Posted Mar 13, 2021 21:43 UTC (Sat)
by dgc (subscriber, #6611)
[Link] (1 responses)
Posted Mar 13, 2021 21:43 UTC (Sat) by dgc (subscriber, #6611) [Link] (1 responses)
It was this article that I put my finger on what was missing from the other articles: an explanation of when you _wouldn't_ want to use a particular pattern. Your articles explain the pattern and give examples of how to use them correctly and in a beneficial way, but the downsides of those patterns are not really enumerated. The context you've given here, I think, probably should have been in a leader article explaining what the series of articles is going to teach the reader. Paul is good at doing this. :)
Other than this, I'm finding the articles easy to read and fairly good at explaining complex behaviour without being overwhelming. Keep 'em coming! :)
-Dave.
Lockless patterns: an introduction to compare-and-swap
Posted Mar 14, 2021 9:56 UTC (Sun)
by pbonzini (subscriber, #60935)
[Link]
Posted Mar 14, 2021 9:56 UTC (Sun) by pbonzini (subscriber, #60935) [Link]
There's a chicken-and-egg problem between teaching the "what" and the "when/why". My choice here was to first teach the "what" so that people would be able to read others' code, and leave the "when/why" for the end, but I'm sure it was possible to do it otherwise. I read somewhere that you can't teach something well until the fifth time you teach it.
> I do "scalability" and not "lockless"....
Well, everybody should be doing scalability and not lockless. I focused the series on "lockless patterns" because, again, my aim was to teach people how to deal with code written by others. (Fun fact: I had never seen before the TCP code I explained in part 3. I literally grepped for smp_mb and picked a random example. It's too easy to omit details when explaining code you're familiar with, and I wanted to be in the same shoes as the readers).
But lockless (and sharding :)) are just tools that you use to achieve scalability. If you don't need scalability just throw a big lock around everything and call it a day!
Lockless patterns: an introduction to compare-and-swap
Posted Mar 13, 2021 15:50 UTC (Sat)
by PaulMcKenney (subscriber, #9624)
[Link]
Posted Mar 13, 2021 15:50 UTC (Sat) by PaulMcKenney (subscriber, #9624) [Link]
But no matter what you do, there can be no question that having your software work reliably under extreme conditions does require extreme care in both development and validation.
Lockless patterns: an introduction to compare-and-swap
Posted Mar 13, 2021 15:51 UTC (Sat)
by jcm (subscriber, #18262)
[Link] (4 responses)
Posted Mar 13, 2021 15:51 UTC (Sat) by jcm (subscriber, #18262) [Link] (4 responses)
Lockless patterns: an introduction to compare-and-swap
Posted Mar 13, 2021 17:14 UTC (Sat)
by foom (subscriber, #14868)
[Link]
Posted Mar 13, 2021 17:14 UTC (Sat) by foom (subscriber, #14868) [Link]
With these single instructions, the memory system is able to e.g. send an atomic-add-1-and-store operation over the memory bus, and have that operation run in the memory controller, rather than having to fetch the data to the cpu core, run the add on the cpu, then write it back out. This can greatly increase the ability to scale to higher levels of parallelism.
Lockless patterns: an introduction to compare-and-swap
Posted Mar 14, 2021 11:48 UTC (Sun)
by pbonzini (subscriber, #60935)
[Link]
Posted Mar 14, 2021 11:48 UTC (Sun) by pbonzini (subscriber, #60935) [Link]
Lockless patterns: an introduction to compare-and-swap
Posted Mar 14, 2021 22:10 UTC (Sun)
by roc (subscriber, #30627)
[Link] (1 responses)
Posted Mar 14, 2021 22:10 UTC (Sun) by roc (subscriber, #30627) [Link] (1 responses)
Lockless patterns: an introduction to compare-and-swap
Posted Mar 15, 2021 19:24 UTC (Mon)
by dancol (guest, #142293)
[Link]
Posted Mar 15, 2021 19:24 UTC (Mon) by dancol (guest, #142293) [Link]
Lockless patterns: an introduction to compare-and-swap
Posted Mar 14, 2021 9:30 UTC (Sun)
by epa (subscriber, #39769)
[Link] (5 responses)
Posted Mar 14, 2021 9:30 UTC (Sun) by epa (subscriber, #39769) [Link] (5 responses)
Lockless patterns: an introduction to compare-and-swap
Posted Mar 14, 2021 11:05 UTC (Sun)
by excors (subscriber, #95769)
[Link]
Posted Mar 14, 2021 11:05 UTC (Sun) by excors (subscriber, #95769) [Link]
The downside is that in the uncontended case where only a single thread is accessing the address, every atomic operation that returns a value (like cmpxchg) will have hundreds of cycles of latency. And if you have multiple logically-independent threads each accessing a different address in a different cache line, they can't execute independently - they all have to go to the L2 controller and create contention there. But it scales well when you have hundreds of threads contending for the same address.
(GPUs also have local memory (closer to the L1 level) and a separate implementation of atomics for that. If you're doing a big reduction computation (e.g. finding the sum of a big array), you should do it hierarchically - each thread would reduce some input elements by simply looping and using registers, then a group of threads would merge their results using local memory atomics, then a group of groups would merge their results using global memory atomics. That way you can have a huge number of threads and still avoid L2 contention.)
Lockless patterns: an introduction to compare-and-swap
Posted Mar 17, 2021 5:15 UTC (Wed)
by dgc (subscriber, #6611)
[Link] (3 responses)
Posted Mar 17, 2021 5:15 UTC (Wed) by dgc (subscriber, #6611) [Link] (3 responses)
> hardware compare-and-swap operation never using the cache, but always going
> directly to main memory, could end up being more scalable?
Even if you have no caches, you still need a hardware mechanism that guarantees atomic, exclusive access to that memory multiple CPUs all try to read/modify it at the same time. You can't have two CPUs perform concurrent atomic cmpxchg operations on the same memory and have both succeed - one has to fail and retry when the other succeeds. So even if you have no caches, you still need hardware level memory access coordination protocols....
-Dave.
Lockless patterns: an introduction to compare-and-swap
Posted Mar 19, 2021 8:13 UTC (Fri)
by epa (subscriber, #39769)
[Link] (2 responses)
Posted Mar 19, 2021 8:13 UTC (Fri) by epa (subscriber, #39769) [Link] (2 responses)
Lockless patterns: an introduction to compare-and-swap
Posted Mar 19, 2021 11:38 UTC (Fri)
by excors (subscriber, #95769)
[Link] (1 responses)
Posted Mar 19, 2021 11:38 UTC (Fri) by excors (subscriber, #95769) [Link] (1 responses)
Lockless patterns: an introduction to compare-and-swap
Posted Mar 19, 2021 12:36 UTC (Fri)
by foom (subscriber, #14868)
[Link]
Posted Mar 19, 2021 12:36 UTC (Fri) by foom (subscriber, #14868) [Link]
(Where the performance really shines is the operations that can be fully implemented in the memory controller, such as atomic fetch-and-add/sub/or/and/etc, because these cannot fail.)