Lockless patterns: relaxed access and partial memory barriers

February 26, 2021

This article was contributed by Paolo Bonzini

The first article in this series provided an introduction to lockless algorithms and the happens before relationship that allows us to reason about them. The next step is to look at the concept of a "data race" and the primitives that exist to prevent data races. We continue in that direction with a look at relaxed accesses, memory barriers, and how they can be used to implement the kernel's seqcount mechanism.

Memory barriers are an old acquaintance for some Linux kernel programmers. The first document vaguely resembling a specification of what one could expect from concurrent accesses to data in the kernel is, in fact, called memory-barriers.txt. That document describes many kinds of memory barriers, along with the expectations that Linux has concerning the properties of data and control dependencies. It also describes "memory-barrier pairing"; this could be seen as a cousin of release-acquire pairing, in that it also helps creating cross-thread happens before edges.

This article will not go into the same excruciating detail as memory-barriers.txt. Instead, we'll look at how barriers compare with the acquire and release model and how they simplify or enable the implementation of the seqcount primitive. Nevertheless, one article will not be enough to cover even the most common memory-barrier patterns, so full memory barriers will have to wait for the next installment.

Data races, relaxed accesses, and memory barriers

The concept of a data race, as presented here, was first introduced in C++11 and, since then, has been applied to various other languages, notably C11 and Rust. These language standards are quite strict in how to approach lockless access to data structures, and they introduce specific atomic load and atomic store primitives to do so.

A data race occurs when two accesses are concurrent (i.e., not ordered by the happens before relation), at least one of them is a store, and at least one of them is not using atomic load or store primitives. Whenever a data race occurs, the result (according to C11/C++11) is undefined behavior, meaning that anything is allowed to happen. Avoiding data races does not mean that your algorithm is free of "race conditions": data races are violations of the language standards, while race conditions are bugs that stem from incorrect locking, incorrect acquire/release semantics, or both.

Data races and the consequent undefined behavior are easy to avoid, however. As long as one wants to store to a shared data location, which is probably the case, there are two ways to do so. The first is to ensure that accesses are ordered by the happens before relation, using any pair of acquire and release operations of your choice; the second is to annotate the loads and stores as atomic.

C11, C++11, and Rust all provide various memory orderings that the programmer can use for their loads and stores; the three that we are interested in are acquire (to be used with loads), release (for stores), and relaxed (for both). Acquire and release should be self-explanatory by now, and Linux provides the same concept in its smp_load_acquire() and smp_store_release() operations. Relaxed operations, instead, do not provide any cross-thread ordering; a relaxed operation does not create a happens before relationship. Instead, these operations have essentially no purpose other than to avoid data races and the undefined behavior associated with them.

In practice, Linux expects both the compiler and the CPU to allow a bit more leeway than the language standards do. In particular, the kernel expects that regular loads and stores will not trigger undefined behavior just because there is a concurrent store. However, the value that is loaded or stored in such situations is still not well defined and may well be garbage. For example, the result could include parts of an old value and parts of a new value; this means that, at the very least, dereferencing pointer values loaded from a data race is generally a bad idea.

In addition, regular loads and stores are subject to compiler optimizations, which can produce surprises of their own. Therefore the idea of a relaxed-ordering — but guaranteed atomic — memory operation is useful in Linux too; this is what the READ_ONCE() and WRITE_ONCE() macros provide. In general the remainder of this series will always use READ_ONCE() and WRITE_ONCE() explicitly, which nowadays is considered good practice by Linux developers.

These macros already appeared in an example from part 1:

    thread 1                          thread 2
    -----------------------------     ------------------------
    a.x = 1;
    smp_wmb();
    WRITE_ONCE(message, &a);          datum = READ_ONCE(message);
                                      smp_rmb();
                                      if (datum != NULL)
                                        printk("%x\n", datum->x);

They are used in a similar way to smp_load_acquire() and smp_store_release(), but their first argument is the target of the assignment (an lvalue) rather than a pointer. Unless other mechanisms ensure that the result of a data race is thrown away, it is highly recommended to use READ_ONCE() and WRITE_ONCE() to load and store shared data outside a lock. Typically, relaxed atomics are used together with some other primitive or synchronization mechanism that has release and acquire semantics; that "something else" will order the relaxed writes against reads of the same memory location.

Suppose, for example, that you have multiple work items that fill certain elements of an array with ones; whoever spawned the work items will only read the array after calling flush_work(). Similar to pthread_join(), flush_work() has acquire semantics and synchronizes with the end of the work item; flush_work() guarantees that reading the array will happen after the completion of the work items, and the array can be read with regular loads. However, if multiple, concurrent work items can store into the same array element, they must use WRITE_ONCE(a[x], 1) rather than just a[x] = 1.

A more complicated case occurs when the release and acquire semantics are provided by a memory barrier. In order to explain this case we will use the practical example of seqcounts.

Seqcounts

Seqcounts are a specialized primitive that allows a consumer to detect that a data structure changed in the middle of the consumer's access. While they are only usable in special cases (small amount of data being protected, no side effects within read-side critical sections, and writes being quick and relatively rare), they have various interesting properties: in particular, readers do not starve writers and the writer can keep ownership of the cache line that holds the seqcount. Both properties make seqcounts particularly interesting when scalability is important.

Seqcounts are a single-producer, multiple-consumer primitive. In order to avoid multiple concurrent writers, they are usually combined with a spinlock or mutex, forming the familiar Linux seqlock_t type. Sometimes, outside the kernel, you'll see seqcounts referred to as seqlocks, however.

Seqcounts are effectively a generation count, where the generation number is odd if and only if a write is in progress. Whenever the generation number was odd at the beginning of a read-side critical section, or it changed during the read-side critical section, the reader has accessed potentially inconsistent state and must retry the read. For a seqcount to work correctly, the reader must detect correctly the beginning and the end of the write. This requires two load-acquire and two store-release operations; here is how one might write a seqcount reader and writer without any wrapper APIs:

    thread 1 (buggy writer)             thread 2 (buggy reader)
    --------------------------------    ------------------------
    WRITE_ONCE(sc, sc + 1);             do {
    smp_store_release(&data.x, 123);        old = smp_load_acquire(&sc) & ~1;
    WRITE_ONCE(data.y, 456);                copy.y = READ_ONCE(data.y);
    smp_store_release(&sc, sc + 1);         copy.x = smp_load_acquire(&data.x);
                                        } while(READ_ONCE(sc) != old);

This code is similar to the "message passing" pattern shown in the first part of the series. There are two pairs of load-acquire and store-release operations, one set for sc and one for data.x. It is not even that hard to show why both load-acquire/store-release pairs are necessary:

For thread 2 to exit the loop, the first read of sc must see the even value that was written by the second store to sc in thread 1. If this happens, smp_store_release() and smp_load_acquire() ensure that the stores to the data fields will be visible.
If the store to data.x is visible when thread 2 reads it, smp_store_release() and smp_load_acquire() ensure that thread 2 will see (at least) the first generation-count update. Thus, thread 2 will either loop or, if it also sees the second update, retrieve a consistent copy of the new data as described above.

However, the code has a bug! Because the writer has no acquire operation at the top of the sequence, the write to data.y might execute before writing the odd value to sc. [Note: the article was updated on March 2nd to point out this problem]. Using load-acquire/store-release for all fields would sidestep the issue, but one wonders if this would be putting lipstick on a pig. And in fact it is possible to do much better.

The first article showed that older Linux code may use smp_wmb() followed by WRITE_ONCE() rather than smp_store_release() ; likewise, instead of smp_load_acquire(), sometimes READ_ONCE() is followed by smp_rmb(). These partial barriers create specific types of happens before relations. Specifically (but informally), smp_wmb() turns all the following relaxed stores into release operations and smp_rmb() turns all the preceding relaxed loads into acquire operations.

Let's try to apply this replacement to the data.x accesses:

    thread 1 (writer)                   thread 2 (reader)
    ------------------------------      ------------------------
    // write_seqcount_begin(&sc)        do {
    WRITE_ONCE(sc, sc + 1);                 // read_seqcount_begin(&sc)
    smp_wmb();                              old = smp_load_acquire(&sc) & ~1;
    WRITE_ONCE(data.x, 123);                copy.y = READ_ONCE(data.y);
    WRITE_ONCE(data.y, 456);                copy.x = READ_ONCE(data.x);
    // write_seqcount_end(&sc)              // read_seqcount_retry(&sc, old)
    smp_store_release(&sc, sc + 1);         smp_rmb();
                                        } while(READ_ONCE(sc) != old);

Leaving aside the question of how barriers work, this already has much better chances of being wrapped with an easy-to-use API. Data is accessed entirely with relaxed atomic loads and stores (though in the Linux kernel memory model non-atomic accesses would be acceptable too), and the barriers could be hidden within the seqcount primitives read_seqcount_retry() and write_seqcount_begin().

The barriers inserted above split the reads and writes into two separate groups; this ensures the safety of seqcount accesses. However, there are two complications:

First, the barriers do not impose an order among the relaxed accesses themselves. It is possible that thread 2 sees the update to data.y and not the update to data.x. This is not a problem for seqcounts, because the check on sc forces thread 2 to retry in case it saw only some of the stores.
Second, the barriers are weaker than load-acquire and store-release operations. A read with smp_load_acquire() happens before any loads and stores that follow it, and likewise smp_store_release() happens after not just preceding stores, but also after preceding loads. Instead, for smp_rmb(), the ordering is only guaranteed among loads, and, for smp_wmb(), only among stores. Load-store ordering however rarely matters, which is why Linux developers only used smp_rmb() and smp_wmb() for a long time.

In the case of seqcounts, load-store ordering is not a problem, because the reader does not perform any writes to shared memory in its critical section, thus there cannot be concurrent changes to shared memory between the writer's updates of the generation count. There's a little bit of handwaving in this reasoning, but it is actually sound as long as the code is kept simple and faithful to the pattern. If the reader needed to write to shared memory, it would suffice to protect those writes with a different mechanism than the seqcount.

While informal, the explanation in the previous paragraph highlights the importance of knowing the common lockless programming patterns. In short, patterns enable thinking about code at a more coarse level without losing precision. Instead of looking at each memory access individually, you can make statements like "data.x and data.y are protected by the seqcount sc" or, referring to the earlier message passing example, "a is published to other threads via message". To some extent, proficiency in lockless programming patterns means being able to make such statements and take advantage of them to understand the code.

This concludes our initial look at memory barriers. There is a lot more to this topic than has been covered so far, naturally; the next installment in this series will delve into full memory barriers, how they work, and how they are used in the kernel.

Index entries for this article
Kernel	Lockless algorithms
Kernel	Memory barriers
GuestArticles	Bonzini, Paolo

Lockless patterns: relaxed access and partial memory barriers

Posted Feb 27, 2021 11:57 UTC (Sat) by PengZheng (subscriber, #108006) [Link] (10 responses)

I'm confused by the first Seqcounts example. Consider the following:

thread 1 (writer)
--------------------------------
WRITE_ONCE(sc, sc + 1); // sc == 1
smp_store_release(&data.x, 123);
WRITE_ONCE(data.y, 456);
smp_store_release(&sc, sc + 1); // sc == 2

WRITE_ONCE(sc, sc + 1); // sc == 3
smp_store_release(&data.x, 567);
WRITE_ONCE(data.y, 890);
smp_store_release(&sc, sc + 1); // sc == 4

thread 2
------------------------
do {
old = smp_load_acquire(&sc) & ~1; // old ==2
copy.y = READ_ONCE(data.y);
copy.x = smp_load_acquire(&data.x); // x == 123
} while(READ_ONCE(sc) != old);

What forced data.y to be 456 instead of 890? Neither "smp_store_release(&data.x, 567)" nor "smp_store_release(&sc, sc + 1)" prevents WRITE_ONCE(data.y, 890) from moving before them.

Please enlighten me.

Lockless patterns: relaxed access and partial memory barriers

Posted Feb 27, 2021 12:01 UTC (Sat) by PengZheng (subscriber, #108006) [Link] (9 responses)

thread 2
------------------------
do {
old = smp_load_acquire(&sc) & ~1; // old ==2
copy.y = READ_ONCE(data.y);
copy.x = smp_load_acquire(&data.x); // x == 123
} while(READ_ONCE(sc) != old); // READ_ONCE(sc) == 2

Lockless patterns: relaxed access and partial memory barriers

Posted Feb 27, 2021 12:30 UTC (Sat) by PengZheng (subscriber, #108006) [Link] (8 responses)

Try to answer my own question after reading the second Seqcounts example:
1. smp_load_acquire(&data.x) turns READ_ONCE(data.y) into a acquire op.
2. smp_store_release(&data.x, 123) turns WRITE_ONCE(data.y, 456) into a release op.

Lockless patterns: relaxed access and partial memory barriers

Posted Feb 27, 2021 23:11 UTC (Sat) by randomguy3 (subscriber, #71063) [Link] (7 responses)

My understanding is that if READ_ONCE(data.y) observes the write of 890, then that means the smp_load_acquire(&sc) before it synchronises with (happens after) smp_store_release(&data.x, 567). That, in turn, means READ_ONCE(sc) will see the write of 3 to sc (or a later write to sc), and we'll go round the loop again.

Lockless patterns: relaxed access and partial memory barriers

Posted Feb 27, 2021 23:18 UTC (Sat) by randomguy3 (subscriber, #71063) [Link] (6 responses)

Oh, no, wait, the wording from the first article suggests the store and load have to refer to the same variable/memory location in order to synchronise, which makes sense.

Lockless patterns: relaxed access and partial memory barriers

Posted Feb 28, 2021 4:08 UTC (Sun) by PengZheng (subscriber, #108006) [Link] (5 responses)

If READ_ONCE(data.y) observes the write of 890, then READ_ONCE(sc) >= 3. Why?
Because smp_store_release(&data.x, 567) is a stronger form of

smp_wmb();
WRITE_ONCE(data.x, 567);

copy.x = smp_load_acquire(&data.x) is a stronger form of

copy.x = READ_ONCE(data.x)
smp_rmb();

smp_wmb() makes both WRITE_ONCE(data.x, 567) and WRITE_ONCE(data.y, 890) into release operations, while smp_rmb() makes both READ_ONCE(data.y) and READ_ONCE(data.x) into acquire operations.

Lockless patterns: relaxed access and partial memory barriers

Posted Feb 28, 2021 4:19 UTC (Sun) by PengZheng (subscriber, #108006) [Link]

WRITE_ONCE(data.y, 890) after smp_wmb() sync with READ_ONCE(data.y) before smp_rmb().

In thread 1 we have
WRITE_ONCE(sc, sc + 1); // sc == 3
happens before
smp_wmb();
WRITE_ONCE(data.y, 890);

In thread 2 we have
copy.y = READ_ONCE(data.y);
smp_rmb();
happens before READ_ONCE(sc)

Thus WRITE_ONCE(sc, sc + 1); // sc == 3
happens before READ_ONCE(sc) in thread 2.

Uh-oh

Posted Feb 28, 2021 6:33 UTC (Sun) by pbonzini (subscriber, #60935) [Link] (3 responses)

> "smp_store_release(&data.x, 567)" is a stronger form of "smp_wmb(); WRITE_ONCE(data.x, 567);"

This is not entirely correct, because barriers are bidirectional while store-release is unidirectional.

I tried putting this in CPPMEM:

int main() {
  atomic_int sc = 2;
  atomic_int x = 123;
  atomic_int y = 456;
  {{{
   {
     sc.store(3, memory_order_relaxed);
     x.store(567, memory_order_release);
     y.store(890, memory_order_relaxed);
     // sc.store(4, memory_order_release);
   }
  |||
   {
     sc.load(memory_order_acquire).readsvalue(2);
     r1 = y.load(memory_order_relaxed);
     r2 = x.load(memory_order_acquire).readsvalue(123);
   }
  }}};
  return 0;
}

and you are probably correct. The simplest fix is to use load-acquire/store-release for y as well. I'll write up a fix and ask our editor to update the article.

wmb, smp_wmb, and store-release

Posted Mar 1, 2021 3:00 UTC (Mon) by PengZheng (subscriber, #108006) [Link]

> This is not entirely correct, because barriers are bidirectional while store-release is unidirectional.

Thanks for correcting my misunderstanding. I read the following for a brief explanation of memory barriers:
http://bruceblinn.com/linuxinfo/MemoryBarriers.html

Uh-oh

Posted Mar 1, 2021 8:41 UTC (Mon) by peda (subscriber, #5285) [Link] (1 responses)

Are you sure you're not just putting lipstick on a pig?

Uh-oh

Posted Mar 1, 2021 11:24 UTC (Mon) by pbonzini (subscriber, #60935) [Link]

Indeed. :)

But memory barriers do work, they are bidirectional.

Lockless patterns: relaxed access and partial memory barriers

Posted Feb 28, 2021 18:02 UTC (Sun) by marcH (subscriber, #57642) [Link] (11 responses)

> A data race occurs when two accesses are concurrent (i.e., not ordered by the happens before relation), at least one of them is a store, and at least one of them is not using atomic load or store primitives.

This definition seems very old, wikipedia quotes a paper from 1991:

https://en.wikipedia.org/wiki/Race_condition#Example_defi...

> The concept of a data race, as presented here, was first introduced in C++11 and, since then, has been applied to various other languages, notably C11 and Rust.

The Java memory model was formalized in 2004. How did it not include the concept of a data race?

Lockless patterns: relaxed access and partial memory barriers

Posted Feb 28, 2021 18:12 UTC (Sun) by mathstuf (subscriber, #69389) [Link] (10 responses)

> The Java memory model was formalized in 2004. How did it not include the concept of a data race?

The key words are "as presented here". Java/JVM would, presumably, differ in some details here.

Lockless patterns: relaxed access and partial memory barriers

Posted Mar 1, 2021 1:23 UTC (Mon) by marcH (subscriber, #57642) [Link] (9 responses)

Sure and "undefined behavior" is one of the implementation details that seem to differ.
However the lack of references to earlier work makes it very to misinterpreted this paragraph as if the concept of data race was pioneered by C/C++11. I'm no expert but I suspect the reality is more like C/C++ defined their memory models very late; decades after both memory models _and_ the languages were a thing and that they merely adjusted very old concepts. So hinting at the existence of prior art wouldn't hurt. I realize LWN articles are not peer-reviewed IEEE articles but their quality often comes close which naturally raises expectations :-)

Lockless patterns: relaxed access and partial memory barriers

Posted Mar 1, 2021 8:33 UTC (Mon) by pbonzini (subscriber, #60935) [Link] (8 responses)

The C++11 memory model indeed came out after Java's, but despite being based on the same theoretical framework it's quite different; I left it out because most of the material in here is not applicable.

At a high level, the Java memory model defines synchronize-with relations for higher level constructs, such as "synchronized" blocks, constructors, finalizers etc. C++ (and C/Rust) only define fences and atomic memory operations. It also doesn't have undefined behavior, for example data races cannot cause out-of-bounds array accesses.

The biggest difference however is that it does not have memory orderings or separate fences. Everything is either declared volatile, or writes must not be concurrent with reads. For example, seqcounts cannot be implemented in Java without massive performance issues.

Lockless patterns: relaxed access and partial memory barriers

Posted Mar 20, 2021 4:55 UTC (Sat) by alison (subscriber, #63752) [Link] (7 responses)

Reading the article made me wonder how "volatile" itself worked. Here is "When is a Volatile Object Accessed?" from GCC:

https://gcc.gnu.org/onlinedocs/gcc/Volatiles.html#Volatiles

Short summary is, one should insert an ASM memory barrier to prevent writes and reads from being reordered. Presumably they compile to the same bits as an rmb().

Paging device manufacturers! GCC opines that "it is unwise to use volatile bit-fields to access hardware."

Lockless patterns: relaxed access and partial memory barriers

Posted Mar 20, 2021 14:25 UTC (Sat) by Wol (subscriber, #4433) [Link] (1 responses)

> Paging device manufacturers! GCC opines that "it is unwise to use volatile bit-fields to access hardware."

I thought that was the justification for introducing volatile! That other actors *do* have access to that memory location, and *will* be reading/writing to it.

Cheers,
Wol

Lockless patterns: relaxed access and partial memory barriers

Posted Mar 20, 2021 15:42 UTC (Sat) by zlynx (guest, #2285) [Link]

Hardware can have all kinds of odd access requirements and there is no way to define these with C bitfields, volatile or not.

Do you read 16 bits, modify 3 of them and write it back? Or is it 32 bits or 8 bits?

And I believe some hardware does not allow reading. Instead you need to keep a shadow copy and write the whole byte, word, double word or whatever.

Lockless patterns: relaxed access and partial memory barriers

Posted Mar 20, 2021 15:27 UTC (Sat) by excors (subscriber, #95769) [Link] (3 responses)

> Short summary is, one should insert an ASM memory barrier to prevent writes and reads from being reordered. Presumably they compile to the same bits as an rmb().

They don't - 'asm volatile ("":::"memory")' compiles to zero bits. It's only preventing the compiler from reordering memory accesses across the barrier, not the CPU. Specifically it's telling the compiler "this (empty) string of assembly code might read or write memory, and I'm not going to say exactly which parts of memory", which prevents the compiler making assumptions about the contents of any memory after that point. E.g. it can't assume that reading the same variable before and after the barrier will return the same value, regardless of whether it was declared volatile or not, so it can't reorder the read across the barrier and it can't cache the value in a register.

The CPU knows nothing about that compiler barrier and may still decide to reorder memory accesses. rmb() etc emit special assembly instructions that will constrain the CPU's reordering too.

> Paging device manufacturers! GCC opines that "it is unwise to use volatile bit-fields to access hardware."

I misread that at first, so I think it's worth noting that it's specifically about volatile bit-fields, not volatile in general. (Typically the bit-field accesses will be translated into read/write sequences over bytes or words. It's common for hardware register reads to have side effects and be non-idempotent, so the compiler inserting an implicit read to implement the bit-field semantics may be very surprising to the programmer who thought they were just doing a write. So when accessing hardware registers (instead of normal RAM), you should stick with reading/writing whole bytes/words (depending on architecture) and do the bit manipulation explicitly to avoid surprises. And then use volatile, plus compiler barriers and memory barriers where the hardware is sensitive about the order of operations.)

Lockless patterns: relaxed access and partial memory barriers

Posted Mar 21, 2021 4:39 UTC (Sun) by alison (subscriber, #63752) [Link] (2 responses)

> They don't - 'asm volatile ("":::"memory")' compiles to zero bits. . . . rmb() etc emit special assembly instructions that will constrain the CPU's reordering too.

Is rmb() likely to emit the same ASM as 'asm volatile ("":::"memory")' though? Let me guess that the answer is, it depends on the architecture.

In a header file I chanced to read today one finds "__asm__ __volatile__". Presumably those are compiler directives that result in the other ASM directive?

> The CPU knows nothing about that compiler barrier and may still decide to reorder memory accesses.

Does turning off any of the various forms of speculative execution affect memory reordering? Perhaps reordering has more to do with some accesses coming from the cache or corresponding to TLB hits and others not?

>when accessing hardware registers (instead of normal RAM), you should stick with reading/writing whole bytes/words (depending on architecture)

As far as I know, there is not a hardware way to read individual bit unless one includes GPIO output levels.

Lockless patterns: relaxed access and partial memory barriers

Posted Mar 21, 2021 11:42 UTC (Sun) by excors (subscriber, #95769) [Link] (1 responses)

> Is rmb() likely to emit the same ASM as 'asm volatile ("":::"memory")' though? Let me guess that the answer is, it depends on the architecture.

On e.g. x86 (https://elixir.bootlin.com/linux/latest/source/arch/x86/i...) it combines the mfence/lfence/sfence instruction (to prevent CPU reordering) with the "memory" clobber argument (to prevent compiler reordering). (The important part is the "memory", not the "asm volatile"). I'd guess all architectures will do a similar memory clobber because they all need to prevent compiler reordering across the barrier.

> Does turning off any of the various forms of speculative execution affect memory reordering? Perhaps reordering has more to do with some accesses coming from the cache or corresponding to TLB hits and others not?

Speculation might have some effect, but even without speculation the CPU may still execute non-dependent instructions out-of-order, so it can reorder two memory instructions if there's no implicit dependency (by the architecture specification) or explicit dependency (by a fence/barrier instruction) between them. It might be more likely to do that if the first instruction was a cache miss and it can complete the second instruction while waiting, but there's nothing stopping it reordering for any arbitrary reason.

And even without out-of-order execution of instructions, it might still reorder the RAM operations (and that reordering would be visible to other CPUs accessing the same RAM). E.g. a store instruction will typically write into a per-CPU store buffer, which is not flushed to L1/L2/RAM until some time later. If the same CPU loads a different address from RAM after the store instruction but before the flush, then RAM will see the load before the store. (I think L1/L2 caches can be ignored here since cache coherency means they'll behave equivalently to a global shared RAM; the problem is the store buffer which is a kind of cache but doesn't participate in the coherency protocol). x86 mostly guarantees sequential consistency, and the store buffer is the main exception to that.

Lockless patterns: relaxed access and partial memory barriers

Posted Mar 24, 2021 4:34 UTC (Wed) by alison (subscriber, #63752) [Link]

>(I think L1/L2 caches can be ignored here since cache coherency means they'll behave
>equivalently to a global shared RAM; the problem is the store buffer which is a kind of cache but >doesn't participate in the coherency protocol). x86 mostly guarantees sequential consistency, >and the store buffer is the main exception to that.

J.F. Bastien and Chris Leary just discussed that very topic at their excellent TLB Hits podcast under "Multiprocessing Teaser":

https://tlbh.it/000_mov_fp_sp.html

Lockless patterns: relaxed access and partial memory barriers

Posted Mar 23, 2021 9:51 UTC (Tue) by daenzer (subscriber, #7050) [Link]

C bit-fields are mostly useless for modelling hardware registers for various other reasons as well, in particular if the code needs to work on multiple host architectures (since the memory layout of bit-fields is left as implementation defined in the C spec).

Lockless patterns: relaxed access and partial memory barriers

Posted Mar 11, 2021 14:11 UTC (Thu) by jogness (subscriber, #49775) [Link]

What is interesting is that mainline does not implement seqcount as described here. Both write_seqcount_end() and read_seqcount_begin() are using the heavier full write and read memory barriers, respectively. Should this be fixed in mainline or is there perhaps an additional rule implied in the seqcount semantics? For example, perhaps stores after write_seqcount_end() should not be visible to seqcount readers before stores in the writer critical section. Such a rule would require the heavier memory barriers.

Lockless patterns: relaxed access and partial memory barriers

Posted Mar 12, 2021 14:01 UTC (Fri) by nunojsa (guest, #107530) [Link] (2 responses)

"However, the code has a bug! Because the writer has no acquire operation at the top of the sequence, the write to data.y might execute before writing the odd value to sc"

I'm very puzzled by this sentence :D... This refers to the fist seqcount example. Maybe I'm just misunderstanding the whole thing but how can WRITE_ONCE(data.y, 456) execute before writing the odd value to sc?? As far as I understand (at least on the linux kernel generic implementation), smp_store_release(&data.x, 123) evaluates to something like:

__smp_mb(); //which is for loads and stores...
WRITE_ONCE(data.y, 456);

Hence, there's no order guarantees between data.x and data.y but shouldn't __smp_mb(); guarantee that WRITE_ONCE(data.y, 456) does not get reordered to before this barrier and so the odd value to sc?

Maybe smp_store_release() here does not imply a barrier? Is that what I'm missing?

Lockless patterns: relaxed access and partial memory barriers

Posted Apr 16, 2021 7:03 UTC (Fri) by pbonzini (subscriber, #60935) [Link] (1 responses)

The generic implementation of smp_mb() is stronger than what smp_store_release() requires. See here for a CPPMEM test that shows the problem.

Lockless patterns: relaxed access and partial memory barriers

Posted Apr 26, 2021 9:04 UTC (Mon) by nunojsa (guest, #107530) [Link]

"The generic implementation of smp_mb() is stronger than what smp_store_release() requires"

Yes, I get that :).

Using CPPMEM I could see the issue on c++ memory model. I guess my problem is that I was only thinking in how could it happen in the kernel (which I think it can't)...

Thanks!

Lockless patterns: relaxed access and partial memory barriers

Posted Apr 5, 2021 10:06 UTC (Mon) by naveenmv7 (guest, #141627) [Link] (1 responses)

old = smp_load_acquire(&sc) & ~1;

Here I assume we are checking if sc is odd or even by looking at the last bit.

In that case, shouldn't this be

old = smp_load_acquire(&sc) & 1;

Lockless patterns: relaxed access and partial memory barriers

Posted Apr 5, 2021 10:09 UTC (Mon) by naveenmv7 (guest, #141627) [Link]

Never mind. We are checking the current sc value against the previous. Not the parity.

Lockless patterns: relaxed access and partial memory barriers

Posted Jan 21, 2022 12:30 UTC (Fri) by rharjani (subscriber, #87278) [Link]

"Data is accessed entirely with relaxed atomic loads and stores (though in the Linux kernel memory model non-atomic accesses would be acceptable too)"

Could you throw some light on the comment between brackets? i.e. In Linux kernel memory model, non-atomic accesses would be acceptable too?