Lockless patterns: relaxed access and partial memory barriers

Posted Feb 28, 2021 18:02 UTC (Sun) by marcH (subscriber, #57642)
Parent article: Lockless patterns: relaxed access and partial memory barriers

> A data race occurs when two accesses are concurrent (i.e., not ordered by the happens before relation), at least one of them is a store, and at least one of them is not using atomic load or store primitives.

This definition seems very old, wikipedia quotes a paper from 1991:

https://en.wikipedia.org/wiki/Race_condition#Example_defi...

> The concept of a data race, as presented here, was first introduced in C++11 and, since then, has been applied to various other languages, notably C11 and Rust.

The Java memory model was formalized in 2004. How did it not include the concept of a data race?

Lockless patterns: relaxed access and partial memory barriers

Posted Feb 28, 2021 18:12 UTC (Sun) by mathstuf (subscriber, #69389) [Link] (10 responses)

> The Java memory model was formalized in 2004. How did it not include the concept of a data race?

The key words are "as presented here". Java/JVM would, presumably, differ in some details here.

Lockless patterns: relaxed access and partial memory barriers

Posted Mar 1, 2021 1:23 UTC (Mon) by marcH (subscriber, #57642) [Link] (9 responses)

Sure and "undefined behavior" is one of the implementation details that seem to differ.
However the lack of references to earlier work makes it very to misinterpreted this paragraph as if the concept of data race was pioneered by C/C++11. I'm no expert but I suspect the reality is more like C/C++ defined their memory models very late; decades after both memory models _and_ the languages were a thing and that they merely adjusted very old concepts. So hinting at the existence of prior art wouldn't hurt. I realize LWN articles are not peer-reviewed IEEE articles but their quality often comes close which naturally raises expectations :-)

Lockless patterns: relaxed access and partial memory barriers

Posted Mar 1, 2021 8:33 UTC (Mon) by pbonzini (subscriber, #60935) [Link] (8 responses)

The C++11 memory model indeed came out after Java's, but despite being based on the same theoretical framework it's quite different; I left it out because most of the material in here is not applicable.

At a high level, the Java memory model defines synchronize-with relations for higher level constructs, such as "synchronized" blocks, constructors, finalizers etc. C++ (and C/Rust) only define fences and atomic memory operations. It also doesn't have undefined behavior, for example data races cannot cause out-of-bounds array accesses.

The biggest difference however is that it does not have memory orderings or separate fences. Everything is either declared volatile, or writes must not be concurrent with reads. For example, seqcounts cannot be implemented in Java without massive performance issues.

Lockless patterns: relaxed access and partial memory barriers

Posted Mar 20, 2021 4:55 UTC (Sat) by alison (subscriber, #63752) [Link] (7 responses)

Reading the article made me wonder how "volatile" itself worked. Here is "When is a Volatile Object Accessed?" from GCC:

https://gcc.gnu.org/onlinedocs/gcc/Volatiles.html#Volatiles

Short summary is, one should insert an ASM memory barrier to prevent writes and reads from being reordered. Presumably they compile to the same bits as an rmb().

Paging device manufacturers! GCC opines that "it is unwise to use volatile bit-fields to access hardware."

Lockless patterns: relaxed access and partial memory barriers

Posted Mar 20, 2021 14:25 UTC (Sat) by Wol (subscriber, #4433) [Link] (1 responses)

> Paging device manufacturers! GCC opines that "it is unwise to use volatile bit-fields to access hardware."

I thought that was the justification for introducing volatile! That other actors *do* have access to that memory location, and *will* be reading/writing to it.

Cheers,
Wol

Lockless patterns: relaxed access and partial memory barriers

Posted Mar 20, 2021 15:42 UTC (Sat) by zlynx (guest, #2285) [Link]

Hardware can have all kinds of odd access requirements and there is no way to define these with C bitfields, volatile or not.

Do you read 16 bits, modify 3 of them and write it back? Or is it 32 bits or 8 bits?

And I believe some hardware does not allow reading. Instead you need to keep a shadow copy and write the whole byte, word, double word or whatever.

Lockless patterns: relaxed access and partial memory barriers

Posted Mar 20, 2021 15:27 UTC (Sat) by excors (subscriber, #95769) [Link] (3 responses)

> Short summary is, one should insert an ASM memory barrier to prevent writes and reads from being reordered. Presumably they compile to the same bits as an rmb().

They don't - 'asm volatile ("":::"memory")' compiles to zero bits. It's only preventing the compiler from reordering memory accesses across the barrier, not the CPU. Specifically it's telling the compiler "this (empty) string of assembly code might read or write memory, and I'm not going to say exactly which parts of memory", which prevents the compiler making assumptions about the contents of any memory after that point. E.g. it can't assume that reading the same variable before and after the barrier will return the same value, regardless of whether it was declared volatile or not, so it can't reorder the read across the barrier and it can't cache the value in a register.

The CPU knows nothing about that compiler barrier and may still decide to reorder memory accesses. rmb() etc emit special assembly instructions that will constrain the CPU's reordering too.

> Paging device manufacturers! GCC opines that "it is unwise to use volatile bit-fields to access hardware."

I misread that at first, so I think it's worth noting that it's specifically about volatile bit-fields, not volatile in general. (Typically the bit-field accesses will be translated into read/write sequences over bytes or words. It's common for hardware register reads to have side effects and be non-idempotent, so the compiler inserting an implicit read to implement the bit-field semantics may be very surprising to the programmer who thought they were just doing a write. So when accessing hardware registers (instead of normal RAM), you should stick with reading/writing whole bytes/words (depending on architecture) and do the bit manipulation explicitly to avoid surprises. And then use volatile, plus compiler barriers and memory barriers where the hardware is sensitive about the order of operations.)

Lockless patterns: relaxed access and partial memory barriers

Posted Mar 21, 2021 4:39 UTC (Sun) by alison (subscriber, #63752) [Link] (2 responses)

> They don't - 'asm volatile ("":::"memory")' compiles to zero bits. . . . rmb() etc emit special assembly instructions that will constrain the CPU's reordering too.

Is rmb() likely to emit the same ASM as 'asm volatile ("":::"memory")' though? Let me guess that the answer is, it depends on the architecture.

In a header file I chanced to read today one finds "__asm__ __volatile__". Presumably those are compiler directives that result in the other ASM directive?

> The CPU knows nothing about that compiler barrier and may still decide to reorder memory accesses.

Does turning off any of the various forms of speculative execution affect memory reordering? Perhaps reordering has more to do with some accesses coming from the cache or corresponding to TLB hits and others not?

>when accessing hardware registers (instead of normal RAM), you should stick with reading/writing whole bytes/words (depending on architecture)

As far as I know, there is not a hardware way to read individual bit unless one includes GPIO output levels.

Lockless patterns: relaxed access and partial memory barriers

Posted Mar 21, 2021 11:42 UTC (Sun) by excors (subscriber, #95769) [Link] (1 responses)

> Is rmb() likely to emit the same ASM as 'asm volatile ("":::"memory")' though? Let me guess that the answer is, it depends on the architecture.

On e.g. x86 (https://elixir.bootlin.com/linux/latest/source/arch/x86/i...) it combines the mfence/lfence/sfence instruction (to prevent CPU reordering) with the "memory" clobber argument (to prevent compiler reordering). (The important part is the "memory", not the "asm volatile"). I'd guess all architectures will do a similar memory clobber because they all need to prevent compiler reordering across the barrier.

> Does turning off any of the various forms of speculative execution affect memory reordering? Perhaps reordering has more to do with some accesses coming from the cache or corresponding to TLB hits and others not?

Speculation might have some effect, but even without speculation the CPU may still execute non-dependent instructions out-of-order, so it can reorder two memory instructions if there's no implicit dependency (by the architecture specification) or explicit dependency (by a fence/barrier instruction) between them. It might be more likely to do that if the first instruction was a cache miss and it can complete the second instruction while waiting, but there's nothing stopping it reordering for any arbitrary reason.

And even without out-of-order execution of instructions, it might still reorder the RAM operations (and that reordering would be visible to other CPUs accessing the same RAM). E.g. a store instruction will typically write into a per-CPU store buffer, which is not flushed to L1/L2/RAM until some time later. If the same CPU loads a different address from RAM after the store instruction but before the flush, then RAM will see the load before the store. (I think L1/L2 caches can be ignored here since cache coherency means they'll behave equivalently to a global shared RAM; the problem is the store buffer which is a kind of cache but doesn't participate in the coherency protocol). x86 mostly guarantees sequential consistency, and the store buffer is the main exception to that.

Lockless patterns: relaxed access and partial memory barriers

Posted Mar 24, 2021 4:34 UTC (Wed) by alison (subscriber, #63752) [Link]

>(I think L1/L2 caches can be ignored here since cache coherency means they'll behave
>equivalently to a global shared RAM; the problem is the store buffer which is a kind of cache but >doesn't participate in the coherency protocol). x86 mostly guarantees sequential consistency, >and the store buffer is the main exception to that.

J.F. Bastien and Chris Leary just discussed that very topic at their excellent TLB Hits podcast under "Multiprocessing Teaser":

https://tlbh.it/000_mov_fp_sp.html

Lockless patterns: relaxed access and partial memory barriers

Posted Mar 23, 2021 9:51 UTC (Tue) by daenzer (subscriber, #7050) [Link]

C bit-fields are mostly useless for modelling hardware registers for various other reasons as well, in particular if the code needs to work on multiple host architectures (since the memory layout of bit-fields is left as implementation defined in the C spec).