Lockless patterns: relaxed access and partial memory barriers

Posted Mar 21, 2021 11:42 UTC (Sun) by excors (subscriber, #95769)
In reply to: Lockless patterns: relaxed access and partial memory barriers by alison
Parent article: Lockless patterns: relaxed access and partial memory barriers

> Is rmb() likely to emit the same ASM as 'asm volatile ("":::"memory")' though? Let me guess that the answer is, it depends on the architecture.

On e.g. x86 (https://elixir.bootlin.com/linux/latest/source/arch/x86/i...) it combines the mfence/lfence/sfence instruction (to prevent CPU reordering) with the "memory" clobber argument (to prevent compiler reordering). (The important part is the "memory", not the "asm volatile"). I'd guess all architectures will do a similar memory clobber because they all need to prevent compiler reordering across the barrier.

> Does turning off any of the various forms of speculative execution affect memory reordering? Perhaps reordering has more to do with some accesses coming from the cache or corresponding to TLB hits and others not?

Speculation might have some effect, but even without speculation the CPU may still execute non-dependent instructions out-of-order, so it can reorder two memory instructions if there's no implicit dependency (by the architecture specification) or explicit dependency (by a fence/barrier instruction) between them. It might be more likely to do that if the first instruction was a cache miss and it can complete the second instruction while waiting, but there's nothing stopping it reordering for any arbitrary reason.

And even without out-of-order execution of instructions, it might still reorder the RAM operations (and that reordering would be visible to other CPUs accessing the same RAM). E.g. a store instruction will typically write into a per-CPU store buffer, which is not flushed to L1/L2/RAM until some time later. If the same CPU loads a different address from RAM after the store instruction but before the flush, then RAM will see the load before the store. (I think L1/L2 caches can be ignored here since cache coherency means they'll behave equivalently to a global shared RAM; the problem is the store buffer which is a kind of cache but doesn't participate in the coherency protocol). x86 mostly guarantees sequential consistency, and the store buffer is the main exception to that.

Lockless patterns: relaxed access and partial memory barriers

Posted Mar 24, 2021 4:34 UTC (Wed) by alison (subscriber, #63752) [Link]

>(I think L1/L2 caches can be ignored here since cache coherency means they'll behave
>equivalently to a global shared RAM; the problem is the store buffer which is a kind of cache but >doesn't participate in the coherency protocol). x86 mostly guarantees sequential consistency, >and the store buffer is the main exception to that.

J.F. Bastien and Chris Leary just discussed that very topic at their excellent TLB Hits podcast under "Multiprocessing Teaser":

https://tlbh.it/000_mov_fp_sp.html