Question 1
No.
memory_order_relaxed
imposes no memory order at all:
Relaxed operation: there are no synchronization or ordering constraints, only atomicity is required of this operation.
While memory_order_consume
imposes memory ordering on data dependent reads (on the current thread)
A load operation with this memory order performs a consume operation on the affected memory location: no reads in the current thread dependent on the value currently loaded can be reordered before this load.
Edit
In general memory_order_seq_cst
is stronger memory_order_acq_rel
is stronger memory_ordering_relaxed
.
This is like having a Elevator A that can lift 800 Kg Elevator C that lifts 100Kg.
Now if you had the power to magically change Elevator A into Elevator C, what would happen if the former was filled with 10 average-weighting people?
That would be bad.
To see what could go wrong with the code exactly, consider the example on your question:
Thread A Thread B
Payload = 42; g = Guard.load(memory_order_consume);
Guard.store(1, memory_order_release); if (g != 0)
p = Payload;
This snippet are intended to be looped, there is no synchronization, only ordering, between the two threads.
With memory_order_relaxed
, and assuming that a natural word load/store is atomic, the code would be equivalent to
Thread A Thread B
Payload = 42; g = Guard
Guard = 1 if (g != 0)
p = Payload;
From a CPU point of view on Thread A there are two stores to two separate addresses, so if Guard
is “closer” to the CPU (meaning the store will complete faster) from another processor it seems that Thread A is perfoming
Thread A
Guard = 1
Payload = 42
And this order of execution is possible
Thread A Guard = 1
Thread B g = Guard
Thread B if (g != nullptr) p = Payload
Thread A Payload = 42
And that’s bad, since Thread B read a non updated value of Payload.
It could seems however that in Thread B the synchronization would be useless since the CPU won’t do a reorder like
Thread B
if (g != 0) p = Payload;
g = Guard
But it actually will.
From its perspective there are two unrelated load, it is true that one is on a dependent data path but the CPU can still speculatively do the load:
Thread B
hidden_tmp = Payload;
g = Guard
if (g != 0) p = hidden_tmp
That may generate the sequence
Thread B hidden_tmp = Payload;
Thread A Payload = 42;
Thread A Guard = 1;
Thread B g = Guard
Thread B if (g != 0) p = hidden_tmp
Whoops.
Question 2
In general that can never be done.
You can replace memory_order_acquire
with memory_order_consume
when you are going to generate an address dependency between the loaded value and the value(s) whose access need to be ordered.
To understand memory_order_relaxed
we can take the ARM architecture as a reference.
The ARM Architecture mandates only a weak memory ordering meaning that in general the loads and stores of a program can be executed in any order.
str r0, [r2]
str r0, [r3]
In the snippet above the store to [r3]
can be observed, externally, before the store to [r2]
1.
However the CPU doesn’t go as far as the Alpha CPU and imposes two kinds of dependencies: address dependency, when a value load from memory is used to compute the address of another load/store, and control dependency, when a value load from memory is used to compute the control flags of another load/store.
In the presence of such dependency the ordering of two memory operations is guaranteed to be visible in program order:
If there is an address dependency then the two memory accesses are observed in program order.
So, while a memory_order_acquire
would generate a memory barrier, with memory_order_consume
you are telling the compiler that the way you’ll use the loaded value will generate an address dependency and so it can, if relevant to the architecture, exploit this fact and omit a memory barrier.
1 If r2
is the address of a synchronization object, that’s bad.