mo_relaxed
is fine for both load and store of a stop
flag
There’s also no meaningful latency benefit to stronger memory orders, even if latency of seeing a change to a keep_running
or exit_now
flag was important.
IDK why Herb thinks stop.store
shouldn’t be relaxed; in his talk, his slides have a comment that says // not relaxed
on the assignment, but he doesn’t say anything about the store side before moving on to “is it worth it”.
Of course, the load runs inside the worker loop, but the store runs only once, and Herb really likes to recommend sticking with SC unless you have a performance reason that truly justifies using something else. I hope that wasn’t his only reason; I find that unhelpful when trying to understand what memory order would actually be necessary and why. But anyway, I think either that or a mistake on his part.
The ISO C++ standard doesn’t say anything about how soon stores become visible or what might influence that. These apply to all atomic operations, including relaxed
. They’re not just notes, but only should not must.
ISO C++ section 6.9.2.3 Forward progress
18. An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time.
And 33.5.4 Order and consistency [atomics.order] covering only atomics, not mutexes etc.:
11. Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.
Inter-thread latency is primarily a quality-of-implementation thing, with the standard leaving things wide open. Normal C++ implementations that work by compiling to asm for some architecture effectively just expose the hardware’s cache-coherence properties, so typically tens of nanoseconds best case, sub-microsecond near-worst case if both threads are currently running on different cores. (Otherwise scheduler timeslice…)
Another thread can loop arbitrarily many times before its load actually sees this store value, even if they’re both seq_cst
, assuming there’s no other synchronization of any kind between them. Low inter-thread latency is a performance issue, not correctness / formal guarantee.
And non-infinite inter-thread latency is apparently only a “should” QOI (quality of implementation) issue. 😛 Nothing in the standard suggests that seq_cst
would help on a hypothetical implementation where store visibility could be delayed indefinitely, although one might guess that could be the case, e.g. on a hypothetical implementation with explicit cache flushes instead of cache coherency. (Although such an implementation is probably not practically usable in terms of performance with CPUs anything like what we have now; every release and/or acquire operation would have to flush the whole cache.)
On real hardware (which uses some form of MESI cache coherency), different memory orders for store or load don’t make stores visible sooner in real time, they just control whether later operations can become globally visible while still waiting for the store to commit from the store buffer to L1d cache. (After invalidating any other copies of the line.)
Stronger orders, and barriers, don’t make things happen sooner in an absolute sense, they just delay other things until they’re allowed to happen relative to the store or load. (This is the case on all real-world CPUs AFAIK; they always try to make stores visible to other cores ASAP anyway, so the store buffer doesn’t fill up.)
See also (my similar answers on):
- Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees?
- If I don’t use fences, how long could it take a core to see another core’s writes?
- memory_order_relaxed and visibility
- Thread synchronization: How to guarantee visibility of writes (it’s a non-issue on current real hardware)
The second Q&A is about x86 where commit from the store buffer to L1d cache is in program order. That limits how far past a cache-miss store execution can get, and also any possible benefit of putting a release or seq_cst fence after the store to prevent later stores (and loads) from maybe competing for resources. (x86 microarchitectures will do RFO (read for ownership) before stores reach the head of the store buffer, and plain loads normally compete for resources to track RFOs we’re waiting for a response to.) But these effects are extremely minor in terms of something like exiting another thread; only very small scale reordering.
because who cares if the thread stops with a slightly bigger delay.
More like, who cares if the thread gets more work done by not making loads/stores after the load wait for the check to complete. (Of course, this work will get discarded if it’s in the shadow of a a mis-speculated branch on the load result when we eventually load true
.) The cost of rolling back to a consistent state after a branch mispredict is more or less independent of how much already-executed work had happened beyond the mispredicted branch. And it’s a stop
flag which presumably doesn’t get set very often, so the total amount of wasted work costing cache/memory bandwidth for other CPUs is pretty minimal.
That phrasing makes it sound like an acquire
load or release
store would actually get the the store seen sooner in absolute real time, rather than just relative to other code in this thread. (Which is not the case).
The benefit is more instruction-level and memory-level parallelism across loop iterations when the load produces a false
. And simply avoiding running extra instructions on ISAs where an acquire or especially an SC load needs extra instructions, especially expensive 2-way barrier instructions (like PowerPC isync
/sync
or especially ARMv7 dmb ish
full barrier even for acquire), not like ARMv8 ldapr
or x86 mov
acquire-load instructions. (Godbolt)
BTW, Herb is right that the dirty
flag can also be relaxed
, but only because of the thread.join
sync between the reader and any possible writer. Otherwise yeah, release / acquire.
But in this case, dirty
only needs to be atomic<>
at all because of possible simultaneous writers all storing the same value, which ISO C++ still deems data-race UB. e.g. because of the theoretical possibility of hardware race-detection that traps on conflicting non-atomic accesses. (Or a software implementations like clang -fsanitize=thread
)
Fun fact: C++20 introduced std::stop_token
for use as a stop
or keep_running
flag.