When to use volatile with multi threading?

In C++11, don’t use volatile for threading, only for MMIO

But TL:DR, it does “work” sort of like atomic with mo_relaxed on hardware with coherent caches (i.e. everything); it is sufficient to stop compilers keeping vars in registers. atomic doesn’t need memory barriers to create atomicity or inter-thread visibility, only to make the current thread wait before/after an operation to create ordering between this thread’s accesses to different variables. mo_relaxed never needs any barriers, just load, store, or RMW.

For roll-your-own atomics with volatile (and inline-asm for barriers) in the bad old days before C++11 std::atomic, volatile was the only good way to get some things to work. But it depended on a lot of assumptions about how implementations worked and was never guaranteed by any standard.

For example the Linux kernel still uses its own hand-rolled atomics with volatile, but only supports a few specific C implementations (GNU C, clang, and maybe ICC). Partly that’s because of GNU C extensions and inline asm syntax and semantics, but also because it depends on some assumptions about how compilers work.

It’s almost always the wrong choice for new projects; you can use std::atomic (with std::memory_order_relaxed) to get a compiler to emit the same efficient machine code you could with volatile. std::atomic with mo_relaxed obsoletes volatile for threading purposes. (except maybe to work around missed-optimization bugs with atomic<double> on some compilers.)

The internal implementation of std::atomic on mainstream compilers (like gcc and clang) does not just use volatile internally; compilers directly expose atomic load, store and RMW builtin functions. (e.g. GNU C __atomic builtins which operate on “plain” objects.)


Volatile is usable in practice (but don’t do it)

That said, volatile is usable in practice for things like an exit_now flag on all(?) existing C++ implementations on real CPUs, because of how CPUs work (coherent caches) and shared assumptions about how volatile should work. But not much else, and is not recommended. The purpose of this answer is to explain how existing CPUs and C++ implementations actually work. If you don’t care about that, all you need to know is that std::atomic with mo_relaxed obsoletes volatile for threading.

(The ISO C++ standard is pretty vague on it, just saying that volatile accesses should be evaluated strictly according to the rules of the C++ abstract machine, not optimized away. Given that real implementations use the machine’s memory address-space to model C++ address space, this means volatile reads and assignments have to compile to load/store instructions to access the object-representation in memory.)


As another answer points out, an exit_now flag is a simple case of inter-thread communication that doesn’t need any synchronization: it’s not publishing that array contents are ready or anything like that. Just a store that’s noticed promptly by a not-optimized-away load in another thread.

    // global
    bool exit_now = false;

    // in one thread
    while (!exit_now) { do_stuff; }

    // in another thread, or signal handler in this thread
    exit_now = true;

Without volatile or atomic, the as-if rule and assumption of no data-race UB allows a compiler to optimize it into asm that only checks the flag once, before entering (or not) an infinite loop. This is exactly what happens in real life for real compilers. (And usually optimize away much of do_stuff because the loop never exits, so any later code that might have used the result is not reachable if we enter the loop).

 // Optimizing compilers transform the loop into asm like this
    if (!exit_now) {        // check once before entering loop
        while(1) do_stuff;  // infinite loop
    }

Multithreading program stuck in optimized mode but runs normally in -O0 is an example (with description of GCC’s asm output) of how exactly this happens with GCC on x86-64. Also MCU programming – C++ O2 optimization breaks while loop on electronics.SE shows another example.

We normally want aggressive optimizations that CSE and hoist loads out of loops, including for global variables.

Before C++11, volatile bool exit_now was one way to make this work as intended (on normal C++ implementations). But in C++11, data-race UB still applies to volatile so it’s not actually guaranteed by the ISO standard to work everywhere, even assuming HW coherent caches.

Note that for wider types, volatile gives no guarantee of lack of tearing. I ignored that distinction here for bool because it’s a non-issue on normal implementations. But that’s also part of why volatile is still subject to data-race UB instead of being equivalent to relaxed atomic.

Note that “as intended” doesn’t mean the thread doing exit_now waits for the other thread to actually exit. Or even that it waits for the volatile exit_now=true store to even be globally visible before continuing to later operations in this thread. (atomic<bool> with the default mo_seq_cst would make it wait before any later seq_cst loads at least. On many ISAs you’d just get a full barrier after the store).

C++11 provides a non-UB way that compiles the same

A “keep running” or “exit now” flag should use std::atomic<bool> flag with mo_relaxed

Using

  • flag.store(true, std::memory_order_relaxed)
  • while( !flag.load(std::memory_order_relaxed) ) { ... }

will give you the exact same asm (with no expensive barrier instructions) that you’d get from volatile flag.

As well as no-tearing, atomic also gives you the ability to store in one thread and load in another without UB, so the compiler can’t hoist the load out of a loop. (The assumption of no data-race UB is what allows the aggressive optimizations we want for non-atomic non-volatile objects.) This feature of atomic<T> is pretty much the same as what volatile does for pure loads and pure stores.

atomic<T> also make += and so on into atomic RMW operations (significantly more expensive than an atomic load into a temporary, operate, then a separate atomic store. If you don’t want an atomic RMW, write your code with a local temporary).

With the default seq_cst ordering you’d get from while(!flag), it also adds ordering guarantees wrt. non-atomic accesses, and to other atomic accesses.

(In theory, the ISO C++ standard doesn’t rule out compile-time optimization of atomics. But in practice compilers don’t because there’s no way to control when that wouldn’t be ok. There are a few cases where even volatile atomic<T> might not be enough control over optimization of atomics if compilers did optimize, so for now compilers don’t. See Why don’t compilers merge redundant std::atomic writes? Note that wg21/p0062 recommends against using volatile atomic in current code to guard against optimization of atomics.)


volatile does actually work for this on real CPUs (but still don’t use it)

even with weakly-ordered memory models (non-x86). But don’t actually use it, use atomic<T> with mo_relaxed instead!! The point of this section is to address misconceptions about how real CPUs work, not to justify volatile. If you’re writing lockless code, you probably care about performance. Understanding caches and the costs of inter-thread communication is usually important for good performance.

Real CPUs have coherent caches / shared memory: after a store from one core becomes globally visible, no other core can load a stale value. (See also Myths Programmers Believe about CPU Caches which talks some about Java volatiles, equivalent to C++ atomic<T> with seq_cst memory order.)

When I say load, I mean an asm instruction that accesses memory. That’s what a volatile access ensures, and is not the same thing as lvalue-to-rvalue conversion of a non-atomic / non-volatile C++ variable. (e.g. local_tmp = flag or while(!flag)).

The only thing you need to defeat is compile-time optimizations that don’t reload at all after the first check. Any load+check on each iteration is sufficient, without any ordering. Without synchronization between this thread and the main thread, it’s not meaningful to talk about when exactly the store happened, or ordering of the load wrt. other operations in the loop. Only when it’s visible to this thread is what matters. When you see the exit_now flag set, you exit. Inter-core latency on a typical x86 Xeon can be something like 40ns between separate physical cores.


In theory: C++ threads on hardware without coherent caches

I don’t see any way this could be remotely efficient, with just pure ISO C++ without requiring the programmer to do explicit flushes in the source code.

In theory you could have a C++ implementation on a machine that wasn’t like this, requiring compiler-generated explicit flushes to make things visible to other threads on other cores. (Or for reads to not use a maybe-stale copy). The C++ standard doesn’t make this impossible, but C++’s memory model is designed around being efficient on coherent shared-memory machines. E.g. the C++ standard even talks about “read-read coherence”, “write-read coherence”, etc. One note in the standard even points the connection to hardware:

http://eel.is/c++draft/intro.races#19

[ Note: The four preceding coherence requirements effectively disallow compiler reordering of atomic operations to a single object, even if both operations are relaxed loads. This effectively makes the cache coherence guarantee provided by most hardware available to C++ atomic operations. — end note ]

There’s no mechanism for a release store to only flush itself and a few select address-ranges: it would have to sync everything because it wouldn’t know what other threads might want to read if their acquire-load saw this release-store (forming a release-sequence that establishes a happens-before relationship across threads, guaranteeing that earlier non-atomic operations done by the writing thread are now safe to read. Unless it did further writes to them after the release store…) Or compilers would have to be really smart to prove that only a few cache lines needed flushing.

Related: my answer on Is mov + mfence safe on NUMA? goes into detail about the non-existence of x86 systems without coherent shared memory. Also related: Loads and stores reordering on ARM for more about loads/stores to the same location.

There are I think clusters with non-coherent shared memory, but they’re not single-system-image machines. Each coherency domain runs a separate kernel, so you can’t run threads of a single C++ program across it. Instead you run separate instances of the program (each with their own address space: pointers in one instance aren’t valid in the other).

To get them to communicate with each other via explicit flushes, you’d typically use MPI or other message-passing API to make the program specify which address ranges need flushing.


Real hardware doesn’t run std::thread across cache coherency boundaries:

Some asymmetric ARM chips exist, with shared physical address space but not inner-shareable cache domains. So not coherent. (e.g. comment thread an A8 core and an Cortex-M3 like TI Sitara AM335x).

But different kernels would run on those cores, not a single system image that could run threads across both cores. I’m not aware of any C++ implementations that run std::thread threads across CPU cores without coherent caches.

For ARM specifically, GCC and clang generate code assuming all threads run in the same inner-shareable domain. In fact, the ARMv7 ISA manual says

This architecture (ARMv7) is written with an expectation that all processors using the same operating system or hypervisor are in the same Inner Shareable shareability domain

So non-coherent shared memory between separate domains is only a thing for explicit system-specific use of shared memory regions for communication between different processes under different kernels.

See also this CoreCLR discussion about code-gen using dmb ish (Inner Shareable barrier) vs. dmb sy (System) memory barriers in that compiler.

I make the assertion that no C++ implementation for other any other ISA runs std::thread across cores with non-coherent caches. I don’t have proof that no such implementation exists, but it seems highly unlikely. Unless you’re targeting a specific exotic piece of HW that works that way, your thinking about performance should assume MESI-like cache coherency between all threads. (Preferably use atomic<T> in ways that guarantees correctness, though!)


Coherent caches makes it simple

But on a multi-core system with coherent caches, implementing a release-store just means ordering commit into cache for this thread’s stores, not doing any explicit flushing. (https://preshing.com/20120913/acquire-and-release-semantics/ and https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/). (And an acquire-load means ordering access to cache in the other core).

A memory barrier instruction just blocks the current thread’s loads and/or stores until the store buffer drains; that always happens as fast as possible on its own. (Or for LoadLoad / LoadStore barriers, block until previous loads have completed.) (Does a memory barrier ensure that the cache coherence has been completed? addresses this misconception). So if you don’t need ordering, just prompt visibility in other threads, mo_relaxed is fine. (And so is volatile, but don’t do that.)

See also C/C++11 mappings to processors

Fun fact: on x86, every asm store is a release-store because the x86 memory model is basically seq-cst plus a store buffer (with store forwarding).


Semi-related re: store buffer, global visibility, and coherency: C++11 guarantees very little. Most real ISAs (except PowerPC) do guarantee that all threads can agree on the order of a appearance of two stores by two other threads. (In formal computer-architecture memory model terminology, they’re “multi-copy atomic”).

  • Will two atomic writes to different locations in different threads always be seen in the same order by other threads?
  • Concurrent stores seen in a consistent order

Another misconception is that memory fence asm instructions are needed to flush the store buffer for other cores to see our stores at all. Actually the store buffer is always trying to drain itself (commit to L1d cache) as fast as possible, otherwise it would fill up and stall execution. What a full barrier / fence does is stall the current thread until the store buffer is drained, so our later loads appear in the global order after our earlier stores.

  • Are loads and stores the only instructions that gets reordered?
  • x86 mfence and C++ memory barrier
  • Globally Invisible load instructions

(x86’s strongly ordered asm memory model means that volatile on x86 may end up giving you closer to mo_acq_rel, except that compile-time reordering with non-atomic variables can still happen. But most non-x86 have weakly-ordered memory models so volatile and relaxed are about as weak as mo_relaxed allows.)

Leave a Comment