Which types on a 64-bit computer are naturally atomic in gnu C and gnu C++? — meaning they have atomic reads, and atomic writes

The answer from the point of view of the language standard is very simple: none of them are “definitively automatically” atomic.

First of all, it’s important to distinguish between two senses of “atomic”.

  • One is atomic with respect to signals. This ensures, for instance, that when you do x = 5 on a sig_atomic_t, then a signal handler invoked in the current thread will see either the old or new value. This is usually accomplished simply by doing the access in one instruction, since signals can only be triggered by hardware interrupts, which can only arrive between instructions. For instance, x86 add dword ptr [var], 12345, even without a lock prefix, is atomic in this sense.

  • The other is atomic with respect to threads, so that another thread accessing the object concurrently will see a correct value. This is more difficult to get right. In particular, ordinary variables of type sig_atomic_t are not atomic with respect to threads. You need _Atomic or std::atomic to get that.

Note well that the internal names your implementation chooses for its types are not evidence of anything. From typedef int _Atomic_word; I would certainly not infer that “int is clearly atomic”; I don’t know in what sense the implementers were using the word “atomic”, or whether it’s accurate (could be used by legacy code, for instance). If they wanted to make such a promise it would be in the documentation, not in an unexplained typedef in a bits header that is never meant to be seen by the application programmer.


The fact that your hardware may make certain types of access “automatically atomic” does not tell you anything at the level of C/C++. For instance, it is true on x86 that ordinary full-size loads and stores to naturally aligned variables are atomic. But in the absence of std::atomic, the compiler is under no obligation to emit ordinary full-size loads and stores; it is entitled to be clever and access those variables in other ways. It “knows” this will be no problem, because concurrent access would be a data race, and of course the programmer would never write code with a data race, would they?

As a concrete example, consider the following code:

unsigned x;

unsigned foo(void) {
    return (x >> 8) & 0xffff;
}

A load of a nice 32-bit integer variable, followed by some arithmetic. What could be more innocent? Yet check out the assembly emitted by GCC 11.2 -O2 try on godbolt:

foo:
        movzx   eax, WORD PTR x[rip+1]
        ret

Oh dear. A partial load, and unaligned to boot.

Fortunately, x86 does guarantee that a 16-bit load or store contained within an aligned dword is atomic, even if unaligned, on P5 Pentium or later. In fact, any 1, 2, or 4-byte load or store that fits within an aligned 8-byte is atomic on x86-64, so this would be a valid optimization even if x had been std::atomic<int>. But in that case GCC would have missed the optimization.

Both Intel and AMD separately guarantee this. Intel for P5 Pentium and later which includes all their x86-64 CPUs. There is no single “x86” document that lists the common subset of atomicity guarantees. A stack overflow answer lists combines the guarantees from those two vendors; presumably it’s also atomic on other vendors like Via / Zhaoxin.

Hopefully also guaranteed in any emulators or binary-translators that turn this x86 instruction into AArch64 machine code for example, but that’s definitely something to worry about if there isn’t a matching atomicity guarantee on the host machine.


Here is another fun example, this time on ARM64. Aligned 64-bit stores are atomic, per B2.2.1 of the ARMv8-A Architecture Reference Manual. So this looks fine:

unsigned long x;

void bar(void) {
    x = 0xdeadbeefdeadbeef;
}

But, GCC 11.2 -O2 gives (godbolt):

bar:
        adrp    x1, .LANCHOR0
        add     x2, x1, :lo12:.LANCHOR0
        mov     w0, 48879
        movk    w0, 0xdead, lsl 16
        str     w0, [x1, #:lo12:.LANCHOR0]
        str     w0, [x2, 4]
        ret

That’s two 32-bit strs, not atomic in any way. A reader may very well read 0x00000000deadbeef.

Why do it this way? Materializing a 64-bit constant in a register takes several instructions on ARM64, with its fixed instruction size. But both halves of the value are equal, so why not materialize the 32-bit value and store it to each half?

(If you do unsigned long *p; *p = 0xdeadbeefdeadbeef then you get stp w1, w1, [x0] (godbolt). Which looks more promising as it is a single instruction, but in fact is still two separate writes for purposes of atomicity between threads.)


User supercat’s answer to Are concurrent unordered writes with fencing to shared memory undefined behavior? has another nice example for ARM32 Thumb, where the C source asks for an unsigned short to be loaded once, but the generated code loads it twice. In the presence of concurrent writes, you could get an “impossible” result.

One can provoke the same on x86-64 (godbolt):

_Bool x, y, z;

void foo(void) {
    _Bool tmp = x;
    y = tmp;
    // imagine elaborate computation here that needs lots of registers
    z = tmp;
}

GCC will reload x instead of spilling tmp. On x86 you can load a global with just one instruction, but spilling to the stack would need at least two. So if x is being concurrently modified, either by threads or by signals/interrupts, then assert(y == z) afterwards could fail.


It really isn’t safe to assume anything beyond what the languages actually guarantees, which is nothing unless you use std::atomic. Modern compilers know the exact limits of the language rules very well, and optimize aggressively. They can and will break code that assumes they will do what would be “natural”, if that is outside the bounds of what the language promises, and they will very often do it in ways that one would never expect.

Leave a Comment