Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees?

Basically no significant effect on inter-core latency, and definitely never worth using “blindly” without careful profiling, if you suspect there might be any contention from later loads missing in cache. It’s a common misconception that asm barriers are needed to make the store buffer commit to cache. In fact barriers just make this core wait … Read more

Acquire/release semantics with 4 threads

You are thinking in terms of sequential consistency, the strongest (and default) memory order. If this memory order is used, all accesses to atomic variables constitute a total order, and the assertion indeed cannot be triggered. However, in this program, a weaker memory order is used (release stores and acquire loads). This means, by definition … Read more

Is there any compiler barrier which is equal to asm(“” ::: “memory”) in C++11?

re: your edit: But I do not want to use atomic variable. Why not? If it’s for performance reasons, use them with memory_order_relaxed and atomic_signal_fence(mo_whatever) to block compiler reordering without any runtime overhead other than the compiler barrier potentially blocking some compile-time optimizations, depending on the surrounding code. If it’s for some other reason, then … Read more

How many memory barriers instructions does an x86 CPU have?

sfence (SSE1) and mfence / lfence (SSE2) are the only instructions that are named for their memory fence/barrier functionality. Unless you’re using NT stores and/or WC memory (and NT loads from WC), only mfence is needed for memory ordering. (Note that lfence on Intel CPUs is also a barrier for out-of-order execution, so it can … Read more

Atomicity of loads and stores on x86

It sounds like the atomic operations on memory will be executed directly on memory (RAM). Nope, as long as every possible observer in the system sees the operation as atomic, the operation can involve cache only. Satisfying this requirement is much more difficult for atomic read-modify-write operations (like lock add [mem], eax, especially with an … Read more

Is LFENCE serializing on AMD processors?

AMD has always in their manual described their implementation of LFENCE as a load serializing instruction Acts as a barrier to force strong memory ordering (serialization) between load instructions preceding the LFENCE and load instructions that follow the LFENCE. The original use case for LFENCE was ordering WC memory type loads. However, after the speculative … Read more

Does it make any sense to use the LFENCE instruction on x86/x86_64 processors?

Bottom line (TL;DR): LFENCE alone indeed seems useless for memory ordering, however it does not make SFENCE a substitute for MFENCE. The “arithmetic” logic in the question is not applicable. Here is an excerpt from Intel’s Software Developers Manual, volume 3, section 8.2.2 (the edition 325384-052US of September 2014), the same that I used in … Read more

How do I Understand Read Memory Barriers and Volatile

There are read barriers and write barriers; acquire barriers and release barriers. And more (io vs memory, etc). The barriers are not there to control “latest” value or “freshness” of the values. They are there to control the relative ordering of memory accesses. Write barriers control the order of writes. Because writes to memory are … Read more

How does a mutex lock and unlock functions prevents CPU reordering?

The short answer is that the body of the pthread_mutex_lock and pthread_mutex_unlock calls will include the necessary platform-specific memory barriers which will prevent the CPU from moving memory accesses within the critical section outside of it. The instruction flow will move from the calling code into the lock and unlock functions via a call instruction, … Read more

tech