cpu-cache – Make Me Engineer

CPU cache inhibition

June 4, 2023 by Tarik

x86 has no way to do a store that bypasses or writes through L1D/L2 but not L3. There are NT stores which bypass all cache. Anything that forces a write-back to L3 also forces write-back all the way to memory. (e.g. a clwb instruction). Those are designed for non-volatile RAM use cases, or for non-coherent … Read more

What specifically marks an x86 cache line as dirty – any write, or is an explicit change required?

June 1, 2023 by Tarik

Currently no implementation of x86 (or any other ISA, as far as I know) supports optimizing silent stores. There has been academic research on this and there is even a patent on “eliminating silent store invalidation propagation in shared memory cache coherency protocols”. (Googling ‘”silent store” cache’ if you are interested in more.) For x86, … Read more

simplest tool to measure C program cache hit/miss and cpu time in linux?

May 23, 2023 by Tarik

Use perf: perf stat ./yourapp See the kernel wiki perf tutorial for details. This uses the hardware performance counters of your CPU, so the overhead is very small. Example from the wiki: perf stat -B dd if=/dev/zero of=/dev/null count=1000000 Performance counter stats for ‘dd if=/dev/zero of=/dev/null count=1000000’: 5,099 cache-misses # 0.005 M/sec (scaled from 66.58%) … Read more

How are cache memories shared in multicore Intel CPUs?

May 1, 2023 by Tarik

In a multiprocessor system or a multicore processor (Intel Quad Core, Core two Duo etc..) does each cpu core/processor have its own cache memory (data and program cache)? Yes. It varies by the exact chip model, but the most common design is for each CPU core to have its own private L1 data and instruction … Read more

VIPT Cache: Connection between TLB & Cache?

April 29, 2023 by Tarik

At this level of detail, you have to break “the cache” and “the TLB” down into their component parts. They’re very tightly interconnected in a design that uses the VIPT speed hack of translating in parallel with tag fetch (i.e. taking advantage of the index bits all being below the page offset and thus being … Read more

Cache size estimation on your system?

April 23, 2023 by Tarik

you need direct access to memory I am not meaning DMA transfer by this. Memory must be accessed by CPU of course (otherwise you are not measuring CACHEs) but as directly as it can be … so measurements will probably not be very accurate on Windows/Linux because services and other processes can mess with caches … Read more

How does the CPU cache affect the performance of a C program

April 17, 2023 by Tarik

The plots show the combination of several complex low-level effects (mainly cache trashing & prefetching issues). I assume the target platform is a mainstream modern processor with cache lines of 64 bytes (typically a x86 one). I can reproduce the problem on my i5-9600KF processor. Here is the resulting plot: First of all, when nj … Read more

Do current x86 architectures support non-temporal loads (from “normal” memory)?

November 29, 2022 by Tarik

To answer specifically the headline question: Yes, recent1 mainstream Intel CPUs support non-temporal loads on normal 2 memory – but only “indirectly” via non-temporal prefetch instructions, rather than directly using non-temporal load instructions like movntdqa. This is in contrast to non-temporal stores where you can just use the corresponding non-temporal store instructions3 directly. The basic … Read more

How does one write code that best utilizes the CPU cache to improve performance?

November 7, 2022 by Tarik

The cache is there to reduce the number of times the CPU would stall waiting for a memory request to be fulfilled (avoiding the memory latency), and as a second effect, possibly to reduce the overall amount of data that needs to be transfered (preserving memory bandwidth). Techniques for avoiding suffering from memory fetch latency … Read more

Does cmpxchg write destination cache line on failure? If not, is it better than xchg for spinlock?

November 5, 2022 by Tarik