Do current x86 architectures support non-temporal loads (from “normal” memory)?

Question

To answer specifically the headline question:

Yes, recent¹ mainstream Intel CPUs support non-temporal loads on normal² memory – but only “indirectly” via non-temporal prefetch instructions, rather than directly using non-temporal load instructions like movntdqa. This is in contrast to non-temporal stores where you can just use the corresponding non-temporal store instructions³ directly.

The basic idea is that you issue a prefetchnta to the cache line before any normal loads, and then issue loads as normal. If the line wasn’t already in the cache, it will be loaded in a non-temporal fashion. The exact meaning of non-temporal fashion depends on the architecture but the general pattern is that the line is loaded into, at least the L1 and perhaps some higher cache levels. Indeed for a prefetch to be of any use it needs to cause the line to load, at least into some cache level for consumption by a later load. The line may also be treated specially in the cache, for example by flagging it as high priority for eviction or restricting the ways in which it can be placed.

The upshot of all this is that while non-temporal loads are supported in a sense, they are really only partly non-temporal, unlike stores where you really leave no trace of the line in any of the cache levels. Non-temporal loads will cause some cache pollution, but generally less than regular loads. The exact details are architecture specific, and I’ve included some details below for modern Intel. You can find a slightly longer writeup in this answer to the question “Non-temporal loads and the hardware prefetcher, do they work together?”
).

Skylake Client

Based on the tests in this answer it seems that the behavior for prefetchnta Skylake is to fetch normally into the L1 cache, to skip the L2 entirely, and fetches in a limited way into the L3 cache (probably into 1 or 2 ways only so the total amount of the L3 available to nta prefetches is limited).

This was tested on Skylake client, but I believe this basic behavior probably extends backwards probably to Sandy Bridge and earlier (based on wording in the Intel optimization guide), and also forwards to Kaby Lake and later architectures based on Skylake client. So unless you are using a Skylake-SP or Skylake-X part, or an extremely old CPU, this is probably the behavior you can expect from prefetchnta.

Skylake Server

The only recent Intel chip known to have different behavior is Skylake server (used in Skylake-X, Skylake-SP and a few other lines). This has a considerably changed L2 and L3 architecture, and the L3 is no longer inclusive of the much larger L2. For this chip, it seems that prefetchnta skips both the L2 and L3 caches, so on this architecture cache pollution is limited to the L1.

This behavior was reported by user Mysticial in a comment. The downside, as pointed out in those comments is that this makes prefetchnta much more brittle: if you get the prefetch distance or timing wrong (especially easy when hyperthreading is involved and the sibling core is active), and the data gets evicted from L1 before you use, you are going all the way back to main memory rather than the L3 on earlier architectures.

¹ Recent here probably means anything in the last decade or so, but I don’t mean to imply that earlier hardware didn’t support non-temporal prefetch: it’s possible that support goes right back to the introduction of prefetchnta but I don’t have the hardware to check that and can’t find an existing reliable source of information on it.

² Normal here just means WB (writeback) memory, which is the memory dealing with at the application level the overwhelming majority of the time.

³ Specifically, the NT store instructions are movnti for general purpose registers and the movntd* and movntp* families for SIMD registers.

Skylake Client

Skylake Server

Leave a Comment Cancel reply