See also First use of AVX 256-bit vectors slows down 128-bit vector and AVX scalar ops re: implicit widening of 128-bit AVX operations to 256-bit if any uppers are dirty. (Including for the purposes of “light” vs. “heavy” turbo limits). This could be a reason to use vzeroupper
, especially if you have some regions of your program that use 256-bit vectors (especially “light” instructions, like integer stuff other than multiply), and others that make heavy use of 128-bit FMA. Without vzeroupper
, the 128-bit FP math instructions could lower your max turbo as if you’d been using heavy 256-bit instructions. (If you’re doing that anyway, maybe not a big deal).
You’re correct that if your whole program doesn’t use any non-VEX instructions that write xmm
registers, you don’t need vzeroupper
to avoid state-transition penalties.
Beware that non-VEX instructions can lurk in CRT startup code and/or the dynamic linker, or other highly non-obvious places.
That said, a non-VEX instruction can only cause a one-time penalty when it runs. The reverse isn’t true: one VEX-256 instruction can make non-VEX instructions in general (or just with that register) slow for the rest of the program.
There’s no penalty when mixing VEX and EVEX, so no need to use vzeroupper
there.
On Skylake-AVX512: vzeroupper
or vzeroall
are the only way to restore max-turbo after dirtying a ZMM register, assuming your program still uses any SSE*, AVX1, or AVX2 instructions on xmm/ymm0..15.
See also Does Skylake need vzeroupper for turbo clocks to recover after a 512-bit instruction that only reads a ZMM register, writing a k mask? – merely reading a zmm doesn’t cause this.
Posted by @BeeOnRope in chat:
There is a new, pretty bad effect with AVX-512 instructions on surrounding code: once a 512-bit instruction is executed (except perhaps for instructions that don’t write to a zmm register) the core enters an “upper 256 dirty state”. In this state, any later scalar FP/SSE/AVX instruction (anything using xmm or ymm regs) will internally be extended to 512 bits. This means the processor will be locked to no higher than the AVX turbo (the so-called “L1 license”) until vzeroupper or vzeroall are issued.
Unlike the earlier “dirty upper 128” issue with AVX and legacy non-VEX SSE (which still exists on Skylake Xeon), this will slow down all code due to the lower frequency, but there are no “merging uops” or false dependencies or anything like that: it’s just that the smaller operations are effectively treated as 512-bit wide in order to implement the zero-extending behavior.
about “writing the low halves …” – no, it is a global state, and only vzero gets you out of it*. It occurs even if you dirty a zmm register but use different ones for ymm and xmm. It occurs even if the only dirtying instruction is a zeroing idiom like
vpxord zmm0, zmm0, zmm0
. It doesn’t occur for writes to zmm16-31 though.
His description of actually extending all vector ops to 512 bits isn’t quite right, because he later confirmed that it doesn’t reduce throughput for 128 and 256-bit instructions. But we know that when 512-bit uops are in flight, the vector ALUs on port 1 are shut down. (So the 256-bit FMA units normally accessible via ports 0 and 1 can combine into a 512-bit unit for all FP math, integer multiply, and possibly some other stuff. Some SKX Xeons have a 2nd 512-bit FMA unit on port 5, some don’t.)
For max-turbo after using only AVX1 / AVX2 (including on earlier CPUs like Haswell): Opportunistically powering down the upper halves of execution units if they haven’t been used for a while (and sometimes allowing higher Turbo clock speeds) depends on whether YMM instructions have been used recently, not on whether the upper halves are dirty or not. So AFAIK, vzeroupper
does not help the CPU un-throttle the clock speed sooner after using AVX1 / AVX2, for CPUs where max turbo is lower for 256-bit.
This is different from Intel’s Skylake-AVX512 (SKX / Skylake-SP), where AVX512 is somewhat “bolted on”.
VZEROUPPER
might make context switches slightly cheaper
because the CPU still knows whether the ymm-upper state is clean or dirty.
If it’s clean, I think xsaveopt
or xsavec
can write out the FPU state more compactly, without storing the all-zero upper halves at all (just setting a bit that says they’re clean). Notice in the state-transition diagram for SSE/AVX that xsave
/ xrstor
is part of the picture.
An extra vzeroupper
just for this is only worth considering if your code won’t use any 256b instructions for a long time after this, because ideally you won’t have any context switches / CPU migrations before the next use of 256-bit vectors.
This may not apply as much on AVX512 CPUs: vzeroupper
/ vzeroall
don’t touch ZMM16..31, only ZMM0..15. So you can still have lots of dirty state after vzeroall
.
(Plausible in theory): Dirty upper halves may be taking up physical registers (although IDK of any evidence for this being true on any real CPUs). If so, it would limit out-of-order window size for the CPU to find instruction-level parallelism. (ROB size is the other major limiting factor, but PRF size can be the bottleneck.)
This may be true on AMD CPUs before Zen2, where 256b ops are split into two 128b ops. YMM registers are handled internally as two 128-bit registers, and e.g. vmovaps ymm0, ymm1
renames the low 128 with zero latency, but needs a uop for the upper half. (See Agner Fog’s microarch pdf). It’s unknown whether vzeroupper
can actually drop the renaming for the upper halves, though. Zeroing idioms on AMD Zen (unlike SnB-family) still need a back-end uop to write the register value, even for the 128b low half; only mov-elimination avoids a back-end uop. So there may not be a physical zero register that uppers can be renamed onto.
Experiments in that ROB size / PRF size blog post show that FP physical register file entries are 256-bit in Sandybridge, though. vzeroupper
shouldn’t free up more registers on mainstream Intel CPUs with AVX/AVX2. Haswell-style transition penalties are slow enough that it probably drains the ROB to save or restore uppers to separate storage that isn’t renamed, not using up valuable PRF entries.
Silvermont doesn’t support AVX. And it uses a separate retirement register file for the architectural state, so the out-of-order PRF only holds speculative execution results. So even if it did support AVX with 128-bit halves, a stale YMM register with a dirty upper half probably wouldn’t be using up extra space in the rename register file.
KNL (Knight’s Landing / Xeon Phi) is specifically designed to run AVX512, so presumably its FP register file has 512-bit entries. It’s based on Silvermont, but the SIMD parts of the core are different (e.g. it can reorder FP/vector instructions, while Silvermont can only execute them speculatively but not reorder them within the FP/vector pipeline, according to Agner Fog). Still, KNL may also use a separate retirement register file, so dirty ZMM uppers wouldn’t consume extra space even if it was able to split a 512-bit entry to store two 256-bit vectors. Which is unlikely, because a larger out-of-order window for only AVX1/AVX2 on KNL wouldn’t be worth spending transistors on.
vzeroupper
is much slower on KNL than mainstream Intel CPUs (one per 36 cycles in 64-bit mode), so you probably wouldn’t want to use, especially just for the tiny context-switch advantage.
On Skylake-AVX512, the evidence supports the conclusion that the vector physical register file is 512-bits wide.
Some future CPU might pair up entries in a physical register file to store wide vectors, even if they don’t normally decode to separate uops the way AMD does for 256-bit vectors.
@Mysticial reports unexpected slowdowns in code with long FP dependency chains with YMM vs. ZMM but otherwise identical code, but later experiments disagree with the conclusion that SKX uses 2x 256-bit register file entries for ZMM registers when the upper 256 bits are dirty.
AVX-512 and physical registers on Ice Lake / Sapphire Rapids
https://chipsandcheese.com/2023/01/15/golden-coves-vector-register-file-checking-with-official-spr-data/
[…]
While testing on server Ice Lake suggests Intel’s mechanism isn’t nearly that sophisticated. Instead, the core simply remembers whether the upper set of ZMM registers are in use. If you use any of the extra registers introduced with AVX-512 – that is, ZMM16 through 31, Ice Lake reserves another 16 registers to hold known-good state. It doesn’t matter if you touch one of them or all of them. Golden Cove is Ice Lake’s successor, and could use a similar mechanism.…
Therefore, Zen 4 does not employ the same register-saving optimization as Ice Lake.
But unfortunately, I don’t think vzeroupper
/ vzeroall
can help with this; it doesn’t affect ZMM16..31, so it can’t restore them to “clean” status and free up those extra 16 physical registers for out-of-order exec.
If I understand correctly, manually xor-zeroing them will stop them from using physical registers (vpxord xmm16, xmm16, xmm16
through xmm31); either there’s an extra bit to indicate all-zero, or there’s a physical zero register that the renamer can point them at. But there might still be 16 extra PRF entries reserved for retirement state, even if the actual RAT entries aren’t pointing at them.
With them zeroed, xsave
/xrstor
on context-switch might get back to the zmm16-31-unused state. The CPU presumably has to be able to get back to that state somehow other than a cold boot or entering a deep sleep state.
That article has some other interesting findings, like that only 220 of the 320 vector PRF entries are capable of holding 512-bit results. So using 256-bit instructions whenever that’s sufficient (e.g. horizontal reductions start by narrowing to 256) can help out-of-order exec see farther ahead.