Convention for displaying vector registers

Being consistent is the most important thing; If I’m working on existing code that already has LSE-first comments or variable names, I match that. Given the choice, I prefer MSE-first notation in comments, especially when designing something with shuffles or especially packing/unpacking to different element sizes. Intel uses MSE-first not only in their diagrams in … Read more

What specifically marks an x86 cache line as dirty – any write, or is an explicit change required?

Currently no implementation of x86 (or any other ISA, as far as I know) supports optimizing silent stores. There has been academic research on this and there is even a patent on “eliminating silent store invalidation propagation in shared memory cache coherency protocols”. (Googling ‘”silent store” cache’ if you are interested in more.) For x86, … Read more

Half-precision floating-point arithmetic on Intel chips

related: https://scicomp.stackexchange.com/questions/35187/is-half-precision-supported-by-modern-architecture – has some info about BFloat16 in Cooper Lake and Sapphire Rapids, and some non-Intel info. Sapphire Rapids will have both BF16 and FP16, with FP16 using the same IEEE754 binary16 format as F16C conversion instructions, not brain-float. And AVX512-FP16 has support for most math operations, unlike BF16 which just has conversion to/from … Read more

Branch target prediction in conjunction with branch prediction?

Do read along with the Intel optimization manual, current download location is here. When stale (they move stuff around all the time) then search the Intel site for “Architectures optimization manual”. Keep in mind the info there is fairly generic, they disclose only as much as needed to allow writing efficient code. Branch prediction implementation … Read more

The most correct way to refer to 32-bit and 64-bit versions of programs for x86-related CPUs?

x86 can be a broad term that covers all CPUs that are backwards-compatible with 8086, and all extensions to the architecture including x86-64. Note that IA-64 is not x86 at all, it’s Itanium (a 64-bit VLIW architecture with explicit speculation / parallelism). It was also designed by Intel, but is totally unrelated to x86 in … Read more

How has CPU architecture evolution affected virtual function call performance?

AMD processor in the early-gigahertz era had a 40 cycle penalty every time you called a function Huh.. so large.. There is an “Indirect branch prediction” method, which helps to predict virtual function jump, IF there was the same indirect jump some time ago. There is still a penalty for first and mispredicted virt. function … Read more

tech