Convention for displaying vector registers

Being consistent is the most important thing; If I’m working on existing code that already has LSE-first comments or variable names, I match that. Given the choice, I prefer MSE-first notation in comments, especially when designing something with shuffles or especially packing/unpacking to different element sizes. Intel uses MSE-first not only in their diagrams in … Read more

Half-precision floating-point arithmetic on Intel chips

related: https://scicomp.stackexchange.com/questions/35187/is-half-precision-supported-by-modern-architecture – has some info about BFloat16 in Cooper Lake and Sapphire Rapids, and some non-Intel info. Sapphire Rapids will have both BF16 and FP16, with FP16 using the same IEEE754 binary16 format as F16C conversion instructions, not brain-float. And AVX512-FP16 has support for most math operations, unlike BF16 which just has conversion to/from … Read more

How to detect SSE/SSE2/AVX/AVX2/AVX-512/AVX-128-FMA/KCVI availability at compile-time?

Most compilers will automatically define: __SSE__ __SSE2__ __SSE3__ __AVX__ __AVX2__ etc, according to whatever command line switches you are passing. You can easily check this with gcc (or gcc-compatible compilers such as clang), like this: $ gcc -msse3 -dM -E – < /dev/null | egrep “SSE|AVX” | sort #define __SSE__ 1 #define __SSE2__ 1 #define … Read more

Is it useful to use VZEROUPPER if your program+libraries contain no SSE instructions?

See also First use of AVX 256-bit vectors slows down 128-bit vector and AVX scalar ops re: implicit widening of 128-bit AVX operations to 256-bit if any uppers are dirty. (Including for the purposes of “light” vs. “heavy” turbo limits). This could be a reason to use vzeroupper, especially if you have some regions of … Read more

Transpose an 8×8 float using AVX/AVX2

I already answered this question Fast memory transpose with SSE, AVX, and OpenMP. Let me repeat the solution for transposing an 8×8 float matrix with AVX. Let me know if this is any faster than using 4×4 blocks and _MM_TRANSPOSE4_PS. I used it for a kernel in a larger matrix transpose which was memory bound … Read more

Fast vectorized rsqrt and reciprocal with SSE/AVX depending on precision

There are lots of examples of the algorithm in practice. For example: Newton Raphson with SSE2 – can someone explain me these 3 lines has an answer explaining the iteration used by one of Intel’s examples. For perf analysis on let’s say Haswell (which can FP mul on two execution ports, unlike previous designs), I’ll … Read more

Where is Clang’s ‘_mm256_pow_ps’ intrinsic?

That’s not an intrinsic; it’s an Intel SVML library function name that confusingly uses the same naming scheme as actual intrinsics. There’s no vpowps instruction. (AVX512ER on Xeon Phi does have the semi-related vexp2ps instruction…) IDK if this naming scheme is to trick people into depending on Intel tools when writing SIMD code with their … Read more

tech