avx – Make Me Engineer

Convention for displaying vector registers

June 4, 2023 by Tarik

Being consistent is the most important thing; If I’m working on existing code that already has LSE-first comments or variable names, I match that. Given the choice, I prefer MSE-first notation in comments, especially when designing something with shuffles or especially packing/unpacking to different element sizes. Intel uses MSE-first not only in their diagrams in … Read more

Using ymm registers as a “memory-like” storage location

June 2, 2023 by Tarik

You can’t vpinsrq into a YMM register. Only an xmm destination is available, so it unavoidably zeros the upper lane of the full YMM register. It was introduced with AVX1 as the VEX version of the 128-bit instruction. AVX2 and AVX512 did not upgrade it to YMM/ZMM destinations. I’m guessing they didn’t want to provide … Read more

Half-precision floating-point arithmetic on Intel chips

May 30, 2023 by Tarik

related: https://scicomp.stackexchange.com/questions/35187/is-half-precision-supported-by-modern-architecture – has some info about BFloat16 in Cooper Lake and Sapphire Rapids, and some non-Intel info. Sapphire Rapids will have both BF16 and FP16, with FP16 using the same IEEE754 binary16 format as F16C conversion instructions, not brain-float. And AVX512-FP16 has support for most math operations, unlike BF16 which just has conversion to/from … Read more

How to detect SSE/SSE2/AVX/AVX2/AVX-512/AVX-128-FMA/KCVI availability at compile-time?

May 22, 2023 by Tarik

Most compilers will automatically define: __SSE__ __SSE2__ __SSE3__ __AVX__ __AVX2__ etc, according to whatever command line switches you are passing. You can easily check this with gcc (or gcc-compatible compilers such as clang), like this: $ gcc -msse3 -dM -E – < /dev/null | egrep “SSE|AVX” | sort #define __SSE__ 1 #define __SSE2__ 1 #define … Read more

Is it useful to use VZEROUPPER if your program+libraries contain no SSE instructions?

May 17, 2023 by Tarik

See also First use of AVX 256-bit vectors slows down 128-bit vector and AVX scalar ops re: implicit widening of 128-bit AVX operations to 256-bit if any uppers are dirty. (Including for the purposes of “light” vs. “heavy” turbo limits). This could be a reason to use vzeroupper, especially if you have some regions of … Read more

Transpose an 8×8 float using AVX/AVX2

May 3, 2023 by Tarik

I already answered this question Fast memory transpose with SSE, AVX, and OpenMP. Let me repeat the solution for transposing an 8×8 float matrix with AVX. Let me know if this is any faster than using 4×4 blocks and _MM_TRANSPOSE4_PS. I used it for a kernel in a larger matrix transpose which was memory bound … Read more

Fast vectorized rsqrt and reciprocal with SSE/AVX depending on precision

April 23, 2023 by Tarik

There are lots of examples of the algorithm in practice. For example: Newton Raphson with SSE2 – can someone explain me these 3 lines has an answer explaining the iteration used by one of Intel’s examples. For perf analysis on let’s say Haswell (which can FP mul on two execution ports, unlike previous designs), I’ll … Read more

Fastest way to multiply an array of int64_t?

April 16, 2023 by Tarik

You seem to be assuming long is 64bits in your code, but then using __uint64_t as well. In 32bit, the x32 ABI, and on Windows, long is a 32bit type. Your title mentions long long, but then your code ignores it. I was wondering for a while if your code was assuming that long was … Read more

inlining failed in call to always_inline ‘__m256d _mm256_broadcast_sd(const double*)’

November 23, 2022 by Tarik

“Target specific option mismatch” means that you’re missing a feature flag from your GCC invocation. You probably need to add -mavx to your compiler invocation. If you’re intending to run this on your computer only, -march=native will turn on all the feature flags that your own machine supports.

Where is Clang’s ‘_mm256_pow_ps’ intrinsic?

November 23, 2022 by Tarik

That’s not an intrinsic; it’s an Intel SVML library function name that confusingly uses the same naming scheme as actual intrinsics. There’s no vpowps instruction. (AVX512ER on Xeon Phi does have the semi-related vexp2ps instruction…) IDK if this naming scheme is to trick people into depending on Intel tools when writing SIMD code with their … Read more