Convention for displaying vector registers

Being consistent is the most important thing; If I’m working on existing code that already has LSE-first comments or variable names, I match that. Given the choice, I prefer MSE-first notation in comments, especially when designing something with shuffles or especially packing/unpacking to different element sizes. Intel uses MSE-first not only in their diagrams in … Read more

How to detect SSE/SSE2/AVX/AVX2/AVX-512/AVX-128-FMA/KCVI availability at compile-time?

Most compilers will automatically define: __SSE__ __SSE2__ __SSE3__ __AVX__ __AVX2__ etc, according to whatever command line switches you are passing. You can easily check this with gcc (or gcc-compatible compilers such as clang), like this: $ gcc -msse3 -dM -E – < /dev/null | egrep “SSE|AVX” | sort #define __SSE__ 1 #define __SSE2__ 1 #define … Read more

Using SSE in c# is it possible?

The upcoming Mono 2.2 release will have SIMD support. Miguel de Icaza blogged about the upcoming feature here, and the API is here. Although there will be a library that will support development under Microsoft’s .NET Windows runtime, it will not have the performance benefits that you are looking for unless you run the code … Read more

In GNU C inline asm, what are the size-override modifiers for xmm/ymm/zmm for a single operand?

From the file gcc/config/i386/i386.c of the GCC sources: b — print the QImode name of the register for the indicated operand. %b0 would print %al if operands[0] is reg 0. w — likewise, print the HImode name of the register. k — likewise, print the SImode name of the register. q — likewise, print the … Read more

Fast counting the number of set bits in __m128i register

Here are some codes I used in an old project (there is a research paper about it). The function popcnt8 below computes the number of bits set in each byte. SSE2-only version (based on Algorithm 3 in Hacker’s Delight book): static const __m128i popcount_mask1 = _mm_set1_epi8(0x77); static const __m128i popcount_mask2 = _mm_set1_epi8(0x0F); static inline __m128i … Read more

Header files for x86 SIMD intrinsics

These days you should normally just include <immintrin.h>. It includes everything. GCC and clang will stop you from using intrinsics for instructions you haven’t enabled at compile time (e.g. with -march=native or -mavx2 -mbmi2 -mpopcnt -mfma -mcx16 -mtune=znver1 or whatever.) MSVC and ICC will let you use intrinsics without enabling anything at compile time, but … Read more

Fast vectorized rsqrt and reciprocal with SSE/AVX depending on precision

There are lots of examples of the algorithm in practice. For example: Newton Raphson with SSE2 – can someone explain me these 3 lines has an answer explaining the iteration used by one of Intel’s examples. For perf analysis on let’s say Haswell (which can FP mul on two execution ports, unlike previous designs), I’ll … Read more