simd – Make Me Engineer

Why is vectorization, faster in general, than loops?

June 8, 2023 by Tarik

Vectorization (as the term is normally used) refers to SIMD (single instruction, multiple data) operation. That means, in essence, that one instruction carries out the same operation on a number of operands in parallel. For example, to multiply a vector of size N by a scalar, let’s call M the number of operands that size … Read more

Convention for displaying vector registers

June 4, 2023 by Tarik

Being consistent is the most important thing; If I’m working on existing code that already has LSE-first comments or variable names, I match that. Given the choice, I prefer MSE-first notation in comments, especially when designing something with shuffles or especially packing/unpacking to different element sizes. Intel uses MSE-first not only in their diagrams in … Read more

Count each bit-position separately over many 64-bit bitmasks, with AVX but not AVX2

June 1, 2023 by Tarik

On my system, a 4 year old MacBook (2.7 GHz intel core i5) with clang-900.0.39.2 -O3, your code runs in 500ms. Just changing the inner test to if ((pLong[j] & m) != 0) saves 30%, running in 350ms. Further simplifying the inner part to target[i] += (pLong[j] >> i) & 1; without a test brings … Read more

Why is this SIMD multiplication not faster than non-SIMD multiplication?

June 1, 2023 by Tarik

There was a major bug in the timing function I used for previous benchmarks. This grossly underestimated the bandwidth without vectorization as well as other measurements. Additionally, there was another problem that was overestimating the bandwidth due to COW on the array that was read but not written to. Finally, the maximum bandwidth I used … Read more

Micro Optimization of a 4-bucket histogram of a large array or list

May 19, 2023 by Tarik

This should be possible at about 8 elements (1 AVX2 vector) per 2.5 clock cycles or so (per core) on a modern x86-64 like Skylake or Zen 2, using AVX2. Or per 2 clocks with unrolling. Or on your Piledriver CPU, maybe 1x 16-byte vector of indexes per 3 clocks with AVX1 _mm_cmpeq_epi32. The general … Read more

No speedup when summing uint16 vs uint64 arrays with NumPy?

May 18, 2023 by Tarik

TL;DR: I made an experimental analysis on Numpy 1.21.1. Experimental results show that np.sum does NOT (really) make use of SIMD instructions: no SIMD instruction are used for integers, and scalar SIMD instructions are used for floating-point numbers! Moreover, Numpy converts the integers to 64-bits values for smaller integer types by default so to avoid … Read more

How to convert 32-bit float to 8-bit signed char? (4:1 packing of int32 to int8 __m256i)

May 17, 2023 by Tarik

For good throughput with multiple source vectors, it’s a good thing that _mm256_packs_epi16 has 2 input vectors instead of producing a narrower output. (AVX512 _mm256_cvtepi32_epi8 isn’t necessarily the most efficient way to do things, because the version with a memory destination decodes to multiple uops, or the regular version gives you multiple small outputs that … Read more

Fast counting the number of set bits in __m128i register

May 4, 2023 by Tarik

Here are some codes I used in an old project (there is a research paper about it). The function popcnt8 below computes the number of bits set in each byte. SSE2-only version (based on Algorithm 3 in Hacker’s Delight book): static const __m128i popcount_mask1 = _mm_set1_epi8(0x77); static const __m128i popcount_mask2 = _mm_set1_epi8(0x0F); static inline __m128i … Read more

Sum reduction of unsigned bytes without overflow, using SSE2 on Intel

May 4, 2023 by Tarik

You can abuse PSADBW to calculate horizontal sums of bytes without overflow. For example: pxor xmm0, xmm0 psadbw xmm0, [a + 0] ; sum in 2x 64-bit chunks pxor xmm1, xmm1 psadbw xmm1, [a + 16] paddw xmm0, xmm1 ; accumulate vertically pshufd xmm1, xmm0, 2 ; bring down the high half paddw xmm0, xmm1 … Read more

Transpose an 8×8 float using AVX/AVX2

May 3, 2023 by Tarik

I already answered this question Fast memory transpose with SSE, AVX, and OpenMP. Let me repeat the solution for transposing an 8×8 float matrix with AVX. Let me know if this is any faster than using 4×4 blocks and _MM_TRANSPOSE4_PS. I used it for a kernel in a larger matrix transpose which was memory bound … Read more