Counting 1 bits (population count) on large data using AVX-512 or AVX-2

AVX-2 @HadiBreis’ comment links to an article on fast population-count with SSSE3, by Wojciech Muła; the article links to this GitHub repository; and the repository has the following AVX-2 implementation. It’s based on a vectorized lookup instruction, and using a 16-value lookup table for the bit counts of nibbles. # include <immintrin.h> # include <x86intrin.h> … Read more

How to convert 32-bit float to 8-bit signed char? (4:1 packing of int32 to int8 __m256i)

For good throughput with multiple source vectors, it’s a good thing that _mm256_packs_epi16 has 2 input vectors instead of producing a narrower output. (AVX512 _mm256_cvtepi32_epi8 isn’t necessarily the most efficient way to do things, because the version with a memory destination decodes to multiple uops, or the regular version gives you multiple small outputs that … Read more

Transpose an 8×8 float using AVX/AVX2

I already answered this question Fast memory transpose with SSE, AVX, and OpenMP. Let me repeat the solution for transposing an 8×8 float matrix with AVX. Let me know if this is any faster than using 4×4 blocks and _MM_TRANSPOSE4_PS. I used it for a kernel in a larger matrix transpose which was memory bound … Read more

Emulating shifts on 32 bytes with AVX

From different inputs, I gathered these solutions. The key to crossing the inter-lane barrier is the align instruction, _mm256_alignr_epi8. _mm256_slli_si256(A, N) 0 < N < 16 _mm256_alignr_epi8(A, _mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 2, 0)), 16 – N) N = 16 _mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 2, 0)) 16 < N < 32 _mm256_slli_si256(_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 2, … Read more

Fastest method to calculate sum of all packed 32-bit integers using AVX512 or AVX2

Related: if you’re looking for the non-existant _mm512_reduce_add_epu8, see Summing 8-bit integers in __m512i with AVX intrinsics vpsadbw as an hsum within qwords is much more efficient than shuffling. Without AVX512, see hsum_8x32(__m256i) below for AVX2 without Intel’s reduce_add helper function. reduce_add doesn’t necessarily compile optimally anyway with AVX512. There is a int _mm512_reduce_add_epi32(__m512i) inline … Read more

Loading 8 chars from memory into an __m256 variable as packed single precision floats

If you’re using AVX2, you can use PMOVZX to zero-extend your chars into 32-bit integers in a 256b register. From there, conversion to float can happen in-place. ; rsi = new_image VPMOVZXBD ymm0, [rsi] ; or SX to sign-extend (Byte to DWord) VCVTDQ2PS ymm0, ymm0 ; convert to packed foat This is a good strategy … Read more