How is a vector’s data aligned?

C++ standard requires allocation functions (malloc() and operator new()) to allocate memory suitably aligned for any standard type. As these functions don’t receive the alignment requirement as an argument, in practice it means that the alignment for all allocations is the same, and is that of a standard type with the largest alignment requirement, which … Read more

Using AVX CPU instructions: Poor performance without “/arch:AVX”

2021 update: Modern versions of MSVC don’t need manual use of _mm256_zeroupper() even when compiling AVX intrinsics without /arch:AVX. VS2010 did. The behavior that you are seeing is the result of expensive state-switching. See page 102 of Agner Fog’s manual: http://www.agner.org/optimize/microarchitecture.pdf Every time you improperly switch back and forth between SSE and AVX instructions, you … Read more

Loading 8 chars from memory into an __m256 variable as packed single precision floats

If you’re using AVX2, you can use PMOVZX to zero-extend your chars into 32-bit integers in a 256b register. From there, conversion to float can happen in-place. ; rsi = new_image VPMOVZXBD ymm0, [rsi] ; or SX to sign-extend (Byte to DWord) VCVTDQ2PS ymm0, ymm0 ; convert to packed foat This is a good strategy … Read more

SIMD prefix sum on Intel cpu

The fastest parallel prefix sum algorithm I know of is to run over the sum in two passes in parallel and use SSE as well in the second pass. In the first pass you calculate partial sums in parallel and store the total sum for each partial sum. In the second pass you add the … Read more