sse – Page 3 – Make Me Engineer

How to sum __m256 horizontally?

November 1, 2022 by Tarik

This version should be optimal for both Intel Sandy/Ivy Bridge and AMD Bulldozer, and later CPUs. // x = ( x7, x6, x5, x4, x3, x2, x1, x0 ) float sum8(__m256 x) { // hiQuad = ( x7, x6, x5, x4 ) const __m128 hiQuad = _mm256_extractf128_ps(x, 1); // loQuad = ( x3, x2, x1, … Read more

How to use Fused Multiply-Add (FMA) instructions with SSE/AVX

October 9, 2022 by Tarik

The compiler is allowed to fuse a separated add and multiply, even though this changes the final result (by making it more accurate). An FMA has only one rounding (it effectively keeps infinite precision for the internal temporary multiply result), while an ADD + MUL has two. The IEEE and C standards allow this when … Read more

How to efficiently convert an 8-bit bitmap to array of 0/1 integers with x86 SIMD [duplicate]

October 5, 2022 by Tarik

Per-element atomicity of vector load/store and gather/scatter?

October 5, 2022 by Tarik

Per-element atomicity of vector load/store and gather/scatter?

What’s the difference between logical SSE intrinsics?

October 5, 2022 by Tarik

Is there any difference between using one or another intrinsic (with appropriate type casting). Won’t there be any hidden costs like longer execution in some specific situation? Yes, there can be performance reasons to choose one vs. the other. 1: Sometimes there is an extra cycle or two of latency (forwarding delay) if the output … Read more

How is a vector’s data aligned?

September 4, 2022 by Tarik

C++ standard requires allocation functions (malloc() and operator new()) to allocate memory suitably aligned for any standard type. As these functions don’t receive the alignment requirement as an argument, in practice it means that the alignment for all allocations is the same, and is that of a standard type with the largest alignment requirement, which … Read more

Using AVX CPU instructions: Poor performance without “/arch:AVX”

September 4, 2022 by Tarik

2021 update: Modern versions of MSVC don’t need manual use of _mm256_zeroupper() even when compiling AVX intrinsics without /arch:AVX. VS2010 did. The behavior that you are seeing is the result of expensive state-switching. See page 102 of Agner Fog’s manual: http://www.agner.org/optimize/microarchitecture.pdf Every time you improperly switch back and forth between SSE and AVX instructions, you … Read more

Loading 8 chars from memory into an __m256 variable as packed single precision floats

September 3, 2022 by Tarik

If you’re using AVX2, you can use PMOVZX to zero-extend your chars into 32-bit integers in a 256b register. From there, conversion to float can happen in-place. ; rsi = new_image VPMOVZXBD ymm0, [rsi] ; or SX to sign-extend (Byte to DWord) VCVTDQ2PS ymm0, ymm0 ; convert to packed foat This is a good strategy … Read more

Is it possible to use SSE and SSE2 to make a 128-bit wide integer?

September 1, 2022 by Tarik

SIMD is meant to work on multiple small values at the same time, hence there won’t be any carry over to the higher unit and you must do that manually. In SSE2 there’s no carry flag but you can easily calculate the carry as carry = sum < a or carry = sum < b … Read more

SIMD prefix sum on Intel cpu

August 7, 2022 by Tarik

The fastest parallel prefix sum algorithm I know of is to run over the sum in two passes in parallel and use SSE as well in the second pass. In the first pass you calculate partial sums in parallel and store the total sum for each partial sum. In the second pass you add the … Read more