sse – Page 6 – Make Me Engineer

Why is this SSE code 6 times slower without VZEROUPPER on Skylake?

May 10, 2022 by Tarik

AVX2 what is the most efficient way to pack left based on a mask?

May 4, 2022 by Tarik

AVX2 + BMI2. See my other answer for AVX512. (Update: saved a pdep in 64bit builds.) We can use AVX2 vpermps (_mm256_permutevar8x32_ps) (or the integer equivalent, vpermd) to do a lane-crossing variable-shuffle. We can generate masks on the fly, since BMI2 pext (Parallel Bits Extract) provides us with a bitwise version of the operation we … Read more

Why does mulss take only 3 cycles on Haswell, different from Agner’s instruction tables? (Unrolling FP loops with multiple accumulators)

April 28, 2022 by Tarik

Fastest way to do horizontal SSE vector sum (or other reduction)

April 23, 2022 by Tarik

In general for any kind of vector horizontal reduction, extract / shuffle high half to line up with low, then vertical add (or min/max/or/and/xor/multiply/whatever); repeat until a there’s just a single element (with high garbage in the rest of the vector). If you start with vectors wider than 128-bit, narrow in half until you get … Read more