AVX2 what is the most efficient way to pack left based on a mask?

AVX2 + BMI2. See my other answer for AVX512. (Update: saved a pdep in 64bit builds.) We can use AVX2 vpermps (_mm256_permutevar8x32_ps) (or the integer equivalent, vpermd) to do a lane-crossing variable-shuffle. We can generate masks on the fly, since BMI2 pext (Parallel Bits Extract) provides us with a bitwise version of the operation we … Read more