How to sum __m256 horizontally?
This version should be optimal for both Intel Sandy/Ivy Bridge and AMD Bulldozer, and later CPUs. // x = ( x7, x6, x5, x4, x3, x2, x1, x0 ) float sum8(__m256 x) { // hiQuad = ( x7, x6, x5, x4 ) const __m128 hiQuad = _mm256_extractf128_ps(x, 1); // loQuad = ( x3, x2, x1, … Read more