Sum reduction of unsigned bytes without overflow, using SSE2 on Intel

You can abuse PSADBW to calculate horizontal sums of bytes without overflow.
For example:

pxor    xmm0, xmm0
psadbw  xmm0, [a + 0]     ; sum in 2x 64-bit chunks
pxor    xmm1, xmm1
psadbw  xmm1, [a + 16]
paddw   xmm0, xmm1        ; accumulate vertically
pshufd  xmm1, xmm0, 2     ; bring down the high half
paddw   xmm0, xmm1   ; low word in xmm0 is the total sum
; movd  eax, xmm0    ; higher bytes are zero so efficient dword extract is fine

Intrinsics version:

#include <immintrin.h>
#include <stdint.h>

// use loadu instead of load if 16-byte alignment of a[] isn't guaranteed
unsigned sum_32x8(const uint8_t a[32])
{
    __m128i zero = _mm_setzero_si128();
    __m128i sum0 = _mm_sad_epu8( zero,
                        _mm_load_si128(reinterpret_cast<const __m128i*>(a)));
    __m128i sum1 = _mm_sad_epu8( zero,
                        _mm_load_si128(reinterpret_cast<const __m128i*>(&a[16])));
    __m128i sum2 = _mm_add_epi32(sum0, sum1);
    __m128i totalsum = _mm_add_epi32(sum2, _mm_shuffle_epi32(sum2, 2));
    return _mm_cvtsi128_si32(totalsum);
}

This portably compiles back to the same asm, as you can see on Godbolt.

The reinterpret_cast<const __m128i*> is necessary because Intel intrinsics before AVX-512 for integer vector load/store take __m128i* pointer args, instead of a more convenient void*. Some prefer more compact C-style casts like _mm_loadu_si128( (const __m128*) &a[16] ) as a style choice.

16 vs. 32 vs. 64-bit SIMD element size doesn’t matter much; 16 and 32 are equally efficient on all machines, and 32-bit will avoid overflow even if you use this for summing much larger arrays. (paddq is slower on some old CPUs like Core 2; https://agner.org/optimize/ and https://uops.info/) Extracting as 32-bit is definitely more efficient than _mm_extract_epi16 (pextrw).

Leave a Comment