TL;DR: I made an experimental analysis on Numpy 1.21.1. Experimental results show that np.sum
does NOT (really) make use of SIMD instructions: no SIMD instruction are used for integers, and scalar SIMD instructions are used for floating-point numbers! Moreover, Numpy converts the integers to 64-bits values for smaller integer types by default so to avoid overflows!
Note that this may not reflect all Numpy versions since there is an ongoing work to provide SIMD support for commonly used functions (the version Numpy 1.22.0rc1 not yet released continue this long-standing work). Moreover, the compiler or the processor used may significantly impact the results. The following experiments have been done using a Numpy retrieved from pip on a Debian Linux with a i5-9600KF processor.
Under the hood of np.sum
For floating-point numbers, Numpy uses a pairwise algorithm which is known to be quite numerically stable while being relatively fast. This can be seen in the code, but also simply using a profiler: TYPE_pairwise_sum
is the C function called to compute the sum at runtime (where TYPE
is DOUBLE
or FLOAT
).
For integers, Numpy use a classical naive reduction. The C function called is ULONG_add_avx2
on AVX2-compatible machines. It also surprisingly convert items to 64-bit ones if the type is not np.int64
.
Here is the hot part of the assembly code executed by the DOUBLE_pairwise_sum
function
3,65 │ a0:┌─→add $0x8,%rcx ; Loop iterator
3,60 │ │ prefetcht0 (%r8,%rax,1) ; Prefetch data
9,46 │ │ vaddsd (%rax,%rbp,1),%xmm1,%xmm1 ; Accumulate an item in xmm1[0]
4,65 │ │ vaddsd (%rax),%xmm0,%xmm0 ; Same for xmm0[0]
6,91 │ │ vaddsd (%rax,%rdx,1),%xmm4,%xmm4 ; Etc.
7,77 │ │ vaddsd (%rax,%rdx,2),%xmm7,%xmm7
7,41 │ │ vaddsd (%rax,%r10,1),%xmm2,%xmm2
7,27 │ │ vaddsd (%rax,%rdx,4),%xmm6,%xmm6
6,80 │ │ vaddsd (%rax,%r11,1),%xmm3,%xmm3
7,46 │ │ vaddsd (%rax,%rbx,1),%xmm5,%xmm5
3,46 │ │ add %r12,%rax ; Switch to the next items (x8)
0,13 │ ├──cmp %rcx,%r9 ; Should the loop continue?
3,27 │ └──jg a0 ; Jump to the beginning if so
The Numpy compiled code clearly uses the scalar SIMD instruction vaddsd
(computing only a single double-precision item) although it successfully unroll the loop 8 times. The same code is generated for FLOAT_pairwise_sum
: vaddss
is called 8 times.
For the np.uint32
, here is the hot part of the generated assembly code:
2,37 │160:┌─→add $0x1,%rax ; Loop iterator
95,95 │ │ add (%rdi),%rdx ; Accumulate the values in %rdx
0,06 │ │ add %r10,%rdi ; Switch to the next item
│ ├──cmp %rsi,%rax ; Should the loop continue?
1,08 │ └──jne 160 ; Jump to the beginning if so
Numpy obviously does not use SIMD instructions for np.uint32
type. It does not even unroll the loop. The add (%rdi),%rdx
instruction performing the accumulation is the bottleneck here due to the data dependency on the accumulator. The same loops can be seen for np.uint64(despite the name of the function is
ULONG_add_avx2`).
However, the np.uint32
version call the C function _aligned_contig_cast_uint_to_ulong
in order to convert the integer items to a wider type. Numpy does that to avoid integer overflows. The same thing can be seen for the types np.uint8
and np.uint16
(although the name of the function differs). Hopefully, this function makes use of SIMD instructions (SSE) but still takes a significant portion of the execution time (~30% of the np.sum
time).
EDIT: as pointed out by @user2357112supportsMonica, the dtype
parameter of np.sum
can be explicitly specified. When it match with the dtype
of the input array, then the conversion is not performed. This results in a smaller execution time at the expense of a possible overflow.
Benchmark results
Here is the result on my machine:
uint16: 7.17 µs ± 80 ns per loop (mean ± std. dev. of 7 runs, 20000 loops each)
uint32: 7.11 µs ± 12.3 ns per loop (mean ± std. dev. of 7 runs, 20000 loops each)
uint64: 5.05 µs ± 8.57 ns per loop (mean ± std. dev. of 7 runs, 20000 loops each)
float32: 2.88 µs ± 9.27 ns per loop (mean ± std. dev. of 7 runs, 20000 loops each)
float64: 3.06 µs ± 10.6 ns per loop (mean ± std. dev. of 7 runs, 20000 loops each)
First of all, not that the results are very similar to the ones provided in the question meaning that the behavior seen on my machine can be successfully reproduced on other one. Thus, the explanation should also be consistent.
As you can see, the 64-bit version is faster than the other integer-based versions. This is due to the overhead of the conversion. The two first are equally+fast because of the scalar loop and the add
instruction being equally-fast for 8-bit, 16-bit and 32-bit integers (this should be true for most 64-bit mainstream platforms). The integer implementations are slower than the floating-point ones because of the lack of (proper) loop unrolling.
The floating-point implementations are equally-fast due to the scalar SIMD instructions. Indeed, the instructions vaddss
(for np.float32
) and vaddsd
(for np.float64
) have the same latency and throughput (at least on all modern Intel processors). Thus the throughput of the two implementation is the same since the loop of the two implementation is similar (same unrolling).
Additional notes
This is sad np.sum
does not fully make use of SIMD instructions as this would speed up a lot the computations using it (especially small integers).
[UPDATE] Looking at the Numpy code, I discovered that the code is not vectorized because the array stride is a runtime value and the compiler do not generate a specialized version where the stride is 1. In fact, this can be partially seen in the previous assembly code: the compiler used the instruction add %r10, %rdi
because %r10
(the stride of the target array) is not known at compile-time. There is currently no optimization for this specific case of reduction in the Numpy code yet (the functions are relatively generic). This may change in a near future.
In addition to the stride problem, one big point makes it hard for compilers to automatically vectorize a code: the floating-point addition is not commutative, nor associative (unless flags like -ffast-math
are used).