parallel prefix (cumulative) sum with SSE
This is the first time I’m answering my own question but it seems appropriate. Based on hirschhornsalz answer for prefix sum on 16 bytes simd-prefix-sum-on-intel-cpu I have come up with a solution for using SIMD on the first pass for 4, 8, and 16 32-bit words. The general theory goes as follows. For a sequential … Read more