parallel prefix (cumulative) sum with SSE

This is the first time I’m answering my own question but it seems appropriate. Based on hirschhornsalz answer for prefix sum on 16 bytes simd-prefix-sum-on-intel-cpu I have come up with a solution for using SIMD on the first pass for 4, 8, and 16 32-bit words. The general theory goes as follows. For a sequential … Read more

SSE integer division?

Math says that it is indeed possible to go faster Agner Fog’s (http://www.agner.org/optimize/#vectorclass) method works great if division is done with a single divisor. Furthermore, this method has even further benefits if the divisor is known at compile time, or if it doesn’t change often at runtime. However, when performing SIMD division on __m128i elements … Read more

Do any JVM’s JIT compilers generate code that uses vectorized floating point instructions?

So, basically, you want your code to run faster. JNI is the answer. I know you said it didn’t work for you, but let me show you that you are wrong. Here’s Dot.java: import java.nio.FloatBuffer; import org.bytedeco.javacpp.*; import org.bytedeco.javacpp.annotation.*; @Platform(include = “Dot.h”, compiler = “fastfpu”) public class Dot { static { Loader.load(); } static float[] … Read more