sse – Page 4 – Make Me Engineer

parallel prefix (cumulative) sum with SSE

August 1, 2022 by Tarik

This is the first time I’m answering my own question but it seems appropriate. Based on hirschhornsalz answer for prefix sum on 16 bytes simd-prefix-sum-on-intel-cpu I have come up with a solution for using SIMD on the first pass for 4, 8, and 16 32-bit words. The general theory goes as follows. For a sequential … Read more

How do I enable SSE for my freestanding bootable code?

July 17, 2022 by Tarik

SSE integer division?

July 17, 2022 by Tarik

Math says that it is indeed possible to go faster Agner Fog’s (http://www.agner.org/optimize/#vectorclass) method works great if division is done with a single divisor. Furthermore, this method has even further benefits if the divisor is known at compile time, or if it doesn’t change often at runtime. However, when performing SIMD division on __m128i elements … Read more

Do any JVM’s JIT compilers generate code that uses vectorized floating point instructions?

July 13, 2022 by Tarik

So, basically, you want your code to run faster. JNI is the answer. I know you said it didn’t work for you, but let me show you that you are wrong. Here’s Dot.java: import java.nio.FloatBuffer; import org.bytedeco.javacpp.*; import org.bytedeco.javacpp.annotation.*; @Platform(include = “Dot.h”, compiler = “fastfpu”) public class Dot { static { Loader.load(); } static float[] … Read more

Why does adding an xorps instruction make this function using cvtsi2ss and addss ~5x faster?

July 12, 2022 by Tarik

practical BigNum AVX/SSE possible?

July 12, 2022 by Tarik

I think it may be possible to implement BigNum with SIMD efficiently but not in the way you suggest. Instead of implementing a single BigNum using a SIMD register (or with an array of SIMD registers) you should process multiple BigNums at once. Let’s consider 128-bit addition. Let 128-bit integers be defined by a pair … Read more

Why is SSE scalar sqrt(x) slower than rsqrt(x) * x?

July 9, 2022 by Tarik

How to merge a scalar into a vector without the compiler wasting an instruction zeroing upper elements? Design limitation in Intel’s intrinsics?

July 4, 2022 by Tarik

SIMD signed with unsigned multiplication for 64-bit * 64-bit to 128-bit

July 1, 2022 by Tarik

Non-temporal loads and the hardware prefetcher, do they work together?

June 28, 2022 by Tarik