sse – Page 2 – Make Me Engineer

SSE, intrinsics, and alignment

December 2, 2022 by Tarik

First of all you have to care for two types of memory allocation: Static allocation. For automatic variables to be properly aligned, your type needs a proper alignment specification (e.g. __declspec(align(16)), __attribute__((aligned(16))), or your _MM_ALIGN16). But fortunately you only need this if the alignment requirements given by the type’s members (if any) are not sufficient. … Read more

How to determine if memory is aligned?

November 30, 2022 by Tarik

#define is_aligned(POINTER, BYTE_COUNT) \ (((uintptr_t)(const void *)(POINTER)) % (BYTE_COUNT) == 0) The cast to void * (or, equivalenty, char *) is necessary because the standard only guarantees an invertible conversion to uintptr_t for void *. If you want type safety, consider using an inline function: static inline _Bool is_aligned(const void *restrict pointer, size_t byte_count) { … Read more

Where is Clang’s ‘_mm256_pow_ps’ intrinsic?

November 23, 2022 by Tarik

That’s not an intrinsic; it’s an Intel SVML library function name that confusingly uses the same naming scheme as actual intrinsics. There’s no vpowps instruction. (AVX512ER on Xeon Phi does have the semi-related vexp2ps instruction…) IDK if this naming scheme is to trick people into depending on Intel tools when writing SIMD code with their … Read more

What is the point of SSE2 instructions such as orpd?

November 23, 2022 by Tarik

Remember that SSE1 orps came first. (Well actually MMX por mm, mm/mem came even before SSE1.) Having the same opcode with a new prefix be the SSE2 orpd instruction makes sense for hardware decoder logic, I guess, just like movapd vs. movaps. Several instructions like this are redundant between between ps and pd versions, but … Read more

C++ error: ‘_mm_sin_ps’ was not declared in this scope

November 23, 2022 by Tarik

_mm_sin_ps is part of the SVML library, shipped with intel compilers only. GCC developers focused on wrapping machine instructions and simple tasks, so there’s no SVML in immintrin.h so far. You have to use a library or write it by yourself. Sinus implementation: Taylor series CORDIC Quadratic curve

Fastest Implementation of the Natural Exponential Function Using SSE

November 22, 2022 by Tarik

The C code below is a translation into SSE intrinsics of an algorithm I used in a previous answer to a similar question. The basic idea is to transform the computation of the standard exponential function into computation of a power of 2: expf (x) = exp2f (x / logf (2.0f)) = exp2f (x * … Read more

Loop unrolling to achieve maximum throughput with Ivy Bridge and Haswell

November 22, 2022 by Tarik

For Sandy/Ivy Bridge you need to unroll by 3: Only FP Add has dependency on the previous iteration of the loop FP Add can issue every cycle FP Add takes three cycles to complete Thus unrolling by 3/1 = 3 completely hides the latency FP Mul and FP Load do not have a dependency on … Read more

latency vs throughput in intel intrinsics

November 5, 2022 by Tarik

The Effect of Architecture When Using SSE / AVX Intrinisics

November 2, 2022 by Tarik

Efficient 4×4 matrix vector multiplication with SSE: horizontal add and dot product – what’s the point?

November 1, 2022 by Tarik

Horizontal add and dot product instructions are complex: they are decomposed into multiple simpler microoperations which are executed by processor just like simple instructions. The exact decomposition of horizontal add and dot product instructions into microoperations is processor-specific, but for recent Intel processors horizontal add is decomposed into 2 SHUFFLE + 1 ADD microoperations, and … Read more