SSE, intrinsics, and alignment

First of all you have to care for two types of memory allocation: Static allocation. For automatic variables to be properly aligned, your type needs a proper alignment specification (e.g. __declspec(align(16)), __attribute__((aligned(16))), or your _MM_ALIGN16). But fortunately you only need this if the alignment requirements given by the type’s members (if any) are not sufficient. … Read more

How to determine if memory is aligned?

#define is_aligned(POINTER, BYTE_COUNT) \ (((uintptr_t)(const void *)(POINTER)) % (BYTE_COUNT) == 0) The cast to void * (or, equivalenty, char *) is necessary because the standard only guarantees an invertible conversion to uintptr_t for void *. If you want type safety, consider using an inline function: static inline _Bool is_aligned(const void *restrict pointer, size_t byte_count) { … Read more

Where is Clang’s ‘_mm256_pow_ps’ intrinsic?

That’s not an intrinsic; it’s an Intel SVML library function name that confusingly uses the same naming scheme as actual intrinsics. There’s no vpowps instruction. (AVX512ER on Xeon Phi does have the semi-related vexp2ps instruction…) IDK if this naming scheme is to trick people into depending on Intel tools when writing SIMD code with their … Read more

What is the point of SSE2 instructions such as orpd?

Remember that SSE1 orps came first. (Well actually MMX por mm, mm/mem came even before SSE1.) Having the same opcode with a new prefix be the SSE2 orpd instruction makes sense for hardware decoder logic, I guess, just like movapd vs. movaps. Several instructions like this are redundant between between ps and pd versions, but … Read more

Efficient 4×4 matrix vector multiplication with SSE: horizontal add and dot product – what’s the point?

Horizontal add and dot product instructions are complex: they are decomposed into multiple simpler microoperations which are executed by processor just like simple instructions. The exact decomposition of horizontal add and dot product instructions into microoperations is processor-specific, but for recent Intel processors horizontal add is decomposed into 2 SHUFFLE + 1 ADD microoperations, and … Read more