Can long integer routines benefit from SSE?

In the past, the answer to this question was a solid, “no”. But as of 2017, the situation is changing. But before I continue, time for some background terminology: Full Word Arithmetic Partial Word Arithmetic Full-Word Arithmetic: This is the standard representation where the number is stored in base 232 or 264 using an array … Read more

How to check if a CPU supports the SSE3 instruction set?

I’ve created a GitHub repro that will detect CPU and OS support for all the major x86 ISA extensions: https://github.com/Mysticial/FeatureDetector Here’s a shorter version: First you need to access the CPUID instruction: #ifdef _WIN32 // Windows #define cpuid(info, x) __cpuidex(info, x, 0) #else // GCC Intrinsics #include <cpuid.h> void cpuid(int info[4], int InfoType){ __cpuid_count(InfoType, 0, … Read more

How to efficiently perform double/int64 conversions with SSE/AVX?

There’s no single instruction until AVX512, which added conversion to/from 64-bit integers, signed or unsigned. (Also support for conversion to/from 32-bit unsigned). See intrinsics like _mm512_cvtpd_epi64 and the narrower AVX512VL versions, like _mm256_cvtpd_epi64. If you only have AVX2 or less, you’ll need tricks like below for packed-conversion. (For scalar, x86-64 has scalar int64_t <-> double … Read more

print a __m128i variable

Use this function to print them: #include <stdint.h> #include <string.h> void print128_num(__m128i var) { uint16_t val[8]; memcpy(val, &var, sizeof(val)); printf(“Numerical: %i %i %i %i %i %i %i %i \n”, val[0], val[1], val[2], val[3], val[4], val[5], val[6], val[7]); } You split 128bits into 16-bits(or 32-bits) before printing them. This is a way of 64-bit splitting and … Read more

How to solve the 32-byte-alignment issue for AVX load/store operations?

Yes, you can use _mm256_loadu_ps / storeu for unaligned loads/stores (AVX: data alignment: store crash, storeu, load, loadu doesn’t). If the compiler doesn’t do a bad job (cough GCC default tuning), AVX _mm256_loadu/storeu on data that happens to be aligned is just as fast as alignment-required load/store, so aligning data when convenient still gives you … Read more