sse – Page 5 – Make Me Engineer

What is the meaning of “non temporal” memory accesses in x86

June 24, 2022 by Tarik

Can long integer routines benefit from SSE?

June 23, 2022 by Tarik

In the past, the answer to this question was a solid, “no”. But as of 2017, the situation is changing. But before I continue, time for some background terminology: Full Word Arithmetic Partial Word Arithmetic Full-Word Arithmetic: This is the standard representation where the number is stored in base 232 or 264 using an array … Read more

How to check if a CPU supports the SSE3 instruction set?

June 23, 2022 by Tarik

I’ve created a GitHub repro that will detect CPU and OS support for all the major x86 ISA extensions: https://github.com/Mysticial/FeatureDetector Here’s a shorter version: First you need to access the CPUID instruction: #ifdef _WIN32 // Windows #define cpuid(info, x) __cpuidex(info, x, 0) #else // GCC Intrinsics #include <cpuid.h> void cpuid(int info[4], int InfoType){ __cpuid_count(InfoType, 0, … Read more

How to implement atoi using SIMD?

June 23, 2022 by Tarik

How to efficiently perform double/int64 conversions with SSE/AVX?

June 17, 2022 by Tarik

There’s no single instruction until AVX512, which added conversion to/from 64-bit integers, signed or unsigned. (Also support for conversion to/from 32-bit unsigned). See intrinsics like _mm512_cvtpd_epi64 and the narrower AVX512VL versions, like _mm256_cvtpd_epi64. If you only have AVX2 or less, you’ll need tricks like below for packed-conversion. (For scalar, x86-64 has scalar int64_t <-> double … Read more

Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all

June 12, 2022 by Tarik

SSE instructions: which CPUs can do atomic 16B memory operations?

May 27, 2022 by Tarik

print a __m128i variable

May 27, 2022 by Tarik

Use this function to print them: #include <stdint.h> #include <string.h> void print128_num(__m128i var) { uint16_t val[8]; memcpy(val, &var, sizeof(val)); printf(“Numerical: %i %i %i %i %i %i %i %i \n”, val[0], val[1], val[2], val[3], val[4], val[5], val[6], val[7]); } You split 128bits into 16-bits(or 32-bits) before printing them. This is a way of 64-bit splitting and … Read more

How to solve the 32-byte-alignment issue for AVX load/store operations?

May 20, 2022 by Tarik

Yes, you can use _mm256_loadu_ps / storeu for unaligned loads/stores (AVX: data alignment: store crash, storeu, load, loadu doesn’t). If the compiler doesn’t do a bad job (cough GCC default tuning), AVX _mm256_loadu/storeu on data that happens to be aligned is just as fast as alignment-required load/store, so aligning data when convenient still gives you … Read more

What are the best instruction sequences to generate vector constants on the fly?

May 15, 2022 by Tarik