DateTime.DayOfWeek micro optimization

Let’s do some tunning. Prime factorization of TimeSpan.TicksPerDay (864000000000) : DayOfWeek now can be expressed as: public DayOfWeek DayOfWeek { get { return (DayOfWeek)(((Ticks>>14) / 52734375 + 1L) % 7L); } } And we are working in modulo 7, 52734375 % 7 it’s 1. So, the code above is equal to: public static DayOfWeek dayOfWeekTurbo(this … Read more

How to force NASM to encode [1 + rax*2] as disp32 + index*2 instead of disp8 + base + index?

NOSPLIT: Similarly, NASM will split [eax*2] into [eax+eax] because that allows the offset field to be absent and space to be saved; in fact, it will also split [eax*2+offset] into [eax+eax+offset]. You can combat this behaviour by the use of the NOSPLIT keyword: [nosplit eax*2] will force [eax*2+0] to be generated literally. [nosplit eax*1] also … Read more

Is it useful to use VZEROUPPER if your program+libraries contain no SSE instructions?

See also First use of AVX 256-bit vectors slows down 128-bit vector and AVX scalar ops re: implicit widening of 128-bit AVX operations to 256-bit if any uppers are dirty. (Including for the purposes of “light” vs. “heavy” turbo limits). This could be a reason to use vzeroupper, especially if you have some regions of … Read more

Modern x86 cost model

The best reference is the Intel Optimization Manual, which provides fairly detailed information on architectural hazards and instruction latencies for all recent Intel cores, as well as a good number of optimization examples. Another excellent reference is Agner Fog’s optimization resources, which have the virtue of also covering AMD cores. Note that specific cost models … Read more

What are the costs of failed store-to-load forwarding on x86?

It is not really a full answer, but still evidence that the penalty is visible. MSVC 2022 benchmark, compiler with /std:c++latest. #include <chrono> #include <iostream> struct alignas(16) S { char* a; int* b; }; extern “C” void init_fused_copy_unfused(int n, S & s2, S & s1); extern “C” void init_fused_copy_fused(int n, S & s2, S & … Read more