micro-optimization – Make Me Engineer

Performance / Space implications when ordering SQL Server columns?

June 4, 2023 by Tarik

SQL Server stores the data on disk in set and fixed fashion. The order in sys.columns and key columns has no relevance to this on-disk order. See “Anatomy of a record” (Paul Randal) and my answer here: How do you get to limits of 8060 bytes per row and 8000 per (varchar, nvarchar) value?

DateTime.DayOfWeek micro optimization

May 29, 2023 by Tarik

Let’s do some tunning. Prime factorization of TimeSpan.TicksPerDay (864000000000) : DayOfWeek now can be expressed as: public DayOfWeek DayOfWeek { get { return (DayOfWeek)(((Ticks>>14) / 52734375 + 1L) % 7L); } } And we are working in modulo 7, 52734375 % 7 it’s 1. So, the code above is equal to: public static DayOfWeek dayOfWeekTurbo(this … Read more

Fastest way to strip all non-printable characters from a Java String

May 22, 2023 by Tarik

using 1 char array could work a bit better int length = s.length(); char[] oldChars = new char[length]; s.getChars(0, length, oldChars, 0); int newLen = 0; for (int j = 0; j < length; j++) { char ch = oldChars[j]; if (ch >= ‘ ‘) { oldChars[newLen] = ch; newLen++; } } s = new … Read more

Micro Optimization of a 4-bucket histogram of a large array or list

May 19, 2023 by Tarik

This should be possible at about 8 elements (1 AVX2 vector) per 2.5 clock cycles or so (per core) on a modern x86-64 like Skylake or Zen 2, using AVX2. Or per 2 clocks with unrolling. Or on your Piledriver CPU, maybe 1x 16-byte vector of indexes per 3 clocks with AVX1 _mm_cmpeq_epi32. The general … Read more

How to force NASM to encode [1 + rax2] as disp32 + index2 instead of disp8 + base + index?

May 17, 2023 by Tarik

NOSPLIT: Similarly, NASM will split [eax*2] into [eax+eax] because that allows the offset field to be absent and space to be saved; in fact, it will also split [eax*2+offset] into [eax+eax+offset]. You can combat this behaviour by the use of the NOSPLIT keyword: [nosplit eax*2] will force [eax*2+0] to be generated literally. [nosplit eax*1] also … Read more

Is it useful to use VZEROUPPER if your program+libraries contain no SSE instructions?

May 17, 2023 by Tarik

See also First use of AVX 256-bit vectors slows down 128-bit vector and AVX scalar ops re: implicit widening of 128-bit AVX operations to 256-bit if any uppers are dirty. (Including for the purposes of “light” vs. “heavy” turbo limits). This could be a reason to use vzeroupper, especially if you have some regions of … Read more

Modern x86 cost model

May 13, 2023 by Tarik

The best reference is the Intel Optimization Manual, which provides fairly detailed information on architectural hazards and instruction latencies for all recent Intel cores, as well as a good number of optimization examples. Another excellent reference is Agner Fog’s optimization resources, which have the virtue of also covering AMD cores. Note that specific cost models … Read more

Which is better option to use for dividing an integer number by 2?

April 26, 2023 by Tarik

Use the operation that best describes what you are trying to do. If you are treating the number as a sequence of bits, use bitshift. If you are treating it as a numerical value, use division. Note that they are not exactly equivalent. They can give different results for negative integers. For example: -5 / … Read more

What are the costs of failed store-to-load forwarding on x86?

April 17, 2023 by Tarik

It is not really a full answer, but still evidence that the penalty is visible. MSVC 2022 benchmark, compiler with /std:c++latest. #include <chrono> #include <iostream> struct alignas(16) S { char* a; int* b; }; extern “C” void init_fused_copy_unfused(int n, S & s2, S & s1); extern “C” void init_fused_copy_fused(int n, S & s2, S & … Read more

latency vs throughput in intel intrinsics

November 5, 2022 by Tarik