CPU cache inhibition

x86 has no way to do a store that bypasses or writes through L1D/L2 but not L3. There are NT stores which bypass all cache. Anything that forces a write-back to L3 also forces write-back all the way to memory. (e.g. a clwb instruction). Those are designed for non-volatile RAM use cases, or for non-coherent … Read more

Should pointer comparisons be signed or unsigned in 64-bit x86?

TL:DR: intptr_t might be best in some cases because the signed-overflow boundary is in the middle of the “non-canonical hole”. Treating a value as negative instead of huge may be better if wrapping from zero to 0xFF…FF or vice versa is possible, but pointer+size for any valid size can’t wrap a value from INT64_MAX to … Read more

Convention for displaying vector registers

Being consistent is the most important thing; If I’m working on existing code that already has LSE-first comments or variable names, I match that. Given the choice, I prefer MSE-first notation in comments, especially when designing something with shuffles or especially packing/unpacking to different element sizes. Intel uses MSE-first not only in their diagrams in … Read more

NASM Error Parsing, Instruction Expected

That assembly language is MASM, not NASM. For starters, NASM segments are defined differently. Instead of Code segment word public ‘CODE’ we write .section text And that “ASSUME” declaration… You must have an ancient book. That is old, old MASM code. Brings back memories from the early 1980s for me! There are many differences between … Read more

signed and unsigned arithmetic implementation on x86

If you look at the various multiplication instructions of x86, looking only at 32bit variants and ignoring BMI2, you will find these: imul r/m32 (32×32->64 signed multiply) imul r32, r/m32 (32×32->32 multiply) * imul r32, r/m32, imm (32×32->32 multiply) * mul r/m32 (32×32->64 unsigned multiply) Notice that only the “widening” multiply has an unsigned counterpart. … Read more

Is processor can do memory and arithmetic operation at the same time?

You’re right, a modern x86 will decode add dword [mem], 1 to 3 uops: a load, an ALU add, and a store. (This is actually a simplification of various things, including Intel’s micro-fusion and how AMD always keeps a load+ALU together in some parts of the pipeline…) Those 3 dependent operations can’t happen at the … Read more

Is it possible to decode x86-64 instructions in reverse?

An x86 instruction stream is not self-synchronizing, and can only be unambiguously decoded forward. You need to know a valid start-point to decode. The last byte of an immediate can be a 0x90 which decodes as a nop, or in general a 4-byte immediate or displacement can have byte-sequences that are valid instructions, or whatever … Read more