Convention for displaying vector registers

Being consistent is the most important thing; If I’m working on existing code that already has LSE-first comments or variable names, I match that.

Given the choice, I prefer MSE-first notation in comments, especially when designing something with shuffles or especially packing/unpacking to different element sizes.

Intel uses MSE-first not only in their diagrams in manuals, but in the naming of intrinsics/instructions like pslldq (byte shift) and psrlw (bit-shift): a left bit/byte shift goes towards the MSB. LSE-first thinking doesn’t save you from mentally reversing things, it means you have to do it when thinking about shifts instead of loads/stores. Since x86 is little-endian, you sometimes have to be thinking about this anyway.


In MSE-first thinking about vectors, just remember that memory order is right to left. When you need to think about overlapping unaligned loads from a block of memory, you can draw the memory contents in right-to-left order, so you can look at vector-length windows of it.

In a text editor, it’s no problem to add new text at the left hand side of something and have the existing text displaced to the right, so adding more elements to a comment isn’t a problem.

Two major downsides to MSE-first notation are:

  • harder to type the alphabet backwards (like h g f e | d c b a for an AVX vector of 32-bit elements), so I sometimes just start from the right and type a, left-arrow, b, space, ctrl-left arrow, c, space, … or something like that.

  • Opposite from C array-initializer order. Normally not a problem, because _mm_set_epi* uses MSE-first order. (Use _mm_setr_epi* to match LSE-first comments).


An example where MSE-first is nice is when trying to design a lane-crossing version of 256b vpalignr: See my answer on that question
How to concatenate two vector efficiently using AVX2?. That includes design-notes in MSE-first notation.

As another example, consider implementing a variable-count byte-shift across a whole vector. You could make a table of pshufb control vectors, but that would be a massive waste of cache footprint. Much better to load a sliding window from memory:

/*  Example of using MSE notation for memory as well as vectors

// 4-element vectors to keep the design notes compact
// I started by just writing down a couple rows of this, then noticing which way they lined up
<< 3:                       00 FF FF FF
<< 1:                 02 01 00 FF
   0:              03 02 01 00
>> 2:        FF FF 03 02
>> 3:     FF FF FF 03
>> 4:  FF FF FF FF

       FF FF FF FF 03 02 01 00 FF FF FF FF
  highest address                       lowest address
*/

#include <immintrin.h>
#include <stdint.h>
// positive counts are right shifts, negative counts are left
// a left-only or right-only implementation would only have one side of the table,
// and only need 32B alignment for the constant in memory to prevent cache-line splits.
__m128i vshift(__m128i v, intptr_t bytes_right)
{   // intptr_t means the caller has to sign-extend it to the width of a pointer, saving a movsx in the non-inline version

   // C11 uses _Alignas, C++11 uses alignas
    _Alignas(64) static const int32_t shuffles[] = { 
        -1, -1, -1, -1,
        0x03020100, 0x07060504, 0x0b0a0908, 0x0f0e0d0c,
        -1, -1, -1, -1
    };  // compact but messy with a mix of ordering :/
    const char *identity_shuffle = 16 + (const char*)shuffles;  // points to the middle 16B

    //  count &= 0xf;  tricky to efficiently limit the count while still allowing >>16 to zero the vector, and to allow negative.
    __m128i control = _mm_load_si128((const __m128i*) (identity_shuffle + bytes_right));
    return _mm_shuffle_epi8(v, control);
}

This is kind of the worst-case for MSE-first, because right-shifts take a window from farther left. In LSE-first notation, it might look more natural. Still, unless I got something backwards :P, I think it shows that you can successfully use MSE-first notation even for something you’d expect to be tricky. It didn’t feel mind-bending or over-complicated. I just started writing down shuffle control vectors and then lined them up. I could have made it slightly simpler when translating to a C array if I’d used uint8_t shuffles[] = { 0xff, 0xff, ..., 0, 1, 2, ..., 0xff };.
I haven’t tested this, only that it compiles to one instruction:

    vpshufb xmm0, xmm0, xmmword ptr [rdi + vshift.shuffles+16]
    ret

MSE lets you notice more easily when you can use a bit-shift instead of a shuffle instruction, to reduce pressure on port 5. e.g. psllq xmm, 16/_mm_slli_epi64(v,16) to shift word elements left by one (with zeroing at qword boundaries). Or when you need to shift byte elements, but the only available shifts are 16-bit or wider. The narrowest variable-per-element shifts are 32-bit elements (vpsllvd).

MSE makes it easy to get the shuffle constant right when using larger or smaller granularity shuffles or blends, e.g. pshufd when you can keep pairs of word elements together, or pshufb to shuffle words across the whole vector (because pshuflw/hw is limited).

_MM_SHUFFLE(d,c,b,a) goes in MSE order, too. So does any other way of writing it as a single integer, like C++14 0b11'10'01'00 or 0xE4 (the identity shuffle). Using LSE-first notation will make your shuffle constants look “backwards” relative to your comments. (except for pshufb constants, which you can write with _mm_setr)

Leave a Comment