Most functions don’t have more than 6 integer parameters, so this is really a corner case. Passing some excess integer params in xmm registers would make the rules for where to find floating point args more complicated, for little to no benefit. Besides the fact that it probably wouldn’t make code any faster.
A further reason for storing excess parameters in memory is that you the function probably won’t use them all right away. If you want to call another function, you have to save those parameters from xmm registers to memory, because the function you call will destroy any parameter-passing registers. (And all the xmm regs are caller-saved anyway.) So you could potentially end up with code that stuffs parameters into vector registers where they can’t be used directly, and from there stores them to memory before calling another function, and only then loads them back into integer registers. Or even if the function doesn’t call other functions, maybe it needs the vector registers for its own use, and would have to store params to memory to free them up for running vector code! It would have been easier just to push
params onto the stack, because push
very heavily optimized, for obvious reasons, to do the store and the modification of RSP all in a single uop, about as cheap as a mov
.
There is one integer register that is not used for parameter passing, but also not call-preserved in the SysV Linux/Mac x86-64 ABI (r11). It’s useful to have a scratch register for lazy dynamic linker code to use without saving (since such shim functions need to pass on all their args to the dynamically-loaded function), and similar wrapper functions.
So AMD64 could have used more integer registers for function parameters, but only at the expense of the number of registers that called functions have to save before using. (Or dual-purpose r10 for languages that don’t use a “static chain” pointer, or something.)
Anyway, more parameters passed in registers isn’t always better.
xmm registers can’t be used as pointer or index registers, and moving data from the xmm registers back to integer registers could slow down the surrounding code more than loading data that was just stored. (If any execution resource is going to be a bottleneck, rather than cache misses or branch mispredicts, it’s more likely going to be ALU execution units, not load/store units. Moving data from xmm to gp registers takes an ALU uop, in Intel and AMD’s current designs.)
L1 cache is really fast, and store->load forwarding makes the total latency for a round trip to memory something like 5 cycles on e.g. Intel Haswell. (The latency of an instruction like inc dword [mem]
is 6 cycles, including the one ALU cycle.)
If moving data from xmm to gp registers was all you were going to do (with nothing else to keep the ALU execution units busy), then yes, on Intel CPUs the round trip latency for movd xmm0, eax
/ movd eax, xmm0
(2 cycles Intel Haswell) is less than the latency of mov [mem], eax
/ mov eax, [mem]
(5 cycles Intel Haswell), but integer code usually isn’t totally bottlenecked by latency the way FP code often is.
On AMD Bulldozer-family CPUs, where two integer cores share a vector/FP unit, moving data directly between GP regs and vector regs is actually quite slow (8 or 10 cycles one way, or half that on Steamroller). A memory round trip is only 8 cycles.
32bit code manages to run reasonably well, even though all parameters are passed on the stack, and have to be loaded. CPUs are very highly optimized for storing parameters onto the stack and then loading them again, because the crufty old 32bit ABI is still used for a lot of code, esp. on Windows. (Most Linux systems mostly run 64bit code, while most Windows desktop systems run a lot of 32bit code because so many Windows programs are only available as pre-compiled 32bit binaries.)
See http://agner.org/optimize/ for CPU microarchitecture guides to learn how to figure out how many cycles something will actually take. There are other good links in the x86 wiki, including the x86-64 ABI doc linked above.