What is a microcoded instruction?

A CPU reads machine code and decodes it into internal control signals that send the right data to the right execution units.

Most instructions map to one internal operation, and can be decoded directly. (e.g. on x86, add eax, edx just sends eax and edx to the integer ALU for an ADD operation, and puts the result in eax.)

Some other single instructions do much more work. e.g. x86’s rep movs implements memcpy(edi, esi, ecx), and requires the CPU to loop.

When the instruction decoders see an instruction like that, instead of just producing internal control signals directly they read micro-code out of the microcode ROM.

A micro-coded instruction is one that decodes to many internal operations


Modern x86 CPUs always decode x86 instructions to internal micro-operations. In this terminology, it still doesn’t count as “micro-coded” even when add [mem], eax decodes to a load from [mem], an ALU ADD operation, and a store back into [mem]. Another example is xchg eax, edx, which decodes to 3 uops on Intel Haswell. And interestingly, not exactly the same kind of uops you’d get from using 3 MOV instructions to do the exchange with a scratch register, because they aren’t zero-latency.

On Intel / AMD CPUs, “micro-coded” means the decoders turn on the micro-code sequencer to feed uops from the ROM into the pipeline, instead of producing multiple uops directly.

(You could call any multi-uop x86 instruction “microcoded” if you were thinking in pure RISC terms, but it’s useful to use the term “microcoded” to make a different distinction, IMO. This meaning is I think widespread in x86 optimization circles, like Intel’s optimization manual. Other people may use different meanings for terminology, especially if talking about other architectures or about computer architecture in general when comparing x86 to a RISC.)

In current Intel CPUs, the limit on what the decoders can produce directly, without going to micro-code ROM, is 4 uops (fused-domain). AMD similarly has FastPath (aka DirectPath) single or double instructions (1 or 2 “macro-ops”, AMD’s equivalent of uops), and beyond that it’s VectorPath aka Microcode, as explained in David Kanter’s in-depth look at AMD Bulldozer, specifically talking about its decoders.

Another example is x86’s integer DIV instruction, which is micro-coded even on modern Intel CPUs like Haswell. But not AMD; AMD just has one or 2 uops activate everything inside the integer divider unit. It’s not fundamental to DIV, just an implementation choice. See my answer on C++ code for testing the Collatz conjecture faster than hand-written assembly – why? for the numbers.

FP division is also slow, but is decoded to a single uop so it doesn’t bottleneck the front-end. If FP division is rare and not part of a latency bottleneck, it can be as cheap as multiplication. (But if execution does have to wait for its result, or bottlenecks on its throughput, it’s much slower.) More in this answer.

Integer division and other micro-coded instructions can give the CPU a hard time, and creates effects that make code alignment matter where it wouldn’t otherwise.


To learn more about x86 CPU internals, see the x86 tag wiki, and especially Agner Fog’s microarch guide.

Also David Kanter’s deep dives into x86 microarchitectures are useful to understand the pipeline that uops go through: Core 2 and Sandy Bridge being major ones, also AMD K8 and Bulldozer articles are interesting for comparison.

RISC vs. CISC Still Matters (Feb 2000) by Paul DeMone looks at how PPro breaks down instructions into uops, vs. RISCs where most instructions are already simple to just go through the pipeline in one step, with only rare ones like ARM push/pop multiple registers needing to send multiple things down the pipeline (aka microcoded in RISC terms).

And for good measure, Modern Microprocessors
A 90-Minute Guide! is always worth recommending for the basics of pipelining and OoO exec.


Other uses of the term in very different contexts than modern x86

In some older / simpler CPUs, every instruction was effectively micro-coded. For example, the 6502 executed 6502 instructions by running a sequence of internal instructions from a PLA decode ROM. This works well for a non-pipelined CPU, where the order of using the different parts of the CPU can vary from instruction to instruction.


Historically, there was a different technical meaning for “microcode”, meaning something like the internal control signals decoded from the instruction word. Especially in a CPU like MIPS where the instruction word mapped directly to those control signals, without complicated decoding. (I may have this partly wrong; I read something like this (other than in the deleted answer on this question) but couldn’t find it again later.)

This meaning may still actually get used in some circles, like when designing a simple pipelined CPU, like a hobby MIPS.

Leave a Comment