Loop unrolling to achieve maximum throughput with Ivy Bridge and Haswell

For Sandy/Ivy Bridge you need to unroll by 3:

  • Only FP Add has dependency on the previous iteration of the loop
  • FP Add can issue every cycle
  • FP Add takes three cycles to complete
  • Thus unrolling by 3/1 = 3 completely hides the latency
  • FP Mul and FP Load do not have a dependency on the previous iteration and you can rely on the OoO core to issue them in the near-optimal order. These instructions could affect the unroll factor only if they lowered the throughput of FP Add (not the case here, FP Load + FP Add + FP Mul can issue every cycle).

For Haswell you need to unroll by 10:

  • Only FMA has dependency on the previous iteration of the loop
  • FMA can double-issue every cycle (i.e. on average independent instructions take 0.5 cycles)
  • FMA has latency of 5
  • Thus unrolling by 5/0.5 = 10 completely hides FMA latency
  • The two FP Load microoperations do not have a dependency on the previous iteration, and can co-issue with 2x FMA, so they don’t affect the unroll factor.

Leave a Comment