For Sandy/Ivy Bridge you need to unroll by 3:
- Only FP Add has dependency on the previous iteration of the loop
- FP Add can issue every cycle
- FP Add takes three cycles to complete
- Thus unrolling by 3/1 = 3 completely hides the latency
- FP Mul and FP Load do not have a dependency on the previous iteration and you can rely on the OoO core to issue them in the near-optimal order. These instructions could affect the unroll factor only if they lowered the throughput of FP Add (not the case here, FP Load + FP Add + FP Mul can issue every cycle).
For Haswell you need to unroll by 10:
- Only FMA has dependency on the previous iteration of the loop
- FMA can double-issue every cycle (i.e. on average independent instructions take 0.5 cycles)
- FMA has latency of 5
- Thus unrolling by 5/0.5 = 10 completely hides FMA latency
- The two FP Load microoperations do not have a dependency on the previous iteration, and can co-issue with 2x FMA, so they don’t affect the unroll factor.