Loop unrolling to achieve maximum throughput with Ivy Bridge and Haswell

Question

For Sandy/Ivy Bridge you need to unroll by 3:

Only FP Add has dependency on the previous iteration of the loop
FP Add can issue every cycle
FP Add takes three cycles to complete
Thus unrolling by 3/1 = 3 completely hides the latency
FP Mul and FP Load do not have a dependency on the previous iteration and you can rely on the OoO core to issue them in the near-optimal order. These instructions could affect the unroll factor only if they lowered the throughput of FP Add (not the case here, FP Load + FP Add + FP Mul can issue every cycle).

For Haswell you need to unroll by 10:

Only FMA has dependency on the previous iteration of the loop
FMA can double-issue every cycle (i.e. on average independent instructions take 0.5 cycles)
FMA has latency of 5
Thus unrolling by 5/0.5 = 10 completely hides FMA latency
The two FP Load microoperations do not have a dependency on the previous iteration, and can co-issue with 2x FMA, so they don’t affect the unroll factor.