Optimize for fast multiplication but slow addition: FMA and doubledouble

To answer my third question I found a faster solution for double-double addition. I found an alternative definition in the paper Implementation of float-float operators on graphics hardware. Theorem 5 (Add22 theorem) Let be ah+al and bh+bl the float-float arguments of the following algorithm: Add22 (ah ,al ,bh ,bl) 1 r = ah ⊕ bh … Read more

Obtaining peak bandwidth on Haswell in the L1 cache: only getting 62%

IACA Analysis Using IACA (the Intel Architecture Code Analyzer) reveals that macro-op fusion is indeed occurring, and that it is not the problem. It is Mysticial who is correct: The problem is that the store isn’t using Port 7 at all. IACA reports the following: Intel(R) Architecture Code Analyzer Version – 2.1 Analyzed File – … Read more