L1 memory bandwidth: 50% drop in efficiency using addresses which differ by 4096+64 bytes

I think the gap between a and b does not really matter. After leaving only one gap between b and c I’ve got the following results on Haswell:

k   %
-----
1  48
2  48
3  48
4  48
5  46
6  53
7  59
8  67
9  73
10 81
11 85
12 87
13 87
...
0  86

Since Haswell is known to be free of bank conflicts, the only remaining explanation is false dependence between memory addresses (and you’ve found proper place in Agner Fog’s microarchitecture manual explaining exactly this problem). The difference between bank conflict and false sharing is that bank conflict prevents accessing the same bank twice during the same clock cycle while false sharing prevents reading from some offset in 4K piece of memory just after you’ve written something to same offset (and not only during the same clock cycle but also for several clock cycles after the write).

Since your code (for k=0) writes to any offset just after doing two reads from the same offset and would not read from it for a very long time, this case should be considered as “best”, so I placed k=0 at the end of the table. For k=1 you always read from offset that is very recently overwritten, which means false sharing and therefore performance degradation. With larger k time between write and read increases and CPU core has more chances to pass written data through all memory hierarchy (which means two address translations for read and write, updating cache data and tags and getting data from cache, data synchronization between cores, and probably many more stuff). k=12 or 24 clocks (on my CPU) is enough for every written piece of data to be ready for subsequent read operations, so starting with this value performance gets back to usual. Looks not very different from 20+ clocks on AMD (as said by @Mysticial).

Leave a Comment