lizaczek lizaczek - 25 days ago 6
C++ Question

C++ How to force prefetch data to cache? (array loop)

I have loop like this

start = __rdtsc();
unsigned long long count = 0;
for(int i = 0; i < N; i++)
for(int j = 0; j < M; j++)
count += tab[i][j];
stop = __rdtsc();
time = (stop - start) * 1/3;


Need to check how prefetch data influences on efficiency. How to force prefetch some values from memory into cache before they will be counted?

Answer

First, I suppose that tab is a large 2D array such as a static array (e.g., int tab[1024*1024][1024*1024]) or a dynamically-allocated array (e.g., int** tab and following mallocs). Here, you want to prefetch some data from tab to the cache to reduce the execution time.

Simply, I don't think that you need to manually insert any prefetching to your code, where a simple reduction for a 2D array is performed. Modern CPUs will do automatic prefetching if necessary and profitable.


Two facts you should know for this problem:

(1) You are already exploit the spatial locality of tab inside of the innermost loop. Once tab[i][0] is read (after a cache miss, or a page fault), the data from tab[i][0] to tab[i][15] will be in your CPU caches, assuming that the cache line size is 64 bytes.

(2) However, when the code traverses in the row, i.e., tab[i][M-1] to tab[i+1][0], it is highly likely to happen a cold cache miss, especially when tab is a dynamically-allocated array where each row could be allocated in a fragmented way. However, if the array is statically allocated, each row will be located contiguously in the memory.


So, prefetching makes a sense only when you read (1) the first item of the next row and (2) j + CACHE_LINE_SIZE/sizeof(tab[0][0]) ahead of time.

You may do so by inserting a prefetch operation (e.g., __builtin_prefetch) in the upper loop. However, modern compilers may not always emit such prefetch instructions. If you really want to do that, you should check the generated binary code.

However, as I said, I do not recommend you do that because modern CPUs will mostly do prefetching automatically, and that automatic prefetching will mostly outperform your manual code. For instance, an Intel CPU like Ivy Bridge processors, there are multiple data prefetchers such as prefetching to L1, L2, or L3 cache. (I don't think mobile processors have a fancy data prefetcher though). Some prefetchers will load adjacent cache lines.


If you do more expensive computations on large 2D arrays, there are many alternative algorithms that are more friendly to caches. A notable example would be blocked(titled) matrix multiply. A naive matrix multiplication suffers a lot of cache misses, but a blocked algorithm significantly reduces cache misses by calculating on small subsets that are fit to caches. See some references like this.