I have loop like this
start = __rdtsc();
unsigned long long count = 0;
for(int i = 0; i < N; i++)
for(int j = 0; j < M; j++)
count += tab[i][j];
stop = __rdtsc();
time = (stop - start) * 1/3;
First, I suppose that
tab is a large 2D array such as a static array (e.g.,
int tab[1024*1024][1024*1024]) or a dynamically-allocated array (e.g.,
int** tab and following
mallocs). Here, you want to prefetch some data from
tab to the cache to reduce the execution time.
Simply, I don't think that you need to manually insert any prefetching to your code, where a simple reduction for a 2D array is performed. Modern CPUs will do automatic prefetching if necessary and profitable.
Two facts you should know for this problem:
(1) You are already exploit the spatial locality of
tab inside of the innermost loop. Once
tab[i] is read (after a cache miss, or a page fault), the data from
tab[i] will be in your CPU caches, assuming that the cache line size is 64 bytes.
(2) However, when the code traverses in the row, i.e.,
tab[i+1], it is highly likely to happen a cold cache miss, especially when
tab is a dynamically-allocated array where each row could be allocated in a fragmented way. However, if the array is statically allocated, each row will be located contiguously in the memory.
So, prefetching makes a sense only when you read (1) the first item of the next row and (2)
j + CACHE_LINE_SIZE/sizeof(tab) ahead of time.
You may do so by inserting a prefetch operation (e.g.,
__builtin_prefetch) in the upper loop. However, modern compilers may not always emit such prefetch instructions. If you really want to do that, you should check the generated binary code.
However, as I said, I do not recommend you do that because modern CPUs will mostly do prefetching automatically, and that automatic prefetching will mostly outperform your manual code. For instance, an Intel CPU like Ivy Bridge processors, there are multiple data prefetchers such as prefetching to L1, L2, or L3 cache. (I don't think mobile processors have a fancy data prefetcher though). Some prefetchers will load adjacent cache lines.
If you do more expensive computations on large 2D arrays, there are many alternative algorithms that are more friendly to caches. A notable example would be blocked(titled) matrix multiply. A naive matrix multiplication suffers a lot of cache misses, but a blocked algorithm significantly reduces cache misses by calculating on small subsets that are fit to caches. See some references like this.