Narek Atayan Narek Atayan - 23 days ago 6
C++ Question

Scenarios when software prefetching manual instructions are reasonable

I have read about that on x86 and x86-64 Intel

provides special prefetching instructions:

#include <xmmintrin.h>
enum _mm_hint
_MM_HINT_T0 = 3,
_MM_HINT_T1 = 2,
_MM_HINT_T2 = 1,
void _mm_prefetch(void *p, enum _mm_hint h);

Programs can use the
intrinsic on any
pointer in the program. And The different hints to be used with the

intrinsic are implementation defined. Generally said is that each of the hints have its own meaning.

fetches data to all levels of the cache for inclusive caches
and to the lowest level cache for exclusive caches

_MM_HINT_T1 hint pulls the data into L2 and
not into L1d. If there is an L3 cache the _MM_HINT_T2
hints can do something similar for it

_MM_HINT_NTA, allows telling the processor to treat the prefetched cache line specially

So can someone describe examples when this instruction used?

And how to properly choose the hint?


The idea of prefetching is based upon these facts:

  • Accessing memory is very expensive the first time.
    The first time a memory address1 is accessed is must be fetched from memory, it is then stored into the cache hierarchy2.
  • Accessing memory is inherently asynchronous.
    The CPU doesn't reuse any resource to perform a load/store3 and thus it can be easily done in parallel with other tasks.

Thanks to the above it makes sense to try a load before it is actually needed, so that when the code will actually need the data, it won't have to wait.
It is very worth nothing that the CPU can go pretty far ahead when looking for something to do, but not arbitrarily deep; so sometime it need the help of the programmer to perform optimally.

The cache hierarchy is, by its very nature, an aspect of the micro-architecture not the architecture (read ISA). Intel or AMD cannot give strong guarantees on what these instructions do.
Furthermore using them correctly is not easy as the programmer must have clear in mind how many cycles each instruction can take. Finally the latest CPU are getting more and more good at hiding memory latency and lowering it.
So in general prefetching is a job for the skilled assembly programmer.

That said the only possible scenario is where the timing of a piece of code must be consistent at every invocation.
For example if you know that an interrupt handler always update a state and it must perform as fast as possible, it is worth, when setting the hardware that use such interrupt, to prefetch the state variable.

Regarding the different level of prefetching, my understanding is that different levels (L1 - L4) correspond to different amount of sharing and polluting.

For example prefetch0 is good if the thread/core that execute the instruction is the same that will read the variable.
However this will take a line in all the caches, eventually evicting other, possible useful, lines. You can use this for example when you know that you'll need the data surely in short.

prefetch1 is good to make the data quickly available for all core or core group (depending on how L2 is shared) without polluting L1.
You can use this if you know that you may need the data or that you'll need it after having done with another task (that takes priority in using the cache).
This is not as fast as having the data in L1 but much better than having it in memory.

prefetch2 can be used to take out most of the memory access latency since it move the data in the L3 cache.
It doesn't pollute L1 or L2 and it is shared among cores, so it's good for data used by rare (but possible) code paths or for preparing data for other cores.

prefetchnta is the easiest to understand, it is a non-temporal move. It avoid creating an entry in every cache line for a data that is accessed only once.

prefetchw/prefetchwnt1 are like the others but makes the line Exclusive and invalidates other cores lines that alias this one.
Basically it makes writing faster as it is in the optimal state of the MESI protocol (for cache choerence).

Finally a prefetch can be done incrementally, first by moving into L3 and then by moving into L1 (just for the threads that need it).

In short each instruction let you decide the compromise between pollution, sharing and speed of access.
Since these all require to keep track of the use of the cache very carefully (you need to know that it's not worth creating and entry in the L1 but it is in the L2) the use is limited to very specific environments.
In a modern OS it's not possible to keep track of the cache, you can do a prefetch just to find your quantum expired and your program replaced by another one that evict the just loaded line.

As for concrete example I'm a bit out of ideas.
In the past I had to measure the timing of some external event as consistently as possible.
I used and interrupt to periodically monitor the event, in such case I prefetched the variables needed by the interrupt handler, thereby eliminating the latency of the first access.

Another, unorthodox, use of the prefetching is to move the data into the cache.
This is useful if you want to test the cache system or unmap a device from memory relying on the cache to keep the data a bit longer.
In this case moving to L3 is enough, not all CPU has an L3, so we may need to move to L2 instead.

I understand these example are not very good though.

1 Actually the granularity is "cache lines" not "addresses".
2 Which I assume you are familiar with. Shortly put: It, as present, goes from L1 to L3/L4. L3/L4 is shared among cores. L1 is always private per core and shared by the core's threads, L2 usually is like L1 but some model may have L2 shared across pairs of cores. 3 For the data transfer only, computing the address takes up resources.