The intrinsics guide says only this much about
void _mm_prefetch (char const* p, int i)
Fetch the line of data from memory that contains address p to a
location in the cache heirarchy specified by the locality hint i.
Sometimes intrinsics are better understood in terms of the instruction they represent rather than as the abstract semantic given in their descriptions.
The full set of the locality constants, as today, is
#define _MM_HINT_T0 1 #define _MM_HINT_T1 2 #define _MM_HINT_T2 3 #define _MM_HINT_NTA 0 #define _MM_HINT_ENTA 4 #define _MM_HINT_ET0 5 #define _MM_HINT_ET1 6 #define _MM_HINT_ET2 7
For IA32/AMD processors, the set is reduced to
#define _MM_HINT_T0 1 #define _MM_HINT_T1 2 #define _MM_HINT_T2 3 #define _MM_HINT_NTA 0 #define _MM_HINT_ET1 6
_mm_prefetch is compiled into different instructions based on the architecture and the locality hint
Hint IA32/AMD iMC _MM_HINT_T0 prefetcht0 vprefetch0 _MM_HINT_T1 prefetcht1 vprefetch1 _MM_HINT_T2 prefetcht2 vprefetch2 _MM_HINT_NTA prefetchtnta vprefetchnta _MM_HINT_ENTA - vprefetchenta _MM_HINT_ET0 - vprefetchet0 _MM_HINT_ET1 prefetchtwt1 vprefetchet1 _MM_HINT_ET2 - vprefetchet2
(v)prefetch instructions do, if all the requirements are satisfied, is to bring a cache line worth of data into the cache level specified by the locality hint.
The instruction is just a hint, it may be ignored.
When a line is prefetched into level X, the manuals (both Intel and AMD) say that it also fetched into all the other higher level (but for the case X=3).
I'm not sure if this is actually true, I believe that the line is prefetched with-respect-to cache level X and depending on the caching strategies of the higher levels (inclusive vs non-inclusive) it may or may not be present there too.
Another attribute of the
(v)prefetch instructions is the non-temporal attribute.
A non-temporal data is unlikely to be reused soon.
In my understanding, NT data is stored in the "streaming load buffers" for the IA32 architecture1 while for the iMC architecture it is stored in the normal cache (using as the way the hardware thread id) but with Most Recent Use replacement policy (so that it will be the next evicted line if needed).
For AMD the manual read that the actual location is implementation dependent, ranging from a software invisible buffer to a dedicated non-temporal cache.
The last attribute of the
(v)prefetch instructions is the "intent" attribute or the "eviction" attribute.
Due to the MESI-and-variant protocols, a Request-for-ownership must be made to bring a line into an exclusive state (in order to modify it).
An RFO is just a special read, so prefetching it with an RFO will bring it into the Exclusive state directly (otherwise the first store to it will cancel the benefits of prefetching due to the "delayed" RFO needed), granted we know we will write to it later.
The IA32 and AMD architectures don't support and exclusive non-temporal hint (yet) since the way the non-temporal cache level is implementation-defined.
The iMC architecture allows for it with the locality code
1 Which I understand to be the WC buffers.
For reference here is the description of the instructions involved
Fetches the line of data from memory that contains the byte specified with the source operand to a location in the cache hierarchy specified by a locality hint:
• T0 (temporal data)—prefetch data into all levels of the cache hierarchy.
• T1 (temporal data with respect to first level cache misses)—prefetch data into level 2 cache and higher.
• T2 (temporal data with respect to second level cache misses)—prefetch data into level 3 cache and higher, or an implementation-specific choice.
• NTA (non-temporal data with respect to all cache levels)—prefetch data into non-temporal cache structure and into a location close to the processor, minimizing cache pollution.
Fetches the line of data from memory that contains the byte specified with the source operand to a location in the cache hierarchy specified by an intent to write hint (so that data is brought into ‘Exclusive’ state via a request for ownership) and a locality hint:
• T1 (temporal data with respect to first level cache)—prefetch data into the second level cache.
Cache Temporal Exclusive state Level VPREFETCH0 L1 NO NO VPREFETCHNTA L1 YES NO VPREFETCH1 L2 NO NO VPREFETCH2 L2 YES NO VPREFETCHE0 L1 NO YES VPREFETCHENTA L1 YES YES VPREFETCHE1 L2 NO YES VPREFETCHE2 L2 YES YES