If I may start with an example.
Say we have a system of 4 sockets, where each socket has 4 cores and each socket has 2GB RAM ccNUMA (cache coherent non-uniform memory access) type of memory.
Let's say the 4 processes running are on each socket and all have some shared memory region allocated in P2's RAM denoted SHM. This means any load/store to that region will incur a lookup into the P2's directory, correct? If so, then... When that look up happens, is that an equivalent to accessing RAM in terms of latency? Where does this directory reside physically? (See below)
With a more concrete example: Say P2 does a LOAD on SHM and that data is brought into P2's L3 cache with the tag '(O)wner'. Furthermore, say P4 does a LOAD on the same SHM. This will cause P4 to do a lookup into P2's directory, and since the data is tagged as Owned by P2 my question is:
Does P开发者_Python百科4 get SHM from P2's RAM or does it ALWAYS get the data from P2's L3 cache?
If it always gets the data from the L3 cache, wouldn't it be faster to get the data directly from P2's RAM? Since it already has to do a look up in P2's directory? And my understanding is that the directory is literally sitting on top of the RAM.
Sorry if I'm grossly misunderstanding what is going on here, but I hope someone can help clarify this.
Also, is there any data on how fast such a directory look up is? In terms of data retrieval is there documentation on the average latencies on such lookups? How many cycles on a L3 read-hit, read-miss, directory lookup? etc.
It depends on whether the Opteron processor implements the HT Assist mechanism.
If it does not, then there is no directory. In your example, when P4 issues a load, a memory request will arrive to P2 memory controller. P2 will answer back with the cache line and will also send a probe message to the other two cores. Finally, these other two cores will answer back to P4 with an ACK saying they do not have a copy of the cache line.
If HT Assist is enabled (typically for 6-core and higher sockets), then each L3 cache contains a snoop filter (directory) used to write down which cores are keeping a line. Thus, in your example, P4 will not send probe messages to the other two cores, as it looks up the HT Assist directory to find out that no one else has a copy of the line (this is a simplification, as the state of the line would be Exclusive instead of Owned and no directory lookup would be needed).
精彩评论