开发者

Are modern CPU caches optimized to deal with constant strides? Across threads?

开发者 https://www.devze.com 2022-12-09 21:43 出处:网络
Say I have a big array, and multiple threads reading from the array. Ea开发者_Python百科ch thread iterates through the array by jumping a constant amount, but starts at a different offset. So thread 1

Say I have a big array, and multiple threads reading from the array. Ea开发者_Python百科ch thread iterates through the array by jumping a constant amount, but starts at a different offset. So thread 1 may start at element 0, then read elements 32, 64, 96, etc. But thread 2 starts at element 1, and read element 33, 65, 97, etc. (keeping in mind that an 'element' may constitute more than a single byte or word) I know that usually spatial locality is desirable for getting the best cache performance, but I've also read that modern CPUs have hardware prefetchers that look for patterns in accesses, and a stride to me seems like an obvious pattern.

  • So is this cache friendly on a modern box or isn't it?
  • What if I increase the stride to something larger than a cache line?
  • Is the answer affected by the use of multiple threads (so despite accessing the same memory they may be running on different cores with different caches)?


Cache performance is pretty complex, and the really reliable answers are going to come from hardware designers or operating system developers who work specifically with scheduling an dispatching. I used to work in performance analysis tools on large IBM systems, so I can give a partial, slightly-out-of-date answer:

First, cache memory is associative by address. If a piece of memory is addressed, the "cache line" for that address is loaded into cache. Depending on processor design, this could be 4, 8, 16, or 32 bytes in length. (Maybe more.) This will most likely be based on "alilgnment" of hardware addresses; in other words, a 32-byte line will be on a boundary that aligns with a divisible-by-32 address. Your memory reference may be in the beginning, middle, or end of that cache line.

Once it's in the cache, the address is used as a "lookup" to find the cached data.

Locality of reference will help you if the cache line is large enough that an "adjacent" item is referenced that happens to have been cached as part of the cache line. Jumping through your array will defeat this.

Cache designs vary widely based on vendor, product line, processor price, and a whole lot more. Perfect cache optimization is going to be highly elusive unless (1) you know a whole lot about the machine you're going to run on, and (2) you're really not interested in running on any other machine.

One other factor to consider is that 32-bit addresses are half the size of 64-bit addresses, and this has a significant effect on how much data can be cached. Giving more bits to addresses means fewer bits for data, more-or-less.

Prefetching is more witchcraft than science. Fetching memory from data to cache is expensive, even when it's asynchronous from processor execution (although it can't ever be too far separated from execution). Locality of reference is a good rule, although it's going to be based on hardware architecture in ways that don't necessarily match code execution on the micro scale. LRU (least recently used) is a common method of deciding what to boot out of cache, but removing something from cache to make room for something that ends up not being used ever is not such a good optimization. So prefetching will be judicious, to say the least.

EDIT: virtual memory issues, task switching, etc.

Virtual memory certainly makes things far more interesting, especially in operating systems that support multiple address spaces. Cache is most likely to be based on real addresses, not virtual addresses, so things like page swaps can have interesting side-effects on caching. Typically, a page that is due to be swapped out or released will first be invalidated, and moved to a "flush list" (where it can be written to the swap file), or a "free list". Depending on implementation, these pages can still be reclaimed by an application, but they're no longer addressable - meaning a page fault would occur in the process of reclaiming them. So once a page has been moved out of a app's working set, it's very likely that any cache lines associated with it would be invalidated. If the page isn't being heavily used, then it's not likely to have much in cache, either, but in a heavy swapping situation, cache performance can take a hit along with swapping performance.

Also, some cache designs have "shared" cache, and most or all have processor- and core-specific cache. Where cache is designated to a specific processor or core, and that core changes task, the entire cache is likely to be flushed to avoid corruption by a new process. This would not include thread-switching, since threads run in the same process and same address space. The real issue here is that high activity in other applications on a system can impact your cache performance. Shared cache aleviates this problem to some extent, but has to be more carefully managed to avoid corruptions.

0

精彩评论

暂无评论...
验证码 换一张
取 消