When there is a page fault or a cache miss we can use either the Least Recently Used (LRU), First in Fist Out (FIFO) or Random replacement algorithms. I was wondering, which one provides the best performance aka the least possible future cache miss'/page faults?
开发者_如何学编程Architecture: Coldfire processor
The expression "There are no stupid questions" fits this so well. This was such a good question I had to create an account and post on it and share my views as somebody who has modelled caches on a couple of CPUs.
You specify the architecture a 68000 which is a CPU rather than a GPU or USB controller, or other piece of hardware which may access a cache however...
Therefore the code you run on the 68000 will make a huge difference to the part of the question "least possible future cache miss'/page faults".
In this you differentiate between cache miss's and page faults, I'm not sure exactly which coldfire architecure you are refering to but I'm assuming this doesn't have a hardware TLB replacement, it uses a software mechansim (so the cache will be shared with the data of applications).
In the replacement policy the most important factor is the number of associations (or ways).
A direct map cache (1 way), directly correlates (all most always) with the low order bits of the address (the number of bits specify the size of the cache) so a 32k cache would be the lower 15bits. In this case the replacement algorthims LRU, FIFO or Random would be useless as there is only one possible choice.
However Writeback or Writethrough selection of the cache would have more effect. For writes to memory only Writethrough means the cache line isn't allocated as apposed to a writeback cache where the line currently in the cache which shares the same lower 15 bits is ejected out of the cache and read back in then modified, for use IF the code running on the CPU uses this data).
For operations that write and don't perform multiple operations on the data then writethrough is usually much better, also on modern processors (and I don' know if this architecture supports it), but Writethrough or Writeback can be selected on a TLB/Page basis. This can have a much greator effect on the cache than the policy, you can programme the system to match the type of data in each page, especially in a direct map cache ;-)
So a direct map cache is pretty easy to understand, it's also easy to understand the basis of cache's worst case, best case and average case.
Imagin a memcpy routine which copies data which is aligned to the cache size. For example a 32k direct mapped cache, with two 32k buffers aligned on a 32k boundry....
0x0000 -> read
0x8000 -> write
0x8004 -> read
0x8004 -> write
...
0x8ffc -> read
0x8ffc -> write
Here you see the read and write's as they copy each word of data, notice the lower 15 bits are the same for each read and write operation.
A direct mapped cache using writeback (remember writeback allocates lines does the following)
0x0000 -> read
cache performs: (miss)
0x0000:0x001f -> READ from main memory (ie. read 32 bytes of the source)
0x8000 -> write
cache performs: (miss)
invalidate 0x0000:0x001f (line 0)
0x8000:0x801f -> READ from main memory (ie. read 32 bytes of the destination)
0x8000 (modify this location in the cache with the read source data)
<loop>
0x0004 -> read
cache performs: (miss)
writeback 0x8000:0x801f -> WRITE to main memory (ie. write 32 bytes to the desitnation)
0x0000:0x001f -> READ from main memory (ie. read 32 bytes of source (the same as we did just before)
0x8004 -> write
cache performs: (miss)
invalidate 0x0000:0x001f (line 0)
0x8000:0x801f -> READ from main memory (ie. read 32 bytes of the destination)
0x8004 (modify this location in the cache with the read source data)
</loop> <--- (side note XML is not a language but we use it as such)
As you see a lot of memory operations go on, this in fact is called "thrashing" and is the best example of a worst case scenairo.
Now imagine that we use a writethrough cache, these are the operations:
<loop>
0x0000 -> read
cache performs: (miss)
0x0000:0x001f -> READ from main memory (ie. read 32 bytes of the source)
0x8000 -> write
cache performs: (not a miss)
(not a lot, the write is "posted" to main memory) (posted is like a letter you just place it in the mailbox and you don't care if it takes a week to get there).
<loop>
0x0004 -> read
cache performs: (hit)
(not a lot, it just pulls the data it fetched last time which it has in it's memory so it goes very quickly to the CPU)
0x8004 -> write
cache performs: (not a miss)
(not a lot, the write is "posted" to main memory)
</loop until next 32 bytes>
</loop until end of buffer>
As you can see a huge difference we now don't thrash, in fact we are best case in this example.
Ok so that's the simple case of write through vs. write back.
Direct map caches however are now not very common most people use, 2,4 or 8 way cache's, that is there are 2, 4 or 8 different possible allocations in a line. So we could store 0x0000, 0x8000, 0x1000, 0x1800 all in the cache at the same time in a 4 way or 8 way cache (obviously an 8 way could also store 0x2000, 0x2800, 0x3000, 0x3800 as well).
This would avoid this thrashing issue.
Just to clarify the line number in a 32k direct mapped cache is the bottom 15 bits of the address. In a 32k 2 way it's the bottom 14 bits. In a 32k 4 way it's the bottom 13 bits. In a 32k 8 way it's the bottom 12 bits.
And in a fully associatve cache it's the lines size (or the bottom 5 bits with a 32 byte line). You can't have less than a line. 32 bytes is typically the most optimum operation in a DDR memory system (there are other reasons as well, someimes 16 or sometimes 64 bytes can be better, and 1 byte would be optimal in the algorthmic case, lets uses 32 as that's very common)
To help understand LRU, FIFO and Random consider the cache is full associative, in a 32k 32 byte line cache this is 1024 lines.
A random replacement policy would on random cause a worst case hit every 1024 replacements (ie. 99.9% hit), in either LRU or FIFO I could always write a programme which would "thrash" ie. always cause a worst case behavouir (ie. 0% hit).
Clearly if you had a fully associative cache you would only choose LRU or FIFO if the programme was known and it was known the exact behavouir of the programme.
For ANYTHING which wasn't 99.9% predictable you would choose RANDOM, it's simply the best at not being worst, and one of the best for being average, but how about best case (where I get the best performance)...
Well it depends basically on the number of ways...
2 ways and I can optimise things like memcpy and other algorthims to do a good job. Random would get it wrong half the time. 4 ways and when I switch between other tasks I might not damage the cache so much that their data is still local. Random would get it wrong quater of the time. 8 ways now statistics can take effect a 7/8% hit rate on a memcpy isn't as good as a 1023/1024% (fully associative or optimised code) but for non optimised code it makes a difference.
So why don't people make fully associative cache's with random replacement policies!
Well it's not because they can't generate good random numbers, in fact a pseudo random number generator is just as good and yes I can write a programme to get 100% miss rate, but that isn't the point, I couldn't write a useful programme that would have 100% miss, which I could with an LRU or FIFO algo.
A 32k 32 byte line Fully associatve cache's require you to compare 1024 values, in hardware this is done through a CAM, but this is an expensive piece of hardware, and also it's just not possible to compare this many values in a "FAST" processing time, I wonder if a quantum computer could....
Anyway to answer your question which one is better:
- Consider if writethrough might be better than writeback.
- Large way's RANDOM is better
- Unknown code RANDOM is better for 4 way or above.
- If it's single function or you want the most speed from something your willing to optimise, and or you don't care about worst case, then LRU is probably what you want.
- If you have very few way's LRU is likely what you want unless you have a very specific scenario then FIFO might be OK.
References:
- http://www.freescale.com/files/training/doc/APF_ENT_T0801_5225_COLDFIRE_MCU.pdf
- http://en.wikipedia.org/wiki/Cache_algorithms
No perfect caching policy exists because it would require knowledge of the future (how a program will access memory).
But, some are measurably better than others in the common memory access pattern cases. This is the case with LRU. LRU has historically given very good performance in overall use.
But, for what you are trying to do, another policy may be better. There is always some memory access pattern that will cause a caching policy to perform poorly.
You may find this thread helpful (and more elaborate!) Why is LRU better than FIFO?
Many of the architectures I have studied use LRU, as it generally provides not only efficiency in implementation, but also is pretty good on average at preventing misses. However, in the latest x86 architectures, I think there are some more complicated things going on. LRU is sort of a basic model.
It really depends on what kind of operations you are performing on your device. Depending on the types of operations, different evacuation policies will work better. For example, FIFO works well with traversing memory sequentially.
Hope this helps, I'm not really an architecture guy.
Between the three, I'd recommend LRU. First, it's a good approximation to optimal scheduling when locality is assumed (this turns out to be a good assumption). Random scheduling cannot benefit from locality. Second, it doesn't suffer from Belady's anomaly (like FIFO); that is, bigger caches mean better performance, which isn't necessarily true with FIFO.
Only if your specific problem domain strongly suggests using something else, LRU is going to be hard to beat in the general case.
Of the three, LRU is generally the best while FIFO is the worst and random comes in somewhere between. You can construct access patterns where any of the three is superior to either of the others, but it is somewhat tricky. Interestingly enough, this order is also roughly how expensive they are to implement -- LRU is the most expensive and FIFO is the cheapest. Just goes to show, there's no free lunch
If you want the best of both worlds, consider an adaptive approach that alters it strategy based on actual usage patterns. For example, look at the algorith IBM's Adaptive Replacement Cache: http://code.activestate.com/recipes/576532-adaptive-replacement-cache-in-python/
精彩评论