I've seen this blog:
http://igoro.com/archive/gallery-of-processor-cache-effects/
The "weirdness" in part 7 is what caught my interest.
My first thought was "Thats just C# being weird".
Its not I wrote the following C++ code.
volatile int* p = (volatile int*)_aligned_malloc( sizeof( int ) * 8, 64 );
memset( (void*)p, 0, sizeof( int ) * 8 );
double dStart = t.GetTime();
for (int i = 0; i < 200000000; i++)
{
//p[0]++;p[1]++;p[2]++;p[3]++; // Option 1
//p[0]++;p[2]++;p[4]++;p[6]++; // Option 2
p[0]++;p[2]++; // Option 3
}
double dTime = t.GetTime() - dStart;
The timing I get on my 2.4 Ghz Core 2 Quad go as follows:
Option 1 = ~8 cycles per loop.
Option 2 = ~4 cycles per loop.
Option 3 = ~6 cycles per loop.
Now This is confusing. My reasoning behind the difference comes down to the cache write latency (3 cycles) on my chip and an assumption that the cache has a 128-bit write port (This is pure guess work on my part).
On that basis in Option 1: It will increment p[0] (1 cycle) then increment p[2] (1 cycle) then it has to wait 1 cycle (for cache) then p[1] (1 cycle) then wait 1 cycle (for cache) then p[3] (1 cycle). Finally 2 cycles for increment and jump (Though its usually implemented as decrement and jump). This gives a total of 8 cycles.
In Option 2: It can increment p[0] and p[4] in one cycle then increment p[2] and p[6] in another cycle. Then 2 cycles for subtract and jump. No waits needed on cache. Total 4 cycles.
In option开发者_如何转开发 3: It can increment p[0] then has to wait 2 cycles then increment p[2] then subtract and jump. The problem is if you set case 3 to increment p[0] and p[4] it STILL takes 6 cycles (which kinda blows my 128-bit read/write port out of the water).
So ... can anyone tell me what the hell is going on here? Why DOES case 3 take longer? Also I'd love to know what I've got wrong in my thinking above, as i obviously have something wrong! Any ideas would be much appreciated! :)
It'd also be interesting to see how GCC or any other compiler copes with it as well!
Edit: Jerry Coffin's idea gave me some thoughts.
I've done some more tests (on a different machine so forgive the change in timings) with and without nops and with different counts of nops
case 2 - 0.46 00401ABD jne (401AB0h)
0 nops - 0.68 00401AB7 jne (401AB0h)
1 nop - 0.61 00401AB8 jne (401AB0h)
2 nops - 0.636 00401AB9 jne (401AB0h)
3 nops - 0.632 00401ABA jne (401AB0h)
4 nops - 0.66 00401ABB jne (401AB0h)
5 nops - 0.52 00401ABC jne (401AB0h)
6 nops - 0.46 00401ABD jne (401AB0h)
7 nops - 0.46 00401ABE jne (401AB0h)
8 nops - 0.46 00401ABF jne (401AB0h)
9 nops - 0.55 00401AC0 jne (401AB0h)
I've included the jump statetements so you can see that the source and destination are in one cache line. You can also see that we start to get a difference when we are 13 bytes or more apart. Until we hit 16 ... then it all goes wrong.
So Jerry isn't right (though his suggestion DOES help a bit), however something IS going on. I'm more and more intrigued to try and figure out what it is now. It does appear to be more some sort of memory alignment oddity rather than some sort of instruction throughput oddity.
Anyone want to explain this for an inquisitive mind? :D
Edit 3: Interjay has a point on the unrolling that blows the previous edit out of the water. With an unrolled loop the performance does not improve. You need to add a nop in to make the gap between jump source and destination the same as for my good nop count above. Performance still sucks. Its interesting that I need 6 nops to improve performance though. I wonder how many nops the processor can issue per cycle? If its 3 then that account for the cache write latency ... But, if thats it, why is the latency occurring?
Curiouser and curiouser ...
I strongly suspect what you're seeing is an oddity of branch prediction rather than anything to do with caching. In particular, on quite a few CPUs, branch prediction doesn't work (as well | at all) when both the source and the target of the branch are in the same cache line. Putting enough code inside the loop (even NOPs) to get the source and target into different cache lines will give a substantial improvement in speed.
Well I had a brief chat with an intel engineer about exactly this problem and got this response:
It's clearly something to do with which instructions end up in which execution units, how quickly the machine spots a store-hit-load problem, and how quickly and elegantly it deals with unrolling the speculative execution to cope with it (or if it takes multiple cycles because of some internal conflict). But that said - you'd need a very detailed pipetrace and simulator to figure this out. Predicting out-of-order instruction processing in these pipelines is way too hard to do on paper, even for the people who designed the machines. For laymen - no hope in hell. Sorry!
Ao I thought I'd add the answer here and close this question once and for all :)
This doesn't seem to be compiler related. At first I thought it could be due to compiler tricks such as loop unrolling, but looking at the generated assembly, MSVC 9.0 just generates a straightforward translation from the C++ code.
Option 1:
$LL3@main:
add DWORD PTR [esi], ecx
add DWORD PTR [esi+4], ecx
add DWORD PTR [esi+8], ecx
add DWORD PTR [esi+12], ecx
sub eax, ecx
jne SHORT $LL3@main
Option 2:
$LL3@main:
add DWORD PTR [esi], ecx
add DWORD PTR [esi+8], ecx
add DWORD PTR [esi+16], ecx
add DWORD PTR [esi+24], ecx
sub eax, ecx
jne SHORT $LL3@main
Option 3:
$LL3@main:
add DWORD PTR [esi], ecx
add DWORD PTR [esi+8], ecx
sub eax, ecx
jne SHORT $LL3@main
The x86 instruction set is in no way representative anymore for what is really being done by the CPU. The instructions are translated to an internal machine language, the term "micro-op" was coined back in the 486 days. Throw in stuff like register renaming, speculative execution, multiple execution units and their interaction with the cache and there's just no way to predict how long something should take anymore. Chip manufacturers have stopped posting cycle time predictions a long time ago. Their designs are a trade secret.
精彩评论