I have noticed that sometimes MSVC 2010 doesn't reorder SSE instructions at all. I thought I didn't have to care about instruction order inside my loop since the compiler handles that best, which doesn't seem to be the case.
How should I think about this? What determines the best instruction order? I know some instruction have higher latency than others and that some instructions can run in parallel/async on cpu level. What metrics are relevant in the context? Where can I find them?
I know that I could avoid this question by profiling, however such profilers are expensive (VTune XE) and I would like to know the theory behind it, not just emperical results.
Also should I care about software prefetching (_mm_prefetch
) or can I assume that the cpu will do a better job than me?
Lets say I have the following function. Should I interleave some of the instructions? Should I do the stores before the streams, do all the loads in order and then do calculations, etc...? Do I need to consider USWC vs non-USWC, and temporal vs non-temporal?
auto cur128 = reinterpret_cast<__m128i*>(cur);
auto prev128 = reinterpret_cast<const __m128i*>(prev);
auto dest128 = reinterpret_cast<__m128i*>(dest;
auto end = cur128 + count/16;
while(cur128 != end)
{
auto xmm0 = _mm_add_epi8(_mm_load_si128(cur128+0), _mm_load_si128(prev128+0));
auto xmm1 = _mm_ad开发者_JAVA技巧d_epi8(_mm_load_si128(cur128+1), _mm_load_si128(prev128+1));
auto xmm2 = _mm_add_epi8(_mm_load_si128(cur128+2), _mm_load_si128(prev128+2));
auto xmm3 = _mm_add_epi8(_mm_load_si128(cur128+3), _mm_load_si128(prev128+3));
// dest128 is USWC memory
_mm_stream_si128(dest128+0, xmm0);
_mm_stream_si128(dest128+1, xmm1);
_mm_stream_si128(dest128+2, xmm2);;
_mm_stream_si128(dest128+3, xmm3);
// cur128 is temporal, and will be used next time, which is why I choose store over stream
_mm_store_si128 (cur128+0, xmm0);
_mm_store_si128 (cur128+1, xmm1);
_mm_store_si128 (cur128+2, xmm2);
_mm_store_si128 (cur128+3, xmm3);
cur128 += 4;
dest128 += 4;
prev128 += 4;
}
std::swap(cur, prev);
I agree with everyone that testing and tweaking is the best approach. But there are some tricks to help it.
First of all, MSVC does re-order SSE instruction. Your example is probably too simple or already optimal.
Generally speaking, if you have enough registers to do so, full interleaving tends to give the best results. To take it a step further, unroll your loops enough to use all the registers, but not too much to spill. In your example, the loop is completely bound by memory accesses, so there isn't much room to do any better.
In most cases, it isn't necessary to get the order of the instructions perfect to achieve optimal performance. As long as it's "close enough", either the compiler, or the hardware's out-of-order execution will fix it for you.
The method I use to determine if my code is optimal is critical-path and bottleneck analysis. After I write the loop, I look up what instructions use which resources. Using that information, I can calculate upper-bound on performance, which I then compare with the actual results to see how close/far I am from optimal.
For example, suppose I have a loop with 100 adds and 50 multiplies. On both Intel and AMD (pre-Bulldozer), each core can sustain one SSE/AVX add and one SSE/AVX multiply per cycle. Since my loop has 100 adds, I know I cannot do any better than 100 cycles. Yes, the multiplier will be idle half the time, but the adder is the bottleneck.
Now I go and time my loop and I get 105 cycles per iteration. That means I'm pretty close to optimal and there's not much more to gain. But if I get 250 cycles, then that means something's wrong with the loop and it's worth tinkering with it more.
Critical-path analysis follows the same idea. Look up the latencies for all the instructions and find the cycle time of the critical path of the loop. If your actual performance is very close to it, you're already optimal.
Agner Fog has a great reference for the internal details of the current processors: http://www.agner.org/optimize/microarchitecture.pdf
I've just build this using VS2010 32bit compiler and I get the following:
void F (void *cur, const void *prev, void *dest, int count)
{
00901000 push ebp
00901001 mov ebp,esp
00901003 and esp,0FFFFFFF8h
__m128i *cur128 = reinterpret_cast<__m128i*>(cur);
00901006 mov eax,220h
0090100B jmp F+10h (901010h)
0090100D lea ecx,[ecx]
const __m128i *prev128 = reinterpret_cast<const __m128i*>(prev);
__m128i *dest128 = reinterpret_cast<__m128i*>(dest);
__m128i *end = cur128 + count/16;
while(cur128 != end)
{
auto xmm0 = _mm_add_epi8(_mm_load_si128(cur128+0), _mm_load_si128(prev128+0));
00901010 movdqa xmm0,xmmword ptr [eax-220h]
auto xmm1 = _mm_add_epi8(_mm_load_si128(cur128+1), _mm_load_si128(prev128+1));
00901018 movdqa xmm1,xmmword ptr [eax-210h]
auto xmm2 = _mm_add_epi8(_mm_load_si128(cur128+2), _mm_load_si128(prev128+2));
00901020 movdqa xmm2,xmmword ptr [eax-200h]
auto xmm3 = _mm_add_epi8(_mm_load_si128(cur128+3), _mm_load_si128(prev128+3));
00901028 movdqa xmm3,xmmword ptr [eax-1F0h]
00901030 paddb xmm0,xmmword ptr [eax-120h]
00901038 paddb xmm1,xmmword ptr [eax-110h]
00901040 paddb xmm2,xmmword ptr [eax-100h]
00901048 paddb xmm3,xmmword ptr [eax-0F0h]
// dest128 is USWC memory
_mm_stream_si128(dest128+0, xmm0);
00901050 movntdq xmmword ptr [eax-20h],xmm0
_mm_stream_si128(dest128+1, xmm1);
00901055 movntdq xmmword ptr [eax-10h],xmm1
_mm_stream_si128(dest128+2, xmm2);;
0090105A movntdq xmmword ptr [eax],xmm2
_mm_stream_si128(dest128+3, xmm3);
0090105E movntdq xmmword ptr [eax+10h],xmm3
// cur128 is temporal, and will be used next time, which is why I choose store over stream
_mm_store_si128 (cur128+0, xmm0);
00901063 movdqa xmmword ptr [eax-220h],xmm0
_mm_store_si128 (cur128+1, xmm1);
0090106B movdqa xmmword ptr [eax-210h],xmm1
_mm_store_si128 (cur128+2, xmm2);
00901073 movdqa xmmword ptr [eax-200h],xmm2
_mm_store_si128 (cur128+3, xmm3);
0090107B movdqa xmmword ptr [eax-1F0h],xmm3
cur128 += 4;
00901083 add eax,40h
00901086 lea ecx,[eax-220h]
0090108C cmp ecx,10h
0090108F jne F+10h (901010h)
dest128 += 4;
prev128 += 4;
}
}
which shows that the compiler is reordering the instructions, following the general rule of "don't use a register immediately after writing to the register". It's also turned two loads and an add into a single load and an add from memory. There's no reason you couldn't write the code like this yourself and use all the SIMD registers rather than the four you're currently using. You may want to match the total number of bytes loaded to the size of a cache line. This will give the hardware prefetching a chance to fill the next cache line before you need it.
Also, the prefetch, especially in code the reads memory sequentially, is often not necessary. The MMU can prefetch up to four streams at a time.
you might find chapters 5 to 7 of the Intel Architectures Optimization Reference Manual very interesting, it details how Intel thinks you should go about writing optimal SSE code, and it details a lot of the questions you are asking about.
I also want to recommend the Intel® Architecture Code Analyzer:
https://software.intel.com/en-us/articles/intel-architecture-code-analyzer
It is a static code analyzer which helps figuring out/optimizing critical paths, latencies and throughput. It works for Windows, Linux and MacOs (I only tried it on Linux). The documentation has a medium-simple example on how to use it (i.e., how to avoid latencies by re-ordering instructions).
精彩评论