I have an inline assembler loop that cumulatively adds elements from an int32 data array with MMX instructions. In particular, it uses the fact that the MMX registers can accommodate 16 int32s to calculate 16 different cumulative sums in parallel.
I would now like to convert this piece of code to MMX intrinsics but I am afraid that I will suffer a performance penalty because one cannot explicitly intruct the compiler to use the 8 MMX registers to accomulate 16 independent sums.
Can anybody comment on this and maybe propose a solution on how to convert the piece of code below to use intrinsics?
== inline assembler (only part within the loop) ==
paddd mm0, [esi+edx+8*0] ; add first & second pair of int32 elements
paddd mm1, [esi+edx+8*1] ; add third & fourth pair of int32 elements ...
paddd mm2, [esi+edx+8*2开发者_运维百科]
paddd mm3, [esi+edx+8*3]
paddd mm4, [esi+edx+8*4]
paddd mm5, [esi+edx+8*5]
paddd mm6, [esi+edx+8*6]
paddd mm7, [esi+edx+8*7] ; add 15th & 16th pair of int32 elements
- esi points to the beginning of the data array
- edx provides the offset in the data array for the current loop iteration
- the data array is arranged such that the elements for the 16 independent sums are interleaved.
The VS2010 does a decent optimization job on the equivalent code using intrinsics. In most cases, it compiles the intrinsic:
sum = _mm_add_pi32(sum, *(__m64 *) &intArray[i + offset]);
into something like:
movq mm0, mmword ptr [eax+8*offset]
paddd mm1, mm0
This isn't as concise as your padd mm1, [esi+edx+8*offset]
, but it arguably comes pretty close. The execution time is likely dominated by the memory fetch.
The catch is that VS seems to like adding MMX registers only to other MMX registers. The above scheme works only for the first 7 sums. The 8th sum requires that some register be saved temporarily to memory.
Here's a complete program and its corresponding compiled assembly (release build):
#include <stdio.h>
#include <stdlib.h>
#include <xmmintrin.h>
void addWithInterleavedIntrinsics(int *interleaved, int count)
{
// sum up the numbers
__m64 sum0 = _mm_setzero_si64(), sum1 = _mm_setzero_si64(),
sum2 = _mm_setzero_si64(), sum3 = _mm_setzero_si64(),
sum4 = _mm_setzero_si64(), sum5 = _mm_setzero_si64(),
sum6 = _mm_setzero_si64(), sum7 = _mm_setzero_si64();
for (int i = 0; i < 16 * count; i += 16) {
sum0 = _mm_add_pi32(sum0, *(__m64 *) &interleaved[i]);
sum1 = _mm_add_pi32(sum1, *(__m64 *) &interleaved[i + 2]);
sum2 = _mm_add_pi32(sum2, *(__m64 *) &interleaved[i + 4]);
sum3 = _mm_add_pi32(sum3, *(__m64 *) &interleaved[i + 6]);
sum4 = _mm_add_pi32(sum4, *(__m64 *) &interleaved[i + 8]);
sum5 = _mm_add_pi32(sum5, *(__m64 *) &interleaved[i + 10]);
sum6 = _mm_add_pi32(sum6, *(__m64 *) &interleaved[i + 12]);
sum7 = _mm_add_pi32(sum7, *(__m64 *) &interleaved[i + 14]);
}
// reset the MMX/floating-point state
_mm_empty();
// write out the sums; we have to do something with the sums so that
// the optimizer doesn't just decide to avoid computing them.
printf("%.8x %.8x\n", ((int *) &sum0)[0], ((int *) &sum0)[1]);
printf("%.8x %.8x\n", ((int *) &sum1)[0], ((int *) &sum1)[1]);
printf("%.8x %.8x\n", ((int *) &sum2)[0], ((int *) &sum2)[1]);
printf("%.8x %.8x\n", ((int *) &sum3)[0], ((int *) &sum3)[1]);
printf("%.8x %.8x\n", ((int *) &sum4)[0], ((int *) &sum4)[1]);
printf("%.8x %.8x\n", ((int *) &sum5)[0], ((int *) &sum5)[1]);
printf("%.8x %.8x\n", ((int *) &sum6)[0], ((int *) &sum6)[1]);
printf("%.8x %.8x\n", ((int *) &sum7)[0], ((int *) &sum7)[1]);
}
void main()
{
int count = 10000;
int *interleaved = new int[16 * count];
// create some random numbers to add up
// (note that on VS2010, RAND_MAX is just 32767)
for (int i = 0; i < 16 * count; ++i) {
interleaved[i] = rand();
}
addWithInterleavedIntrinsics(interleaved, count);
}
Here's the generated assembly code for the inner portion of the sum loop (without its prolog and epilog). Note how most sums are kept efficiently in mm1-mm6. Contrast that with mm0, which is used to bring the number to add to each sum, and with mm7, which serves the last two sums. The 7-sum version of this program doesn't seem to have mm7 problem.
012D1070 movq mm7,mmword ptr [esp+18h]
012D1075 movq mm0,mmword ptr [eax-10h]
012D1079 paddd mm1,mm0
012D107C movq mm0,mmword ptr [eax-8]
012D1080 paddd mm2,mm0
012D1083 movq mm0,mmword ptr [eax]
012D1086 paddd mm3,mm0
012D1089 movq mm0,mmword ptr [eax+8]
012D108D paddd mm4,mm0
012D1090 movq mm0,mmword ptr [eax+10h]
012D1094 paddd mm5,mm0
012D1097 movq mm0,mmword ptr [eax+18h]
012D109B paddd mm6,mm0
012D109E movq mm0,mmword ptr [eax+20h]
012D10A2 paddd mm7,mm0
012D10A5 movq mmword ptr [esp+18h],mm7
012D10AA movq mm0,mmword ptr [esp+10h]
012D10AF movq mm7,mmword ptr [eax+28h]
012D10B3 add eax,40h
012D10B6 dec ecx
012D10B7 paddd mm0,mm7
012D10BA movq mmword ptr [esp+10h],mm0
012D10BF jne main+70h (12D1070h)
So what can you do?
Profile the 7-sum and 8-sum intrinsic-based programs. Choose the one that executes quicker.
Profile the version that adds just one MMX register at a time. It should still be able to take advantage of the fact that modern processors fetch 64 to 128 bytes into the cache at a time. It is not obvious that the 8-sum version would be faster than the 1-sum one. The 1-sum version fetches the exact same amount of memory, and does the exact same number of MMX additions. You will need to interleave the inputs accordingly though.
If your target hardware allows it, consider using SSE instructions. Those can add 4 32-bit values at a time. SSE is available in intel CPU's since the Pentium III in 1999.
精彩评论