I'm going to have to code a very basic checksum function, something like:
char sum(const char * data, const int len)
{
char sum(0);
for (const char * end=data+len ; data<end ; ++data)
sum += *data;
return sum;
}
That's trivial. Now, how should I optimize this? First, I should probably use some std::for_each with a lambda or something like that:
char sum2(const char * data, const int len)
{
char sum(0);
std::for_each(data, data+len, [&sum](char b){sum+=b;});
return sum;
}
Next, I could use multiple threads/cores to sum up chunks, then add the results. I won't write it down, and I'm afraid the cost of creating threads (or getting them from a pool anyway), then cutting up the array, then dispatching everything, etc, would not be very good considering that I would mostly calculate checksums for small arrays, mostly 10-100 bytes, rarely up to 1000.
But what I really want is something lower level, some SIMD stuff that would sum up bytes on 128b registers, or maybe sum bytes independently between two registers without carrying the carry, or both.
Is there any such thing out there ?
Note: This IS actual premature optimization, but it's fun, so what the hell?
Edit: I still need a way to sum up all the bytes in an SSE register, something better than
char ptr[16];
_mm_storeu_si128((__m128i*)ptr, sum);
checksum += ptr[0] + ptr[1] + ptr[2] + ptr[3] + ptr[4] + ptr[5] + ptr[6] + ptr[7]
+ ptr[8] + ptr[9] + ptr[10] + ptr[11] + ptr[12] + ptr[13] + ptr[开发者_开发技巧14] + ptr[15];
Yes, there are such instructions in the MMX instruction set, called "Packed ADD":
_mm_add_pi8
in Visual C++__builtin_ia32_paddb
in gcc
And in the SSE2 instruction set:
_mm_add_epi8
in Visual C++__builtin_ia32_paddb128
in gcc
EDIT: A faster way to add the partial sums:
__m128i sums;
sums = _mm_add_epi8(sums, _mm_srli_si128(sums, 1));
sums = _mm_add_epi8(sums, _mm_srli_si128(sums, 2));
sums = _mm_add_epi8(sums, _mm_srli_si128(sums, 4));
sums = _mm_add_epi8(sums, _mm_srli_si128(sums, 8));
checksum += _mm_cvtsi128_si32(sums);
Look at _mm_add_ps. Simultaneous add of 128-bit contiguous block. You'll need to zero pad your array or process the last few non SIMD style.
精彩评论