What's the best way ( sse2 ) to reduce a _m128 ( 4 words a b c d) to one word? I want the low part of each _m128 components:
int result = ( _m128.a & 0x000000ff ) << 24
| ( _m128.b &a开发者_如何学运维mp; 0x000000ff ) << 16
| ( _m128.c & 0x000000ff ) << 8
| ( _m128.d & 0x000000ff ) << 0
Is there an intrinsics for that ? thanks !
FYI, the sse3 intrinsics _mm_shuffle_epi8
do the job: (with the mask 0x0004080c in this case )
The SSE2 answer takes more than one instructions:
unsigned benoit(__m128i x)
{
__m128i zero = _mm_setzero_si128(), mask = _mm_set1_epi32(255);
return _mm_cvtsi128_si32(
_mm_packus_epi16(
_mm_packus_epi16(
_mm_and_si128(x, mask), zero), zero));
}
The above amounts to 5 machine ops, given the input in %xmm1 and output in %rax:
pxor %xmm0, %xmm0
pand MASK, %xmm1
packuswb %xmm0, %xmm1
packuswb %xmm0, %xmm1
movd %xmm1, %rax
If you want to see some unusual uses of SSE2, including high-speed bit-matrix transpose, string search and bitonic (GPGPU-style) sort, you might want to check my blog, Coding on the edges.
Anyway, hope that helps.
精彩评论