SSE2: How to reduce a _m128 to a word_问答_开发者

开发者 https://www.devze.com 2022-12-11 03:32 出处：网络

What\'s the best way ( sse2 ) to reduce a _m128 ( 4 words a b c d) to one word? I want the low part of each _m128 components:

相关专题：simd sse

What's the best way ( sse2 ) to reduce a _m128 ( 4 words a b c d) to one word? I want the low part of each _m128 components:

int result = ( _m128.a & 0x000000ff ) <<  24
        | ( _m128.b &a开发者_如何学运维mp; 0x000000ff ) << 16
        | ( _m128.c & 0x000000ff ) << 8
        | ( _m128.d & 0x000000ff ) << 0

Is there an intrinsics for that ? thanks !

FYI, the sse3 intrinsics _mm_shuffle_epi8 do the job: (with the mask 0x0004080c in this case )

The SSE2 answer takes more than one instructions:

unsigned benoit(__m128i x)
{
    __m128i zero = _mm_setzero_si128(), mask = _mm_set1_epi32(255);
    return _mm_cvtsi128_si32(
                _mm_packus_epi16(
                        _mm_packus_epi16(
                                _mm_and_si128(x, mask), zero), zero));
}

The above amounts to 5 machine ops, given the input in %xmm1 and output in %rax:

 pxor     %xmm0, %xmm0
 pand     MASK, %xmm1
 packuswb %xmm0, %xmm1
 packuswb %xmm0, %xmm1
 movd     %xmm1, %rax

If you want to see some unusual uses of SSE2, including high-speed bit-matrix transpose, string search and bitonic (GPGPU-style) sort, you might want to check my blog, Coding on the edges.

Anyway, hope that helps.