开发者

Tweaking MIT's bitcount algorithm to count words in parallel?

开发者 https://www.devze.com 2023-03-14 22:57 出处:网络
I want to use a version of the well known MIT bitcount algorithm to count neighbors in Conway\'s game of life using SSE2 instructions.

I want to use a version of the well known MIT bitcount algorithm to count neighbors in Conway's game of life using SSE2 instructions.

Here's the MIT bitcount in c, extended to count bitcounts > 63 bits.

int bitCount(unsigned long long n)
{
unsigned long long uCount;

uCount = n – ((n >> 1) & 0×7777777777777777)
           - ((n >> 2) & 0×3333333333333333)
           - ((n >> 3) & 0×1111111111111111);
return ((uCount + (uCount >> 4))
& 0x0F0F0F0F0F0F0F0F) % 255;
}

Here's a version in Pascal

function bitcount(n: uint64): cardinal;
var ucount: uint64;
begin
  ucount:= n - ((n shr 1) and $7777777777777777)
             - ((n shr 2) and $3333333333333333) 
             - ((n shr 3) and $1111111111111111);
  Result:= ((ucount + (count shr 4)) 
           and $0F0F0F0F0F0F0F0F) mod 255;
end;

I'm looking to count the bits in this structure in parallel.

  32-bit word where the pixels are laid out as follows.
  lo-byte         lo-byte neighbor
  0 4 8 C  048C   0 4 8 C 
   +---------------+
  1|5 9 D  159D   1|5 9 D 
   |               |
  2|6 A E  26AE   2|6 A E  
   +---------------+
  3 7 B F  37BF   3 7 B F 
 |-------------|            << slice A
   |---------------|        << slice B
     |---------------|      << slice C

Notice how this structure has 16 bits in the middle that need to be looked up. I want to calculate neighbor counts for each of the 16 bits in the middle using SSE2. In order 开发者_高级运维to do this I put slice A in XMM0 low-dword, slice B in XXM0-dword1 etc.

I copy XMM0 to XMM1 and I mask off bits 012-456-89A for bit 5 in the low word of XMM0, do the same for word1 of XMM0, etc. using different slices and masks to make sure each word in XMM0 and XMM1 holds the neighbors for a different pixel.

Question

How do I tweak the MIT-bitcount to end up with a bitcount per word/pixel in each XMM word?

Remarks

I don't want to use a lookup table, because I already have that approach and I want to test to see if SSE2 will speed up the process by not requiring memory accesses to the lookup table.

An answer using SSE assembly would be optimal, because I'm programming this in Delphi and I'm thus using x86+SSE2 assembly code.


The MIT algorithm would be tough to implement in SSE2, since there is no integer modulus instruction which could be used for the final ... % 255 expression. Of the various popcnt methods out there, the one that lends itself to SSE most easily and efficiently is probably the first one in Chapter 5 of "Hackers Delight" by Henry S. Warren, which I have implemented here in C using SSE intrinsics:

#include <stdio.h>
#include <emmintrin.h>

__m128i _mm_popcnt_epi16(__m128i v)
{
    v = _mm_add_epi16(_mm_and_si128(v, _mm_set1_epi16(0x5555)), _mm_and_si128(_mm_srli_epi16(v, 1), _mm_set1_epi16(0x5555)));
    v = _mm_add_epi16(_mm_and_si128(v, _mm_set1_epi16(0x3333)), _mm_and_si128(_mm_srli_epi16(v, 2), _mm_set1_epi16(0x3333)));
    v = _mm_add_epi16(_mm_and_si128(v, _mm_set1_epi16(0x0f0f)), _mm_and_si128(_mm_srli_epi16(v, 4), _mm_set1_epi16(0x0f0f)));
    v = _mm_add_epi16(_mm_and_si128(v, _mm_set1_epi16(0x00ff)), _mm_and_si128(_mm_srli_epi16(v, 8), _mm_set1_epi16(0x00ff)));
    return v;
}

int main(void)
{
    __m128i v0 = _mm_set_epi16(7, 6, 5, 4, 3, 2, 1, 0);
    __m128i v1;

    v1 = _mm_popcnt_epi16(v0);

    printf("v0 = %vhd\n", v0);
    printf("v1 = %vhd\n", v1);

    return 0;
}

Compile and test as follows:

$ gcc -Wall -msse2 _mm_popcnt_epi16.c -o _mm_popcnt_epi16
$ ./_mm_popcnt_epi16 
v0 = 0 1 2 3 4 5 6 7
v1 = 0 1 1 2 1 2 2 3
$ 

It looks like around 16 arithmetic/logical instructions so it should run at around 16 / 8 = 2 clocks per point.

You can easily convert this to raw assembler if you need to - each intrinsic maps to a single instruction.

0

精彩评论

暂无评论...
验证码 换一张
取 消