开发者

C Population Count of unsigned 64-bit integer with a maximum value of 15

开发者 https://www.devze.com 2023-01-01 19:12 出处:网络
I use a population count (hamming weight) function intensively in a windows c application and have to optimize it as much as possible in order to boost performance. More than half the cases where I us

I use a population count (hamming weight) function intensively in a windows c application and have to optimize it as much as possible in order to boost performance. More than half the cases where I use the function I only need to know the value to a maximum of 15. The software will run on a wide range of processors, both old and new. I already make use of the POPCNT instruction when Intel's SSE4.2 or AMD's SSE4a is present, but would like to optimize the software implementation (used as a fall back if no SSE4 is present) as much as possible.

Currently I have the following software implementation of the function for 64bit (platform) mode:

int population_count64(unsigned __int64 w) {
    w -= (w >> 1) & 0x5555555555555555ULL;
    w = (w & 0x3333333333333333ULL) + ((w >> 2) & 0x3333333333333333ULL);
    w = (w + (w >> 4)) & 0x0f0f0f0f0f0f0f0fULL;
    return int((w * 0x0101010101010101ULL) >> 56);
}

So to summarize:

(1) I would like to know if it is possible to optimize this for the case when I only want to know the value to a 开发者_C百科maximum of 15.

(2) Is there a faster software implementation (for both Intel and AMD CPU's) than the function above (for unsigned 64bit integers)?


It is indeed possible to optimise your function for the "maximum 15" case. The following shaves off a few operations:


inline int population_count64_max15(unsigned __int64 w)
{
  w -= (w >> 1) & 0x5555555555555555ULL;
  w  = (w & 0x3333333333333333ULL) + ((w >> 2) & 0x3333333333333333ULL);

  return int((w * 0x1111111111111111ULL) >> 60);
}


Inlining the function (using the inline keyword as above) should also increase performance.


If you're on a 32-bit machine, split w into two 32-bit words, calculate the popcount separately for each half, then add up. This will get rid of some unneeded operations that are required to synthesize 64-bit operations from 32-bit ones (shifts, mults...). This also allows for increased parallelism if you interleave the calculations.

If you're compiling 64-bit code, you may try this:

int popcnt64(uint64_t w)
{
   uint64_t w1 = (w & 0x2222222222222222) + ((w+w) & 0x2222222222222222);
   uint64_t w2 = (w >> 1 & 0x2222222222222222) + (w >> 2 & 0x2222222222222222);
   w1 = w1 + (w1 >> 4) & 0x0f0f0f0f0f0f0f0f;
   w2 = w2 + (w2 >> 4) & 0x0f0f0f0f0f0f0f0f;
   return (w1 + w2) * 0x0101010101010101 >> 57;
}

This contains more operations, but gives more opportunities of parallel execution to the CPU. On newer CPUs it should be slightly faster, on others it will be slightly slower.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号