Fast dot product for a very special case_问答_开发者

Given a vector X of size L, where every scalar element of X is from a binary set {0,1}, it is to find a dot product z=dot(X,Y) if vector Y of size L consists of the integer-valued elements. I suggest, there must exist a very fast way to do it.

Let's say we have L=4; X[L]={1, 0, 0, 1}; Y[L]={-4, 2, 1, 0} and we开发者_StackOverflow中文版 have to find z=X[0]*Y[0] + X[1]*Y[1] + X[2]*Y[2] + X[3]*Y[3] (which in this case will give us -4).

It is obvious that X can be represented using binary digits, e.g. an integer type int32 for L=32. Then, all what we have to do is to find a dot product of this integer with an array of 32 integers. Do you have any idea or suggestions how to do it very fast?

This really would require profiling but an alternative you might want to consider:

int result=0;
int mask=1;
for ( int i = 0; i < L; i++ ){
    if ( X & mask ){
        result+=Y[i];
    }
    mask <<= 1;
}

Typically bit shifting and bitwise operations are faster than multiplication, however, the if statement might be slower than a multiplication, although with branch prediction and large L my guess is it might be faster. You would really have to profile it, though, to determine if it resulted in any speedup.

As has been pointed out in the comments below, unrolling the loop either manually or via a compiler flag (such as "-funroll-loops" on GCC) could also speed this up (eliding the loop condition).

Edit
In the comments below, the following good tweak has been proposed:

int result=0;
for ( int i = 0; i < L; i++ ){
    if ( X & 1 ){
        result+=Y[i];
    }
    X >>= 1;
}

Is a suggestion to look into SSE2 helpful? It has dot-product type operations already, plus you can trivially do 4 (or perhaps 8, I forget the register size) simple iterations of your naive loop in parallel. SSE also has some simple logic-type operations so it may be able to do additions rather than multiplications without using any conditional operations... again you'd have to look at what ops are available.

Try this:

int result=0;
for ( int i = 0; i < L; i++ ){
    result+=Y[i] & (~(((X>>i)&1)-1));
}

This avoids a conditional statement and uses bitwise operators to mask the scalar value with either zeros or ones.

Since size explicitly doesn’t matter, I think the following is probably the most efficient general-purpose code:

int result = 0;
for (size_t i = 0; i < 32; ++i)
    result += Y[i] & -X[i];

Bit-encoding X just doesn’t bring anything to the table (even if the loop may potentially terminate earlier as @Mathieu correctly noted). But omitting the if inside the loop does.

Of course, loop unrolling can speed this up drastically, as others have noted.

This solution is identical to, but slightly faster (by my test), than Micheal Aaron's:

long Lev=1;
long Result=0
for (int i=0;i<L;i++) {
  if (X & Lev)
     Result+=Y[i];
  Lev*=2;
}

I thought there was a numerical way to rapidly establish the next set bit in a word which should improve performance if your X data is very sparse but currently cannot find said numerical formulation currently.

I've seen a number of responses with bit trickery (to avoid branching) but none got the loop right imho :/

Optimizing @Goz answer:

int result=0;
for (int i = 0, x = X; x > 0; ++i, x>>= 1 )
{
   result += Y[i] & -(int)(x & 1);
}

Advantages:

no need to do i bit-shifting operations each time (X>>i)
the loop stops sooner if X contains 0 in higher bits

Now, I do wonder if it runs faster, especially since the premature stop of the for loop might not be as easy for loop unrolling (compared to a compile-time constant).

How about combining a shifting loop with a small lookup table?

    int result=0;

    for ( int x=X; x!=0; x>>=4 ){
        switch (x&15) {
            case 0: break;
            case 1: result+=Y[0]; break;
            case 2: result+=Y[1]; break;
            case 3: result+=Y[0]+Y[1]; break;
            case 4: result+=Y[2]; break;
            case 5: result+=Y[0]+Y[2]; break;
            case 6: result+=Y[1]+Y[2]; break;
            case 7: result+=Y[0]+Y[1]+Y[2]; break;
            case 8: result+=Y[3]; break;
            case 9: result+=Y[0]+Y[3]; break;
            case 10: result+=Y[1]+Y[3]; break;
            case 11: result+=Y[0]+Y[1]+Y[3]; break;
            case 12: result+=Y[2]+Y[3]; break;
            case 13: result+=Y[0]+Y[2]+Y[3]; break;
            case 14: result+=Y[1]+Y[2]+Y[3]; break;
            case 15: result+=Y[0]+Y[1]+Y[2]+Y[3]; break;
        }
        Y+=4;
    }

The performance of this will depend on how good the compiler is at optimising the switch statement, but in my experience they are pretty good at that nowadays....

There is probably no general answer to this question. You need to profile your code under all the different targets. Performance will depend on compiler optimizations such as loop unwinding and SIMD instructions that are available on most modern CPUs (x86, PPC, ARM all have their own implementations).

For small L, you can use a switch statement instead of a loop. For example, if L = 8, you could have:

int dot8(unsigned int X, const int Y[])
{
    switch (X)
    {
       case 0: return 0;
       case 1: return Y[0];
       case 2: return Y[1];
       case 3: return Y[0]+Y[1];
       // ...
       case 255: return Y[0]+Y[1]+Y[2]+Y[3]+Y[4]+Y[5]+Y[6]+Y[7];
    }
    assert(0 && "X too big");
}

And if L = 32, you can write a dot32() function which calls dot8() four times, inlined if possible. (If your compiler refuses to inline dot8(), you could rewrite dot8() as a macro to force inlining.) Added:

int dot32(unsigned int X, const int Y[])
{
    return dot8(X >> 0  & 255, Y + 0)  +
           dot8(X >> 8  & 255, Y + 8)  +
           dot8(X >> 16 & 255, Y + 16) +
           dot8(X >> 24 & 255, Y + 24);
}

This solution, as mikera points out, may have an instruction cache cost; if so, using a dot4() function might help.

Further update: This can be combined with mikera's solution:

static int dot4(unsigned int X, const int Y[])
{
    switch (X)
    {
        case 0: return 0;
        case 1: return Y[0];
        case 2: return Y[1];
        case 3: return Y[0]+Y[1];
        //...
        case 15: return Y[0]+Y[1]+Y[2]+Y[3];
    }
}

Looking at the resulting assembler code with the -S -O3 options with gcc 4.3.4 on CYGWIN, I'm slightly surprised to see that this is automatically inlined within dot32(), with eight 16-entry jump-tables.

But adding __attribute__((__noinline__)) seems to produce nicer-looking assembler.

Another variation is to use fall-throughs in the switch statement, but gcc adds jmp instructions, and it doesn't look any faster.

Edit--Completely new answer: After thinking about the 100 cycle penalty mentioned by Ants Aasma, and the other answers, the above is likely not optimal. Instead, you could manually unroll the loop as in:

int dot(unsigned int X, const int Y[])
{
    return (Y[0] & -!!(X & 1<<0)) +
           (Y[1] & -!!(X & 1<<1)) +
           (Y[2] & -!!(X & 1<<2)) +
           (Y[3] & -!!(X & 1<<3)) +
           //...
           (Y[31] & -!!(X & 1<<31));
}

This, on my machine, generates 32 x 5 = 160 fast instructions. A smart compiler could conceivably unroll the other suggested answers to give the same result.

But I'm still double-checking.

result = 0;
for(int i = 0; i < L ; i++)
    if(X[i]!=0)
      result += Y[i];

It's quite likely that the time spent to load X and Y from main memory will dominate. If this is the case for your CPU architecture, the algorithm is faster when loading less. This means that storing X as a bitmask and expanding it into L1 cache will speed up the algorithm as a whole.

Another relevant question is whether your compiler will generate optimal loads for Y. This is higly CPU and compiler dependent. But in general, it helps if the compiler can see precsiely which values are needed when. You could manually unroll the loop. However, if L is a contant, leave it to the compiler:

template<int I> inline void calcZ(int (&X)[L], int(&Y)[L], int &Z) {
  Z += X[I] * Y[I]; // Essentially free, as it operates in parallel with loads.
  calcZ<I-1>(X,Y,Z);
}
template< > inline void calcZ<0>(int (&X)[L], int(&Y)[L], int &Z) {
  Z += X[0] * Y[0];
}
inline int calcZ(int (&X)[L], int(&Y)[L]) {
    int Z = 0;
    calcZ<L-1>(X,Y,Z);
    return Z;
}

(Konrad Rudolph questioned this in a comment, wondering about memory use. That's not the real bottleneck in modern computer architectures, bandwidth between memory and CPU is. This answer is almost irrelevant if Y is somehow already in cache. )

You can store your bit vector as a sequence of ints where each int packs a couple of coefficients as bits. Then, the component-wise multiplication is equivalent to bit-and. With this you simply need to count the number of set bits which could be done like this:

inline int count(uint32_t x) {
    // see link
}

int dot(uint32_t a, uint32_t b) {
    return count(a & b);
}

For a bit hack to count the set bits see http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel

Edit: Sorry I just realized only one of the vectors contains elements of {0,1} and the other one doesn't. This answer only applies to the case where both vectors are limited to coefficients from the set of {0,1}.

Represente X using linked list of the places where x[i] = 1. To find required sum you need O(N) operations where N is size of your list.

Well you want all bits to get past if its a 1 and none if its a 0. So you want to somehow turn 1 into -1 (ie 0xffffffff) and 0 stays the same. Thats just -X .... so you do ...

Y & (-X)

for each element ... job done?

Edit2: To give a code example you can do something like this and avoid the branch:

int result=0;
for ( int i = 0; i < L; i++ )
{
   result+=Y[i] & -(int)((X >> i) & 1);
}

Of course you'd be best off keeping the 1s and 0s in an array of ints and therefore avoiding the shifts.

Edit: Its also worth noting that if the values in Y are 16-bits in size then you can do 2 of these and operations per operation (4 if you have 64-bit registers). It does mean negating the X values 1 by 1 into a larger integer, though.

ie YVals = -4, 3 in 16-bit = 0xFFFC, 0x3 ... put into 1 32-bit and you get 0xFFFC0003. If you have 1, 0 as the X vals then you form a bit mask of 0xFFFF0000 and the 2 together and you've got 2 results in 1 bitwise-and op.

Another edit:

IF you want the code on how to do the 2nd method something like this should work (Though it takes advantage of unspecified behaviour so it may not work on every compiler .. works on every compiler I've come across though).

union int1632
{
     int32_t i32;
     int16_t i16[2];
};

int result=0;
for ( int i = 0; i < (L & ~0x1); i += 2 )
{
    int3264 y3264;
    y3264.i16[0] = Y[i + 0];
    y3264.i16[1] = Y[i + 1];

    int3264 x3264;
    x3264.i16[0] = -(int16_t)((X >> (i + 0)) & 1);
    x3264.i16[1] = -(int16_t)((X >> (i + 1)) & 1);

    int3264 res3264;
    res3264.i32  = y3264.i32 & x3264.i32;

    result += res3264.i16[0] + res3264.i16[1];    
}

if ( i < L )
    result+=Y[i] & -(int)((X >> i) & 1);

Hopefully the compiler will optimise out the assigns (Off the top of my head i'm not sure but the idea could be re-worked so that they definitely are) and give you a small speed up in that you now only need to do 1 bitwise-and instead of 2. The speed up would be minor though ...