How to insert zeros between bits in a bitmap?_问答_开发者

I have some performance-heavy code that performs bit manipulations. It can be reduced to the following well-defined problem:

Given a 13-bit bitmap, construct a 26-bit b开发者_开发问答itmap that contains the original bits spaced at even positions.

To illustrate:

0000000000000000000abcdefghijklm (input, 32 bits)
0000000a0b0c0d0e0f0g0h0i0j0k0l0m (output, 32 bits)

I currently have it implemented in the following way in C:

if (input & (1 << 12))
    output |= 1 << 24;
if (input & (1 << 11))
    output |= 1 << 22;
if (input & (1 << 10))
    output |= 1 << 20;
...

My compiler (MS Visual Studio) turned this into the following:

test        eax,1000h
jne         0064F5EC
or          edx,1000000h
... (repeated 13 times with minor differences in constants)

I wonder whether I can make it any faster. I would like to have my code written in C, but switching to assembly language is possible.

Can I use some MMX/SSE instructions to process all bits at once?
Maybe I can use multiplication? (multiply by 0x11111111 or some other magical constant)
Would it be better to use condition-set instruction (SETcc) instead of conditional-jump instruction? If yes, how can I make the compiler produce such code for me?
Any other idea how to make it faster?
Any idea how to do the inverse bitmap transformation (I have to implement it too, bit it's less critical)?

There is a clever way to do this which may be helpful here. It actually solves a slightly more general bit-shuffling problem. Your problem has an input of:

+---------------+---------------+---------------+---------------+
|0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0|0 0 0 a b c d e|f g h i j k l m|
+---------------+---------------+---------------+---------------+

....but let's consider all of the bits:

+---------------+---------------+---------------+---------------+
|A B C D E F G H|I J K L M N O P|Q R S a b c d e|f g h i j k l m|
+---------------+---------------+---------------+---------------+

and attempt to interleave them all like so:

+---------------+---------------+---------------+---------------+
|A Q B R C S D a|E b F c G d H e|I f J g K h L i|M j N k O l P m|
+---------------+---------------+---------------+---------------+

For the first step, consider the middle half of the input:

bit 31        24              16               8               0
 v             v               v               v               v
+---------------+---------------+---------------+---------------+
|               |I J K L M N O P|Q R S a b c d e|               |
+---------------+---------------+---------------+---------------+

Construct the 8-bit value: { I^Q, J^R, K^S, L^a, M^b, N^c, O^d, P^e }.

If we exclusive-OR this 8-bit value with bits [15:8], and also exclusive-OR the same 8-bit value with bits [23:16], we will swap the middle two bytes: for example, bit 23 (originally I) will become I ^ (I^Q) = Q and bit 15 (originally Q) will become Q ^ (I^Q) = I.

To do that: tmp = (input ^ (input >> 8)) & 0x0000ff00;:

+---------------+---------------+---------------+---------------+
|A B C D E F G H|I J K L M N O P|Q R S a b c d e|f g h i j k l m| input
+---------------+---------------+---------------+---------------+
                            exclusive-OR with:
+---------------+---------------+---------------+---------------+
|0 0 0 0 0 0 0 0|A B C D E F G H|I J K L M N O P|Q R S a b c d e| input >> 8
+---------------+---------------+---------------+---------------+

                             -->|want these bits|<--

 mask (bitwise AND) with 0x0000ff00:
+---------------+---------------+---------------+---------------+
|0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0|1 1 1 1 1 1 1 1|0 0 0 0 0 0 0 0| 0x0000ff00
+---------------+---------------+---------------+---------------+

Now the 8-bit value that we need is in bits [15:8], with all other bits 0. Now we can do the swap with

input ^= (tmp ^ (tmp << 8));

resulting in:

+---------------+---------------+---------------+---------------+
|A B C D E F G H|Q R S a b c d e|I J K L M N O P|f g h i j k l m| input
+---------------+---------------+---------------+---------------+

For the next step, divide and conquer... perform a similar swap of the middle bits of both the left hand half:

+---------------+---------------+---------------+---------------+
|A B C D E F G H|Q R S a b c d e|               |               |
+---------------+---------------+---------------+---------------+
             becomes
+---------------+---------------+---------------+---------------+
|A B C D Q R S a|E F G H b c d e|               |               |
+---------------+---------------+---------------+---------------+

...and the right-hand half:

+---------------+---------------+---------------+---------------+
|               |               |I J K L M N O P|f g h i j k l m|
+---------------+---------------+---------------+---------------+
                                             becomes
+---------------+---------------+---------------+---------------+
|               |               |I J K L f g h i|M N O P j k l m|
+---------------+---------------+---------------+---------------+

We can use exactly the same trick as in the first step, and because we want to perform exactly the same operation on both 16-bit halves of the 32-bit word, we can do them in parallel:

tmp = (input ^ (input >> 4)) & 0x00f000f0;

constructs the two pairs of 4 bits that we will use for the swap, and then

input ^= (tmp ^ (tmp << 4));

actually does the swap.

We can continue applying the same principle until the swap is complete. The bits that participate in the exchange at each point are marked with #:

+---------------+---------------+---------------+---------------+
|A B C D E F G H|I J K L M N O P|Q R S a b c d e|f g h i j k l m|
+---------------+---------------+---------------+---------------+
                 ###############/###############
+---------------+---------------+---------------+---------------+
|A B C D E F G H|Q R S a b c d e|I J K L M N O P|f g h i j k l m|
+---------------+---------------+---------------+---------------+
         #######/#######                 #######/#######
+---------------+---------------+---------------+---------------+
|A B C D Q R S a|E F G H b c d e|I J K L f g h i|M N O P j k l m|
+---------------+---------------+---------------+---------------+
     ###/###         ###/###         ###/###         ###/###
+---------------+---------------+---------------+---------------+
|A B Q R C D S a|E F b c G H d e|I J f g K L h i|M N j k O P l m|
+---------------+---------------+---------------+---------------+
   #/#     #/#     #/#     #/#       #/#   #/#     #/#     #/#
+---------------+---------------+---------------+---------------+
|A Q B R C S D a|E b F c G d G e|I f J g K h L i|M j N k O l P m|
+---------------+---------------+---------------+---------------+

Code:

tmp = (input ^ (input >> 8)) & 0x0000ff00;
input ^= (tmp ^ (tmp << 8));
tmp = (input ^ (input >> 4)) & 0x00f000f0;
input ^= (tmp ^ (tmp << 4));
tmp = (input ^ (input >> 2)) & 0x0c0c0c0c;
input ^= (tmp ^ (tmp << 2));
tmp = (input ^ (input >> 1)) & 0x22222222;
input ^= (tmp ^ (tmp << 1));                    /* = output */

The reverse operation can be performed by running the 4 steps backwards:

tmp = (input ^ (input >> 1)) & 0x22222222;
input ^= (tmp ^ (tmp << 1));                    /* = output */
tmp = (input ^ (input >> 2)) & 0x0c0c0c0c;
input ^= (tmp ^ (tmp << 2));
tmp = (input ^ (input >> 4)) & 0x00f000f0;
input ^= (tmp ^ (tmp << 4));
tmp = (input ^ (input >> 8)) & 0x0000ff00;
input ^= (tmp ^ (tmp << 8));

although you may be able to improve on this for your particular application, if every other bit is known to be zero: see my answer to another question here.

As a final note, don't believe anything anyone says about relative performance of any of the methods suggested here without benchmarking them in your application. (In particular, large lookup tables can appear to be much better in simple microbenchmarks than they actually are in practice in a given real application, due to evicting large quantities of other data from the cache, which can have a negative effect on the outer loop(s).)

Do it with a lookup table. 2^13 sound like a lot of entries but they will easily fit into the CPU cache.

Oh, and if there's garbage in the other 19 bits, you need to mask them out first.

You could do:

; eax = input bits
shrd edx,eax,2
shr eax,1
shrd edx,eax,2
shr eax,1
shrd edx,eax,2
shr eax,1
shrd edx,eax,2
shr eax,1
shrd edx,eax,2
shr eax,1
shrd edx,eax,2
shr eax,1
shrd edx,eax,2
shr eax,1
shrd edx,eax,2
shr eax,1
shrd edx,eax,2
shr eax,1
shrd edx,eax,2
shr eax,1
shrd edx,eax,2
shr eax,1
shrd edx,eax,2
shr eax,1
shrd edx,eax,8
and edx,0x01555555
; edx = output

Don't use branching:

output =
   (input & 1)
   | ((input & 2) << 1)
   | ((input & 4) << 2)
   | ((input & 8) << 3)
   | ((input & 16) << 4)
   /* etc. */

Here's a possibly easier to read/understand version of the same thing:

output =
     ((input & (1 <<  0)) <<  0)
   | ((input & (1 <<  1)) <<  1)
   | ((input & (1 <<  2)) <<  2)
   | ((input & (1 <<  3)) <<  3)
   | ((input & (1 <<  4)) <<  4)
   | ((input & (1 <<  5)) <<  5)
   | ((input & (1 <<  6)) <<  6)
   | ((input & (1 <<  7)) <<  7)
   | ((input & (1 <<  8)) <<  8)
   | ((input & (1 <<  9)) <<  9)
   | ((input & (1 << 10)) << 10)
   | ((input & (1 << 11)) << 11)
   | ((input & (1 << 12)) << 12);

I'll give an algorithm that works without conditionals (only addition and bitwise operations), and I believe this will be faster than your current solution.

Here's the C code for 13 bits. Below there's an illustration of how the method works for 3 bits, and the generalization will be clear I hope.

(Note: The code is loop-unrolled. A good compiler will do that for you, so you can just condense it to a loop.)

unsigned mask, output;
unsigned x = input;

mask = ((1<<13)-1) << 13;
x = (x + mask) & ~mask;

mask = ((1<<12)-1) << 12;
x = (x + mask) & ~mask;

...

mask = ((1<<3)-1) << 3;
x = (x + mask) & ~mask;

mask = ((1<<2)-1) << 2;
x = (x + mask) & ~mask;

mask = ((1<<1)-1) << 1;
x = (x + mask) & ~mask;

output = x;

Now, here's the explanation of the method for 3 bits. The initial state is '00abc'. Start by moving 'a' two places to the left by adding 01100 and then ANDing with 10011 (which happens to be the bitwise NOT of the previous number). This is how it works for a=0,1 (first arrow is the addition, second arrow is the AND):

a=0: 00abc = 000bc -> 011bc -> 000bc = a00bc
a=1: 00abc = 001bc -> 100bc -> 100bc = a00bc

Next, move 'b' one place to the left by adding 00010 and then ANDing with 10101:

b=0: a00bc = a000c -> a001c -> a000c = a0b0c
b=1: a00bc = a001c -> a010c -> a010c = a0b0c

That's it.

On Intel x86 processors starting from Haswell, you can use single pdep instruction from BMI2 instruction set to do it:

uint32_t interleave_zero_bits(uint32_t x) {
    return _pdep_u32(x, 0x55555555U);
}

First, for your "26-bit" values the highest bit should always be clear, so it's actually a 25-bit value.

1) MMX (and/or SSE) won't help, as the main problem is that there's no simple series of arithmetic or boolean operations that makes gives the results you want, and everything supports the same arithmetic and boolean operations.

2) I couldn't think of or find a magic constant for multiplication.

3) I can't see a method of using any condition-set instruction (e.g. SETcc) that has any advantages over shift/add instructions.

4) jdv and paul (above) are right. If you need to do this conversion often enough that performance matters, then a lookup table would be the best/fastest option on modern CPUs. The lookup table for "13-bit to 26-bit" would 2**13 dwords, or 32 KiB. On old CPUs (with small L1 caches) the relative difference between CPU speed and RAM speed isn't as bad as it is now.

If you can't spare 32 KiB for the "13-bit to 25-bit" lookup table, you can split the 13-bit value into a pair of values (one 6-bit value and one 7-bit value) and then use the lookup table on each of these values before combining the results, like this:

mov ebx,eax                    ;ebx = 13-bit value
shr eax,6                      ;eax = highest 7 bits of value
and ebx,0x003F                 ;ebx = lowest 6 bits of value
mov eax,[lookup_table + eax*2] ;eax = highest 14-bits of result
mov ebx,[lookup_table + ebx*2] ;eax = lowest 12-bits of result
shl eax,12
or eax,ebx                     ;eax = 25-bit result

In this case, the lookup table has 128 entries (with 2 bytes per entry), so it's only 256 bytes.

5) For the reverse operation, a simple lookup table would cost you 64 MiB (2**25*2) so that isn't a good idea. However, you could split the 25-bit value into a 13-bit value and a 11-bit value (a 12-bit value where the highest bit is always clear), and use an 8192 entry table with one byte per entry (total cost is 8 KiB). There's no reason you couldn't split the 25-bit values into more/smaller pieces though (and use a much smaller table).

You can always use a for-loop:

for (int i = 0; i < 13; i++)
{
    output |= (input & (1 << i)) << i;
}

This is shorter, but I don't think that it's significantly faster.

I think this might be relevant, but I'm not completely certain. I know of MMX instructions for interleaving bytes of 32/64 bit values, but not individual bits.

You haven't specified the platform this is to run on, and I'd like to try a different approach from the ones already posted (I like the lookup table one, which works fine until the number of bits is increased).

Most platforms have separate shift and rotate instructions. Almost always there is an instruction that includes the carry / overflow flags, so you can "shift in" a bit you want. Let's say we have these instructions: * SHIFTLEFT: does a leftshift and fills the lower bit with zero. * ROTATELEFT: does a leftshift, sets the lowest bit from the former value in carry flag, and sets the carry from the bit that got shifted "out" on left.

Pseudocode:

LOAD value into register A;
LOAD 0 into register B;
SHIFT register A (registerwidth-13) times; 
ROTATELEFT A
ROTATELEFT B
SHIFTLEFT  B

... repeat 13 times. Unrolling as you please.

The first shift should get the uppermost bit into place right before the carry. ROTATELEFT A will push the MSB into the carry, ROTATELEFT B will push the bit into the LSB of B, and SHIFTLEFT B will put the 0 in. Do that for all the bits.

Edit/Added:

You can do the opposite (inverse bitmap transformation) with the same instructions, like this:

LOAD value into register A; LOAD 0 into register B;

ROTATELEFT A; ROTATELEFT A; ROTATELEFT B; ... repeat 13 times and then SHIFTLEFT B; for (registerwidth-13) times.

LSB to carry; forget about it, next LSB into carry, put that into the target register, repeat for all bits, then align the result.

Check if your CPU supports byte and word swapping (For endian conversion) - if so - just roll a swap over it - that would be some 6(5) instructions shorter.