fastest way to write a bitstream on modern x86 hardware_问答_开发者

What is the fastest way to write a bitstream on x86/x86-64? (codeword <= 32bit)

by writing a bitstream I refer to the process of concatenating variable bit-length symbols into a contiguous memory buffer.

currently I've got a standard container with a 32bit intermediate buffer to write to

void write_bits(SomeContainer<unsigned int>& dst,unsigned int& buffer, unsigned int& bits_left_in_buffer,int codeword, short bits_to_write){
    if(bits_to_write < bits_left_in_buffer){
        buffer|= codeword << (32-bits_left_in_buffer);
        bits_left_in_buffer -= bits_to_write;

    }else{
        unsigned int full_bits = bits_to_write - bits_left_in_buffer;
        unsigned int towrite = buffer|(codeword<<(32-bits_left_in_buffer));
        buffer= full_bits ? (codeword >> bits_left_in_buffer) : 0;
        dst.push_back(towrite);
    开发者_如何学Go    bits_left_in_buffer = 32-full_bits;
    }
}

Does anyone know of any nice optimizations, fast instructions or other info that may be of use?

Cheers,

I wrote once a quite fast implementation, but it has several limitations: It works on 32 bit x86 when you write and read the bitstream. I don't check for buffer limits here, I was allocating larger buffer and checked it from time to time from the calling code.

unsigned char* membuff; 
unsigned bit_pos; // current BIT position in the buffer, so it's max size is 512Mb

// input bit buffer: we'll decode the byte address so that it's even, and the DWORD from that address will surely have at least 17 free bits
inline unsigned int get_bits(unsigned int bit_cnt){ // bit_cnt MUST be in range 0..17
    unsigned int byte_offset = bit_pos >> 3;
    byte_offset &= ~1;  // rounding down by 2.
    unsigned int bits = *(unsigned int*)(membuff + byte_offset);
    bits >>= bit_pos & 0xF;
    bit_pos += bit_cnt;
    return bits & BIT_MASKS[bit_cnt];
};

// output buffer, the whole destination should be memset'ed to 0
inline unsigned int put_bits(unsigned int val, unsigned int bit_cnt){
    unsigned int byte_offset = bit_pos >> 3;
    byte_offset &= ~1;
    *(unsigned int*)(membuff + byte_offset) |= val << (bit_pos & 0xf);
    bit_pos += bit_cnt;
};

It's hard to answer in general because it depends on many factors such as the distribution of bit-sizes you are reading, the call pattern in the client code and the hardware and compiler. In general, the two possible approaches for reading (writing) from a bitstream are:

Using a 32-bit or 64-bit buffer and conditionally reading (writing) from the underlying array it when you need more bits. That's the approach your write_bits method takes.
Unconditionally reading (writing) from the underlying array on every bitstream read (write) and then shifting and masking the resultant values.

The primary advantages of (1) include:

Only reads from the underlying buffer the minimally required number of times in an aligned fashion.
The fast path (no array read) is somewhat faster since it doesn't have to do the read and associated addressing math.
The method is likely to inline better since it doesn't have reads - if you have several consecutive read_bits calls, for example, the compiler can potentially combine a lot of the logic and produce some really fast code.

The primary advantage of (2) is that it is completely predictable - it contains no unpredictable branches.

Just because there is only one advantage for (2) doesn't mean it's worse: that advantage can easily overwhelm everything else.

In particular, you can analyze the likely branching behavior of your algorithm based on two factors:

How often will the bitsteam need to read from the underlying buffer?
How predictable is the number of calls before a read is needed?

For example if you are reading 1 bit 50% of the time and 2 bits 50% of time, you will do 64 / 1.5 = ~42 reads (if you can use a 64-bit buffer) before requiring an underlying read. This favors method (1) since reads of the underlying are infrequent, even if mis-predicted. On the other hand, if you are usually reading 20+ bits, you will read from the underlying every few calls. This is likely to favor approach (2), unless the pattern of underlying reads is very predictable. For example, if you always read between 22 and 30 bits, you'll perhaps always take exactly three calls to exhaust the buffer and read the underlying¹ array. So the branch will be well-predicated and (1) will stay fast.

Similarly, it depends on how you call these methods, and how the compiler can inline and simplify the code. Especially if you ever call the methods repeatedly with a compile-time constant size, a lot of simplification is possible. Little to no simplification is available when the codeword is known at compile-time.

Finally, you may be able to get increased performance by offering a more complex API. This mostly applies to implementation option (1). For example, you can offer an ensure_available(unsigned size) call which ensures that at least size bits (usually limited the buffer size) are available to read. Then you can read up to that number of bits using unchecked calls that don't check the buffer size. This can help you reduce mis-predictions by forcing the buffer fills to a predictable schedule and lets you write simpler unchecked methods.

¹ This depends on exactly how your "read from underlying" routine is written, as there are a few options here: Some always fill to 64-bits, some fill to between 57 and 64-bits (i.e., read an integral number of bytes), and some may fill between 32 or 33 and 64-bits (like your example which reads 32-bit chunks).

You'll probably have to wait until 2013 to get hold of real HW, but the "Haswell" new instructions will bring proper vectorised shifts (ie the ability to shift each vector element by different amounts specified in another vector) to x86/AVX. Not sure of details (plenty of time to figure them out), but that will surely enable a massive performance improvement in bitstream construction code.

I don't have the time to write it for you (not too sure your sample is actually complete enough to do so) but if you must, I can think of

using translation tables for the various input/output bit shift offsets; This optimization would make sense for fixed units of n bits (with n sufficiently large (8 bits?) to expect performance gains) In essence, you'd be able to do
```
destloc &= (lookuptable[bits_left_in_buffer][input_offset][codeword]);
```

disclaimer: this is very sloppy pseudo code, I just hope it conveys my idea of a lookup table o prevent bitshift arithmetics

writing it in assembly (I know i386 has XLAT, but then again, a good compiler might already use something like that) ; Also, XLAT seems limited to 8 bits and the AL register, so it's not really versatile

Update

Warning: be sure to use a profiler and test your optimization for correctness and speed. Using a lookup table can result in poorer performance in the light of locality of reference. So, you might need to change the bit-streaming thread on a single core (set thread affinity) to get the benefits, and you might have to adapt the lookup table size to the processor's L2 cache.

Als, have a look at SIMD, SSE4 or GPU (CUDA) instruction sets if you know you'll have certain features at your disposal.