I have a bit array implementation where the 0th index is the MSB of the first byte in an array, the 8th index is th开发者_JAVA技巧e MSB of the second byte, etc...
What's a fast way to find the first bit that is set in this bit array? All the related solutions I have looked up find the first least significant bit, but I need the first most significant one. So, given 0x00A1, I want 8 (since it's the 9th bit from the left).
GCC has __builtin_clz
that translates to BSR on x86/x64, CLZ on ARM, etc. and emulates the instruction if the hardware does not implement it.
Visual C++ 2005 and up has _BitScanReverse
.
tl:dr; For 32 bits, use de Bruijn multiplication.
It's the "fastest" portable algorithm. It is substantially faster and more correct than all the other portable 32-bit MSB algorithms in this thread.
The de Bruijn algorithm also returns a correct result when the input is zero. The __builtin_clz and _BitScanReverse instructions return incorrect results when the input is zero.
On Windows x86-64, de Bruijn multiplication runs at a speed comparable to the equivalent (flawed) Windows function, with a performance difference of only around 3%.
Here's the code.
u32 msbDeBruijn32( u32 v )
{
static const int MultiplyDeBruijnBitPosition[32] =
{
0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30,
8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31
};
v |= v >> 1; // first round down to one less than a power of 2
v |= v >> 2;
v |= v >> 4;
v |= v >> 8;
v |= v >> 16;
return MultiplyDeBruijnBitPosition[( u32 )( v * 0x07C4ACDDU ) >> 27];
}
All the other answers in this thread either run much more poorly than their authors suggest, or don't calculate the result correctly, or both. Let's benchmark them all, and let's verify that they do what they claim to do.
Here's a simple C++11 harness to test all these implementations. It compiles clean on Visual Studio but should work on all modern compilers. It allows you to run the benchmark in performance mode (bVerifyResults = false) and in checking mode (bVerifyResults = true).
Here are the results in verification mode:
Verification failed for msbNative64: input was 0; output was 818af060; expected 0
Verification failed for msbFfs: input was 22df; output was 0; expected d
Verification failed for msbPerformanceJunkie32: input was 0; output was ffffffff; expected 0
Verification failed for msbNative32: input was 0; output was 9ab07060; expected 0
The "performance junkie" and the Microsoft native implementations do different things when the input is zero. msbPerformanceJunkie32 produces -1, and Microsoft's _BitScanReverse produces a random number, consistent with the underlying hardware instruction. Also the msbPerformanceJunkie32 implementation produces a result that is off by one from all the other answers.
Here are the results in performance mode, running on my i7-4600 laptop, compiled in release mode:
msbLoop64 took 2.56751 seconds
msbNative64 took 0.222197 seconds
msbLoop32 took 1.43456 seconds
msbFfs took 0.525097 seconds
msbPerformanceJunkie32 took 1.07939 seconds
msbDeBruijn32 took 0.224947 seconds
msbNative32 took 0.218275 seconds
The de Bruijn version beats the other implementations soundly because it is branchless, and therefore it runs well against inputs that produce an evenly distributed set of outputs. All the other versions are slower against arbitrary inputs because of the penalties of branch misprediction on modern CPUs. The smbFfs function produces incorrect results so it can be ignored.
Some of the implementations work on 32 bit inputs, and some work on 64 bit inputs. A template will help us compare apples to apples, regardless of the input size.
Here's the code. Download and run the benchmarks yourself if you like.
#include <iostream>
#include <chrono>
#include <random>
#include <cassert>
#include <string>
#include <limits>
#ifdef _MSC_VER
#define MICROSOFT_COMPILER 1
#include <intrin.h>
#endif // _MSC_VER
const int iterations = 100000000;
bool bVerifyResults = false;
std::random_device rd;
std::default_random_engine re(rd());
typedef unsigned int u32;
typedef unsigned long long u64;
class Timer
{
public:
Timer() : beg_(clock_::now()) {}
void reset() {
beg_ = clock_::now();
}
double elapsed() const {
return std::chrono::duration_cast<second_>
(clock_::now() - beg_).count();
}
private:
typedef std::chrono::high_resolution_clock clock_;
typedef std::chrono::duration<double, std::ratio<1> > second_;
std::chrono::time_point<clock_> beg_;
};
unsigned int msbPerformanceJunkie32(u32 x)
{
static const unsigned int bval[] =
{ 0,1,2,2,3,3,3,3,4,4,4,4,4,4,4,4 };
unsigned int r = 0;
if (x & 0xFFFF0000) {
r += 16 / 1;
x >>= 16 / 1;
}
if (x & 0x0000FF00) {
r += 16 / 2;
x >>= 16 / 2;
}
if (x & 0x000000F0) {
r += 16 / 4;
x >>= 16 / 4;
}
return r + bval[x];
}
#define FFS(t) \
{ \
register int n = 0; \
if (!(0xffff & t)) \
n += 16; \
if (!((0xff << n) & t)) \
n += 8; \
if (!((0xf << n) & t)) \
n += 4; \
if (!((0x3 << n) & t)) \
n += 2; \
if (!((0x1 << n) & t)) \
n += 1; \
return n; \
}
unsigned int msbFfs32(u32 x)
{
FFS(x);
}
unsigned int msbLoop32(u32 x)
{
int r = 0;
if (x < 1) return 0;
while (x >>= 1) r++;
return r;
}
unsigned int msbLoop64(u64 x)
{
int r = 0;
if (x < 1) return 0;
while (x >>= 1) r++;
return r;
}
u32 msbDeBruijn32(u32 v)
{
static const int MultiplyDeBruijnBitPosition[32] =
{
0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30,
8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31
};
v |= v >> 1; // first round down to one less than a power of 2
v |= v >> 2;
v |= v >> 4;
v |= v >> 8;
v |= v >> 16;
return MultiplyDeBruijnBitPosition[(u32)(v * 0x07C4ACDDU) >> 27];
}
#ifdef MICROSOFT_COMPILER
u32 msbNative32(u32 val)
{
unsigned long result;
_BitScanReverse(&result, val);
return result;
}
u32 msbNative64(u64 val)
{
unsigned long result;
_BitScanReverse64(&result, val);
return result;
}
#endif // MICROSOFT_COMPILER
template <typename InputType>
void test(unsigned int msbFunc(InputType),
const std::string &name,
const std::vector< InputType > &inputs,
std::vector< unsigned int > &results,
bool bIsReference = false
)
{
if (bIsReference)
{
int i = 0;
for (int i = 0; i < iterations; i++)
results[i] = msbFunc(inputs[i]);
}
InputType result;
if (bVerifyResults)
{
bool bNotified = false;
for (int i = 0; i < iterations; i++)
{
result = msbFunc(inputs[i]);
if ((result != results[i]) && !bNotified)
{
std::cout << "Verification failed for " << name << ": "
<< "input was " << std::hex << inputs[i]
<< "; output was " << result
<< "; expected " << results[i]
<< std::endl;
bNotified = true;
}
}
}
else
{
Timer t;
for (int i = 0; i < iterations; i++)
{
result = msbFunc(inputs[i]);
}
double elapsed = t.elapsed();
if ( !bIsReference )
std::cout << name << " took " << elapsed << " seconds" << std::endl;
if (result == -1.0f)
std::cout << "this comparison only exists to keep the compiler from " <<
"optimizing out the benchmark; this branch will never be called";
}
}
void main()
{
std::uniform_int_distribution <u64> dist64(0,
std::numeric_limits< u64 >::max());
std::uniform_int_distribution <u32> shift64(0, 63);
std::vector< u64 > inputs64;
for (int i = 0; i < iterations; i++)
{
inputs64.push_back(dist64(re) >> shift64(re));
}
std::vector< u32 > results64;
results64.resize(iterations);
test< u64 >(msbLoop64, "msbLoop64", inputs64, results64, true);
test< u64 >(msbLoop64, "msbLoop64", inputs64, results64, false);
#ifdef MICROSOFT_COMPILER
test< u64 >(msbNative64, "msbNative64", inputs64, results64, false);
#endif // MICROSOFT_COMPILER
std::cout << std::endl;
std::uniform_int_distribution <u32> dist32(0,
std::numeric_limits< u32 >::max());
std::uniform_int_distribution <u32> shift32(0, 31);
std::vector< u32 > inputs32;
for (int i = 0; i < iterations; i++)
inputs32.push_back(dist32(re) >> shift32(re));
std::vector< u32 > results32;
results32.resize(iterations);
test< u32 >(msbLoop32, "msbLoop32", inputs32, results32, true);
test< u32 >(msbLoop32, "msbLoop32", inputs32, results32, false);
test< u32 >(msbFfs32, "msbFfs", inputs32, results32, false);
test< u32 >(msbPerformanceJunkie32, "msbPerformanceJunkie32",
inputs32, results32, false);
test< u32 >(msbDeBruijn32, "msbDeBruijn32", inputs32, results32, false);
#ifdef MICROSOFT_COMPILER
test< u32 >(msbNative32, "msbNative32", inputs32, results32, false);
#endif // MICROSOFT_COMPILER
}
As a performance junkie I have tried a ton of variations for MSB set, the following is the fastest I have come across,
unsigned int msb32(unsigned int x)
{
static const unsigned int bval[] =
{0,1,2,2,3,3,3,3,4,4,4,4,4,4,4,4};
unsigned int r = 0;
if (x & 0xFFFF0000) { r += 16/1; x >>= 16/1; }
if (x & 0x0000FF00) { r += 16/2; x >>= 16/2; }
if (x & 0x000000F0) { r += 16/4; x >>= 16/4; }
return r + bval[x];
}
There are multiple ways to do this, and the relative performance of the different implementations is somewhat machine-dependent (I happen to have benchmarked this to some extent for a similar purpose). On some machines there's even a built-in instruction for this (use one if available and portability can be dealt with).
Check out some implementations here (under “integer log base 2”). If you are using GCC, check out the functions __builtin_clz
and __builtin_clzl
(which do this for non-zero unsigned ints and unsigned longs, respectively). The “clz” stands for “count leading zeros”, which is yet another way to describe the same problem.
Of course, if your bit array does not fit into a suitable machine word, you need to iterate over words in the array to find the first non-zero word and then perform this calculation only on that word.
Look up the BSR (Bit scan reverse) x86 asm instruction for the fastest way to do this. From Intel's doc:
Searches the source operand (second operand) for the most significant set bit (1 bit).
If a most significant 1 bit is found, its bit index is stored in the destination operand
(first operand).
http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious
I have worked with a number of functions to get the most significant bit, but problems generally arise moving between 32 and 64 bit numbers or moving between x86_64 and x86 boxes. The functions __builtin_clz
, __builtin_clzl
and __builtin_clzll
work well for 32/64 bit numbers and across x86_64 and x86 machines. However, three functions are required. I have found a simple MSB that relies on right-shift that will handle all cases for positive numbers. At least for the use I make of it, it has succeeded where others have failed:
int
getmsb (unsigned long long x)
{
int r = 0;
if (x < 1) return 0;
while (x >>= 1) r++;
return r;
}
By designating input as unsigned long long
it can handle all number classes from unsigned char
to unsigned long long
and given the standard definition, it is compatible across x86_64 and x86 builds. The case for 0
is defined to return 0
, but can be changed as required. A simple test and output are:
int
main (int argc, char *argv[]) {
unsigned char c0 = 0;
unsigned char c = 216;
unsigned short s = 1021;
unsigned int ui = 32768;
unsigned long ul = 3297381253;
unsigned long long ull = 323543844043;
int i = 32767;
printf (" %16u MSB : %d\n", c0, getmsb (c0));
printf (" %16u MSB : %d\n", c, getmsb (c));
printf (" %16u MSB : %d\n", s, getmsb (s));
printf (" %16u MSB : %d\n", i, getmsb (i));
printf (" %16u MSB : %d\n", ui, getmsb (ui));
printf (" %16lu MSB : %d\n", ul, getmsb (ul));
printf (" %16llu MSB : %d\n", ull, getmsb (ull));
return 0;
}
Output:
0 MSB : 0
216 MSB : 7
1021 MSB : 9
32767 MSB : 14
32768 MSB : 15
3297381253 MSB : 31
323543844043 MSB : 38
NOTE: for speed considerations, using a single function to accomplish the same thing centered around __builtin_clzll
is still faster by a factor of about 6.
If you're using x86, you can beat practically any byte-by-byte or word-by-word solution using the SSE2 operations, combined with the find-first-bit instructions, which (in the gcc world) are pronounced "ffs" for the lowest bit and "fls" for the highest bit. Pardon me for having trouble (!@#$%^) formatting "C" code in an answer; check out: http://mischasan.wordpress.com/2011/11/03/sse2-bit-trick-ffsfls-for-xmm-registers/
x86 has a BSR instruction that returns a bit-index (rather than the count of leading zeros above it).
But unfortunately there's no portable intrinsic that efficiently exposes it for all compilers. GNU C provides __builtin_clz
, but unsigned bitidx = 31 - __builtin_clz(x);
doesn't optimize back to just BSR with current GCC and ICC. (It does with clang, which proves that the expression is equivalent so it could).
The following defines BSR32()
and BSR64()
macros or functions that compile efficiently to just a bsr
instruction on x86. (Producing a garbage result if the input was zero. There's no way with intrinsics to take advantage of the asm instruction's behaviour of leaving the destination unmodified for input=0.)
Portability to non-x86 would take some additional #ifdef
e.g. to fall back to 31-__builtin_clz
. Most non-x86 ISAs, if they have a leading-zero bitscan at all, count leading zeros instead of giving you the bit-index. That's why GNU C defines __builtin_clz
as the portable builtin. (If there's no HW support on the target system, the builtin will compile to software emulation, usually calling a libgcc helper function.)
#include <stdint.h>
// define BSR32() and BSR64()
#if defined(_MSC_VER) || defined(__INTEL_COMPILER)
#ifdef __INTEL_COMPILER
typedef unsigned int bsr_idx_t;
#else
#include <intrin.h> // MSVC
typedef unsigned long bsr_idx_t;
#endif
static inline
unsigned BSR32(unsigned long x){
bsr_idx_t idx;
_BitScanReverse(&idx, x); // ignore bool retval
return idx;
}
static inline
unsigned BSR64(uint64_t x) {
bsr_idx_t idx;
_BitScanReverse64(&idx, x); // ignore bool retval
return idx;
}
#elif defined(__GNUC__)
#ifdef __clang__
static inline unsigned BSR64(uint64_t x) {
return 63-__builtin_clzll(x);
// gcc/ICC can't optimize this back to just BSR, but clang can and doesn't provide alternate intrinsics
}
#else
#define BSR64 __builtin_ia32_bsrdi
#endif
#include <x86intrin.h>
#define BSR32(x) _bit_scan_reverse(x)
#endif
bsf
probably doesn't need as much help for compilers, because the builtin matches the asm instruction's behaviour of returning the bit-index of the LSB, i.e. the count of trailing zeros.
A test caller unsigned test32(unsigned x) { return BSR32(x); }
inlines it to 1 instruction on all the major x86 compilers, on the Godbolt compiler explorer. BSR64 inlines the same way, to a 64-bit operand-size version. See also Is there an x86/x86_64 instruction which zeros all bits below the Most Significant Bit? for example use-cases.
;; x64 MSVC 19.16 -O2
unsigned int test32(unsigned int) PROC ; test32, COMDAT
bsr eax, ecx
ret 0
unsigned int test32(unsigned int) ENDP ; test32
# clang -O3 -march=haswell is too "smart?" for its own good:
test32(unsigned int):
lzcnt eax, edi
xor eax, 31
ret
# gcc8.2 -O3 -march=haswell
test32(unsigned int):
bsr eax, edi
ret
# ICC19 -O3 -march=haswell
test32(unsigned int):
bsr eax, edi #15.9
ret #41.12
The point of this is to avoid slow code from the portable (to non-MSVC) version:
#ifdef __GNUC__
unsigned badgcc(uint64_t x) {
return 63 - __builtin_clzll(x);
}
#endif
Without -march=haswell
we get just BSR from clang, but:
# gcc8.2 -O3
badgcc(unsigned long):
bsr rdi, rdi
mov eax, 63
xor rdi, 63
sub eax, edi
ret
# ICC19.0.1 -O3
badgcc(unsigned long):
mov rax, -1 #46.17
bsr rdx, rdi #46.17
cmove rdx, rax #46.17
neg rdx #46.17
add rdx, 63 #46.17
neg edx #46.17
add edx, 63 #46.17
mov eax, edx #46.17
ret #46.17
That's just nasty. (Interesting to see that ICC is doing a CMOV to produce -1
if the input is zero. BSR sets ZF according to its input, unlike most instructions which set flags according to the result.)
With -march=haswell
(or otherwise enabling use of BMI1 instructions), it's not as bad, but still not as good as just BSR. Modulo output dependencies, which compilers mostly work to avoid for lzcnt but strangely not for BSR. (Where the output dependency is a true dependency, because of the input=0 behaviour.) Why does breaking the "output dependency" of LZCNT matter?
Two best ways I know to do this in pure C:
First linear-search the byte/word array to find the first byte/word that's nonzero, then do an unrolled binary-search of the byte/word you find.
if (b>=0x10)
if (b>=0x40)
if (b>=0x80) return 0;
else return 1;
else
if (b>=0x20) return 2;
else return 3;
else
if (b>=0x4)
if (b>=0x8) return 4;
else return 5;
else
if (b>=0x2) return 6;
else return 7;
3 (BTW that's log2(8)) conditional jumps to get the answer. On modern x86 machines the last one will be optimized to a conditional mov.
Alternatively, use a lookup table to map the byte to the index of the first bit that's set.
A related topic you might want to look up is integer log2 functions. If I recall, ffmpeg has a nice implementation.
Edit: You can actually make the above binary search into a branchless binary search, but I'm not sure if it would be more efficient in this case...
Not the fastest, but it works...
//// C program
#include <math.h>
#define POS_OF_HIGHESTBIT(a) /* 0th position is the Least-Signif-Bit */ \
((unsigned) log2(a)) /* thus: do not use if a <= 0 */
#define NUM_OF_HIGHESTBIT(a) ((!(a)) \
? 0 /* no msb set*/ \
: (1 << POS_OF_HIGHESTBIT(a) ))
// could be changed and optimized, if it is known that the following NEVER holds: a <= 0
int main()
{
unsigned a = 5; // 0b101
unsigned b = NUM_OF_HIGHESTBIT(a); // 4 since 4 = 0b100
return 0;
}
Here's a code snippet explaining __builtin_clz()
////// go.c ////////
#include <stdio.h>
unsigned NUM_BITS_U = ((sizeof(unsigned) << 3) - 1);
#define POS_OF_HIGHESTBITclz(a) (NUM_BITS_U - __builtin_clz(a)) /* only works for a != 0 */
#define NUM_OF_HIGHESTBITclz(a) ((a) \
? (1U << POS_OF_HIGHESTBITclz(a)) \
: 0)
int main()
{
unsigned ui;
for (ui = 0U; ui < 18U; ++ui)
printf("%i \t %i\n", ui, NUM_OF_HIGHESTBITclz(ui));
return 0;
}
I'll add one!
typedef unsigned long long u64;
typedef unsigned int u32;
typedef unsigned char u8;
u8 findMostSignificantBit (u64 u64Val)
{
u8 u8Shift;
u8 u8Bit = 0;
assert (u64Val != 0ULL);
for (u8Shift = 32 ; u8Shift != 0 ; u8Shift >>= 1)
{
u64 u64Temp = u64Val >> u8Shift;
if (u64Temp)
{
u8Bit |= u8Shift; // notice not using +=
u64Val = u64Temp;
}
}
return u8Bit;
}
Of course, this is working on a 64 bit number (unsigned long long), and not an array. Also, plenty of people have pointed to inbuilt g++ functions I was not aware of. How interesting.
Anyhow, this finds the most significant bit in 6 iterations and gives an assert if you passed 0 to the function. Not the best function to use if you have access to an instruction of the chipset.
I also am also using |= instead of += because these are always powers of two, and OR is (classically) faster than addition. Since I'm only adding unique powers of 2 together, I never have roll over.
This is a binary search which means it always finds the result in 6 iterations.
Again, this is better:
u8 findMostSignificantBit2 (u64 u64Val)
{
assert (u64Val != 0ULL);
return (u8) (__builtin_ctzll(u64Val));
}
Here's a simple, brute force algorithm for an arbitrary-sized array of bytes:
int msb( unsigned char x); // prototype for function that returns
// most significant bit set
unsigned char* p;
for (p = arr + num_elements; p != arr;) {
--p;
if (*p != 0) break;
}
// p is with pointing to the last byte that has a bit set, or
// it's pointing to the first byte in the array
if (*p) {
return ((p - arr) * 8) + msb( *p);
}
// what do you want to return if no bits are set?
return -1;
I'll leave it as a an exercise for the reader to come up with an appropriate msb()
function as well as the optimization to work on int
or long long
sized chinks of data.
Um, your tag indicates 32bit but it looks like the values that you're using are 16 bit. If you did mean 32 bit, then I think the answer for 0x00a1 ought to be 24 and not 8.
Assuming that you are looking for the MSB bit index from the left hand side and you know that you will only be dealing with uint32_t's, here's the obvious, simple-minded algorithm:
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
int main()
{
uint32_t test_value = 0x00a1;
int i;
for (i=0; i<32; ++i)
{
if (test_value & (0x80000000 >> i))
{
printf("i = %d\n", i);
exit(0);
}
}
return 0;
}
For java I use this:
static public final int msb(int n) {
n |= n >>> 1;
n |= n >>> 2;
n |= n >>> 4;
n |= n >>> 8;
n |= n >>> 16;
n >>>= 1;
n += 1;
return n;
}
And:
static public final int msb_index(int n) {
final int[] multiply_de_bruijn_bit_position = {
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
return multiply_de_bruijn_bit_position[(msb(n) * 0x077CB531) >>> 27];
}
#define FFS(t) \
({ \
register int n = 0; \
\
if (!(0xffff & t)) \
n += 16; \
\
if (!((0xff << n) & t)) \
n += 8; \
\
if (!((0xf << n) & t)) \
n += 4; \
\
if (!((0x3 << n) & t)) \
n += 2; \
\
if (!((0x1 << n) & t)) \
n += 1; \
\
n; \
})
精彩评论