开发者

How do I Quickly Compare Variable-Length Bit Strings in C++?

开发者 https://www.devze.com 2023-02-06 15:38 出处:网络
I\'m performing comparisons of objects based on the binary presence or absence of a set of features. These features can be represented by a bit string, such as this:

I'm performing comparisons of objects based on the binary presence or absence of a set of features. These features can be represented by a bit string, such as this:

开发者_如何学JAVA10011

This bitstring has the first, fourth and fifth feature.

I'm trying to calculate the similarity of a pair of bit strings as the number of features that both share in common. For a given set of bit strings I know that they'll all have the same length, but I don't know at compile time what that length will be.

For example, these two strings have two features in common, so I'd like the similarity function to return 2:

s(10011,10010) = 2

How do I efficiently represent and compare bit-strings in C++?


You can use the std::bitset STL class.

They can be built from bit strings, ANDed, and count the number of 1:

#include <string>
#include <bitset>

int main()
{
  std::bitset<5> option1(std::string("10011")), option2(std::string("10010"));
  std::bitset<5> and_bit = option1 & option2; //bitset will have 1s only on common options
  size_t s = and_bit.count ();                //return the number of 1 in the bitfield
  return 0;
}

EDIT

If number of bits is unknown at compile time, you can use boost::dynamic_bitset<>:

boost::dynamic_bitset<> option(bit_string);

Other parts of example don't change, since boost::dynamic_bitset<> share a common interface with std::bitset.


Faster algorithm:

int similarity(unsigned int a, unsigned int b)
{
   unsigned int r = a & b;
   r = ( r & 0x55555555 ) + ((r >> 1) & 0x55555555 );
   r = ( r & 0x33333333 ) + ((r >> 2) & 0x33333333 );
   r = ( r & 0x0f0f0f0f ) + ((r >> 4) & 0x0f0f0f0f );
   r = ( r & 0x00ff00ff ) + ((r >> 8) & 0x00ff00ff );
   r = ( r & 0x0000ffff ) + ((r >>16) & 0x0000ffff );
   return r;
}

int main() {
        unsigned int a = 19 ;//10011
        unsigned int b = 18 ;//10010
        cout << similarity(a,b) << endl; 
        return 0;
}

Output:

2

Demonstration at ideone : http://www.ideone.com/bE4qb


As you don't know the bit length at compile time, you can use boost::dynamic_bitset instead of std::bitset.

You can then use operator& (or &=) to find the common bits, and count them using boost::dynamic_bitset::count().

The performance depends. For max speed, depending on your compiler, you may have to implement the loop yourself, e.g. using @Nawaz's method, or something from Bit Twiddling Hacks, or by writing the loop using assembler/compiler intrinsics for sse/popcount/etc.

Notice that at least llvm, gcc and icc detect many patterns of this sort and optimize the thing for you, so profile/check the generated code before doing manual work.


Use a std::bitset, if your set of features is less than the number of bits in a long (I think it's a long), you can get an unsigned long representation of the bits, then and the two values, and use bit twidling tricks from here to count.


If you want to continue to use strings to represent your bit pattern, you could do something like the following, using the zip_iterator from boost.

#include <iostream>
#include <string>
#include <algorithm>

#include <boost/tuple/tuple.hpp>
#include <boost/iterator/zip_iterator.hpp>

struct check_is_set :
  public std::unary_function<const boost::tuple<char const&, char const&>&, bool>
{
  bool operator()(const boost::tuple<char const&, char const&>& t) const
  {
    const char& cv1 = boost::get<0>(t);
    const char& cv2 = boost::get<1>(t);
    return cv1 == char('1') && cv1 == cv2;
  }
};

size_t count_same(std::string const& opt1, std::string const& opt2)
{
  std::string::const_iterator beg1 = opt1.begin();
  std::string::const_iterator beg2 = opt2.begin();

  // need the same number of items for end (this really is daft, you get a runtime
  // error if the sizes are different otherwise!! I think it's a bug in the
  // zip_iterator implementation...)
  size_t end_s = std::min(opt1.size(), opt2.size());
  std::string::const_iterator end1 = opt1.begin() + end_s;
  std::string::const_iterator end2 = opt2.begin() + end_s;

  return std::count_if(
  boost::make_zip_iterator(
    boost::make_tuple(beg1, beg2)
    ),
  boost::make_zip_iterator(
    boost::make_tuple(end1, end2)
    ),
    check_is_set()
  );
}

int main(void)
{
  std::string opt1("1010111");
  std::string opt2("001101");

  std::cout << "same: " << count_same(opt1, opt2) << std::endl;

  return 0;
}
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号