Given a range of indexes (identifiers), where I want to map each index to a boolean value, that is:
// interface pseudocode
interface bitmap {
bool identifier_is_set(unsigned int id_idx) const;
void set_identifier(unsigned int id_idx, bool val) const;
};
so that I can set and query for each ID (index) if it is set or not, what would you prefer to use to implement this?
I think this is called a bit array or bitmap or bitset, correct me if I'm wrong.
Assume that the maximum identifier is predetermined and not greater than 1e6 (1m), possibly much smaller (10k - 100k). (Which means the size used by sizeof(int)*maximum_id_idx easily fits int开发者_如何学Co memory.)
Possible solutions I see so far:
std::set<size_t>
- Add or erase the identifier to this set as neccessary. This would allow for arbitrarily large identifiers as long as we have a sparse bitmap.std::vector<bool>
- Sized to the appropriate maximum value, storing true or false for each id_idx.std::vector<char>
- Same thing, but not suffering from weirdstd::vector<bool>
problems. Uses less memory thanvector<int>
.std::vector<int>
- Using anint
as the boolean flag to have a container using the natural word size of the machine. (No clue if that could make a difference.)
Please answer which container type you would prefer and why, given the maximum id restriction cited above and especially considering performance aspects of querying the bitmap (inserting performance does not matter).
Note: The interface usage of vector
vs. set
does not matter, as it will be hidden behind it's wrapping class anyway.
EDIT: To add to the discussion about std::bitset : std::bitset will incorporate the whole array size into the object, that is a sizeof(std::bitset<1m>) will be a size of approx 1/8 megabyte, which makes for a huge single object and makes for something you cannot put on the stack anymore (which may or may not be relevant).
Without knowing the platform you are running this code on and your access patterns, it's hard to say whether vector<bool>
will be faster than vector<char>
(or vector<int>
) or even set<int>
or unordered_set<int>
.
For example, if you have an extremely sparse array, a linear search of a vector<int>
that just contains the indices set might be the best answer. (See Mike Abrash's article on optimizing Pixomatic for x86.)
On the other hand, you might have a somewhat sparse array. By somewhat sparse, I mean that the number of set elements is much greater than L1 or L2. In that case, more low-level details start to come into play, as well as your actual access patterns.
For example, on some platforms, variable bit shifting is incredibly expensive. So, if you are querying a random set of identifiers, the more frequently you do this, the more a vector<char>
or vector<int>
becomes a better idea than bitset<...>
or vector<bool>
. (The latter two use bit shifts to lookup bits.) On the other hand, if you are iterating through the sparse bit vector in order and just want the bits set, you can optimize that iteration to get rid of the overhead of variable shifts.
At this point, you might also want to know how your sparse identifiers are actually distributed. If they are clumped, you need to know the tradeoff between the optimal memory read size and reading a char at a time. That will dictate whether hitting the cache more often will offset reading in non-native sized data.
If the identifiers are scattered, you may get a significant win by using a hash set (unordered_set<int>
) instead of a bit vector. That depends on the load, however.
Have you checked out boost::dynamic_bitset?
http://www.boost.org/doc/libs/1_36_0/libs/dynamic_bitset/dynamic_bitset.html
Assume that the maximum identifier is predetermined and not greater than 1e6 (1m)
Use a std::bitset
if you have a hard limit:
std::bitset<1000000> bits;
bits.set(1000);
If by performance you mean the one that is the quickest to look up then std::bitset is probably fast enough as its lookup is constant-time. There is an initial overhead to set all the bits to zero. vector<int> would probably be unnoticeably faster and would have a bigger overhead to set the bits as there are 32-times as many of them in a 32-bit system.
vector<bool> is like bitset in its implementation, and has the advantage of being resizeable if you need that, although in general I would avoid vector, and use boost's dynamic_bitset if I need to resize.
std::set would be O(log N) in lookup and insertion/deletion although it is the most scalable in memory use, occupying less if the set is not particularly full. std::set is not restricted in range.
some form of hash is also an option if your data is more sparse, generally O(1) setting and lookup although there may be some overhead with collision-handling.
The fastest seems to be using bitmask. You should construct an std::vector<int>
, and make its size adequate (N divided by sizeof(int)*8, rounded up).
This seems to be faster than std::vector<bool>
(or similar) for large sets of data. Because you actually use much less memory, hence cache utilization is better
You could always have a std::vector<std::bitset<sizeof(size_t)> >
, then your lookup is simple calculation (though the modulo operation is relatively slow), but you have the advantage of this being able to grow... I would hazard that space wise, the above is probably the most optimal as well...
精彩评论