开发者

An integer hashing problem

开发者 https://www.devze.com 2023-02-24 11:47 出处:网络
I have a (C++) std::map<int, MyObject*> that contains a couple of millions of objects of type MyObject*. The maximum number of objects that I can have, is around 100 millions. The key is the obj

I have a (C++) std::map<int, MyObject*> that contains a couple of millions of objects of type MyObject*. The maximum number of objects that I can have, is around 100 millions. The key is the object's id. During a certain process, these objects must be somehow marked( with a 0 or 1) as fast as possible. The marking cannot happen on the objects themselves (so I cannot introduce a member variable and use that for the marking process). Since I know the minimum and maximum id (1 to 100_000_000), the first thought that occured to me, was to use a std::bit_set<100000000> and perform my marking there. This solves my problem and also makes it easier when marking processes run in parallel, since these use their own bit_set to mark things, but I was wondering what the solution could be, if I had to use something else instead of a 0-1 marking, e.g what could I use if I had to mark all objects with an integer number ?

Is there some form of a data structure that can deal with this kind of problem in a compact (memory-wise) manner, and also be fast ? The main queries of interest are whether an object is marked, and with what was m开发者_运维百科arked with.

Thank you.

Note: std::map<int, MyObject*> cannot be changed. Whatever data structure I use, must not deal with the map itself.


How about making the value_type of your map a std::pair<bool, MyObject*> instead of MyObject*?


If you're not concerned with memory, then a std::vector<int> (or whatever suits your need in place of an int) should work.

If you don't like that, and you can't modify your map, then why not create a parallel map for the markers?

std::map<id,T> my_object_map;
std::map<id,int> my_marker_map;

If you cannot modify the objects directly, have you considered wrapping the objects before you place them in the map? e.g.:

struct
{
    int marker;
    T *p_x;
} T_wrapper;


std::map<int,T_wrapper> my_map;

If you're going to need to do lookups anyway, then this will be no slower.

EDIT: As @tenfour suggests in his/her answer, a std::pair may be a cleaner solution here, as it saves the struct definition. Personally, I'm not a big fan of std::pairs, because you have to refer to everything as first and second, rather than by meaningful names. But that's just me...


The most important question to ask yourself is "How many of these 100,000,000 objects might be marked (or remain unmarked)?" If the answer is smaller than roughly 100,000,000/(2*sizeof(int)), then just use another std::set or std::tr1::unordered_set (hash_set previous to tr1) to track which ones are so marked (or remained unmarked).

Where does 2*sizeof(int) come from? It's an estimate of the amount of memory overhead to maintain a heap structure in a deque of the list of items that will be marked.

If it is larger, then use std::bitset as you were about to use. It's overhead is effectively 0% for the scale of quantity you need. You'll need about 13 megabytes of contiguous ram to hold the bitset.

If you need to store a marking as well as presence, then use std::tr1::unordered_map using the key of Object* and value of marker_type. And again, if the percentage of marked nodes is higher than the aforementioned comparison, then you'll want to use some sort of bitset to hold the number of bits needed, with suitable adjustments in size, at 12.5 megabytes per bit.

A purpose-built object holding the bitset might be your best choice, given the clarification of the requirements.


Edit: this assumes that you've done proper time-complexity computations for what are acceptable solutions to you, since changing the base std::map structure is no longer permitted.


If you don't mind using hacks, take a look at the memory optimization used in Boost.MultiIndex. It can store one bit in the LSB of a stored pointer.

0

精彩评论

暂无评论...
验证码 换一张
取 消