开发者

mmap-loadable data structure library for C++ (or C)

开发者 https://www.devze.com 2022-12-20 22:34 出处:网络
I have a some large data structure (N > 10,000) that usually only needs to be created once (at runtime), and can be reused many times afterwards, but it needs to be loaded very quickly. (It is used fo

I have a some large data structure (N > 10,000) that usually only needs to be created once (at runtime), and can be reused many times afterwards, but it needs to be loaded very quickly. (It is used for user input processing on iPhoneOS.) mmap-ing a file seems to be the best choice.

Are there any data structure libraries for C++ (or C)? Something along the line

ReadOnlyHashTable<char, int> table ("filename.hash");
// mmap(...) inside the c'tor
...
int freq开发者_开发知识库 = table.get('a');
...
// munmap(...); inside the d'tor.

Thank you!


Details:

I've written a similar class for hash table myself but I find it pretty hard to maintain, so I would like to see if there's existing solutions already. The library should

  • Contain a creation routine that serialize the data structure into file. This part doesn't need to be fast.
  • Contain a loading routine that mmap a file into read-only (or read-write) data structure that can be usable within O(1) steps of processing.
  • Use O(N) amount of disk/memory space with a small constant factor. (The device has serious memory constraint.)
  • Small time overhead to accessors. (i.e. the complexity isn't modified.)

Assumptions:

  • Bit representation of data (e.g. endianness, encoding of float, etc.) does not matter since it is only used locally.
  • So far the possible types of data I need are integers, strings, and struct's of them. Pointers do not appear.

P.S. Can Boost.intrusive help?


You could try to create a memory mapped file and then create the STL map structure with a customer allocator. Your customer allocator then simply takes the beginning of the memory of the memory mapped file, and then increments its pointer according to the requested size. In the end all the allocated memory should be within the memory of the memory mapped file and should be reloadable later.

You will have to check if memory is free'd by the STL map. If it is, your customer allocator will lose some memory of the memory mapped file but if this is limited you can probably live with it.


Sounds like maybe you could use one of the "perfect hash" utilities out there. These spend some time opimising the hash function for the particular data, so there are no hash collisions and (for minimal perfect hash functions) so that there are no (or at least few) empty gaps in the hash table. Obviously, this is intended to be generated rarely but used frequently.

CMPH claims to cope with large numbers of keys. However, I have never used it.

There's a good chance it only generates the hash function, leaving you to use that to generate the data structure. That shouldn't be especially hard, but it possibly still leaves you where you are now - maintaining at least some of the code yourself.


Just thought of another option - Datadraw. Again, I haven't used this, so no guarantees, but it does claim to be a fast persistent database code generator.


WRT boost.intrusive, I've just been having a look. It's interesting. And annoying, as it makes one of my own libraries look a bit pointless.

I thought this section looked particularly relevant.

If you can use "smart pointers" for links, presumably the smart pointer type can be implemented using a simple offset-from-base-address integer (and I think that's the point of the example). An array subscript might be equally valid.

There's certainly unordered set/multiset support (C++ code for hash tables).


Using cmph would work. It does have the serialization machinery for the hash function itself, but you still need to serialize the keys and the data, besides adding a layer of collision resolution on top of it if your query set universe is not known before hand. If you know all keys before hand, then it is the way to go since you don't need to store the keys and will save a lot of space. If not, for such a small set, I would say it is overkill.

Probably the best option is to use google's sparse_hash_map. It has very low overhead and also has the serialization hooks that you need.

http://google-sparsehash.googlecode.com/svn/trunk/doc/sparse_hash_map.html#io


GVDB (GVariant Database), the core of Dconf is exactly this.

See git.gnome.org/browse/gvdb, dconf and bv
and developer.gnome.org/glib/2.30/glib-GVariant.html

0

精彩评论

暂无评论...
验证码 换一张
取 消