I've got a special need and the most important concerns are:
- in-memory
- very low memory footprint
- speed
Here's my "problem": I need to store, in-memory, a huge number of very sparse bit arrays. Those bitsets are "append only" and are to be used mostly for intersections. By huge, I mean as high as 200 000 bit arrays.
The range shall be between [0...16 000 000] for each bitset.
I ran some pre-test with "only" 10 673 bit arrays containing some actual data I've got and got the following results:
1% of the bit arrays ( 106 bit arrays) Hamming weight: at most 1 bit set
5% of the bit arrays ( 534 bit arrays) Hamming weight: at most 4 bits set
10% of the bit arrays ( 1068 bit arrays) Hamming weight: at most 8 bits set
15% of the bit arrays ( 1603 bit arrays) Hamming weight: at most 12 bits set
20% of the bit arrays ( 2137 bit arrays) Hamming weight: at most 17 bits set
25% of the bit arrays ( 2671 bit arrays) Hamming weight: at most 22 bits set
30% of the bit arrays ( 3206 bit arrays) Hamming weight: at most 28 bits set
35% of the bit arrays ( 3740 bit arrays) Hamming weight: at most 35 bits set
40% of the bit arrays ( 4274 bit arrays) Hamming weight: at most 44 bits set
45% of the bit arrays ( 4809 bit arrays) Hamming weight: at most 55 bits set
50% of the bit arrays ( 5343 bit arrays) Hamming weight: at most 67 bits set
55% of the bit arrays ( 5877 bit arrays) Hamming weight: at most 83 bits set
60% of the bit arrays ( 6412 bit arrays) Hamming weight: at most 103 bits set
65% of the bit arrays ( 6946 bit arrays) Hamming weight: at most 128 bits set
70% of the bit arrays ( 7480 bit arrays) Hamming weight: at most 161 bits set
75% of the bit arrays ( 8015 bit arrays) Hamming weight: at most 206 bits set
80% of the bit arrays ( 8549 bit arrays) Hamming weight: at most 275 bits set
85% of the bit arrays ( 9083 bit arrays) Hamming weight: at most 395 bits set
90% of the bit arrays ( 9618 bit arrays) Hamming weight: at most 640 bits set
95% of the bit arrays (10152 bit arrays) Hamming weight: at most 1453 bits set
96% of the bit arrays (10259 bit arrays) Hamming weight: at most 1843 bits set
97% of the bit arrays (10366 bit arrays) Hamming weight: at most 2601 bits set
98% of the bit arrays (10473 bit arrays) Hamming weight: at most 3544 bits set
99% of the bit arrays (10580 bit arrays) Hamming weight: at most 4992 bits set
100% of the bit arrays (10687 bit arrays) Hamming weight: at most 53153 bits set
Seen the numbers involved, I obviously need to use compressed bit arrays and that is not an issue: it shall stay easy to deal with seen that the bit arrays are "append only".
The bit array bits that are on are kinda grouped, but not totally. So you'll tend to have several bits on in the same area (but usually not one after another, making RLE kinda not great for bits that are on).
My question is what kind of compression to use?
Now I don't know if I should put my first approach here or in an answer to my own question.
Basically I imagined a "worst case" scenario using a very dumb encoding:
1 bit: if on, the following 5 bits determine how many bits are needed to compute the 'skip', if off, optimization: the following 5 bits determine how many bits are too be taken literally (that is 'on' or 'off', without skipping) [this would only be switched to when determined to be more efficient than the other representation, so when it kicks in, it shall always be an optimization (size-wise)]
5 bits: how many bits we can skip before the next bit on
x bits: skip
Here's an example: a bit array has 3 bit set, the first bit being at 3 098 137, the second at 3 098 141 and the third at 3 098 143.
+-- now we won't skip
|
| +-- 3 because we need 3 bits to store "6" (from 3 098 138 to 3 098 143)
| | +--- 3 098 141 is on
22 3 098 137 | 3 | +- 3 098 143 is on
1 10110 1011110100011000011001 0 00011 000101 etc.
First bit on tells we're going to skip bits. 5 next bits (always 5) tells how many bits we need to tell how many bits we'll skip 22 bits telling to skip to 3 098 137 one bit off telling now we're not skipping bits 5 next bits (always 5) tells how many bits we'll read "as is" 6 bits: off, off, off, on, off, on meaning 3 098 141 and 3 098 143 are on etc.
Seen the amazing sparsity of these bit arrays, this seems quite size-efficient.
So using that encoding, I took my sample data and computed a "worst case" scenario (I haven't written the algo yet, I'd rather have a few from here inputs first): basically I considered that not only the "size optimization" would never kick in and, also, that the 5 bits would always be set to their maximum value (24 bits), which of course cannot happen.
I did it just to have a very crude approximation of what the "worst of the worst" case could be.
I was very pleasantly surprised:
Worst case scenario:
108 913 290 bits needed for the 10 687 very sparse bit arrays
12.9 MB (13 295 KB)
The data being actual data and all the data being similar, I know that, if worse comes to worse, I could store my 200 000 bit arrays in about 240 MB, which is fine.
I'm pretty sure the actual encoding will comes to way less than that but as I haven't actually written it yet, I can only (very easily) compute the "worst case" which is why I only show that one.
Any hints / ideas as to how to make this more size-efficient (remembering these are super-sparse bit arrays, that there shall be hundreds thousands of them, that they must be in memory, and that they shall be "append only")?
About my 'append-only' case
Basically I've got one growing "expanse" (the range, but "expanse" is the actual term as I understand it) and a lot of bit arrays that have a few bit sets. When the range goes from, say, 0 to 1 000 000, all the bit arrays goes from 0 to 1 000 000 to. When the range grows to 1 000 001, then all the bit arrays are growing too, all by one bit. But most开发者_StackOverflow社区 of these bit arrays will have a '0' appended at their end, while about 4 to 8 of the bit arrays will have a '1' appended at their end. However I cannot predict in advance which of the bit arrays will have a 0 or a 1 appended.
So I've got a lot of bit arrays that have all the same size, that are all very sparse (< 0.5% of their bits set) and that are all "growing" as the range growth (so they're all always growing at the same rate).
Judy arrays are great. But I read about them a few years ago and that stuff was "above my head". Judy arrays are a C-only 20KLOC lib and I'm definitely not re-implementing that. But they're amazing.
So I guess I need to add I'd like all this to stay relatively simple, which is not that far-fetched seen the special "append only" property of my very sparse bit arrays.
You didn't say what programming language you want to use. It sounds like you don't want Judy because it's "C-only"... if you are using C# then you could use my Compact Patricia Trie instead. Is is almost 4500 LOC (commented) and uses similar ideas to Judy, but the size and speed of each trie are not ideal due to limitations of .NET. It is not optimized for computing intersections either, but such an algorithm could be added. The article about CP Tries does not emphasize this point, but it can store sets (sparse bit arrays) much more compactly than dictionaries (the graphs in the article show the size and speed of dictionaries, not sets).
The best case is a dense cluster of bits. With 50% occupancy (every other bit set), it requires less than 8 bits per key (less than 4 bits per integer). (correction: less than 8 bits, not more.)
If you only need an approximate representation of the data, use a Bloom filter.
By the way, what do you mean by "append only"? Does it mean that you only add keys, or that each key you add is greater than the keys you added before?
Update: Since you are only adding larger keys, you should probably design a special algorithm just for your case. IMO, when designing a custom algorithm, you should make it as simple as possible. So here's my idea, which assumes the keys of different bitsets are uncorrelated (therefore there is no benefit of attempting to compress data between different bitsets):
A bitset is represented by a sorted array of 32-bit slots. Because it's sorted, you can use binary search to find keys. Each slot consists of a 24-bit "prefix" and 8 bits of "flags". Each slot represents a region of 8 keys. The "flags" tell you which of the 8 keys in the region are present in the bitset, and the "prefix" tells you which region we're talking about, by specifying bits 3 to 26 of the key. For example, if the following bits are "1" in the bitset:
1, 3, 4, 1094, 8001, 8002, 8007, 8009
...then the bitset is represented by an array of 4 slots (16 bytes):
Prefix: 0, 136, 1000, 1001
Flags: 0x15, 0x40, 0x86, 0x02
The first slot represents 1, 3, 4 (notice that bits 1, 3 and 4 are set in the number 0x15); the second slot represents 1094 (136 * 8 + 6); the third slot represents 8001, 8002, and 8007; the fourth slot represents 8009. Does this make sense?
I don't know if this is as compact as your idea. But I think you'll get faster queries and faster modifications, and it will be fairly easy to implement.
You may use binary tree for bit array. Say, you have array with range of [M..N]. Store it in such a manner:
Choose some number encoding for [0...ram size], like Fibonacci, Golomb or Rice code (you may choose most suitable representation after profiling your program with actual data).
- If array is empty (have no bits set), store it as number 0.
- If array is full (have all bits set), store it as number 1.
- Else split it in two parts: A in [M..(M+N)/2-1] and B in [(M+N)/2..N]
- Generate representations of P0 and P1 using this algorithm recursively.
- Get length of P0 (in bits or other units length may be whole number of) and store it as a number (you may need to add 1 if length may be 1, e.g. you store 0 as single bit 0).
- Store P0 then P1.
In this case, if limits are common, operations of intersection an union are trivial recursions:
Intersection:
- If array A is empty, store 0.
- If array A is full, store copy of B
- Else split arrays, make intersections of both halves, store length of first half, then both halves.
This algorithm may deal with bits (if you need them to be most compact) and bytes/words (if bit operations are so slow).
Also you may add specical encodings for arrays with single bit set, all arrays with size less than some limit (8 elements for example) to decrease level of recursion.
Drawback is that without some hacks adding/removing element to/from array is complex operation (as complex as intersection/union operations).
For example, array with single 0xAB bit set should be stored in array of 0..0xFF as (pseudocode for):
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1, EMPTY, 13, 1, EMPTY, 9, 1, EMPTY, 5, 1, EMPTY, 1, EMPTY, FULL, EMPTY, EMPTY, EMPTY
| AA | AB |
|A8..A9| AA .. AB |
| A8 .. AB |AC..AF|
|A0..A7| A8 .. AF |
| A0 .. AF |B0..BF|
|80..9F| A0 .. BF |
| 80 .. BF |C0..FF|
| 0..7F| 80 .. FF |
EMPTY and FULL are codes for empty and full arrays, numbers are lengths in elements (should be replaced with actual lengthts in bytes, bits or so)
Ff you do not need fast single bit check, you may use most simple approach: Just store distances between set bits using codes: fibonacci, rice, golomb, levenshtein, elias etc. or invent another one. Note, that in order to get minimal code length, you should use code with code lengths as close as possible to -log p/log 2, where p is probability of that code. You may use huffman code for that.
For example, use elias gamma code, so array like this:
0 1 0000 1 1 000 1 0 1 000000000000000000 1 000000000000000000
2 5 1 4 2 19 18 (distance)
Should be encoded as:
010 00101 1 00100 010 000010011 000010010
2 5 1 4 2 19 18 (distance code explained)
And mostly compact for array with uniform bits distribution would be arithmetic encoding, but it is very CPU time counsumpting. Becaouse you'll have to read and write such arrays bit by bit with no fast skipping available.
You may look into compressed bitmaps. A common strategy is to use word-aligned run-length encoding.
C++ implementation:
https://github.com/lemire/EWAHBoolArray
Java implementation:
https://github.com/lemire/javaewah
Reference:
Daniel Lemire, Owen Kaser, Kamel Aouiche, Sorting improves word-aligned bitmap indexes. Data & Knowledge Engineering 69 (1), pages 3-28, 2010. http://arxiv.org/abs/0901.3751
Even if they aren't exactly what you're looking for, it's worth checking out Judy trees. Judy is a heavily optimized library for ordered maps, and one configuration is specifically designed as a bitset rather than a map. I don't think intersection is one of the operations natively optimized for, though...
The general idea is to use a tree with a fixed number of address bits per level, and take advantage of the sparseness at each level. This results in quite good compression even in the worst case, and fast query performance as well. I believe an intersection operation would be relatively straightforward and potentially very fast.
At any rate, it's always a good idea to steal from the best!
Considering you are going to do a bunch of intersection tests anyway, maybe you should try storing all of the bitvectors in parallel. One sparse, 16M entry list. Each entry in that list contains a list of which of the 200k input bitvectors has a '1' at that location. It looks like you expect to have only about 5 bits set per input vector, or 1M total entries? Taking a straw-man linked list implementation for the toplevel and the buckets, and a worst case of no intersections at all (thus 1M buckets with 1 element each) you could store it all in 32MB.
You might be interested in Binary Decision Diagrams (BDD), and more precisely Zero-suppressed Binary Decision Diagram (ZBDD).
They are used to represent sets in a compressed way. Unlike other compressed forms, operations (such as set intersections, or insertions of elements - your "append only" thing?) work directly on the compressed form.
精彩评论