What's the most efficient bit vector compression method for my use case?_问答_开发者

What's the most efficient bit vector compression method for my use case?

开发者 https://www.devze.com 2023-02-06 07:58 出处：网络

I\'m working on a project in computational biology and I need to store an index of locuses that differ between ma开发者_运维百科ny sequences. For now, I\'m using a B+Tree for that purpose, but I gues

I'm working on a project in computational biology and I need to store an index of locuses that differ between ma开发者_运维百科ny sequences. For now, I'm using a B+Tree for that purpose, but I guess using a bitmap index would be way faster for such a use case : only a small number of locus differ between two sequences, 1% on average, and they are nearly equally distributed along the sequence; so it seems like there is a lot of room for bitmap index compression. My problem is that I cannot manage to find a compression method that can efficiently:

allow fast individual bit setting/unsetting
permit efficient range queries over the bitmap
possibly allow fast XOR-ing/AND-ing of two indexes

Thx in advance for your suggestions.

Check out FastBit:

https://sdm.lbl.gov/fastbit/

You could use a simple tree data structure like this:

struct node {
    node * leftChild;
    node * rightChild;
    long mask;
};
struct tree {
    int exponent; // the size of the tree is 2^exponent
    node rootNode;
};

Every node represents a sub-array of the large bit array that is (2^n)*sizeof(long) bits, n>=0. Leaf nodes store a raw bit mask in 'mask' if they're at the bottom of the tree, otherwise they store 0 in 'mask'. This way, leaf node with a 'mask' value of 0 can represent a (2^n)*sizeof(long) -sized empty area in the bit array, so sparse bit arrays can be stored efficiently.

leftChild and rightChild are of course null in all leaf nodes. Every other node has both a leftChild and rightChild pointer, and every node that isn't a leaf node has at least one descendant node with mask that has bits set in it.

To find a bit at a given index:

bool find_bit_at_index(tree t, long ind) {
    long divider = 1 << (t.exponent - 1);
    node *n = &t.rootNode;
    node *lastNode;
    while (n)
    {
       lastNode = n;
       if (ind >= divider) {
          n = n->rightChild;
          ind -= divider;
       }
       else {
          n = n->leftChild;
       }
       divider >>= 1;
    }
    return lastNode->mask & (1 << ind);
}

Constructing the tree and developing the rest of the algorithms should be easy enough once you understand the idea. I haven't actually tested the code, since this isn't a complete solution, some typos or such might remain. And I'm not a bitmap index expert, there might be (probably is) a ready-made package that does this better, but this solution is simple and should be relatively efficient. 1% might not yet be sparse enough to make this better compared to just a plain bit array (assuming longs storing 64 bits each, it doesn't take more than 2 longs to have more than one bit set on average), but if the sparsity increases beyond that the space and time savings will show.