开发者

Indexing count of buckets

开发者 https://www.devze.com 2023-01-04 14:59 出处:网络
So, here is my little problem. Let\'s say I have a list of buckets a0 ... an which respectively contain L <= c0 ... cn < H items. I can decide of the L and H limits. I could even update them dy

So, here is my little problem.

Let's say I have a list of buckets a0 ... an which respectively contain L <= c0 ... cn < H items. I can decide of the L and H limits. I could even update them dynamically, though I don't think it would help much.

The order of the buckets matter. I can't go and swap th开发者_运维技巧em around.

Now, I'd like to index these buckets so that:

  • I know the total count of items
  • I can look-up the ith element
  • I can add/remove items from any bucket and update the index efficiently

Seems easy right ? Seeing these criteria I immediately thought about a Fenwick Tree. That's what they are meant for really.

However, when you think about the use cases, a few other use cases creep in:

  • if a bucket count drops below L, the bucket must disappear (don't worry about the items yet)
  • if a bucket count reaches H, then a new bucket must be created because this one is full

I haven't figured out how to edit a Fenwick Tree efficiently: remove / add a node without rebuilding the whole tree...

Of course we could setup L = 0, so that removing would become unecessary, however adding items cannot really be avoided.

So here is the question:

Do you know either a better structure for this index or how to update a Fenwick Tree ?

The primary concern is efficiency, and because I do plan to implement it cache/memory considerations are worth worrying about.

Background:

I am trying to come up with a structure somewhat similar to B-Trees and Ranked Skip Lists but with a localized index. The problem of those two structures is that the index is kept along the data, which is inefficient in term of cache (ie you need to fetch multiple pages from memory). Database implementations suggest that keeping the index isolated from the actual data is more cache-friendly, and thus more efficient.


I have understood your problem as:

Each bucket has an internal order and buckets themselves have an order, so all the elements have some ordering and you need the ith element in that ordering.

To solve that:

What you can do is maintain a 'cumulative value' tree where the leaf nodes (x1, x2, ..., xn) are the bucket sizes. The value of a node is the sum of values of its immediate children. Keeping n a power of 2 will make it simple (you can always pad it with zero size buckets in the end) and the tree will be a complete tree.

Corresponding to each bucket you will maintain a pointer to the corresponding leaf node.

Eg, say the bucket sizes are 2,1,4,8.

The tree will look like

     15
    /  \
   3    12
  / \  / \
 2  1  4  8

If you want the total count, read the value of the root node.

If you want to modify some xk (i.e. change correspond bucket size), you can walk up the tree following parent pointers, updating the values.

For instance if you add 4 items to the second bucket it will be (the nodes marked with * are the ones that changed)

     19*
    /   \
   7*    12
  / \   / \
 2  5*  4  8

If you want to find the ith element, you walk down the above tree, effectively doing the binary search. You already have a left child and right child count. If i > left child node value of current node, you subtract the left child node value and recurse in the right tree. If i <= left child node value, you go left and recurse again.

Say you wanted to find the 9th element in the above tree:

Since left child of root is 7 < 9. You subtract 7 from 9 (to get 2) and go right.

Since 2 < 4 (the left child of 12), you go left.

You are at the leaf node corresponding to the third bucket. You now need to pick the second element in that bucket.

If you have to add a new bucket, you double the size of your tree (if needed) by adding a new root, making the existing tree the left child and add a new tree with all zero buckets except the one you added (which we be the leftmost leaf of the new tree). This will be amortized O(1) time for adding a new value to the tree. Caveat is you can only add a bucket at the end, and not anywhere in the middle.

Getting the total count is O(1). Updating single bucket/lookup of item are O(logn).

Adding new bucket is amortized O(1).

Space usage is O(n).

Instead of a binary tree, you can probably do the same with a B-Tree.


I still hope for answers, however here is what I could come up so far, following @Moron suggestion.

Apparently my little Fenwick Tree idea cannot be easily adapted. It's easy to append new buckets at the end of the fenwick tree, but not in it the middle, so it's kind of a lost cause.

We're left with 2 data structures: Binary Indexed Trees (ironically the very name Fenwick used to describe his structure) and Ranked Skip List.

Typically, this does not separate the data from the index, however we can get this behavior by:

  1. Use indirection: the element held by the node is a pointer to a bucket, not the bucket itself
  2. Use pool allocation so that the index elements, even though allocated independently from one another, are still close in memory which shall helps the cache

I tend to prefer Skip Lists to Binary Trees because they are self-organizing, so I'm spared the trouble of constantly re-balancing my tree.

These structures would allow to get to the ith element in O(log N), I don't know if it's possible to get faster asymptotic performance.

Another interesting implementation detail is I have a pointer to this element, but others might have been inserted/removed, how do I know the rank of my element now?

It's possible if the bucket points back to the node that owns it. But this means that either the node should not move or it should update the bucket's pointer when moved around.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号