B-Trees / B+Trees and duplicate keys_问答_开发者

I'm investigating the possibility of putting together a custom storage scheme for my application. It's worth the effort of potentially reinventing the wheel, I think, because both performance and storage efficiency are a main objective and the data and operations on it are far simpler than everything provided by an RDBMS (no updates, no deletes, predefined set of queries).

I'm using just a small handful of web resources I've found about B-Trees and B+-Trees - Wikipedia, http://www.bluerwhite.org/btree/, http://slady.net/java/bt/view.php, http://www.brpreiss.com/books/opus6/html/page342.html (the last one is the most valuable).

Duplicate keys

The first problem I'm trying to solve is how to deal with duplicate keys - this tree will be acting as a DB index and for example there won't just be one 'thing' with 'color=red', so looking up 'red' in this tree should yield many results.

There are two solutions I have come up with so far. The first is simply having multiple entries in the tree for each of these. But when there are 100,000 or 1,000,000 'red' things in the tree.. is that very efficient for a tree structure? The second was to have just one entry for each key, but the 'payload' associated with each key points to a different block of data, which is a linked list pointing to all instances of items that are 'red'.

Is there a common / better option?

B+Tree nodes changing types

I wanted to check an assumption I'm making. Say you have a B+-Tree, height 2 - the external (leaf) nodes at level 2 hold 'actual data'. Then an insertion necessitates a split of a leaf node - the leaf node no longer holds 'actual data'. Am I right in thinking that in implementation terms because the data might be of a substantial size that you would instead store a kind of 'pointer' as the 'actual data' - so if a leaf node becomes a branch node, that pointer (of the same size) is instead updated to po开发者_Python百科int to the new subtree?

By that I mean, internal and external nodes, they should be the same size really since external nodes might become internal ones, and shuffling data around isn't a good idea?

(Added the C# tag since I'm implementing this from scratch in C#.)

Kieren, I'm sure you figured out by now that B+ trees grow by splitting upwards, so that a leaf node is always a leaf node, and internal nodes are always internal nodes. Eventually, you must split the root node, which turns that into two internals, and you define a new root. So to answer the second part of your question, you don't change node types.

Regarding the first part of your question, when you delete a data record from the DB, you will need to find all the keys that point to that particular record, and remove them. If you have to look through long linear lists to do that, deleting will be slow. I am assuming you are using a binary search within a node in order to quickly find the correct node element (key + pointer), so if you make that "node searching" mechanism include the ability to ask for a particular key + pointer combination, you can quickly find the correct key element to remove. In other words, make the data record pointer part of the search (only when searching for a particular data record's key). This does mean that the duplicate keys will be stored in the nodes in "data pointer" order, so as long as ordering of the duplicate keys is not important, this mechanism will work.

Attempting to answer my own question.. I would welcome other answers too.

Duplicate Keys

The tree will store a reference to a list (memory) or linked-list (disk) of items with the given key, if duplicate entries for the same key is a possibility.

B+Tree nodes, changing types

In-memory, my nodes have an object reference, which can point to another node (in itself another valid B+Tree) in the case of an internal/branch node, or indeed data directly in the case of an external/leaf node. On disk, this would work in a very similar way: a 64-bit value for each 'link slot', as I have chosen to name them - either an offset in the file if pointing at a sub-node, or a block number if pointing to data directly (or the head of a linked-list in the case mentioned in the first part of the question).

The main feature of B+ trees is minimizing disk seeks. If you just "store pointers" then you defeat that benefit, and your code will be chasing file pointers, and you will be seeking the disk head all over the place. You can't read a hundred bytes from disk, all reads and writes are in aligned blocks.

Leaf parents: data are always in a leaf, and just one key per leaf are in the nodes (that are immediate parents of the leaves). The node lets you "peek" at the leaf's contents by looking at a copy of the first key in that leaf, right there in the node.

Node parents: The key before the first key in the child node is in the node's parent.

The duplication of data is not bad. For example, if there are 207 records per leaf, and there are 268 records per node, then you will store one extra key for every 207 records. If you have more than 207*269 leaves, then you need one more key per 207*269 records.

You seem to be confusing B-trees and B+ trees. B+ trees always have the data in the leaves, and there is never any data in nodes. Only a sample of the lowest key per child is present in a node. Data never "moves up" a B+ tree, only copies of one lowest key per child are propagated upward.

The overhead grows logarithmically. The minimal duplication saves a lot of seeks.

(Really) Duplicate keys

To handle duplicate keys in a B+ tree, as in multiple rows that have the same value, implementations typically force it to be unique by appending an additional hidden column to the table, and assigning it an auto-incrementing value when a record is created. The hidden column is added to the end of the indexed key, which guarantees that it will always be unique. The index starts with the indexed column(s) so the ordering will be as specified, but the appended hidden value guarantees uniqueness.

If the table already has a primary key, then it could just use that to enforce uniqueness instead of adding a hidden column.

When you deal with duplicate keys you always hit a leaf containing the given key you searched for.

Since a leaf groups all keys together you just need to walk the leaf to the left (position -1) to find the first entry with the given key. If you find the first key of the leaf, you need to inspect the previous leafs.

Since there is no possible assumption about the leaf you will hit for a duplicated key, you need to walk all previous leafs until you find a leaf which first key is not the key you search. If the last key of that leaf is not the key you search for (<) than its the next leaf otherwise this leaf.

The search over the leafs is linear inside the leaf you have log n to find the first key entry.

If you can sort the key entries in the leaf by the data they hold you can easily find the leaf to find a certain entry (which is great for contains and remove operations).

for a high chance of duplicates it is best to look for an other storage model by storing key -> datas. Especially if the data does not change often.

[Update]

There is a chance of forgetting a key:

Node N [L1 |3| L2] Leaf L1 [1,2,3] -> L2 [ 3, 4, 5]

You remove the 3 in L2 you result in.

Node N [L1 |3| L2] Leaf L1 [1,2,3] -> L2 [ 4, 5]

When you now search you find that 3 is not in L2. You might now look in the previous leaf to find the 3.

Another way is to update the key to the actual first key of the leaf, resulting in (resulting in potential update propagation):

Node N [L1 |4| L2] Leaf L1 [1,2,3] -> L2 [ 4, 5]

Or you borrow from the left leaf the element.

Node N [L1 |3| L2] Leaf L1 [1,2] -> L2 [3, 4, 5]

I tend to use the first solution since it works also for leafs in the middle of multi leaf duplicates.