Data structure for efficiently returning the top-K entries of a hash table (map, dictionary)_问答_开发者

Here's a description:

It operates like a regular map with get, put, and remove methods, but has a getTopKEntries(int k) method to get the top-K elements, sorted by the key:

For my specific use case, I'm adding, removing, and adjusting a lot of values in the structure, but at any one time there's approximately 500-1000 elements; I want to return the entries for the top 10 keys efficiently.

I call the put and remove methods many times.
I call the getTopKEntries method.
I call the put and remove methods some more times.
I call the getTopKEntries method.
...

I'm hoping for O(1) get, put, and remove operations, and for getTopKEntries to be dependent only on K, not on the size of the map.

So what's a data structure for efficiently returning the top-K elements of a map?

My other question is similar, but is for the case of returning all elements of a ma开发者_StackOverflow社区p, sorted by the key.

If it helps, both the keys and values are 4-byte integers.

A binary search tree (i.e. std::map in C++) sounds like the perfect structure: it’s already lexicographically ordered, i.e. a simple in-order traversal will yield the elements in ascending order. Hence, iterating over the first k elements will yield the top k elements directly.

Additionally, since you foresee a lot of “remove” operations, a hash table won’t be well-suited anyway: remove operations destroy the load factor characteristics of hash tables which leads to a rapid deterioration of the runtime.

I'm not sure I accept fully Konrad's view that lots of remove operations would destroy the structure of a hash table.

Without remove operations, you could keep all the objects in a hash table, and keep the top K in a priority heap that'd be incrementally updated. This would make insert O(1 + log K), i.e. constant-time in N assuming K is constant and not dependent on N (N = number of objects in the table). However this doesn't work when you have the remove operation available. The proposed Fibonacci heap has O(log N) amortized delete operation so it doesn't give a good solution either, as all the objects would be need to kept in the heap, and if you eventually remove every object that you insert, you get O(log N) behavior in general per a pair of insert+delete.

I would maybe try the following approach:

Store the objects in a hash table, assuming you need the whole table for some other purposes than returning the top objects. Maintain a priority heap (standard heap goes) that contains K * C objects for C whose value you need to search for experimentally. Whenever you add a new object, try to insert it in the heap; if it fits in the KC space (heap is not at capacity yet or it pushes another object away), insert it and set a bit in the hash table to denote that the object is in the heap; when you push an object out of the heap, clear the bit. When you remove an object, check the bit; if the bit=1 i.e. the object was in the heap remove it from there (you need to search for it unless you have a pointer to it from the hash table; it's best to maintain the pointer). What happens now is that the heap shrinks. They key thing is that as long as the heap has still at least K objects it is guaranteed to contain all the top K objects. This is where the factor C comes in as it provides the "leeway" for the heap. When the heap size drops BELOW K, you run a linear scan over the whole hash table and fill the heap back to KC capacity.

Setting C is empirical because it depends on how your objects come and go; but tuning it should be easy as you can tune it just based on runtime profiling.

Complexity: Insert is O(1 + log (KC)). Remove is O(1 + p log (KC) + q N) where p is the probability that a removed object was in the heap, and q is the probability that the heap needs to be rebuilt. p is dependent on the characteristics of how objects come and go. For a simple analysis we can set p=(KC/N), i.e. assume uniform probability. q is even more sensitive to the "flow" of the objects. For example, if new objects in general increase in their value over time and you always delete older objects, q tends towards zero.

Note that funnily enough p is inversely proportional to N so actually this part speeds up when N grows :)

An alternative would be just to sort the items.

In your usage scenario there are only 1000 items – sorting them is just incredibly fast (keep in mind that log₂ 1000 ≈ 10 = almost 1), and it seems not to occur too often.

You can even adapt the selection algorithm to return the K smallest items. Unfortunately, this will still depend on n, not only on k as you’d hoped for: O(n + k log k).

(I’ve added this as a new answer because it’s actually completely unrelated to my first entry.)

I would recommend a fibonacci heap.

You might want a heap (although deletion may be a problem).

Unless I'm being severely un-creative today, you just can't do it all at O(1).

If you are maintaining a sort order, then adds and deletes will probably be at O(log n). If you are not, then your search will have to be O(n).

Hash tables just don't do sorting. I suggest you live with the O(log n) for inserts and deletes and use one of the suggested data structures (Heap is probably the best). If you need O(1) lookups you could combine a hash, but then you are maintaining two data structures in parallel and might as well use a TreeMap.

If the sort key is a simple integer or decimal number, a trie will be quite fast. It will use up memory, and technically finding an element in a trie is O(log n). But in practice it'll be something like log₂₅₆ n, so the constant factor is very small (log₂₅₆ of 2 billion = 4).

I feel, heap is the best data structure for this problem. Because, put,remove and return K top elements can be returned in O(klog(N)) time. Use a max-heap if you want max elements.

Here, I am assuming that k top elements means, you need the k elements having max value.