开发者

Asymptotically Fast Associative Array with Low Memory Requirements

开发者 https://www.devze.com 2023-01-08 17:21 出处:网络
Ok, tries have been around for a while. A typical implementation should give you O(m) lookup, insert and delete operations independently of the size n of the data set, where m is the message length. H

Ok, tries have been around for a while. A typical implementation should give you O(m) lookup, insert and delete operations independently of the size n of the data set, where m is the message length. However, this same implementation takes up 256 words per input byte, in the worst case.

Other data structures, notably hashing, give you expected O(m) lookup, insertion and deletion, with some implementations even providing constant time lookup. Nevertheless, in the worst case the routines either do not halt or ta开发者_C百科ke O(nm) time.

The question is, is there a data structure that provides O(m) lookup, insertion and deletion time while keeping a memory footprint comparable to hashing or search trees?

It might be appropriate to say I am only interested in worst case behaviour, both in time and space-wise.


Did you try Patricia-(alias critbit- or Radix-) tries? I think they solve the worst-case space issue.


There is a structure known as a suffix array. I can't remember the research in this area, but I think they've gotten pretty darn close to O(m) lookup time with this structure, and it is much more compact that your typical tree-based indexing methods.

Dan Gusfield's book is the Bible of string algorithms.


I don't think there a reason to be worried about the worst case for two reasons:

  1. You'll never have more total active branches in the sum of all trie nodes than the total size of the stored data.
  2. The only time the node size becomes an issue is if there is huge fan-out in the data you're sorting/storing. Mnemonics would be an example of that. If you're relying on the trie as a compression mechanism, then a hash table would do no better for you.

If you need to compress and you have few/no common subsequences, then you need to design a compression algorithm based on the specific shape of the data rather than based on generic assumptions about strings. For example, in the case of a fully/highly populated mnemonic data set, a data structure that tracked the "holes" in the data rather than the populated data might be more efficient.

That said, it might pay for you to avoid a fixed trie node size if you have moderate fan-out. You could make each node of the trie a hash table. Start with a small size and increase as elements are inserted. Worst-case insertion would be c * m when every hash table had to be reorganized due to increases where c is the number of possible characters / unique atomic elements.


In my experience there are three implementation that I think could met your requirement:

  • HAT-Trie (combination between trie and hashtable)
  • JudyArray (compressed n-ary tree)
  • Double Array Tree

You can see the benchmark here. They are as fast as hashtable, but with lower memory requirement and better worst-case.

0

精彩评论

暂无评论...
验证码 换一张
取 消