What are the fastest-performing options for a read-only, unordered collection of unique strings?_问答_开发者

Disclaimer: I realize the totally obvious answer to this question is HashSet<string>. It is absurdly fast, it is unordered, and its values are unique.

But I'm just wondering, because HashSet<T> is a mutable class, so it has Add, Remove, etc.; and so I am not sure if the underlying data structure that makes these operations possible makes certain performance sacrifices when it comes to read operations -- in particular, I'm concerned with Contains.

Basically, I'm wondering what are the absolute faste开发者_高级运维st-performing data structures in existence that can supply a Contains method for objects of type string. Within or outside of the .NET framework itself.

I'm interested in all kinds of answers, regardless of their limitations. For example I can imagine that some structure might be restricted to strings of a certain length, or may be optimized depending on the problem domain (e.g., range of possible input values), etc. If it exists, I want to hear about it.

One last thing: I'm not restricting this to read-only data structures. Obviously any read-write data structure could be embedded inside a read-only wrapper. The only reason I even mentioned the word "read-only" is that I don't have any requirement for a data structure to allow adding, removing, etc. If it has those functions, though, I won't complain.

UPDATE:

Moron's answer is an excellent example of the sort of thing I'm looking for. A Trie* definitely seems like a great possibility for the following reason: HashSet<T>.Contains depends on the GetHashCode function of some IEqualityComparer<string>, which, as far as I can tell, is O(n)** by default in .NET. In other words, every character in a string must be examined for HashSet<string>.Contains to return either true or false. For a Trie, only a return value of true would take O(n) to determine; a return value of false could potentially return much more quickly.

This is of course hypothetical. So far I have not written or come across a Trie implementation in .NET that can beat a HashSet<string> at Contains (though an implementation I wrote myself came quite close for the alphabet 'a' to 'z'). I'm just saying, it seems possible.

_{*That link, by the way, also led me to another intriguing/similar possibility: DAWG.}

_{**Here "n" is referring to the length of the string.}

Tries are good for doing a Contains, especially for strings from a finite alphabet. Given a string s, the time complexity for Contains on a trie is O(|s|) (|s| = length of s), which is optimal.

Apart from your wondering Hashset is the fastest collection.

There's no faster method because the underlying Hashtable allows O(1) read-write-access

A hashing container approaches O(1) for insert and retrieval, so from an order-of-magnitude perspective you can't get much better than that.

Within a hash container, your performance over time is going to be related to two things: how good of a distribution your hash function provides, and how fast it can compute it. These are not equivalent - a poorly distributed function (where you end up with a lot of collisions) is going to be much more performance-impacting than a slower but better distributed hash function.

Thus, if you could come up with a perfect hash function that was also extremely fast to compute, that would be an improvement. It's possible that constraining the data in specific ways may make that easier. But, odds are you, whatever you come up with will not be as good as what already exists.

Hashing tables are amortized O(1) for lookup. Can't do better than that, O(1/n) algorithms are perpetual motion devices. There are only two things that make them behave poorly:

A poor hashing function that causes many collisions. The worst one will degenerate lookup to O(n). You'll have no trouble with strings, they hash really well. String.GetHashCode() does a terrific job.
A collection that is heavily mutated with many removed items that were added early. This can causes many empty hash buckets that need to be skipped by iterators. Degradation to O(n) is technically possible although quite rare. A simple workaround is to rebuild the collection by reassigning the reference (like table = new HashSet(table); )

These kind of problems are rare. You don't design for them up front (other than the hash function), you start considering them only when you detect perf problems with the program.