Mapping from String to integer - performance of various approaches_问答_开发者

Let's say that I need to make a mapping from String to an integer. The integers are unique and form a continuous range starting from 0. That is:

Hello -> 0
World -> 1
Foo   -> 2
Bar   -> 3
Spam  -> 4
Eggs  -> 5
etc.

There are at least two straightforward ways to do it. With a hashmap:

HashMap<String, Integer> map = ...
int integer = map.get(string); // Plus maybe null check to avoid NPE in unboxing.

Or with a list:

List<String> list = ...
int integer = list.indexOf(string); // Plus maybe check for -1.

Which approach should I use, and why? Arguably the relative performance depends on the size of the list/map, since List#indexOf() is a linear search using String#equals() -> O(n) efficienc开发者_如何学编程y, while HashMap#get() uses hash to narrow down the search -> certainly more efficient when the map is big, but maybe inferior when there are just few elements (there must be some overhead in calculating the hash, right?).

Since benchmarking Java code properly is notoriously hard, I would like to get some educated guesses. Is my reasoning above correct (list is better for small, map is better for large)? What is the threshold size approximately? What difference do various List and HashMap implementations make?

A third option and possibly my favorite would be to use a trie:

I bet it beats the HashMap in performance (no collisions + the fact that computing the hash-code is O(length of string) anyway), and possibly also the List approach in some cases (such as if your strings have long common prefixes, as the indexOf would waste lot of time in the equals methods).

When choosing between List and Map I would go for a Map (such as HashMap). Here is my reasoning:

Readability

The Map interface simply provides a more intuitive interface for this use case.
Optimization in the right place

I'd say if you're using a List you would be optimizing for the small cases anyway. That's probably not where the bottle neck is.

A fourth option would be to use a LinkedHashMap, iterate through it if the size is small, and get the associated number if the size is large.

A fifth option is to encapsulate the decision in a separate class all together. In this case you could even implement it to change strategy in runtime as the list grows.

You're right: a List would be O(n), a HashMap would be O(1), so a HashMap would be faster for n large enough so that the time to calculate the hash didn't swamp the List linear search.

I don't know the threshold size; that's a matter for experimentation or better analytics than I can muster right now.

Your question is totally correct on all points:

HashMaps are better (they use a hash)
Benchmarking Java code is hard

But at the end of the day, you're just going to have to benchmark your particular application. I don't see why HashMaps would be slower for small cases but the benchmarking will give you the answer if it is or not.

One more option, a TreeMap is another map data structure which uses a tree as opposed to a hash to access the entries. If you are doing benchmarking, you might as well benchmark that as well.

Regarding benchmarking, one of the main problems is the garbage collector. However if you do a test which doesn't allocate any objects, that shouldn't be a problem. Fill up your map/list, then just write a loop to get N random elements, and then time it, that should be reasonably reproducable and therefore informative.

Unfortunately, you are going to have to benchmark this yourself, because the relative performance will depend critically on the actual String values, and also on the relative probability that you will test a string that is not in your mapping. And of course, it depends on how String.equals() and String.hashCode() are implemented, as well as the details of the HashMap and List classes used.

In the case of a HashMap, a lookup will typically involve calculating the hash of the key String, and then comparing the key String with one or more entry key Strings. The hashcode calculation looks at all characters of the String, and is therefore dependent on the key String. The equals operations typically will typically examine all of the characters when equals returns true and considerably less when it returns false. The actual number of times that equals is called for a given key string depends on how the hashed key strings are distributed. Normally, you'd expect an average of 1 or 2 calls to equal for a "hit" and maybe up to 3 for a "miss".

In the case of a List, a lookup will call equals for an average of half the entry key Strings in the case of a "hit" and all of them in the case of a "miss". If you know the relative distribution of the keys that you are looking up, you can improve the performance in the "hit" case by ordering the list. But the "miss" case cannot be optimized.

In addition to the trie alternative suggested by @aioobe, you could also implement a specialized String to integer hashmap using a so-called perfect hash function. This maps each of the actual key strings to a unique hash within a small range. The hash can then be used to index an array of key/value pairs. This reduces a lookup to exactly one call to hash function and one call to String.equals. (And if you can assume that supplied key will always be one of the mapped strings, you can dispense with the call to equals.)

The difficulty of the perfect hash approach is in finding a function that works for the set of keys in the mapping and is not too expensive to compute. AFAIK, this has to be done by trial and error.

But the reality is that simply using a HashMap is a safe option, because it gives O(1) performance with a relatively small constant of proportionality (unless the entry keys are pathological).

(FWIW, my guess is that the break-even point where HashMap.get() becomes better than List.contains() is less than 10 entries, assuming that the strings have an average length of 5 to 10.)

From what I can remember, the list method will be O(n),but would be quick to add items, as no computation occurs. You could get this lower O(log n) if you implemented a b-search or other searching algorithms. The hash is O(1), but its slower to insert, since the hash needs to be computed every time you add an element.

I know in .net, theres a special collection called a HybridDictionary, that does exactly this. Uses a list to a point, then a hash. I think the crossover is around 10, so this may be a good line in the sand.

I would say you're correct in your above statement, though I'm not 100% sure if a list would be faster for small sets, and where the crossover point is.

I think a HashMap will always be better. If you have n strings each of length at most l, then String#hashCode and String#equals are both O(l) (in Java's default implementation, anyway).

When you do List#indexOf it iterates through the list (O(n)) and performs a comparison on each element (O(l)), to give O(nl) performance.

Java's HashMap has (let's say) r buckets, and each bucket contains a linked list. Each of these lists is of length O(n/r) (assuming the String's hashCode method distributes the Strings uniformly between the buckets). To look up a String, you need to calculate the hashCode (O(l)), look up the bucket (O(1) - one, not l), and iterate through that bucket's linked list (O(n/r) elements) doing an O(l) comparison on each one. This gives a total lookup time of O(l + (nl)/r).

As the List implementation is O(nl) and the HashMap implementation is O(nl/r) (I'm dropping the first l as it's relatively insignificant), lookup performance should be equivalent when r=1 and the HashMap will be faster for all greater values of r.

Note that you can set r when you construct the HashMap using this constructor (set the initialCapacity to r and the loadFactor argument to n/r for your given n and chosen r).