开发者

How to implement a dictionary (Trie vs HashTable and important issues)?

开发者 https://www.devze.com 2023-02-04 21:56 出处:网络
I\'ve ran across several questions and articles saying that dictionary implementation in java is done best using tries. But most of them didn\'t address important issues, as far as I saw it. So, next

I've ran across several questions and articles saying that dictionary implementation in java is done best using tries. But most of them didn't address important issues, as far as I saw it. So, next is a real world task:

Let's assume that I need to implement a dictionary (let's say something like Lingvo, but simpler) using java. For my particular task it is needed to store word definitions and to perform fast dictionary lookup.

Please, address next questions:

  • What data structure should I use then (Trie or HashTable)?
  • How should it(search, datastruct) be organised if I need the dictionary to be case insensitive?
  • What if I want it(search, dictionary) to be case sensitive?

P.S.: Code examples are highly appreciated. :)

Thanks for answers in advanc开发者_运维问答e.

UPDATE:If we're talking about standard DS implementations in java, is it true that HashTable will be the best one for this particular task? Why not HashMap, TreeMap or LinkedHashMap?


I want to address just one point in your question:

A trie is not a general-purpose dictionary data structure. The reason is that a trie is a specialized search tree for (sub)string search. Generally, you will be more interested in general search trees, e.g. binary search trees or B-trees.

All these implementations rely on an ordering of the dictionary elements, and all of them have a logarithmic average-case and worst-case runtime for common operations.

A hash table, by contrast, does not require a relative ordering of the elements. Instead, it requires that elements are hashable and equality comparable. The worst-case characteristic of common hash table characteristics is much worse than for trees, namely linear in the number of elements.

However, with a bit of care the average case for hash tables operations can be made constant (i.e. independent of the container size). What’s more, it can be proven that slower operations are exceedingly rare.

In practice, this means that except for very specialized use-cases, hash tables beat tree-based dictionaries hands down.

The downside to this is that hash tables impose an arbitrary-seeming order on its elements. If you are interested in getting the items from your dictionary in sorted order, hash tables are not for you.

(There are other interesting implementations of dictionaries, e.g. skip lists which rival search trees and probabilistic implementations like the Bloom filter.)

A trie-based implementation can only be used if you are dealing with a dictionary of string values, in which case it is actually often a good choice, in particular if many strings in the dictionary share common prefixes and are rather short.


EDIT stop upvoting this: I misread the question. The OP is not after a dictionary to verify word spellings/suggestions/type-ahead-lookup/auto-completion/whatever (which I thought was what he was after). The OP is after a key/value mapping where for each word there's a definition.

Having worked on dictionaries, I can tell you that you're taking the wrong approach.

It's not as simple as a choice between an hashtable or a trie.

You mention Lingvo: it's much more than just a table.

Do you want close match to be offered suggestions? You may then need to things like generating permutations on what the user entered and for each permutation see if it exists in the dico: if it does, you'd then need to compute its' Levenhstein Edit Distance and suggest first the words that have the shortest LED.

Do you want most likely matches to be auto-completed/suggested (like what Google does)? You'd then need a very advanced data structure like a BK-tree (basically a tree of LED if I understand it correctly).

How many words will you have in your dictionary? You won't be able to use a dictionary made of 400 000 words using Strings and other heavyweight Java objects / data structure without a serious performance hit (once again: a dictionary is more than just one hashtable, a dictionary typically involve several datastructures). This won't easily fit in your users' computer memory. There are known, searchable, ways to store words where every single word can be packed on fewer than 15 bits per word (fewer than 15 bits per word, you read correctly).

In addition to that, you may want to do suggestion based on phonetics: like by using a double-metaphone mapping.

A dictionary, as in a "word dictionary" is so much more than just a key/value table. It is really a complicated beast due to which features the user shall except and due to the amount of data involved. Just plain english + a few specialized domains terminologies, medical, comp-sci, whatever. will give you hundreds of thousands of data: try to put that in a Java HashMap and... Kaboom!


Dictionary implementation in Java, definitely hash collections are best bet.

Regarding HashMap or HashTable : Mainly if your class is used in multithreaded manner than you have to use HashTable, otherwise HashMap is the best option.

HashMap vs TreeMap: If you need insertion order into collection then we have to use TreeMap.

HashMap vs LinkedHashMap: LinkedHashMap implementation differs from HashMap in that it maintains a doubly-linked list running through all of its entries. This linked list defines the iteration ordering, which is normally the order in which keys were inserted into the map (insertion-order). Note that insertion order is not affected if a key is re-inserted into the map. (A key k is reinserted into a map m if m.put(k, v) is invoked when m.containsKey(k) would return true immediately prior to the invocation.)

0

精彩评论

暂无评论...
验证码 换一张
取 消