Background
I am designing an application which will convert text from one language to other. For example, input text hello
will be converted to specified language text. This is done by having a lookup table for each language. Conversion algorithm has the following steps.
- Looks for exact match. The whole word
hello
will be the pattern. If any matches found, stop processing and return the match. - Else start from left-right and lookup for each character until all the combinations are done. Ex: Iteration1 : pattern =
h
, 2nd -he
, 3rd -hel
and so on. Matched token will be kept in a temporary buffer and returned when there are no more matches. This works like maximal munch rule. - If match found, th开发者_如何学Ce matched text will be removed from original text and continue processing with the remaining text.
This algorithm will have to iterate over the input text multiple times and leads into quadratic complexity. I am trying to optimize this by arranging the lookup table in a tree structure (very similar to suffix tree). If there are lookup text for h
, hel
, lo
, it will be organized like
root
--h
----hel
--lo
This will help me to avoid unnecessary lookups. After matching h
, I need to goto next character only if h
node has got children. This reduces the number of iterations drastically.
Questions
- What is the name of this kind of data structure? Is there a ready-made implementation available for C or C++?
- Do you see any possible improvements on the above algorithm or data structure?
Any thoughts...?
A ternary search tree might be what you're talking about. It is similar to a trie, but more space efficient from what I understand.
Look at 'Trie' data structure.
Why don't use Lucene that indexes text in very fast way, and even more - indexing includes algorithms for
- steming
- fuse matching
and so on.
精彩评论