What data structure should I use to find similar strings? For example, when you query Google for string "hapyp brithdya", Google asks you do you mean "happy birthday", a string which is very similar to the previously misspelled 开发者_开发百科string "hapyp brithdya".
What data structure will be most efficient to do this kind of operation in both space and time ?
Please help. Your time is greatly appreciated.
Since you ask for a data structure, I'm going to recommend Levenshtein automata.
These can be extended to a probabilistic variant that returns the most likely (according to corpus statistics) correction of a string. See the essay "How to Write a Spelling Corrector" by Google's Peter Norvig for the basic idea; combining this with Levenshtein automata requires some knowledge of finite-state transducers. See Hassan, Noeman and Hassan for more details.
A learning mechanism that Google uses is the search history. For example, I searched "hapyp brithdya" and then realized that the spellings were incorrect and therefore did not select any link. My next search will be for "happy birthday" the correct spellings. And from this sequence of searches google can figure out that the "hapyp brithdya" actually meant "happy birthday".
Another scoring mechanism based on the same lines that helps google to give more acceptable spelling corrections is that a search for "hapyp brithdya" leading to a click by the user on a link(suggested by google search) containing "happy birthday". This increases the proximity of "happy birthday" to "hapyp brithdya", as compared to (say) "nappy birthday" which was present in a link that the user did not visit
精彩评论