I'm building a spelling corrector for search engine queries by implementing the method described in "Spelling correction as an iterative process that exploits the collective knowledge of web users".
The high-level approach is as follows: for a given query, come up with possible correction candidates (words in the query log within a certain edit distance) of each unigram and bigram, then perform a modified Viterbi search to find the most likely sequence of candidates given bigram frequencies. Repeat this process until the sequence is of maximum probability.
开发者_StackOverflow中文版The modification to the Viterbi search is such that if two adjacent words are both found in a trusted lexicon, at most one can be corrected. This is especially important for avoiding correction of properly-spelled single-word queries to words of higher frequency.
My question is where to find such a lexicon. It should be in English and contain proper nouns (first/last names, places, brand names, etc) likely to show up in search queries as well as common and uncommon English words. Even a push in the right direction would be useful.
Also, if anyone is reading this and has any suggestions for improvement on the methodology supplied in the paper, I am open to those as well given that this is my first foray into NLP.
The best lexicon for this purpose is probably the Google Web 1T 5-gram data set.
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13
Unfortunately, it is not free unless your university is a member of LDC.
You could also try the corpora in packages like Python NLTK, but the Google one seems to be the best for your purpose since it is related to search queries already.
精彩评论