For a given word I'd like to find the n closest misspellings. I was wondering if an open source spell checker like aspell would be useful in that context unless you ha开发者_JS百科ve other suggestions.
For example: 'health'
would give me: ealth, halth, heallth, healf, ...
Spelling correction tools take misspelled words and offer possible correctly spelled alternatives. You seem to want to go in the other direction.
Going from a correctly spelled word to a set of possible misspellings could probably be performed by applying a set of mutation heuristics to common words. These heuristics might do things like:
- randomly adding or removing single characters
- randomly apply transpositions of pairs of characters
- changing characters to other characters based on keyboard layouts
- application of common "point" misspellings; e.g. transposing "ie" to "ei", doubling or undoubling "l"s.
Going from a correctly spelled word to a set of common misspellings is really hard. Probably the only reliable way to do this would be to instrument a spelling checker package used by a large community of users, record the actual spelling corrections made using the spelling checker, and aggregate the results. That is probably (!) beyond the scope of your project.
On revisiting my answer, I think I've missed something.
My heuristics above are mostly for typing error rather than misspellings. A typing error is where the user knows the correct spelling but mistyped the word. A misspelling is where the person doesn't know the correct spelling of a word, and uses either incorrect knowledge or intuition (i.e. a guess). Typical guesses are based on listening to what the word sounds like, and then pick a spelling that (if correct) would most likely be pronounced that way.
So an good heuristic for predicting misspellings would need to be based what the word actually sounds like when spoken. That requires a phonetic dictionary (to go from the actual word to its pronunciation) and a set of rules for generating plausible spellings for the phonetic word. That's more complicated than simple heuristics for typing errors.
精彩评论