开发者

efficient algorithm to perform spell check on HTML document

开发者 https://www.devze.com 2022-12-14 23:14 出处:网络
I have a HTML document, a list of common spelling mistakes, and the correct spelling for each case. The HTML documents will be up to ~50 pages and there are ~30K spelling correction entries.

I have a HTML document, a list of common spelling mistakes, and the correct spelling for each case. The HTML documents will be up to ~50 pages and there are ~30K spelling correction entries.

What is an efficient way to correct all spelling mistakes in this HTML document?

(Note: my implementation will be in Python, in case you know of any relevant libraries.)

I have thought of 2 possibles approaches:

  • build hashtable of the spelling data
  • parse text from HTML
  • split text by whitespace into tokens
  • if token in spelling hashtable replace with correction
  • build new HT开发者_运维百科ML document with updated text

This approach will fail for multi-word spelling corrections, which will exist. The following is a simpler though seemingly less efficient approach that will work for multi-words:

  • iterate spelling data
  • search for word in HTML document
  • if word exists replace with correction


You are correct that the first approach will be MUCH faster than the second (additionally, I would recommend looking into Tries instead of a straight hash, the space savings will be quite dramatic for 30k words).

To still be able to handle the multi-word cases, you could either keep track of the previous token and thereby check your hash for a combined string such as "prev cur".

Or else you could leave the multi-word corrections out of the hash and combine your two approaches, first using the hash for single words and then doing a scan for the multi-word combos (or vice versa). This could still be relatively fast if the number of multi-word corrections is relatively small.

Be careful tho, pulling out word tokens is trickier than just splitting on whitespace. You don't want to fail to correct an error simply because you didn't find 'instence,' with a comma in your hash.


I agree with Rob's suggestion of using a trie, based on characters, because I programmed a spelling correction algorithm ages ago based on having a dictionary of valid words stored as a trie. By using branch-and-bound I was able to suggest possibly correct spellings of misspelled words (by Levenshtein distance). In addition, since a trie is just a big finite-state-machine, it is fairly easy to add common prefixes and suffixes, so it could handle "words" like "postnationalizationalism's".

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号