I'm currently writing a script tasked with going through tens of thousands of rows of account information and cleaning mistyped addresses, as well as printing out reports on how the address was cleaned. Currently the biggest source of unclean addresses is mistyped street-names (it's amazing how many ways you can spell a street-name). In any case, currently my script grabs the input street-name and performs a series of edits specific to the Norwegian language (v.
becomes vegen
, gt.
becomes gata
etc.) and searches for the street-name in a ~2 million row database of addresses. If it doesn't find a match开发者_运维知识库 it proceeds to split off the latter half of the street-name and replacing it with a wildcard. It tries out different variations of the wildcard search.
Anyway, my question is:
Does MySQL include anything that could make this easier for me? I recall hearing mention of a "search" function in MySQL that finds the cells in a column with the most matching characters or something. In the cases where my wild-card search fails it would be a great tool to have.
Anything else that would help with finding matches to mistyped addresses would be great.
One option might be to try to use SOUNDEX to get you close to what you want. SOUNDEX will make matches off of pronunciation so it might get you closer if people are mistyping based off of the phonetic spelling of a street name.
You might also try the Levenshtein distance algorithm. This is probably more closely tied to what you are looking for. Basically it looks at how close one word is to another. It can be used for spell checking, etc. and it might be useful when looking for bad data in address fields. Here is a link to it:
http://www.merriampark.com/ld.htm
If you want the function to use the Levenshtein distance algorithm in MySQL, you can look at an example here:
http://www.artfulsoftware.com/infotree/queries.php#552
You might want to play around with FULLTEXT
indexes and fuzzy MATCH ... AGAINST
queries. Keep in mind that words shorter than 4 letters are excluded from the index by default.
This is a little bit more of work, but:
Create a table words with the fields
word
num_appeared
And a pivot table between words and addresses
address_id
word_id
Traverse your addresses table, split the address by word, then insert each word in the words table and create the record in the pivot table. When you are done, sort the words table by num_appeared ASC and there - you have the words with biggest chances of being mistyped. You can then create a script that searches google after those words and the suggestion google makes might be the correct form of the word.
精彩评论