I am trying to work out which entries in my data store are near-duplicates using approximate string matching.
Is there any implementation of the following approach in python, or do i need to try and roll my own?
Thanks :)
from wikipedia:
...
A brute-force approach would be to compute the edit distance to P for all substrings of T, and then choose the substring with the minimum distance. However, this algorithm would have the running time O(n3 m)
A better solution[3][4], utilizing dynamic programming, uses an alternative formulation of th开发者_JAVA技巧e problem: for each position j in the text T and each position i in the pattern P, compute the minimum edit distance between the i first characters of the pattern, Pi, and any substring Tj',j of T that ends at position j.
What is the most efficient way to apply this to many strings?
Yes.
google("python levenshtein")
difflib.get_close_matches should do the work.
difflib
might be the answer, eg,
from difflib import context_diff
a = 'acaacbaaca'
b = 'accabcaacc'
print ''.join(context_diff(a,b))
Levenshtein distance performs very similarly to the fuzzywuzzy standard ratio() function. fuzzywuzzy uses difflib http://seatgeek.com/blog/dev/fuzzywuzzy-fuzzy-string-matching-in-python
example from the fuzzywuzzy documentation: https://github.com/seatgeek/fuzzywuzzy
fuzz.ratio("this is a test", "this is a test!")
96
精彩评论