开发者

Is there any implementation of this string matching method in python?

开发者 https://www.devze.com 2023-02-14 23:18 出处:网络
I am trying to work out which entries in my data store are near-duplicates using approximate string matching.

I am trying to work out which entries in my data store are near-duplicates using approximate string matching.

Is there any implementation of the following approach in python, or do i need to try and roll my own?

Thanks :)

from wikipedia:

...

A brute-force approach would be to compute the edit distance to P for all substrings of T, and then choose the substring with the minimum distance. However, this algorithm would have the running time O(n3 m)

A better solution[3][4], utilizing dynamic programming, uses an alternative formulation of th开发者_JAVA技巧e problem: for each position j in the text T and each position i in the pattern P, compute the minimum edit distance between the i first characters of the pattern, Pi, and any substring Tj',j of T that ends at position j.

What is the most efficient way to apply this to many strings?


Yes.

google("python levenshtein")


difflib.get_close_matches should do the work.


difflib might be the answer, eg,

from difflib import context_diff

a = 'acaacbaaca'
b = 'accabcaacc'

print ''.join(context_diff(a,b))


Levenshtein distance performs very similarly to the fuzzywuzzy standard ratio() function. fuzzywuzzy uses difflib http://seatgeek.com/blog/dev/fuzzywuzzy-fuzzy-string-matching-in-python

example from the fuzzywuzzy documentation: https://github.com/seatgeek/fuzzywuzzy

fuzz.ratio("this is a test", "this is a test!")
    96
0

精彩评论

暂无评论...
验证码 换一张
取 消