I would like to know if there is a library that will tell me approximately how similar two strings are
I am not looking for anything specific, but in this 开发者_开发知识库case:
a = 'alex is a buff dude'
b = 'a;exx is a buff dud'
we could say that b
and a
are approximately 90% similar.
Is there a library which can do this?
import difflib
>>> a = 'alex is a buff dude'
>>> b = 'a;exx is a buff dud'
>>> difflib.SequenceMatcher(None, a, b).ratio()
0.89473684210526316
http://en.wikipedia.org/wiki/Levenshtein_distance
There are a few libraries on pypi, but be aware that this is expensive, especially for longer strings.
You may also want to check out python's difflib: http://docs.python.org/library/difflib.html
Look for Levenshtein algorithm for comparing strings. Here's a random implementation found via google: http://hetland.org/coding/python/levenshtein.py
Other way is to use longest common substring. Here a implementation in Daniweb with my lcs implementation (this is also defined in difflib)
Here is simple length only version with list as data structure:
def longest_common_sequence(a,b):
n1=len(a)
n2=len(b)
previous=[]
for i in range(n2):
previous.append(0)
over = 0
for ch1 in a:
left = corner = 0
for ch2 in b:
over = previous.pop(0)
if ch1 == ch2:
this = corner + 1
else:
this = over if over >= left else left
previous.append(this)
left, corner = this, over
return 200.0*previous.pop()/(n1+n2)
Here is my second version which actualy gives the common string with deque data structure (also with the example data use case):
from collections import deque
a = 'alex is a buff dude'
b = 'a;exx is a buff dud'
def lcs_tuple(a,b):
n1=len(a)
n2=len(b)
previous=deque()
for i in range(n2):
previous.append((0,''))
over = (0,'')
for i in range(n1):
left = corner = (0,'')
for j in range(n2):
over = previous.popleft()
if a[i] == b[j]:
this = corner[0] + 1, corner[1]+a[i]
else:
this = max(over,left)
previous.append(this)
left, corner = this, over
return 200.0*this[0]/(n1+n2),this[1]
print lcs_tuple(a,b)
""" Output:
(89.47368421052632, 'aex is a buff dud')
"""
精彩评论