I'm OCRing some text from two different sources. They can each make mistakes in different places, where they won't recognize a letter/group of letters. If they don't recognize something, it's replaced with a ?. For example, if the word is Roflcopter
, one source might return Ro?copter
, while another, Roflcop?er
. I'd like a function that returns whether two matches might be equivalent, allowing for multiple ?
s. Example:
match("Ro?copter", "Roflcop?er") --> True
match("Ro?copter", "Roflcopter") --> True
match("Roflcopter", "Roflcop?er") --> True
match("Ro?co?er", "Roflcop?er") --> True
So far I can match one OCR with a perfect one by using regular expressions:
>>> def match(tn1, tn2):
tn1re = tn1.replace("?", ".{0,4}")
tn2re = tn2.rep开发者_StackOverflowlace("?", ".{0,4}")
return bool(re.match(tn1re, tn2) or re.match(tn2re, tn1))
>>> match("Roflcopter", "Roflcop?er")
True
>>> match("R??lcopter", "Roflcopter")
True
But this doesn't work when they both have ?s in different places:
>>> match("R??lcopter", "Roflcop?er")
False
Well, as long as one ? corresponds to one character, then I can suggest a performant and a compact enough method.
def match(str1, str2):
if len(str1) != len(str2): return False
for index, ch1 in enumerate(str1):
ch2 = str2[index]
if ch1 == '?' or ch2 == '?': continue
if ch1 != ch2: return False
return True
>>> ================================ RESTART ================================
>>>
>>> match("Roflcopter", "Roflcop?er")
True
>>> match("R??lcopter", "Roflcopter")
True
>>>
>>> match("R??lcopter", "Roflcop?er")
True
>>>
Edit: Part B), brain-fart free now.
def sets_match(set1, set2):
return any(match(str1, str2) for str1 in set1 for str2 in set2)
>>> ================================ RESTART ================================
>>>
>>> s1 = set(['a?', 'fg'])
>>> s2 = set(['?x'])
>>> sets_match(s1, s2) # a? = x?
True
>>>
Thanks to Hamish Grubijan for this idea. Every ? in my ocr'd names can be anywhere from 0 to 3 letters. What I do is expand each string to a list of possible expansions:
>>> list(expQuestions("?flcopt?"))
['flcopt', 'flcopt@', 'flcopt@@', 'flcopt@@@', '@flcopt', '@flcopt@', '@flcopt@@', '@flcopt@@@', '@@flcopt', '@@flcopt@', '@@flcopt@@', '@@flcopt@@@', '@@@flcopt', '@@@flcopt@', '@@@flcopt@@', '@@@flcopt@@@']
then I expand both and use his matching function, which I called matchats
:
def matchOCR(l, r):
for expl in expQuestions(l):
for expr in expQuestions(r):
if matchats(expl, expr):
return True
return False
Works as desired:
>>> matchOCR("Ro?co?er", "?flcopt?")
True
>>> matchOCR("Ro?co?er", "?flcopt?z")
False
>>> matchOCR("Ro?co?er", "?flc?pt?")
True
>>> matchOCR("Ro?co?e?", "?flc?pt?")
True
The matching function:
def matchats(l, r):
"""Match two strings with @ representing exactly 1 char"""
if len(l) != len(r): return False
for i, c1 in enumerate(l):
c2 = r[i]
if c1 == "@" or c2 == "@": continue
if c1 != c2: return False
return True
and the expanding function, where cartesian_product
does just that:
def expQuestions(s):
"""For OCR w/ a questionmark in them, expand questions with
@s for all possibilities"""
numqs = s.count("?")
blah = list(s)
for expqs in cartesian_product([(0,1,2,3)]*numqs):
newblah = blah[:]
qi = 0
for i,c in enumerate(newblah):
if newblah[i] == '?':
newblah[i] = '@'*expqs[qi]
qi += 1
yield "".join(newblah)
Using the Levenshtein distance may be useful. It will give a value of how similar the strings are to each other. This will work if they are different lengths, too. The linked page has some psuedocode to get you started.
You'll end up with something like this:
>>> match("Roflcopter", "Roflcop?er")
1
>>> match("R??lcopter", "Roflcopter")
2
>>> match("R?lcopter", "Roflcop?er")
3
So you could have a maximum threshold below which you say they may match.
This might not be the most Pythonic of options, but if a ?
is allowed to match any number of characters, then the following backtracking search does the trick:
def match(a,b):
def matcher(i,j):
if i == len(a) and j == len(b):
return True
elif i < len(a) and a[i] == '?' \
or j < len(b) and b[j] == '?':
return i < len(a) and matcher(i+1,j) \
or j < len(b) and matcher(i,j+1)
elif i == len(a) or j == len(b):
return False
else:
return a[i] == b[j] and matcher(i+1,j+1)
return matcher(0,0)
This may be adapted to be more stringent in what to match. Also, to save stack space, the final case (i+1,j+1
) may be transformed into a non-recursive solution.
Edit: some more clarification in response to the reactions below. This is an adaptation of a naive matching algorithm for simplified regexes/NFAs (see Kernighan's contrib to Beautiful Code, O'Reilly 2007 or Jurafsky & Martin, Speech and Language Processing, Prentice Hall 2009).
How it works: the matcher
function recursively walks through both strings/patterns, starting at (0,0)
. It succeeds when it reaches the end of both strings (len(a),len(b))
; it fails when it encounters two unequal characters or the end of one string while there are still characters to match in the other string.
When matcher
encounters a variable (?
) in either string (say a
), it can do two things: either skip over the variable (matching zero characters), or skip over the next character in b
but keep pointing to the variable in a
, allowing it to match more characters.
精彩评论