High performance fuzzy string comparison in Python, use Levenshtein or difflib [closed]_问答_开发者

High performance fuzzy string comparison in Python, use Levenshtein or difflib [closed]

开发者 https://www.devze.com 2023-03-20 10:13 出处：网络

Closed. This question is opinion-based. It is not currently accepting answers. Want to improve this question? Update the question so it can be answered with facts and citations by editing th

Closed. This question is opinion-based. It is not currently accepting answers.

Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.

Closed 7 years ago.

Improve this question

I am doing clinical message normalization (spell check) in which I check each given word against 900,000 word medical dictionary. I am more concern about the time complexity/performance.

I want to do fuzzy string comparison, but I'm not sure which library to use.

Option 1:

import Levenshtein
Levenshtein.ratio('hello world', 'hello')

Result: 0.625

Option 2:

import difflib
difflib.SequenceMatcher(None, 'hello world', 'hello').ratio()

Result:开发者_如何学JAVA 0.625

In this example both give the same answer. Do you think both perform alike in this case?

In case you're interested in a quick visual comparison of Levenshtein and Difflib similarity, I calculated both for ~2.3 million book titles:

import codecs, difflib, Levenshtein, distance

with codecs.open("titles.tsv","r","utf-8") as f:
    title_list = f.read().split("\n")[:-1]

    for row in title_list:

        sr      = row.lower().split("\t")

        diffl   = difflib.SequenceMatcher(None, sr[3], sr[4]).ratio()
        lev     = Levenshtein.ratio(sr[3], sr[4]) 
        sor     = 1 - distance.sorensen(sr[3], sr[4])
        jac     = 1 - distance.jaccard(sr[3], sr[4])

        print diffl, lev, sor, jac

I then plotted the results with R:

High performance fuzzy string comparison in Python, use Levenshtein or difflib [closed]

Strictly for the curious, I also compared the Difflib, Levenshtein, Sørensen, and Jaccard similarity values:

library(ggplot2)
require(GGally)

difflib <- read.table("similarity_measures.txt", sep = " ")
colnames(difflib) <- c("difflib", "levenshtein", "sorensen", "jaccard")

ggpairs(difflib)

Result:

High performance fuzzy string comparison in Python, use Levenshtein or difflib [closed]

The Difflib / Levenshtein similarity really is quite interesting.

2018 edit: If you're working on identifying similar strings, you could also check out minhashing--there's a great overview here. Minhashing is amazing at finding similarities in large text collections in linear time. My lab put together an app that detects and visualizes text reuse using minhashing here: https://github.com/YaleDHLab/intertext

difflib.SequenceMatcher uses the Ratcliff/Obershelp algorithm it computes the doubled number of matching characters divided by the total number of characters in the two strings.
Levenshtein uses Levenshtein algorithm it computes the minimum number of edits needed to transform one string into the other

Complexity

SequenceMatcher is quadratic time for the worst case and has expected-case behavior dependent in a complicated way on how many elements the sequences have in common. (from here)

Levenshtein is O(m*n), where n and m are the length of the two input strings.

Performance

According to the source code of the Levenshtein module : Levenshtein has a some overlap with difflib (SequenceMatcher). It supports only strings, not arbitrary sequence types, but on the other hand it's much faster.