Edit Distance Algorithm_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2022-12-08 05:44 出处：网络

I have a dictionary of \'n\' words given and there are \'m\' Queries to respond to. I want to output the number of words in dictionary which are edit distance 1 or 2. I want to optimize the result set

I have a dictionary of 'n' words given and there are 'm' Queries to respond to. I want to output the number of words in dictionary which are edit distance 1 or 2. I want to optimize the result set given that n and m are roughly 3000.

Edit added from answer below:

I will try to word it differently.

Initially there are 'n' words given as a set of Dictionary words. Next 'm' words are given which are query words and for each query word, I need to find if the word already exists in Dictionary (Edit Distance '0') or the total count of words in dictionary which are at edit distance 1 or 2 from the dictionary 开发者_StackOverflow中文版words.

I hope the Question is now Clear.

Well, it times out if the Time Complexity is (m*n)n.The naive use of DP Edit Distance Algorithm times out. Even Calculating the Diagonal Elements of 2k+1 times out where k is the threshold here k=3 in above case.

You want to use the Levenshtein distance between two words, but I assume you know that since that's what the question's tags say.

You would have to iterate through your List (assumption) and compare every word in the list with the current query you're executing. You could build a BK-tree to limit your search space, but that sounds like an overkill if you only have ~3000 words.

var upperLimit = 2;
var allWords = GetAllWords();
var matchingWords = allWords
        .Where(word => Levenshtein(query, word) <= upperLimit)
        .ToList();

Added after edit of original question

Finding cases where distance=0 would be easy Contains-queries if you have a case insensitive dictionary. Those cases where distance <= 2 would require a complete scan of the search space, 3000 comparisons per query word. Assuming an equal amount of query words would result in 9 million comparisons.

You mention that it times out, so I presume you have a timeout configured? Could your speed be due to a poor, or slow, implementation of the Levenshtein calculation?

Edit Distance Algorithm

_{(source: itu.edu.tr)}
Above graph is stolen from CLiki: bk-tree

As seen, using bk-tree with an edit distance <= 2 would only visit about 1% of the search space, but that's assuming that you have a very large input data, in their case up to a half million words. I would assume similar numbers in your case, but such a low amount of inputs wouldnt cause much trouble even if stored in a List/Dictionary.

I will try to word it differently.

I hope the Question is now Clear.

Well, it times out if the Time Complexity is (m*n)*n.The naive use of DP Edit Distance Algorithm times out. Even Calculating the Diagonal Elements of 2*k+1 times out where k is the threshold here k=3 in above case.

PS: BK Tree should suffice the purpose.Any Links about Implementation in C++.

public class Solution   {
    public int minDistance(String word1, String word2) {
        int[][] table = new int[word1.length()+1][word2.length()+1];
        for(int i = 0; i < table.length; ++i) {
            for(int j = 0; j < table[i].length; ++j) {
                if(i == 0)
                    table[i][j] = j;
                else if(j == 0)
                    table[i][j] = i;
                else {
                    if(word1.charAt(i-1) == word2.charAt(j-1))
                        table[i][j] = table[i-1][j-1];
                    else
                        table[i][j] = 1 + Math.min(Math.min(table[i-1][j-1], 
                            table[i-1][j]), table[i][j-1]);
                }
            }
        }
        return table[word1.length()][word2.length()];
    }
}

Checkout this simplified solution, solved using Dynamic Programming,

class Solution:
    
    def minDistance(self, word1: str, word2: str) -> int:
        
        return self.edit_distance(word1, word2)
    
    
    @cache
    def edit_distance(self, s, t):
        
        # Edge conditions
        if len(s) == 0:
            return len(t)
        
        if len(t) == 0:
            return len(s)
        
        
        # If 1st char matches
        if s[0] == t[0]:
            return self.edit_distance(s[1:], t[1:])
        
        
        else:
            
            return min(
                1 + self.edit_distance(s[1:], t), # delete
                1 + self.edit_distance(s, t[1:]), # insert
                1 + self.edit_distance(s[1:], t[1:]) # replace
            )