Which algorithm to use to get longest string matching in two big arrays?_问答_开发者

Which algorithm to use to get longest string matching in two big arrays?

开发者 https://www.devze.com 2023-03-21 01:13 出处：网络

The problem is easy to explain: we have two b开发者_运维百科ig arrays (32 bit integer values) and we have to find all common sequences above a given number of consecutive position (n).

相关专题：algorithm

The problem is easy to explain: we have two b开发者_运维百科ig arrays (32 bit integer values) and we have to find all common sequences above a given number of consecutive position (n).

For instance, if n=3 and arrays to compare are:

a = [1, 3, 5, 7, 3, 2, 7, 4, 6, 7, 2, 1, 0, 4, 6]
b = [2, 5, 7, 3, 2, 3, 4, 5, 6, 3, 2, 7, 4, 6, 0]

The algoritmh should return, two arrays:

r0 = [5, 7, 3, 2]
r1 = [3, 2, 7, 4, 6]

(or better, its relative positions to first array and the number of consecutive bytes matched).

I believe a good point to start is the Longest Common Substring Algorithm, but perhaps anybody knows an algorithm that fits better or exactly with my problem.

I think the algorithm for finding LCS using suffix tree is a perfect fit. You build the suffix tree the same way, but in the final phase, you're not looking for the deepest node that has descendants for both strings. You're looking for all nodes with the depth of more than n that have descendants for both strings.

I think the algorithms on the wikipedia page you reference do almost exactly what you want. You just need to modify them to keep all answers over a certain size rather than only keeping the longest answer. For instance, the dynamic programming solution on that page could be modified as follows:

function LCSubstr(S[1..m], T[1..n], min_size)
    L := array(1..m, 1..n)
    ret := {}
    for i := 1..m
        for j := 1..n
            if S[i] = T[j]
                if i = 1 or j = 1
                    L[i,j] := 1
                else
                    L[i,j] := L[i-1,j-1] + 1
                if L[i,j] >= min_size
                    ret := ret ∪ {S[i-z+1..z]}
    return

As written this will include prefixes as well as longest matches. Those could be filtered out by tracking what string was found into L and removing a prefix from the return set when we discover that it has an extension.

If I understand you correctly and n is the minimal size of a sequence, then I'de use a variation on Boyer-Moor's search algorithm (http://en.wikipedia.org/wiki/Boyer_moore)