开发者

efficient way to find matches against two strings

开发者 https://www.devze.com 2023-01-15 23:20 出处:网络
I need to find all equal substrings against two strings. I\'ve tried to use suffix tree to lookup substrings and it works fast, but too memory consuming (inap开发者_StackOverflow社区propriate for my t

I need to find all equal substrings against two strings. I've tried to use suffix tree to lookup substrings and it works fast, but too memory consuming (inap开发者_StackOverflow社区propriate for my task).

Any other ideas?


Aho-corasick is a great implementation for matching any number of strings with minimal performance issues. Did you try that?


You could do sliding window, though that's less memory, but more time consuming.

The smallest substring is one character (actually, the empty word is one, but let's leave that aside).

Take character 1 of string 1 and save the positions of that character in string 2 in some sort of data structure, like a map or an array.

Then you take the next one, (character 2 of string 1) and do the same thing.

Once you've reached the end of string 1, you start over but this time you take every two characters of string 1 and alway advance by one character checking for all positions in string 2.

You do this as long as the substring you're cheking is equal in length to string 1, meaning you compare string 1 and 2 as a whole.

Keep in mind: when string 2 is longer than string 1, you need to advance the whole string 1 once every character on string 2, since string 1 might be a substring of string 2.

If string 1 is larger than string 2, you can stop cheking, once your substring is longer that string 2, all other substrings will have been checked by then. Ideally, you'd end up having a map, (which in its simplest form is a two dimensional array), that holds the positions of each substring of string 1 in string 2.


Why do you say that suffix tree is too memory consuming? If implemented properly, it consumes only O(n) memory.

0

精彩评论

暂无评论...
验证码 换一张
取 消