I need to find a substring SIMILAR to a given pattern in a huge string. The source huge string could be up to 100 Mb in length. The pattern is rather short (10-100 chars). The problem is that I need to find not only exact substrings but also similar substrings that differ from the pattern in several chars (the maximum allowed error count is provided as a parameter).
I开发者_开发知识库s there any idea how to speed up the algorithm?
1)There are many Algorithms related to the string searching. One of them is the famous Knuth–Morris–Pratt Algorithm.
2) You may also want to check Regular Expressions ("Regex") in whatever the language you're using. They will sure help you finding substrings 'similar' to the original ones.
i.e. [Java]
String pat = "Home";
String source = "IgotanewHwme";
for(int i = 0; i < pat.length(); i++){
//split around i .. not including char i itself .. instead, replace it with [a-zA-Z] and match using this new pattern.
String new_pat = "("+pat.substring(0, i)+")"+ "[a-zA-Z]" + "("+pat.substring(i+1, pat.length())+")";
System.out.println(new_pat);
System.out.println(source.matches("[a-zA-Z]*"+new_pat+"[a-zA-Z]*"));
}
and I think it's easy to make it accept any number of error counts.
Sounds like you want Fuzzy/Approximate String Matching. Have a look at the Wikipedia page and see if you can't find an algorithm that suits your needs.
You can have a look at the Levenshtein distance, the Needleman–Wunsch algorithm and the Damerau–Levenshtein distance
They give you metrics evaluating the amount of differences between two strings (ie the numbers of addition, deletion, substitution, etc.). They are often used to measure the variation between DNA.
You will easily find implementations in various languages.
精彩评论