开发者

Count Occurence of Needle String in Haystack String, most optimally?

开发者 https://www.devze.com 2022-12-24 20:08 出处:网络
The Problem is simple Find \"ABC\" in \"ABCDSGDABCSAGAABCCCCAAABAABC\" without using String.split(\"ABC\")

The Problem is simple Find "ABC" in "ABCDSGDABCSAGAABCCCCAAABAABC" without using String.split("ABC")

Here is the solution I propose, I'm looking for any solutions that might be better than this one.

public static void main(String[] args) {
 String haystack = "ABCDSGDABCSAGAABCCCCAAABAABC";
 String needle = "ABC";
 char [] needl = needle.toCharArray();
 int needleLen = needle.length();
 int found=0;
 char hay[] = haystack.toCharArray();
 int index =0;
 int chMatched =0;

 for (int i=0; i<hay.length; i++){

  if (index >= needleLen || chMatched==0)
   index=0;
  System.out.print("\nchar-->"+hay[i] + ", with->"+needl[index]);

  if(hay[i] == needl[index]){
   chMatched++;
   System.out.println(", matched");
  }else {
   chMatched=0;
   index=0;
   if(hay[i] == needl[index]){
    chMatched++;
    System.out.print("\ncha开发者_运维问答r->"+hay[i] + ", with->"+needl[index]);
    System.out.print(", matched");
   }else
   continue;
  }

  if(chMatched == needleLen){
   found++;
   System.out.println("found. Total ->"+found);
  }
  index++;
 } 
 System.out.println("Result Found-->"+found);
 }

It took me a while creating this one. Can someone suggest a better solution (if any) P.S. Drop the sysouts if they look messy to you.


How about:

boolean found = haystack.indexOf("ABC") >= 0;

**Edit - The question asks for number of occurences, so here's a modified version of the above:

public static void main(String[] args)
{
    String needle = "ABC";
    String haystack = "ABCDSGDABCSAGAABCCCCAAABAABC";

    int numberOfOccurences = 0;
    int index = haystack.indexOf(needle);
    while (index != -1)
    {
        numberOfOccurences++;
        haystack = haystack.substring(index+needle.length());
        index = haystack.indexOf(needle);
    }

    System.out.println("" + numberOfOccurences);
}


If you're looking for an algorithm, google for "Boyer-Moore". You can do this in sub-linear time.

edit to clarify and hopefully make all the purists happy: the time bound on Boyer-Moore is, formally speaking, linear. However the effective performance is often such that you do many fewer comparisons than you would with a simpler approach, and in particular you can often skip through the "haystack" string without having to check each character.


You say your challenge is to find ABC within a string. If all you need is to know if ABC exists within the string, a simple indexOf() test will suffice.

If you need to know the number of occurrences, as your posted code tries to find, a simple approach would be to use a regex:

public static int countOccurrences(string haystack, string regexToFind) {
   Pattern p = Pattern.compile(regexToFind);
   Matcher m = p.matcher(haystack); // get a matcher object
   int count = 0;
   while(m.find()) {
       count++;
   }
   return count;
}


Have a look at http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm


public class NeedleCount
{
  public static void main(String[] args)
  {
    String s="AVBVDABCHJHDFABCJKHKHF",ned="ABC";
    int nedIndex=-1,count=0,totalNed=0;
    for(int i=0;i<s.length();i++)
    {
      if(i>ned.length()-1)
        nedIndex++;
      else
        nedIndex=i;
      if(s.charAt(i)==ned.charAt(nedIndex))
        count++;
      else
      {
        nedIndex=0;
        count=0;
         if(s.charAt(i)==ned.charAt(nedIndex))
          count++;
        else
          nedIndex=-1;
      }
      if(count==ned.length())
      {
        nedIndex=-1;
        count=0;
        totalNed++;
        System.out.println(totalNed+" needle found at index="+(i-(ned.length()-1)));
      }
    }
    System.out.print("Total Ned="+totalNed);
  }
}


Asked by others, better in what sense? A regexp based solution will be the most concise and readable (:-) ). Boyer-Moore (http://en.wikipedia.org/wiki/Boyer–Moore_string_search_algorithm) will be the most efficient in terms of time (O(N)).


If you don't mind implementing a new datastructure as replacement for strings, have a look at Tries: http://c2.com/cgi/wiki?StringTrie or http://en.wikipedia.org/wiki/Trie

If you don't look for a regular expression but an exact match they should provide the fastest solution (proportional to length of search string).


    public class FindNeedleInHaystack {

        String hayStack="ASDVKDBGKBCDGFLBJADLBCNFVKVBCDXKBXCVJXBCVKFALDKBJAFFXBCD";
        String needle="BCD";
        boolean flag=false;

        public void findNeedle() {
//Below for loop iterates the string by each character till reaches max length
            for(int i=0;i<hayStack.length();i++) {  
//When i=n (0,1,2... ) then we are at nth character of hayStack. Let's start comparing nth char of hayStach with first char of needle
                if(hayStack.charAt(i)==needle.charAt(0)) {
//if condition return true, we reach forloop which iterates needle by lenghth.
//Now needle(BCD) first char is 'B' and nth char of hayStack is 'B'. Then let's compare remaining characters of needle with haystack using below loop. 
                    for(int j=0;j<needle.length();j++) {
//for example at i=9 is 'B', i+j is i+0,i+1,i+2...
//if condition return true, loop continues or else it will break and goes to i+1
                        if(hayStack.charAt(i+j)==needle.charAt(j)) {
                            flag=true;
                        } else {
                            flag=false;
                            break;
                        }
                    }
                    if(flag) {
                        System.out.print(i+" ");
                    }
                }                               
            }
        }
    }


Below code will perform exactly O(n) complexity because we are looping n chars of haystack. If you want to capture start and end index's of needle uncomment below commented code. Solution is around playing with characters and no Java String functions (Pattern matching, IndexOf, substring etc.,) are used as they may bring extra space/time complexity

        char[] needleArray = needle.toCharArray();
        char[] hayStackArray = hayStack.toCharArray();
        //java.util.LinkedList<Pair<Integer,Integer>> indexList = new LinkedList<>();

        int head;
        int tail = 0;
        int needleCount = 0;

        while(tail<hayStackArray.length){

            head = tail;
            boolean proceed = false;
            for(int j=0;j<needleArray.length;j++){
                if(head+j<hayStackArray.length && hayStackArray[head+j]==needleArray[j]){
                    tail = head+j;
                    proceed = true;
                }else{
                    proceed = false;
                    break;
                }
            }

            if(proceed){
               // indexList.add(new Pair<>(head,tail));
                needleCount++;
            }
            ++tail;
        }

        System.out.println(needleCount);
        //System.out.println(indexList);
0

精彩评论

暂无评论...
验证码 换一张
取 消