String Occurrence Counting Algorithm_问答_开发者

开发者 https://www.devze.com 2022-12-28 22:38 出处：网络

I am curious what is the most efficient algorithm (or commonly used) to count the number of occurrences of a string in a chunk of text.

From what I read, the Boyer–Moore string search algorithm is the standard for string searches but I am not sure if counting occurrences in an efficient way would be same as searching a string.

In Python this is what I want:

text_chunck = "one two three four one five six one"
occurance_count(text_chunck, "one") # gives 3.

EDIT: It seems like python str.count s开发者_开发技巧erves as such a method; however, I am not able to find what algorithm it uses.

For starters, yes, you can accomplish this with Boyer-Moore very efficiently. However, depending on some other parameters of your problem, there might be a better solution.

The Aho-Corasick string matching algorithm will find all occurrences of a set of pattern strings in a target string and does so in time O(m + n + z), where m is the length of the string to search, n is the combined length of all the patterns to match, and z is the total number of matches produced. This is linear in the size of the source and target strings if you just have one string to match. It also will find overlapping occurrences of the same string. Moreover, if you want to check how many times a set of strings appears in some source string, you only need to make one call to the algorithm. On top of this, if the set of strings that you want to search for never changes, you can do the O(n) work as preprocessing time and then find all matches in O(m + z).

If, on the other hand, you have one source string and a rapidly-changing set of substrings to search for, you may want to use a suffix tree. With O(m) preprocessing time on the string that you will be searching in, you can, in O(n) time per substring, check how many times a particular substring of length n appears in the string.

Finally, if you're looking for something you can code up easily and with minimal hassle, you might want to consider looking into the Rabin-Karp algorithm, which uses a roling hash function to find strings. This can be coded up in roughly ten to fifteen lines of code, has no preprocessing time, and for normal text strings (lots of text with few matches) can find all matches very quickly.

Hope this helps!

Boyer-Moore would be a good choice for counting occurrences, since it has some overhead that you would only need to do once. It does better the longer the pattern string is, so for "one" it would not be a good choice.

If you want to count overlaps, start the next search one character after the previous match. If you want to ignore overlaps, start the next search the full pattern string length after the previous match.

If your language has an indexOf or strpos method for finding one string in another, you can use that. If it proves to slow, then choose a better algorithm.

Hellnar, You can use a simple dictionary to count occurrences in a String. The algorithm is a counting algorithm, here is an example:

"""
The counting algorithm is used to count the occurences of a character
in a string. This allows you to compare anagrams and strings themselves.
ex. animal, lamina a=2,n=1,i=1,m=1
"""

def count_occurences(str):
  occurences = {}
  for char in str:
    if char in occurences:
      occurences[char] = occurences[char] + 1
    else:
      occurences[char] = 1
  return occurences

  def is_matched(s1,s2):
    matched = True
    s1_count_table = count_occurences(s1)

    for char in s2:
      if char in s1_count_table and s1_count_table[char]>0:
      s1_count_table[char] -= 1
    else:
      matched = False
      break
    return matched

  #counting.is_matched("animal","laminar")

This example just returns True or False if the strings match. Keep in mind, this algorithm counts the number of times a character shows up in a string, this is good for anagrams.