开发者

fastest way to do keyword search in large text in C/C++

开发者 https://www.devze.com 2023-03-29 00:58 出处:网络
Say I have 100 keywords (that can include spaces) and I need to find out how many times they occur in a big piece of text. What would the fast way be to accomplish this?

Say I have 100 keywords (that can include spaces) and I need to find out how many times they occur in a big piece of text. What would the fast way be to accomplish this?

My current idea is as follows:

  • turn the keywords into a suffix tree
  • walk through the text following the nodes and whenever a char does not occur (i.e. node->next == NULL) in the suffix tree, skip to next word and search again

The suffix tree struct would look something like this:

struct node {
   int count; //number of occurences (only used at leaf node)
   /* for each lower-case char, have a pointer to either NULL or next node */
   struct node 开发者_开发问答*children[26];
};

I am sure there is a faster way to do this, but what is it? Space efficiency is not really a big deal for this case (hence the children array for faster lookup), but time efficiency really is. Any suggestions?


The problem with the suffix tree approach is that you have to start the suffix search for each letter of the text to be searched. I think the best way to go would be to arrange a search for each keyword in the text, but using some fast search method with precomputed values, such as Boyer-Moore.

EDIT:

OK, You may be sure the trie may be faster. Boyer-Moore is very fast in the average case. Consider, for example, that strings have a mean length of m. BM can be as fast as O(n/m) for "normal" strings. That would make 100*O(n/m). The trie would be O(n*m) in mean (but it is true it can be much faster in real life), so if 100 >> m then the trie would win.

Now for random ideas on optimization. In some compression algorithms that have to do backward searchs, I've seen partial hash tables indexed by two characters of the string. That is, if the string to check is the sequence of characters c1, c2, and c3, you can check wether:

if (hash_table[c1 * 256 + c2] == true) check_strings_begining with [c1,c2]

then for c2 and c3, and so on. It is surprising how many cases you avoid by doing this simple check, as this hash will be true only each 100/65536 times (0.1%).


This is what I would do.

  1. Put all your keywords in a hash table of key-value pairs, with the number of occurences of the keyword as the value and the keyword as the (you guessed it) key.
  2. check each word in the text blob against the hash table. if the word is in the hash table, increment the occurrence count associated with it.

This is a good way because hash table look up is (or should be) amortized O(1) time. The whole algorithm has linear complexity :).

EDIT: If your keywords can contain spaces, you would need to make a sort of DFA. Scan the file until you find a word which one of your key "phrases" starts with. If the second (or however many) next words are part of the "key phrase", then increment the occurence count.


You seem to be groping your way towards http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm

To quote:

The complexity of the algorithm is linear in the length of the patterns plus the length of the searched text plus the number of output matches. Note that because all matches are found, there can be a quadratic number of matches if every substring matches (e.g. dictionary = a, aa, aaa, aaaa and input string is aaaa).


If it is a industrial application, use Boost Regex

It is tested, fast and chances are that it will save you a lot of pain.

0

精彩评论

暂无评论...
验证码 换一张
取 消