开发者

string processing

开发者 https://www.devze.com 2023-04-06 14:00 出处:网络
a string \'hello\', how to list all words and the count of each word. normal suffix tree algorithm return the suffix开发者_JS百科 only, mean that middle word \'ll\' will not appear. can anyone help me

a string 'hello', how to list all words and the count of each word. normal suffix tree algorithm return the suffix开发者_JS百科 only, mean that middle word 'll' will not appear. can anyone help me to solve in step by step?


Initialize a hash table.
Use a double loop (for within for). One loop index will represent the beginning of the substring, the other the end. Make sure the end index is strictly bigger than the beginning index, and that both are in the string boundaries.
For each substring encountered, check if it is in the hash. If it isn't - add it as a key, with the value 1. If it is - increment the current saved value.

Edit: as per your comment:

for(b = 0; b < len; ++b) {
  for(e = b + 1; e <= len; ++e) {
    //process substring from index b (incl.) to index e (excl.)
  }
}

This will traverse the string "abcd" in this order:
a ab abc abcd b bc bcd c cd d


Use a prefix tree instead of a suffix tree. Then run through this tree and at any node output the string encountered so far plus the number of sub trees that are still available.

EDIT:

Actually it's too early and I messed up some nomenclature:

A Prefix tree is a tree that stores common prefixes only once. A Suffix tree stores all suffixes in a prefix tree. So I meant a suffix tree here (which also happens to be a special kind of prefix tree).

So you do the following:

  1. Build up your suffix tree

Do a search on the prefix tree, starting at root

function search( node ) {
   c = node.symbol;
   if not children.empty then
     for each child in node.children do
        sub_search = search( child )
        other_results.append( sub_search.results );
        sub_trees += sub_search.num_trees
     done
     for each result in other_results do
        append c to result
     done
     return c :: other_results
   else
     return {results = c; num_trees = 1 }
   fi
}

If I did not do any mistake this should do the trick. The suffix part of the suffix tree takes care of eliminating all suffixes and the prefix part takes care of eliminating all prefixes. Because both are stored reduced you get strings in between (which might already have been stored together). Note that this is not including any compression on the trie, which is usually not needed unless your strings get very long.

0

精彩评论

暂无评论...
验证码 换一张
取 消