a string 'hello', how to list all words and the count of each word. normal suffix tree algorithm return the suffix开发者_JS百科 only, mean that middle word 'll' will not appear. can anyone help me to solve in step by step?
Initialize a hash table.
Use a double loop (for within for). One loop index will represent the beginning of the substring, the other the end. Make sure the end index is strictly bigger than the beginning index, and that both are in the string boundaries.
For each substring encountered, check if it is in the hash. If it isn't - add it as a key, with the value 1. If it is - increment the current saved value.
Edit: as per your comment:
for(b = 0; b < len; ++b) {
for(e = b + 1; e <= len; ++e) {
//process substring from index b (incl.) to index e (excl.)
}
}
This will traverse the string "abcd" in this order:
a ab abc abcd b bc bcd c cd d
Use a prefix tree instead of a suffix tree. Then run through this tree and at any node output the string encountered so far plus the number of sub trees that are still available.
EDIT:
Actually it's too early and I messed up some nomenclature:
A Prefix tree is a tree that stores common prefixes only once. A Suffix tree stores all suffixes in a prefix tree. So I meant a suffix tree here (which also happens to be a special kind of prefix tree).
So you do the following:
- Build up your suffix tree
Do a search on the prefix tree, starting at root
function search( node ) {
c = node.symbol;
if not children.empty then
for each child in node.children do
sub_search = search( child )
other_results.append( sub_search.results );
sub_trees += sub_search.num_trees
done
for each result in other_results do
append c to result
done
return c :: other_results
else
return {results = c; num_trees = 1 }
fi
}
If I did not do any mistake this should do the trick. The suffix part of the suffix tree takes care of eliminating all suffixes and the prefix part takes care of eliminating all prefixes. Because both are stored reduced you get strings in between (which might already have been stored together). Note that this is not including any compression on the trie, which is usually not needed unless your strings get very long.
精彩评论