Finding popular keywords in huge list_问答_开发者

开发者 https://www.devze.com 2023-01-24 15:47 出处：网络

I have a huge list with about 100 000 lines like this: ipadnews abcipad cddeeffipad hellworld iworldthis .. and so on

I have a huge list with about 100 000 lines like this:

ipadnews
abcipad
cddeeffipad
hellworld
iworldthis .. and so on

And would like to find popular substrings, in this case "ipad" would be the most popular and "world" would be on second place. Minimum length should be three or four 开发者_Go百科chars.

I can't predict the substrings so using a dictionary is a no no.

This is a relatively complicated problem ... but it's tractable using prefix/suffix trees. It's essentially a variation of the longest common subsequence and longest common substring problems. - which is where I would start.

There's actually quite a bit of research on problems on this form - you should be able to use the terms above to narrow your search.

You can solve this using a generalized suffix tree which can be built in O(n) time. This is effectively a play on the LCS problem.

I would go about this problem using the following flow of logic:

Extract the set of suffixes for each word. So from 'ipadnews' we get: 'ipadnews', 'padnews', 'adnews', and so on. This way, 'news' will be one of the suffixes, but not 'ipad'.
To make up for the missing substrings in the above step, extract the prefixes as well. We get 'ipadnew', 'ipadne', and so on, including 'ipad'.
For each of the substrings above, hash them towards a count, e.g. $hash{$substr}++.

At the end we will have a long hashtable with frequency of words as values. Instead of an expensive sorting, suppose you only want 10 most popular words. Keep a set from the beginning whose criteria is that any word in it must have a score more than the current min score. You can keep track of the word with min score and when you add the 11th item with score more than the min score, bump out the word with the min score and update the min score pointer.

The max number of keys in the hashtable will be 2*k*n where k is the average length of the words and n is total number of words.