I am working on an application that requires me t开发者_StackOverflowo extract keywords (and finally generate a tag cloud of these words) from a stream of conversations. I am considering the following steps:
- Tokenize each raw conversation (output stored as List of List of strings)
- Remove stop words
- Use stemmer (Porter stemming algorithm)
Up till here, nltk provides all the tools I need.After this, however I need to somehow "rank" these words and come up with most important words. Can anyone suggest me what tools from nltk might be used for this ?
Thanks Nihit
I guess it depends on your definition of "important". If you are talking about frequency, then you can just build a dictionary using words (or stems) as keys, and then counts as values. Afterwards, you can sort the keys in the dictionary based on their count.
Something like (not tested):
from collections import defaultdict
#Collect word statistics
counts = defaultdict(int)
for sent in stemmed_sentences:
for stem in sent:
counts[stem] += 1
#This block deletes all words with count <3
#They are not relevant and sorting will be way faster
pairs = [(x,y) for x,y in counts.items() if y >= 3]
#Sort (stem,count) pairs based on count
sorted_stems = sorted(pairs, key = lambda x: x[1])
精彩评论