Using nltk library to extract keywords_问答_开发者

开发者 https://www.devze.com 2023-03-11 00:57 出处：网络

I am working on an application that requires me t开发者_StackOverflowo extract keywords (and finally generate a tag cloud of these words) from a stream of conversations. I am considering the following

Tokenize each raw conversation (output stored as List of List of strings)
Remove stop words
Use stemmer (Porter stemming algorithm)

Up till here, nltk provides all the tools I need.After this, however I need to somehow "rank" these words and come up with most important words. Can anyone suggest me what tools from nltk might be used for this ?

Thanks Nihit

I guess it depends on your definition of "important". If you are talking about frequency, then you can just build a dictionary using words (or stems) as keys, and then counts as values. Afterwards, you can sort the keys in the dictionary based on their count.

Something like (not tested):

from collections import defaultdict

#Collect word statistics
counts = defaultdict(int) 
for sent in stemmed_sentences:
   for stem in sent:
      counts[stem] += 1

#This block deletes all words with count <3
#They are not relevant and sorting will be way faster
pairs = [(x,y) for x,y in counts.items() if y >= 3]

#Sort (stem,count) pairs based on count 
sorted_stems = sorted(pairs, key = lambda x: x[1])