I would like to calculate the frequency of function words in Python/NLTK. I see two ways to go about it :
- Use Part-Of-Speech tagger and sum up on POS tags which constitute to function words
- Create a list of function words and perform a simple look up
The catch in the first case is that, my data is noisy and I don't know(for sure) which POS tags constitute as function words. The catch in the second case is I don't have a list and since my data is noisy the lookup won't be accurate.
I would prefer the first to the second or any other example which would throw 开发者_Python百科me more accurate results.
I just used the LIWC English 2007 dictionary ( I paid for the same) and performed a simple lookup as of now. Any other answers are most welcome.
I must say I am a little surprised by the impulsiveness of a couple of answers here. Since, someone asked for code. Here's what I did :
''' Returns frequency of function words '''
def get_func_word_freq(words,funct_words):
fdist = nltk.FreqDist([funct_word for funct_word in funct_words if funct_word in words])
funct_freq = {}
for key,value in fdist.iteritems():
funct_freq[key] = value
return funct_freq
''' Read LIWC 2007 English dictionary and extract function words '''
def load_liwc_funct():
funct_words = set()
data_file = open(liwc_dict_file, 'rb')
lines = data_file.readlines()
for line in lines:
row = line.rstrip().split("\t")
if '1' in row:
if row[0][-1:] == '*' :
funct_words.add(row[0][:-1])
else :
funct_words.add(row[0])
return list(funct_words)
Anyone who has done some code in python would tell you that performing a look up or extracting words with specific POS tags isn't rocket science. To add, tags(on the question) of NLP(Natural Language Processing) and NLTK(Natural Language ToolKit) should be enough indication to the astute minded.
Anyways, I understand & respect sentiments of people who reply here since most of it is free but I think the least we can do is show a bit of respect to question posters. As it's rightly pointed out help is received when you help others, similarly respect is received when one respect's others.
You don't know which approach will work until you try. I recommend the first approach though; I've used it with success on very noisy data, where the "sentences" where email subject headers (short texts, not proper sentences) and even the language was unknown (some 85% English; the Cavnar & Trenkle algorithm broke down quickly). Success was defined as increased retrieval performance in a search engine; if you just want to count frequencies, the problem may be easier.
Make sure you use a POS tagger that takes context into account (most do). Inspect the list of words and frequencies you get and maybe eliminate some words that you don't consider function words, or even filter out words that are too long; that will eliminate the false positives.
(Disclaimer: I was using the Stanford POS tagger, not NLTK, so YMMV. I used one of the default models for English, trained, I think, on the Penn Treebank.)
精彩评论