What is the correct way to count English words in a document using regula开发者_StackOverflow社区r expression?
I tried with:
words=re.findall('\w+', open('text.txt').read().lower())
len(words)
but it seems I am missing few words (compares to the word count in gedit). Am I doing it right?
Thanks a lot!
Using \w+ won't correctly count words containing apostrophes or hyphens, eg "can't" will be counted as 2 words. It will also count numbers (strings of digits); "12,345" and "6.7" will each count as 2 words ("12" and "345", "6" and "7").
This seems to work as expected.
>>> import re
>>> words=re.findall('\w+', open('/usr/share/dict/words').read().lower())
>>> len(words)
234936
>>>
bash-3.2$ wc /usr/share/dict/words
234936 234936 2486813 /usr/share/dict/words
Why are you lowercasing your words? What does that have to do with the count?
I'd submit that the following would be more efficient:
words=re.findall(r'\w+', open('/usr/share/dict/words').read())
Once you have list of words by _words_list = words.split()
or required processing through regex or other methods, you can easily get a count of words with the following method:
import numpy as NP
import pandas as PD
_counted_words = PD.Series(NP.array(_words_list)).value_counts()
精彩评论