开发者

How do I count the number of occurrences of a list of items in another .txt file?

开发者 https://www.devze.com 2023-01-04 15:04 出处:网络
I have a list of words and I want to find how many times they occur in a .txt file.The word list is something like as follows:

I have a list of words and I want to find how many times they occur in a .txt file. The word list is something like as follows:

wordlist = ['cup', 'bike', 'run']

I want to be able to not only pick up these words, but also things like CUP, biker, running, Cups, etc. So I think I need a regular expression. Here is what I was thinking but it doesn't work开发者_JAVA百科:

len(re.findall(wordlist, filename, re.I))

Thanks in advance!


You're close. But re.findall takes a pattern and a string, not a wordlist and a filename.

But, if you read your file into a string and turn your wordlist into a pattern, then you'll get it.

The pattern you need will look like this: r"cup|bike|run". You could do "|".join(wordlist) to get this.

That's a very loose way of counting all these instances. Note that if your file has the words "My truncheon has been scuppered" in it, then re.findall will find "run" and "cup" inside the bigger words. So you may want to tweak your pattern to catch the beginnings and ends of words.

To get whole words only, use this pattern: r"\b(cup|bike|run)\b". Of course, you'll need to fill in all the word varieties that you are looking for.


The regex needs work, but this should get you started:

from __future__ import with_statement # only if < 2.6
from collections import defaultdict
import re

matches = defaultdict(int)
with open(filename) as f:
    for mtch in re.findall(r'\b(cup|bike|run)', f.read(), re.I):
        matches[mtch.lower()] += 1


You will have first to guess all forms of the words and that seems a PITA. But here is a simplified fn i wrote after reading http://www.theenglishspace.com/spelling/ :

def getWordForms(word):
    ''' Given an English word, return list of possible forms
    '''
    l = [word]
    if len(word)>1:
        l.extend([word + 's', word + 'ing', word + 'ed'])
        wor, d = word[:-1], word[-1:]
        if d == 'e':
            l.append(word + 'd')
            l.append(wor + 'ing')
            if wor[-1:] == 'f':
                l.append(wor[:-1] + 'ves')
        elif d == 'y':
            l.append(wor + 'ied')
            l.append(wor + 'ies')
        elif d == 'z':
            l.append(word + 'zes') # double Z
        elif d == 'f':
            l.append(wor + 'ves')
        elif d in 'shox':
            l.append(word + 'es')
        if re.match('[^aeiou][aeiou][^aeiou]', word):
            l.append(word + d + 'ing') # double consonant
            l.append(word + d + 'ed')
    return l

It is overly generous in the variants of words it guesses - but that is ok because this is not a spell checker and you will be using \b for word boundaries on both sides.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号