开发者

How can I remove all instances of a single set item in set A from set B?

开发者 https://www.devze.com 2023-03-04 17:28 出处:网络
As you can see below, when I open test.txt and put the words into a set, the difference of the set with the common_words set is returned. However, it is only removing a single instance of the words in

As you can see below, when I open test.txt and put the words into a set, the difference of the set with the common_words set is returned. However, it is only removing a single instance of the words in the common_words set rather than all occurrences of them. How can I achieve this? I want to remove ALL instances of items in common_words from title_words

from string import punctuation
from operator import itemgetter

N = 10
words = {}

linestring = open('test.txt', 'r').read()

//set A, want to remove these from set B
common_words = set(("if", "but", "and", "the", "when", "use", "to开发者_运维问答", "for"))

title = linestring

//set B, want to remove ALL words in set A from this set and store in keywords
title_words = set(title.lower().split())

keywords = title_words.difference(common_words)

words_gen = (word.strip(punctuation).lower() for line in keywords
                                             for word in line.split())

for word in words_gen:
    words[word] = words.get(word, 0) + 1

top_words = sorted(words.iteritems(), key=itemgetter(1), reverse=True)[:N]

for word, frequency in top_words:
    print "%s: %d" % (word, frequency)


I agree with senderle. Try this code:

for common_word in common_words:
    try:
        title.words.remove(common_word)
    except:
        print "The common word %s was not in title_words" %common_word

That should do it

Hope this helps


Strip punctuation out before you make it a set, you do:

keywords = title_words.strip(punctuation).difference(common_words)

Which tries to call the strip method of the title_words, which is a set (only str has this method). You could do something like this instead:

for chr in punctuation:
    title = title.replace(chr, '')

title_words = set(title.lower().split())

keywords = title_words.difference(common_words)


You just want the difference() method for this, but it looks like your example is buggy.

title_words is a set, and doesn't have the strip() method.

Try this instead:

title_words = set(title.lower().split())
keywords = title_words.difference(common_words)


If title_words is a set, then there is only one occurrence of any one word. So you only need to remove one occurrence. Have I misunderstood your question?


I'm still a bit confused by this question, but I notice that one problem might be that when you pass your initial data through set, the punctuation hasn't been stripped yet. So there may be multiple punctuated versions of a word slipping through the .difference() operation. Try this:

title_words = set(word.strip(punctuation) for word in title.lower().split())

Also, your words_gen generator is written in a slightly confusing way. Why line in keywords -- what's the line? And why are you calling split() again? keywords ought to be a set of straight words, right?


You've succeeded in finding the top N most uniquely punctuated words in your input file.

Run this input file through your original code:

the quick brown fox.
The quick brown fox?
The quick brown fox! 
the quick, brown fox

And you'll get the following output:

fox: 4
quick: 2
brown: 1

Notice that fox appears in 4 variations: fox, fox?, fox!, and fox. The word brown appears only one way. And quick appears only with and without a comma (2 variations).

What happens when we add fox to the common_words set? Only the variation that has no trailing punctuation is removed, and we're left with the three punctuation-adorned variants, giving this output:

fox: 3
quick: 2
brown: 1

For a more realistic example, run MLK's I Have a Dream speech through your method:

justice: 4
children: 3
today: 3
rights: 3
satisfied: 3
nation: 3
day: 3
ring: 3
hope: 3
injustice: 3

Dr. King says "I Have a Dream" eight times in that speech, yet dream doesn't show up at all on the list. Do a search for justice and you'll find four (4) punctuated flavors:

  • until "justice rolls
  • palace of justice: In the
  • make justice a reality
  • path of racial justice.

So what went wrong? It looks like this method has been through a lot of rework, considering the names of the variables don't seem to match their purpose. So let's go through (moving some code around a bit, my apologies):

Open the file and slurp the whole thing into linestring, good so far except for the variable name:

linestring = open(filename, 'r').read()

Is this a line or a title? Both? In any event, we now lowercase the whole file and split it up by whitespace. Using my test file, this means title_words now contains fox?, fox!, fox, and fox.

title = linestring
title_words = set(title.lower().split())

Now the attempt to remove the common words. Let's assume our common_words contains fox. This next line removes fox but leaves fox?, fox!, and fox.

keywords = title_words.difference(common_words)

The next line really looks legacy to me, as if it was meant to be something like for line in linestring.split('\n') for word in line.split(). In the current form, keywords is just a list of words, so line is just a word without spaces, so for word in line.split() has no effect. We just iterate over every word, remove punctuation, and make it lowercase. words_gen now contains 3 copies of fox: fox, fox, fox. We've removed the one un-punctuated version.

words_gen = (word.strip(punctuation).lower() for line in keywords
                                             for word in line.split())

The frequency analysis is pretty spot-on. This creates a histogram based on the words in the words_gen generator. Which ultimately gives us the N most uniquely punctuated words! In this example, fox=3:

words = {}
for word in words_gen:
    words[word] = words.get(word, 0) + 1
top_words = sorted(words.iteritems(), key=itemgetter(1), reverse=True)[:N]

So there's the what-went-wrong. Others have posted clear solutions for word frequency analysis, but I'm in a bit of a performance frame of mind, and came up with my own variant. First, split the text into words using a regular expression:

# 1. assumes proper word spacing after punctuation, if not, then something like
#    "I ate.I slept" will return "I", "ATEI", "SLEPT"
# 2. handles contractions properly. E.g., "don't" becomes "DONT"
# 3. removes any unexpected characters such as Unicode Non-breaking space and
#    non-printable ascii characters (MS Word inserts ASCII 0x05 for
#    in-line review comments)
clean = re.sub("[^\w\s]+", "", text.upper())
words = clean.split()

Now based on Python Performance Tips for Initializing Dictionary Entries (and my own measured performance), find the top N most frequent words:

# first create a dictionary that will count the number of words. 
# using defaultdict(int) is the 2nd fastest method I measured but 
# the most readable. It was very close in speed to "if not w in freq" technique
freq = defaultdict(int)
for w in words:
    freq[w] += 1

# remove any of the common words by deleting common keys from the dictionary
for k in common_words:
    if k in freq:
        del freq[k]

# Ryan's original top-N selection was the fastest of several
# methods I tried including using dictview and lambda functions
# - sort the items by directly accessing item[1] (i.e., the value/frequency count)
top = sorted( freq.iteritems(), key=itemgetter(1), reverse=True)[:N]

And to close with Dr. King's speech with all articles and pronouns removed:

('OF', 99)
('TO', 59)
('AND', 53)
('BE', 33)
('WE', 30)
('WILL', 27)
('THAT', 24)
('IS', 23)
('IN', 22)
('THIS', 20)

And, for kicks, my performance measurments:

Original                      ; 0:00:00.645000 ************
SortAllWords                  ; 0:00:00.571000 ***********
MyFind                        ; 0:00:00.870000 *****************
MyImprovedFind                ; 0:00:00.551000 ***********
DontInsertCommon              ; 0:00:00.649000 ************
JohnGainsJr                   ; 0:00:00.857000 *****************
ReturnImmediate               ; 0:00:00
SortWordsAndReverse           ; 0:00:00.572000 ***********
JustCreateDic_GetZero         ; 0:00:00.439000 ********
JustCreateDic_TryExcept       ; 0:00:00.732000 **************
JustCreateDic_IfNotIn         ; 0:00:00.309000 ******
JustCreateDic_defaultdict     ; 0:00:00.328000 ******
CreateDicAndRemoveCommon      ; 0:00:00.437000 ********

Cheers, E


I wrote some code recently that does something similar, although the style is very different from yours. Maybe it will help you out.

import string
import sys

def main():
    # get some stop words
    stopf = open('stop_words.txt', "r")
    stopwords = {}
    for s in stopf:
        stopwords[string.strip(s)] = 1

    file = open(sys.argv[1], "r")
    filedata = file.read()
    words=string.split(filedata)
    histogram = {}
    count = 0
    for word in words:
        word = string.strip(word, string.punctuation)
        word = string.lower(word)
        if word in stopwords:
            continue
        histogram[word] = histogram.get(word, 0) + 1
        count = (count+1) % 1000
        if count == 0:
            print '*',
    flist = []
    for word, count in histogram.items():
        flist.append([count, word])
    flist.sort()
    flist.reverse()
    for pair in flist[0:100]:
        print "%30s: %4d" % (pair[1], pair[0])

main()


Not ideal, but works as a word frequency counter (which is what this appears to be aiming at):

from string import punctuation
from operator import itemgetter
import itertools

N = 10
words = {}

linestring = open('test.txt', 'r').read()

common_words = set(("if", "but", "and", "the", "when", "use", "to", "for"))

words = [w.strip(punctuation) for w in linestring.lower().split()]

keywords = itertools.ifilterfalse(lambda w: w in common_words, words)

words = {}
for word in keywords:
    words[word] = words.get(word, 0) + 1

top_words = sorted(words.iteritems(), key=itemgetter(1), reverse=True)[:N]

for word, frequency in top_words:
    print "%s: %d" % (word, frequency)
0

精彩评论

暂无评论...
验证码 换一张
取 消