开发者

Add new keys to a dictionary while incrementing existing values

开发者 https://www.devze.com 2023-01-22 20:09 出处:网络
I am processing a CSV file and count开发者_如何学JAVAing the unique values of column 4. So far I have coded this three ways. One uses \"if key in dictionary\", the second traps the KeyError and the th

I am processing a CSV file and count开发者_如何学JAVAing the unique values of column 4. So far I have coded this three ways. One uses "if key in dictionary", the second traps the KeyError and the third uses "DefaultDictionary". For example (where x[3] is the value from the file and "a" is a dictionary):

First way:

if x[3] in a:
    a[x[3]] += 1
else:
    a[x[3]] = 1

Second way:

try:
    b[x[3]] += 1
except KeyError:
    b[x[3]] = 1

Third way:

from collections import defaultdict
c = defaultdict(int)
c[x[3]] += 1

My question is: which way is more efficient... cleaner... better... etc. Or is there a better way. Both ways work and give the same answer, but I thought I would tap the hive mind as a learning case.

Thanks -


Use collections.Counter. Counter is syntactic sugar for defaultdict(int), but what's cool about it is that it accepts an iterable in the constructor, thus saving an extra step (I assume all of your examples above are wrapped in a for-loop.)

from collections import Counter
count = Counter(x[3] for x in my_csv_reader)

Prior to the introduction of collections.Counter, collections.defaultdict was the most idiomatic for this task, so for users < 2.7, use defaultdict.

from collections import defaultdict
count = defaultdict(int)
for x in my_csv_reader:
    count[x[3]] += 1


You asked which was more efficient. Assuming that you are talking about execution speed: If your data is small, it doesn't matter. If it is large and typical, the "already exists" case will happen much more often than the "not in dict" case. This observation explains some of the results.

Below is some code which can be used with the timeit module to explore speed without file-reading overhead. I have taken the liberty of adding a 5th method, which is not uncompetetive and will run on any Python from at least 1.5.2 [tested] onwards.

from collections import defaultdict, Counter

def tally0(iterable):
    # DOESN'T WORK -- common base case for timing
    d = {}
    for item in iterable:
        d[item] = 1
    return d

def tally1(iterable):
    d = {}
    for item in iterable:
        if item in d:
            d[item] += 1
        else:
            d[item] = 1
    return d

def tally2(iterable):
    d = {}
    for item in iterable:
        try:
            d[item] += 1
        except KeyError:
            d[item] = 1
    return d

def tally3(iterable):
    d = defaultdict(int)
    for item in iterable:
        d[item] += 1

def tally4(iterable):
    d = Counter()
    for item in iterable:
        d[item] += 1

def tally5(iterable):
    d = {}
    dg = d.get
    for item in iterable:
        d[item] = dg(item, 0) + 1
    return d

Typical run (in Windows XP "Command Prompt" window):

prompt>\python27\python -mtimeit -s"t=1000*'now is the winter of our discontent made glorious summer by this son of york';import tally_bench as tb" "tb.tally1(t)"
10 loops, best of 3: 29.5 msec per loop

Here are the results (msec per loop):

0 base case   13.6
1 if k in d   29.5
2 try/except  26.1
3 defaultdict 23.4
4 Counter     79.4
5 d.get(k, 0) 29.2

Another timing trial:

prompt>\python27\python -mtimeit -s"from collections import defaultdict;d=defaultdict(int)" "d[1]+=1"
1000000 loops, best of 3: 0.309 usec per loop

prompt>\python27\python -mtimeit -s"from collections import Counter;d=Counter()" "d[1]+=1"
1000000 loops, best of 3: 1.02 usec per loop

The speed of Counter is possibly due to it being implemented partly in Python code whereas defaultdict is entirely in C (in 2.7, at least).

Note that Counter() is NOT just "syntactic sugar" for defaultdict(int) -- it implements a full bag aka multiset object -- see the docs for details; they may save you from reinventing the wheel if you need some fancy post-processing. If all you want to do is count things, use defaultdict.

Update in response to a question from @Steven Rumbalski: """ I'm curious, what happens if you move the iterable into the Counter constructor: d = Counter(iterable)? (I have python 2.6 and cannot test it.) """

tally6: just does d = Count(iterable); return d, takes 60.0 msecs

You could look at the source (collections.py in the SVN repository) ... here's what my Python27\Lib\collections.py does when iterable is not a Mapping instance:

            self_get = self.get
            for elem in iterable:
                self[elem] = self_get(elem, 0) + 1

Seen that code anywhere before? There's a whole lot of carry-on just to call code that's runnable in Python 1.5.2 :-O


from collections import Counter
Counter(a)


Since you don't have access to Counter, your best bet is your third approach. It's much cleaner and easier to read. In addition, it doesn't have the perpetual testing (and branching) that the first two approaches have, which makes it more efficient.


Use setdefault.

a[x[3]] = a.setdefault(x[3], 0) + 1

setdefault gets the value of the specified key (x[3] in this case), or if it does not exist, the specified value (0 in this case).

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号