I am processing a CSV file and count开发者_如何学JAVAing the unique values of column 4. So far I have coded this three ways. One uses "if key in dictionary", the second traps the KeyError and the third uses "DefaultDictionary". For example (where x[3] is the value from the file and "a" is a dictionary):
First way:
if x[3] in a:
a[x[3]] += 1
else:
a[x[3]] = 1
Second way:
try:
b[x[3]] += 1
except KeyError:
b[x[3]] = 1
Third way:
from collections import defaultdict
c = defaultdict(int)
c[x[3]] += 1
My question is: which way is more efficient... cleaner... better... etc. Or is there a better way. Both ways work and give the same answer, but I thought I would tap the hive mind as a learning case.
Thanks -
Use collections.Counter
. Counter
is syntactic sugar for defaultdict(int)
, but what's cool about it is that it accepts an iterable in the constructor, thus saving an extra step (I assume all of your examples above are wrapped in a for-loop.)
from collections import Counter
count = Counter(x[3] for x in my_csv_reader)
Prior to the introduction of collections.Counter
, collections.defaultdict
was the most idiomatic for this task, so for users < 2.7, use defaultdict
.
from collections import defaultdict
count = defaultdict(int)
for x in my_csv_reader:
count[x[3]] += 1
You asked which was more efficient. Assuming that you are talking about execution speed: If your data is small, it doesn't matter. If it is large and typical, the "already exists" case will happen much more often than the "not in dict" case. This observation explains some of the results.
Below is some code which can be used with the timeit
module to explore speed without file-reading overhead. I have taken the liberty of adding a 5th method, which is not uncompetetive and will run on any Python from at least 1.5.2 [tested] onwards.
from collections import defaultdict, Counter
def tally0(iterable):
# DOESN'T WORK -- common base case for timing
d = {}
for item in iterable:
d[item] = 1
return d
def tally1(iterable):
d = {}
for item in iterable:
if item in d:
d[item] += 1
else:
d[item] = 1
return d
def tally2(iterable):
d = {}
for item in iterable:
try:
d[item] += 1
except KeyError:
d[item] = 1
return d
def tally3(iterable):
d = defaultdict(int)
for item in iterable:
d[item] += 1
def tally4(iterable):
d = Counter()
for item in iterable:
d[item] += 1
def tally5(iterable):
d = {}
dg = d.get
for item in iterable:
d[item] = dg(item, 0) + 1
return d
Typical run (in Windows XP "Command Prompt" window):
prompt>\python27\python -mtimeit -s"t=1000*'now is the winter of our discontent made glorious summer by this son of york';import tally_bench as tb" "tb.tally1(t)"
10 loops, best of 3: 29.5 msec per loop
Here are the results (msec per loop):
0 base case 13.6
1 if k in d 29.5
2 try/except 26.1
3 defaultdict 23.4
4 Counter 79.4
5 d.get(k, 0) 29.2
Another timing trial:
prompt>\python27\python -mtimeit -s"from collections import defaultdict;d=defaultdict(int)" "d[1]+=1"
1000000 loops, best of 3: 0.309 usec per loop
prompt>\python27\python -mtimeit -s"from collections import Counter;d=Counter()" "d[1]+=1"
1000000 loops, best of 3: 1.02 usec per loop
The speed of Counter
is possibly due to it being implemented partly in Python code whereas defaultdict
is entirely in C (in 2.7, at least).
Note that Counter()
is NOT just "syntactic sugar" for defaultdict(int)
-- it implements a full bag
aka multiset
object -- see the docs for details; they may save you from reinventing the wheel if you need some fancy post-processing. If all you want to do is count things, use defaultdict
.
Update in response to a question from @Steven Rumbalski: """ I'm curious, what happens if you move the iterable into the Counter constructor: d = Counter(iterable)? (I have python 2.6 and cannot test it.) """
tally6: just does d = Count(iterable); return d
, takes 60.0 msecs
You could look at the source (collections.py in the SVN repository) ... here's what my Python27\Lib\collections.py
does when iterable
is not a Mapping instance:
self_get = self.get
for elem in iterable:
self[elem] = self_get(elem, 0) + 1
Seen that code anywhere before? There's a whole lot of carry-on just to call code that's runnable in Python 1.5.2 :-O
from collections import Counter
Counter(a)
Since you don't have access to Counter, your best bet is your third approach. It's much cleaner and easier to read. In addition, it doesn't have the perpetual testing (and branching) that the first two approaches have, which makes it more efficient.
Use setdefault
.
a[x[3]] = a.setdefault(x[3], 0) + 1
setdefault
gets the value of the specified key (x[3]
in this case), or if it does not exist, the specified value (0
in this case).
精彩评论