Python string formatting too slow_问答_开发者_运维开发者技术经验分享

I use the following code to log a map, it is fast when it only contains zeroes, but as soon as there is actual data in the map it becomes unbearably slow... Is there any way to do this faster?

log_file = open('testfile', 'w')
for i, x in ((i, start + i * interval) for i in range(length)):
    log_file.write('%-5d %8.3f %13g %13g %13g %13g %13g %13g\n' % (i, x,
        map[0][i], map[1][i], map[2][i], map[3][开发者_运维问答i], map[4][i], map[5][i]))

I suggest you run your code using the cProfile module and postprocess the results as described on http://docs.python.org/library/profile.html . This will let you know exactly how much time is spent in the call to str.__mod__ for the string formatting and how much is spent doing other things, like writing the file and doing the __getitem__ lookups for map[0][i] and such.

First I checked % against backquoting. % is faster. THen I checked % (tuple) against 'string'.format(). An initial bug made me think it was faster. But no. % is faster.

So, you are already doing your massive pile of float-to-string conversions the fastest way you can do it in Python.

The Demo code below is ugly demo code. Please don't lecture me on xrange versus range or other pedantry. KThxBye.

My ad-hoc and highly unscientific testing indicates that (a) % (1.234,) operations on Python 2.5 on linux is faster than % (1.234,...) operation Python 2.6 on linux, for the test code below, with the proviso that the attempt to use 'string'.format() won't work on python versions before 2.6. And so on.

# this code should never be used in production.
# should work on linux and windows now.

import random
import timeit
import os
import tempfile


start = 0
interval = 0.1

amap = [] # list of lists
tmap = [] # list of tuples

def r():
    return random.random()*500

for i in xrange(0,10000):
        amap.append ( [r(),r(),r(),r(),r(),r()] )

for i in xrange(0,10000):
        tmap.append ( (r(),r(),r(),r(),r(),r()) )




def testme_percent():
    log_file = tempfile.TemporaryFile()
    try:
        for qmap in amap:
            s = '%g %g %g %g %g %g \n' % (qmap[0], qmap[1], qmap[2], qmap[3], qmap[4], qmap[5]) 
            log_file.write( s)
    finally:
        log_file.close();

def testme_tuple_percent():
    log_file = tempfile.TemporaryFile()
    try:    
        for qtup in tmap:
            s = '%g %g %g %g %g %g \n' % qtup
            log_file.write( s );
    finally:
        log_file.close();

def testme_backquotes_rule_yeah_baby():
    log_file = tempfile.TemporaryFile()
    try:
        for qmap in amap:
            s = `qmap`+'\n'
            log_file.write( s );
    finally:
        log_file.close();        

def testme_the_new_way_to_format():
    log_file = tempfile.TemporaryFile()
    try:
        for qmap in amap:
            s = '{0} {1} {2} {3} {4} {5} \n'.format(qmap[0], qmap[1], qmap[2], qmap[3], qmap[4], qmap[5]) 
            log_file.write( s );
    finally:
        log_file.close();

# python 2.5 helper
default_number = 50 
def _xtimeit(stmt="pass",  timer=timeit.default_timer,
           number=default_number):
    """quick and dirty"""
    if stmt<>"pass":
        stmtcall = stmt+"()"
        ssetup = "from __main__ import "+stmt
    else:
        stmtcall = stmt
        ssetup = "pass"
    t = timeit.Timer(stmtcall,setup=ssetup)
    try:
      return t.timeit(number)
    except:
      t.print_exc()


# no formatting operation in testme2

print "now timing variations on a theme"

#times = []
#for i in range(0,10):

n0 = _xtimeit( "pass",number=50)
print "pass = ",n0

n1 = _xtimeit( "testme_percent",number=50);
print "old style % formatting=",n1

n2 = _xtimeit( "testme_tuple_percent",number=50);
print "old style % formatting with tuples=",n2

n3 = _xtimeit( "testme_backquotes_rule_yeah_baby",number=50);
print "backquotes=",n3

n4 = _xtimeit( "testme_the_new_way_to_format",number=50);
print "new str.format conversion=",n4


#        times.append( n);




print "done"

I think you could optimize your code by building your TUPLES of floats somewhere else, wherever you built that map, in the first place, build your tuple list, and then applying the fmt_string % tuple this way:

for tup in mytups:
    log_file.write( fmt_str % tup )

I was able to shave the 8.7 seconds down to 8.5 seconds by dropping the making-a-tuple part out of the for loop. Which ain't much. The big boy there is floating point formatting, which I believe is always going to be expensive.

Alternative:

Have you considered NOT writing such huge logs as text, and instead, saving them using the fastest "persistence" method available, and then writing a short utility to dump them to text, when needed? Some people use NumPy with very large numeric data sets, and it does not seem they would use a line-by-line dump to store their stuff. See:

http://thsant.blogspot.com/2007/11/saving-numpy-arrays-which-is-fastest.html

Without wishing to wade into the optimize-this-code morass, I would have written the code more like this:

log_file = open('testfile', 'w')
x = start
map_iter = zip(range(length), map[0], map[1], map[2], map[3], map[4], map[5])
fmt = '%-5d %8.3f %13g %13g %13g %13g %13g %13g\n'
for i, m0, m1, m2, m3, m4, m5 in mapiter:
    s = fmt % (i, x, m0, m1, m2, m3, m4, m5)
    log_file.write(s)
    x += interval

But I will weigh in with the recommendation that you not name variables after python builtins, like map.