I want to do this in python but I'开发者_高级运维m stumped. I wont be able to load the whole file into ram without things becoming unstable, so I want to read it line by line... Any advice would be appreciated.
If you do absolutely need to split the file, why not just use the *nix split
utility?
http://ss64.com/bash/split.html
split -l 100000 inputfile
One idea could be the following:
import itertools
with open('the1gfile.txt') as inf:
for i in itertools.count():
with open('outfile%d.txt' % i, 'w') as ouf:
for linenum, line in enumerate(inf):
ouf.write(line)
if linenum == 99999: break
else:
break
The with
statement requires Python 2.6 or better, or 2.5 with a from __future__ import with_statement
at the top of the module (that's the reason I'm using old-fashioned string formatting to make the output file names -- the new style wouldn't work in 2.5, and you don't tell us what Python version you want to use -- substitute the new style formatting if your Python version supports it, of course;-).
itertools.count()
yields 0, 1, 2, ... and so on, with no limit (that loop is terminated only when the conditional break
at the very end finally executes).
for linenum, line in enumerate(inf):
reads one line at a time (with some buffering for speed) and sets linenum to 0, 1, 2, ... and so on - and we break off that loop after 100,000 lines (next time, the for loop will continue reading exactly where this one left off).
The for
loop's else:
clause executes if and only if the break
within that loop didn't, therefore, if we've read less than 100,000 lines -- i.e., when the input file is finished. Note that there will be one empty output file if the number of lines in the input file is an exact multiple of 100,000.
I hope this makes every part of the mechanism sufficiently clear for you...?
精彩评论