开发者

I have a file > 1g, I want to split it into files with 100k lines each

开发者 https://www.devze.com 2023-01-12 15:45 出处:网络
I want to do this in python but I\'开发者_高级运维m stumped. I wont be able to load the whole file into ram without things becoming unstable, so I want to read it line by line... Any advice would be a

I want to do this in python but I'开发者_高级运维m stumped. I wont be able to load the whole file into ram without things becoming unstable, so I want to read it line by line... Any advice would be appreciated.


If you do absolutely need to split the file, why not just use the *nix split utility?

http://ss64.com/bash/split.html

split -l 100000 inputfile


One idea could be the following:

import itertools

with open('the1gfile.txt') as inf:
  for i in itertools.count():
    with open('outfile%d.txt' % i, 'w') as ouf:
      for linenum, line in enumerate(inf):
        ouf.write(line)
        if linenum == 99999: break
      else:
        break

The with statement requires Python 2.6 or better, or 2.5 with a from __future__ import with_statement at the top of the module (that's the reason I'm using old-fashioned string formatting to make the output file names -- the new style wouldn't work in 2.5, and you don't tell us what Python version you want to use -- substitute the new style formatting if your Python version supports it, of course;-).

itertools.count() yields 0, 1, 2, ... and so on, with no limit (that loop is terminated only when the conditional break at the very end finally executes).

for linenum, line in enumerate(inf): reads one line at a time (with some buffering for speed) and sets linenum to 0, 1, 2, ... and so on - and we break off that loop after 100,000 lines (next time, the for loop will continue reading exactly where this one left off).

The for loop's else: clause executes if and only if the break within that loop didn't, therefore, if we've read less than 100,000 lines -- i.e., when the input file is finished. Note that there will be one empty output file if the number of lines in the input file is an exact multiple of 100,000.

I hope this makes every part of the mechanism sufficiently clear for you...?

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号