开发者

What is the optimal way to process a very large (over 30GB) text file and also show progress

开发者 https://www.devze.com 2023-03-08 20:54 出处:网络
[newbie question] Hi, I\'m working on a huge text file which is well over 30GB. I have to do some processing on each line and then write开发者_开发问答 it to a db in JSON format. When I read the f

[newbie question]

Hi,

I'm working on a huge text file which is well over 30GB.

I have to do some processing on each line and then write开发者_开发问答 it to a db in JSON format. When I read the file and loop using "for" my computer crashes and displays blue screen after about 10% of processing data.

Im currently using this:

f = open(file_path,'r')
for one_line in f.readlines():
    do_some_processing(one_line)
f.close()

Also how can I show overall progress of how much data has been crunched so far ?

Thank you all very much.


File handles are iterable, and you should probably use a context manager. Try this:

with open(file_path, 'r') as fh:
  for line in fh:
    process(line)

That might be enough.


I use a function like this for a similiar problem. You can wrap up any iterable with it.

Change this

for one_line in f.readlines():

You just need to change your code to

# don't use readlines, it creates a big list of all data in memory rather than
# iterating one line at a time.
for one_line in in progress_meter(f, 10000):

You might want to pick a smaller or larger value depending on how much time you want to waste printing status messages.

def progress_meter(iterable, chunksize):
    """ Prints progress through iterable at chunksize intervals."""
    scan_start = time.time()
    since_last = time.time()
    for idx, val in enumerate(iterable):
        if idx % chunksize == 0 and idx > 0: 
            print idx
            print 'avg rate', idx / (time.time() - scan_start)
            print 'inst rate', chunksize / (time.time() - since_last)
            since_last = time.time()
            print
        yield val


Using readline imposes to find the end of each line in your file. If some lines are very long, it might lead your interpreter to crash (not enough memory to buffer the full line).

In order to show progress you can check the file size for example using:

import os
f = open(file_path, 'r')
fsize = os.fstat(f).st_size

The progress of your task can then be the number of bytes processed divided by the file size times 100 to have a percentage.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号