What is the optimal way to process a very large (over 30GB) text file and also show progress_问答_开发者

What is the optimal way to process a very large (over 30GB) text file and also show progress

开发者 https://www.devze.com 2023-03-08 20:54 出处：网络

[newbie question] Hi, I\'m working on a huge text file which is well over 30GB. I have to do some processing on each line and then write开发者_开发问答 it to a db in JSON format. When I read the f

[newbie question]

Hi,

I'm working on a huge text file which is well over 30GB.

I have to do some processing on each line and then write开发者_开发问答 it to a db in JSON format. When I read the file and loop using "for" my computer crashes and displays blue screen after about 10% of processing data.

Im currently using this:

f = open(file_path,'r')
for one_line in f.readlines():
    do_some_processing(one_line)
f.close()

Also how can I show overall progress of how much data has been crunched so far ?

Thank you all very much.

File handles are iterable, and you should probably use a context manager. Try this:

with open(file_path, 'r') as fh:
  for line in fh:
    process(line)

That might be enough.

I use a function like this for a similiar problem. You can wrap up any iterable with it.

Change this

for one_line in f.readlines():

You just need to change your code to

# don't use readlines, it creates a big list of all data in memory rather than
# iterating one line at a time.
for one_line in in progress_meter(f, 10000):

You might want to pick a smaller or larger value depending on how much time you want to waste printing status messages.

def progress_meter(iterable, chunksize):
    """ Prints progress through iterable at chunksize intervals."""
    scan_start = time.time()
    since_last = time.time()
    for idx, val in enumerate(iterable):
        if idx % chunksize == 0 and idx > 0: 
            print idx
            print 'avg rate', idx / (time.time() - scan_start)
            print 'inst rate', chunksize / (time.time() - since_last)
            since_last = time.time()
            print
        yield val

Using readline imposes to find the end of each line in your file. If some lines are very long, it might lead your interpreter to crash (not enough memory to buffer the full line).

In order to show progress you can check the file size for example using:

import os
f = open(file_path, 'r')
fsize = os.fstat(f).st_size

The progress of your task can then be the number of bytes processed divided by the file size times 100 to have a percentage.

What is the optimal way to process a very large (over 30GB) text file and also show progress

精彩评论

关注公众号

热门标签

图文推荐

What is the optimal way to process a very large (over 30GB) text file and also show progress

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：