I need to read in and process a bunch of ~40mb gzipped text files, and I need it done f开发者_高级运维ast and with minimal i/o overhead (as the volumes are used by others as well). The fastest way I've found thus for this task looks like this:
def gziplines(fname):
f = Popen(['zcat', fname], stdout=PIPE)
for line in f.stdout:
yield line
and then:
for line in gziplines(filename)
dostuff(line)
but what I would like to do (IF this is faster?) is something like this:
def gzipmmap(fname):
f = Popen(['zcat', fname], stdout=PIPE)
m = mmap.mmap(f.stdout.fileno(), 0, access=mmap.ACCESS_READ)
return m
sadly, when I try this, I get this error:
>>> m = mmap.mmap(f.stdout.fileno(), 0, access=mmap.ACCESS_READ)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
mmap.error: [Errno 19] No such device
even though, when I try:
>>> f.stdout.fileno()
4
So, I think I have a basic misunderstanding of what is going on here. :(
The two questions are:
1) Would this mmap be a faster method at putting the whole file into memory for processing?
2) How can I achieve this?
Thank you very much... everyone here has been incredibly helpful already! ~Nik
From the mmap(2)
man page:
ENODEV The underlying file system of the specified file does not sup- port memory mapping.
You cannot mmap streams, only real files or anonymous swap space. You will need to read from the stream into memory yourself.
Pipes aren't mmapable.
case MAP_PRIVATE:
...
if (!file->f_op || !file->f_op->mmap)
return -ENODEV;
and pipe's file operations does not contain mmap
hook.
精彩评论