According to a section in this presumably accurate book,
A common use of pipes is to read a compressed file incrementally; that is, without uncompressing the whole thing at once. The following function takes the name of a compressed file as a parameter and returns a pipe that uses gunzip to decompress the contents:
def open_gunzip(filename): cmd = 'gunzip -c ' + filename fp = os.popen(cmd) return fp
If you read lines from fp one at a time, you never have to store the uncompressed file in memory or on disk.
Maybe I'm just interpreting this wrong, but I don't see how this is possible. Python couldn't have any means of pausing gunzip halfway through spitting out the results, right? I assume gunzip isn't going to 开发者_如何学Pythonblock until a line of output is read before continuing to output more lines, so some buffer has to be capturing all of this (whether inside the Python interpreter or in the OS, whether in memory or on disk), meaning the uncompressed file is being stored somewhere in full...right?
Your assumption is faulty. gunzip does not have to see the entire file to unzip it. Read the unzip file format. There's a directory, with offsets to the individual components.
It's possible to unzip a file in pieces.
"uncompressed file is being stored somewhere in full...right?"
Not necessarily. Not sure why you're assuming it or where you read it.
All low-level I/O calls can block. The write in gunzip -- when writing to a pipe -- can block when the pipe buffer is full. That's the way I/O to a pipe is defined. Pipe I/O blocks.
Check the man pages for pipe for details.
If a process attempts to read from an empty pipe, then read(2) will
block until data is available. If a process attempts to write to a
full pipe (see below), then write(2) blocks until sufficient data has
been read from the pipe to allow the write to complete. Non-blocking
I/O is possible by using the fcntl(2) F_SETFL operation to enable the
O_NONBLOCK open file status flag.
This really comes from gunzip
implementation, not from python.
It is written in C. It probably uses fwrite()
from C's stdio.h
to write its output.
libc6
implementation I use automatically creates an output buffer, and when it is filled, blocks on fwrite()
until it can write more.
It's not Python that is suspending gunzip
, it's that the kernel will stop executing gunzip
when it tries writing (using the write()
syscall) to a full buffer. This is called blocking on IO. The kernel maintains an internal buffer connecting the two ends of the pipline, independent of any buffering happening in any processes that are writing to or reading from the pipe.
Python will similarly block when reading from a pipe that has an empty buffer i.e. that doesn't currently have any data from gunzip
written to it.
Pipes can be seen as as solution to the Producer-consumer problem.
精彩评论