I'm interested in reading fixed width text files in Python in as efficient a manner as I can. Specifically, most of 开发者_如何学编程the time I'm interested in one or more columns in the flat file but not entire records.
It strikes me as inefficient to read the file a line at a time and extract the desired columns after reading the entire line into memory. I think I'd rather have the option of reading only the desired columns, top to bottom, left to right (instead of reading left to right, top to bottom).
Is such a thing desirable, and if so, is it possible?
Files are laid out as a (one-dimensional) sequence of bits. 'Lines' are just a convenience we added to make things easy to read for humans. So, in general, what you're asking is not possible on plain files. To pull this off, you would need some way of finding where a record starts. The two most common ways are:
- Search for newline symbols (in other words, read the entire file).
- Use a specially spaced layout, so that each record is laid out using a fixed with. That way, you can use low level file operations, like
seek
, to go directly to where you need to go. This avoids reading the entire file, but is painful to do manually.
I wouldn't worry too much about file reading performance unless it becomes a problem. Yes, you could memory map the file, but your OS probably already caches for you. Yes, you could use a database format (e.g., the sqlite3 file format through sqlalchemy), but it probably isn't worth the hassle.
Side note on "fixed width:" What precisely do you mean by this? If you really mean 'every column always starts at the same offset relative to the start of the record' then you can definitely use Python's seek
to skip past data that you are not interested in.
How big are the lines? Unless each record is huge, it's probably likely to make little difference only reading in the fields you're interested in rather than the whole line.
For big files with fixed formatting, you might get something out of mmapping the file. I've only done this with C rather than Python, but it seems like mmapping the file then accessing the appropriate fields directly is likely to be reasonably efficient.
Flat files are not good with what you're trying to do. My suggestion is to convert the files to SQL database (using sqlite3) and then reading just the columns you want. SQLite3 is blazing fast.
If it's truly fixed width, then you should be able to just call read(N) to skip past the fixed number of bytes from the end of your column on one line to the start of it on the next.
精彩评论