开发者

Run Time Critical, reading operation of CSV files in C

开发者 https://www.devze.com 2023-02-09 05:28 出处:网络
Is there a way 开发者_开发知识库to code a swift, efficient way of reading csv files?[the point to note here is: I am talking about a csv file with a million+ lines]

Is there a way 开发者_开发知识库to code a swift, efficient way of reading csv files?[the point to note here is: I am talking about a csv file with a million+ lines]

The Run Time is the critical metric here.

One resource on internet concentrated on using binary file operations to read in bulk. But I am sure, if it will be helpful in reading CSV files

There are other methods as well, like Robert Gamble written SourceForge code. Is there a way to write it using native functions?

Edit: Lets split the entire question in a clearer and better way:

  1. Is there an efficient (Run Time critical) way to read files in C? (in this case a million rows long .csv file)

  2. Is there a swift efficient way to parse a csv file?


There is no single way of reading and parsing any type of file that is fastest all the time. However, you might want to build a Ragel grammar for CSVs; those tend to be pretty fast. You can adapt it to your specific type of CSV (comma-separated, ;-separated, numbers only, etc.) and perhaps skip over any data that you're not going to use. I've had good experience with dataset-specific SQL parsers that could skip over much of their input (database dumps).

Reading in bulk might be a good idea, but you should measure on actual data whether it it's really faster than stdio-buffering. Using binary I/O might speed things up a bit on Windows, but then you need to handle newlines somewhere else.


In my experience, the parsing of CSV files — even in higher-level interpreted language — isn't usually a bottleneck. Usually huge amounts of data take a lot of space; CSV files are big, and most of the loading time is I/O, that is, the hard drive reading the tons of digits into memory.

So my strong advice is to consider compressing the CSVs. gzip does it's job very efficiently, it manages to squash and restore CSV streams on-the-fly, speeding up saving and loading by means of greatly decreasing file size and thus I/O time.

If you are developing under Unix, you may try this at cost of no additional code at all, benefiting from piping CSV input and output through gzip -c and gunzip -c. Just try it — for me it sped up things tens of times.


Set the input buffer to a much larger size than the default using setvbuf. This is the only thing that you can do in C to increase the read speed. Also do some timing tests because there will be a poingt of diminishing returns beyond which there is no point in increasing the buffer size.

Outside of C you can start by putting that .CSV onto an SSD drive, or store it on a compressed filesystem.


The best you can hope for is to haul large blocks of text into memory (or "memory map" a file), and process the text in memory.

The thorn in the efficiency is that text lines are variable length records. Generally, text is read until an end of line terminator is found. In general, this means reading a character, and checking for eol. Many platforms and libraries try make this more efficient by reading blocks of data and searching the data for eol.

Your CSV format further complicates the issue. In a CSV file, the fields are variable length records. Again, searching for a terminal character such as a comma, tab or vertical bar.

If you want better performance, you will have to change the data layout to fixed field lengths and fixed record lengths. Pad fields if necessary. The applications can remove the extra padding. Fixed length records are very efficient as far as reading is concerned. Just read N number of bytes. No scanning, just dump into a buffer somewhere.

Fixed length fields allow for random access into the record (or text line). The index into a field is constant and can be calculated easily. No searching required.

In summary, variable length records and fields are by their nature, not the most efficient data structure. Time is wasted searching for terminal characters. Fixed length records and fixed length fields are more efficient since they don't require searching.

If your application is data intensive, perhaps restructuring the data will make the program more efficient.

0

精彩评论

暂无评论...
验证码 换一张
取 消