I am presently working on a C++ project that involves reading in thousands of small (~20kb) text files which are all in ASCII format.
Will I be able to get a significant performance improvement by converting all of the files into Binary befor开发者_StackOverflowe analyzing them?
Converting a string to a number, while not cheap in cpu cycles, is a non-issue. The amount of overhead involved with I/O is always orders of magnitude larger than the conversion. The size of the file is not much of an issue either, a disk supplies 8KB about as fast as 20KB, it all comes out of the same cluster on the same track. Having thousands of files is a big issue, opening a file involves moving the disk reader head and that takes forever.
So focus on whittling down the number of files for a real gain.
There is no real difference between "ASCII" and "Binary" if you're handling text. ASCII is an interpretation of Binary data as text. So, if I understand your question correctly, the answer is no, there is no conversion that is possible and there is no performance improvement.
Storing data in binary format has two advantages:
- it occupies less storage (less disk IO)
- it is faster to read (no time-consuming string parsing)
So there will be performance improvements if you convert your textual representation to a tightly-packed binary format, but if they are significant depends on your particular situation.
If data streaming is already a performance bottleneck, switching to a binary format (and possibly even compressed - reading from disks is inherently slow ) can bring a lot.
You can get a performance gain on load when the binary format is such that you consequently minimise any requirement for parsing. For example, where the content can be dumped in large chunks that map directly into a 'struct dump'. Every further step beyond this in turn may cost you performance. Whether this ends up being much ahead of the ASCII will in part depend on how complex/inefficient the ASCII is to start with.
Steps which cost you even in binary include:
- Compression
- Platform independence
- Variable content
- Changes to the content requiring an update of the binary from the ASCII
If you are sure a large part of the execution time is load and parse, but you only do this once for a fixed data set, an other option might be to use threads. Set up a bunch of parallel workers that load the data and then place it on a queue for analysis.
Probably, yes. But then it'll be impossible to verify the input files by inspection, and you'll have to spend time writing code to transcode them, and new code to read them. I'd only do it if you find that I/O time is a significant problem .
精彩评论