开发者

Fastest way to do many small, blind writes on a huge file (in C++)?

开发者 https://www.devze.com 2023-01-06 11:44 出处:网络
I have some very large (>4 GB) files containing (millions of) fixed-length binary records. I want to (efficiently) join them to records in other files by writing pointers (i.e. 64-bit record numbers)

I have some very large (>4 GB) files containing (millions of) fixed-length binary records. I want to (efficiently) join them to records in other files by writing pointers (i.e. 64-bit record numbers) into those records at specific offsets.

To elaborate, I have a pair of lists of (key, record number) tuples sorted by key for each join I want to perform on a given pair of files, say, A and B. Iterating through a list pair and matching up the keys yields a list of (key, record number A, record number B) tuples representing the joined records (assuming a 1:1 mapping for simplicity). To complete the join, I conceptually need to seek to each A record in the list and write the corresponding B record number at the appropriate offset, and vice versa. My question is what is the fastest way to actually do this?

Since the list of joined records is sorted by key, the as开发者_如何学Pythonsociated record numbers are essentially random. Assuming the file is much larger than the OS disk cache, doing a bunch of random seeks and writes seems extremely inefficient. I've tried partially sorting the record numbers by putting the A->B and B->A mappings in a sparse array, and flushing the densest clusters of entries to disk whenever I run out of memory. This has the benefit of greatly increasing the chances that the appropriate records will be cached for a cluster after updating its first pointer. However, even at this point, is it generally better to do a bunch of seeks and blind writes, or to read chunks of the file manually, update the appropriate pointers, and write the chunks back out? While the former method is much simpler and could be optimized by the OS to do the bare minimum of sector reads (since it knows the sector size) and copies (it can avoid copies by reading directly into properly aligned buffers), it seems that it will incur extremely high syscall overhead.

While I'd love a portable solution (even if it involves a dependency on a widely used library, such as Boost), modern Windows and Linux are the only must-haves, so I can make use of OS-specific APIs (e.g. CreateFile hints or scatter/gather I/O). However, this can involve a lot of work to even try out, so I'm wondering if anyone can tell me if it's likely worth the effort.


It looks like you can solve this by use of data structures. You've got three constraints:

  • Access time must be reasonably quick
  • Data must be kept sorted
  • You are on a spinning disk

B+ Trees were created specifically to address the kind of workload you're dealing with here. There are several links to implementations in the linked Wikipedia article.

Essentially, a B+ tree is a binary search tree, except groups of nodes are held together in groups. That way, instead of having to seek around for each node, the B+ tree loads only a chunk at a time. And it keeps a bit of information around to know which chunk it's going to need in a search.

EDIT: If you need to sort by more than one item, you can do something like:


+--------+-------------+-------------+---------+
| Header | B+Tree by A | B+Tree by B | Records |
+--------+-------------+-------------+---------+
      ||      ^     |     ^    |          ^
      |\------/     |     |    |          |
      \-------------------/    |          |
                    |          |          |
                    \----------+----------/

I.e. you have seperate B+Trees for each key, and a seperate list of records, pointers to which are stored in the B+ trees.


I've tried partially sorting the record numbers by putting the A->B and B->A mappings in a sparse array, and flushing the densest clusters of entries to disk whenever I run out of memory. it seems that it will incur extremely high syscall overhead.

You can use memory mapped access to the file to avoid syscall overhead. mmap() on *NIX, and CreateFileMapping() on Windows.

Split file logically into blocks, e.g. 32MB. If somethings needs to be changed in the block, mmap() it , modify data, optionally msync() if desired, munmap() and then move to the next block.

That would have been something I have tried first. OS would automatically read whatever needs to be read (on first access to the data), and it will queue IO anyway it likes.

Important things to keep in mind is that the real IO isn't that fast. Performance-wise limiting factors for random access are (1) the number of IOs per second (IOPS) storage can handle and (2) the number of disk seeks. (Usual IOPS is in hundreds range. Usual seek latency is 3-5ms.) Storage for example can read/write 50MB/s: one continuous block of 50MB in one second. But if you would try to patch byte-wise 50MB file, then seek times would simply kill the performance. Up to some limit, it is OK to read more and write more, even if to update only few bytes.

Another limit to observe is the OS' max size of IO operation: it depends on the storage but most OSs would split IO tasks larger than 128K. The limit can be changed and best if it is synchronized with the similar limit in the storage.

Also keep in mind the storage. Many people forget that storage is often only one. I'm trying here to say that starting crapload of threads doesn't help IO, unless you have multiple storages. Even single CPU/core is capable of easily saturating RAID10 with its 800 read IOPS and 400 write IOPS limits. (But a dedicated thread per storage at least theoretically makes sense.)

Hope that helps. Other people here often mention Boost.Asio which I have no experience with - but it is worth checking.

P.S. Frankly, I would love to hear other (more informative) responses to your question. I was in the boat several times already, yet had no chance to really get down to it. Books/links/etc related to IO optimizations (regardless of platform) are welcome ;)


Random disk access tends to be orders of magnitude slower than sequential disk access. So much so that it can be useful to choose algorithms that might sound badly inefficient at first blush. For example, you might try this:

Create your join index, but instead of using it, just write out the list of pairs (A index, B index) to a disk file.

Sort this new file of pairs by the A index. Use a sort algorithm designed for external sorting (though I've not tried it myself, the STXXL library from stxxl.sourceforge.net looked promising when I was researching a similar problem)

Sequentially walk through the A record file and the sorted pair list. Read a huge chunk in, make all the relevant changes in memory, write the chunk out. Never touch that portion of the A record file again (since the changes you planned to make come in sequential order)

Go back, sort the pair file by the B index (again, using an external sort). Use this to update the B record file in the same manner.


Instead of building a list of (key, record number A, record number B) I would leave out the key to save space and just build (record number A, record number B). I'd sort that table or file by the A's, sequentially seek to each A record, write the B number, then sort the list by the B's, sequentially seek to each B record, write the A number.

I'm doing very similar large file manipulations, and these newer machines are so damn fast it doesn't take long at all:

On a cheapo 2.4gHz HP Pavilion with 3gb ram and 32-bit Vista, writing 3 million sequential 1,008-byte records to a new file takes 56 seconds, using Delphi library routines (as opposed to the Win API).

Sequentially seeking to each record in the file and writing 8 bytes using Win API FileSeek/FileWrite on a booted machine takes 136 seconds. That's 3 million updates. Immediately rerunning the same code takes 108 seconds, since the O/S has some things cached.

Sorting record offsets first, then sequentially updating the files, is the way to go.

0

精彩评论

暂无评论...
验证码 换一张
取 消