开发者

Grand Central Strategy for Opening Multiple Files

开发者 https://www.devze.com 2023-02-01 09:11 出处:网络
I have a working implementation using Grand Central dispatch queues that (1) opens a file and computes an OpenSSL DSA hash on \"queue1\", (2) writing out the hash to a new \"side car\" file for later

I have a working implementation using Grand Central dispatch queues that (1) opens a file and computes an OpenSSL DSA hash on "queue1", (2) writing out the hash to a new "side car" file for later verification on "queue2".

I would like to open multiple files at the same time, but based on some logic that doesn't 开发者_StackOverflow中文版"choke" the OS by having 100s of files open and exceeding the hard drive's sustainable output. Photo browsing applications such as iPhoto or Aperture seem to open multiple files and display them, so I'm assuming this can be done.

I'm assuming the biggest limitation will be disk I/O, as the application can (in theory) read and write multiple files simultaneously.

Any suggestions?

TIA


You are correct in that you'll be I/O bound, most assuredly. And it will be compounded by the random access nature of having multiple files open and being actively read at the same time.

Thus, you need to strike a bit of a balance. More likely than not, one file is not the most efficient, as you've observed.

Personally?

I'd use a dispatch semaphore.

Something like:

@property(nonatomic, assign) dispatch_queue_t dataQueue;
@property(nonatomic, assign) dispatch_semaphore_t execSemaphore;

And:

- (void) process:(NSData *)d {
    dispatch_async(self.dataQueue, ^{
        if (!dispatch_semaphore_wait(self.execSemaphore, DISPATCH_TIME_FOREVER)) {
            dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^{
                ... do calcualtion work here on d ...
                dispatch_async(dispatch_get_main_queue(), ^{
                    .... update main thread w/new data here ....
                });
                dispatch_semaphore_signal(self.execSemaphore);
            });
        }
    });
}

Where it is kicked off with:

self.dataQueue = dispatch_queue_create("com.yourcompany.dataqueue", NULL);
self.execSemaphore = dispatch_semaphore_create(3);
[self process: ...];
[self process: ...];
[self process: ...];
[self process: ...];
[self process: ...];
.... etc ....

You'll need to determine how best you want to handle the queueing. If there are many items and there is a notion of cancellation, enqueueing everything is likely wasteful. Similarly, you'll probably want to enqueue URLs to the files to process, and not NSData objects like the above.

In any case, the above will process three things simultaneously, regardless of how many have been enqueued.


I'd use NSOperation for this because of the ease of handling both dependencies and cancellation.

I'd create one operation each for reading the data file, computing the data file's hash, and writing the sidecar file. I'd make each write operation dependent on its associated compute operation, and each compute operation dependent on its associated read operation.

Then I'd add the read and write operations to one NSOperationQueue, the "I/O queue," with a restricted width. The compute operations I'd add to a separate NSOperationQueue, the "compute queue," with a non-restricted width.

The reason for the restriced width on the I/O queue is that your work will likely be I/O bound; you may want it to have a width greater than 1, but it's very likely to be directly related to the number of physical disks on which your input files reside. (Probably something like 2x, you'll want to determine this experimentally.)

The code would wind up looking something like this:

@implementation FileProcessor

static NSOperationQueue *FileProcessorIOQueue = nil;
static NSOperationQueue *FileProcessorComputeQueue = nil;

+ (void)inititalize
{
    if (self == [FileProcessor class]) {
        FileProcessorIOQueue = [[NSOperationQueue alloc] init];
        [FileProcessorIOQueue setName:@"FileProcessorIOQueue"];
        [FileProcessorIOQueue setMaxConcurrentOperationCount:2]; // limit width

        FileProcessorComputeQueue = [[NSOperationQueue alloc] init];
        [FileProcessorComputeQueue setName:@"FileProcessorComputeQueue"];
    }
}

- (void)processFilesAtURLs:(NSArray *)URLs
{
    for (NSURL *URL in URLs) {
        __block NSData *fileData = nil; // set by readOperation
        __block NSData *fileHashData = nil; // set by computeOperation

        // Create operations to do the work for this URL

        NSBlockOperation *readOperation =
            [NSBlockOperation blockOperationWithBlock:^{
                fileData = CreateDataFromFileAtURL(URL);
            }];

        NSBlockOperation *computeOperation =
            [NSBlockOperation blockOperationWithBlock:^{
                fileHashData = CreateHashFromData(fileData);
                [fileData release]; // created in readOperation
            }];

        NSBlockOperation *writeOperation =
            [NSBlockOperation blockOperationWithBlock:^{
                WriteHashSidecarForFileAtURL(fileHashData, URL);
                [fileHashData release]; // created in computeOperation
            }];

        // Set up dependencies between operations

        [computeOperation addDependency:readOperation];
        [writeOperation addDependency:computeOperation];

        // Add operations to appropriate queues

        [FileProcessorIOQueue addOperation:readOperation];
        [FileProcessorComputeQueue addOperation:computeOperation];
        [FileProcessorIOQueue addOperation:writeOperation];
    }
}

@end

It's pretty straightforward; rather than deal with multiply-nested layers of sync/async as you would with the dispatch_* APIs, NSOperation allows you to define your units of work and your dependencies between them independently. For some situations this can be easier to understand and debug.


You have received excellent answers already, but I wanted to add a couple points. I have worked on projects that enumerate all the files in a file system and calculate MD5 and SHA1 hashes of each file (in addition to other processing). If you are doing something similar, where you are searching a large number of files and the files may have arbitrary content, then some points to consider:

  • As noted, you will be I/O bound. If you read more than 1 file simultaneously, you will have a negative impact on the performance of each calculation. Obviously, the goal of scheduling calculations in parallel is to keep the disk busy between files, but you may want to consider structuring your work differently. For example, set up one thread that enumerates and opens the files and a second thread the gets open file handles from the first thread one at a time and processes them. The file system will cache catalog information, so the enumeration won't have a severe impact on reading the data, which will actually have to hit the disk.

  • If the files can be arbitrarily large, Chris' approach may not be practical since the entire content is read into memory.

  • If you have no other use for the data than calculating the hash, then I suggest disabling file system caching before reading the data.

If using NSFileHandles, a simple category method will do this per-file:

@interface NSFileHandle (NSFileHandleCaching)
- (BOOL)disableFileSystemCache;
@end

#include <fcntl.h>

@implementation NSFileHandle (NSFileHandleCaching)
- (BOOL)disableFileSystemCache {
     return (fcntl([self fileDescriptor], F_NOCACHE, 1) != -1);
}
@end
  • If the sidecar files are small, you may want to collect them in memory and write them out in batches to minimize disruption of the processing.

  • The file system (HFS, at least) stores file records for files in a directory sequentially, so traverse the file system breadth-first (i.e., process each file in a directory before entering subdirectories).

The above is just suggestions, of course. You will want to experiment and measure performance to confirm the actual impact.


libdispatch actually provides APIs explicitly for this! Check out dispatch_io; it will handle parallelizing IO when appropriate, and otherwise serializing it to avoid thrashing the disk.


The following link is to a BitBucket project I setup utilizing NSOperation and Grand Central Dispatch in use a primitive file integrity application.

https://bitbucket.org/torresj/hashar-cocoa

I hope it is of help/use.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号