Process files concurrently as they arrive in c#_问答_开发者

I have an app that works great for processing files that land in a directory on my server. The process is:

1) check for files in a directory
2) queue a user work item to handle each file in the background
3) wait until all workers have completed
4) goto 1

This works nicely and I never worry about the same file being processed twice or multiple threads being spawned for the same file. However, if there's one file that takes too long to process, step #3 hangs on that one file and holds up all other processing.

So my question is, what's the correc开发者_JS百科t paradigm to spawn exactly one thread for each file I need to process, while not blocking if one file takes too long? I considered FileSystemWatcher, but the files may not be immediately readable which is why I continually look at all files and spawn a process for each (which will immediately exit if the file is locked).

Should I remove step #3 and maintain a list of files I've already processed? That seems messy and the list would grow very large over time so I suspect there's a more elegant solution.

I would suggest that you maintain a list of files which you are currently processing. Have the thread remove itself from this list when the thread finishes. When looking for new files, exclude those in the currently-running list.

Move the files to a processing directory before you start threads. Then you can fire-and-forget the threads and any admins can see at a glance what's going on.

Spawning one thread per item to process is almost never good approach. In your case when number of files will go above several hundreds one-thread-per-file will make application performance pretty bad and with 32-bit process will start running out of address space.

List solution by Dark Falcon is simple enough and matches your algorithm. I would actually use queue (likle ConcurrentQueue - http://msdn.microsoft.com/en-us/library/dd267265.aspx) to put items to process on one side (i.e. based on periodic scans of file watcher) and pick items for processing by one or several threads on other side. You generally want smaller number of threads (i.e. 1-2x number of CPUs for CPU intensive load).

Also consider using Task Parallel Library (like Parallel.ForEach - http://msdn.microsoft.com/en-us/library/dd989744.aspx) to deal with multiple thread.

To minimize number of files to handle I would keep persistent (i.e. disk file) list of items that are already processed - file path + last modified date (unless you can obtain this information from other source).

My two main questions would be:

What are the size of the files?
How often will files appear?

Depending on your answer there, I might go with the following producer-consumer algorithm:

Use a file system watcher to see that there is activity in the directory you are monitoring
When activity occurs, start polling "lightly"; that is test each file available to see if it is not locked (i.e., try open w/ write privileges using a simple IsLocked extension method that tests via a try..catch); if 1 or more files are not free, set a timer to go off in some amount of time (longer if expecting larger fewer files, shorter if smaller and/or more frequent) to again test files
As soon as you see that a file is free, process it (i.e., move it to another folder, put an item in a concurrent queue, have your consumer threads process the queue, archive the file/results).
Have some kind of persistence mechanism like Alexei mentions (i.e., disk/database) to be able to recover your processing where you left off in case of system failure.

I feel that this is a good combination of non-blocking, low cpu-usage behavior. But measure your before and after results. I would recommend using the ThreadPool and try to keep threads from blocking (i.e., try to ensure thread re-use by not blocking by doing something like Thread.Sleep)

Notes:

Base the number of threads processing files on the number of CPUs and cores available on the machine; also consider server load
FileSystemWatcher can be finicky; be sure that it's running from the same machine that you are monitoring (i.e., not watching a remote server), otherwise you'll need to reinitialize connectivity from time to time.
I definitely would not spawn a different process per file; multiple threads should be plenty sufficient; reusing threads is best. Spawning processes is a very very expensive operation and spawning threads is an expensive operation. Alexei has some good information wrt Task Parallel Library; it uses the ThreadPool.