C# Multithreading File IO (Reading)_问答_开发者

We have a situation where our application needs to process a series of files and rather than perform this function synchronously, we would like to employ multi-threading to have the workload split amongst different threads.

Each item of work is:

1. Open a file for read only

2. Process the data in the file

3. Write the processed data to a Dictionary

We would like to perform each file's work on a new thread? Is this possible and should be we better to use the ThreadPool or spawn new threads keeping in mind that each item of "work" only takes 30ms however its possible that hundreds of files will need to be processed.

Any ideas to make this more efficient is appreciated.

EDIT: At the moment we are making use of the ThreadPool to handle开发者_Python百科 this. If we have 500 files to process we cycle through the files and allocate each "unit of processing work" to the threadpool using QueueUserWorkItem.

Is it suitable to make use of the threadpool for this?

I would suggest you to use ThreadPool.QueueUserWorkItem(...), in this, threads are managed by the system and the .net framework. The chances of you meshing up with your own threadpool is much higher. So I would recommend you to use Threadpool provided by .net . It's very easy to use,

ThreadPool.QueueUserWorkItem(new WaitCallback(YourMethod), ParameterToBeUsedByMethod);

YourMethod(object o){ Your Code here... }

For more reading please follow the link http://msdn.microsoft.com/en-us/library/3dasc8as%28VS.80%29.aspx

Hope, this helps

I suggest you have a finite number of threads (say 4) and then have 4 pools of work. I.e. If you have 400 files to process have 100 files per thread split evenly. You then spawn the threads, and pass to each their work and let them run until they have finished their specific work.

You only have a certain amount of I/O bandwidth so having too many threads will not provide any benefits, also remember that creating a thread also takes a small amount of time.

Instead of having to deal with threads or manage thread pools directly I would suggest using a higher-level library like Parallel Extensions (PEX):

var filesContent = from file in enumerableOfFilesToProcess
                   select new 
                   {
                       File=file, 
                       Content=File.ReadAllText(file)
                   };

var processedContent = from content in filesContent
                       select new 
                       {
                           content.File, 
                           ProcessedContent = ProcessContent(content.Content)
                       };

var dictionary = processedContent
           .AsParallel()
           .ToDictionary(c => c.File);

PEX will handle thread management according to available cores and load while you get to concentrate about the business logic at hand (wow, that sounded like a commercial!)

PEX is part of the .Net Framework 4.0 but a back-port to 3.5 is also available as part of the Reactive Framework.

I suggest using the CCR (Concurrency and Coordination Runtime) it will handle the low-level threading details for you. As for your strategy, one thread per work item may not be the best approach depending on how you attempt to write to the dictionary, because you may create heavy contention since dictionaries aren't thread safe.

Here's some sample code using the CCR, an Interleave would work nicely here:

Arbiter.Activate(dispatcherQueue, Arbiter.Interleave(
    new TeardownReceiverGroup(Arbiter.Receive<bool>(
        false, mainPort, new Handler<bool>(Teardown))),
    new ExclusiveReceiverGroup(Arbiter.Receive<object>(
        true, mainPort, new Handler<object>(WriteData))),
    new ConcurrentReceiverGroup(Arbiter.Receive<string>(
        true, mainPort, new Handler<string>(ReadAndProcessData)))));

public void WriteData(object data)
{
    // write data to the dictionary
    // this code is never executed in parallel so no synchronization code needed
}

public void ReadAndProcessData(string s)
{
    // this code gets scheduled to be executed in parallel
    // CCR take care of the task scheduling for you
}

public void Teardown(bool b)
{
    // clean up when all tasks are done
}

In the long run, I think you'll be happier if you manage your own threads. This will let you control how many are running and make it easy to report status.

Build a worker class that does the processing and give it a callback routine to return results and status.
For each file, create a worker instance and a thread to run it. Put the thread in a Queue.
Peel threads off of the queue up to the maximum you want to run simultaneously. As each thread completes go get another one. Adjust the maximum and measure throughput. I prefer to use a Dictionary to hold running threads, keyed by their ManagedThreadId.
To stop early, just clear the queue.
Use locking around your thread collections to preserve your sanity.

Use ThreadPool.QueueUserWorkItem to execute each independent task. Definitely don't create hundreds of threads. That is likely to cause major headaches.

The general rule for using the ThreadPool is if you don't want to worry about when the threads finish (or use Mutexes to track them), or worry about stopping the threads.

So do you need to worry about when the work is done? If not, the ThreadPool is the best option. If you want to track the overall progress, stop threads then your own collection of threads is best.

ThreadPool is generally more efficient if you are re-using threads. This question will give you a more detailed discussion.

Hth

Using the ThreadPool for each individual task is definitely a bad idea. From my experience this tends to hurt performance more than helping it. The first reason is that a considerable amount of overhead is required just to allocate a task for the ThreadPool to execute. By default, each application is assigned it's own ThreadPool that is initialized with ~100 thread capacity. When you are executing 400 operations in a parallel, it does not take long to fill the queue with requests and now you have ~100 threads all competing for CPU cycles. Yes the .NET framework does a great job with throttling and prioritizing the queue, however, I have found that the ThreadPool is best left for long-running operations that probably won't occur very often (loading a configuration file, or random web requests). Using the ThreadPool to fire off a few operations at random is much more efficient than using it to execute hundreds of requests at once. Given the current information, the best course of action would be something similar to this:

Create a System.Threading.Thread (or use a SINGLE ThreadPool thread) with a queue that the application can post requests to
Use the FileStream's BeginRead and BeginWrite methods to perform the IO operations. This will cause the .NET framework to use native API's to thread and execute the IO (IOCP).

This will give you 2 leverages, one is that your requests will still get processed in parallel while allowing the operating system to manage file system access and threading. The second is that because the bottleneck of the vast majority of systems will be the HDD, you can implement a custom priority sort and throttling to your request thread to give greater control over resource usage.

Currently I have been writing a similar application and using this method is both efficient and fast... Without any threading or throttling my application was only using 10-15% CPU, which can be acceptable for some operations depending on the processing involved, however, it made my PC as slow as if an application was using 80%+ of the CPU. This was the file system access. The ThreadPool and IOCP functions do not care if they are bogging the PC down, so don't get confused, they are optimized for performance, even if that performance means your HDD is squeeling like a pig.

The only problem I have had is memory usage ran a little high (50+ mb) during the testing phaze with approximately 35 streams open at once. I am currently working on a solution similar to the MSDN recommendation for SocketAsyncEventArgs, using a pool to allow x number of requests to be operating simultaneously, which ultimately led me to this forum post.

Hope this helps somebody with their decision making in the future :)