Creating Thousands of Threads Quickly and Executing Them Near Simultaneously_问答_开发者

I have a C#.NET application that needs to inform anywhere from 4000 to 40,000 connected devices to perform a task all at once (or as close to simultaneous as possible).

The application works well; however, I am not satisfied with the performance. In a perfect world, as soon as I send the command I would like to see all of the devices respond simultaneously. Yet, there seems to be a delay as al开发者_Go百科l the threads I have created spin up and perform the task.

I have used the .NET 4.0 ThreadPool, created my own solution using custom threads and I have even tweaked the existing ThreadPool to allow for more threads to be executed at once.

I still want better performance and that is why I am here. Any ideas? Comments? Suggestion? Thank you.

-Shaun

Let me add that the application notifies these 'connected devices' that they need to go listen for audio on a multicast address.

A dual-core hyperthreaded processor MAY be able to execute 4 threads simultaneously - depending on what the thread is doing (no contention on IO or memory access, etc). A quad-core hyperthread perhaps 8. But 40K just can't physically happen.

If you want near simultaneous, you're better off spinning up just as many threads as the computer has free cores and having each thread fire off notifications then end. You'll get rid of a bunch of context switching this way.

Or, look elsewhere. As SB recommended in the comments, use a UDP multicast to notify listening machines that they should do something.

You cannot execute 4000 threads simultaneously, let alone 40k. At best on a desktop box with hyperthreading, you might get up to 8 simultaneous processes going (this assumes quad core). Threads are pseudo-parallel, and that's not even digging into the issues of bus contention.

If you absolutely need simultaneity for 40k devices, you want some form of hardware synchronization.

It sounds like you have some control over what software runs on each device. In which case, you could look to HPC usage and architect your devices (nodes) hierarchically and/or use MPI to execute your remote processes.

For the hierarchy example: Designate say, 8 nodes as primary masters, again with 8 slave nodes, each slave can act as a master too with 8 slaves (you might need to look at an automated subscription algorithm to do this). You will have a hierarchy 6 deep to cover 40,000 nodes. Each master has a small portion of code running continually waiting for instructions to pass to slaves.

All you then do is pass the instruction to the 8 primary masters and your instruction will be propagated to the ‘cluster’ on the wire asynchronously by the masters. The instruction only has to be passed on a maximum of 5 times, and thus will be propagated v-quickly.

Alternatively (or in conjunction) you could look at MPI, which is an out-of-the-can solution. There are some established C# implementations.

The overhead of creating thousands of threads is (very) significant; I would seek an alternative solution. This sounds like a job for asynchronous IO: your computer presumably only has one network connection, so no more than one message can be sent at a time - threads cannot improve on this!

Am I correct in guessing that you're using a synchronous API call on your device, which is why it must be executed in a thread? Does the API have an asynchronous version of the call? If the device API can really support 40k+ devices, then it should. It should also have internal handling of whatever wait handles (or equivalent) are required to synchronize the return data for callback. This isn't something you can handle at the client application side; you don't have enough visibility of the underlying implementation of the device API to know how to parallelize the tasks. As you've discovered, creating 40k threads with blocking calls doesn't cut it.

You should do async IO to the devices. This is very efficient and uses a different (larger ) set of threads to handle some of the work. Certainly the devices will receive the commands much faster. The IO thread pool will handle the replies (if any)

Always fun with these old ones.

1mb per thread means you need 4-40gb just in RAM minimum, and 4k-40k cores. and the fact that you have a network to send it on.

Means that it will be syncronized somewhere along the way, on the nearest switch/router (most of it probably even on you network card, if you even could get all the packages there at the same time, and it managed to send it without caching it or dying on you). Meaning simply all that work multi threading was for nothing as it will not reach the endpoints simultaneously.

Think of it as taking one 40'000 lane road and placing 40'000 cars on it, sure everyone get to the same point on the road at the same time, but then they leave the road and go home. Everyone gets home at different times, even if they started driving on the 40k road at the same point and time.

You just, can not, beat the physical realm (yet...).