Where's the balance between thread amount and thread block times?_问答_开发者

Elongated question:

When having more blocking threads then CPU cores, where's the balance between thread amount and thread block times to maximize CPU efficiency by reducing context switch overhead?

I have a wide variety of IO devices that I need to control on Windows 7, with a x64 multi-core processor: PCI devices, network devices, stuff being saved to hard drives, big chunks of data being copied,... The most common policy is: "Put a thread on it!". Several dozen threads later, this is starting to feel like a bad idea.

None of my cores are being used 100%, and there's several cores who're still idling, but开发者_如何学Go there are delays showing up in the range of 10 to 100ms who cannot be explained by IO blockage or CPU intensive usage. Other processes don't seem to require resources either. I'm suspecting context switch overhead.

There's a bunch of possible solutions I have:

Reduce threads by bundling the same IO devices: This mainly goes for the hard drive, but maybe for the network as well. If I'm saving 20MB to the hard drive in one thread, and 10MB in the other, wouldn't it be better to post it all to the same? How would this work in case of multiple hard drives?
Reduce threads by bundling similar IO devices, and increase it's priority: Dozens of threads with increased priority are probably gonna make my user interface thread stutter. But I can bundle all that functionality together in 1 or a couple of threads and increase it's priority.

Any case studies tackling similar problems are much appreciated.

First, it sounds like these tasks should be performed using asynchronous I/O (IO Completion Ports, preferably), rather than with separate threads. Blocking threads are generally the wrong way to do I/O.

Second, blocked threads shouldn't affect context switching. The scheduler has to juggle all the active threads, and so, having a lot of threads running (not blocked) might slow down context switching a bit. But as long as most of your threads are blocked, they shouldn't affect the ones that aren't.

10-100ms with some cores idle: it's not context-switching overhead in itself since a switch is orders of magnitude faster than these delays, even with a core swap and cache flush.

Async I/O would not help much here. The kernel thread pools that implement ASIO also have to be scheduled/swapped, albeit this is faster than user-space threads since there are fewer Wagnerian ring-cycles. I would certainly head for ASIO if the CPU loading was becoming an issue, but it's not.

You are not short of CPU, so what is it? Is there much thrashing - RAM shortage? Excessive paging can surely result in large delays. Where is your page file? I've shoved mine off Drive C onto another fast SATA drive.

PCI bandwidth? You got a couple of TV cards in there?

Disk controller flushing activity - have you got an SSD that's approaching capacity? That's always a good one for unexplained pauses. I get the odd pause even though my 128G SSD is only 2/3 full.

I've never had a problem specifically related to context-swap time and I've been writing multiThreaded apps for decades. Windows OS schedules & despatches the ready threads onto cores reasonably quickly. 'Several dozen threads' in itself, (ie. not all running!), is not remotely a problem - looking now at my TaskManger/performance, I have 1213 threads loaded on and no performance issues at all with ~6% CPU usage, (app on test running in background, bitTorrent etc). Firefox has 30 threads, VLC media player 27, my test app 23. No problem at all writing this post.

Given your issue of 10-100ms delays, I would be amazed if fiddling with thread priorities and/or changing the way your work is loaded onto threads provides any improvement - something else is stuffing your system, (you haven't got any drivers that I coded, have you? :).

Does perfmon give any clues?

Rgds, Martin

I don't think that there is a conclusive answer, and it probably depends on your OS as well; some handle threads better than others. Still, delays in the 10 to 100 ms range are not due to context switching itself (although they could be due to characteristics of the scheduling algorithm). My experience under Windows is that I/O is very inefficient, and if you're doing I/O, of any type, you will block. And that I/O by one process or thread will end up blocking other processes or threads. (Under Windows, for example, there's probably no point in having more than one thread handle the hard drive. You can't read or write several sectors at the same time, and my impression is that Windows doesn't optimize accesses like some other systems do.)

With regards to your exact questions:

"If I'm saving 20MB to the hard drive in one thread, and 10MB in the other, wouldn't it be better to post it all to the same?": It depends on the OS. Normally, there should be no reduction in time or latency using separate threads, and depending on other activity and the OS, there could be an improvement. (If there are several disk requests in instance, most OS's will optimize the accesses, reordering the requests to reduce head movement.) The simplest solution would be to try both, and see which works better on your system.

"How would this work in case of multiple hard drives?": The OS should be able to do the I/O in parallel, if the requests are to different drives.

With regards to increasing priority of one or more theads, it's very OS dependent, but probably worth trying. Unless there's significant CPU time used in the threads with the higher priority, it shouldn't impact the user interface—these threads are mostly blocked for I/O, remember.

Well, my Windows 7 is currently running 950 threads. I don't think that adding another few dozen on would make a significant difference. However, you should definitely be looking at a thread pool or other work-stealing device for this - you shouldn't make new threads just to let them block. If Windows provides asynchronous I/O by default, then use it.