We have recently purchased a dual Intel X5650 workstation to run a floating-point intensive simulation, under Ubuntu 10.04.
Each X5650 has 6 cores, so there are 12 cores in total. The code is trivially parallel, so I have been running it mostly with 12 threads, and observing approximately "1200%" process开发者_JS百科or utilization through "top".
HyperThreading is enabled in the BIOS, so the operating system nominally sees 24 cores available. If I increase the number of threads to 24, top reports approximately 2000% processor utilization - however, it does not appear that the actual code performance increases by 20/12.
My question is - how does HyperThreading actually work on the latest generation of Xeons? Would a floating-point intensive code benefit from scheduling more than one thread per core? Does the answer change if the working set is on the order of the cache size, as compared to several times larger, or if there are substantial I/O operations (e.g. writing simulation outputs to disk)?
Additionally - how should I interpret processor utilization percentages from "top" when hyperthreading is enabled?
With HT, the OS will schedule 2 threads to each core at the same time. The utilization reported by top is essentially just the average number of threads in the "running" state over its sampling interval (typically 1 second). Running threads are available for the CPU to execute, but may not be getting much work done, e.g. if they're mostly stalled on cache misses.
When a thread is blocked on real I/O -- network, disk, etc. -- the OS will deschedule it from the core and schedule some other ready thread, so HT won't help.
HT tries to get more utilization out of the math execution units without actually doubling very much hardware in the core. If one thread has enough instruction-level parallelism and doesn't miss cache much, then it'll mostly fill up the core's resources and HT won't help. For heavy FP apps with data that doesn't fit in cache, HT still probably won't help much, since both threads are using the same execution units (SSE math) and both need more than the full cache -- in fact it's likely to hurt since they'll be competing for cache and thrashing more. Of course it depends on exactly what you're doing and what your data access patterns look like.
HT mostly helps on branchy code with irregular and unpredictable access patterns. For FP-intensive code you can often do better with 1 thread per core and careful design of your access patterns (e.g. good data blocking).
I have developed a very high-performing, embarassingly parallel code which will run on as many cores as are available. Initially it ran on a 2-core AMD laptop but when I moved to a 2-core+HT intel laptop the execution improvement was marginal: the presence of generation-later CPU, two more (HT) cores and 670Mhz higher CPU clock could just not be noticed. When I restricted the code to two non-HT threads the expected speed-up in the 2-core case was suddenly there and I could breathe easier.
When I changed the compiler optimization level from 3 to 2 and finally 1 hyperthreading started showing its promise. The best results were at optimization level 1 and was approximately 50% better than the 2-code non-HT case.
What I think happens is that too well-written and -optimized code utilizes a core to the utmost, to the extent that there are basically no extra available resources for a second thread to execute on. Of course the second thread will run but the two threads will be colliding whenever they need the same resource. They will do this much more often due to the high optimization level.
By having less optimized or dense code there was an opportunity for the threads to "interleave" their accesses to the core's resources to a larger degree. This resulted in two threads each running at around 75% of the rate that the most highly optimized code would run on one core. When you sum it up the less optimized code on two threads would yield 1.5 times the throughput of the most optimized on one.
I have entertained the idea of writing code to see what level of core resource "interleaving" that might be achieved and I hypothesize that such a thread would spend half of its execution of an inner loop in one CPU execution pipe and half in the other. The expected result "would" be that one would execute one half inner loop behind the other to achieve the best resource "interleaving."
精彩评论