I develop a Lattice Boltzmann (Fluid dynamics) code using F#. I am now testing the code on a 24 cores, 128 GB memory server. The code basically consists of one main recursive function for time evolution and insid开发者_JAVA技巧e a System.Threading.Tasks.Parallel.For loop for a 3D dimensional space iteration. The 3D space is 500x500x500 large and one time cycle takes for ever :).
let rec timeIterate time =
// Time consuming for loop
System.Threading.Tasks.Parallel.For(...)
I would expect the server to use all 24 cores i.e. to have 100% usage. What I observe is something between 1% - 30% usage.
And my questions are:
- Is F# an appropriate tool for HPC computations on such servers?
- Is it realistic to use up to 100% of CPU for a real world problem?
- What should I do to obtain a high speed up? Everything is in one big parallel for loop so I would expect that it is all what I should do...
- If F# is NOT an appropriate language, what language is?
Thank you for any suggestions.
EDIT: I am willing to share the code if anyone is interested to take a look at it.
EDIT2: Here is the stripped version of the code: http://dl.dropbox.com/u/4571/LBM.zip It does not do anything reasonable and I hope I have not introduced any bugs by stripping the code :)
The startup file is ShearFlow.fs and at the of the file bottom is
let rec mainLoop (fA: FArrayO) (mR: MacroResult) time =
let a = LBM.Lbm.lbm lt pA getViscosity force g (fA, mR)
1 . Is F# an appropriate tool for HPC computations on such servers?
It (F#), as a language, can encourage code which works well in parallel -- at least part of this is a reduction of state mutability and higher-order functions -- this is a can and not a will. However, with HPC there are many specialty programming languages/compilers and/or ways of load distribution (e.g. shared unified memory or distributed micro-kernels). F# is merely a general-purpose programming language: it may or may not have access (e.g. bindings may or may not exist) to the various techniques. (This applies even to non-distributed parallel computing.)
2 . Is it realistic to use up to 100% of CPU for a real world problem?
It depends on what the limiting factor is. Talking to my friend who does 5k+ 100k+ core HPC research and development, the exchange of data and idle times are normally the limiting factor (of course, this is a much higher n :-) and so even small improvements in IO reduction (efficiency or different algorithm) can lead to significant gains. Don't forget the cost of simply moving data between CPUs/caches on the same machine! And, of course, the ever-slow disk IO...
3 . What should I do to obtain a high speed up? Everything is in one big parallel for loop so I would expect that it is all what I should do...
Find out where the slow part(s) is(are) and fix it(them) :-) E.g. run a profile analysis. Keep in mind it may require using an entirely different algorithm or approach.
4 . If F# is NOT an appropriate language, what language is?
While I am not arguing for it, my PhD friend uses/works on Charm++: it is a very focused language for distributed parallel computing (not the environment in question, but I'm trying to make a point :-) -- F# tries to be a decent general-purpose language.
F# should be as good as any language. It is more how you write your code than the language itself that determines performance.
You should be able to come close to 100%, at least in the high 90% range if your computation is CPU bound.
There could be several reasons you don't get 100% CPU here.
- Your computation could be I/O bound (do you do file or network operations in the for loop?)
- You have have synchronization issues like to much locking (do you have shared state between the threads, including where you "commit" the result?)
Is F# an appropriate tool for HPC computations on such servers?
I don’t know F# very much but I would rather suspect that it’s quite well suited. It has all the right tools and it’s a functional language which lends itself to highly parallel execution.
Is it realistic to use up to 100% of CPU for a real world problem?
Yes, or very nearly. But in fact, your application should use 2400% of the CPU power if you’ve got 24 cores! At least, that’s how it’s usually displayed. If you observe 30% usage, chances are, it’s running on a single core and not even using that one.
What should I do to obtain a high speed up? Everything is in one big parallel for loop so I would expect that it is all what I should do...
Well, you didn’t show your code. I can only assume that something in your code prevents it from being executed in parallel.
Alternatively (the 1% to 30% CPU usage point to that) your problem isn’t actually compute bound, and the computation is all the time waiting for other resources such as secondary memory. This doesn’t necessarily depend on the problem – after all, fluid dynamics is a compute-bound problem! – but rather on your particular implementation. So far, a lot points to resource contention.
I don't think that F# has yet made it into the mainstream of HPC, where Fortran, C and C++ dominate, but I don't see any particular reasons why you should avoid it.
No, it's not, not for any extended time period. Sooner or later all (questionable assertion that) HPC codes become memory-bandwidth limited -- CPUs can crunch numbers a lot faster than RAM can load and store. On a long-running computation you're doing well to use 10% of the theoretical maximum number of FLOPs that your CPUs can execute.
I don't really know F# well enough to provide specific advice for your configuration (I'm one of those HPC Fortran programmers). But in general you need to ensure good load-balancing (ie all cores do the same amount of work), efficient and effective use of the memory hierarchy (which gets difficult as languages get 'higher-level' as they tend to make it difficult to manage processes at a low level), and the best possible thing you can do is choose the best algorithm. The best parallel algorithm is not necessarily the best serial algorithm made parallel, and I suspect that the best functional (implementation of an) algorithm may not be the best (imperative implementation of an) algorithm.
Fortran.
The thread pool has a maximum number of threads depending on various circumstances.
From MSDN:
Maximum Number of Thread Pool Threads
The number of operations that can be queued to the thread pool is limited only by available > memory; however, the thread pool limits the number of threads that can be active in the process simultaneously. Beginning with the .NET Framework version 4, the default size of the thread pool for a process depends on several factors, such as the size of the virtual address > space. A process can call the GetMaxThreads method to determine the number of threads.
You can control the maximum number of threads by using the GetMaxThreads and SetMaxThreads methods.
Also try upping MinThreads, if necessary. The amount of cores on your system might be throwing Threadpool optimization algorithms off? Worth a try.
Again, from MSDN:
The thread pool provides new worker threads or I/O completion threads on demand until it reaches a specified minimum for each category. You can use the GetMinThreads method to obtain these minimum values.
When a minimum is reached, the thread pool can create additional threads or wait until some tasks complete. Beginning with the .NET Framework 4, the thread pool creates and destroys worker threads in order to optimize throughput, which is defined as the number of tasks that complete per unit of time. Too few threads might not make optimal use of available resources, whereas too many threads could increase resource contention.
Functional programming focuses on high-level abstraction, i.e., you abstract the common programming patterns out and make them generally reusable. High performance computing is about get things to run in parallel, think about the bits between different threads, thinking about data locality to make the cache hit high. These are two different directions.
Nowadays, people tend to think FP as a silver-bullet for everything parallel, including high performance computing. NO. Otherwise you will see a lot of FP papers published in high performance conferences. Actually quite few.
What you are using now is Task Parallel library, which is a .Net library for C#/F#/VB. Not F# specific. Which is itself written in C#, I believe.
With this in mind, let's go back to your question. Why cannot you use 100% CPU? The skills help you to find the bottleneck have less to do with F#. Do a profiling of your program, see whether some threads are waiting others to finish (you need to finish all the computing in Paralle.For to continue).
Have you tried using the threading analysis tools included in Visual Studio: using the concurrency profiler option in the performance wizard?
精彩评论