I am assuming a dual-core (2 cores per processors) machine with 2 processors for the questions that follow; so a total of 4 "cores". So some natural questions arose:
Suppose I wrote a simple serial program and built it in, say, Visual Studio.. and ran the same program twice, say, with distinct input data in each run. Would they be running on the same processor? Or distinct processors? How much RAM memory would be assigned to each? Would it be the RAM memory on 1 processor (2 cores) or the total RAM? I believe the two programs would run on distinct processors and should each have RAM memory of 1 processor (2 cores); but I am not 100% certain. Would the behavior be any different on Linux?
Now suppose my program was written using a distributed memory parallel interface such as MPI and that I ran it once with 2 processors in the np argument (say). Would the program use both processors (and in effect all 4 cores)? Is this the 开发者_如何转开发optimal value for the argument -np? In other words, if I did the same with -np 3 or -np 4; is it correct to assume there would be no added advantage? Again, I think so, but I am not 100% certain. I assume also that I could go higher than 4 (-np 5, -np 6, etc). In such cases, how do the processes compete for memory at values of np > 4? Would the performance get worse for np > 4. I think yes, and perhaps this partly depends on problem size, but again not 100% sure.
Next, suppose I ran two instances of my MPI-built parallel program, both with -np 2, each with, say, different input data. First off, is this possible? I assume it is and that they each run on both processors? How are the two programs synchronized and how do they individually compete for memory sequentially? This should atleast in part, be based on the order of launching the programs, presumably?
Lastly, suppose my program was written using a shared memory parallel interface such as OpenMP and that I ran it once. How many "threads" can I run it on to make full use of shared memory parallelism - is it 2 or 4? (since I have 2 processors with 2 cores each). My guess is it is 4; since all 4 cores are part of the a single shared memory unit? Is that correct? If the answer is 4; does it make sense to run on greater than 4 threads? I am not sure this even works (unlike MPI, where I believe we can do -np 5, -np 6 and so on).
Finally, suppose I run 2 instances of the shared memory parallel program, each with, say, different input data. I assume this is possible and that the individual processes would somehow compete for memory, presumably in the order the programs were launched?
Which processor they run on is entirely up to the OS and depends on many factors, including whatever else is happening on the same machine. The common case, though, is that they will tend to sit on one core each, occasionally swapping to different cores ("occasionally" may mean several times a second or even more frequently).
Çores don't have their own RAM on normal PC hardware, and the processes will be given however much RAM they ask for.
For MPI processes, yes, your parallelism should match the core count (assuming a CPU-heavy workload). If two MPI processes run with -np 2, they will simply consume all four cores. Increase anything and they'll start to contend. As explained above, RAM has nothing to do with any of this, though cache will suffer in the presence of contention.
This "question" is way too long, so I'm going to stop now.
@Marcelo is absolutely right and I'd like to just expand on his answer a little bit.
The OS will determine where and when the threads the comprise the application execution depending on what else is going on in the system and the available resources. Each application will run in it's own process and that process can have hundereds or thousands of threads. The OS (Windows, Linux, Mac whatever) will switch the execution context of the processing cores to ensure that all applications and services get a slice of the pie.
As for I/O access to such things as RAM that is physically controlled by the NorthBridge Controller that sits on your motherboard. Each process (not processor!) will have an allocated amount of RAM that it can deal with that can expand or contract over the lifetime of the application... this of course is limited to the amount of resources available on the system, and also worth noting the OS will take care of swapping RAM requests beyond it's physically availability to disk (i.e. Virtual RAM). On the other hand though you will need to coordinate access to memory within your application through the use of critical sections and other thread synchronising mechanisms.
OpenMP is a library that helps you write multithreaded parellel applications and makes the syntax of keeping threads in sync easier.... I would comment more, but it's been quite a while since I've used it and I'm sure someone could give a better explaination.
I see you are using windows, so I will summarize by saying that you can set process affinities (which core or cores a process can run on) in the task manager. There's also a winapi call but the name escapes me
a) for a single threaded program, they will not launch on the same cpu (assuming its cpu bound). You can guarantee it by changing the affinity. in linux there's a call sched_setaffinity
and a userspace program taskset
b) depends on the MPI library; the machinery is library-specific.
c) it depends on the specific application and data pattern. For small data accesses but lots of messaging passing, you may actually find limiting to 1 CPU to be the most efficient pattern.
精彩评论