I have the following problem.
First my environment, I have two 24-CPU servers to work with and one big job (resampling a large dataset) to share among them. I've setup multicore and (a socket) Snow cluster on each. As a high-level interface I'm using foreach.
What is the optimal sharing of the job? Should I setup a Snow cluster using CPUs from both machines and split the job that way (i.e. use doSNOW for the foreach loop). Or should I use the two servers separately and use multicore on each server (i.e. split the job in two chunks, run them on each server and then stich it back together).
Basically what is an easy way to: 1. Keep communication between servers down (since this is probably the slowest bit). 2. Ensure that the random number开发者_如何学Pythons generated in the servers are not highly correlated.
Snow
and multicore
varies in one significant way -- multicore
forks a new process, so it is using the same memory as the main process. This means that if you use snow
you need to distribute (physically send and store in children' space) the data you want to process, but if you use multicore
children will be just able to access the main process's copy of the data -- so it saves transfer and memory use.
Don't have enough experience to answer (1). But the way to avoid (2) is to use a random number generator meant for parallel programs: look at the rlecuyer
package and the clusterSetupRNG
function in snow
.
精彩评论