I have installed mpich1 and UPC on a machine at a directory /scratch/sharatds (which is mounted on NFS).
However, when I tried running it inititally, it used to work well on a single machine (lagrid02) .
When I tried including the other machines (lagrid02-09) as well in the loop, it threw error.
rm_3521: p4_error: rm_start: net_conn_to_listener failed: 36394
p0_30647: p4_error: Child process exited while making connection to remote process on lagrid03: 0
p0_30647: (3开发者_C百科8.617188) net_send: could not write to fd=4, errno = 32
If you have an idea , what could be going wrong, can you suggest me any measures that I could do to make it work ?
This is a sysadmin question, not a programming question.
Firstly - mpich_1_? Really? Mpich1 hasn't been updated since 2005; I would strongly suggest using mpich2 instead. You won't find many people willing to offser help or support with mpich1 problems.
As to the particular error messages across nodes, there are several reasons why MPI might have trouble communicating between nodes: do you have passwordless ssh setup so you can ssh from lagrid02 to lagrid03? Are there firewalls on the various machines?>
精彩评论