I'm trying to get a UPC-NAS Benchmark (compiled for 256 threads) running on a cluster of 32 nodes. When I run it, the rsh connections are established for 247 threads and it terminates giving an error as follows
p0_11350: p4_error: Child process exited while making con开发者_如何学Cnection to remote process on dell16: 0
506 rm_l_237_24446: (26.785156) net_send: corm_11947: (215.339844) net_srm_l_1rm_24412: (26.785156) net_send: could not write to fd=4, errnrrrm_l_127_5013: (121.984375) net_send: could not w rite to fd=5, errno = 32
Can anybody point out where the problem lies ?
It runs fine for lesser threads like 64, 128 etc.
Errno 32 is EPIPE (#define EPIPE 32 /* Broken pipe */
).
I suggest, that some file descriptor limit is hitted (check ulimit -a
). Or network limits. Or network failure.
Also I should mention, that p4 is anciently old. It can be some internal limit. The development of p4 stopped > 15 years ago. It is kind of very stable code in terms of inclusion into Debian Stable.
So, why do you use mpich1? Can you move to less ancient mpich2?
精彩评论