I'm w开发者_Python百科riting an MPI program to be run over a local area network. These machines can be ssh'd to by any student at any time.
Although I always test my program at night, the performance has been very inconsistent. My guess is that some nodes were busy when I ran the program.
So my question is: can I write a script to detect non-busy machines and update the machine file? What's an easy way to write it?
Thanks a lot.
SSH into each machine, then read the /proc/loadavg file or determine the "business" in some other way.
I think the easiest way would be installing the check_load[1] script from Nagios to every node you want to check and call it via ssh with some sensible parameters:
# /usr/lib64/nagios/plugins/check_load -w 1,2,3 -c 3,4,5
OK - load average: 0.20, 0.43, 0.50|load1=0.200;1.000;3.000;0; load5=0.430;2.000;4.000;0; load15=0.500;3.000;5.000;0;
# /usr/lib64/nagios/plugins/check_load -w 0.1,2,3 -c 3,4,5
WARNING - load average: 0.18, 0.43, 0.50|load1=0.180;0.100;3.000;0; load5=0.430;2.000;4.000;0; load15=0.500;3.000;5.000;0;
# /usr/lib64/nagios/plugins/check_load -w 0.01,2,3 -c
0.1,4,5
CRITICAL - load average: 0.41, 0.46, 0.51|load1=0.410;0.010;0.100;0; load5=0.460;2.000;4.000;0; load15=0.510;3.000;5.000;0;
CRITICAL would mean "really busy", WARNING could be "is kinda busy" and OK would mean "the machine is idle".
You have to pay attention for the tresholds you have to give as 1/5/15 minute for warning and critical; for instance, a machine with 16 cores having a load of 3 is perfectly ok, while a load of 3 on a single-core machine would mean it's really really busy.
Good luck! Alex.
[1] http://nagiosplugins.org/man/check_load
精彩评论