LoadBalancing

Currently the best way to run jobs across multiple machines on the cluster is to ConfiguringSSH to allow passwordless logins and then to use ssh to remotely launch jobs. Since this will not balance the loads across the cluster, you can view the machine loads with the "cuptime.pl" or "cup.pl" commands as described in CurrentUsage. (Note that with the 32 and 64 core machines available, launching an interactive session on one of those machines may be sufficient, not requiring remote job launching.)

In the following example, the user is logged into kimclust11 and uses ssh to launch the command 'date' on kimclust38. The output from the command will be displayed in the current terminal window.


The following perl code can also be used to run commands across the cluster:

for ($i=40; $i<=50; $i++) {  # step through cluster
    $tmp = `ssh kimclust$i date`;
    print "kimclust$1: $tmp\n";
}


As a first step toward load balancing, I created the perl script 'load' which uses ssh passwordless logins to launch a single command on the fastest and least loaded cluster machine. When run, 'load' checks all of the machines listed in your "~/.my_hosts" file (this assumes a single machine name per line as in /usr/local/lib/all_hosts), and computes their machine stats. It then checks to see how many niced and non-niced jobs are running on each machine and computes the fastest machine that has at least 1 unloaded cpu. If all cpu are being used, then depending on the flags you use, it will either not run your job or recheck the machines ignoring either niced or non-niced jobs using the assertion that every CPU can have 1 niced and 1 non-niced job running. Thus 'load' will:

Again the logic is that every cpu is allowed 1 niced job and 1 non-niced job. Thus when you include the "-i" ("ignore") flag, it will fully load the cpu accordingly. If you do not include the "-i" flag, then new jobs will not be run on cpu that have any pre-existing jobs.

Machine Limits:

Examples:

Lastly, you should be able to include ANY machines in your "~/.my_hosts" file that have ssh, passwordless access, not just machines in our cluster. If you wish to use 'cup.pl' or 'load' on non-cluster machines you will need to copy /usr/local/bin/compStats.pl and /usr/local/bin/loadStats.pl to these machines. If you do not have access to /usr/local/bin on the non-cluster machines, then you will also need to modify your own version of /usr/local/bin/cup.pl or /usr/local/bin/load to seek compStats.pl and/or loadStats.pl in another directory.