Parallel Performance Analysis

Introduction:

The jobpar script queries SGE for data and then sorts out the relevant information for a given job. This is designed to show the parallel performance of the job and help give clues about how to optimize the parallelism within the program. You can use qstat or qb to find the job-ID within SGE and then run jobpar on that job-ID:

jbp@head1 [ 89 ] % jobpar 25392
job 25392 is running on 8 hosts (16 cpus)
cbcb-n25 (2) nsoe-n2 (2) nsoe-n8 (2) cbcb-n17 (2m)
cbcb-n5 (2) cbcb-n14 (2) cbcb-n23 (2) cbcb-n1 (2)
using load type 'hl:load_avg':
min load: 1.120000
max load: 1.420000
avg load: 1.2725
number of machines per load-threshold:
load 0.00 to 0.25 : 0 :
load 0.25 to 0.50 : 0 :
load 0.50 to 0.75 : 0 :
load 0.75 to 1.00 : 0 :
load 1.00 to 1.25 : 4 : ####
load 1.25 to 1.50 : 4 : ####

The first block of information lists the machines that are being used with the job along with the number of CPUs on each machine. The 'm' letter indicates which machine is the 'master' process. A 'b' letter indicates that data from that processor may be bad -- it is being shared with another user, or is not providing the requested information (perhaps due to a transient SGE error condition).

The next block contains the min, max, and average load that was found across all machines in the parallel job. Remembering that all machines were using 2 CPUs, the load information shown here is not particularly impressive, we would like to see a min load of 1.8 or more. However, if the job was just started, then the load information may not be very accurate ... recall that this is a load average and thus at job-start it may be averaging over some amount of idle time.

The last block of information is a histogram of machines within a certain load-range. Here we see that half the machines have loads in the 1.00 to 1.25 range, the other half is between 1.25 and 1.50. Again, not stellar parallel performance, but seems to be well balanced.


Command-line Options:

The jobpar script takes several options which modify it's behavior:

-a (default) use 'load_avg' information; this is roughly a 5-minute average over the whole remote machine
-s use 'load_short' information; this is roughly a 1-minute average over the whole remote machine; use this during job start-up so that less idle time is averaged in
-m use 'load_medium' information; this is roughly a 5-minute average over the whole remote machine; this may or may not be the same as 'load_avg'
-l use 'load_long' information; this is roughly a 15-minute average over the whole remote machine; use this for a more thorough analysis of the overall performance of the code
-A use 'np_load_avg' information; this is roughly a 5-minute average over the whole remote machine but averaged per CPU; use this if you are using different numbers of CPUs per machine
-S use 'np_load_short' information; this is roughly a 1-minute average over the whole remote machine but averaged per CPU; use during job-start; use if different number of CPUs per machine
-M use 'np_load_medium' information; this is roughly a 5-minute average over the whole remote machine but averaged per CPU; this may or may not be the same as 'np_load_avg'; use if different number of CPUs per machine
-L use 'np_load_long' information; this is roughly a 15-minute average over the whole remote machine but averaged per CPU; use for more thorough analysis; use if different number of CPUs per machine
-C uses CPU load reported by the machines; should be ok for job start-up, this is the per-machine CPU load (averaged over number of CPUs)
-r # (default is '-r 4') set rounding parameter for histogram output ('4' means to round loads to nearest 1/4th); set this higher for a more detailed view
-x eXtended output, includes memory analysis

Memory Analysis:

As with the basic analysis above, differences in memory usage on the parallel processes may indicate a load imbalance or a similar mis-use, or inefficient use, of memory. E.g. this could mean that when you run on 10 processors, you can only increase your problem size by 9x. Using the "-x" option, you will see a second histogram block for memory usage per node:

number of machines per memory-threshold:
mem 0MB to 100MB : 1 : #
mem 100MB to 200MB : 1 : #
mem 200MB to 300MB : 2 : ##
mem 300MB to 400MB : 1 : #
mem 400MB to 500MB : 0 :
mem 500MB to 600MB : 1 : #
mem 600MB to 700MB : 4 : ####
mem 700MB to 800MB : 2 : ##
mem 800MB to 900MB : 3 : ###

In this case, the parallel program is consuming a wide range of memory on the different machines -- some machines are using nearly 9-times more memory than others. However, as with the CPU data, if the machines are shared with other users (as was the case with this example), then the differences in memory usage could simply be caused by those other users.


Possible Cautions:

When assessing the performance of any parallel job, you want to make sure that the machines in use are an "equal" as possible. In particular, you want to make sure that all the CPUs are the same speed or else you will not be able to tell if a load imbalance is due to your program itself, or due to the CPU-speed differences. To request that SGE only assign you the same speed CPUs, you can use '-l mhz=2800' in your job submission script. This will request that only 2800MHz machines (such as node1 to 64 and cbcb-n1 to 64) are used. Use 'qhost -l mhz=2800' to see a list of machines with 'mhz' set to 2800.

In some cases, the output from jobpar may contain a warning that the job just started recently and thus the information may not be accurate. This is due to the averaging of information that is reported by SGE. If you requested a 5-minute average (any of the '*_medium' averages above), and the job started 3 minutes ago, then the averages will represent some amount of idle time before your job started. Thus, for longer averaging periods, you'll need to wait longer before you run jobpar.

In other cases, the output from jobpar will contain a warning that you are sharing machines with other users and thus the load-information may not be accurate. Unfortunately, there is no real way to pull out 'your' CPU usage from 'their' CPU usage (at least not through SGE) and thus, you should take the jobpar data with a grain of salt. If possible, submit a job with an even number of CPUs so that SGE is more likely to schedule you onto "whole" machines (i.e. 2 CPUs per machine). For large memory (single-CPU) jobs, use '-l mem_free=1800M' in your job submission script to request that 1800MB of memory be allocated, this should ensure that only your job is run on that machine.

Note that both 'mhz' and 'mem_free' can be requested together. The syntax would be '-l mhz=2800,mem_free=1800M'.

The jobpar script assumes that your code is single-threaded, and it may or may not be able to detect if your code is actually multi-threaded. If you know your code has multiple threads per machine, then realize that some of the load averages that jobpar uses may be averages over the machine or per-CPU. For example, if you have 10 machines running 1 MPI task each, but two threads per task, jobpar will see 1 CPU-per-machine, but a CPU-load of 2 ... which will confuse it. This could be because your job has 2 threads per machine, or it could be that there is another using the same machine -- jobpar can't really tell the difference.


How to Identify Communication Problems:

A common problem with parallel programs is excessive communication among the processors. This usually shows up as poor scalability, e.g. running on 16 CPUs is only 8-times faster than running on 1 CPU, since the increased number of CPUs leads to increased communication costs. While jobpar does not directly look at communication throughput, it can help identify such problems. In our example above, if we allow the program to run for 15 minutes and then pull up a jobpar display:

jbp@head1 [ 91 ] % jobpar 25392
job 25392 is running on 8 hosts (16 cpus)
cbcb-n25 (2) nsoe-n2 (2) nsoe-n8 (2) cbcb-n17 (2m)
cbcb-n5 (2) cbcb-n14 (2) cbcb-n23 (2) cbcb-n1 (2)
using load type 'hl:load_avg':
min load: 1.150000
max load: 1.430000
avg load: 1.25375
number of machines per load-threshold:
load 0.00 to 0.25 : 0 :
load 0.25 to 0.50 : 0 :
load 0.50 to 0.75 : 0 :
load 0.75 to 1.00 : 0 :
load 1.00 to 1.25 : 4 : ####
load 1.25 to 1.50 : 4 : ####

we might attribute this kind of profile to poor communication performance. If we assume that this is at "steady-state", then each machine is getting about half of its possible performance. If we were to run the same program on a single CPU and we see full CPU utilization with one CPU, then the associated communication in going to 16 CPUs may be to blame for the poor performance. If we were to run the same code on 32 CPUs and we see even lower performance, then we'd definitely want to look at the amount of communication the program is doing and look for any latency-hiding opportunities (i.e. places where we can send a message, then do other useful work while waiting for the reply message).

How to Identify Load Balance Problems:

Another common problem in parallel programming is load balancing all of the processors. If the problem is highly dynamic and the amount of work done by each CPU is not constant (in the short term), then some CPUs will sit idle while others are crunching away on useful work. This will show up in jobpar as something like this:

jbp@head1 [ 91 ] % jobpar 1234
job 1234 is running on 8 hosts (16 cpus)
cbcb-n25 (2) nsoe-n2 (2) nsoe-n8 (2) cbcb-n17 (2m)
cbcb-n5 (2) cbcb-n14 (2) cbcb-n23 (2) cbcb-n1 (2)
using load type 'hl:load_avg':
min load: 0.340000
max load: 1.920000
avg load: 1.242351
number of machines per load-threshold:
load 0.00 to 0.25 : 0 :
load 0.25 to 0.50 : 1 : #
load 0.50 to 0.75 : 1 : #
load 0.75 to 1.00 : 0 :
load 1.00 to 1.25 : 0 :
load 1.25 to 1.50 : 0 :
load 1.50 to 1.75 : 2 : ##
load 1.75 to 2.00 : 4 : ####

Notice that the average load reported is roughly the same as before, but now the histogram output is quite different. We see that 2 machines have very low utilization, even down in the 0.25 to 0.50 range. However, we also have 4 machines which are almost at maximum capacity (2.00 load for 2 CPUs). In such a case, since at least some of the machines are near max load, it would seem that the computer code as a whole must be quite well written -- any single-CPU optimizations that were made are doing their job and keeping the CPU fully loaded. If the code were poorly written, then all of the machines should be in the lower ranges. Thus it is possible that the 2 low-performing machines just have nothing to do! Perhaps those machines got stuck with the "ends" of the main data set. There is often a trade-off between sending lots of small messages and fewer larger messages and this often impacts the decomposition scheme used to partition the data.

If the histogram output shows some CPUs at near-max performance, and some at very low levels of performance, then it is possible that the issue it not communication, but simply an imbalance in the workload.

Good Performance:

An example of good parallel performance would look like this:

jbp@head1 [ 91 ] % jobpar 1234
job 1234 is running on 8 hosts (16 cpus)
cbcb-n25 (2) nsoe-n2 (2) nsoe-n8 (2) cbcb-n17 (2m)
cbcb-n5 (2) cbcb-n14 (2) cbcb-n23 (2) cbcb-n1 (2)
using load type 'hl:load_avg':
min load: 1.900000
max load: 2.000000
avg load: 1.99
number of machines per load-threshold:
load 0.00 to 0.25 : 0 :
load 0.25 to 0.50 : 0 :
load 0.50 to 0.75 : 0 :
load 0.75 to 1.00 : 0 :
load 1.00 to 1.25 : 0 :
load 1.25 to 1.50 : 0 :
load 1.50 to 1.75 : 0 :
load 1.75 to 2.00 : 16 : ################

Note that the min, max, and average loads are very high, and all the nodes seem to be balanced in terms of workload.


Conclusions:

Not to rain on anyone's parade, but ... it should be noted that if jobpar shows a job to have reasonably good performance, you may still want to tweak the code with other serial and parallel performance tools like function profiling or tracing the communication patterns in the code. If jobpar shows good performance at 16 CPUs, it may not be very good if 32 CPUs are used—i.e. jobpar provides a snapshot of the performance, it does not actually analyze the code itself. What works extremely well on 16 CPUs may or may not work well at 32 CPUs, and may not work at all on 64 CPUs. You should periodically run jobpar on actual production-runs of your code to see how your program is behaving on your datasets.

If jobpar shows your job to have somewhat poor parallel performance, then you should seriously consider tinkering with the code itself to reduce communication bottlenecks or improve load balancing. If jobpar says that it's bad, then it really is bad.

Note also that we are only measuring CPU load, and it is possible to trade higher CPU usage for less communication by duplicating computations on multiple nodes. In other words, rather than having node1 compute a data item and send it to nodes 2, 3, and 4, we can just re-program the code so that all 4 nodes compute the data item themselves. In general, this leads to higher CPU usage and may even help (or hide) load imbalances. However, this is extra computation that would not be done on a single-CPU program and thus while we may see 100% CPU utilization, we may not see good parallel scalability. If too much redundant computation is done, then the 16-CPU job may only run 10-times faster than the single-CPU code.