Using Allinea Performance Reports on NREL HPC systems
Allinea Performance Reports is a very easy to use, low overhead tool for analysis of the performance of applications. It reports time spent in computation, I/O and MPI communications, amount of memory used and information about why your program is spending so much time in a particular type of processing. Output is provided in a simple one page text or HTML file.
To use this tool on Peregrine:
- load the allinea module
- add perf-report to the command that executes your application:
$ perf-report ./hello
$ perf-report mpirun -np 4 ./hello_MPI
To view results, look at either the .txt file produced or open the .html file in your favorite browser. The file name will contain the name of the program you executed, how many processors it ran on, the date and a number.
An example text output file is provided below:
Executable: slow_f
Resources: 4 processes, 1 node
Machine: n0296
Started on: Tue Aug 5 10:51:50 2014
Total time: 11 seconds (0 minutes)
Full path: /home/icarpent/perf_reports
Notes:
Summary: slow_f is CPU-bound in this configuration
CPU: 55.9% |=====|
MPI: 44.1% |===|
I/O: 0.0% |
This application run was CPU-bound. A breakdown of this time and advice for investigating
further is found in the CPU section below.
CPU:
A breakdown of how the 55.9% total CPU time was spent:
Scalar numeric ops: 37.2% |===|
Vector numeric ops: 37.0% |===|
Memory accesses: 25.8% |==|
Other: 0.0% |
The per-core performance is arithmetic-bound. Try to increase the amount of time spent
in vectorized instructions by analyzing the compiler's vectorization reports.
MPI:
A breakdown of how the 44.1% total MPI time was spent:
Time in collective calls: 56.5% |=====|
Time in point-to-point calls: 43.5% |===|
Effective collective rate: 2.55e+05 bytes/s
Effective point-to-point rate: 6.14e+07 bytes/s
Most of the time is spent in collective calls with a very low transfer rate. This suggests
load imbalance is causing synchonization overhead; use an MPI profiler to investigate further.
The point-to-point transfer rate is low. This can be caused by inefficient message sizes,
such as many small messages, or by imbalanced workloads causing processes to wait.
I/O:
A breakdown of how the 0.0% total I/O time was spent:
Time in reads: 0.0% |
Time in writes: 0.0% |
Effective read rate: 0.00e+00 bytes/s
Effective write rate: 0.00e+00 bytes/s
No time is spent in I/O operations. There's nothing to optimize here!
Memory:
Per-process memory usage may also affect scaling:
Mean process memory usage: 5.74e+07 bytes
Peak process memory usage: 1.11e+08 bytes
Peak node memory usage: 5.3% ||
There is significant variation between peak and mean memory usage. This may be a sign of
workload imbalance or a memory leak.
The peak node memory usage is very low. You may be able to reduce the amount of allocation
time used by running with fewer MPI processes and more data on each process.
CAUTION: This tool does not support OpenMPI 1.7.*. If you would like to use this tool and your application uses OpenMPI, please use either version 1.6.* or version 1.8.*.
Allinea Performance Reports documentation (Internal NREL access only)