Chapter 8. MPI Performance Profiling

This chapter describes the perfcatch utility used to profile the performance of an MPI program and other tools that can be used for profiling MPI applications. It covers the following topics:

Overview of perfcatch Utility

The perfcatch utility runs an MPI program with a wrapper profiling library that prints MPI call profiling information to a summary file upon MPI program completion. This MPI profiling result file is called MPI_PROFILING_STATS, by default (see “ MPI_PROFILING_STATS Results File Example”). It is created in the current working directory of the MPI process with rank 0.

Using the perfcatch Utility

The syntax of the perfcatch utility is, as follows:

perfcatch [-v | -vofed | -i] cmd args

The perfcatch utility accepts the following options:

No option 

Supports MPT

-v 

Supports Voltaire MPI

-vofed 

Supports Voltaire OFED MPI

-i 

Supports Intel MPI

To use perfcatch with an SGI Message Passing Toolkit MPI program, insert the perfcatch command in front of the executable name. Here are some examples:

mpirun -np 64 perfcatch a.out arg1

and

mpirun host1 32, host2 64 perfcatch a.out arg1

To use perfcatch with Intel MPI, add the -i options. An example is, as follows:

mpiexec -np 64 perfcatch -i a.out arg1

For more information, see the perfcatch (1) man page.

MPI_PROFILING_STATS Results File Example

The MPI profiling result file has a summary statistics section followed by a rank-by-rank profiling information section. The summary statistics section reports some overall statistics, including the percent time each rank spent in MPI functions, and the MPI process that spent the least and the most time in MPI functions. Similar reports are made about system time usage.

The rank-by-rank profiling information section lists every profiled MPI function called by a particular MPI process. The number of calls and the total time consumed by these calls is reported. Some functions report additional information such as average data counts and communication peer lists.

An example MPI_PROFILING_STATS results file is, as follows:

============================================================
PERFCATCHER version 22 
(C) Copyright SGI.  This library may only be used
on SGI hardware platforms. See LICENSE file for
details.
============================================================
MPI program profiling information
Job profile recorded Wed Jan 17 13:05:24 2007
Program command line:                                /home/estes01/michel/sastest/mpi_hello_linux 
Total MPI processes                                  2

Total MPI job time, avg per rank                    0.0054768 sec
Profiled job time, avg per rank                     0.0054768 sec
Percent job time profiled, avg per rank             100%

Total user time, avg per rank                       0.001 sec
Percent user time, avg per rank                     18.2588%
Total system time, avg per rank                     0.0045 sec
Percent system time, avg per rank                   82.1648%

Time in all profiled MPI routines, avg per rank     5.75004e-07 sec
Percent time in profiled MPI routines, avg per rank 0.0104989%

Rank-by-Rank Summary Statistics
-------------------------------

Rank-by-Rank: Percent in Profiled MPI routines
        Rank:Percent
        0:0.0112245%    1:0.00968502%
  Least:  Rank 1      0.00968502%
  Most:   Rank 0      0.0112245%
  Load Imbalance:  0.000771%

Rank-by-Rank: User Time
        Rank:Percent
        0:17.2683%      1:19.3699%
  Least:  Rank 0      17.2683%
  Most:   Rank 1      19.3699%

Rank-by-Rank: System Time
        Rank:Percent
        0:86.3416%      1:77.4796%
  Least:  Rank 1      77.4796%
  Most:   Rank 0      86.3416%

Notes
-----

Wtime resolution is                   5e-08 sec

Rank-by-Rank MPI Profiling Results
----------------------------------

Activity on process rank 0

           Single-copy checking was not enabled.
comm_rank            calls:      1  time: 6.50005e-07 s  6.50005e-07 s/call

Activity on process rank 1

           Single-copy checking was not enabled.
comm_rank            calls:      1  time: 5.00004e-07 s  5.00004e-07 s/call

------------------------------------------------

recv profile

             cnt/sec for all remote ranks
local   ANY_SOURCE        0            1    
 rank

------------------------------------------------

recv wait for data profile

             cnt/sec for all remote ranks
local        0            1    
 rank
------------------------------------------------

recv wait for data profile

             cnt/sec for all remote ranks
local        0            1    
 rank

------------------------------------------------

send profile

             cnt/sec for all destination ranks
  src        0            1    
 rank

------------------------------------------------

ssend profile

             cnt/sec for all destination ranks
  src        0            1    
 rank

------------------------------------------------

ibsend profile

             cnt/sec for all destination ranks
  src        0            1    
 rank

MPI Performance Profiling Environment Variables

The MPI performance profiling environment variables are, as follows:

Variable

Description

MPI_PROFILE_AT_INIT

Activates MPI profiling immediately, that is, at the start of MPI program execution.

MPI_PROFILING_STATS_FILE

Specifies the file where MPI profiling results are written. If not specified, the file MPI_PROFILING_STATS is written.

MPI Supported Profiled Functions

The MPI supported profiled functions are, as follows:


Note: Some functions may not be implemented in all language as indicated below.


Languages

Function

C Fortran

mpi_allgather

C Fortran

mpi_allgatherv

C Fortran

mpi_allreduce

C Fortran

mpi_alltoall

C Fortran

mpi_alltoallv

C Fortran

mpi_alltoallw

C Fortran

mpi_barrier

C Fortran

mpi_bcast

C Fortran

mpi_comm_create

C Fortran

mpi_comm_free

C Fortran

mpi_comm_group

C Fortran

mpi_comm_rank

C Fortran

mpi_finalize

C Fortran

mpi_gather

C Fortran

mpi_gatherv

C

mpi_get_count

C Fortran

mpi_group_difference

C Fortran

mpi_group_excl

C Fortran

mpi_group_free

C Fortran

mpi_group_incl

C Fortran

mpi_group_intersection

C Fortran

mpi_group_range_excl

C Fortran

mpi_group_range_incl

C Fortran

mpi_group_union

C

mpi_ibsend

C Fortran

mpi_init

C

mpi_init_thread

C Fortran

mpi_irecv

C Fortran

mpi_isend

C

mpi_probe

C Fortran

mpi_recv

C Fortran

mpi_reduce

C Fortran

mpi_scatter

C Fortran

mpi_scatterv

C Fortran

mpi_send

C Fortran

mpi_sendrecv

C Fortran

mpi_ssend

C Fortran

mpi_test

C Fortran

mpi_testany

C Fortran

mpi_wait

C Fortran

mpi_wait

Profiling MPI Applications

This section describes the use of profiling tools to obtain performance information. Compared to the performance analysis of sequential applications, characterizing the performance of parallel applications can be challenging. Often it is most effective to first focus on improving the performance of MPI applications at the single process level.

It may also be important to understand the message traffic generated by an application. A number of tools can be used to analyze this aspect of a message passing application's performance, including Performance Co-Pilot and various third party products. In this section, you can learn how to use these various tools with MPI applications. It covers the following topics:

Profiling Interface

You can write your own profiling by using the MPI-1 standard PMPI_* calls. In addition, either within your own profiling library or within the application itself you can use the MPI_Wtime function call to time specific calls or sections of your code.

The following example is actual output for a single rank of a program that was run on 128 processors, using a user-created profiling library that performs call counts and timings of common MPI calls. Notice that for this rank most of the MPI time is being spent in MPI_Waitall and MPI_Allreduce.

Total job time 2.203333e+02 sec
Total MPI processes 128
Wtime resolution is 8.000000e-07 sec

activity on process rank 0
comm_rank calls 1      time 8.800002e-06
get_count calls 0      time 0.000000e+00
ibsend calls    0      time 0.000000e+00
probe calls     0      time 0.000000e+00
recv calls      0      time 0.00000e+00   avg datacnt 0   waits 0       wait time 0.00000e+00
irecv calls     22039  time 9.76185e-01   datacnt 23474032 avg datacnt 1065
send calls      0      time 0.000000e+00
ssend calls     0      time 0.000000e+00
isend calls     22039  time 2.950286e+00
wait calls      0      time 0.00000e+00   avg datacnt 0
waitall calls   11045  time 7.73805e+01   # of Reqs 44078  avg data  cnt 137944
barrier calls   680    time 5.133110e+00   
alltoall calls  0      time 0.0e+00       avg datacnt 0
alltoallv calls 0      time 0.000000e+00
reduce calls    0      time 0.000000e+00
allreduce calls 4658   time 2.072872e+01
bcast calls     680    time 6.915840e-02
gather calls    0      time 0.000000e+00
gatherv calls   0      time 0.000000e+00
scatter calls   0      time 0.000000e+00
scatterv calls  0      time 0.000000e+00  

activity on process rank 1 
...

MPI Internal Statistics

MPI keeps track of certain resource utilization statistics. These can be used to determine potential performance problems caused by lack of MPI message buffers and other MPI internal resources.

To turn on the displaying of MPI internal statistics, use the MPI_STATS environment variable or the -stats option on the mpirun command. MPI internal statistics are always being gathered, so displaying them does not cause significant additional overhead. In addition, one can sample the MPI statistics counters from within an application, allowing for finer grain measurements. If the MPI_STATS_FILE variable is set, when the program completes, the internal statistics will be written to the file specified by this variable. For information about these MPI extensions, see the mpi_stats man page.

These statistics can be very useful in optimizing codes in the following ways:

  • To determine if there are enough internal buffers and if processes are waiting (retries) to aquire them

  • To determine if single copy optimization is being used for point-to-point or collective calls

For additional information on how to use the MPI statistics counters to help tune the run-time environment for an MPI application, see Chapter 7, “Run-time Tuning”.

Third Party Products

Two third party tools that you can use with the SGI MPI implementation are Vampir from Pallas (www.pallas.com) and Jumpshot, which is part of the MPICH distribution. Both of these tools are effective for smaller, short duration MPI jobs. However, the trace files these tools generate can be enormous for longer running or highly parallel jobs. This causes a program to run more slowly, but even more problematic is that the tools to analyze the data are often overwhelmed by the amount of data.