Intel® Cluster Toolkit Compiler Edition 4.0 for Microsoft* Windows* Compute Cluster Server OS
Getting Started Guide
Copyright © 2007–2010 Intel Corporation
All Rights Reserved
Document Number: 318537-006
Revision: 20100602
World Wide Web: http://www.intel.com
Disclaimer and Legal Information
2................... Introduction
3. Intel Software Downloads, Installation, and Uninstalling on Microsoft Windows CCS* OS
4................... Getting Started with Intel® MPI Library
4.2 How to Set Up MPD Daemons on Microsoft Windows CCS* OS
4.3 Compiling and Linking with Intel® MPI Library on Microsoft Windows CCS* OS
4.4 Selecting a Network Fabric
4.5 Running an MPI Program Using Intel® MPI Library on Microsoft Windows CCS* OS
4.6 Experimenting with Intel® MPI Library on Microsoft Windows CCS* OS
4.7 Controlling MPI Process Placement on Microsoft Windows CCS* OS
4.8 Using the Automatic Tuning Utility Called mpitune
4.8.2 MPI Application-Specific Tuning
6................... Instrumenting MPI Applications with Intel® Trace Analyzer and Collector
6.1 Instrumenting the Intel® MPI Library Test Examples.
6.2 Instrumenting the Intel® MPI Library Test Examples in a Fail-Safe Mode
6.3 Using itcpin to Instrument an Application
6.4 Working with the Intel® Trace Analyzer and Collector Examples
6.5 Experimenting with the Message Checking Component of Intel® Trace Collector
6.6 Saving a Working Environment through a Project File
6.7 Analysis of Application Imbalance
6.8 Analysis with the Ideal Interconnect Simulator
6.9 Building a Simulator with the Custom Plug-in Framework
7................... Getting Started in Using the Intel® Math Kernel Library (Intel® MKL)
7.1 Experimenting with ScaLAPACK*
7.2 Experimenting with the Cluster DFT Software
7.3 Experimenting with the High Performance Linpack Benchmark*
8................... Using the Intel® MPI Benchmarks
8.1 Building Microsoft Visual Studio* x64 Executables for the Intel® MPI Benchmarks
8.2 Building Microsoft Visual Studio* IA-32 Executables for the Intel® MPI Benchmarks
9................... Using the Compiler Switch /Qtcollect
10................. Using Cluster OpenMP*
Revision History
Document Number |
Revision Number |
Description |
Revision Date |
318537-006 |
20100602 |
Updated Intel® Cluster Toolkit Compiler Edition 4.0 for Microsoft* Windows* Compute Cluster Server OS Getting Started Guide to reflect changes and improvements to the software components. |
06/02/2010 |
Disclaimer and Legal Information
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT
DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL
PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR.
Intel may make changes to specifications and product descriptions at any time,
without notice. Designers must not rely on the absence or characteristics of
any features or instructions marked "reserved" or
"undefined." Intel reserves these for future definition and shall
have no responsibility whatsoever for conflicts or incompatibilities arising
from future changes to them. The information here is subject to change without
notice. Do not finalize a design with this information.
The products described in this document may contain design defects or errors
known as errata which may cause the product to deviate from published
specifications. Current characterized errata are available on request.
Contact your local Intel sales office or your distributor to obtain the latest
specifications and before placing your product order.
Copies of documents which have an order number and are referenced in this
document, or other Intel literature, may be obtained by calling 1-800-548-4725,
or by visiting Intel's Web Site.
Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See http://www.intel.com/products/processor_number for details.
MPEG is an international standard for video compression/decompression promoted by ISO. Implementations of MPEG CODECs, or MPEG enabled platforms may require licenses from various entities, including Intel Corporation.
The software described in this document may contain software defects which may cause the product to deviate from published specifications. Current characterized software defects are available on request.
This document as well as the software described in it is furnished under license and may only be used or copied in accordance with the terms of the license. The information in this manual is furnished for informational use only, is subject to change without notice, and should not be construed as a commitment by Intel Corporation. Intel Corporation assumes no responsibility or liability for any errors or inaccuracies that may appear in this document or any software that may be provided in association with this document.
Except as permitted by such license, no part of this document may be reproduced, stored in a retrieval system, or transmitted in any form or by any means without the express written consent of Intel Corporation.
Developers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Improper use of reserved or undefined features or instructions may cause unpredictable behavior or failure in developer’s software code when running on an Intel processor. Intel reserves these features or instructions for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from their unauthorized use.
BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Atom, Centrino Atom Inside, Centrino Inside, Centrino logo, Core Inside, FlashFile, i960, InstantIP, Intel, Intel logo, Intel386, Intel486, IntelDX2, IntelDX4, IntelSX2, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel StrataFlash, Intel Viiv, Intel vPro, Intel XScale, Itanium, Itanium Inside, MCS, MMX, Oplus, OverDrive, PDCharm, Pentium, Pentium Inside, skoool, Sound Mark, The Journey Inside, Viiv Inside, vPro Inside, VTune, Xeon, and Xeon Inside are trademarks of Intel Corporation in the U.S. and other countries.
* Other names and brands may be claimed as the property of others.
Copyright © 2007-2010, Intel Corporation. All rights reserved.
In terms of the Intel® Cluster Toolkit Compiler Edition software for Windows* OS, consider references within this document to Microsoft Windows Compute Cluster Server* (CCS*) OS and Microsoft* Windows* HPC Server 2008 OS as interchangeable. The Intel® Cluster Toolkit Compiler Edition 4.0 on Microsoft Windows Compute Cluster Server (Microsoft Windows CCS*) consists of:
The software architecture of the Intel Cluster Toolkit Compiler Edition for Microsoft Windows CCS is illustrated in Figure 2.1:
Figure 2.1 – The software architecture for the Intel Cluster Toolkit Compiler Edition for Microsoft Windows CCS
The following are acronyms and definitions of those acronyms that may be referenced within this document.
Acronym |
Definition |
ABI |
Application Binary Interface – describes the low-level interface an application program and the operating system, between an application and its libraries, or between component parts of an application. |
BLACS |
Basic Linear Algebra Communication Subprograms – provides a linear algebra oriented message passing interface for distributed memory computing platforms. |
BLAS |
Basic Linear Algebra Subroutines |
DAPL* |
Direct Access Program Library - an Application Program Interface (API) for Remote Data Memory Access (RDMA). |
DFT |
Discrete Fourier Transform |
Ethernet |
Ethernet is the predominant local area networking technology. It is transports data over a variety of electrical or optical media. It transports any of several upper layer protocols via data packet transmissions. |
GB |
Gigabyte |
ICT |
Intel® Cluster Toolkit |
ICTCE |
Intel® Cluster Toolkit Compiler Edition |
IMB |
Intel® MPI Benchmarks |
IP |
Internet protocol |
ITA or ita |
Intel® Trace Analyzer |
ITAC or itac |
Intel® Trace Analyzer and Collector |
ITC or itc |
Intel® Trace Collector |
MPD |
Multi-purpose daemon protocol – a daemon that runs on each node of a cluster. These MPDs configure the nodes of the cluster into a “virtual machine” that is capable of running MPI programs. |
MPI |
Message Passing Interface - an industry standard, message-passing protocol that typically uses a two-sided send-receive model to transfer messages between processes. |
NFS |
The Network File System (acronym NFS) is a client/server application that lets a computer user view and optionally store and update file on a remote computer as though they were on the user's own computer. The user's system needs to have an NFS client and the other computer needs the NFS server. Both of them require that you also have TCP/IP installed since the NFS server and client use TCP/IP as the program that sends the files and updates back and forth. |
PVM* |
Parallel Virtual Machine |
RAM |
Random Access Memory |
RDMA |
Remote Direct Memory Access - this capability allows processes executing on one node of a cluster to be able to "directly" access (execute reads or writes against) the memory of processes within the same user job executing on a different node of the cluster. |
RDSSM |
TCP + shared memory + DAPL* (for SMP clusters connected via RDMA-capable fabrics) |
RPM* |
Red Hat Package Manager* - a system that eases installation, verification, upgrading, and uninstalling Linux* packages. |
ScaLAPACK |
SCAlable LAPACK - an acronym for Scalable Linear Algebra Package or Scalable LAPACK. |
shm |
Shared memory only (no sockets) |
SMP |
Symmetric Multi-processor |
ssm |
TCP + shared memory (for SMP clusters connected via Ethernet) |
STF |
Structured Trace Format – a trace file format used by the Intel Trace Collector for efficiently recording data, and this trace format is used by the Intel Trace Analyzer for performance analysis. |
TCP |
Transmission Control Protocol - a session-oriented streaming transport protocol which provides sequencing, error detection and correction, flow control, congestion control and multiplexing. |
VML |
Vector Math Library |
VSL |
Vector Statistical Library |
To begin installation on Microsoft Windows CCS* OS follow the instructions provided in the Intel® Cluster Toolkit Compiler Edition for Microsoft* Windows* Compute Cluster Server OS Installation Guide.
This chapter will provide some basic information about getting started with Intel MPI Library. For complete documentation please refer the Intel MPI Library documents Intel MPI Library Getting Started Guide located in <directory-path-to-Intel-MPI-Library>\doc\Getting_Started.pdf and Intel MPI Library Reference Manual located in <directory-path-to-Intel-MPI-Library>\doc\Reference_Manual.pdf on the system where Intel MPI Library is installed.
The software architecture for Intel MPI Library is described in Figure 4.1. With Intel MPI Library on Linux-based systems, you can choose the best interconnection fabric for running an application on a cluster that is based on IA-32, or Intel® 64 architecture. This is done at runtime by setting the I_MPI_FABRICS environment variable (See Section 4.4). Execution failure can be avoided even if interconnect selection fails. This is especially true for batch computing. For such situations, the sockets interface will automatically be selected (Figure 4.1) as a backup.
Similarly using Intel MPI Library on Microsoft Windows CCS, you can choose the best interconnection fabric for running an application on a cluster that is based on Intel® 64 architecture.
The Intel MPI Library uses a Multi-Purpose Daemon (MPD) job startup mechanism. In order to run programs compiled with mpicl (or related) commands, you must first set up MPD daemons. It is strongly recommended that you start and maintain your own set of MPD daemons, as opposed to having the system administrator start up the MPD daemons once for use by all users on the system. This setup enhances system security and gives you greater flexibility in controlling your execution environment.
The command for launching multi-purpose daemons on Microsoft Windows OS is called “smpd”, which is an acronym for simple multi-purpose daemons. When Intel MPI Library is installed on a cluster, the smpd service is automatically started. On the master node of your Windows cluster, you can type the command:
clusrun smpd –status | more
as demonstrated in Figure 4.2.
Figure 4.2 – DOS command line syntax for issuing the smpd –status query
For a four node cluster, you might see response that looks something like:
---------------------clusternode01 returns 0-----------------------
smpd running on clusternode01
---------------------clusternode02 returns 0-----------------------
smpd running on clusternode02
---------------------clusternode03 returns 0-----------------------
smpd running on clusternode03
---------------------clusternode04 returns 0-----------------------
smpd running on clusternode04
For this example, the four nodes of the cluster are respectively called clusternode1, clusternode2, clusternode3, and clusternode4.
From the master node one can stop all of the smpd daemons by typing the command:
clusrun smpd -uninstall
To restart the daemons from the master node, simply type:
clusrun smpd –install
or
clusrun smpd -regserver
To verify that the smpd daemons are running properly, simply repeat the command:
clusrun smpd –status | more
To shut down the smpd daemons, on all of the nodes of the cluster, you can type:
clusrun smpd –remove
or
clusrun smpd –unregserver
or
clusrun smpd –uninstall
In general to see the various options for the smpd command, simply type:
smpd –help
This section describes the basic steps required to compile and link an MPI program, when using only the Intel MPI Library Development Kit. To compile and link an MPI program with the Intel MPI Library:
…\bin\iclvars.bat
and
…\bin\ifortvars.bat
If you have added the architecture argument ia32 to the iclvars.bat and ifortvars.bat invocation as illustrated with the following examples:
…\bin\iclvars.bat ia32
and
…\bin\ifortvars.bat ia32
then you will also need to use the same type of argument for ictvars.bat as follows (See Figure 4.3):
…\ictvars.bat ia32
Figure 4.3 – Setting the Architecture to IA-32 on Intel® 64 Systems
To revert back to the Intel® 64 architecture, use the following command:
…\ictvars.bat
in your DOS session.
mpiicc <directory-path-to-Intel-MPI-Library>\test\test.c
Other supported compilers have an equivalent command that uses the prefix mpi on the standard compiler command. For example, the Intel MPI Library command for the Intel® Fortran Compiler (ifort) is mpiifort.
Supplier of Core Compiler |
MPI Compilation Command |
Core Compiler Compilation Command |
Compiler Programming Language |
Support Application Binary Interface (ABI) |
Microsoft Visual C++* Compiler or Intel C++ Compiler 11.1
|
mpicc |
cl.exe |
C/C++ |
32/64 bit |
mpicl |
cl.exe |
C/C++ |
32/64 bit |
|
mpiicc |
icl.exe |
C/C++ |
32/64 bit |
|
Intel Fortran Compiler 11.1 |
mpif77 |
ifort.exe |
Fortran 77 and Fortran 95 |
32/64 bit |
mpif90 |
ifort.exe |
Fortran 95 and Fortran 95 |
32/64 bit |
|
mpifc |
ifort.exe |
Fortran 95 and Fortran 95 |
32/64 bit |
|
mpiifort |
ifort.exe |
Fortran 77 and Fortran 95 |
32/64 bit |
Remarks
The Compiling and Linking section of <directory-path-to-Intel-MPI-Library>\doc\Getting_Started.pdf or the Compiler Commands section of <directory-path-to-Intel-MPI-Library>\doc\Reference_Manual.pdf on the system where Intel MPI Library is installed include additional details on mpiicc and other compiler commands, including commands for other compilers and languages.
You can also use the Intel® C++ Compiler, the Microsoft Visual C++ Compiler, or the Intel Fortran Compiler directly. For example, on the master node of the Microsoft Windows CCS cluster, go to a shared directory where the Intel® MPI Library test-cases reside. For the test-case test.c, one can build an MPI executable using the following command-line involving the Intel C++ Compiler:
icl /Fetestc /I"%I_MPI_ROOT%\em64t\include" test.c "%I_MPI_ROOT%\em64t\lib\impi.lib"
The executable will be called testc.exe. This is a result of using the command-line option /Fe. The /I option references the path to the MPI include files. The library path reference is for the MPI library.
mpiexec –machinefile machines.WINDOWS –n 4 testc.exe
The –machinefile parameter has a file name reference called machines.WINDOWS. This file contains a list of node names for the cluster. The results might look something like:
Hello world: rank 0 of 4 running on clusternode1
Hello world: rank 1 of 4 running on clusternode2
Hello world: rank 2 of 4 running on clusternode3
Hello world: rank 3 of 4 running on clusternode4
If you have a version of the Microsoft Visual C++ Compiler that was not packaged with Microsoft* Visual Studio* 2008, type the following command-line:
cl /Fetestc /I"%I_MPI_ROOT%\em64t\include" test.c "%I_MPI_ROOT%\em64t\lib\impi.lib" bufferoverflowU.lib
If you have a version of the Microsoft Visual C++ Compiler that was packaged with Microsoft* Visual Studio* 2008, type the following command-line:
cl /Fetestc /I"%I_MPI_ROOT%\em64t\include" test.c "%I_MPI_ROOT%\em64t\lib\impi.lib"
Intel MPI Library supports multiple, dynamically selectable network fabric device drivers to support different communication channels between MPI processes. The default communication method uses a built-in TCP (Ethernet, or sockets) device driver. Prior to the introduction of Intel® MPI Library 4.0, selection of alternative devices was done via the command line using the I_MPI_DEVICE environment variable. With Intel® MPI Library 4.0 and its successors, the I_MPI_FABRICS environment variable is to be used, and the environment variable I_MPI_DEVICE is considered a deprecated syntax. The following network fabric types for I_MPI_FABRICS are supported by Intel MPI Library 4.0 and its successors:
Possible Interconnection-Device-Fabric Values for the I_MPI_FABRICS Environment Variable |
Interconnection Device Fabric Meaning |
shm |
Shared-memory |
dapl |
DAPL–capable network fabrics, such as InfiniBand*, iWarp*, Dolphin*, and XPMEM* (through DAPL*) |
tcp |
TCP/IP-capable network fabrics, such as Ethernet and InfiniBand* (through IPoIB*) |
The environment variable I_MPI_FABRICS has the following syntax:
I_MPI_FABRICS=<fabric> | <intra-node fabric>:<internodes-fabric>
where the <fabric> value meta-symbol can have the values shm, dapl, or tcp. The <intra-node fabric> value meta-symbol can have the values shm, dapl, or tcp. Finally, the <inter-node fabric> value meta-symbol can have the values dapl, or tcp.
The next section will provide some examples for using the I_MPI_FABRICS environment variable within the mpiexec command-line.
Use the mpiexec command to launch programs linked with the Intel MPI Library example:
mpiexec -n <# of processes> .\myprog.exe
When launching the mpiexec command, you may be prompted for an account name and a password which might look something like the following:
User credentials needed to launch processes:
account (domain\user) [clusternode1\user001]:
account (domain\user) [clusternode1\user001]: password:
In the DOS panel simply hit the return key for the user name if you do not want to change it (in this example it is user001), and then enter the password for the associated account.
The only required option for the mpiexec command is the -n option to set the number of processes. However, you will probably want to use the working directory –wdir, and –machinefile command-line options that have the following syntax:
-wdir <working directory>
-machinefile <filename>
You may find these command-line options useful, if the nodes of the cluster are using a file share for example.
If your MPI application is using a network fabric other than the default fabric, use the –env option to specify a value to be assigned to the I_MPI_FABRICS variable. For example, to run an MPI program while using the shared memory for intra-node communication and sockets for inter-node communication, use the following command:
mpiexec -n <# of processes> -env I_MPI_FABRICS shm:tcp .\myprog.exe
As an example of running an MPI application on a cluster system with a combined shared-memory and DAPL-enabled network fabric, the following mpiexec command-line might be used:
mpiexec -n <# of processes> -env I_MPI_FABRICS shm:dapl .\myprog.exe
See the section titled Selecting a Network Fabric in <directory-path-to-Intel-MPI-Library>\doc\Getting_Started.pdf, or the section titled Fabrics Control in <directory-path-to-Intel-MPI-Library>\doc\Reference_Manual.pdf.
To generate a machines.Windows text file to be used with the –machinefile command-line option, the following methodology might be useful (Figure 4.3).
Figure 4.4 – Compute Cluster Administrator display panel within Microsoft Windows HPC Server 2008*
If you select the ComputeNodes link in the left sub-panel (Figure 4.4) and then proceed to suppress the shift key and click in each node listed in the center panel of Figure 4.4 you will get a highlighted list as shown in Figure 4.5. You can then press Ctrl-C to do a copy and transfer the information to a text file. Remove the extraneous text when you do a Ctrl-V such that node names exist as 1 per line.
Figure 4.5 – Highlighting the selected nodes and using Ctrl-C to copy the node names in anticipation of creating the machines.Windows file for the mpiexec command-line option –machinefile
This file can be saved into shared file space area that can be used by all of the nodes of the cluster. An example might be:
z:\cluster_file_share\machines.Windows
If the –machinefile command-line option is used with the mpiexec command, the machines.Windows file might be reference in the following manner:
mpiexec –n 12 –machinfile z:\cluster_file_share\machines.Windows …
For the experiments that follow, it is assumed that a computing cluster has at least 2 nodes and there are two symmetric multi-processors (SMPs) per node.
Recall that in section 4.2 that the command for launching multi-purpose daemons on Microsoft Windows is called “smpd”, which is an acronym for simple multi-purpose daemons. Also note that when Intel MPI Library is installed on a cluster, the smpd service is automatically started. In part 4.2 it was mentioned that you could type the command:
clusrun smpd –status | more
to verify that there are MPD daemons running on the two nodes of the cluster. The response from issuing this command should be something like:
---------------------clusternode01 returns 0-----------------------
smpd running on clusternode01
---------------------clusternode02 returns 0-----------------------
smpd running on clusternode02
assuming that the two nodes of the cluster are called clusternode1 and clusternode2. The actual response will be a function of your cluster configuration.
In the <directory-path-to-Intel-MPI-Library>\test folder where Intel MPI Library resides, there are source files for four MPI test cases. In your local user area, you should create a test directory called:
test_intel_mpi
From the installation directory of Intel MPI Library, copy the test files from <directory-path-to-Intel-MPI-Library>\test to the directory above. The contents of test_intel_mpi should now be:
test.c test.cpp test.f test.f90
Compile the C and C++ test applications into executables using the following commands with respect to the Intel C++ compiler:
icl /Fetestc /I"%I_MPI_ROOT%\em64t\include" test.c "%I_MPI_ROOT%\em64t\lib\impi.lib"
icl /Fetestcpp /I"%I_MPI_ROOT%\em64t\include" test.cpp /link /LIBPATH:"%I_MPI_ROOT%\em64t\lib" impi.lib impicxx.lib
If you have a version of the Microsoft Visual C++ Compiler that was not packaged with Microsoft* Visual Studio* 2008, type the following respective command-lines for the C and C++ test applications:
cl /Fetestc_vc /I"%I_MPI_ROOT%\em64t\include" test.c "%I_MPI_ROOT%\em64t\lib\impi.lib" bufferoverflowU.lib
and
cl /Fetestcpp_vc /I"%I_MPI_ROOT%\em64t\include" test.cpp /link bufferoverflowU.lib /LIBPATH:"%I_MPI_ROOT%\em64t\lib" impi.lib impicxx.lib
If you have a version of the Microsoft Visual C++ Compiler that was packaged with Microsoft* Visual Studio* 2008, type the following respective command-lines for the C and C++ test applications:
cl /Fetestc_vc /I"%I_MPI_ROOT%\em64t\include" test.c "%I_MPI_ROOT%\em64t\lib\impi.lib"
and
cl /Fetestcpp_vc /I"%I_MPI_ROOT%\em64t\include" test.cpp /link /LIBPATH:"%I_MPI_ROOT%\em64t\lib" impi.lib impicxx.lib
The executable for test.c will be called testc.exe and testc_vc.exe, and the executable for test.cpp will be called testcpp.exe testcpp_vc.exe. The executable names are a result of using the command-line option /Fe. The /I option references the path to the MPI include files. The library path reference for the Intel MPI library is given by:
"%I_MPI_ROOT%\em64t\lib\impi.lib"
Similarly for the two test cases called test.f and test.f90, enter the following two respective commands to build executables:
ifort /Fetestf /I"%I_MPI_ROOT%\em64t\include" test.f "%I_MPI_ROOT%\em64t\lib\impi.lib"
ifort /Fetestf90 /I"%I_MPI_ROOT%\em64t\include" test.f90 "%I_MPI_ROOT%\em64t\lib\impi.lib"
Issue mpiexec commands which might look something like the following:
mpiexec -n 12 -machinefile z:\cluster_file_share\machines.Windows -wdir z:\cluster_file_share\test testf.exe
mpiexec -n 12 -machinefile z:\cluster_file_share\machines.Windows -wdir z:\cluster_file_share\test testf90.exe
mpiexec -n 12 -machinefile z:\cluster_file_share\machines.Windows -wdir z:\cluster_file_share\test testc.exe
mpiexec -n 12 -machinefile z:\cluster_file_share\machines.Windows -wdir z:\cluster_file_share\test testcpp.exe
and for the Microsoft Visual C++ executables:
mpiexec -n 12 -machinefile z:\cluster_file_share\machines.Windows -wdir z:\cluster_file_share\test testc_vc.exe
mpiexec -n 12 -machinefile z:\cluster_file_share\machines.Windows -wdir z:\cluster_file_share\test testcpp_vc.exe
The output from testcpp.exe should look something like:
Hello world: rank 0 of 12 running on clusternode0
Hello world: rank 1 of 12 running on clusternode1
Hello world: rank 2 of 12 running on clusternode2
Hello world: rank 3 of 12 running on clusternode3
Hello world: rank 4 of 12 running on clusternode4
Hello world: rank 5 of 12 running on clusternode5
Hello world: rank 6 of 12 running on clusternode6
Hello world: rank 7 of 12 running on clusternode7
Hello world: rank 8 of 12 running on clusternode8
Hello world: rank 9 of 12 running on clusternode9
Hello world: rank 10 of 12 running on clusternode10
Hello world: rank 11 of 12 running on clusternode11
The above mpiexec commands assume that there is a file share called:
z:\cluster_file_share
If your system is using only symmetric multiprocessing on a shared memory system, then the mpiexec commands could omit the –machinefile and –wdir options.
If you have successfully run the above applications using Intel MPI Library, you can now run (without re-linking) the four executables on clusters that use Direct Access Program Library (DAPL) interfaces to alternative interconnection fabrics. If you encounter problems, please see the section titled Troubleshooting within the document Intel MPI Library Getting Started Guide located in <directory-path-to-Intel-MPI-Library>\doc\Getting_Started.pdf for possible solutions.
Assuming that you have a dapl device fabric installed on the cluster, you can issue the following commands for the four executables so as to access that device fabric:
mpiexec –machinefile machines.Windows -env I_MPI_FABRICS dapl -n 2 testf.exe
mpiexec –machinefile machines.Windows -env I_MPI_FABRICS dapl -n 2 testf90.exe
mpiexec –machinefile machines.Windows –env I_MPI_FABRICS dapl -n 2 testc.exe
mpiexec –machinefile machines.Windows -env I_MPI_FABRICS dapl -n 2 testcpp.exe
mpiexec –machinefile machines.Windows –env I_MPI_FABRICS dapl -n 2 testc_vc.exe
mpiexec –machinefile machines.Windows -env I_MPI_FABRICS dapl -n 2 testcpp_vc.exe
The output from testf90 using the dapl device value for the I_MPI_FABRICS environment variable should look something like:
Hello world: rank 0 of 2 running on
clusternode1
Hello world: rank 1 of 2 running on
clusternode2
The mpiexec command controls how the ranks of the processes are allocated to the nodes in the cluster. By default, mpiexec uses round-robin assignment of ranks to the nodes. This placement algorithm may not be the best choice for your application, particularly for clusters with SMP (symmetric multi-processor) nodes.
Suppose that the geometry is <#ranks> = 4 and <#nodes> = 2, where adjacent pairs of ranks are assigned to each node (for example, for 2-way SMP nodes). Issue the command:
type machines.Windows
The results should be something like:
clusternode1
clusternode2
Since each node of the cluster is a 2-way SMP, and 4 processes are to be used for the application, the next experiment will distribute the 4 processes such that 2 of the processes will execute on clusternode1 and 2 will execute on clusternode2. For example, you might issue the following commands:
mpiexec -n 2 -host clusternode1 .\testf : -n 2 -host clusternode2 .\testf
mpiexec -n 2 -host clusternode1 .\testf90 : -n 2 -host clusternode2 .\testf90
mpiexec -n 2 -host clusternode1 .\testc : -n 2 -host clusternode2 .\testc
mpiexec -n 2 -host clusternode1 .\testcpp : -n 2 -host clusternode2 .\testcpp
The following output should be produced for the executable testc:
Hello world: rank 0 of 4 running on clusternode1
Hello world: rank 1 of 4 running on clusternode1
Hello world: rank 2 of 4 running on clusternode2
Hello world: rank 3 of 4 running on clusternode2
In general, if there are i nodes in the cluster and each node is j-way SMP system, then the mpiexec command-line syntax for distributing the i by j processes amongst the i by j processors within the cluster is:
mpiexec -n j -host <nodename-1> .\mpi_example : \
-n j -host <nodename-2> .\mpi_example : \
-n j -host <nodename-3> .\mpi_example : \
…
-n j -host <nodename-i> .\mpi_example
Note that you would have to fill in appropriate host names for <nodename-1> through <nodename-i> with respect to your cluster system. For a complete discussion on how to control process placement through the mpiexec command, see the Local Options section of the Intel MPI Library Reference Manual located in <directory-path-to-Intel-MPI-Library>\doc\Reference_Manual.pdf.
The mpitune utility was first introduced with Intel® MPI Library 3.2. It can be used to find optimal settings of Intel® MPI Library in regards to the cluster configuration or a user’s application for that cluster.
As an example, the executables testc.exe, testcpp.exe, testf.exe, and testf90.exe in the directory test_intel_mpi could be used. The command invocation for mpitune might look something like the following:
mpitune –-host-file machines.WINDOWS –-output-file testc.conf --application \”mpiexec –n 4 testc.exe\”
where the options above are just a subset of the following complete command-line switches:
Command-line Option |
Semantic Meaning |
-a \”<app_cmd_line>\” | --application \”<app_cmd_line>\” |
Switch on the application tuning mode. Quote the full command line as shown |
-cm | --cluster-mode {exclusive | full} |
Set the cluster usage mode exclusive – only one task will executed on the cluster at a time full – maximum number of tasks will be execute. This is the default mode |
-d | --debug |
Print debug information |
-dl [d1[,d2…[,dN]]] | --device-list [d1[,d2…[,dN]]] |
Select the device(s) you want to tune. By default use all of the devices mentioned in the <installdir>/<arch>/etc/devices.xml file |
-er | --existing-ring |
Try to use an existing MPD ring. By default, create a new MPD ring |
-fl [f1[,f2…[,fN]]] | --fabric-list [f1[,f2…[,fN]]] |
Select the fabric(s) you want to tune. By default use all of the fabrics mentioned in the <installdir>/<arch>/etc/fabrics.xml file |
-h | --help |
Display a help message |
-hf <hostsfile> | --host-file <hostsfile> |
Specify an alternative host file name. By default, use the $PWD/mpd.hosts |
-hr | --host-range {min:max | min: | :max} |
Set the range of hosts used for testing. The default minimum value is 1. The default maximum value is the number of hosts defined by the mpd.hosts or the existing MPD ring. The min: or :max format will use the default values as appropriate |
-i <count> | --iterations <count> |
Define how many times to run each tuning step. Higher iteration counts increase the tuning time, but may also increase the accuracy of the results. The default value is 3 |
-mh | --master-host |
Dedicate a single host to mpitune |
--message-range {min:max | min: | :max} |
Set the message size range. The default minimum value is 0. The default maximum value is 4194304 (4mb). By default, the values are given in bytes. They can also be given in the following format: 16kb, 8mb, or 2gb. The min: or :max format will use the default values as appropriate |
-of <file-name> | --output-file <file-name> |
Specify the application configuration file to be generated in the application-specific mode. By default, use the $PWD/app.conf |
-od <outputdir> | --output-directory <outputdir> |
Specify the directory name for all output files. By default, use the current directory. The directory should be accessible from all hosts |
-pr {min:max | min: | :max} | -–ppn-range {min:max | min: | :max} | -–perhost-range {min:max | min: | :max} |
Set the maximum number of processes per host. The default minimum value is 1. The default maximum value is the number of cores of the processor. The min: or :max format will use the default values as appropriate |
-sf [file-path] | --session-file [file-path] |
Continue the tuning process starting from the state saved in the file-path session file |
-s | --silent |
Suppress all diagnostic output |
-td <dir-path> | --temp-directory <dir-path> |
Specify a directory name for the temporary data. By default, use the $PWD/mpitunertemp. This directory should be accessible from all hosts |
-t \”<test_cmd_line>\” | --test \”<test_cmd_line>\” |
Replace the default Intel® MPI Benchmarks by the indicated benchmarking program in the cluster-specific mode. Quote the full command line as shown |
-tl <minutes> | --time-limit <minutes> |
Set mpitune execution time limit in minutes. The default value is 0, which means no limitations |
-V | --version |
Print out the version information |
Details on optimizing the settings for Intel® MPI Library with regards to the cluster configuration or a user’s application for that cluster are described in the next two subsections.
Once you have installed the Intel® Cluster Tools on your system you may want to use the mpitune utility to generate a configuration file that is targeted at optimizing the Intel® MPI Library with regards to the cluster configuration. For example, the mpitune command:
mpitune –hf machines.WINDOWS –of testc.conf –-test \”testc.exe\”
could be used, where machines.WINDOWS contains a list of the nodes in the cluster. Completion of this command may take some time. The mpitune utility will generate a configuration file that might have a name such as app.conf. You can then proceed to run the mpiexec command on an application using the –tune option. For example, the mpiexec command-line syntax for the testc executable might look something like the following:
mpiexec –tune –n 4 testc.exe
The mpitune invocation:
mpitune –hf machines.Linux –of testf90.conf --application \”mpiexec –n 4 testf90.exe\”
will generate a file called app.config that is base on the application testf90. Completion of this command may take some time also. This configuration file can be used in the following manner:
mpiexec –tune testf90.conf –n 4 testf90.exe
where the mpiexec command will load the configuration options recorded in testf90.conf.
You might want to use mpitune utility on each of the test applications testc.exe, testcpp.exe, testf.exe, and testf90.exe. For a complete discussion on how to use the mpitune utility, see the Tuning Reference section of the Intel MPI Library for Windows* OS Reference Manual located in <directory-path-to-Intel-MPI-Library>/doc/Reference_Manual.pdf.
To make inquiries about Intel MPI Library, visit the URL: http://premier.intel.com.
As mentioned previously (e.g., Figure 2.1), debugging of an MPI application can be achieved with the I_MPI_DEBUG environment variable. The syntax of the I_MPI_DEBUG environment variable is as follows:
I_MPI_DEBUG=<level>
where <level> can have the values:
Value |
Debug Level Semantics |
Not set |
Print no debugging information |
1 |
Print warnings if specified I_MPI_DEVICE could not be used |
2 |
Confirm which I_MPI_DEVICE was used |
> 2 |
Add extra levels of debugging information |
In order to simplify process identification add the operators “+” or “-” in front of the numerical value for I_MPI_DEBUG level. This setting produces debug output lines which are prepended with the MPI process rank, a process id, and a host name as defined at the process launch time. For example, the command:
mpiexec –n <# of processes> -env I_MPI_DEBUG +2 my_prog.exe
produces output debug messages in the following format:
I_MPI: [rank#pid@hostname]Debug message
You can also compile the MPI the application with the /Zi or /Z7 compiler options.
MPI applications can be easily instrumented with the Intel Trace Collector Library to gather performance data, and postmortem performance analysis can be visually assessed with Intel Trace Analyzer. The Intel Trace Analyzer and Collector supports instrumentation of applications written in C, C++, Fortran 77, and the Fortran 95 programming languages.
Recall that in the test_intel_mpi folder for Intel MPI Library, there are four source files called:
test.c test.cpp test.f test.f90
In a scratch version of the folder called test, one can set the environment variable VT_LOGFILE_PREFIX to the following:
set VT_LOGFILE_PREFIX=test_inst
where test_inst is an acronym for test instrumentation. After doing this you can create a test instrumentation folder by typing the command:
mkdir %VT_LOGFILE_PREFIX%
To compile and instrument the Fortran files called test.f and test.f90 using the Intel Fortran compiler, you can issue the following respective DOS commands:
ifort /Fetestf /I"%I_MPI_ROOT%"\em64t\include test.f /link /LIBPATH:"%VT_LIB_DIR%" VT.lib Ws2_32.lib /LIBPATH:"%I_MPI_ROOT%\em64t\lib" impi.lib /NODEFAULTLIB:LIBCMTD.lib
and
ifort /Fetestf90 /I"%I_MPI_ROOT%"\em64t\include test.f90 /link /LIBPATH:"%VT_LIB_DIR%" VT.lib Ws2_32.lib /LIBPATH:"%I_MPI_ROOT%\em64t\lib" impi.lib /NODEFAULTLIB:LIBCMTD.lib
To compile and instrument the respective C and C++ files test.c and test.cpp using the Intel C++ Compiler, you can issue the following respective DOS commands:
icl /Fetestc /I"%I_MPI_ROOT%"\em64t\include test.c /link /LIBPATH:"%VT_LIB_DIR%" VT.lib Ws2_32.lib /LIBPATH:"%I_MPI_ROOT%\em64t\lib" impi.lib /NODEFAULTLIB:LIBCMTD.lib
and
icl /Fetestcpp /I"%I_MPI_ROOT%"\em64t\include test.cpp /link /LIBPATH:"%VT_LIB_DIR%" VT.lib Ws2_32.lib /LIBPATH:"%I_MPI_ROOT%\em64t\lib" impi.lib impicxx.lib /NODEFAULTLIB:LIBCMTD.lib
For C++ applications, the Intel MPI library impicxx.lib is needed in addition to impi.lib.
Alternatively, to compile and instrument the respective C and C++ files test.c and test.cpp using a Microsoft* Visual Studio* C++ Compiler that was not packaged with Microsoft* Visual Studio* 2008, you can issue the DOS commands:
cl /Fetestc_vc /I"%I_MPI_ROOT%"\em64t\include test.c /link /LIBPATH:"%VT_LIB_DIR%" VT.lib Ws2_32.lib bufferoverflowu.lib /LIBPATH:"%I_MPI_ROOT%\em64t\lib" impi.lib /NODEFAULTLIB:LIBCMTD.lib
and
cl /Fetestcpp_vc /I"%I_MPI_ROOT%"\em64t\include test.cpp /link /LIBPATH:"%VT_LIB_DIR%" VT.lib Ws2_32.lib bufferoverflowu.lib /LIBPATH:"%I_MPI_ROOT%\em64t\lib" impicxx.lib impi.lib /NODEFAULTLIB:LIBCMTD.lib
Note that when compiling and linking with a Microsoft Visual Studio C++ Compiler that was not packaged with Microsoft* Visual Studio* 2008, the library bufferoverflowu.lib has been added as demonstrated above for the C and C++ test cases.
If you have a version of the Microsoft Visual C++ Compiler that was packaged with Microsoft* Visual Studio* 2008, type the following respective command-lines for the C and C++ test applications:
cl /Fetestc_vc /I"%I_MPI_ROOT%"\em64t\include test.c /link /LIBPATH:"%VT_LIB_DIR%" VT.lib Ws2_32.lib /LIBPATH:"%I_MPI_ROOT%\em64t\lib" impi.lib /NODEFAULTLIB:LIBCMTD.lib
and
cl /Fetestcpp_vc /I"%I_MPI_ROOT%"\em64t\include test.cpp /link /LIBPATH:"%VT_LIB_DIR%" VT.lib Ws2_32.lib /LIBPATH:"%I_MPI_ROOT%\em64t\lib" impicxx.lib impi.lib /NODEFAULTLIB:LIBCMTD.lib
After issuing these compilation and link commands, the following executables should exist in the present working directory:
testc.exe
testcpp.exe
testcpp_vc.exe
testc_vc.exe
testf.exe
testf90.exe
Recall that the environment variable VT_LOGFILE_PREFIX was set to test_inst which was used as part of a mkdir command to create a directory where instrumentation data is to be collected. One method of directing the mpiexec command to place the Intel Trace Collector data into the folder called test_inst is to use the following set of commands for the executables above:
mpiexec -n 4 -env VT_LOGFILE_PREFIX test_inst testc
mpiexec -n 4 -env VT_LOGFILE_PREFIX test_inst testcpp
mpiexec -n 4 -env VT_LOGFILE_PREFIX test_inst testcpp_vc
mpiexec -n 4 -env VT_LOGFILE_PREFIX test_inst testc_vc
mpiexec -n 4 -env VT_LOGFILE_PREFIX test_inst testf
mpiexec -n 4 -env VT_LOGFILE_PREFIX test_inst testf90
For the executables above, 4 MPI processes are created via the mpiexec command. These mpiexec commands will produce the STF files:
testc.stf testcpp.stf testcpp_vc.stf testc_vc.stf testf.stf testf90.stf
within the directory test_inst.
Issuing the traceanalyzer command on the STF file test_inst\testcpp_vc.stf as follows:
traceanalyzer test_inst\testccp_vc.stf
will generate a profile panel which looks something like the following (Figure 6.1):
Figure 6.1 – The Profile display for testcpp_vc.stf
Figure 6.2 shows the Event Timeline display which results when following the menu path Charts->Event Timeline within Figure 6.1.
Figure 6.2 – The Profile and Timeline display for testcpp_vc.stf
An alternative to the above mpiexec commands is to create a trace collector configuration file such as vtconfig.txt which could have the contents, beginning in column 1, of:
logfile-prefix test_inst
The directive called logfile-prefix is analogous to the Intel Trace Collector environment variable VT_LOGFILE_PREFIX. In general, you can place multiple Intel Trace Collector directives into this vtconfig.txt file. For additional information about Intel Trace Collector directives, you should look at Chapter 9 of <directory-path-to-ITAC>\doc\ITC_Reference_Guide.pdf. The file vtconfig.txt can be referenced by the mpiexec commands through the Intel Trace Collector environment variable directive called VT_CONFIG as follows:
mpiexec -n 4 -env VT_CONFIG vtconfig.txt testc
mpiexec -n 4 -env VT_CONFIG vtconfig.txt testcpp
mpiexec -n 4 -env VT_CONFIG vtconfig.txt testcpp_vc
mpiexec -n 4 -env VT_CONFIG vtconfig.txt testc_vc
mpiexec -n 4 -env VT_CONFIG vtconfig.txt testf
mpiexec -n 4 -env VT_CONFIG vtconfig.txt testf90
There may be situations where an application will end prematurely, and thus trace data could be lost. The Intel Trace Collector has a trace library that works in fail-safe mode.
To compile and instrument the Fortran files called test.f and test.f90 using the Intel Fortran Compiler, you can issue the following respective DOS commands:
ifort /Fetestf_fs /I"%I_MPI_ROOT%"\em64t\include test.f /link /LIBPATH:"%VT_LIB_DIR%" VTfs.lib Ws2_32.lib /LIBPATH:"%I_MPI_ROOT%\em64t\lib" impi.lib /NODEFAULTLIB:LIBCMTD.lib
and
ifort /Fetestf90_fs /I"%I_MPI_ROOT%"\em64t\include test.f90 /link /LIBPATH:"%VT_LIB_DIR%" VTfs.lib Ws2_32.lib /LIBPATH:"%I_MPI_ROOT%\em64t\lib" impi.lib /NODEFAULTLIB:LIBCMTD.lib
where the special Intel Trace Collector Library for fail-safe (acronym fs) tracing is –VTfs.lib.
To compile and instrument the respective C and C++ files test.c and test.cpp using the Intel C++ compiler, you can issue the following respective DOS commands:
icl /Fetestc_fs /I"%I_MPI_ROOT%"\em64t\include test.c /link /LIBPATH:"%VT_LIB_DIR%" VTfs.lib Ws2_32.lib /LIBPATH:"%I_MPI_ROOT%\em64t\lib" impi.lib /NODEFAULTLIB:LIBCMTD.lib
and
icl /Fetestcpp_fs /I"%I_MPI_ROOT%"\em64t\include test.cpp /link /LIBPATH:"%VT_LIB_DIR%" VTfs.lib Ws2_32.lib /LIBPATH:"%I_MPI_ROOT%\em64t\lib" impi.lib impicxx.lib /NODEFAULTLIB:LIBCMTD.lib
For C++ applications, the Intel MPI library impicxx.lib is needed in addition to impi.lib.
Alternatively, to compile and instrument the respective C and C++ files test.c and test.cpp using a Microsoft* Visual Studio* C++ Compiler that was not packaged with Microsoft* Visual Studio* 2008, you can issue the DOS commands:
cl /Fetestc_fs_vc /I"%I_MPI_ROOT%"\em64t\include test.c /link /LIBPATH:"%VT_LIB_DIR%" VTfs.lib Ws2_32.lib bufferoverflowu.lib /LIBPATH:"%I_MPI_ROOT%\em64t\lib" impi.lib /NODEFAULTLIB:LIBCMTD.lib
and
cl /Fetestcpp_fs_vc /I"%I_MPI_ROOT%"\em64t\include test.cpp /link /LIBPATH:"%VT_LIB_DIR%" VTfs.lib Ws2_32.lib bufferoverflowu.lib /LIBPATH:"%I_MPI_ROOT%\em64t\lib" impicxx.lib impi.lib /NODEFAULTLIB:LIBCMTD.lib
Again note that when compiling and linking with a Microsoft Visual Studio C++ compiler that was not packaged with Microsoft* Visual Studio* 2008, the library bufferoverflowu.lib has been added as demonstrated above for the C and C++ test cases.
If you have a version of the Microsoft Visual C++ Compiler that was packaged with Microsoft* Visual Studio* 2008, type the following respective command-lines for the C and C++ test applications:
cl /Fetestc_fs_vc /I"%I_MPI_ROOT%"\em64t\include test.c /link /LIBPATH:"%VT_LIB_DIR%" VTfs.lib Ws2_32.lib /LIBPATH:"%I_MPI_ROOT%\em64t\lib" impi.lib /NODEFAULTLIB:LIBCMTD.lib
and
cl /Fetestcpp_fs_vc /I"%I_MPI_ROOT%"\em64t\include test.cpp /link /LIBPATH:"%VT_LIB_DIR%" VTfs.lib Ws2_32.lib /LIBPATH:"%I_MPI_ROOT%\em64t\lib" impicxx.lib impi.lib /NODEFAULTLIB:LIBCMTD.lib
After issuing these compilation and link commands, the following executables should exist in the present working directory:
testc_fs.exe
testcpp_fs.exe
testcpp_fs_vc.exe
testc_fs_vc.exe
testf_fs.exe
testf90_fs.exe
Recall that the environment variable VT_LOGFILE_PREFIX was set to test_inst which was used as part of a mkdir command to create a directory where instrumentation data is to be collected. One method of directing the mpiexec command to place the Intel Trace Collector data into the folder called test_inst is to use the following set of commands for the executables above:
mpiexec -n 4 -env VT_LOGFILE_PREFIX test_inst testc_fs
mpiexec -n 4 -env VT_LOGFILE_PREFIX test_inst testcpp_fs
mpiexec -n 4 -env VT_LOGFILE_PREFIX test_inst testcpp_fs_vc
mpiexec -n 4 -env VT_LOGFILE_PREFIX test_inst testc_fs_vc
mpiexec -n 4 -env VT_LOGFILE_PREFIX test_inst testf_fs
mpiexec -n 4 -env VT_LOGFILE_PREFIX test_inst testf90_fs
In case of execution failure by the application, the fail-safe library freezes all MPI processes and then writes out the trace file. Figure 6.3 shows an Intel Trace Analyzer display for test.c.
Figure 6.3 – Intel Trace Analyzer display of Fail-Safe Trace Collection by Intel Trace Collector
Complete user documentation regarding VTfs.lib for the Intel Trace Collector can be found within the file:
<directory-path-to-ITAC>\doc\ITC_Reference_Guide.pdf
on the system where the Intel Trace Collector is installed. You can use vtfs as a search phrase within the documentation.
The itcpin utility is a binary instrumentation tool that comes with Intel Trace Analyzer and Collector. The Intel® architecture for itcpin on Microsoft Windows must be Intel® 64.
The basic syntax for instrumenting a binary executable with the itcpin utility is as follows:
itcpin [<ITC options>] -- <application command-line>
where -- is a delimiter between Intel Trace Collector (ITC) options and the application command-line.
The <ITC options> that will be used here are:
--run (off)
itcpin only runs the given executable if this option is used.
Otherwise it just analyzes the executable and prints configurable
information about it.
--insert
Intel Trace Collector has several libraries that can be used to do different kinds of tracing. An example library value could be VT which is the Intel Trace Collector Library. This is the default instrumentation library.
--profile (off)
Enables function profiling in the instrumented binary. Once enabled, all functions in the executable will be traced. It is recommended to control this to restrict the runtime overhead and the amount of trace data by disabling functions which do not need to be traced.
To obtain a list of all of the itcpin options simply type:
itcpin -–help
To demonstrate the use of itcpin, you can compile a C programming language example for calculating the value of “pi” where the application uses the MPI parallel programming paradigm. You can download the C source from the URL:
http://www.nccs.gov/wp-content/training/mpi-examples/C/pical.c
For the pi.c example, the following shell commands will allow you to instrument the binary called pi.exe with Intel Trace Collector instrumentation.
mpiicc pi.c /debug:all
set VT_LOGFILE_PREFIX=itcpin_inst
rmdir /S /Q %VT_LOGFILE_PREFIX%
mkdir %VT_LOGFILE_PREFIX%
mpiexec -mapall -n 4 -env VT_DLL_DIR "%VT_DLL_DIR%" -env VT_MPI_DLL "%VT_MPI_DLL%" -env VT_LOGFILE_FORMAT STF -env VT_PCTRACE 5 -env VT_LOGFILE_PREFIX "%VT_LOGFILE_PREFIX%" -env VT_PROCESS "0:N ON" -env VT_STATE "*.dll*:* off" itcpin --run --profile -- pi.exe
where the environment variables that are being set for the mpiexec command are:
-env VT_DLL_DIR "%VT_DLL_DIR%" -env VT_MPI_DLL "%VT_MPI_DLL%" -env VT_LOGFILE_FORMAT STF -env VT_PCTRACE 5 -env VT_LOGFILE_PREFIX "%VT_LOGFILE_PREFIX%" -env VT_PROCESS "0:N ON" -env VT_STATE "*.dll*:* off"
Notice above that the itcpin utility is included within the mpiexec command, and the itcpin options that are used are --run and --profile. The DOS shell commands before and after the invocation of itcpin should be thought of as prolog and epilog code to aid in the use of the itcpin utility. Also, the mpiicc batch script for compiling pi.c uses the /debug:all compiler option to create an executable that includes debug information which aids the instrumentation process.
An explanation for these instrumentation environment variables can be found in the Intel Trace Collector Users’ Guide under the search topic “ITC Configuration”.
The DOS shell commands above could be packaged into a DOS batch script. The output from the above sequence of DOS Shell commands looks something like the following:
Process 0 of 4 on cluster01
pi is approximately 3.1415926544231239, Error is 0.0000000008333307
wall clock time = 0.068866
[0] Intel(R) Trace Collector INFO: Writing tracefile pi.stf in Z:\test\itcpin_inst
Process 2 of 4 on cluster03
Process 3 of 4 on cluster04
Process 1 of 4 on cluster02
The exact output will be a function of your cluster configuration.
Figure 6.4 shows the timeline and function panel displays that were generated from the instrumentation data that was stored into the directory itcpin_inst as indicated by the environment variable VT_LOGFILE_PREFIX. The command that initiated the Intel Trace Analyzer with respect to the current directory was:
traceanalyzer itcpin_inst\pi.exe.stf &
Figure 6.4 – Intel Trace Analyzer display of the “pi” integration application that has been binary instrumented with itcpin
Complete user documentation regarding itcpin for the Intel Trace Collector can be found within the file:
<directory-path-to-ITAC>\doc\ITC_Reference_Guide.pdf
on the system where the Intel Trace Collector is installed. You can use itcpin as a search phrase within the documentation. To make inquiries about the Intel Trace Analyzer, visit the URL: http://premier.intel.com.
In the folder path where Intel Trace Analyzer and Collector resides, there is a folder called examples. The folder path where the examples directory resides might be something like:
C:\Program Files (x86)\Intel\ictce\4.0.0.018\itac\examples
If you copy the examples folder into a work area which is accessible by all of the nodes of the cluster, you might try the following sequence of commands:
nmake distclean
nmake all MPIDIR="c:\Program Files (x86)\Intel\ictce\4.0.0.018\MPI\em64t"
The makefile variable MPIDIR is explicitly set to the folder path where the version of Intel MPI Library resides that supports 64-bit address extensions. This set of commands will respectively clean up the folder content and compile and execute the following C and Fortran executables:
mpiconstants.exe
vnallpair.exe
vnallpairc.exe
vnjacobic.exe
vnjacobif.exe
vtallpair.exe
vtallpairc.exe
vtcounterscopec.exe
vtjacobic.exe
vtjacobif.exe
vttimertest.exe
of which the following STF files are created:
timertest.stf
vtallpair.stf
vtallpairc.stf
vtcounterscopec.stf
vtjacobic.stf
vtjacobif.stf
If one invokes Intel Trace Analyzer with the command:
traceanalyzer vtjacobic.stf
the following display panel will appear (Figure 6.5):
Figure 6.5 - Intel Trace Analyzer Display for vtjacobic.stf
Figure 6.6 shows the Event Timeline display which results when following the menu path Charts->Event Timeline within Figure 6.5.
Figure 6.6 - Intel Trace Analyzer Display for vtjacobic.stf using Charts->Event Timeline
You can use the trace analyzer to view the contents of the other STF files in this working directory on your cluster system.
Intel Trace Collector environment variables which should be useful for message checking are:
VT_DEADLOCK_TIMEOUT <delay>, where <delay> is a time value. The default value is 1 minute and the notation for the meta-symbol <delay> could be 1m. This controls the same mechanism to detect deadlocks as in VTfs.lib which is the fail-safe library. For interactive use it is recommended to set it to a small value like “10s” to detect deadlocks quickly without having to wait long for the timeout.
VT_DEADLOCK_WARNING <delay> where <delay> is a time value. The default value is 5 minutes and the notation for the meta-symbol <delay> could be 5m. If on average the MPI processes are stuck in their last MPI call for more than this threshold, then a GLOBAL:DEADLOCK:NO PROGRESS warning is generated. This is a sign of a load imbalance or a deadlock which cannot be detected because at least one process polls for progress instead of blocking inside an MPI call.
VT_CHECK_TRACING <on | off>. By default, during correctness checking with VTmc.lib no events are recorded and no trace file is written. This option enables recording of all events also supported by the normal VT.lib and the writing of a trace file. The trace file will also contain the errors found during the run.
Complete user documentation regarding message checking for the Intel Trace Collector can be found within the file:
<directory-path-to-ITAC>\doc\ITC_Reference_Guide.pdf
The chapter title is called “Correctness Checking”.
At the URL:
http://www.shodor.org/refdesk/Resources/Tutorials/BasicMPI/deadlock.c
one can obtain the source to an MPI example using C bindings that demonstrates deadlock. This C programming language test case is called deadlock.c.
To compile and instrument deadlock.c using the Intel C++ Compiler, you can issue the following DOS command can be used:
icl /D_CRT_SECURE_NO_DEPRECATE /Fedeadlock /I"%I_MPI_ROOT%"\em64t\include /Zi deadlock.c /link /stack:8000000 /LIBPATH:"%VT_LIB_DIR%" VTmc.lib Ws2_32.lib /LIBPATH:"%I_MPI_ROOT%\em64t\lib" impi.lib /NODEFAULTLIB:LIBCMTD.lib
Alternatively, to compile and instrument deadlock.c using the Microsoft Visual Studio C++ Compiler from say Microsoft Visual Studio 2005, you can issue the DOS command:
cl /D_CRT_SECURE_NO_DEPRECATE /Fedeadlock_vc /I"%I_MPI_ROOT%"\em64t\include /Zi deadlock.c /link /stack:8000000 /LIBPATH:"%VT_LIB_DIR%" VTmc.lib Ws2_32.lib bufferoverflowu.lib /LIBPATH:"%I_MPI_ROOT%\em64t\lib" impi.lib /NODEFAULTLIB:LIBCMTD.lib
If the C++ compiler supplied with Microsoft Visual Studio 2008 is used, you can issue the DOS command:
cl /D_CRT_SECURE_NO_DEPRECATE /Fedeadlock /I"%I_MPI_ROOT%"\em64t\include /Zi deadlock.c /link /stack:8000000 /LIBPATH:"%VT_LIB_DIR%" VTmc.lib Ws2_32.lib /LIBPATH:"%I_MPI_ROOT%\em64t\lib" impi.lib /NODEFAULTLIB:LIBCMTD.lib
where the library bufferoverflowu.lib is omitted. For all three compilation scenarios, the option /Zi was used to instruct the compiler to insert symbolic debug information into an object file.
Also recall that the prior to issuing an mpiexec command, a scratch folder for trace information called test_inst, can be created with the VT_LOGFILE_PREFIX environment variable by using the following process:
set VT_LOGFILE_PREFIX=test_inst
After doing this you can create a test instrumentation folder by typing the command:
mkdir %VT_LOGFILE_PREFIX%
When issuing the mpiexec command with settings for the VT_DEADLOCK_TIMEOUT, VT_DEADLOCK_WARNING, and VT_CHECK_TRACING environment variables two command-line examples are demonstrated. For execution of deadlock.exe on a local drive the mpiexec command might look something like:
mpiexec -wdir "C:\MPI Share\MPI_Share_Area\test_correctness_checking" -genv VT_CHECK_TRACING on -n 2 .\deadlock.exe -genv VT_DEADLOCK_TIMEOUT 20s –genv VT_DEADLOCK_WARNING 25s 0 80000
Alternatively, for a mapped drive that is shared on all nodes of the cluster, the mpiexec command might look something like:
mpiexec -genv VT_CHECK_TRACING on -mapall -n 2 .\deadlock.exe 0 80000 –wdir Z:\MPI_Share_Area\test_correctness_checking -genv VT_DEADLOCK_TIMEOUT 20s -genv VT_DEADLOCK_WARNING 25s
The execution diagnostics might look something like the following:
…
[0] ERROR: no progress observed in any process for over 1:00 minutes, aborting application
[0] WARNING: starting emergency trace file writing
[0] ERROR: GLOBAL:DEADLOCK:HARD: fatal error
[0] ERROR: Application aborted because no progress was observed for over 1:00 minutes,
[0] ERROR: check for real deadlock (cycle of processes waiting for data) or
[0] ERROR: potential deadlock (processes sending data to each other and getting blocked
[0] ERROR: because the MPI might wait for the corresponding receive).
[0] ERROR: [0] no progress observed for over 1:00 minutes, process is currently in MPI call:
[0] ERROR: MPI_Recv(*buf=00000000004D2A80, count=800000, datatype=MPI_INT, source=1, tag=999, comm=MPI_COMM_WORLD, *status=00000000007DFE80)
[0] ERROR: main (C:\MPI Share\MPI_Share_Area\test_correctness_checking\deadlock.c:59)
[0] ERROR: __tmainCRTStartup (f:\dd\vctools\crt_bld\self_64_amd64\crt\src\crt0.c:266)
[0] ERROR: BaseThreadInitThunk (kernel32)
[0] ERROR: RtlUserThreadStart (ntdll)
[0] ERROR: ()
[0] ERROR: [1] no progress observed for over 1:00 minutes, process is currently in MPI call:
[0] ERROR: MPI_Recv(*buf=00000000004D2A80, count=800000, datatype=MPI_INT, source=0, tag=999, comm=MPI_COMM_WORLD, *status=00000000007DFE80)
[0] ERROR: main (C:\MPI Share\MPI_Share_Area\test_correctness_checking\deadlock.c:59)
[0] ERROR: __tmainCRTStartup (f:\dd\vctools\crt_bld\self_64_amd64\crt\src\crt0.c:266)
[0] ERROR: BaseThreadInitThunk (kernel32)
[0] ERROR: RtlUserThreadStart (ntdll)
[0] ERROR: ()
[0] INFO: Writing tracefile deadlock.stf in Z:\MPI_Share_Area\test_correctness_checking\test_inst
[0] INFO: GLOBAL:DEADLOCK:HARD: found 1 time (1 error + 0 warnings), 0 reports were suppressed
[0] INFO: Found 1 problem (1 error + 0 warnings), 0 reports were suppressed.
1/2: receiving 80000
0/2: receiving 80000
job aborted:
rank: node: exit code[: error message]
0: clusternode1: 1: process 0 exited without calling finalize
1: clusternode2: 1: process 1 exited without calling finalize
Because the environment variable VT_CHECK_TRACING was set for the mpiexec command, trace information was placed into the directory referenced by VT_LOGFILE_PREFIX.
One can use the Intel® Trace Analyzer to view the deadlock problem that was reported in the output listing above. Here is what the trace information might look like (Figure 6.7):
Figure 6.7 – Event Timeline illustrating an error as signified by the black circle
For the event timeline chart, errors and warnings are represented by yellow-bordered circles (Figure 6.7). The color of each circle depends on the type of the particular diagnostic. If there is an error the circle will be filled in with a black coloring. If there is a warning, the circle will be filled in with a gray coloring.
For Figure 6.7, error messages and warnings can be suppressed by using a context menu. A context menu will appear if you right click the mouse as shown in Figure 6.8 and follow the path Show->Issues. If you uncheck the Issues item, the black and gray circles will clear.
Figure 6.8 – Context menu that can be used to suppress “Issues”. This is done by un-checking the “Issues” item
One can determine what source line is associated with an error message by using the context menu and selecting Details on Function. This will generate the following Details on Function panel (Figure 6.9):
Figure 6.9 – Illustration of the Detail on Function panel. The Show Source tab is the first item on the left
If you click on the Show Source tab in Figure 6.9, you will ultimately reach a source file panel such as what is demonstrated in Figure 6.10.
Figure 6.10 – The source panel display which shows the line in the user’s source where deadlock has taken place.
The diagnostic text messages and the illustration in Figure 6.10 reference line 49 of deadlock.c which looks something like the following:
…
49 MPI_Recv (buffer_in, MAX_ARRAY_LENGTH, MPI_INT, other, 999,
50 MPI_COMM_WORLD, &status);
51 MPI_Send (buffer_out, messagelength, MPI_INT, other, 999,
52 MPI_COMM_WORLD);
…
This is illustrated in Figure 6.11. To avoid deadlock situations, one might be able to resort to the following solutions:
1. Use a different ordering of MPI communication calls between processes
2. Use non-blocking calls
3. Use MPI_Sendrecv or MPI_Sendrecv_replace
4. Buffered mode
The If-structure for the original program looks something like the following:
…
41 if (sendfirst) {
42 printf ("\n%d/%d: sending %d\n", rank, size, messagelength);
43 MPI_Send (buffer_out, messagelength, MPI_INT, other, 999, MPI_COMM_WORLD);
44 MPI_Recv (buffer_in, MAX_ARRAY_LENGTH, MPI_INT, other, 999,
45 MPI_COMM_WORLD, &status);
46 printf ("\n%d/%d: received %d\n", rank, size, messagelength);
47 } else {
48 printf ("\n%d/%d: receiving %d\n", rank, size, messagelength);
49 MPI_Recv (buffer_in, MAX_ARRAY_LENGTH, MPI_INT, other, 999,
50 MPI_COMM_WORLD, &status);
51 MPI_Send (buffer_out, messagelength, MPI_INT, other, 999,
52 MPI_COMM_WORLD);
33 printf ("\n%d/%d: sendt %d\n", rank, size, messagelength);
54 }
…
If you replace lines 43 to 44 and lines 49 to 52 with calls to MPI_Sendrecv so that they look something like:
MPI_Sendrecv (buffer_out, messagelength, MPI_INT, other, 999, buffer_in, MAX_ARRAY_LENGTH, MPI_INT, other, 999, MPI_COMM_WORLD, &status);
and save this information into a file called deadlock2.c, and proceed to compile the modified application with the Microsoft Visual C++ Compiler:
cl /D_CRT_SECURE_NO_DEPRECATE /Fedeadlock2_vc /I"%I_MPI_ROOT%"\em64t\include /Zi deadlock2.c /link /stack:8000000 /LIBPATH:"%VT_LIB_DIR%" VTmc.lib Ws2_32.lib bufferoverflowu.lib /LIBPATH:"%I_MPI_ROOT%\em64t\lib" impi.lib /NODEFAULTLIB:LIBCMTD.lib
then the result of invoking the mpiexec command for deadlock2_vc.exe:
mpiexec -genv VT_DEADLOCK_TIMEOUT 20s -genv VT_DEADLOCK_WARNING 25s -n 2 -genv VT_CHECK_TRACING on .\deadlock2_vc.exe 0 80000
is the following:
…
[0] INFO: Error checking completed without finding any problems.
1/2: receiving 80000
1/2: sent 80000
0/2: receiving 80000
0/2: sent 80000
This indicates the deadlock errors that were originally encountered have been eliminated for this example. Using the Intel® Trace Analyzer to view the instrumentation results, we see that the deadlock issues have been resolved (Figure 6.12).
Figure 6.12 – Illustration of deadlock removal by using MPI_Sendrecv in the original source file called deadlock.c
There may be situations where you are in the middle of an inspection with Intel® Trace Analyzer and you need to be away. For example, suppose you initially typed the command:
traceanalyzer test_inst\testcpp.stf
and you need to temporarily stop the analysis, and you are looking at the following panel:
Figure 6.13 – Event timeline for running 4 MPI processes for the executable generated from test.cpp
For the panel rendering above, if you selection Project->Save Project or Project->Save Project As…, you will generate a subpanel that allows you to save the state of your session. This is project file has a suffix of “.itapr”, which is an acronym for Intel® Trace Analyzer project. Figure 6.14 shows the process of saving the state of your session through a project file.
Figure 6.14 – Saving a Project File called testcpp.itapr
Suppose at a later time you wish to continue the analysis with Intel® Trace Analyzer. You can type the command:
traceanalyzer
You can then select Project->Load Project… and the following subpanel will appear (Figure 6.15):
Figure 6.15 – Loading a Project File called testcpp.itapr
With regards to Figure 6.15, simply mouse over the Open button and you will immediately go back to point where you last left off (Figure 6.13). For complete details on saving and loading a project file, please see Section 2.2 of the Intel® Trace Analyzer Reference Guide, which is titled “Project Menu”. The path to this file is:
<directory-path-to-ITAC>\doc\ITA_Reference_Guide.pdf
on the system where the Intel® Trace Analyzer and Collector is installed.
With respect to Figure 6.13, a developer may want to know a summary of process imbalance for the executable. One can do this by selecting the menu path Advanced->Application Imbalance Diagram. Figure 6.16 shows the result of making this selection.
Figure 6.16 – Selecting Application Imbalance for the menu selection Advanced->Application Imbalance Diagram
Pressing the OK button in the subpanel will generate the following (Figure 6.17). You can verify the meaning of the histogram subcomponents by pressing on the Colors… button in Figure 6.17. This will generate the panel shown in Figure 6.18.
Figure 6.17 – Histogram subpanel as a result of pressing the OK button shown in Figure 6.16
Figure 6.18 – Legend for interpreting the histogram contributions for the Application Imbalance Diagram
For complete details on application imbalance, please see Section 5.4 of the Intel® Trace Analyzer Reference Guide, which is titled “Application Imbalance Diagram Dialog Box”. The path to this file is:
<directory-path-to-ITAC>\doc\ITA_Reference_Guide.pdf
on the system where the Intel® Trace Analyzer and Collector is installed.
In analyzing the performance of your executable, you can compare your instrumentation trace with an ideal trace for the executable. To do this, make the menu selection Advanced->Idealized. As a result of this, a dialog subpanel will appear which will allow you to create an idealized trace of execution (Figure 6.19):
Figure 6.19 – Trace Idealizer dialog box generated as a result of the menu selection Advanced->Idealization
By pressing the Start button in the dialog panel for Figure 6.19, a trace file will be generated called “testcpp.ideal.stf”. After creating this file, you can then make the menu selection File->Open for the given Intel® Trace Analyzer panel and open the trace file “testcpp.ideal.stf” for comparative analysis. Figure 6.20 shows the side-by-side results of the actual execution trace and the ideal trace for the application “test.cpp”.
Figure 6.20 – Comparison of the actual execution trace versus the idealized trace for the application test.cpp
Notice in Figure 6.20 that the cost of doing message passing in the ideal case is negligible. You can use the data from the ideal case to help gauge the type of tuning performance that should be pursued.
For complete details on application imbalance, please see Section 5.3 of the Intel® Trace Analyzer Reference Guide, which is titled “Trace Idealizer Dialog Box”. The path to this file is:
<directory-path-to-ITAC>\doc\ITA_Reference_Guide.pdf
on the system where the Intel® Trace Analyzer and Collector is installed.
Intel® Trace Analyzer and Collector provides you with a custom plug-in API that allows you to write your own simulator. The simulator API can be find in the folder path:
<directory-path-to-ITAC>\examples\icpf\
on the system where the Intel® Trace Analyzer and Collector is installed. The API source file within the subfolder icpf is called h_devsim.cpp. For background on building a customer simulator for trace files, please see Chapter 9 of the Intel® Trace Analyzer Reference Guide, which is titled “Custom Plug-in Framework”. The path to this file is:
<directory-path-to-ITAC>\doc\ITA_Reference_Guide.pdf
If you encounter the following link error message:
LINK : fatal error LNK1181: cannot open input file 'bufferoverflowu.lib'
when creating executables for the experiments in this chapter and your nmake command is not part of Microsoft Visual Studio 2008, please source the .bat file:
vcvarsx86_amd64.bat
in your DOS command-line window where you are doing the Intel® Math Kernel Library experiments. This .bat file should be located in a bin subfolder within the Microsoft* Visual Studio* folder path and the DOS command for sourcing this file might look something like the following:
"C:\Program Files (x86)\Microsoft Visual Studio 8\VC\bin\x86_amd64\vcvarsx86_amd64.bat"
where the line above is contiguous.
If for the ScaLAPACK experiments in this chapter, you are using a version of nmake from Microsoft Visual Studio 2008, then you can use the ScaLAPACK makefile variable msvs=2008 to prevent the link error referenced above. The setting of msvs=2008 will instruction the ScaLAPACK makefile to not use the library bufferoverflowu.lib.
On Microsoft Windows CCS, the MKL installation might be in the folder path:
C:\Program Files (x86)\intel\ictce\4.0.0.xxx\mkl\
where xxx is the build number of the Intel® Cluster Toolkit Compiler Edition 4.0 package. The contents of ...\mkl sub-folder should be:
To experiment with the ScaLAPACK (SCAlable LAPACK) test suite, recursively copy the contents of the directory path:
<directory-path-to-mkl>\tests\scalapack
to a scratch directory area which is sharable by all of the nodes of the cluster. In the scratch directory, issue the command:
cd scalapack
To build and run the ScaLAPACK executables, you can type the command:
nmake msvs=2008 arch=em64t mpi=intelmpi MPIdir="%I_MPI_ROOT%\em64t" libtype=static run
if you are using an nmake from Microsoft* Visual Studio* 2008. Otherwise use the command:
nmake arch=em64t mpi=intelmpi MPIdir="%I_MPI_ROOT%\em64t" libtype=static run
In the scalapack working directory where the nmake command was issued, the ScaLAPACK executables can be found in source\TESTING, and the results of the computation will also be placed into this same sub-directory. The results will be placed into “*.txt” files. You can invoke an editor to view the results in each of the “*.txt” files that have been created.
As an example result, the file “cdtlu_em64t_static_intelmpi_lp64.exe.txt” might have something like the following in terms of contents for an execution run on a cluster using 4 MPI processes. The cluster that generated this sample output consisted of 4 nodes. The text file was generated by the corresponding executable xcdtlu_em64t_static_intelmpi_lp64.exe.
SCALAPACK banded linear systems.
'MPI machine'
Tests of the parallel complex single precision band matrix solve
The following scaled residual checks will be computed:
Solve residual = ||Ax - b|| / (||x|| * ||A|| * eps * N)
Factorization residual = ||A - LU|| / (||A|| * eps * N)
The matrix A is randomly generated for each test.
An explanation of the input/output parameters follows:
TIME : Indicates whether WALL or CPU time was used.
N : The number of rows and columns in the matrix A.
bwl, bwu : The number of diagonals in the matrix A.
NB : The size of the column panels the matrix A is split into. [-1 for default]
NRHS : The total number of RHS to solve for.
NBRHS : The number of RHS to be put on a column of processes before going
on to the next column of processes.
P : The number of process rows.
Q : The number of process columns.
THRESH : If a residual value is less than THRESH, CHECK is flagged as PASSED
Fact time: Time in seconds to factor the matrix
Sol Time: Time in seconds to solve the system.
MFLOPS : Rate of execution for factor and solve using sequential operation count.
MFLOP2 : Rough estimate of speed using actual op count (accurate big P,N).
The following parameter values will be used:
N : 3 5 17
bwl : 1
bwu : 1
NB : -1
NRHS : 4
NBRHS: 1
P : 1 1 1 1
Q : 1 2 3 4
Relative machine precision (eps) is taken to be 0.596046E-07
Routines pass computational tests if scaled residual is less than 3.0000
TIME TR N BWL BWU NB NRHS P Q L*U Time Slv Time MFLOPS MFLOP2 CHECK
---- -- ------ --- --- ---- ----- ---- ---- -------- -------- -------- -------- ------
WALL N 3 1 1 3 4 1 1 0.000 0.0003 0.45 0.43 PASSED
WALL N 5 1 1 5 4 1 1 0.000 0.0003 0.82 0.77 PASSED
WALL N 17 1 1 17 4 1 1 0.000 0.0003 2.77 2.63 PASSED
WALL N 3 1 1 2 4 1 2 0.000 0.0047 0.05 0.07 PASSED
WALL N 5 1 1 3 4 1 2 0.000 0.0004 0.56 0.84 PASSED
WALL N 17 1 1 9 4 1 2 0.000 0.0004 1.96 2.97 PASSED
WALL N 3 1 1 2 4 1 3 0.000 0.0005 0.24 0.35 PASSED
WALL N 5 1 1 2 4 1 3 0.000 0.0038 0.09 0.16 PASSED
WALL N 17 1 1 6 4 1 3 0.001 0.0009 0.82 1.25 PASSED
WALL N 3 1 1 2 4 1 4 0.001 0.0011 0.13 0.19 PASSED
WALL N 5 1 1 2 4 1 4 0.001 0.0011 0.19 0.33 PASSED
WALL N 17 1 1 5 4 1 4 0.001 0.0041 0.27 0.42 PASSED
Finished 12 tests, with the following results:
12 tests completed and passed residual checks.
0 tests completed and failed residual checks.
0 tests skipped because of illegal input values.
END OF TESTS.
The text in the table above reflects the organization of actual output that you will see.
Please recall from Intel MPI Library and Intel Trace Analyzer and Collector discussions that the above results are dependent on factors such as the processor type, the memory configuration, competing processes, and the type of interconnection network between the nodes of the cluster. Therefore, the results will vary from one cluster configuration to another.
If you proceed to load the cdtlu_em64t_static_intelmpi_lp64.exe.txt table above into a Microsoft Excel* Spreadsheet, and build a chart to compare the Time in Seconds to Solve the System (SLV) and the Megaflop values, you might see something like the following (Figure 7.1):
Figure 7.1 – Display of ScaLAPACK DATA from the executable xcdtlu_em64t_static_intelmpi_lp64.exe
You can also link the libraries dynamically. The following command will provide for this if nmake is part of Microsoft Visual Studio 2008:
nmake msvs=2008 arch=em64t mpi=intelmpi MPIdir="%I_MPI_ROOT%\em64t" libtype=dynamic run
Otherwise, use the command:
nmake arch=em64t mpi=intelmpi MPIdir="%I_MPI_ROOT%\em64t" libtype=dynamic run
Before issuing the command above you should clean up from the static library build. This can be done by using the following nmake command:
nmake cleanall
On Microsoft Windows CCS, in the folder path:
<directory-path-to-mkl>\examples
you will find a set of sub-directories that look something like:
The two sub-folders that will be discussed here are cdftc and cdftf. These two directories respectively contain C and Fortran programming language examples that can be built and executed for the Cluster Discrete Fourier Transform (CDFT). Within each of these folders, there is a help target built within the makefile, and therefore you can type:
nmake help
To do experimentation with the contents of these two folders within a DOS window, you can issue the nmake commands:
nmake libem64t mpi=intelmpi MPIRUNOPTS="-n 4 -machinefile \"z:\global machine files folder\machines.Windows\"" workdir="z:\MPI_Share_Area\cdftc_test" SPEC_OPT="/debug:all"
and
nmake libem64t mpi=intelmpi MPIRUNOPTS="-n 4 -machinefile \"z:\global machine files folder\machines.Windows\"" workdir="z:\MPI_Share_Area\cdftf_test" SPEC_OPT="/debug:all"
where the MPIRUNOPTS macro is used. In general for the cdftc and cdftf makefiles, the MPIRUNOPTS macro is used to pass command-line arguments to the mpiexec command. For the nmake command-line examples above, the MPIRUNOPTS macro is being used to override the default number of MPI processes (where 2 processes is the default), and we are providing a -machinefile argument to select which nodes of the cluster, the MPI processes will run on. Note that there are spaces in the subfolder name “global machine files folder”, and therefore the folder path name z:\global machine files folder\machines.Windows is preceded and followed by the escape character sequence \" and so the –machinefile folder argument is:
\"z:\global machine files folder\machines.Windows\"
The first nmake command listed above should be used for the folder cdftc, and the second nmake command should be used for the folder cdftf. The nmake commands are each contiguous lines that end with workdir="z:\MPI_Share_Area\cdftc_test". These commands reference the makefile target libem64t, and the makefile variables mpi, mpidir, and workdir. You can obtain complete information about this makefile by looking at its contents within the folders ...\cdftc and ...\cdftf. Note that on your cluster system, the test directories z:\MPI_Share_Area\cdftc_test and z:\MPI_Share_Area\cdftf_test may be substituted with folder paths that you may prefer to use.
After executing the nmake commands above within the respective folders ...\cdftc and ...\cdftf, the workdir folders z:\MPI_Share_Area\cdftc_test and z:\MPI_Share_Area\cdftf_test should each have subfolder directories that look something like:
_results\lp64_em64t_intelmpi_lib
The executable and result contents of each of the subfolder paths _results\lp64_em64t_intelmpi might respectively look something like:
dm_complex_2d_double_ex1.exe
dm_complex_2d_double_ex2.exe
dm_complex_2d_single_ex1.exe
dm_complex_2d_single_ex2.exe
and
dm_complex_2d_double_ex1.res
dm_complex_2d_double_ex2.res
dm_complex_2d_single_ex1.res
dm_complex_2d_single_ex2.res
The files with the suffix .res are the output results. A partial listing for results file called dm_complex_2d_double_ex1.res might look something like:
Program is running on 4 processes
DM_COMPLEX_2D_DOUBLE_EX1 Forward-Backward 2D complex transform for double precision data inplace
Configuration parameters:
DFTI_FORWARD_DOMAIN = DFTI_COMPLEX DFTI_PRECISION = DFTI_DOUBLE DFTI_DIMENSION = 2 DFTI_LENGTHS (MxN) = {19,12) DFTI_FORWARD_SCALE = 1.0 DFTI_BACKWARD_SCALE = 1.0/(m*n)
INPUT Global vector X, n columns
Row 0: ( 1.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000) Row 1: ( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000) Row 2: ( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000) Row 3: ( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000) Row 4: ( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000) Row 5: ( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000) Row 6: ( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000) Row 7: ( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000) Row 8: ( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000) Row 9: ( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000) Row 10: ( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000) Row 11: ( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000) Row 12: ( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000) Row 13: ( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000) Row 14: ( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000) Row 15: ( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000) Row 16: ( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000) Row 17: ( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000) Row 18: ( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000)( 0.000, 0.000) …
|
Note that the output results that you obtain will be a function of your cluster configuration. Also notice at the top of the report that 4 MPI processes were used, which is consistent with the MPIRUNOPTS="-n 4 … " macro that was referenced through the nmake command.
Recall the itcpin discussion in Section 6.3 Using itcpin to Instrument an Application, where itcpin is the instrumentation tool used to insert Intel Trace Collector calls into the executables. Using itcpin technology, the following sequence of shell commands could be used to create instrumented executables and generate result information for the executables located in _results\lp64_em64t_intelmpi_lib.
For the C language version of the Cluster Discrete Fourier Transform, the DOS Shell commands might look something like:
Intel® Processor Architecture |
Command-line Sequence for Microsoft Windows |
Trace Results are Located In |
Execution Results are Located In |
Intel® 64 (formerly EM64T) |
set VT_LOGFILE_PREFIX=cdftc_inst
rmdir /S /Q %VT_LOGFILE_PREFIX%
mkdir %VT_LOGFILE_PREFIX%
mpiexec -mapall -n 4 -env VT_DLL_DIR "%VT_DLL_DIR%" -env VT_MPI_DLL "%VT_MPI_DLL%" -env VT_LOGFILE_FORMAT STF -env VT_PCTRACE 5 -env VT_LOGFILE_PREFIX "%VT_LOGFILE_PREFIX%" -env VT_PROCESS "0:N ON" -env VT_STATE "*.dll*:* off" itcpin --run --profile -- "%CD%\_results\lp64_em64t_intelmpi_lib\dm_complex_2d_double_ex1.exe" "C:\Program Files (x86)\Intel\ICTCE\4.0.0.xxx\mkl\examples\cdftc\data\dm_complex_2d_double_ex1.dat" > "%CD%\_results\lp64_em64t_intelmpi_lib\dm_complex_2d_double_ex1.res"
mpiexec -mapall -n 4 -env VT_DLL_DIR "%VT_DLL_DIR%" -env VT_MPI_DLL "%VT_MPI_DLL%" -env VT_LOGFILE_FORMAT STF -env VT_PCTRACE 5 -env VT_LOGFILE_PREFIX "%VT_LOGFILE_PREFIX%" -env VT_PROCESS "0:N ON" -env VT_STATE "*.dll*:* off" itcpin --run --profile -- "%CD%\_results\lp64_em64t_intelmpi_lib\dm_complex_2d_double_ex2.exe" "C:\Program Files (x86)\Intel\ICTCE\4.0.0.xxx\mkl\examples\cdftc\data\dm_complex_2d_double_ex2.dat" > "%CD%\_results\lp64_em64t_intelmpi_lib\dm_complex_2d_double_ex2.res"
mpiexec -mapall -n 4 -env VT_DLL_DIR "%VT_DLL_DIR%" -env VT_MPI_DLL "%VT_MPI_DLL%" -env VT_LOGFILE_FORMAT STF -env VT_PCTRACE 5 -env VT_LOGFILE_PREFIX "%VT_LOGFILE_PREFIX%" -env VT_PROCESS "0:N ON" -env VT_STATE "*.dll*:* off" itcpin --run --profile -- "%CD%\_results\lp64_em64t_intelmpi_lib\dm_complex_2d_single_ex1.exe" "C:\Program Files (x86)\Intel\ICTCE\4.0.0.xxx\mkl\examples\cdftc\data\dm_complex_2d_single_ex1.dat" > "%CD%\_results\lp64_em64t_intelmpi_lib\dm_complex_2d_single_ex1.res"
mpiexec -mapall -n 4 -env VT_DLL_DIR "%VT_DLL_DIR%" -env VT_MPI_DLL "%VT_MPI_DLL%" -env VT_LOGFILE_FORMAT STF -env VT_PCTRACE 5 -env VT_LOGFILE_PREFIX "%VT_LOGFILE_PREFIX%" -env VT_PROCESS "0:N ON" -env VT_STATE "*.dll*:* off" itcpin --run --profile -- "%CD%\_results\lp64_em64t_intelmpi_lib\dm_complex_2d_single_ex2.exe" "C:\Program Files (x86)\Intel\ICTCE\4.0.0.xxx\mkl\examples\cdftc\data\dm_complex_2d_single_ex2.dat" > "%CD%\_results\lp64_em64t_intelmpi_lib\dm_complex_2d_single_ex2.res" |
%CD%\cdftc_inst |
%CD%\_results\lp64_em64t_intelmpi_lib |
The “xxx” in the folder path 4.0.0.xxx needs to be replaced with the appropriate build number that is associated with the Intel® Cluster Toolkit installation on your system.
In this table, the four executables are supplemented with instrumentation calls to the Intel Trace Collector. These DOS commands could be copied from the table above and pasted into a .bat file.
The DOS environment variable %CD% might be set to something like “z:\MPI_Share_Area\cdftc_test”. The setting of %CD% will be a function of where you conduct the instrumentation experiments above on your cluster system.
From this table, an mpiexec command in conjunction with itcpin might look something like:
mpiexec -mapall -n 4 -env VT_DLL_DIR "%VT_DLL_DIR%" -env VT_MPI_DLL "%VT_MPI_DLL%" -env VT_LOGFILE_FORMAT STF -env VT_PCTRACE 5 -env VT_LOGFILE_PREFIX "%VT_LOGFILE_PREFIX%" -env VT_PROCESS "0:N ON" -env VT_STATE "*.dll*:* off" itcpin --run --profile -- "%CD%\_results\lp64_em64t_intelmpi_lib\dm_complex_2d_double_ex1.exe" "C:\Program Files (x86)\Intel\ICTCE\4.0.0.xxx\mkl\examples\cdftc\data\dm_complex_2d_double_ex1.dat" > "%CD%\_results\lp64_em64t_intelmpi_lib\dm_complex_2d_double_ex1.res"
Note that the DOS command-line above is a single line of text. The –mapall option for the mpiexec command will map all of the current network drives. This mapping will be removed when the MPI processes exit. This mpiexec option is used to prevent “pin” errors that look something like the following:
pin error: System error 0x3 : "Z:\MPI_Share_Area\cdftc_test\_results\lp64_em64t_intelmpi_lib\dm_complex_2d_double_ex1.exe" : The system cannot find the path specified.
E:pin is exiting due to fatal error
As a review from the earlier section on itcpin technology, recall that the executable that is being instrumented for this DOS command is dm_complex_2d_double_ex1.exe. The environment variables that are being set for the mpiexec command are:
-env VT_DLL_DIR "%VT_DLL_DIR%" -env VT_MPI_DLL "%VT_MPI_DLL%" -env VT_LOGFILE_FORMAT STF -env VT_PCTRACE 5 -env VT_LOGFILE_PREFIX "%VT_LOGFILE_PREFIX%" -env VT_PROCESS "0:N ON" -env VT_STATE "*.dll*:* off"
As mentioned previously, an explanation for these instrumentation environment variables can be found in the Intel Trace Collector Users’ Guide under the search topic “ITC Configuration”.
In continuing the itcpin review, the itcpin component as part of the overall mpiexec command-line for a C language version of the Cluster Discrete Fourier Transform test case is:
itcpin --run --profile -- "%CD%\_results\lp64_em64t_intelmpi\dm_complex_2d_double_ex1.exe" "C:\Program Files (x86)\Intel\ICTCE\4.0.0.xxx\mkl\examples\cdftc\data\dm_complex_2d_double_ex1.dat"
The data input file for executable dm_complex_2d_double_ex1.exe is:
"C:\Program Files (x86)\Intel\ICTCE\4.0.0.xxx\mkl\examples\cdftc\data\dm_complex_2d_double_ex1.dat"
Remember that the actual reference to the Intel Math Kernel Library folder path for the data input file on your cluster system will be dependent on where you installed the Intel® Cluster Toolkit software package and the value of xxx in the subfolder name 4.0.0.xxx.
In general, recall that the itcpin command-line component has the syntax:
itcpin [<ITC options>] -- <application command-line>
where -- is a delimiter between Intel Trace Collector (ITC) options above and the application command-line. The Intel Trace Collector options for the actual itcpin example invocation are:
--run --profile
The switch called --run instructs itcpin to run the application executable. The --profile option is follows. The default instrumentation library is VT.lib which is for the Intel Trace Collector. Also, remember that you can find out additional information about itcpin in the Intel Trace Collector User’s Guide under the search topic itcpin.
With regards to the test area referenced by the folder path c:\MPI_Share_Area\cdftf_test, the Fortran language version of Cluster Discrete Fourier Transform could be instrumented with itcpin as follows:
Intel® Processor Architecture |
Command-line Sequence for Microsoft Windows |
Trace Results are Located In |
Execution Results are Located In |
Intel® 64 (formerly EM64T) |
set VT_LOGFILE_PREFIX=cdftf_inst
rmdir /S /Q %VT_LOGFILE_PREFIX%
mkdir %VT_LOGFILE_PREFIX%
mpiexec -mapall -n 4 -env VT_DLL_DIR "%VT_DLL_DIR%" -env VT_MPI_DLL "%VT_MPI_DLL%" -env VT_LOGFILE_FORMAT STF -env VT_PCTRACE 5 -env VT_LOGFILE_PREFIX "%VT_LOGFILE_PREFIX%" -env VT_PROCESS "0:N ON" -env VT_STATE "*.dll*:* off" itcpin --run --profile -- "%CD%\_results\lp64_em64t_intelmpi_lib\dm_complex_2d_double_ex1.exe" < "C:\Program Files (x86)\Intel\ICTCE\4.0.0.xxx\mkl\examples\cdftc\data\dm_complex_2d_double_ex1.dat" > "%CD%\_results\lp64_em64t_intelmpi_lib\dm_complex_2d_double_ex1.res"
mpiexec -mapall -n 4 -env VT_DLL_DIR "%VT_DLL_DIR%" -env VT_MPI_DLL "%VT_MPI_DLL%" -env VT_LOGFILE_FORMAT STF -env VT_PCTRACE 5 -env VT_LOGFILE_PREFIX "%VT_LOGFILE_PREFIX%" -env VT_PROCESS "0:N ON" -env VT_STATE "*.dll*:* off" itcpin --run --profile -- "%CD%\_results\lp64_em64t_intelmpi_lib\dm_complex_2d_double_ex2.exe" < "C:\Program Files (x86)\Intel\ICTCE\4.0.0.xxx\mkl\examples\cdftc\data\dm_complex_2d_double_ex2.dat" > "%CD%\_results\lp64_em64t_intelmpi_lib\dm_complex_2d_double_ex2.res"
mpiexec -mapall -n 4 -env VT_DLL_DIR "%VT_DLL_DIR%" -env VT_MPI_DLL "%VT_MPI_DLL%" -env VT_LOGFILE_FORMAT STF -env VT_PCTRACE 5 -env VT_LOGFILE_PREFIX "%VT_LOGFILE_PREFIX%" -env VT_PROCESS "0:N ON" -env VT_STATE "*.dll*:* off" itcpin --run --profile -- "%CD%\_results\lp64_em64t_intelmpi_lib\dm_complex_2d_single_ex1.exe" < "C:\Program Files (x86)\Intel\ICTCE\4.0.0.xxx\mkl\examples\cdftc\data\dm_complex_2d_single_ex1.dat" > "%CD%\_results\lp64_em64t_intelmpi_lib\dm_complex_2d_single_ex1.res"
mpiexec -mapall -n 4 -env VT_DLL_DIR "%VT_DLL_DIR%" -env VT_MPI_DLL "%VT_MPI_DLL%" -env VT_LOGFILE_FORMAT STF -env VT_PCTRACE 5 -env VT_LOGFILE_PREFIX "%VT_LOGFILE_PREFIX%" -env VT_PROCESS "0:N ON" -env VT_STATE "*.dll*:* off" itcpin --run --profile -- "%CD%\_results\lp64_em64t_intelmpi_lib\dm_complex_2d_single_ex2.exe" < "C:\Program Files (x86)\Intel\ICTCE\4.0.0.xxx\mkl\examples\cdftc\data\dm_complex_2d_single_ex2.dat" > "%CD%\_results\lp64_em64t_intelmpi_lib\dm_complex_2d_single_ex2.res" |
%CD%\cdftf_inst |
%CD%\_results\lp64_em64t_intelmpi_lib |
Notice that an input data file for each of the Fortran executables is preceded with the “less than” symbol “<”. This is because the Fortran executables read the input data from standard input. As was mentioned previously for the C version of the Cluster Discrete Fourier Transform examples, the “xxx” in the folder path 4.0.0.xxx needs to be replaced with the appropriate build number that is associated with the Intel® Cluster Toolkit installation on your system.
The DOS commands above could be copied and pasted into a .bat file within a test area such as z:\MPI_Share_Area\cdftf_test. In regards to the test areas z:\MPI_Share_Area\cdftc_test and z:\MPI_Share_Area\cdftf_test, the tracing data for the executables should be deposited respectively into the folders cdftc_inst and cdftf_inst. Note that in the two tables above, the setting of the environment variable VT_LOGFILE_PREFIX resulted in the deposit of trace information into the directories cdftc_inst and cdftf_inst as demonstrated with a listing of the Structured Trace Format (STF) index files:
cdftc_inst\dm_complex_2d_double_ex1.stf
cdftc_inst\dm_complex_2d_double_ex2.stf
cdftc_inst\dm_complex_2d_single_ex1.stf
cdftc_inst\dm_complex_2d_single_ex2.stf
and
cdftf_inst\dm_complex_2d_double_ex1.stf
cdftf_inst\dm_complex_2d_double_ex2.stf
cdftf_inst\dm_complex_2d_single_ex1.stf
cdftf_inst\dm_complex_2d_single_ex2.stf
You can issue the following Intel Trace Analyzer shell command to initiate performance analysis on cdftc_inst\dm_complex_2d_double_ex1.exe.stf:
traceanalyzer .\cdftc_inst\dm_complex_2d_double_ex1.stf
Figure 7.2 shows the result of simultaneously displaying the Function Profile Chart and the Event Timeline Chart.
Figure 7.2 – The Event Timeline Chart and the Function Profile Chart for a Cluster Discrete Fourier Transform Example
On Microsoft Windows CCS, in the directory path:
<directory-path-to-mkl>\benchmarks\mp_linpack
you will find a set of files and subdirectories that look something like the following:
If you make a scratch directory, say:
test_mp_linpack
on a file share for your cluster, and copy the contents of <directory-path-to-mkl>/benchmarks/mp_linpack into that scratch directory you can then proceed to build a High Performance Linpack executable. To create an executable for Intel® 64 architecture, you might issue the following nmake command:
nmake arch=em64t HOME="%CD%" LAdir="c:\Program Files (x86)\Intel\ICTCE\4.0.0.003\mkl" LAinc="c:\Program Files (x86)\Intel\ICTCE\4.0.0.003\mkl\include" MPIdir="%I_MPI_ROOT%" install
where the command sequence above is one continuous line. The macro variable HOME references the work folder where the nmake command was invoked. In this situation the working directory is:
…\test_mp_linpack
The macros LAdir and LAinc describe the folder path to the Intel® 64 Math Kernel library and the Intel® MKL include folder, respectively. The partial directory path c:\Program Files (x86)\Intel\ICTCE\4.0.0.003 for the macros LAdir and LAinc should be considered an example of where an Intel® Math Kernel Library might reside. Note that on your system, the path and a version number value such as 4.0.0.003 may vary depending on your software release.
The High Performance Linpack executable for the nmake command above will be placed into …\test_mp_linpack\bin\em64t and will be called xhpl. The table below summarizes makefile and associated mpiexec commands that might be used to create xhpl executable for Intel® 64 architectures, respectively. The mpiexec commands use 4 MPI processes to do the domain decomposition.
Intel Processor Architecture |
Command-line Sequence for Microsoft Windows |
Executable is Located In |
Execution Results are Located In |
Intel® 64 (formerly Intel EM64T) |
nmake arch=em64t HOME="%CD%" LAdir="c:\Program Files (x86)\Intel\ICTCE\4.0.0.003\mkl" LAinc="c:\Program Files (x86)\Intel\ICTCE\4.0.0.003\mkl\include" MPIdir="%I_MPI_ROOT%" install cd %CD%\bin\em64t mpiexec -mapall -n 4 .\xhpl.exe > results.em64t.out |
%CD%\bin\em64t |
%CD%\bin\em64t |
The output results might look something like the following for Intel® 64 architecture:
================================================================================ HPLinpack 2.0 -- High-Performance Linpack benchmark -- September 10, 2008 Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK Modified by Julien Langou, University of Colorado Denver ================================================================================
An explanation of the input/output parameters follows: T/V : Wall time / encoded variant. N : The order of the coefficient matrix A. NB : The partitioning blocking factor. P : The number of process rows. Q : The number of process columns. Time : Time in seconds to solve the linear system. Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 1000 NB : 112 120 PMAP : Row-major process mapping P : 1 2 1 4 Q : 1 2 4 1 PFACT : Left NBMIN : 4 2 NDIV : 2 RFACT : Crout BCAST : 1ring DEPTH : 0 SWAP : Mix (threshold = 256) L1 : no-transposed form U : no-transposed form EQUIL : no ALIGN : 8 double precision words
-------------------------------------------------------------------------------- …
================================================================================ T/V N NB P Q Time Gflops -------------------------------------------------------------------------------- WR00C2L2 1000 120 4 1 0.07 9.505e+000 -------------------------------------------------------------------------------- ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0033543 ...... PASSED ============================================================================
Finished 16 tests with the following results: 16 tests completed and passed residual checks, 0 tests completed and failed residual checks, 0 tests skipped because of illegal input values. ----------------------------------------------------------------------------
End of Tests. ============================================================================
|
The file <directory-path-to-mkl>\doc\mkl_documentation.htm contains a landing page linking various documentation files associated with Intel MKL 10.2. To make inquiries about Intel Math Kernel Library 10.2, visit the URL: http://premier.intel.com.
- .\doc (ReadMe_IMB.txt; IMB_Users_Guide.pdf, the methodology description)
- .\src (program source code and Makefiles)
- .\license (Source license agreement, trademark and use license agreement)
- .\versions_news (version history and news)
- .\WINDOWS (Microsoft* Visual Studio* projects)
The WINDOWS folder as noted above contains Microsoft* Visual Studio* 2005 and 2008 project folders which allow you to use a pre-existing ".vcproj" project file in conjunction with Microsoft* Visual Studio* to build and run the associated Intel® MPI Benchmark application.
Within Microsoft Windows Explorer and starting at the Windows folder, you can go to one of the subfolders IMB-EXT_VS_2005, IMB-EXT_VS_2008, IMB-IO_VS_2005, IMB-IO_VS_2008, IMB-MPI1_VS_2005, or IMB-MPI1_VS_2008 and click on the corresponding ".vcproj" file and open up Microsoft Visual Studio* (Figure 8.1).
The three executables that will be created from the respective Visual Studio 2005 or Visual 2008 projects will be:
IMB-EXT.exe
IMB-IO.exe
IMB-MPI1.exe
In Figure 8.1 Microsoft Visual Studio* 2008 is being used.
Figure 8.1 – Illustration for Starting Microsoft Visual Studio* 2008 on the project file IMB-EXT.vcproj
From the Visual Studio Project panel:
1) Change the "Solution Platforms" dialog box to "x64". See Figure 8.2.
2) Change the "Solution Configurations" dialog box to "Release". See Figure 8.2.
Figure 8.2 – The Solution Configuration is set to “Release” and Solution Platforms is set to “x64”. Also note that IMB-EXT is highlighted in the Solution Explorer panel on the left in preparation for the context sensitive operations outlined in step 3
3) Follow the menu path Project->Properties or Alt+F7 and check to make sure
that the following are set by expanding "Configuration Properties":
a) General->Project Defaults - Change "Character Set" to "Use Multi-Byte
Character Set"
b) Debugging
i) Set the "Debugger to launch" value to "Local Windows Debugger", for
example. Note that "Local Windows Debugger" is one possible setting. See
Figure 8.3.
ii) For the row "Command" add "$(I_MPI_ROOT)\em64t\bin\mpiexec.exe".
Be sure to include the quotes.
iii) For the row "Command Arguments" add "-n 2 $(TargetPath)"
c) C/C++->General
i) For the row "Additional Include Directories", add "$(I_MPI_ROOT)\em64t\include".
ii) For the row "Warning Level", set the warning level to "Level 1 (/W1)"
d) C/C++->Preprocessor
i) For the row “Preprocessor definitions” within the Visual Studio projects IMB-EXT_VS_2005, and IMB-EXT_VS_2008, add the conditional compilation macro references to WIN_IMB, _CRT_SECURE_NO_DEPRECATE, EXT
ii) For the row “Preprocessor definitions” within the Visual Studio projects IMB-IO_VS_2005, and IMB-IO_VS_2008, add the conditional compilation macro references to WIN_IMB, _CRT_SECURE_NO_DEPRECATE, MPIIO
iii) For the row “Preprocessor definitions” within the Visual Studio projects, IMB-MPI1_VS_2005, or IMB-MPI1_VS_2008, add the conditional compilation macro references to WIN_IMB, _CRT_SECURE_NO_DEPRECATE, MPI1
e) Linker->Input
i) For the row "Additional Dependencies" add "$(I_MPI_ROOT)\em64t\lib\impi.lib". Be sure to include the quotes.
If items "a" through "e" are already set, then proceed to step 4.
Figure 8.3 – Setting the Command and Command Arguments for Debugging under Configuration Properties
4) Use F7 or Build->Build Solution to create an executable. See Figure 8.4.
Figure 8.4 – Microsoft* Visual Studio* 2008 Illustration for building a solution for IMB-EXT
5) Use Debug->Start Without Debugging or Ctrl+F5 to run the executable. See Figure 8.5.
Figure 8.5 – Generation of the command-line panel using the keys Ctrl+F5. The command-line panel shows the execution results for IMB-EXT.exe within Microsoft Visual Studio* 2008
The steps outlined above can be applied to the 2005 and/or 2008 Microsoft* Visual Studio* project folders in building executables for IMB-MPI1.exe and IMB-IO.exe.
Before opening a Microsoft* Visual Studio* project folder, you may want to check the environment variable settings for Include, Lib, and Path. To do this for Microsoft* Windows* HPC Server 2008 OS, start at the Start menu and select Start->Control Panel->System and Maintenance->System->Change settings and a System Properties panel will appear where you should click on the Advanced tab (Figure 8.6).
Figure 8.6 – The System Properties Panel
Regarding Figure 8.6, click on the Environment Variables… button and see if the System Environment Variables Include, Lib, and Path need editing for the display panel shown in Figure 8.7.
Figure 8.7 – Editing the System Environment Variables Include, Lib, and Path
For example, if the Include, Lib, and Path environment variables have respectively the settings:
C:\Program Files (x86)\Intel\ICTCE\4.0.0.013\mpi\em64t\include
C:\Program Files (x86)\Intel\ICTCE\4.0.0.013\mpi\em64t\lib
C:\Program Files (x86)\Intel\ICTCE\4.0.0.013\mpi\em64t\bin
at the beginning of the Variable value dialog panel as shown in Figure 8.7, the these paths should be changed from the subfolder reference of em64t to that of ia32:
C:\Program Files (x86)\Intel\ICTCE\4.0.0.013\mpi\ia32\include
C:\Program Files (x86)\Intel\ICTCE\4.0.0.013\mpi\ia32\lib
C:\Program Files (x86)\Intel\ICTCE\4.0.0.013\mpi\ia32\bin
After making the appropriate changes, the OK button should be pressed regarding the Environment Variables panel shown in Figure 8.7.
After checking out the settings of these environment variables and saving any necessary changes, one can proceed to open the relevant Visual Studio 2005 or Visual Studio 2008 projects under the WINDOWS subfolder for the Intel® MPI Benchmarks.
From the Microsoft Visual Studio* Project panel for the Visual Studio* 2008 project IMB-MPI1:
1) Change the "Solution Platforms" dialog box to "ia32". See Figure 8.8.
2) Change the "Solution Configurations" dialog box to "Release". See Figure 8.8.
Figure 8.8 – The Solution Configuration is set to “Release” and Solution Platforms is set to “ia32”. Also note that IMB-MPI1 is highlighted in the Solution Explorer panel on the left in preparation for the context sensitive operations outlined in step 3
3) Follow the menu path Project->Properties or Alt+F7 and check to make sure
that the following are set by expanding "Configuration Properties":
a) General->Project Defaults - Change "Character Set" to "Use Multi-Byte
Character Set"
b) Debugging
i) Set the "Debugger to launch" value to "Local Windows Debugger", for
example. Note that "Local Windows Debugger" is one possible setting. See
Figure 8.9.
ii) For the row "Command" add "$(I_MPI_ROOT)\ia32\bin\mpiexec.exe".
Be sure to include the quotes.
iii) For the row "Command Arguments" add "-n 2 $(TargetPath)"
c) C/C++->General
i) For the row "Additional Include Directories", add "$(I_MPI_ROOT)\ia32\include".
ii) For the row "Warning Level", set the warning level to "Level 1 (/W1)"
d) C/C++->Preprocessor
i) For the row “Preprocessor definitions” within the Visual Studio projects IMB-EXT_VS_2005, and IMB-EXT_VS_2008, add the conditional compilation macro references to WIN_IMB, _CRT_SECURE_NO_DEPRECATE, EXT
ii) For the row “Preprocessor definitions” within the Visual Studio projects IMB-IO_VS_2005, and IMB-IO_VS_2008, add the conditional compilation macro references to WIN_IMB, _CRT_SECURE_NO_DEPRECATE, MPIIO
iii) For the row “Preprocessor definitions” within the Visual Studio projects, IMB-MPI1_VS_2005, or IMB-MPI1_VS_2008, add the conditional compilation macro references to WIN_IMB, _CRT_SECURE_NO_DEPRECATE, MPI1
e) Linker->Input
i) For the row "Additional Dependencies" add "$(I_MPI_ROOT)\ia32\lib\impi.lib". Be sure to include the quotes.
If items "a" through "e" are already set, then proceed to step 4.
Figure 8.9 – Setting the Command and Command Arguments for Debugging under Configuration Properties
4) Use F7 or Build->Build Solution to create an executable. See Figure 8.10.
Figure 8.10 – Microsoft* Visual Studio* 2008 Illustration for building a solution for IMB-MPI1
5) Use Debug->Start Without Debugging or Ctrl+F5 to run the executable. See Figure 8.11.
Figure 8.11 – Generation of the command-line panel using the keys Ctrl+F5. The command-line panel shows the execution results for IMB-MPI1.exe within Microsoft Visual Studio* 2008
The steps outlined above can be applied to the 2005 and/or 2008 Microsoft* Visual Studio* project folders in building executables for IMB-EXT.exe and IMB-IO.exe.
The Intel® C++ and Intel® Fortran Compilers on Microsoft Windows have the command-line switch called /Qtcollect which allows functions and procedures to be instrumented during compilation with Intel® Trace Collector calls. This compiler command-line switch accepts an optional argument to specify the Intel® Trace Collector library to link with.
Library Selection |
Meaning |
How to Request |
VT.lib |
Default library |
/Qtcollect |
VTcs.lib |
Client-server trace collection library |
/Qtcollect=VTcs |
VTfs.lib |
Fail-safe trace collection library |
/Qtcollect=VTfs |
Recall once again that in the test_intel_mpi folder for Intel MPI Library, there are four source files called:
test.c test.cpp test.f test.f90
To build executables with the /Qtcollect compiler option for the Intel® Compilers, one might use the following compilation and link commands:
icl /Fetestc_Qtcollect /Qtcollect /I"%I_MPI_ROOT%"\em64t\include test.c /link /LIBPATH:"%I_MPI_ROOT%\em64t\lib" impi.lib /NODEFAULTLIB:LIBCMTD.lib
icl /Fetestcpp_Qtcollect /Qtcollect /I"%I_MPI_ROOT%"\em64t\include test.cpp /link /LIBPATH:"%I_MPI_ROOT%\em64t\lib" impi.lib impicxx.lib /NODEFAULTLIB:LIBCMTD.lib
ifort /Fetestf_Qtcollect /Qtcollect /I"%I_MPI_ROOT%"\em64t\include test.f /link /LIBPATH:"%I_MPI_ROOT%\em64t\lib" impi.lib /NODEFAULTLIB:LIBCMTD.lib
ifort /Fetestf90_Qtcollect /Qtcollect /I"%I_MPI_ROOT%"\em64t\include test.f90 /link /LIBPATH:"%I_MPI_ROOT%\em64t\lib" impi.lib /NODEFAULTLIB:LIBCMTD.lib
The names of the MPI executables for the above command-lines should be:
testc_Qtcollect.exe
testcpp_Qtcollect.exe
testf_Qtcollect.exe
testf90_Qtcollect.exe
So as to make a comparison with the Intel Trace Collector STF files:
testc.stf testcpp.stf testf.stf testf90.stf
within the directory test_inst, we will use the following mpiexec commands:
mpiexec -n 4 -env VT_LOGFILE_PREFIX test_inst testc_Qtcollect
mpiexec -n 4 -env VT_LOGFILE_PREFIX test_inst testcpp_Qtcollect
mpiexec -n 4 -env VT_LOGFILE_PREFIX test_inst testf_Qtcollect
mpiexec -n 4 -env VT_LOGFILE_PREFIX test_inst testf90_Qtcollect
The corresponding STF data will be placed into the folder test_inst. To do a comparison between the STF data in testcpp.stf and testcpp_Qtcollect.stf the following traceanalyzer command can be launched from a DOS command-line panel within the folder test_intel_mpi:
traceanalyzer
Figure 13.1 shows the base panel for the Intel Trace Analyzer as a result of invoking the command above from a DOS window.
Figure 9.1 – Base panel for the Intel Trace Analyzer when invoking the DOS Command: traceanalyzer without any arguments
If you select the menu path File->Open and click on the test_inst folder, the following panel will appear (Figure 9.2):
Figure 9.2 – Open a Tracefile Rendering for the test_inst Folder where testcpp.stf has been Highlighted
Selecting testcpp.stf will generate a Flat Profile panel within the Intel Trace Analyzer session that might look something like the following (Figure 9.3).
Figure 9.3 – Flat Panel Display for test_inst\testcpp.stf
For the Flat Panel Display, if you select File->Compare the following sub-panel will appear.
Figure 9.4 – Sub-panel Display for Adding a Comparison STF File
Click on the “Open another file” button and select testcpp_Qtcollect.stf and then proceed to push on the Open button with your mouse.
Figure 9.5 – Sub-panel Activating the Second STF File for Comparison
Click on the Ok button in Figure 9.5 and the comparison display in Figure 9.6 will appear. In Figure 9.6, notice that the timeline display for testcpp_Qtcollect.stf (i.e. the second timeline) is longer than that of the top timeline display (testcpp.stf).
Figure 9.6 – Comparison of testcpp.stf and testcpp_Qtcollect.stf
At the bottom and towards the right of this panel there are two labels with the same name, namely, Major Function Groups. Click on the top label with this name, and a sub-panel will appear with the following information:
Figure 9.7 – “Function Group Editor for file A” Display (i.e., for file testcpp.stf)
Highlight the “All Functions” tree entry and press the Apply but in the low right corner of this panel. Then press the OK button. Repeat this process for the second Major Function Groups label at the bottom of the main Trace Analyzer panel. You should now see a panel rendering that looks something like:
Figure 9.8 – Comparison of STF Files testcpp.stf and testcpp_tcollect.stf after making the All Functions Selection
At the top of the display panel, if you make the menu selection Charts->Function Profile you will be able to see a function profile comparison (lower middle and lower right) for the two executables:
Figure 9.9 – Function Profile Sub-panels in the Lower Middle and Lower Right Sections of the Display for testcpp.stf and testcpp_Qtcollect.stf
Notice that the lower right panel (testcpp_Qtcollect.stf) has much more function profiling information than the lower middle panel (testcpp.stf). This is the result of using the /Qtcollect switch during the compilation process. You can proceed to do similar analysis with:
1) testc.stf and testc_Qtcollect.stf
2) testf.stf and testf_Qtcollect.stf
3) testf90.stf and testf90_Qtcollect.stf
Cluster OpenMP is only available on Linux platforms. The Intel® architecture must be Intel® 64.