Title: Introduction to IBM's profiling tools "HPMlib" and "PEBenchmarker"
1 Introduction to IBM's profiling tools "HPMlib"
and "PEBenchmarker"
Richard Gerber NERSC User Services
ragerber_at_nersc.gov
2Outline
- HPMlib and hpmviz
- PE Benchmarker
- Performance Collection Tool
- Profile Visualization Tool
- Unified Trace Environment utilities
3The HPM Library
- The Hardware Performance Monitor (HPM) Library
provides a set of functions to collect POWER 3
hardware counter data - Calls are inserted into source code
- API is simple
- Many different counters can be started and
stopped at arbitrary positions in your code
4Using HPMLIB
- HPM library can be used to instrument code
sections - Embed calls into source code
- Fortran, C, C
- Access through the hpmtoolkit module
- module load hpmtoolkit
- compile with HPMTOOLKIT env variable
- xlf qarchpwr3 O2 source.F \
HPMTOOLKIT - Execute program normally
- Output written to files separate ones for each
task
5HPMlib Functions
- Include files
- Fortran f_hpmlib.h
- C libhpm.h
- Initialize library
- Fortran f_hpminit(taskID, progName)
- C hpmInit(taskID, progName)
- Start Counter
- Fortran f_hpmstart(id,label)
- C hpmStart(id,label)
6HPMlib Functions II
- Stop Counter
- Fortran f_hpmstop(id)
- C hpmStop(id)
- Finalize library when finished
- Fortran f_hpmterminate(taskID)
- C hpmTerminate(taskID)
- You can have multiple, overlapping counter
stops/starts in your code
7HPMlib Sample Code
- Declarations...
- Z0.0
- CALL RANDOM_NUMBER(X)
- CALL RANDOM_NUMBER(Y)
- !
- ! Initialize HPM Performance Library and Start
Counter - !
- CALL f_hpminit(0,"ma.F")
- CALL f_hpmstart(1,"matrix-matrix
multiply") - DO J1,N
- DO K1,N
- DO I1,N
- Z(I,J) Z(I,J)
X(I,K) Y(K,J) - END DO
- END DO
8HMPlib Example Output
- module load hpmtoolkit
- xlf90 -o xma_hpmlib O2 qarchpwr3 ma.F
HPMTOOLKIT - ./xma_hpmlib
- libHPM output in perfhpm0000.67880
libhpm (Version 2.4.2) summary - running on
POWER3-II Total execution time of instrumented
code (wall time) 4.185484 seconds . . .
Instrumented section 1 - Label matrix-matrix
multiply - process 0 Wall Clock Time 4.18512
seconds Total time in user mode
4.16946747484786 seconds . . . PM_FPU0_CMPL
(FPU 0 instructions)
505166645 PM_FPU1_CMPL (FPU 1 instructions)
6834038 PM_EXEC_FMA (FMAs
executed) 512000683 . .
. MIPS
610.707 Instructions per cycle
1.637 HW Float points
instructions per Cycle 0.327
Floating point instructions FMAs
1024.001 M Float point instructions FMA
rate 243.856 Mflip/s FMA
percentage
100.000 Computation intensity
0.666
9The hpmviz tool
- The hpmviz tool has a GUI to help browse HPMlib
output - Part of the hpmtoolkit module
- After running a code with HPMlib calls, a .viz
file is also produced for each task. - Usage
- hpmviz filename1.viz filename2.viz
10hpmviz Screen Shot 1
11hpmviz Screen Shot 2
Right clicking on the Label line in the previous
slide brings up a detail window.
12Parallel hpmviz
- For parallel codes, right-clicking shows each task
13PE Benchmarker
- PE Benchmarker is a suite of IBM performance
analysis applications and utilities - Performance Collection Tool (pct)
- Collect hardware counter, system info, or
- Collect MPI trace info, user events
- Profile Visualization Tool (pvt)
- Visualize hardware counter, system info data
- Unified Trace Environment utilities
- MPI summary info
- Convert to format for ANLs Jumpshot utility for
visualizing MPI events
14Performance Collection Tool
- A tool to collection either
- Hardware counter OS system info
- MPI and user event data
- Built on Dynamic Probe Class Library (DPCL)
- Allows insertion and deletion of instrumentation
probes while code is running - No code modification needed
- GUI and command line interface
- Profiles at program, file, subroutine levels
15Preparing to Use PCT
- Compile with thread-safe compiler, e.g. mxlf90_r
- Set MPE_UTE environment variable
- setenv MPE_UTE yes (csh)
- export MPE_UTEyes (ksh)
- Load java module
- module load java
16Starting PCT
- Example program mpi_heat2D
- mpxlf_r -O2 mpi_heat2D.f draw_heat.o -o
mpi_heat2D - Start PCT
- pct
17pct Options
- Use full pathnames
- POE arguments must specify nodes and procs
- Use retry nsecs and retrycount ntimes to ensure
job startup
18Select Type of Data
- Select either
- MPI and user events
- Hardware and OS profiles
- Use full pathname for Data Collection directory
19Hardware/OS Profiles
- Select processes
- Select routines
- Select probes
- Select HPM group
20pct Profile Data
- Start program from Application menu
- After job completes, it writes netCDF output
files basename.cdf.taskno - Files can be viewed with pvt application.
21MPI Event Statistics
- Select processes
- Select routines
- Select MPI events
- Add user markers
22MPI Event Data
- Start program from Application menu
- When job finishes, output written to AIX trace
files named basename.xx, one per node. - AIX trace files can be large
- Need to convert AIX trace files to UTE format
using uteconvert utility
23Performance Visualization Tool
- Examine hardware/OS data that was collected using
pct - To run
- module load java
- pvt basename.cdf.
24Examining Data with pvt
- Pick data to view from drop-down menu
- Expand source and function listings with mouse
25pvt Reports
- Can generate many different reports
- TLB miss report shown here
- Limited to data group selection made when using
pct to collect data
26UTE Utilities
- uteconvert
- Converts AIX trace files to UTE interval trace
files - utemerge
- Merges multiple UTE files in a single UTE file
- utestats
- Generates statistics tables from UTE files
- slogmerge
- Converts and merges UTE files to SLOG files
needed by Jumpshot - module load java mpe
- Read /usr/common/usg/mpe/1.2.2/share/jumpshot-3/do
c/TourStepByStep.pdf
27Summary and Recommendations
- HPMlib provides API to profile your code.
- PE Benchmarker suite allows in-depth dynamic
profiling with no code modification. - My personal recommendations.
- Start with hpmcount and poe to profile entire
code. - If want more granularity, use HPMlib to wrap
portions of code to gather performance data. - PE Benchmarker may be good for intermediate to
expert programmers with good knowledge of the
hardware and performance metrics. May have steep
learning curve for novice programmers.
28More Information
- NERSC Website http//hpcf.nersc.gov
- HPM Toolkit
- http//hpcf.nersc.gov/software/ibm/hpmcount/HPM_2_
4_2.html - IBM PE Benchmarker manuals
- http//hpcf.nersc.gov/vendor_docs/ibm/pe/am103mst2
4.html - Compilers, general NERSC SP info
- http//hpcf.nersc.gov/computers/SP/