IBM Hardware Performance Monitor (hpm) - PowerPoint PPT Presentation


PPT – IBM Hardware Performance Monitor (hpm) PowerPoint presentation | free to view - id: 9dce7-ZWRmY


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

IBM Hardware Performance Monitor (hpm)


Resource usage statistics. Hardware performance counters ... floating point performance and usage of floating point units ... Resource Usage Statistics ... – PowerPoint PPT presentation

Number of Views:112
Avg rating:3.0/5.0
Slides: 33
Provided by: DongJ9


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: IBM Hardware Performance Monitor (hpm)

IBM Hardware Performance Monitor (hpm)
  • NPACI Parallel Computing Institute August, 2002

What is Performance?
  • - Where is time spent and how is time spent?
  • MIPS Millions of Instructions Per Second
  • MFLOPS Millions of Floating-Point Operations
    Per Second
  • Run time/CPU time

What is a Performance Monitor?
  • Provides detailed processor/system data
  • Processor Monitors
  • Typically a group of registers
  • Special purpose registers keep track of
    programmable events
  • Non-intrusive counts result in accurate
    measurement of processor events
  • Typical Events counted are Instruction, floating
    point instruction, cache misses, etc.
  • System Level Monitors
  • Can be hardware or software
  • Intended to measure system activity
  • Examples
  • bus monitor measures memory traffic, can analyze
    cache coherency issues in multiprocessor system
  • Network monitor measures network traffic, can
    analyze web traffic internally and externally

Hardware Counter Motivations
  • To understand execution behavior of application
  • Why not use software?
  • Strength simple, GUI interface
  • Weakness large overhead, intrusive, higher
    level, abstraction and simplicity
  • How about using a simulator?
  • Strength control, low-level, accurate
  • Weakness limit on size of code, difficult to
    implement, time-consuming to run
  • When should we directly use hardware counters?
  • Software and simulators not available or not
  • Strength non-intrusive, instruction level
    analysis, moderate control, very accurate, low
  • Weakness not typically reusable, OS kernel

Ptools Project
  • PMAPI Project
  • Common standard API for industry
  • Supported by IBM, SUN, SGI, COMPAQ etc
  • PAPI Project
  • Standard application programming interface
  • Portable, available through a module
  • Can access hardware counter info
  • HPM Toolkit
  • Easy to use
  • Doesnt effect code performance
  • Use hardware counters
  • Designed specifically for IBM SPs and Power

Problem Set
  • Should we collect all events all the time?
  • Not necessary and wasteful
  • What counts should be used?
  • Gather only what you need
  • Cycles
  • Committed Instructions
  • Loads
  • Stores
  • L1/L2 misses
  • L1/L2 stores
  • Committed fl pt instr
  • Branches
  • Branch misses
  • TLB misses
  • Cache misses

POWER3 Architecture

IBM HPM Toolkit
  • High Performance Monitor
  • Developed for performance measurement of
    applications running on IBM Power3 systems. It
    consists of
  • An utility (hpmcount)
  • An instrumentation library (libhpm)
  • A graphical user interface (hpmviz).
  • Requires PMAPI kernel extensions to be loaded
  • Works on IBM 630 and 604e processors
  • Based on IBMs PMAPI low level interface

HPM Count
  • Utilities for performance measurement of
  • Extra logic inserted to the processor to count
    specific events
  • Updated at every cycle
  • Provide a summary output at the end of the
  • Wall clock time
  • Resource usage statistics
  • Hardware performance counters information
  • Derived hardware metrics
  • Serial/Parallel, Gives each performance numbers
    for each task

HPM Usage HW Event Categories
  • Cycles
  • Inst. Completed
  • TLB misses
  • Stores completed
  • Loads completed
  • FPU0 ops
  • FPU1 ops
  • FMAs executed 

EVENT SET 2 Cycles Inst. Completed TLB
misses Stores dispatched L1 store misses Loads
dispatched L1 load misses LSU idle 
EVENT SET 3 Cycles Inst. dispatched Inst.
Completed Cycles w/  0 inst. completed I cache
misses FXU0 ops FXU1 ops FXU2 ops
EVENT SET 4 Cycles Loads dispatched L1 load
misses L2 load misses Stores dispatched L2 store
misses Comp. unit waiting on load LSU idle
floating point performance and usage of floating
point units
performance and usage of level 1 instruction cache
usage of level 2 data cache and branch prediction
data locality and usage of level 1 data cache
HPM for Whole Program using HPMCOUNT
  • Installed in /usr/local/apps/hpm,
  • Environment setting
  • setenv LIBHPM_EVENT_SET 1 (2,3,4)
  • setenv MP_LABELIO YES -gt to correlate each
    line of output with corresponding task
  • setenv MP_STDOUTMODE -gttaskID(e.g. 0) to
    discard output from other tasks
  • Usage
  • poe hpmcount ./a.out -nodes 1 -tasks_per_node
    1 -rmpool 1 -s ltsetgt -e ev,ev -h
  • -h displays a help message
  • -e ev0,ev1, list of event numbers, separated by
  • evltigt corresponds to event selected for counter
  • -s predefined set of envets

Derived Hardware Metrics
  • Hardware counters provide only raw counts
  • 8 counters on Power3
  • Enough info for generation of derived metrics on
    each execution
  • Derived Metrics
  • Floating point rate
  • Computational Intensity
  • Instruction per load / store
  • Load/store per data cache misses
  • Cache hit rate
  • Loads per load miss
  • Stores per store miss
  • Loads per TLB miss
  • FMA
  • Branches Mispredicted

HPMCOUNT Output (Event1) Resource Usage
  • Total execution time of instrumented code (wall
    time) 6.218496 seconds
  • Total amount of time in user mode
    5.860000 seconds
  • Total amount of time in system mode
    3.120000 seconds
  • Maximum resident set size
    23408 Kbytes
  • Average shared memory use in text segment
    97372 Kbytessec
  • Average unshared memory use in data segment
    13396800 Kbytessec
  • Number of page faults without I/O activity
  • Number of page faults with I/O activity
  • Number of times process was swapped out 0
  • Number of times file system performed INPUT
  • Number of times file system performed OUTPUT 0
  • Number of IPC messages sent
  • Number of IPC messages received
  • Number of signals delivered
  • Number of voluntary context switches
  • Number of involuntary context switches

HPMCOUNT (Event1 continued) Resource statistics
  • Instrumented section 1 - Label ALL - process
  • file swim_omp.f, lines 89 lt--gt 189
  • Count 1
  • Wall Clock Time
    6.216718 seconds
  • Total time in user mode
    5.35645462067771 seconds
  • Exclusive duration 0.012166 seconds
  • PM_CYC (Cycles)
  • PM_INST_CMPL (Instructions completed)
  • PM_TLB_MISS (TLB misses)
  • PM_ST_CMPL (Stores completed)
  • PM_LD_CMPL (Loads completed)
  • PM_FPU0_CMPL (FPU 0 instructions)
  • PM_FPU1_CMPL (FPU 1 instructions)
  • PM_EXEC_FMA (FMAs executed)

  • Time usually reports three metrics
  • User Time
  • The time used by your code on CPU, also CPU time
  • Total time in user mode Cycles/Processor
  • System Time
  • The time used by your code running kernel code
    (doing I/O, writing to disk, or printing to the
    screen etc).
  • It is worth to minimize the system time, by
    speeding up the disk I/O, doing I/O in parallel,
    or doing I/O in background while your CPU
    computes in the foreground
  • Wall Clock time
  • Total execution time, the combination of the time
    1 and 2 plus the time spent idle (waiting for
  • In parallel performance tuning, only wall clock
    time counts
  • Interprocessor communication consumes a
    significant amount of your execution time
    (user/system time usually dont account for it),
    need to rely on wall clock time for all the time
    consumed by the job

Floating Point Measures
  • PM_FPU0_CMPL (FPU 0 instructions)
  • The POWER3 processor has two Floating Point Units
    (FPU) which operate in parallel. Each FPU can
    start a new instruction at every cycle. This
    counter shows the number of floating point
    instructions that have been executed by the first
  • PM_FPU1_CMPL (FPU 1 instructions)
  • This counter shows the number of floating point
    instructions (add, multiply, subtract, divide,
    multiply add) that have been processed by the
    second FPU.
  • PM_EXEC_FMA (FMAs executed)
  • This is the number of Floating point Multiply
    Add (FMA) instructions. This instruction does a
    computation of following type x s a b So
    two floating point operations are done within one
    instruction. The compiler generate this
    instruction as often as possible to speed up the
    program. But sometimes additional manual
    optimization is necessary to replace single
    multiply instructions and corresponding add
    instructions by one FMA.

HPMCOUNT (Event1 continued)
  • Utilization rate
  • TLB misses per cycle
  • Estimated latency from TLB misses
    4.432 sec
  • Avg number of loads per TLB miss
  • Load and store operations
    946.444 M
  • Instructions per load/store
  • MIPS
  • Instructions per cycle
  • HW Float points instructions per Cycle
  • Floating point instructions FMAs
    1044.089 M
  • Float point instructions FMA rate
    167.949 Mflip/s
  • FMA percentage
  • Computation intensity

Total Flop Rate
  • Float point instructions FMA rate
  • This is the most often mentioned performance
    index, the MFlops rate.
  • The peak performance of the POWER3-II processor
    is 1500 MFlops. (375 MHZ clock x 2 FPUs x 2
    Flops/FMA instruction).
  • Many applications do not reach more than 10
    percent of this peak performance.
  • Average number of loads per TLB miss
  • This value is the ratio PM_LD_CMPL / PM_TLB_MISS.
    Each time after a TLB miss has been processed,
    fast access to a new page of data is possible.
    Small values for this metric indicate that the
    program has a poor data locality, a redesign of
    the data structures in the program may result in
    significant performance improvements.
  • Computation intensity
  • Computational intensity is the ratio of Load and
    store operations and Floating point operations

HPM for Part of Program using LIBHPM
  • Instrumentation of performance library for
    performance measurement of Fortran, C and C
  • Collects information and performs summarization
    during run-time, generate performance file for
    each task
  • Use the same set of hardware counters events used
    by hpmcount
  • User can specify an event set with the file
  • For each instrumented point in a program, libhpm
    provides output
  • Total count
  • Total duration (wall clock time)
  • Hardware performance counters information
  • Hardware derived metrics
  • Supports
  • multiple instrumentation points, nested
  • OpenMP and thread applications
  • Multiple calls to an instrumented point

LIBHPM Functions
  • C C
  • hpmInit(taskID)
  • hpmTerminate(taskID)
  • hpmStart(instID)
  • hpmStop(instID)
  • hpmTstart(instID)
  • hpmTstop(instID)
  • Fortran
  • f_hpminit(taskID)
  • f_hpmterminate(taskID)
  • f_ hpmstart(instID)
  • f_ hpmstop(instID)
  • f_ hpmtstart(instID)
  • f_ hpmtstop(instID)

Using LIBHPM - C
  • Declaration
  • include libhpm.h
  • C usage
  • MPI_Comm_rank( MPI_COMM_WORLD, taskID)
  • hpmInit(taskID,hpm_test)
  • hpmStart(1,outer call)
  • code segment to be timed
  • hpmStop( 1)
  • hpmTerminate(taskID)
  • Compilation
  • mpcc_r -I/usr/local/apps/HPM_V2.3/include -O3
    -lhpm_r -lpmapi -lm -qarchpwr3 -qstrict
    -qsmpomp -L/usr/local/apps/HPM_V2.3/lib
    hpm_test.c -o hpm_test.x

Using LIBHPM - Fortran
  • Declaration
  • include f_hpm.h
  • Fortran usage
    ierr )
  • call f_hpminit(taskID)
  • call f_hpmstart(instID)
  • code segment to be timed
  • call f_hpmstop(instID)
  • call f_hpmterminate(taskID)
  • Compilation
  • mpxlf_r -I/usr/local/apps/HPM_V2.3/include
    -qsuffixcppf -O3 -qarchpwr3 -qstrict -qsmpomp
    -L/usr/local/apps/HPM_V2.3/lib -lhpm_r -lpmapi
    -lm hpm_test.f -o hpm_test.x

Using LIBHPM - Threads
  • call f_hpminit(taskID)
  • //do
  • call f_hpmtstart(10)
  • do_work
  • call f_hpmtstop(10)
  • end //do
  • //do
  • call f_hpmtstart(20my_thread_ID )
  • do_work
  • call f_hpmtstop(20my_thread_ID )
  • end //do
  • call f_hpmterminate(taskID)

HPM Example Code in C
include ltmpi.hgt include ltstdio.hgt include
"libhpm.h" define n 10000 main(int argc, char
argv) int taskID,i,numprocs double
an,bn,cn MPI_Init(argc,argv) MPI_Comm_si
ze(MPI_COMM_WORLD,numprocs) MPI_Comm_rank(MPI_CO
MM_WORLD,taskID) hpmInit(taskID,"hpm_test") hpm
Start(1,section 1") for(i1iltn1i) aii
bin-1 hpmStop(1)
hpmStart(2, "section 2") for(i2iltn1i)
ciaibiai/bi hpmStop(2) hpmTermin
ate(taskID) MPI_Finalize()

HPM Example Code in Fortran
  • program hpm_test
  • parameter (n10000)
  • integer taskID,ierr,numtasks
  • dimension a(n),b(n),c(n)
  • include "mpif.h"
  • include "f_hpm.h"
  • call MPI_INIT(ierr)
  • call MPI_COMM_SIZE(MPI_COMM_WORLD,numtasks,i
  • call f_hpminit(taskID,"hpm_test")
  • call f_hpmstart(5,section1")
  • do i1,n
  • a(i)real(i)
  • b(i)real(n-i)
  • enddo

call f_hpmstop(5) call
f_hpmterminate(taskID) call
MPI_FINALIZE(ierr) end
Compiling and Linking
  • FF mpxlf_r
  • HPM_DIR /usr/local/apps/HPM_V2.3
  • HPM_INC -I(HPM_DIR)/include
  • HPM_LIB -L(HPM_DIR)/lib -lhpm_r -lpmapi -lm
  • FFLAGS -qsuffixcppf -O3 -qarchpwr3 -qstrict
  • Note -qsuffixcppf is only required for
    Fortran code with .f
  • hpm_test.x hpm_test.f
  • (FF) (HPM_INC) (FFLAGS) hpm_test.f (HPM_LIB)
    -o hpm_test.x

  • takes as input the performance files generated by
  • Usage
  •  gt hpmviz ltperformance files(.viz)gt
  • define a range of values considered satisfactory
  • Red below predefined as minimum recommended
  • Green above the threshold value
  • HPMVIZ left pane of the window
  • displays for each instrumented point, identified
    by its label, the inclusive duration, exclusive,
    and count.
  • HPMVIZ right pane of the window
  • shows the corresponding source code which can be
    edited and saved.
  • The metrics windows
  • display the task ID, Thread ID, count, exclusive
    duration, inclusive duration, and the derived
    hardware metrics.

IBM SP HPM Toolkit Summary
  • A complete problem set
  • Derived metrics
  • Analysis of error message
  • Analyze derived metrics
  • HPMCOUNT very accurate with low overhead,
    non-intrusive, general view for whole program
  • LIBHPM same sets as hpmcount, for part of
  • HPMVIZ easier to view the hardware counters
    information and derived metrics

HPM References
  • HPM README file in /usr/local/apps/HPM_V2.3
  • Online Documentation
  • http//

Lab Session for HPM Environment Setup
  • Setup for running X-windows applications on PCs
  • 1. Login to using CRT
    (located in Applications common).
  • 2. Launch Exceed (located in either
    Applications (Common) or as a shortcut on your
    desktop called "Humming Bird".
  • 3. set your environment, for csh
  • setenv DISPLAY t-wolf.sdsc.edu0.0
  • where "t-wolf for example is the name
    of the PC you are using
  • 4. copy files from /work/Training/HPM_Training
    directory into your own working space.
  • create a directory to work with HPM
  • mkdir HPM
  • change directories into new directory
  • cd HPM
  • copy files into new directory
  • cp /work/Training/ HPM_Training/
  • 5. Go to /work/Training/HPM_Training/simple/

Lab Session for HPM Running HPM
  • 1. Compile either Fortran or C example with the
  • make f makefile_f (or makefile_c)
  • 2. Run executable either interactive or by batch
  • interactive command
  • poe hpm_test.x -nodes 1
    -tasks_per_node 2 euilib ip \
  • euidevice en0
  • 3. Explore hpmcount summary output, looking at
    both usage and resource statistics