IBM Hardware Performance Monitor (hpm) - PowerPoint PPT Presentation

Loading...

PPT – IBM Hardware Performance Monitor (hpm) PowerPoint presentation | free to view - id: 9dce6-MzI4M



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

IBM Hardware Performance Monitor (hpm)

Description:

floating point performance and usage of floating point units ... usage statistics--- Total amount of time in user mode : 141.130000 seconds ... – PowerPoint PPT presentation

Number of Views:86
Avg rating:3.0/5.0
Slides: 27
Provided by: DongJ9
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: IBM Hardware Performance Monitor (hpm)


1
IBM Hardware Performance Monitor (hpm)
  • NPACI Parallel Computing Workshop February 5,
    2002 at SDSC

2
What is Performance?
  • Where is time spent and how is time spent?
  • MIPS Millions of Instructions Per Second
  • not necessarily indicative of the amount of
    useful work done
  • MFLOPS Millions of Floating-Point Operations
    Per Second
  • A better metric for numerically intensive codes,
    but different platforms measure Flops
    differently, and Flops is not completely
    indicative of useful work done
  • Run time/CPU time
  • The only true measure of code performance!
    accounts for algorithmic improvements to code.
    Can be converted to cycles.
  • Counting cycles means
  • Estimate how many cycles your loop(s) should take
  • Compare to measured times(converted to cycles)
    and tune the code to narrow the difference

3
What is a Performance Monitor?
  • Provides detailed processor/system data
  • Processor Monitors
  • Typically a group of registers
  • Special purpose registers keep track of
    programmable events
  • Non-intrusive counts result in accurate
    measurement of processor events
  • Typical Events counted are Instruction, floating
    point instr, cache misses, etc.
  • System Level Monitors
  • Can be h/w or s/w
  • Intended to measure system activity
  • Examples
  • bus monitor measures memory traffic, can analyze
    cache coherency issues in multiprocessor system
  • Network monitor measures network traffic, can
    analyze web traffic internally and externally

4
Hardware Counter Motivations
  • To understand execution behavior of application
    code
  • Why not use software?
  • Strength simple, GUI interface
  • Weakness large overhead, intrusive, higher
    level, abstraction and simplicity
  • How about using a simulator?
  • Strength control, low-level, accurate
  • Weakness limit on size of code, difficult to
    implement
  • When should we directly use hardware counters?
  • Software and simulators not available or not
    enough
  • Strength non-intrusive, instruction level
    analysis, moderate control, very accurate, low
    overhead
  • Weakness not typically reusable, OS kernel
    support

5
Problem Set
  • Should we collect all events all the time?
  • No. not necessary and wasteful
  • What counts should be used?
  • Safe to say gather only what you need
  • Cycles
  • Committed Instructions
  • Loads
  • Stores
  • L1/L2 misses
  • L1/L2 stores
  • Committed fl pt instr
  • Branches
  • Branch misses
  • TLB misses
  • Cache misses

6
POWER3 Architecture

7
IBM HPM Toolkit
  • High Performance Monitor
  • Developed for performance measurement of
    applications running on IBM Power3 systems. It
    consists of
  • An utility (hpmcount)
  • An instrumentation library (libhpm)
  • A graphical user interface (hpmviz).
  • Requires PMAPI kernel extensions to be loaded
  • Works on IBM 630 and 604e processors

8
HPM Count
  • Utilities for performance measurement of
    application
  • Extra logic inserted to the processor to count
    specific events
  • Updated at every cycle
  • Provide a summary output at the end of the
    execution
  • Wall clock time
  • Resource usage statistics
  • Hardware performance counters information
  • Derived hardware metrics

9
HPM Usage HW Event Categories
  • EVENT SET 1
  • Cycles
  • Inst. Completed
  • TLB misses
  • Stores completed
  • Loads completed
  • FPU0 ops
  • FPU1 ops
  • FMAs executed 

EVENT SET 2 Cycles Inst. Completed TLB
misses Stores dispatched L1 store misses Loads
dispatched L1 load misses LSU idle 
EVENT SET 3 Cycles Inst. dispatched Inst.
Completed Cycles w/  0 inst. completed I cache
misses FXU0 ops FXU1 ops FXU2 ops
EVENT SET 4 Cycles Loads dispatched L1 load
misses L2 load misses Stores dispatched L2 store
misses Comp. unit waiting on load LSU idle
floating point performance and usage of floating
point units
performance and usage of level 1 instruction cache
usage of level 2 data cache and branch prediction
data locality and usage of level 1 data cache
10
HPM for Whole Program using HPMCOUNT
  • Installed in /usr/local/apps/hpm,
    /usr/local/apps/HPM_V2.3
  • Environment setting
  • setenv LIBHPM_EVENT_SET 1 (2,3,4)
  • setenv MP_LABELIO YES -gt to correlate each
    line of output with corresponding task
  • setenv MP_STDOUTMODE -gttaskID(e.g. 0) to
    discard output from other tasks
  • Usage
  • poe hpmcount ./a.out -nodes 1 -tasks_per_node
    1 -rmpool 1 -s ltsetgt -e ev,ev -h
  • -h displays a help message
  • -e ev0,ev1, list of event numbers, separated by
    commas
  • evltigt corresponds to event selected for counter
    ltIgt
  • -s predefined set of envets

11
Derived Hardware Metrics
  • Hardware counters provide only raw counts
  • 8 counters on Power3
  • Enough info for generation of derived metrics on
    each execution
  • Derived Metrics
  • Floating point rate
  • Computational Intensity
  • Instruction per load / store
  • Load/store per data cache misses
  • Cache hit rate
  • Loads per load miss
  • Stores per store miss
  • Loads per TLB miss
  • FMA
  • Branches Misspredicted

12
HPMCOUNT Output (Event1)
  • ---usage statistics---
  • Total amount of time in user mode
    141.130000 seconds
  • Total amount of time in system mode
    36.300000 seconds
  • Maximum resident set size
    25516 Kbytes
  • Average shared memory use in text segment
    1978356 Kbytessec
  • Average unshared memory use in data segment
    357949904 Kbytessec
  • Number of page faults without I/O activity
    6750
  • Number of page faults with I/O activity
    81
  • Number of times process was swapped out 0
  • Number of times file system performed INPUT
    0
  • Number of times file system performed OUTPUT 0
  • Number of IPC messages sent
    0
  • Number of IPC messages received
    0
  • Number of signals delivered
    0
  • Number of voluntary context switches
    266907
  • Number of involuntary context switches
    2128527

13
HPMCOUNT (Event1 continued)
  • ---Resource statistics---
  • Wall Clock Time 35.099596 seconds
  • Total time in user mode 54.0518473203182
    seconds
  • Average duration 0.0146248
  • Standard deviation 0.0112495
  • Exclusive duration 0.191238 seconds
  • PM_CYC (Cycles)
    20271809159
  • PM_INST_CMPL (Instructions completed)
    14974657747
  • PM_TLB_MISS (TLB misses)
    4474101
  • PM_ST_CMPL (Stores completed)
    2687036544
  • PM_LD_CMPL (Loads completed)
    5220888450
  • PM_FPU0_CMPL (FPU 0 instructions)
    2581927160
  • PM_FPU1_CMPL (FPU 1 instructions)
    519835526
  • PM_EXEC_FMA (FMAs executed)
    792849657

14
HPMCOUNT (Event1 continued)
  • Utilization rate
    153.988
  • Avg number of loads per TLB miss
    1166.913
  • Load and store operations
    7907.925 M
  • Instructions per load/store
    1.894
  • MIPS
    426.633
  • Instructions per cycle
    0.739
  • HW Float points instructions per Cycle
    0.153
  • Floating point instructions FMAs
    3894.612 M
  • Float point instructions FMA rate
    110.959 Mflip/s
  • FMA percentage
    40.715
  • Computation intensity
    0.492

15
HPM for Part of Program using LIBHPM
  • Instrumentation of performance library for
    performance measurement of Fortran, C and C
    applications
  • Collects information and performs summarization
    during run-time, generate performance file for
    each task
  • Use the same set of hardware counters events used
    by hpmcount
  • User can specify an event set with the file
    libHPMevents
  • For each instrumented point in a program, libhpm
    provides output
  • Total count
  • Total duration (wall clock time)
  • Hardware performance counters information
  • Hardware derived metrics
  • Supports
  • multiple instrumentation points, nested
    instrumentation
  • OpenMP and thread applications
  • Multiple calls to an instrumented point

16
LIBHPM Functions
  • C C
  • hpmInit(taskID)
  • hpmTerminate(taskID)
  • hpmStart(instID)
  • hpmStop(instID)
  • hpmTstart(instID)
  • hpmTstop(instID)
  • Fortran
  • f_hpminit(taskID)
  • f_hpmterminate(taskID)
  • f_ hpmstart(instID)
  • f_ hpmstop(instID)
  • f_ hpmtstart(instID)
  • f_ hpmtstop(instID)

17
Using LIBHPM - C
  • Declaration
  • include libhpm.h
  • C usage
  • MPI_Comm_rank( MPI_COMM_WORLD, taskID)
  • hpmInit(taskID,hpm_test)
  • hpmStart(1,outer call)
  • code segment to be timed
  • hpmStop( 1)
  • hpmTerminate(taskID)
  • Compilation
  • mpcc_r -I/usr/local/apps/HPM_V2.3/include -O3
    -lhpm_r -lpmapi -lm -qarchpwr3 -qstrict
    -qsmpomp -L/usr/local/apps/HPM_V2.3/lib
    hpm_test.c -o hpm_test.x

18
Using LIBHPM - Fortran
  • Declaration
  • include f_hpm.h
  • Fortran usage
  • CALL MPI_COMM_RANK( MPI_COMM_WORLD, taskid,
    ierr )
  • call f_hpminit(taskID)
  • call f_hpmstart(instID)
  • code segment to be timed
  • call f_hpmstop(instID)
  • call f_hpmterminate(taskID)
  • CALL MPI_FINALIZE(ierr)
  • Compilation
  • mpxlf_r -I/usr/local/apps/HPM_V2.3/include
    -qsuffixcppf -O3 -qarchpwr3 -qstrict -qsmpomp
    -L/usr/local/apps/HPM_V2.3/lib -lhpm_r -lpmapi
    -lm hpm_test.f -o hpm_test.x

19
Using LIBHPM - Threads
  • call f_hpminit(taskID)
  • //do
  • call f_hpmtstart(10)
  • do_work
  • call f_hpmtstop(10)
  • end //do
  • //do
  • call f_hpmtstart(20my_thread_ID )
  • do_work
  • call f_hpmtstop(20my_thread_ID )
  • end //do
  • call f_hpmterminate(taskID)

20
HPM Code in C
include ltmpi.hgt include ltstdio.hgt include
"libhpm.h" define n 10000 main(int argc, char
argv) int taskID,i,numprocs double
an,bn,cn MPI_Init(argc,argv) MPI_Comm_si
ze(MPI_COMM_WORLD,numprocs) MPI_Comm_rank(MPI_CO
MM_WORLD,taskID) hpmInit(taskID,"hpm_test") hpm
Start(1,section 1") for(i1iltn1i) aii
bin-1 hpmStop(1)
hpmStart(2, "section 2") for(i2iltn1i)
ciaibiai/bi hpmStop(2) hpmTermin
ate(taskID) MPI_Finalize()

21
HPM Code in Fortran
  • program hpm_test
  • parameter (n10000)
  • integer taskID,ierr,numtasks
  • dimension a(n),b(n),c(n)
  • include "mpif.h"
  • include "f_hpm.h"
  • call MPI_INIT(ierr)
  • call MPI_COMM_RANK(MPI_COMM_WORLD,taskID,ier
    r)
  • call MPI_COMM_SIZE(MPI_COMM_WORLD,numtasks,i
    err)
  • call f_hpminit(taskID,"hpm_test")
  • call f_hpmstart(5,section1")
  • do i1,n
  • a(i)real(i)
  • b(i)real(n-i)
  • enddo

call f_hpmstop(5) call
f_hpmterminate(taskID) call
MPI_FINALIZE(ierr) end
22
Compiling and Linking
  • FF mpxlf_r
  • HPM_DIR /usr/local/apps/HPM_V2.3
  • HPM_INC -I(HPM_DIR)/include
  • HPM_LIB -L(HPM_DIR)/lib -lhpm_r -lpmapi -lm
  • FFLAGS -qsuffixcppf -O3 -qarchpwr3 -qstrict
    -qsmpomp
  • Note -qsuffixcppf is only required for
    Fortran code with .f
  • hpm_test.x hpm_test.f
  • (FF) (HPM_INC) (FFLAGS) hpm_test.f (HPM_LIB)
    -o hpm_test.x

23
HPMVIZ
  • takes as input the performance files generated by
    libhpm
  • Usage
  •  gt hpmviz ltperformance files(.viz)gt
  • define a range of values considered satisfactory
  • Red below predefined as minimum recommended
    value
  • Green above the threshold value
  • HPMVIZ left pane of the window
  • displays for each instrumented point, identified
    by its label, the inclusive duration, exclusive,
    and count.
  • HPMVIZ right pane of the window
  • shows the corresponding source code which can be
    edited and saved.
  • The metrics windows
  • display the task ID, Thread ID, count, exclusive
    duration, inclusive duration, and the derived
    hardware metrics.

24
HPMVIZ
25
IBM SP HPM Toolkit Summary
  • A complete problem set
  • Derived metrics
  • Analysis of error message
  • Analyze derived metrics
  • HPMCOUNT very accurate with low overhead,
    non-intrusive, general view for whole program
  • LIBHPM same sets as hpmcount, for part of
    program
  • HPMVIZ easier to view the hardware counters
    information and derived metrics

26
HPM References
  • HPM README file in /usr/local/apps/HPM_V2.3
  • Online Documentation
  • http//www.sdsc.edu/SciApps/IBM_tools/hpm.html
About PowerShow.com