Title: Performance and Memory Evaluation using the TAU Performance System Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer, malony, amorris}@cs.uoregon.edu Holger Brunst, Wolfgang Nagel T.U. Dresden {holger.brunst,
1Performance and Memory Evaluation using the TAU
Performance SystemSameer Shende, Allen D.
Malony, Alan MorrisUniversity of Oregonsameer,
malony, amorris_at_cs.uoregon.edu Holger Brunst,
Wolfgang NagelT.U. Dresdenholger.brunst,
wolfgang.nagel_at_.tu-dresden.de MS14
Application Performance Analysis and Optimization
on BlueGene/LSIAM Parallel Processing Conference
Wed. Feb 22, 2006. Franciscan Room 5-525pmz
2Outline of Talk
- Overview of TAU
- Instrumentation
- Measurement
- Analysis ParaProf and Vampir/VNG
- Future work and concluding remarks
3TAU Performance System
- Tuning and Analysis Utilities (13 year project
effort) - Open Source Performance system for HPC systems
- Integrated, scalable, flexible, and parallel
- Targets a general complex system computation
model - Entities nodes / contexts / threads
- Multi-level system / software / parallelism
- Measurement and analysis abstraction
- Integrated toolkit for performance problem
solving - Instrumentation, measurement, analysis, and
visualization - Portable performance profiling and tracing
facility - Performance data management and data mining
- http//www.cs.uoregon.edu/research/tau
4Definitions Profiling
- Profiling
- Recording of summary information during execution
- inclusive, exclusive time, calls, hardware
statistics, - Reflects performance behavior of program entities
- functions, loops, basic blocks
- user-defined semantic entities
- Very good for low-cost performance assessment
- Helps to expose performance bottlenecks and
hotspots - Implemented through
- sampling periodic OS interrupts or hardware
counter traps - instrumentation direct insertion of measurement
code
5Definitions Tracing
- Tracing
- Recording of information about significant points
(events) during program execution - entering/exiting code region (function, loop,
block, ) - thread/process interactions (e.g., send/receive
message) - Save information in event record
- timestamp
- CPU identifier, thread identifier
- Event type and event-specific information
- Event trace is a time-sequenced stream of event
records - Can be used to reconstruct dynamic program
behavior - Typically requires code instrumentation
6TAU Parallel Performance System Goals
- Multi-level performance instrumentation
- Multi-language automatic source instrumentation
- Flexible and configurable performance measurement
- Widely-ported parallel performance profiling
system - Computer system architectures and operating
systems - Different programming languages and compilers
- Support for multiple parallel programming
paradigms - Multi-threading, message passing, mixed-mode,
hybrid - Support for performance mapping
- Support for object-oriented and generic
programming - Integration in complex software, systems,
applications
7TAU Performance System Architecture
event selection
8TAU Performance System Architecture
9Program Database Toolkit (PDT)
Application / Library
C / C parser
Fortran parser F77/90/95
Program documentation
PDBhtml
Application component glue
IL
IL
SILOON
C / C IL analyzer
Fortran IL analyzer
C / F90/95 interoperability
CHASM
Program Database Files
Automatic source instrumentation
TAU_instr
DUCTAPE
10TAU Instrumentation Approach
- Support for standard program events
- Routines
- Classes and templates
- Statement-level blocks
- Support for user-defined events
- Begin/End events (user-defined timers)
- Atomic events (e.g., size of memory
allocated/freed) - Selection of event statistics
- Support definition of semantic entities for
mapping - Support for event groups
- Instrumentation optimization (eliminate
instrumentation in lightweight routines)
11TAU Instrumentation
- Flexible instrumentation mechanisms at multiple
levels - Source code
- manual (TAU API, TAU Component API)
- automatic
- C, C, F77/90/95 (Program Database Toolkit
(PDT)) - OpenMP (directive rewriting (Opari), POMP spec)
- Object code
- pre-instrumented libraries (e.g., MPI using PMPI)
- statically-linked and dynamically-linked
- Executable code
- dynamic instrumentation (pre-execution)
(DynInstAPI) - virtual machine instrumentation (e.g., Java using
JVMPI) - Proxy Components
12Using TAU A tutorial
- Configuration
- Instrumentation
- Manual
- MPI Wrapper interposition library
- PDT- Source rewriting for C,C, F77/90/95
- OpenMP Directive rewriting
- Component based instrumentation Proxy
components - Binary Instrumentation
- DyninstAPI Runtime Instrumentation/Rewriting
binary - Java Runtime instrumentation
- Python Runtime instrumentation
- Measurement
- Performance Analysis
13Building Bridges to Other Tools TAU
14TAU Performance System Interfaces
- PDT U. Oregon, LANL, FZJ for instrumentation of
C, C99, F95 source code - PAPI UTK PCLFZJ for accessing hardware
performance counters data - DyninstAPI U. Maryland, U. Wisconsin for
runtime instrumentation - KOJAK FZJ, UTK
- Epilog trace generation library
- CUBE callgraph visualizer
- Opari OpenMP directive rewriting tool
- Vampir/Intel Trace Analyzer Pallas/Intel
- VTF3 trace generation library for Vampir TU
Dresden (available from TAU website) - Paraver trace visualizer CEPBA
- Jumpshot-4 trace visualizer MPICH, ANL
- JVMPI from JDK for Java program instrumentation
Sun - Paraprof profile browser/PerfDMF database
supports - TAU format
- Gprof GNU
- HPM Toolkit IBM
- MpiP ORNL, LLNL
- Dynaprof UTK
- PSRun NCSA
15PAPI UTK
- Performance Application Programming Interface
- The purpose of the PAPI project is to design,
standardize and implement a portable and
efficient API to access the hardware performance
monitor counters found on most modern
microprocessors. - Parallel Tools Consortium project
- University of Tennessee, Knoxville
- http//icl.cs.utk.edu/papi
16TAU Measurement System Configuration
- configure OPTIONS
- -cltCCgt, -ccltccgt Specify C and C
compilers - -pthread, -sproc Use pthread or SGI sproc
threads - -openmp Use OpenMP threads
- -jdkltdirgt Specify Java instrumentation (JDK)
- -opariltdirgt Specify location of Opari OpenMP
tool - -papiltdirgt Specify location of PAPI
- -pdtltdirgt Specify location of PDT
- -dyninstltdirgt Specify location of DynInst
Package - -mpiinc/libltdirgt Specify MPI library
instrumentation - -shmeminc/libltdirgt Specify PSHMEM library
instrumentation - -pythoninc/libltdirgt Specify Python
instrumentation - -epilogltdirgt Specify location of EPILOG
- -slog2ltdirgt Specify location of SLOG2/Jumpshot
- -otfltdirgt Specify location of Open Trace Format
- -vtfltdirgt Specify location of VTF3 trace package
- -archltarchitecturegt Specify architecture
explicitly (bgl,ibm64,ibm64linux)
17TAU Measurement System Configuration
- configure OPTIONS
- -TRACE Generate binary TAU traces
- -PROFILE (default) Generate profiles (summary)
- -PROFILECALLPATH Generate call path profiles
- -PROFILEPHASE Generate phase based profiles
- -PROFILEMEMORY Track heap memory for each routine
- -PROFILEHEADROOM Track memory headroom to grow
- -MULTIPLECOUNTERS Use hardware counters time
- -COMPENSATE Compensate timer overhead
- -CPUTIME Use usertimesystem time
- -PAPIWALLCLOCK Use PAPIs wallclock time
- -PAPIVIRTUAL Use PAPIs process virtual time
- -SGITIMERS Use fast IRIX timers
- -LINUXTIMERS Use fast x86 Linux timers
18Using TAU on IBM BG/L
- Configure PDT
- configure XLC exec-prefixbgl make clean
install - Use XLC compiler
- Configure TAU for front-end
- configure make clean install
- Add lttaudirgt/ppc64/bin/ to your path
- Configure TAU for back-end
- configure -archbgl mpi pdtltdirgt
-pdt_cxlC - Use IBMs Blue Gene/L blrts_xlC compilers for
building the library and xlC for building
tau_instrumentor -pdt_cxlC. It executes on
the front-end. - Libraries are built in lttaudirgt/bgl/lib/
directory - Each configuration creates a unique
ltarchgt/lib/Makefile.tau-ltoptionsgt stub makefile
that corresponds to the configuration options
specified. e.g., - /usr/local/tau/tau-2.15.2/bgl/lib/Makefile.tau-mpi
-pdt
19TAU_SETUP A GUI for Installing TAU
tau-2.xgt./tau_setup
20Configuration Parameters in Stub Makefiles
- Each TAU Stub Makefile resides in lttaugtltarchgt/lib
directory - Variables
- TAU_CXX Specify the C compiler used by TAU
- TAU_CC, TAU_F90 Specify the C, F90 compilers
- TAU_DEFS Defines used by TAU. Add to CFLAGS
- TAU_LDFLAGS Linker options. Add to LDFLAGS
- TAU_INCLUDE Header files include path. Add to
CFLAGS - TAU_LIBS Statically linked TAU library. Add to
LIBS - TAU_SHLIBS Dynamically linked TAU library
- TAU_MPI_LIBS TAUs MPI wrapper library for C/C
- TAU_MPI_FLIBS TAUs MPI wrapper library for F90
- TAU_FORTRANLIBS Must be linked in with C linker
for F90 - TAU_CXXLIBS Must be linked in with F90 linker
- TAU_INCLUDE_MEMORY Use TAUs malloc/free wrapper
lib - TAU_DISABLE TAUs dummy F90 stub library
- TAU_COMPILER Instrument using tau_compiler.sh
script - Note Not including TAU_DEFS in CFLAGS disables
instrumentation in C/C programs (TAU_DISABLE
for f90).
21Using TAU
- Install TAU
- configure make clean install
- Typically modify application makefile
- Change the name of compiler to tau_cxx.sh,
tau_f90.sh - Set environment variables
- Name of the stub makefile TAU_MAKEFILE
- Options passed to tau_compiler.sh TAU_OPTIONS
- Execute application
- mpirun np ltprocsgt a.out
- Analyze performance data
- paraprof, vampir, paraver, jumpshot
22Manual Instrumentation C Example
include ltTAU.hgt int main(int argc, char
argv) TAU_PROFILE(int main(int, char ),
, TAU_DEFAULT) TAU_PROFILE_INIT(argc,
argv) TAU_PROFILE_SET_NODE(0) / for
sequential programs / foo() return
0 int foo(void) TAU_PROFILE(int
foo(void), , TAU_DEFAULT) // measures entire
foo() TAU_PROFILE_TIMER(t, foo() for loop,
2345 file.cpp, TAU_USER)
TAU_PROFILE_START(t) for(int i 0 i lt N
i) work(i) TAU_PROFILE_STOP(t)
// other statements in foo
23Manual Instrumentation F90 Example
cc34567 Cubes program comment line
PROGRAM SUM_OF_CUBES integer profiler(2)
save profiler INTEGER H, T, U
call TAU_PROFILE_INIT() call
TAU_PROFILE_TIMER(profiler, 'PROGRAM
SUM_OF_CUBES') call TAU_PROFILE_START(prof
iler) call TAU_PROFILE_SET_NODE(0)
! This program prints all 3-digit numbers that
! equal the sum of the cubes of their digits.
DO H 1, 9 DO T 0, 9 DO
U 0, 9 IF (100H 10T U H3
T3 U3) THEN PRINT "(3I1)", H,
T, U ENDIF END DO END
DO END DO call TAU_PROFILE_STOP(profil
er) END PROGRAM SUM_OF_CUBES
24TAUs MPI Wrapper Interposition Library
- Uses standard MPI Profiling Interface
- Provides name shifted interface
- MPI_Send PMPI_Send
- Weak bindings
- Interpose TAUs MPI wrapper library between MPI
and TAU - -lmpi replaced by lTauMpi lpmpi lmpi
- No change to the source code! Just re-link the
application to generate performance data - setenv TAU_MAKEFILE ltdirgt/ltarchgt/lib/Makefile.tau-
mpi-options - Use tau_cxx.sh, tau_f90.sh and tau_cc.sh as
compilers
25Using Program Database Toolkit (PDT)
- Parse the Program to create foo.pdb
- cxxparse foo.cpp I/usr/local/mydir DMYFLAGS
- or
- cparse foo.c I/usr/local/mydir DMYFLAGS
- or
- f95parse foo.f90 I/usr/local/mydir
- f95parse .f omerged.pdb I/usr/local/mydir
R free - Instrument the program
- tau_instrumentor foo.pdb foo.f90 o
foo.inst.f90 f select.tau - Compile the instrumented program ifort
foo.inst.f90 c I/usr/local/mpi/include o foo.o
26Using TAU
Step 1 Configure and install TAU configure
-pdtltdirgt -pdt_cxlC -archbgl mpi make
clean make install Builds lttaudirgt/ltarchgt/lib/Mak
efile.tau-ltoptionsgt set path(path
lttaudirgt/ppc64/bin) Step 2 Choose target stub
Makefile setenv TAU_MAKEFILE /usr/local/tau-2.1
5.2/bgl/lib/Makefile.tau-mpi-pdt setenv
TAU_OPTIONS -optVerbose -optKeepFiles (see
tau_compiler.sh for all options) Step 3 Use
tau_f90.sh, tau_cxx.sh and tau_cc.sh as the F90,
C or C compilers respectively. tau_f90.sh -c
app.f90 tau_f90.sh app.o -o app -lm -lblas Or
use these in the application Makefile.
27Tau_cxx,cc,f90.sh Improves Integration in
Makefiles
set TAU_MAKEFILE and TAU_OPTIONS env vars CXX
tau_cxx.sh F90 tau_f90.sh CFLAGS LIBS
-lm OBJS f1.o f2.o f3.o fn.o app
(OBJS) (CXX) (LDFLAGS) (OBJS) -o _at_
(LIBS) .cpp.o (CC) (CFLAGS) -c lt
28Using Stub Makefile and TAU_COMPILER
include /usr/common/acts/TAU/tau-2.15.2/bgl/lib/
Makefile.tau-mpi-pdt-trace MYOPTIONS
-optVerbose optKeepFiles F90 (TAU_COMPILER)
(MYOPTIONS) mpxlf90 OBJS f1.o f2.o f3.o
LIBS -Lappdir lapplib1 lapplib2 app
(OBJS) (F90) (OBJS) o app
(LIBS) .f90.o (F90) c lt
29TAU_COMPILER Options
- Optional parameters for (TAU_COMPILER)
tau_compiler.sh help - -optVerbose Turn on verbose debugging messages
- -optPdtDir"" PDT architecture directory.
Typically (PDTDIR)/(PDTARCHDIR) - -optPdtF95Opts"" Options for Fortran parser in
PDT (f95parse) - -optPdtCOpts"" Options for C parser in PDT
(cparse). Typically (TAU_MPI_INCLUDE)
(TAU_INCLUDE) (TAU_DEFS) - -optPdtCxxOpts"" Options for C parser in PDT
(cxxparse). Typically (TAU_MPI_INCLUDE)
(TAU_INCLUDE) (TAU_DEFS) - -optPdtF90Parser"" Specify a different
Fortran parser. For e.g., f90parse instead of
f95parse - -optPdtUser"" Optional arguments for
parsing source code - -optPDBFile"" Specify merged PDB file.
Skips parsing phase. - -optTauInstr"" Specify location of
tau_instrumentor. Typically (TAUROOT)/(CON
FIG_ARCH)/bin/tau_instrumentor - -optTauSelectFile"" Specify selective
instrumentation file for tau_instrumentor - -optTau"" Specify options for
tau_instrumentor - -optCompile"" Options passed to the
compiler. Typically (TAU_MPI_INCLUDE)
(TAU_INCLUDE) (TAU_DEFS) - -optLinking"" Options passed to the
linker. Typically (TAU_MPI_FLIBS)
(TAU_LIBS) (TAU_CXXLIBS) - -optNoMpi Removes -lmpi libraries
during linking (default) - -optKeepFiles Does not remove
intermediate .pdb and .inst. files - e.g.,
- setenv TAU_OPTIONS -optTauSelectFileselect.tau
optVerbose -optPdtCOpts-I/home -DFOO - tau_cxx.sh matrix.cpp -o matrix -lm
30Optimization of Program Instrumentation
- Need to eliminate instrumentation in frequently
executing lightweight routines - Throttling of events at runtime
- setenv TAU_THROTTLE 1
- Disables instrumentation in routines that execute
over 100000 times (TAU_THROTTLE_NUMCALLS) and
take less than 10 microseconds of inclusive time
per call (TAU_THROTTLE_PERCALL) - Selective instrumentation file to filter events
- tau_instrumentor options f ltfilegt
- Compensation of local instrumentation overhead
- configure -COMPENSATE
31TAU_REDUCE
- Reads profile files and rules
- Creates selective instrumentation file
- Specifies which routines should be excluded from
instrumentation
rules
tau_reduce
Selective instrumentation file
profile
32Memory Profiling in TAU
- Configuration option PROFILEMEMORY
- Records global heap memory utilization for each
function - Takes one sample at beginning of each function
and associates the sample with function name - Configuration option -PROFILEHEADROOM
- Records headroom (amount of free memory to grow)
for each function - Takes one sample at beginning of each function
and associates it with the callstack
TAU_CALLPATH_DEPTH env variable - Useful for debugging memory usage on IBM BG/L.
- Independent of instrumentation/measurement
options selected - No need to insert macros/calls in the source code
- User defined atomic events appear in
profiles/traces
33Memory Profiling in TAU
Flash2 code profile (-PROFILEMEMORY) on IBM
BlueGene/L MPI rank 0
34Memory Profiling in TAU
- Instrumentation based observation of global heap
memory (not per function) - call TAU_TRACK_MEMORY()
- call TAU_TRACK_MEMORY_HEADROOM()
- Triggers one sample every 10 secs
- call TAU_TRACK_MEMORY_HERE()
- call TAU_TRACK_MEMORY_HEADROOM_HERE()
- Triggers sample at a specific location in source
code - call TAU_SET_INTERRUPT_INTERVAL(seconds)
- To set inter-interrupt interval for sampling
- call TAU_DISABLE_TRACKING_MEMORY()
- call TAU_DISABLE_TRACKING_MEMORY_HEADROOM()
- To turn off recording memory utilization
- call TAU_ENABLE_TRACKING_MEMORY()
- call TAU_ENABLE_TRACKING_MEMORY_HEADROOM()
- To re-enable tracking memory utilization
35ParaProf Full Profile (Miranda)
8K processors!
36ParaProf Flat Profile (Miranda)
37ParaProf Callpath Profile (Flash)
38Gprof Style Callpath View in Paraprof
39TAUs ParaProf Profile Browser Static Timers
40ParaProf 3D Full Profile (Miranda)
16k processors
41ParaProf Bar Plot (Zoom in/out /-)
42ParaProf 3D Scatterplot (Miranda)
- Each pointis a threadof execution
- A total offour metricsshown inrelation
- ParaVis 3Dprofilevisualizationlibrary
- JOGL
43Vampir, VNG, and OTF
- Commercial trace based tools developed at ZiH,
T.U. Dresden - Wolfgang Nagel, Holger Brunst and others
- Vampir Trace Visualizer (aka Intel Trace
Analyzer v4.0) - Sequential program
- Vampir Next Generation (VNG)
- Client (vng) runs on a desktop, server (vngd) on
a cluster - Parallel trace analysis
- Orders of magnitude bigger traces (more memory)
- State of the art in parallel trace visualization
- Open Trace Format (OTF)
- Hierarchical trace format, efficient streams
based parallel access with VNGD - Replacement for proprietary formats such as STF
- Tracing library available on IBM BG/L platform
- Development of OTF supported by LLNL
- http//www.vampir-ng.de and http//www.par
atools.com/otf
44Vampir Next Generation (VNG) Architecture
45VNG Parallel Analysis Server
46TAU Tracing Enhancements
- Configure TAU with -TRACE vtfltdirgt otfltdirgt
options - configure TRACE vtfltdirgt
- configure TRACE otfltdirgt
- Generates tau_merge, tau2vtf, tau2otf tools in
lttaugt/ltarchgt/bin directory - tau_f90.sh app.f90 o app
- Instrument and execute application mpirun -np
4 app - Merge and convert trace files to VTF3/OTF format
- tau_treemerge.pl tau2vtf tau.trc tau.edf
app.vpt.gz vampir foo.vpt.gz - OR
- tau2otf tau.trc tau.edf app.otf n
ltnumstreamsgt - vampir app.otf
- OR use VNG to analyze OTF/VTF trace files
-
47Environment Variables
- Configure TAU with -TRACE otfltdirgt option
- configure TRACE otfltdirgt -archbgl-MULTIPLEC
OUNTERS papiltdirgt -mpi pdtdir pdt_cxlC - Set environment variables
- setenv TRACEDIR /p/gm1/ltlogingt/traces
- setenv COUNTER1 GET_TIME_OF_DAY (reqd)
- setenv COUNTER2 PAPI_FP_INS
- setenv COUNTER3 PAPI_TOT_CYC
- Execute application
- srun N8 n16 p pdebug ./a.out args
- tau_treemerge.pl and tau2otf/tau2vtf
48Using Vampir Next Generation (VNG v1.4)
49VNG Timeline Display
50VNG Calltree Display
51VNG Timeline Zoomed In
52VNG Grouping of Interprocess Communications
53VNG Process Timeline with PAPI Counters
54OTF/VNG Support for Counters
55VNG Communication Matrix Display
56VNG Message Profile
57VNG Process Activity Chart
58VNG Preferences
59TAU Performance System Status
- Computing platforms (selected)
- IBM SP/pSeries/BGL, SGI Altix/Origin, Cray
T3E/SV-1/X1/XT3, HP (Compaq) SC (Tru64), Sun,
Linux clusters (IA-32/64, Alpha, PPC, PA-RISC,
Power, Opteron), Apple (G4/5, OS X), Hitachi
SR8000, NEC SX-5/6, Windows - Programming languages
- C, C, Fortran 77/90/95, HPF, Java, Python
- Thread libraries (selected)
- pthreads, OpenMP, SGI sproc, Java,Windows,
Charm - Compilers (selected)
- Intel, PGI, GNU, Fujitsu, Sun, PathScale, SGI,
Cray, IBM, HP, NEC, Absoft, Lahey, Nagware
60Concluding Discussion
- Performance tools must be used effectively
- More intelligent performance systems for
productive use - Evolve to application-specific performance
technology - Deal with scale by full range performance
exploration - Autonomic and integrated tools
- Knowledge-based and knowledge-driven process
- Performance observation methods do not
necessarily need to change in a fundamental sense - More automatically controlled and efficiently use
- Develop next-generation tools and deliver to
community - Open source with support by ParaTools, Inc.
- http//www.cs.uoregon.edu/research/tau
61Support Acknowledgements
- Department of Energy (DOE)
- Office of Science contracts
- University of Utah ASC Level 1 sub-contract
- LLNL ASC/NNSA Level 3 contract
- LLNL ParaTools/GWT contract
- Argonne National Laboratory
- Pete Beckman
- T.U. Dresden, GWT
- Dr. Wolfgang Nagel and Holger Brunst
- Research Centre Juelich
- Dr. Bernd Mohr
- Los Alamos National Laboratory contracts