Title: Performance Technology for Complex Parallel Systems Sameer Shende, Allen D. Malony University of Oregon
1Performance Technology forComplex Parallel
Systems Sameer Shende, Allen D.
MalonyUniversity of Oregon
2Overview
- Introduction
- Definitions, general problem
- Tuning and Analysis Utilities (TAU)
- Instrumentation
- Measurement
- Analysis
- Work in progress
- Visualization Vampir
- Performance Monitoring and Steering
- Performance Database Framework
- Case Study Uintah
- Conclusions
3General Problems
- How do we create robust and ubiquitous
performance technology for the analysis and
tuning of parallel and distributed software and
systems in the presence of (evolving) complexity
challenges? - How do we apply performance technology
effectively for the variety and diversity of
performance problems that arise in the context of
complex parallel and distributed computer systems.
4Computation Model for Performance Technology
- How to address dual performance technology goals?
- Robust capabilities widely available
methodologies - Contend with problems of system diversity
- Flexible tool composition/configuration/integratio
n - Approaches
- Restrict computation types / performance problems
- limited performance technology coverage
- Base technology on abstract computation model
- general architecture and software execution
features - map features/methods to existing complex system
types - develop capabilities that can adapt and be
optimized
5General Complex System Computation Model
- Node physically distinct shared memory machine
- Message passing node interconnection network
- Context distinct virtual memory space within
node - Thread execution threads (user/system) in context
Interconnection Network
Inter-node messagecommunication
Node
Node
Node
node memory
memory
memory
SMP
physicalview
VM space
modelview
Context
Threads
6Definitions Profiling
- Profiling
- Recording of summary information during execution
- inclusive, exclusive time, calls, hardware
statistics, - Reflects performance behavior of program entities
- functions, loops, basic blocks
- user-defined semantic entities
- Very good for low-cost performance assessment
- Helps to expose performance bottlenecks and
hotspots - Implemented through
- sampling periodic OS interrupts or hardware
counter traps - instrumentation direct insertion of measurement
code
7Definitions Tracing
- Tracing
- Recording of information about significant points
(events) during program execution - entering/exiting code region (function, loop,
block, ) - thread/process interactions (e.g., send/receive
message) - Save information in event record
- timestamp
- CPU identifier, thread identifier
- Event type and event-specific information
- Event trace is a time-sequenced stream of event
records - Can be used to reconstruct dynamic program
behavior - Typically requires code instrumentation
8Event Tracing Instrumentation, Monitor, Trace
Event definition
CPU A
timestamp
MONITOR
CPU B
9Event Tracing Timeline Visualization
main
master
slave
B
10TAU Performance System Framework
- Tuning and Analysis Utilities
- Performance system framework for scalable
parallel and distributed high-performance
computing - Targets a general complex system computation
model - nodes / contexts / threads
- Multi-level system / software / parallelism
- Measurement and analysis abstraction
- Integrated toolkit for performance
instrumentation, measurement, analysis, and
visualization - Portable, configurable performance
profiling/tracing facility - Open software approach
- University of Oregon, LANL, FZJ Germany
- http//www.cs.uoregon.edu/research/paracomp/tau
11Strategies for Empirical Performance Evaluation
- Empirical performance evaluation as a series of
performance experiments - Experiment trials describing instrumentation and
measurement requirements - Where/When/How axes of empirical performance
space - where are performance measurements made in
program - when is performance instrumentation done
- how are performance measurement/instrumentation
chosen - Strategies for achieving flexibility and
portability goals - Limited performance methods restrict evaluation
scope - Non-portable methods force use of different
techniques - Integration and combination of strategies
12TAU Performance System Architecture
Paraver
EPILOG
13TAU Instrumentation Options
- Manual instrumentation
- TAU Profiling API
- Automatic instrumentation approaches
- PDT Source-to-source translation
- MPI - Wrapper interposition library
- Opari OpenMP directive rewriting
- Binary
- JVMPI Java virtual machine instrumentation
- DyninstAPI - Runtime code patching
14TAU Instrumentation
- Targets common measurement interface (TAU API)
- Object-based design and implementation
- Macro-based, using constructor/destructor
techniques - Program units function, classes, templates,
blocks - Uniquely identify functions and templates
- name and type signature (name registration)
- static object creates performance entry
- dynamic object receives static object pointer
- runtime type identification for template
instantiations - C and Fortran instrumentation variants
- Instrumentation and measurement optimization
15Multi-Level Instrumentation
- Uses multiple instrumentation interfaces
- Shares information cooperation between
interfaces - Taps information at multiple levels
- Provides selective instrumentation at each level
- Targets a common performance model
- Presents a unified view of execution
16Manual Instrumentation Using TAU
- Install TAU
- configure make clean install
- Instrument application
- TAU Profiling API
- Modify application makefile
- include TAUs stub makefile, modify variables
- Execute application
- mpirun np ltprocsgt a.out
- Analyze performance data
- jracy, vampir, pprof, paraver
17TAU Manual Instrumentation API
- Initialization and runtime configuration
- TAU_PROFILE_INIT(argc, argv)TAU_PROFILE_SET_NODE
(myNode)TAU_PROFILE_SET_CONTEXT(myContext)TAU_
PROFILE_EXIT(message)TAU_REGISTER_THREAD() - Function and class methods
- TAU_PROFILE(name, type, group)
- Template
- TAU_TYPE_STRING(variable, type)TAU_PROFILE(name,
type, group)CT(variable) - User-defined timing
- TAU_PROFILE_TIMER(timer, name, type,
group)TAU_PROFILE_START(timer)TAU_PROFILE_STOP
(timer)
18Manual Instrumentation C Example
include ltTAU.hgt int main(int argc, char
argv) TAU_PROFILE(int main(int, char ),
, TAU_DEFAULT) TAU_PROFILE_INIT(argc,
argv) TAU_PROFILE_SET_NODE(0) / for
sequential programs / foo() return
0 int foo(void) TAU_PROFILE(int
foo(void), , TAU_DEFAULT) // measures entire
foo() TAU_PROFILE_TIMER(t, foo() for loop,
2345 file.cpp, TAU_USER)
TAU_PROFILE_START(t) for(int i 0 i lt N
i) work(i) TAU_PROFILE_STOP(t)
// other statements in foo
19Manual Instrumentation C Example
include ltTAU.hgt int main(int argc, char
argv) TAU_PROFILE_TIMER(tmain, int
main(int, char ), , TAU_DEFAULT)
TAU_PROFILE_INIT(argc, argv)
TAU_PROFILE_SET_NODE(0) / for sequential
programs / TAU_PROFILE_START(tmain) foo()
TAU_PROFILE_STOP(tmain) return 0 int
foo(void) TAU_PROFILE_TIMER(t, foo(), ,
TAU_USER) TAU_PROFILE_START(t) for(int i
0 i lt N i) work(i)
TAU_PROFILE_STOP(t)
20Manual Instrumentation F90 Example
cc34567 Cubes program comment line
PROGRAM SUM_OF_CUBES integer profiler(2)
save profiler INTEGER H, T, U
call TAU_PROFILE_INIT() call
TAU_PROFILE_TIMER(profiler, 'PROGRAM
SUM_OF_CUBES') call TAU_PROFILE_START(prof
iler) call TAU_PROFILE_SET_NODE(0)
! This program prints all 3-digit numbers that
! equal the sum of the cubes of their digits.
DO H 1, 9 DO T 0, 9 DO
U 0, 9 IF (100H 10T U H3
T3 U3) THEN PRINT "(3I1)", H,
T, U ENDIF END DO END
DO END DO call TAU_PROFILE_STOP(profil
er) END PROGRAM SUM_OF_CUBES
21Instrumenting Multithreaded Applications
include ltTAU.hgt void threaded_function(void
data) TAU_REGISTER_THREAD() // Before any
other TAU calls TAU_PROFILE(void
threaded_function, , TAU_DEFAULT)
work() int main(int argc, char argv)
TAU_PROFILE(int main(int, char ), ,
TAU_DEFAULT) TAU_PROFILE_INIT(argc, argv)
TAU_PROFILE_SET_NODE(0) / for sequential
programs / pthread_attr_t attr pthread_t
tid pthread_attr_init(attr)
pthread_create(tid, NULL, threaded_function,
NULL) return 0
22Compiling TAU Makefiles
- Include TAU Stub Makefile (ltarchgt/lib) in the
users Makefile. - Variables
- TAU_CXX Specify the C compiler used by TAU
- TAU_CC, TAU_F90 Specify the C, F90 compilers
- TAU_DEFS Defines used by TAU. Add to CFLAGS
- TAU_LDFLAGS Linker options. Add to LDFLAGS
- TAU_INCLUDE Header files include path. Add to
CFLAGS - TAU_LIBS Statically linked TAU library. Add to
LIBS - TAU_SHLIBS Dynamically linked TAU library
- TAU_MPI_LIBS TAUs MPI wrapper library for C/C
- TAU_MPI_FLIBS TAUs MPI wrapper library for F90
- TAU_FORTRANLIBS Must be linked in with C linker
for F90. - TAU_DISABLE TAUs dummy F90 stub library
- Note Not including TAU_DEFS in CFLAGS disables
instrumentation in C/C programs (TAU_DISABLE
for f90).
23Including TAUs stub Makefile
include /usr/tau/sgi64/lib/Makefile.tau-pthread-kc
c CXX (TAU_CXX) CC (TAU_CC) CFLAGS
(TAU_DEFS) LIBS (TAU_LIBS) OBJS ... TARGET
a.out TARGET (OBJS) (CXX) (LDFLAGS)
(OBJS) -o _at_ (LIBS) .cpp.o (CC) (CFLAGS)
-c lt -o _at_
24TAU Instrumentation Options
- Manual instrumentation
- TAU Profiling API
- Automatic instrumentation approaches
- PDT Source-to-source translation
- MPI - Wrapper interposition library
- Opari OpenMP directive rewriting
25Program Database Toolkit (PDT)
- Program code analysis framework for developing
source-based tools - High-level interface to source code information
- Integrated toolkit for source code parsing,
database creation, and database query - commercial grade front end parsers
- portable IL analyzer, database format, and access
API - open software approach for tool development
- Target and integrate multiple source languages
- Use in TAU to build automated performance
instrumentation tools
26Program Database Toolkit
27PDT Components
- Language front end
- Edison Design Group (EDG) C, C
- Mutek Solutions Ltd. F77, F90
- creates an intermediate-language (IL) tree
- IL Analyzer
- processes the intermediate language (IL) tree
- creates program database (PDB) formatted file
- DUCTAPE (Bernd Mohr, ZAM, Germany)
- C program Database Utilities and Conversion
Tools APplication Environment - processes and merges PDB files
- C library to access the PDB for PDT applications
28TAU Makefile for PDT C Example
include /usr/tau/include/Makefile CXX
(TAU_CXX) CC (TAU_CC) PDTPARSE
(PDTDIR)/(CONFIG_ARCH)/bin/cxxparse TAUINSTR
(TAUROOT)/(CONFIG_ARCH)/bin/tau_instrumentor CFL
AGS (TAU_DEFS) LIBS (TAU_LIBS) OBJS
... TARGET a.out TARGET (OBJS) (CXX)
(LDFLAGS) (OBJS) -o _at_ (LIBS) .cpp.o (PDTP
ARSE) lt (TAUINSTR) .pdb lt -o
.inst.cpp (CC) (CFLAGS) -c .inst.cpp -o
_at_
29Instrumentation Control
- Selection of which performance events to observe
- Could depend on scope, type, level of interest
- Could depend on instrumentation overhead
- How is selection supported in instrumentation
system? - No choice
- Include / exclude lists (TAU)
- Environment variables
- Static vs. dynamic
- Problem Controlling instrumentation of small
routines - High relative measurement overhead
- Significant intrusion and possible perturbation
30Using PDT tau_instrumentor
tau_instrumentor Usage tau_instrumentor
ltpdbfilegt ltsourcefilegt -o ltoutputfilegt
-noinline -g groupname -i headerfile
-c-c-fortran -f ltinstr_req_filegt For
selective instrumentation, use f option cat
selective.dat Selective instrumentation
Specify an exclude/include list. BEGIN_EXCLUDE_LI
ST void quicksort(int , int, int) void
sort_5elements(int ) void interchange(int , int
) END_EXCLUDE_LIST If an include list is
specified, the routines in the list will be the
only routines that are instrumented. To
specify an include list (a list of routines that
will be instrumented) remove the leading to
uncomment the following lines BEGIN_INCLUDE_LIST
int main(int, char ) int select_ END_INCLUDE_
LIST
31Rule-Based Overhead Analysis (N. Trebon, UO)
- Analyze the performance data to determine events
with high (relative) overhead performance
measurements - Create a select list for excluding those events
- Rule grammar (used in TAUreduce tool)
- GroupName Field Operator Number
- GroupName indicates rule applies to events in
group - Field is a event metric attribute (from profile
statistics) - numcalls, numsubs, percent, usec, cumusec, count
PAPI, totalcount, stdev, usecs/call,
counts/call - Operator is one of gt, lt, or
- Number is any number
- Compound rules possible using between simple
rules
32Example Rules
- Exclude all events that are members of TAU_USER
and use less than 1000 microsecondsTAU_USERusec
lt 1000 - Exclude all events that have less than 100
microseconds and are called only onceusec lt
1000 numcalls 1 - Exclude all events that have less than 1000
usecs per call OR have a (total inclusive)
percent less than 5usecs/call lt 1000percent lt 5 - Scientific notation can be used
- usecgt1000 numcallsgt400000 usecs/calllt30
percentgt25
33TAU Instrumentation Options
- Manual instrumentation
- TAU Profiling API
- Automatic instrumentation approaches
- PDT Source-to-source translation
- MPI - Wrapper interposition library
- Opari OpenMP directive rewriting
34TAUs MPI Wrapper Interposition Library
- Uses standard MPI Profiling Interface
- Provides name shifted interface
- MPI_Send PMPI_Send
- Weak bindings
- Interpose TAUs MPI wrapper library between MPI
and TAU - -lmpi replaced by lTauMpi lpmpi lmpi
35MPI Library Instrumentation (MPI_Send)
int MPI_Send() / TAU redefines MPI_Send
/... int returnVal, typesize TAU_PROFILE_T
IMER(tautimer, "MPI_Send()", " ",
TAU_MESSAGE) TAU_PROFILE_START(tautimer) if
(dest ! MPI_PROC_NULL) PMPI_Type_size(datatyp
e, typesize) TAU_TRACE_SENDMSG(tag, dest,
typesizecount) / Wrapper calls PMPI_Send
/ returnVal PMPI_Send(buf, count, datatype,
dest, tag, comm) TAU_PROFILE_STOP(tautimer)
return returnVal
36Including TAUs stub Makefile
include /usr/tau/sgi64/lib/Makefile.tau-mpi CXX
(TAU_CXX) CC (TAU_CC) CFLAGS
(TAU_DEFS) LIBS (TAU_MPI_LIBS)
(TAU_LIBS) LD_FLAGS (USER_OPT)
(TAU_LDFLAGS) OBJS ... TARGET a.out TARGET
(OBJS) (CXX) (LDFLAGS) (OBJS) -o _at_
(LIBS) .cpp.o (CC) (CFLAGS) -c lt -o _at_
37TAU Instrumentation Options
- Manual instrumentation
- TAU Profiling API
- Automatic instrumentation approaches
- PDT Source-to-source translation
- MPI - Wrapper interposition library
- Opari OpenMP directive rewriting FZJ, Germany
38Instrumentation of OpenMP Constructs
- OpenMP Pragma And Region Instrumentor
- Source-to-Source translator to insert POMP
callsaround OpenMP constructs and API functions - Done Supports
- Fortran77 and Fortran90, OpenMP 2.0
- C and C, OpenMP 1.0
- POMP Extensions
- EPILOG and TAU POMP implementations
- Preserves source code information (line line
file) - Work in ProgressInvestigating standardization
through OpenMP Forum
39OpenMP API Instrumentation
- Transform
- omp__lock() ? pomp__lock()
- omp__nest_lock()? pomp__nest_lock()
- init destroy set unset test
- POMP version
- Calls omp version internally
- Can do extra stuff before and after call
40Example !OMP PARALLEL DO Instrumentation
!OMP PARALLEL DO clauses... do
loop !OMP END PARALLEL DO
!OMP PARALLEL other-clauses... !OMP DO
schedule-clauses, ordered-clauses,
lastprivate-clauses do loop !OMP END
DO !OMP END PARALLEL DO
NOWAIT !OMP
BARRIER
call pomp_parallel_fork(d) call
pomp_parallel_begin(d)
call pomp_parallel_end(d) call
pomp_parallel_join(d)
call pomp_do_enter(d)
call pomp_do_exit(d)
call
pomp_barrier_enter(d) call pomp_barrier_exit(d)
41Opari Instrumentation Example
- OpenMP directive instrumentation
pomp_for_enter(omp_rd_2) line 252
"stommel.c" pragma omp for schedule(static)
reduction( diff) private(j) firstprivate
(a1,a2,a3,a4,a5) nowait for( ii1ilti2i)
for(jj1jltj2j) new_psiija1psii1
j a2psii-1j a3psiij1
a4psiij-1 - a5the_forij diffdifffab
s(new_psiij-psiij) pomp_barrier_ente
r(omp_rd_2) pragma omp barrier pomp_barrier_exi
t(omp_rd_2) pomp_for_exit(omp_rd_2) line 261
"stommel.c"
42OPARI Basic Usage (f90)
- Reset OPARI state information
- rm -f opari.rc
- Call OPARI for each input source file
- opari file1.f90...opari fileN.f90
- Generate OPARI runtime table, compile it with
ANSI C - opari -table opari.tab.ccc -c opari.tab.c
- Compile modified files .mod.f90 using OpenMP
- Link the resulting object files, the OPARI
runtime table opari.tab.o and the TAU POMP RTL
43OPARI Makefile Template (C/C)
OMPCC ... insert C OpenMP compiler
hereOMPCXX ... insert C OpenMP compiler
here .c.o opari lt (OMPCC) (CFLAGS) -c
.mod.c .cc.o opari lt (OMPCXX) (CXXFLAGS)
-c .mod.cc opari.init rm -rf
opari.rc opari.tab.o opari -table
opari.tab.c (CC) -c opari.tab.c myprog
opari.init myfile.o ... opari.tab.o (OMPCC) -o
myprog myfile.o opari.tab.o -lpomp myfile1.o
myfile1.c myheader.hmyfile2.o ...
44OPARI Makefile Template (Fortran)
OMPF77 ... insert f77 OpenMP compiler
hereOMPF90 ... insert f90 OpenMP compiler
here .f.o opari lt (OMPF77) (CFLAGS) -c
.mod.F .f90.o opari lt (OMPF90) (CXXFLAGS)
-c .mod.F90 opari.init rm -rf
opari.rc opari.tab.o opari -table
opari.tab.c (CC) -c opari.tab.c myprog
opari.init myfile.o ... opari.tab.o (OMPF90)
-o myprog myfile.o opari.tab.o -lpomp myfile1.o
myfile1.f90myfile2.o ...
45TAU Measurement
- Performance information
- High-resolution timer library (real-time /
virtual clocks) - General software counter library (user-defined
events) - Hardware performance counters
- PAPI (Performance API) (UTK, Ptools Consortium)
- consistent, portable API
- Organization
- Node, context, thread levels
- Profile groups for collective events (runtime
selective) - Performance data mapping between software levels
46TAU Measurement (continued)
- Parallel profiling
- Function-level, block-level, statement-level
- Supports user-defined events
- TAU parallel profile database
- Callpath profiles
- Hardware counts values
- Tracing
- All profile-level events
- Inter-process communication events
- Timestamp synchronization
- User-configurable measurement library (user
controlled)
47TAU Measurement System Configuration
- configure OPTIONS
- -cltCCgt, -ccltccgt Specify C and C
compilers - -pthread, -sproc Use pthread or SGI sproc
threads - -openmp Use OpenMP threads
- -opariltdirgt Specify location of Opari OpenMP
tool - -papiltdirgt Specify location of PAPI
- -pdtltdirgt Specify location of PDT
- -mpiincltdgt, mpilibltdgt Specify MPI library
instrumentation - -TRACE Generate TAU event traces
- -PROFILE Generate TAU profiles
- -PROFILECALLPATH Generate Callpath profiles
(1-level) - -MULTIPLECOUNTERS Use more than one hardware
counter - -CPUTIME Use usertimesystem time
- -PAPIWALLCLOCK Use PAPI to access wallclock time
- -PAPIVIRTUAL Use PAPI for virtual (user) time
48TAU Measurement Configuration Examples
- ./configure -cxlC -ccxlc pdt/usr/packages/pd
toolkit-2.1-pthread - Use TAU with IBMs xlC compiler, PDT and the
pthread library - Enable TAU profiling (default)
- ./configure -TRACE PROFILE
- Enable both TAU profiling and tracing
- ./configure -cCC -cccc MULTIPLECOUNTERS
-papi/usr/local/packages/papi opari/usr/local/o
pari-pomp-1.1 -mpiinc/usr/packages/mpich/includ
e -mpilib/usr/packages/mpich/lib SGITIMERS
-PAPIVIRTUAL - Use OpenMPMPI using SGIs compiler suite, Opari
and use PAPI for accessing hardware performance
counters virtual time for measurements - Typically configure multiple measurement libraries
49Setup Running Applications
setenv PROFILEDIR /home/data/experiments/profile
/01 setenv TRACEDIR /home/data/experiments/trace
/01(optional) set path(path
lttaudirgt/ltarchgt/bin) setenv LD_LIBRARY_PATH
LD_LIBRARY_PATH\lttaudirgt/ltarchgt/lib For PAPI
(1 counter) setenv PAPI_EVENT PAPI_FP_INS For
PAPI (multiplecounters) setenv COUNTER1
PAPI_FP_INS (PAPIs Floating point ins)
setenv COUNTER2 PAPI_L1_DCM (PAPIs L1 Data
cache misses) setenv COUNTER3 P_VIRTUAL_TIME
(PAPIs virtual time) setenv COUNTER4
SGI_TIMERS (Wallclock time) mpirun np ltngt
ltapplicationgt llsubmit job.sh
50Performance Mapping
- Associate performance with significant entities
(events) - Source code points are important
- Functions, regions, control flow events, user
events - Execution process and thread entities are
important - Some entities are more abstract, harder to
measure - Consider callgraph (callpath) profiling
- Measure time (metric) along an edge (path) of
callgraph - Incident edge gives parent / child view
- Edge sequence (path) gives parent / descendant
view - Problem Callpath profiling when callgraph is
unknown - Determine callgraph dynamically at runtime
- Map performance measurement to dynamic call path
state
511-Level Callpath Implementation in TAU
- TAU maintains a performance event (routine)
callstack - Profiled routine (child) looks in callstack for
parent - Previous profiled performance event is the parent
- A callpath profile structure created first time
parent calls - TAU records parent in a callgraph map for child
- String representing 1-level callpath used as its
key - a( )gtb( ) name for time spent in b when
called by a - Map returns pointer to callpath profile structure
- 1-level callpath is profiled using this profiling
data - Build upon TAUs performance mapping technology
- Measurement is independent of instrumentation
- Use PROFILECALLPATH to configure TAU
52TAU Analysis
- Profile analysis
- pprof
- parallel profiler with text-based display
- racy
- graphical interface to pprof (Tcl/Tk)
- jracy
- Java implementation of Racy
- Trace analysis and visualization
- Trace merging and clock adjustment (if necessary)
- Trace format conversion (ALOG, SDDF, Vampir)
- Vampir (Pallas) trace visualization
- Paraver (CEPBA) trace visualization
53Pprof Command
- pprof -c-b-m-t-e-i -r -s -n num -f
file -l nodes - -c Sort according to number of calls
- -b Sort according to number of subroutines called
- -m Sort according to msecs (exclusive time total)
- -t Sort according to total msecs (inclusive time
total) - -e Sort according to exclusive time per call
- -i Sort according to inclusive time per call
- -v Sort according to standard deviation
(exclusive usec) - -r Reverse sorting order
- -s Print only summary profile information
- -n num Print only first number of functions
- -f file Specify full path and filename without
node ids - -l List all functions and exit
54TAU Parallel Performance Profiles
55Terminology Example
- For routine int main( )
- Exclusive time
- 100-20-50-2010 secs
- Inclusive time
- 100 secs
- Calls
- 1 call
- Subrs (no. of child routines called)
- 3
- Inclusive time/call
- 100secs
int main( ) / takes 100 secs / f1() /
takes 20 secs / f2() / takes 50 secs /
f1() / takes 20 secs / / other work
/ / Time can be replaced by counts /
56jracy (NAS Parallel Benchmark LU)
Routine profile across all nodes
Global profiles
n node c context t thread
Individual profile
57jracy (Callpath Profiles) (R. A. Bell, UO)
Callpath profile across all nodes
58Vampir Trace Visualization Tool
- Visualization and Analysis of MPI Programs
- Originally developed by Forschungszentrum Jülich
- Current development by Technical University
Dresden - Distributed by PALLAS, Germany
- http//www.pallas.de/pages/vampir.htm
59Using TAU with Vampir
- Configure TAU with -TRACE option
- configure TRACE SGITIMERS
- Execute application
- mpirun np 4 a.out
- This generates TAU traces and event descriptors
- Merge all traces using tau_merge
- tau_merge .trc app.trc
- Convert traces to Vampir Trace format using
tau_convert - tau_convert pv app.trc tau.edf app.pv
- Note Use vampir instead of pv for
multi-threaded traces - Load generated trace file in Vampir
- vampir app.pv
60Vampir Main Window
- Trace file loading can be
- Interrupted at any time
- Resumed
- Started at a specified time offset
- Provides main menu
- Access to global and process local displays
- Preferences
- Help
- Trace file can be rewritten (regrouped symbols)
61Vampir Timeline Diagram
- Functions organized into groups
- Coloring by group
- Message lines can be colored by tag or size
- Information about states, messages, collective,
and I/O operations available by clicking on the
representation
62Vampir Timeline Diagram (Message Info)
- Sourcecode references are displayed if recorded
in trace
63Vampir Execution Statistics Displays
- Aggregatedprofilinginformation execution time,
calls, inclusive/exclusive - Available for all/any group (activity)
- Available for all routines (symbols)
- Available for any trace part (select in timeline
diagram)
64Vampir Communication Statistics Displays
- Bytes sent/received for collective operations
- Message length statistics
- Available for any trace part
- Byte and message count,min/max/avg message
length and min/max/avg bandwidthfor each process
pair
65Vampir Other Features
- Dynamic global call graph tree
- Parallelism display
- Powerful filtering and trace comparison features
- All diagrams highly customizable (through context
menus)
66Vampir Process Displays
- For all selected processes in the global displays
67Vampir (NAS Parallel Benchmark LU)
Callgraph display
Timeline display
Parallelism display
Communications display
68TAU Performance System Status
- Computing platforms
- IBM SP, SGI Origin, ASCI Red, Cray T3E, Compaq
SC, HP, Sun, Apple, Windows, IA-32, IA-64
(Linux), Hitachi, NEC - Programming languages
- C, C, Fortran 77/90, HPF, Java
- Communication libraries
- MPI, PVM, Nexus, Tulip, ACLMPL, MPIJava
- Thread libraries
- pthread, Java,Windows, SGI sproc, Tulip, SMARTS,
OpenMP - Compilers
- KAI (KCC, KAP/Pro), PGI, GNU, Fujitsu, HP, Sun,
Microsoft, SGI, Cray, IBM, HP, Compaq, Hitachi,
NEC, Intel
69PDT Status
- Program Database Toolkit (Version 2.1, web
download) - EDG C front end (Version 2.45.2)
- Mutek Fortran 90 front end (Version 2.4.1)
- C and Fortran 90 IL Analyzer
- DUCTAPE library
- Standard C system header files (KCC Version
4.0f) - PDT-constructed tools
- TAU instrumentor (C/C/F90)
- Program analysis support for SILOON and CHASM
- Platforms
- SGI, IBM, Compaq, SUN, HP, Linux (IA32/IA64),
Apple, Windows, Cray T3E, Hitachi
70Work in Progress
- Visualization
- TAU will generate event-traces with PAPI
performance data. Vampir (v3.0) will support
visualization of this data - Performance Monitoring and Steering
- Performance Database Framework
71Vampir v3.x HPM Counter
72Performance Monitoring and Steering
- Desirable to monitor performance during execution
- Long-running applications
- Steering computations for improved performance
- Large-scale parallel applications complicate
solutions - More parallel threads of execution producing data
- Large amount of performance data (relative) to
access - Analysis and visualization more difficult
- Problem Online performance data access and
analysis - Incremental profile sampling (based on files)
- Integration in computational steering system
- Dynamic performance measurement and access
73Online Performance Analysis (K. Li, UO)
SCIRun (Univ. of Utah)
Performance Visualizer
Application
// performance data streams
TAU Performance System
Performance Analyzer
// performance data output
accumulated samples
Performance Data Reader
Performance Data Integrator
file system
sample sequencing reader synchronization
742D Field Performance Visualization in SCIRun
SCIRun program
75Uintah Computational Framework (UCF)
- Universityof Utah
- UCF analysis
- Scheduling
- MPI library
- Components
- 500 processes
- Use for onlineand offlinevisualization
- Apply SCIRunsteering
76Empirical-Based Performance Optimization
Process
77TAU Performance Database Framework
- profile data only
- XML representation
- project / experiment / trial
78PerfDBF Architecture (L. Li, R. Bell, UO)
App. profiled With TAU
Standard TAU Output Data
TAU XML Format
TAU to XML Converter
Database Loader
SQL Database
AnalysisTool
79Scalability Analysis Process
- Scalability study on LU
- suite.def of procs -gt 1, 2, 4, and 8
- mpirun -np 1 lu.W1
- mpirun -np 2 lu.W2
- mpirun -np 4 lu.W4
- mpirun -np 8 lu.W8
- populateDatabase.sh
- run Java translator to translate profiles into
XML - run Java XML reader to write XML profiles to
database - Read times for routines and program from
experiments - Calculate scalability metrics
80Contents of Performance Database
81Scalability Analysis Results
- Scalability of LU performance experiments
- Four trial runs
- Funname processors meanspeedup
- .
- applu 2 2.0896117809566
- applu 4 4.812100975788783
- applu 8 8.168409581149514
-
- exact 2 1.95853126762839071803
- exact 4 4.03622321124616535446
- exact 8 7.193812137750623668346
82Current Status and Future
- PerfDBF prototype
- TAU profile to XML translator
- XML to PerfDB populator
- PostgresSQL database
- Java-based PostgresSQL query module
- Use as a layer to support performance analysis
tools - Make accessing the Performance Database quicker
- Continue development
- XML parallel profile representation
- Basic specification
83Overview
- Introduction
- Definitions, general problem
- Tuning and Analysis Utilities (TAU)
- Instrumentation
- Measurement
- Analysis
- Work in progress
- Visualization Vampir
- Performance Monitoring and Steering
- Performance Database Framework
- Case Study Uintah
- Conclusions
84Case Study Utah ASCI/ASAP Level 1 Center
- C-SAFE was established to build a problem-solving
environment (PSE) for the numerical simulation of
accidental fires and explosions - Fundamental chemistry and engineering physics
models - Coupled with non-linear solvers, optimization,
computational steering, visualization, and
experimental data verification - Very large-scale simulations
- Computer science problems
- Coupling of multiple simulation codes
- Software engineering across diverse expert teams
- Achieving high performance on large-scale systems
85Example C-SAFE Simulation Problems
?
Heptane fire simulation
Typical C-SAFE simulation with a billion degrees
of freedom and non-linear time dynamics
Material stress simulation
86Uintah High-Level Component View
87Uintah Computational Framework
- Execution model based on software (macro)
dataflow - Exposes parallelism and hides data transport
latency - Computations expressed a directed acyclic graphs
of tasks - consumes input and produces output (input to
future task) - input/outputs specified for each patch in a
structured grid - Abstraction of global single-assignment memory
- DataWarehouse
- Directory mapping names to values (array
structured) - Write value once then communicate to awaiting
tasks - Task graph gets mapped to processing resources
- Communications schedule approximates global
optimal
88Uintah Task Graph (Material Point Method)
- Diagram of named tasks (ovals) and data (edges)
- Imminent computation
- Dataflow-constrained
- MPM
- Newtonian material point motion time step
- Solid values defined at material point
(particle) - Dashed values defined at vertex (grid)
- Prime () values updated during time step
89Uintah PSE
- UCF automatically sets up
- Domain decomposition
- Inter-processor communication with
aggregation/reduction - Parallel I/O
- Checkpoint and restart
- Performance measurement and analysis (stay tuned)
- Software engineering
- Coding standards
- CVS (Commits Y3 - 26.6 files/day, Y4 - 29.9
files/day) - Correctness regression testing with bugzilla bug
tracking - Nightly build (parallel compiles)
- 170,000 lines of code (Fortran and C tasks
supported)
90Performance Technology Integration
- Uintah present challenges to performance
integration - Software diversity and structure
- UCF middleware, simulation code modules
- component-based hierarchy
- Portability objectives
- cross-language and cross-platform
- multi-parallelism thread, message passing, mixed
- Scalability objectives
- High-level programming and execution abstractions
- Requires flexible and robust performance
technology - Requires support for performance mapping
91Task Execution in Uintah Parallel Scheduler
- Profile methods and functions in scheduler and in
MPI library
Task execution time dominates (what task?)
Task execution time distribution
MPI communication overheads (where?)
- Need to map performance data!
92Semantics-Based Performance Mapping
- Associate performance measurements with
high-level semantic abstractions - Need mapping support in the performance
measurement system to assign data correctly
93Semantic Entities/Attributes/Associations (SEAA)
- New dynamic mapping scheme
- Entities defined at any level of abstraction
- Attribute entity with semantic information
- Entity-to-entity associations
- Two association types (implemented in TAU API)
- Embedded extends data structure of associated
object to store performance measurement entity - External creates an external look-up table
using address of object as the key to locate
performance measurement entity
94Uintah Task Performance Mapping
- Uintah partitions individual particles across
processing elements (processes or threads) - Simulation tasks in task graph work on particles
- Tasks have domain-specific character in the
computation - interpolate particles to grid in Material Point
Method - Task instances generated for each partitioned
particle set - Execution scheduled with respect to task
dependencies - How to attributed execution time among different
tasks - Assign semantic name (task type) to a task
instance - SerialMPMinterpolateParticleToGrid
- Map TAU timer object to (abstract) task (semantic
entity) - Look up timer object using task type (semantic
attribute) - Further partition along different domain-specific
axes
95Using External Associations
- Two level mappings
- Level 1 lttask name, timergt
- Level 2 lttask name, patch, timergt
- Embedded association vs External
association
Hash Table
Data (object)
Performance Data
96Task Performance Mapping Instrumentation
- void MPISchedulerexecute(const ProcessorGroup
pc, - DataWarehouseP old_dw, DataWarehouseP
dw ) - ...
- TAU_MAPPING_CREATE(
- task-gtgetName(), "MPISchedulerexecute()",
(TauGroup_t)(void)task-gtgetName(),
task-gtgetName(), 0) - ...
- TAU_MAPPING_OBJECT(tautimer)
- TAU_MAPPING_LINK(tautimer,(TauGroup_t)(void)task
-gtgetName()) - // EXTERNAL ASSOCIATION
- ...
- TAU_MAPPING_PROFILE_TIMER(doitprofiler,
tautimer, 0) - TAU_MAPPING_PROFILE_START(doitprofiler,0)
- task-gtdoit(pc)
- TAU_MAPPING_PROFILE_STOP(0)
- ...
97Task Performance Mapping (Profile)
Mapped task performance across processes
Performance mapping for different tasks
98Task Performance Mapping (Trace)
Work packet computation events colored by task
type
Distinct phases of computation can be identifed
based on task
99Task Performance Mapping (Trace - Zoom)
Startup communication imbalance
100Task Performance Mapping (Trace - Parallelism)
Communication / load imbalance
101Comparing Uintah Traces for Scalability Analysis
102Scaling Performance Optimizations
Last year initial correct scheduler
Reduce communication by 10 x
Reduce task graph overhead by 20 x
ASCI NirvanaSGI Origin 2000 Los AlamosNational
Laboratory
103Scalability to 2000 Processors (Fall 2001)
ASCI NirvanaSGI Origin 2000 Los AlamosNational
Laboratory
104Concluding Remarks
- Complex software and parallel computing systems
pose challenging performance analysis problems
that require robust methodologies and tools - To build more sophisticated performance tools,
existing proven performance technology must be
utilized - Performance tools must be integrated with
software and systems models and technology - Performance engineered software
- Function consistently and coherently in software
and system environments - PAPI and TAU performance systems offer robust
performance technology that can be broadly
integrated
105Information
- TAU (http//www.acl.lanl.gov/tau)
- PDT (http//www.acl.lanl.gov/pdtoolkit)
- PAPI (http//icl.cs.utk.edu/projects/papi/)
- OPARI (http//www.fz-juelich.de/zam/kojak/)
106Support Acknowledgement
- TAU and PDT support
- Department of Energy (DOE)
- DOE 2000 ACTS contract
- DOE MICS contract
- DOE ASCI Level 3 (LANL, LLNL)
- U. of Utah DOE ASCI Level 1 subcontract
- DARPA
- NSF National Young Investigator (NYI) award