Performance Technology for Complex Parallel Systems Sameer Shende, Allen D. Malony University of Oregon

About This Presentation

Title:

Performance Technology for Complex Parallel Systems Sameer Shende, Allen D. Malony University of Oregon

Description:

How do we create robust and ubiquitous performance technology for ... VM. space. Context. SMP. Threads. node memory. Interconnection Network. Inter-node message ... – PowerPoint PPT presentation

Number of Views:99

Avg rating:3.0/5.0

Slides: 107

Provided by: allend7

Learn more at: http://www.cs.uoregon.edu

Category:

more less

Transcript and Presenter's Notes

Title: Performance Technology for Complex Parallel Systems Sameer Shende, Allen D. Malony University of Oregon

1
Performance Technology forComplex Parallel
Systems Sameer Shende, Allen D.
MalonyUniversity of Oregon
2
Overview

Introduction
Definitions, general problem
Tuning and Analysis Utilities (TAU)
Instrumentation
Measurement
Analysis
Work in progress
Visualization Vampir
Performance Monitoring and Steering
Performance Database Framework
Case Study Uintah
Conclusions

3
General Problems

How do we create robust and ubiquitous
performance technology for the analysis and
tuning of parallel and distributed software and
systems in the presence of (evolving) complexity
challenges?
How do we apply performance technology
effectively for the variety and diversity of
performance problems that arise in the context of
complex parallel and distributed computer systems.

4
Computation Model for Performance Technology

How to address dual performance technology goals?
Robust capabilities widely available
methodologies
Contend with problems of system diversity
Flexible tool composition/configuration/integratio
n
Approaches
Restrict computation types / performance problems
limited performance technology coverage
Base technology on abstract computation model
general architecture and software execution
features
map features/methods to existing complex system
types
develop capabilities that can adapt and be
optimized

5
General Complex System Computation Model

Node physically distinct shared memory machine
Message passing node interconnection network
Context distinct virtual memory space within
node
Thread execution threads (user/system) in context

Interconnection Network
Inter-node messagecommunication

Node
Node
Node
node memory
memory
memory
SMP
physicalview
VM space

modelview

Context
Threads
6
Definitions Profiling

Profiling
Recording of summary information during execution
inclusive, exclusive time, calls, hardware
statistics,
Reflects performance behavior of program entities
functions, loops, basic blocks
user-defined semantic entities
Very good for low-cost performance assessment
Helps to expose performance bottlenecks and
hotspots
Implemented through
sampling periodic OS interrupts or hardware
counter traps
instrumentation direct insertion of measurement
code

7
Definitions Tracing

Tracing
Recording of information about significant points
(events) during program execution
entering/exiting code region (function, loop,
block, )
thread/process interactions (e.g., send/receive
message)
Save information in event record
timestamp
CPU identifier, thread identifier
Event type and event-specific information
Event trace is a time-sequenced stream of event
records
Can be used to reconstruct dynamic program
behavior
Typically requires code instrumentation

8
Event Tracing Instrumentation, Monitor, Trace
Event definition
CPU A
timestamp
MONITOR
CPU B
9
Event Tracing Timeline Visualization
main
master
slave
B
10
TAU Performance System Framework

Tuning and Analysis Utilities
Performance system framework for scalable
parallel and distributed high-performance
computing
Targets a general complex system computation
model
nodes / contexts / threads
Multi-level system / software / parallelism
Measurement and analysis abstraction
Integrated toolkit for performance
instrumentation, measurement, analysis, and
visualization
Portable, configurable performance
profiling/tracing facility
Open software approach
University of Oregon, LANL, FZJ Germany
http//www.cs.uoregon.edu/research/paracomp/tau

11
Strategies for Empirical Performance Evaluation

Empirical performance evaluation as a series of
performance experiments
Experiment trials describing instrumentation and
measurement requirements
Where/When/How axes of empirical performance
space
where are performance measurements made in
program
when is performance instrumentation done
how are performance measurement/instrumentation
chosen
Strategies for achieving flexibility and
portability goals
Limited performance methods restrict evaluation
scope
Non-portable methods force use of different
techniques
Integration and combination of strategies

12
TAU Performance System Architecture
Paraver
EPILOG
13
TAU Instrumentation Options

Manual instrumentation
TAU Profiling API
Automatic instrumentation approaches
PDT Source-to-source translation
MPI - Wrapper interposition library
Opari OpenMP directive rewriting
Binary
JVMPI Java virtual machine instrumentation
DyninstAPI - Runtime code patching

14
TAU Instrumentation

Targets common measurement interface (TAU API)
Object-based design and implementation
Macro-based, using constructor/destructor
techniques
Program units function, classes, templates,
blocks
Uniquely identify functions and templates
name and type signature (name registration)
static object creates performance entry
dynamic object receives static object pointer
runtime type identification for template
instantiations
C and Fortran instrumentation variants
Instrumentation and measurement optimization

15
Multi-Level Instrumentation

Uses multiple instrumentation interfaces
Shares information cooperation between
interfaces
Taps information at multiple levels
Provides selective instrumentation at each level
Targets a common performance model
Presents a unified view of execution

16
Manual Instrumentation Using TAU

Install TAU
configure make clean install
Instrument application
TAU Profiling API
Modify application makefile
include TAUs stub makefile, modify variables
Execute application
mpirun np ltprocsgt a.out
Analyze performance data
jracy, vampir, pprof, paraver

17
TAU Manual Instrumentation API

Initialization and runtime configuration
TAU_PROFILE_INIT(argc, argv)TAU_PROFILE_SET_NODE
(myNode)TAU_PROFILE_SET_CONTEXT(myContext)TAU_
PROFILE_EXIT(message)TAU_REGISTER_THREAD()
Function and class methods
TAU_PROFILE(name, type, group)
Template
TAU_TYPE_STRING(variable, type)TAU_PROFILE(name,
type, group)CT(variable)
User-defined timing
TAU_PROFILE_TIMER(timer, name, type,
group)TAU_PROFILE_START(timer)TAU_PROFILE_STOP
(timer)

18
Manual Instrumentation C Example
include ltTAU.hgt int main(int argc, char
argv) TAU_PROFILE(int main(int, char ),
, TAU_DEFAULT) TAU_PROFILE_INIT(argc,
argv) TAU_PROFILE_SET_NODE(0) / for
sequential programs / foo() return
0 int foo(void) TAU_PROFILE(int
foo(void), , TAU_DEFAULT) // measures entire
foo() TAU_PROFILE_TIMER(t, foo() for loop,
2345 file.cpp, TAU_USER)
TAU_PROFILE_START(t) for(int i 0 i lt N
i) work(i) TAU_PROFILE_STOP(t)
// other statements in foo
19
Manual Instrumentation C Example
include ltTAU.hgt int main(int argc, char
argv) TAU_PROFILE_TIMER(tmain, int
main(int, char ), , TAU_DEFAULT)
TAU_PROFILE_INIT(argc, argv)
TAU_PROFILE_SET_NODE(0) / for sequential
programs / TAU_PROFILE_START(tmain) foo()
TAU_PROFILE_STOP(tmain) return 0 int
foo(void) TAU_PROFILE_TIMER(t, foo(), ,
TAU_USER) TAU_PROFILE_START(t) for(int i
0 i lt N i) work(i)
TAU_PROFILE_STOP(t)
20
Manual Instrumentation F90 Example
cc34567 Cubes program comment line
PROGRAM SUM_OF_CUBES integer profiler(2)
save profiler INTEGER H, T, U
call TAU_PROFILE_INIT() call
TAU_PROFILE_TIMER(profiler, 'PROGRAM
SUM_OF_CUBES') call TAU_PROFILE_START(prof
iler) call TAU_PROFILE_SET_NODE(0)
! This program prints all 3-digit numbers that
! equal the sum of the cubes of their digits.
DO H 1, 9 DO T 0, 9 DO
U 0, 9 IF (100H 10T U H3
T3 U3) THEN PRINT "(3I1)", H,
T, U ENDIF END DO END
DO END DO call TAU_PROFILE_STOP(profil
er) END PROGRAM SUM_OF_CUBES
21
Instrumenting Multithreaded Applications
include ltTAU.hgt void threaded_function(void
data) TAU_REGISTER_THREAD() // Before any
other TAU calls TAU_PROFILE(void
threaded_function, , TAU_DEFAULT)
work() int main(int argc, char argv)
TAU_PROFILE(int main(int, char ), ,
TAU_DEFAULT) TAU_PROFILE_INIT(argc, argv)
TAU_PROFILE_SET_NODE(0) / for sequential
programs / pthread_attr_t attr pthread_t
tid pthread_attr_init(attr)
pthread_create(tid, NULL, threaded_function,
NULL) return 0
22
Compiling TAU Makefiles

Include TAU Stub Makefile (ltarchgt/lib) in the
users Makefile.
Variables
TAU_CXX Specify the C compiler used by TAU
TAU_CC, TAU_F90 Specify the C, F90 compilers
TAU_DEFS Defines used by TAU. Add to CFLAGS
TAU_LDFLAGS Linker options. Add to LDFLAGS
TAU_INCLUDE Header files include path. Add to
CFLAGS
TAU_LIBS Statically linked TAU library. Add to
LIBS
TAU_SHLIBS Dynamically linked TAU library
TAU_MPI_LIBS TAUs MPI wrapper library for C/C
TAU_MPI_FLIBS TAUs MPI wrapper library for F90
TAU_FORTRANLIBS Must be linked in with C linker
for F90.
TAU_DISABLE TAUs dummy F90 stub library
Note Not including TAU_DEFS in CFLAGS disables
instrumentation in C/C programs (TAU_DISABLE
for f90).

23
Including TAUs stub Makefile
include /usr/tau/sgi64/lib/Makefile.tau-pthread-kc
c CXX (TAU_CXX) CC (TAU_CC) CFLAGS
(TAU_DEFS) LIBS (TAU_LIBS) OBJS ... TARGET
a.out TARGET (OBJS) (CXX) (LDFLAGS)
(OBJS) -o _at_ (LIBS) .cpp.o (CC) (CFLAGS)
-c lt -o _at_
24
TAU Instrumentation Options

Manual instrumentation
TAU Profiling API
Automatic instrumentation approaches
PDT Source-to-source translation
MPI - Wrapper interposition library
Opari OpenMP directive rewriting

25
Program Database Toolkit (PDT)

Program code analysis framework for developing
source-based tools
High-level interface to source code information
Integrated toolkit for source code parsing,
database creation, and database query
commercial grade front end parsers
portable IL analyzer, database format, and access
API
open software approach for tool development
Target and integrate multiple source languages
Use in TAU to build automated performance
instrumentation tools

26
Program Database Toolkit
27
PDT Components

Language front end
Edison Design Group (EDG) C, C
Mutek Solutions Ltd. F77, F90
creates an intermediate-language (IL) tree
IL Analyzer
processes the intermediate language (IL) tree
creates program database (PDB) formatted file
DUCTAPE (Bernd Mohr, ZAM, Germany)
C program Database Utilities and Conversion
Tools APplication Environment
processes and merges PDB files
C library to access the PDB for PDT applications

28
TAU Makefile for PDT C Example
include /usr/tau/include/Makefile CXX
(TAU_CXX) CC (TAU_CC) PDTPARSE
(PDTDIR)/(CONFIG_ARCH)/bin/cxxparse TAUINSTR
(TAUROOT)/(CONFIG_ARCH)/bin/tau_instrumentor CFL
AGS (TAU_DEFS) LIBS (TAU_LIBS) OBJS
... TARGET a.out TARGET (OBJS) (CXX)
(LDFLAGS) (OBJS) -o _at_ (LIBS) .cpp.o (PDTP
ARSE) lt (TAUINSTR) .pdb lt -o
.inst.cpp (CC) (CFLAGS) -c .inst.cpp -o
_at_
29
Instrumentation Control

Selection of which performance events to observe
Could depend on scope, type, level of interest
Could depend on instrumentation overhead
How is selection supported in instrumentation
system?
No choice
Include / exclude lists (TAU)
Environment variables
Static vs. dynamic
Problem Controlling instrumentation of small
routines
High relative measurement overhead
Significant intrusion and possible perturbation

30
Using PDT tau_instrumentor
tau_instrumentor Usage tau_instrumentor
ltpdbfilegt ltsourcefilegt -o ltoutputfilegt
-noinline -g groupname -i headerfile
-c-c-fortran -f ltinstr_req_filegt For
selective instrumentation, use f option cat
selective.dat Selective instrumentation
Specify an exclude/include list. BEGIN_EXCLUDE_LI
ST void quicksort(int , int, int) void
sort_5elements(int ) void interchange(int , int
) END_EXCLUDE_LIST If an include list is
specified, the routines in the list will be the
only routines that are instrumented. To
specify an include list (a list of routines that
will be instrumented) remove the leading to
uncomment the following lines BEGIN_INCLUDE_LIST
int main(int, char ) int select_ END_INCLUDE_
LIST
31
Rule-Based Overhead Analysis (N. Trebon, UO)

Analyze the performance data to determine events
with high (relative) overhead performance
measurements
Create a select list for excluding those events
Rule grammar (used in TAUreduce tool)
GroupName Field Operator Number
GroupName indicates rule applies to events in
group
Field is a event metric attribute (from profile
statistics)
numcalls, numsubs, percent, usec, cumusec, count
PAPI, totalcount, stdev, usecs/call,
counts/call
Operator is one of gt, lt, or
Number is any number
Compound rules possible using between simple
rules

32
Example Rules

Exclude all events that are members of TAU_USER
and use less than 1000 microsecondsTAU_USERusec
lt 1000
Exclude all events that have less than 100
microseconds and are called only onceusec lt
1000 numcalls 1
Exclude all events that have less than 1000
usecs per call OR have a (total inclusive)
percent less than 5usecs/call lt 1000percent lt 5
Scientific notation can be used
usecgt1000 numcallsgt400000 usecs/calllt30
percentgt25

33
TAU Instrumentation Options

Manual instrumentation
TAU Profiling API
Automatic instrumentation approaches
PDT Source-to-source translation
MPI - Wrapper interposition library
Opari OpenMP directive rewriting

34
TAUs MPI Wrapper Interposition Library

Uses standard MPI Profiling Interface
Provides name shifted interface
MPI_Send PMPI_Send
Weak bindings
Interpose TAUs MPI wrapper library between MPI
and TAU
-lmpi replaced by lTauMpi lpmpi lmpi

35
MPI Library Instrumentation (MPI_Send)
int MPI_Send() / TAU redefines MPI_Send
/... int returnVal, typesize TAU_PROFILE_T
IMER(tautimer, "MPI_Send()", " ",
TAU_MESSAGE) TAU_PROFILE_START(tautimer) if
(dest ! MPI_PROC_NULL) PMPI_Type_size(datatyp
e, typesize) TAU_TRACE_SENDMSG(tag, dest,
typesizecount) / Wrapper calls PMPI_Send
/ returnVal PMPI_Send(buf, count, datatype,
dest, tag, comm) TAU_PROFILE_STOP(tautimer)
return returnVal
36
Including TAUs stub Makefile
include /usr/tau/sgi64/lib/Makefile.tau-mpi CXX
(TAU_CXX) CC (TAU_CC) CFLAGS
(TAU_DEFS) LIBS (TAU_MPI_LIBS)
(TAU_LIBS) LD_FLAGS (USER_OPT)
(TAU_LDFLAGS) OBJS ... TARGET a.out TARGET
(OBJS) (CXX) (LDFLAGS) (OBJS) -o _at_
(LIBS) .cpp.o (CC) (CFLAGS) -c lt -o _at_
37
TAU Instrumentation Options

Manual instrumentation
TAU Profiling API
Automatic instrumentation approaches
PDT Source-to-source translation
MPI - Wrapper interposition library
Opari OpenMP directive rewriting FZJ, Germany

38
Instrumentation of OpenMP Constructs

OpenMP Pragma And Region Instrumentor
Source-to-Source translator to insert POMP
callsaround OpenMP constructs and API functions
Done Supports
Fortran77 and Fortran90, OpenMP 2.0
C and C, OpenMP 1.0
POMP Extensions
EPILOG and TAU POMP implementations
Preserves source code information (line line
file)
Work in ProgressInvestigating standardization
through OpenMP Forum

39
OpenMP API Instrumentation

Transform
omp__lock() ? pomp__lock()
omp__nest_lock()? pomp__nest_lock()
init destroy set unset test
POMP version
Calls omp version internally
Can do extra stuff before and after call

40
Example !OMP PARALLEL DO Instrumentation
!OMP PARALLEL DO clauses... do
loop !OMP END PARALLEL DO
!OMP PARALLEL other-clauses... !OMP DO
schedule-clauses, ordered-clauses,
lastprivate-clauses do loop !OMP END
DO !OMP END PARALLEL DO
NOWAIT !OMP
BARRIER
call pomp_parallel_fork(d) call
pomp_parallel_begin(d)
call pomp_parallel_end(d) call
pomp_parallel_join(d)
call pomp_do_enter(d)
call pomp_do_exit(d)
call
pomp_barrier_enter(d) call pomp_barrier_exit(d)

41
Opari Instrumentation Example

OpenMP directive instrumentation

pomp_for_enter(omp_rd_2) line 252
"stommel.c" pragma omp for schedule(static)
reduction( diff) private(j) firstprivate
(a1,a2,a3,a4,a5) nowait for( ii1ilti2i)
for(jj1jltj2j) new_psiija1psii1
j a2psii-1j a3psiij1
a4psiij-1 - a5the_forij diffdifffab
s(new_psiij-psiij) pomp_barrier_ente
r(omp_rd_2) pragma omp barrier pomp_barrier_exi
t(omp_rd_2) pomp_for_exit(omp_rd_2) line 261
"stommel.c"
42
OPARI Basic Usage (f90)

Reset OPARI state information
rm -f opari.rc
Call OPARI for each input source file
opari file1.f90...opari fileN.f90
Generate OPARI runtime table, compile it with
ANSI C
opari -table opari.tab.ccc -c opari.tab.c
Compile modified files .mod.f90 using OpenMP
Link the resulting object files, the OPARI
runtime table opari.tab.o and the TAU POMP RTL

43
OPARI Makefile Template (C/C)
OMPCC ... insert C OpenMP compiler
hereOMPCXX ... insert C OpenMP compiler
here .c.o opari lt (OMPCC) (CFLAGS) -c
.mod.c .cc.o opari lt (OMPCXX) (CXXFLAGS)
-c .mod.cc opari.init rm -rf
opari.rc opari.tab.o opari -table
opari.tab.c (CC) -c opari.tab.c myprog
opari.init myfile.o ... opari.tab.o (OMPCC) -o
myprog myfile.o opari.tab.o -lpomp myfile1.o
myfile1.c myheader.hmyfile2.o ...
44
OPARI Makefile Template (Fortran)
OMPF77 ... insert f77 OpenMP compiler
hereOMPF90 ... insert f90 OpenMP compiler
here .f.o opari lt (OMPF77) (CFLAGS) -c
.mod.F .f90.o opari lt (OMPF90) (CXXFLAGS)
-c .mod.F90 opari.init rm -rf
opari.rc opari.tab.o opari -table
opari.tab.c (CC) -c opari.tab.c myprog
opari.init myfile.o ... opari.tab.o (OMPF90)
-o myprog myfile.o opari.tab.o -lpomp myfile1.o
myfile1.f90myfile2.o ...
45
TAU Measurement

Performance information
High-resolution timer library (real-time /
virtual clocks)
General software counter library (user-defined
events)
Hardware performance counters
PAPI (Performance API) (UTK, Ptools Consortium)
consistent, portable API
Organization
Node, context, thread levels
Profile groups for collective events (runtime
selective)
Performance data mapping between software levels

46
TAU Measurement (continued)

Parallel profiling
Function-level, block-level, statement-level
Supports user-defined events
TAU parallel profile database
Callpath profiles
Hardware counts values
Tracing
All profile-level events
Inter-process communication events
Timestamp synchronization
User-configurable measurement library (user
controlled)

47
TAU Measurement System Configuration

configure OPTIONS
-cltCCgt, -ccltccgt Specify C and C
compilers
-pthread, -sproc Use pthread or SGI sproc
threads
-openmp Use OpenMP threads
-opariltdirgt Specify location of Opari OpenMP
tool
-papiltdirgt Specify location of PAPI
-pdtltdirgt Specify location of PDT
-mpiincltdgt, mpilibltdgt Specify MPI library
instrumentation
-TRACE Generate TAU event traces
-PROFILE Generate TAU profiles
-PROFILECALLPATH Generate Callpath profiles
(1-level)
-MULTIPLECOUNTERS Use more than one hardware
counter
-CPUTIME Use usertimesystem time
-PAPIWALLCLOCK Use PAPI to access wallclock time
-PAPIVIRTUAL Use PAPI for virtual (user) time

48
TAU Measurement Configuration Examples

./configure -cxlC -ccxlc pdt/usr/packages/pd
toolkit-2.1-pthread
Use TAU with IBMs xlC compiler, PDT and the
pthread library
Enable TAU profiling (default)
./configure -TRACE PROFILE
Enable both TAU profiling and tracing
./configure -cCC -cccc MULTIPLECOUNTERS
-papi/usr/local/packages/papi opari/usr/local/o
pari-pomp-1.1 -mpiinc/usr/packages/mpich/includ
e -mpilib/usr/packages/mpich/lib SGITIMERS
-PAPIVIRTUAL
Use OpenMPMPI using SGIs compiler suite, Opari
and use PAPI for accessing hardware performance
counters virtual time for measurements
Typically configure multiple measurement libraries

49
Setup Running Applications
setenv PROFILEDIR /home/data/experiments/profile
/01 setenv TRACEDIR /home/data/experiments/trace
/01(optional) set path(path
lttaudirgt/ltarchgt/bin) setenv LD_LIBRARY_PATH
LD_LIBRARY_PATH\lttaudirgt/ltarchgt/lib For PAPI
(1 counter) setenv PAPI_EVENT PAPI_FP_INS For
PAPI (multiplecounters) setenv COUNTER1
PAPI_FP_INS (PAPIs Floating point ins)
setenv COUNTER2 PAPI_L1_DCM (PAPIs L1 Data
cache misses) setenv COUNTER3 P_VIRTUAL_TIME
(PAPIs virtual time) setenv COUNTER4
SGI_TIMERS (Wallclock time) mpirun np ltngt
ltapplicationgt llsubmit job.sh
50
Performance Mapping

Associate performance with significant entities
(events)
Source code points are important
Functions, regions, control flow events, user
events
Execution process and thread entities are
important
Some entities are more abstract, harder to
measure
Consider callgraph (callpath) profiling
Measure time (metric) along an edge (path) of
callgraph
Incident edge gives parent / child view
Edge sequence (path) gives parent / descendant
view
Problem Callpath profiling when callgraph is
unknown
Determine callgraph dynamically at runtime
Map performance measurement to dynamic call path
state

51
1-Level Callpath Implementation in TAU

TAU maintains a performance event (routine)
callstack
Profiled routine (child) looks in callstack for
parent
Previous profiled performance event is the parent
A callpath profile structure created first time
parent calls
TAU records parent in a callgraph map for child
String representing 1-level callpath used as its
key
a( )gtb( ) name for time spent in b when
called by a
Map returns pointer to callpath profile structure
1-level callpath is profiled using this profiling
data
Build upon TAUs performance mapping technology
Measurement is independent of instrumentation
Use PROFILECALLPATH to configure TAU

52
TAU Analysis

Profile analysis
pprof
parallel profiler with text-based display
racy
graphical interface to pprof (Tcl/Tk)
jracy
Java implementation of Racy
Trace analysis and visualization
Trace merging and clock adjustment (if necessary)
Trace format conversion (ALOG, SDDF, Vampir)
Vampir (Pallas) trace visualization
Paraver (CEPBA) trace visualization

53
Pprof Command

pprof -c-b-m-t-e-i -r -s -n num -f
file -l nodes
-c Sort according to number of calls
-b Sort according to number of subroutines called
-m Sort according to msecs (exclusive time total)
-t Sort according to total msecs (inclusive time
total)
-e Sort according to exclusive time per call
-i Sort according to inclusive time per call
-v Sort according to standard deviation
(exclusive usec)
-r Reverse sorting order
-s Print only summary profile information
-n num Print only first number of functions
-f file Specify full path and filename without
node ids
-l List all functions and exit

54
TAU Parallel Performance Profiles
55
Terminology Example

For routine int main( )
Exclusive time
100-20-50-2010 secs
Inclusive time
100 secs
Calls
1 call
Subrs (no. of child routines called)
3
Inclusive time/call
100secs

int main( ) / takes 100 secs / f1() /
takes 20 secs / f2() / takes 50 secs /
f1() / takes 20 secs / / other work
/ / Time can be replaced by counts /
56
jracy (NAS Parallel Benchmark LU)
Routine profile across all nodes
Global profiles
n node c context t thread
Individual profile
57
jracy (Callpath Profiles) (R. A. Bell, UO)
Callpath profile across all nodes
58
Vampir Trace Visualization Tool

Visualization and Analysis of MPI Programs
Originally developed by Forschungszentrum Jülich
Current development by Technical University
Dresden
Distributed by PALLAS, Germany

http//www.pallas.de/pages/vampir.htm

59
Using TAU with Vampir

Configure TAU with -TRACE option
configure TRACE SGITIMERS
Execute application
mpirun np 4 a.out
This generates TAU traces and event descriptors
Merge all traces using tau_merge
tau_merge .trc app.trc
Convert traces to Vampir Trace format using
tau_convert
tau_convert pv app.trc tau.edf app.pv
Note Use vampir instead of pv for
multi-threaded traces
Load generated trace file in Vampir
vampir app.pv

60
Vampir Main Window

Trace file loading can be
Interrupted at any time
Resumed
Started at a specified time offset
Provides main menu
Access to global and process local displays
Preferences
Help
Trace file can be rewritten (regrouped symbols)

61
Vampir Timeline Diagram

Functions organized into groups
Coloring by group
Message lines can be colored by tag or size

Information about states, messages, collective,
and I/O operations available by clicking on the
representation

62
Vampir Timeline Diagram (Message Info)

Sourcecode references are displayed if recorded
in trace

63
Vampir Execution Statistics Displays

Aggregatedprofilinginformation execution time,
calls, inclusive/exclusive
Available for all/any group (activity)
Available for all routines (symbols)
Available for any trace part (select in timeline
diagram)

64
Vampir Communication Statistics Displays

Bytes sent/received for collective operations
Message length statistics
Available for any trace part

Byte and message count,min/max/avg message
length and min/max/avg bandwidthfor each process
pair

65
Vampir Other Features

Dynamic global call graph tree

Parallelism display
Powerful filtering and trace comparison features
All diagrams highly customizable (through context
menus)

66
Vampir Process Displays

Activity chart

Call tree

Timeline

For all selected processes in the global displays

67
Vampir (NAS Parallel Benchmark LU)
Callgraph display
Timeline display
Parallelism display
Communications display
68
TAU Performance System Status

Computing platforms
IBM SP, SGI Origin, ASCI Red, Cray T3E, Compaq
SC, HP, Sun, Apple, Windows, IA-32, IA-64
(Linux), Hitachi, NEC
Programming languages
C, C, Fortran 77/90, HPF, Java
Communication libraries
MPI, PVM, Nexus, Tulip, ACLMPL, MPIJava
Thread libraries
pthread, Java,Windows, SGI sproc, Tulip, SMARTS,
OpenMP
Compilers
KAI (KCC, KAP/Pro), PGI, GNU, Fujitsu, HP, Sun,
Microsoft, SGI, Cray, IBM, HP, Compaq, Hitachi,
NEC, Intel

69
PDT Status

Program Database Toolkit (Version 2.1, web
download)
EDG C front end (Version 2.45.2)
Mutek Fortran 90 front end (Version 2.4.1)
C and Fortran 90 IL Analyzer
DUCTAPE library
Standard C system header files (KCC Version
4.0f)
PDT-constructed tools
TAU instrumentor (C/C/F90)
Program analysis support for SILOON and CHASM
Platforms
SGI, IBM, Compaq, SUN, HP, Linux (IA32/IA64),
Apple, Windows, Cray T3E, Hitachi

70
Work in Progress

Visualization
TAU will generate event-traces with PAPI
performance data. Vampir (v3.0) will support
visualization of this data
Performance Monitoring and Steering
Performance Database Framework

71
Vampir v3.x HPM Counter

Counter Timeline Display

Process Timeline Display

72
Performance Monitoring and Steering

Desirable to monitor performance during execution
Long-running applications
Steering computations for improved performance
Large-scale parallel applications complicate
solutions
More parallel threads of execution producing data
Large amount of performance data (relative) to
access
Analysis and visualization more difficult
Problem Online performance data access and
analysis
Incremental profile sampling (based on files)
Integration in computational steering system
Dynamic performance measurement and access

73
Online Performance Analysis (K. Li, UO)
SCIRun (Univ. of Utah)
Performance Visualizer
Application
// performance data streams
TAU Performance System
Performance Analyzer
// performance data output
accumulated samples
Performance Data Reader
Performance Data Integrator
file system
sample sequencing reader synchronization
74
2D Field Performance Visualization in SCIRun
SCIRun program
75
Uintah Computational Framework (UCF)

Universityof Utah
UCF analysis
Scheduling
MPI library
Components
500 processes
Use for onlineand offlinevisualization
Apply SCIRunsteering

76
Empirical-Based Performance Optimization
Process
77
TAU Performance Database Framework

profile data only
XML representation
project / experiment / trial

78
PerfDBF Architecture (L. Li, R. Bell, UO)
App. profiled With TAU
Standard TAU Output Data
TAU XML Format
TAU to XML Converter
Database Loader
SQL Database
AnalysisTool
79
Scalability Analysis Process

Scalability study on LU
suite.def of procs -gt 1, 2, 4, and 8
mpirun -np 1 lu.W1
mpirun -np 2 lu.W2
mpirun -np 4 lu.W4
mpirun -np 8 lu.W8
populateDatabase.sh
run Java translator to translate profiles into
XML
run Java XML reader to write XML profiles to
database
Read times for routines and program from
experiments
Calculate scalability metrics

80
Contents of Performance Database
81
Scalability Analysis Results

Scalability of LU performance experiments
Four trial runs
Funname processors meanspeedup
.
applu 2 2.0896117809566
applu 4 4.812100975788783
applu 8 8.168409581149514
exact 2 1.95853126762839071803
exact 4 4.03622321124616535446
exact 8 7.193812137750623668346

82
Current Status and Future

PerfDBF prototype
TAU profile to XML translator
XML to PerfDB populator
PostgresSQL database
Java-based PostgresSQL query module
Use as a layer to support performance analysis
tools
Make accessing the Performance Database quicker
Continue development
XML parallel profile representation
Basic specification

83
Overview

Introduction
Definitions, general problem
Tuning and Analysis Utilities (TAU)
Instrumentation
Measurement
Analysis
Work in progress
Visualization Vampir
Performance Monitoring and Steering
Performance Database Framework
Case Study Uintah
Conclusions

84
Case Study Utah ASCI/ASAP Level 1 Center

C-SAFE was established to build a problem-solving
environment (PSE) for the numerical simulation of
accidental fires and explosions
Fundamental chemistry and engineering physics
models
Coupled with non-linear solvers, optimization,
computational steering, visualization, and
experimental data verification
Very large-scale simulations
Computer science problems
Coupling of multiple simulation codes
Software engineering across diverse expert teams
Achieving high performance on large-scale systems

85
Example C-SAFE Simulation Problems
?
Heptane fire simulation
Typical C-SAFE simulation with a billion degrees
of freedom and non-linear time dynamics
Material stress simulation
86
Uintah High-Level Component View
87
Uintah Computational Framework

Execution model based on software (macro)
dataflow
Exposes parallelism and hides data transport
latency
Computations expressed a directed acyclic graphs
of tasks
consumes input and produces output (input to
future task)
input/outputs specified for each patch in a
structured grid
Abstraction of global single-assignment memory
DataWarehouse
Directory mapping names to values (array
structured)
Write value once then communicate to awaiting
tasks
Task graph gets mapped to processing resources
Communications schedule approximates global
optimal

88
Uintah Task Graph (Material Point Method)

Diagram of named tasks (ovals) and data (edges)
Imminent computation
Dataflow-constrained
MPM
Newtonian material point motion time step
Solid values defined at material point
(particle)
Dashed values defined at vertex (grid)
Prime () values updated during time step

89
Uintah PSE

UCF automatically sets up
Domain decomposition
Inter-processor communication with
aggregation/reduction
Parallel I/O
Checkpoint and restart
Performance measurement and analysis (stay tuned)
Software engineering
Coding standards
CVS (Commits Y3 - 26.6 files/day, Y4 - 29.9
files/day)
Correctness regression testing with bugzilla bug
tracking
Nightly build (parallel compiles)
170,000 lines of code (Fortran and C tasks
supported)

90
Performance Technology Integration

Uintah present challenges to performance
integration
Software diversity and structure
UCF middleware, simulation code modules
component-based hierarchy
Portability objectives
cross-language and cross-platform
multi-parallelism thread, message passing, mixed
Scalability objectives
High-level programming and execution abstractions
Requires flexible and robust performance
technology
Requires support for performance mapping

91
Task Execution in Uintah Parallel Scheduler

Profile methods and functions in scheduler and in
MPI library

Task execution time dominates (what task?)
Task execution time distribution
MPI communication overheads (where?)

Need to map performance data!

92
Semantics-Based Performance Mapping

Associate performance measurements with
high-level semantic abstractions
Need mapping support in the performance
measurement system to assign data correctly

93
Semantic Entities/Attributes/Associations (SEAA)

New dynamic mapping scheme
Entities defined at any level of abstraction
Attribute entity with semantic information
Entity-to-entity associations
Two association types (implemented in TAU API)
Embedded extends data structure of associated
object to store performance measurement entity
External creates an external look-up table
using address of object as the key to locate
performance measurement entity

94
Uintah Task Performance Mapping

Uintah partitions individual particles across
processing elements (processes or threads)
Simulation tasks in task graph work on particles
Tasks have domain-specific character in the
computation
interpolate particles to grid in Material Point
Method
Task instances generated for each partitioned
particle set
Execution scheduled with respect to task
dependencies
How to attributed execution time among different
tasks
Assign semantic name (task type) to a task
instance
SerialMPMinterpolateParticleToGrid
Map TAU timer object to (abstract) task (semantic
entity)
Look up timer object using task type (semantic
attribute)
Further partition along different domain-specific
axes

95
Using External Associations

Two level mappings
Level 1 lttask name, timergt
Level 2 lttask name, patch, timergt
Embedded association vs External
association

Hash Table
Data (object)
Performance Data
96
Task Performance Mapping Instrumentation

void MPISchedulerexecute(const ProcessorGroup
pc,
DataWarehouseP old_dw, DataWarehouseP
dw )
...
TAU_MAPPING_CREATE(
task-gtgetName(), "MPISchedulerexecute()",
(TauGroup_t)(void)task-gtgetName(),
task-gtgetName(), 0)
...
TAU_MAPPING_OBJECT(tautimer)
TAU_MAPPING_LINK(tautimer,(TauGroup_t)(void)task
-gtgetName())
// EXTERNAL ASSOCIATION
...
TAU_MAPPING_PROFILE_TIMER(doitprofiler,
tautimer, 0)
TAU_MAPPING_PROFILE_START(doitprofiler,0)
task-gtdoit(pc)
TAU_MAPPING_PROFILE_STOP(0)
...

97
Task Performance Mapping (Profile)
Mapped task performance across processes
Performance mapping for different tasks
98
Task Performance Mapping (Trace)
Work packet computation events colored by task
type
Distinct phases of computation can be identifed
based on task
99
Task Performance Mapping (Trace - Zoom)
Startup communication imbalance
100
Task Performance Mapping (Trace - Parallelism)
Communication / load imbalance
101
Comparing Uintah Traces for Scalability Analysis
102
Scaling Performance Optimizations
Last year initial correct scheduler
Reduce communication by 10 x
Reduce task graph overhead by 20 x
ASCI NirvanaSGI Origin 2000 Los AlamosNational
Laboratory
103
Scalability to 2000 Processors (Fall 2001)
ASCI NirvanaSGI Origin 2000 Los AlamosNational
Laboratory
104
Concluding Remarks

Complex software and parallel computing systems
pose challenging performance analysis problems
that require robust methodologies and tools
To build more sophisticated performance tools,
existing proven performance technology must be
utilized
Performance tools must be integrated with
software and systems models and technology
Performance engineered software
Function consistently and coherently in software
and system environments
PAPI and TAU performance systems offer robust
performance technology that can be broadly
integrated

105
Information

TAU (http//www.acl.lanl.gov/tau)
PDT (http//www.acl.lanl.gov/pdtoolkit)
PAPI (http//icl.cs.utk.edu/projects/papi/)
OPARI (http//www.fz-juelich.de/zam/kojak/)

106
Support Acknowledgement

TAU and PDT support
Department of Energy (DOE)
DOE 2000 ACTS contract
DOE MICS contract
DOE ASCI Level 3 (LANL, LLNL)
U. of Utah DOE ASCI Level 1 subcontract
DARPA
NSF National Young Investigator (NYI) award

Write a Comment

User Comments (0)

About PowerShow.com

Performance Technology for Complex Parallel Systems Sameer Shende, Allen D. Malony University of Oregon - PowerPoint PPT Presentation

Performance Technology for Complex Parallel Systems Sameer Shende, Allen D. Malony University of Oregon

How do we create robust and ubiquitous performance technology for ... VM. space. Context. SMP. Threads. node memory. Interconnection Network. Inter-node message ... – PowerPoint PPT presentation