Performance Technology for Complex Parallel Systems Sameer Shende, Allen D. Malony University of Oregon - PowerPoint PPT Presentation

Loading...

PPT – Performance Technology for Complex Parallel Systems Sameer Shende, Allen D. Malony University of Oregon PowerPoint presentation | free to download - id: 1e5808-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Performance Technology for Complex Parallel Systems Sameer Shende, Allen D. Malony University of Oregon

Description:

How do we create robust and ubiquitous performance technology for ... VM. space. Context. SMP. Threads. node memory. Interconnection Network. Inter-node message ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 107
Provided by: allend7
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Performance Technology for Complex Parallel Systems Sameer Shende, Allen D. Malony University of Oregon


1
Performance Technology for Complex Parallel
Systems Sameer Shende, Allen D.
Malony University of Oregon
2
Overview
  • Introduction
  • Definitions, general problem
  • Tuning and Analysis Utilities (TAU)
  • Instrumentation
  • Measurement
  • Analysis
  • Work in progress
  • Visualization Vampir
  • Performance Monitoring and Steering
  • Performance Database Framework
  • Case Study Uintah
  • Conclusions

3
General Problems
  • How do we create robust and ubiquitous
    performance technology for the analysis and
    tuning of parallel and distributed software and
    systems in the presence of (evolving) complexity
    challenges?
  • How do we apply performance technology
    effectively for the variety and diversity of
    performance problems that arise in the context of
    complex parallel and distributed computer systems.

4
Computation Model for Performance Technology
  • How to address dual performance technology goals?
  • Robust capabilities widely available
    methodologies
  • Contend with problems of system diversity
  • Flexible tool composition/configuration/integratio
    n
  • Approaches
  • Restrict computation types / performance problems
  • limited performance technology coverage
  • Base technology on abstract computation model
  • general architecture and software execution
    features
  • map features/methods to existing complex system
    types
  • develop capabilities that can adapt and be
    optimized

5
General Complex System Computation Model
  • Node physically distinct shared memory machine
  • Message passing node interconnection network
  • Context distinct virtual memory space within
    node
  • Thread execution threads (user/system) in context

Interconnection Network
Inter-node message communication


Node
Node
Node
node memory
memory
memory
SMP
physical view
VM space

model view

Context
Threads
6
Definitions Profiling
  • Profiling
  • Recording of summary information during execution
  • inclusive, exclusive time, calls, hardware
    statistics,
  • Reflects performance behavior of program entities
  • functions, loops, basic blocks
  • user-defined semantic entities
  • Very good for low-cost performance assessment
  • Helps to expose performance bottlenecks and
    hotspots
  • Implemented through
  • sampling periodic OS interrupts or hardware
    counter traps
  • instrumentation direct insertion of measurement
    code

7
Definitions Tracing
  • Tracing
  • Recording of information about significant points
    (events) during program execution
  • entering/exiting code region (function, loop,
    block, )
  • thread/process interactions (e.g., send/receive
    message)
  • Save information in event record
  • timestamp
  • CPU identifier, thread identifier
  • Event type and event-specific information
  • Event trace is a time-sequenced stream of event
    records
  • Can be used to reconstruct dynamic program
    behavior
  • Typically requires code instrumentation

8
Event Tracing Instrumentation, Monitor, Trace
Event definition
CPU A
timestamp
MONITOR
CPU B
9
Event Tracing Timeline Visualization
main
master
slave
B
10
TAU Performance System Framework
  • Tuning and Analysis Utilities
  • Performance system framework for scalable
    parallel and distributed high-performance
    computing
  • Targets a general complex system computation
    model
  • nodes / contexts / threads
  • Multi-level system / software / parallelism
  • Measurement and analysis abstraction
  • Integrated toolkit for performance
    instrumentation, measurement, analysis, and
    visualization
  • Portable, configurable performance
    profiling/tracing facility
  • Open software approach
  • University of Oregon, LANL, FZJ Germany
  • http//www.cs.uoregon.edu/research/paracomp/tau

11
Strategies for Empirical Performance Evaluation
  • Empirical performance evaluation as a series of
    performance experiments
  • Experiment trials describing instrumentation and
    measurement requirements
  • Where/When/How axes of empirical performance
    space
  • where are performance measurements made in
    program
  • when is performance instrumentation done
  • how are performance measurement/instrumentation
    chosen
  • Strategies for achieving flexibility and
    portability goals
  • Limited performance methods restrict evaluation
    scope
  • Non-portable methods force use of different
    techniques
  • Integration and combination of strategies

12
TAU Performance System Architecture
Paraver
EPILOG
13
TAU Instrumentation Options
  • Manual instrumentation
  • TAU Profiling API
  • Automatic instrumentation approaches
  • PDT Source-to-source translation
  • MPI - Wrapper interposition library
  • Opari OpenMP directive rewriting
  • Binary
  • JVMPI Java virtual machine instrumentation
  • DyninstAPI - Runtime code patching

14
TAU Instrumentation
  • Targets common measurement interface (TAU API)
  • Object-based design and implementation
  • Macro-based, using constructor/destructor
    techniques
  • Program units function, classes, templates,
    blocks
  • Uniquely identify functions and templates
  • name and type signature (name registration)
  • static object creates performance entry
  • dynamic object receives static object pointer
  • runtime type identification for template
    instantiations
  • C and Fortran instrumentation variants
  • Instrumentation and measurement optimization

15
Multi-Level Instrumentation
  • Uses multiple instrumentation interfaces
  • Shares information cooperation between
    interfaces
  • Taps information at multiple levels
  • Provides selective instrumentation at each level
  • Targets a common performance model
  • Presents a unified view of execution

16
Manual Instrumentation Using TAU
  • Install TAU
  • configure make clean install
  • Instrument application
  • TAU Profiling API
  • Modify application makefile
  • include TAUs stub makefile, modify variables
  • Execute application
  • mpirun np ltprocsgt a.out
  • Analyze performance data
  • jracy, vampir, pprof, paraver

17
TAU Manual Instrumentation API
  • Initialization and runtime configuration
  • TAU_PROFILE_INIT(argc, argv) TAU_PROFILE_SET_NODE
    (myNode) TAU_PROFILE_SET_CONTEXT(myContext) TAU_
    PROFILE_EXIT(message) TAU_REGISTER_THREAD()
  • Function and class methods
  • TAU_PROFILE(name, type, group)
  • Template
  • TAU_TYPE_STRING(variable, type) TAU_PROFILE(name,
    type, group) CT(variable)
  • User-defined timing
  • TAU_PROFILE_TIMER(timer, name, type,
    group) TAU_PROFILE_START(timer) TAU_PROFILE_STOP
    (timer)

18
Manual Instrumentation C Example
include ltTAU.hgt int main(int argc, char
argv) TAU_PROFILE(int main(int, char ),
 , TAU_DEFAULT) TAU_PROFILE_INIT(argc,
argv) TAU_PROFILE_SET_NODE(0) / for
sequential programs / foo() return
0 int foo(void) TAU_PROFILE(int
foo(void), , TAU_DEFAULT) // measures entire
foo() TAU_PROFILE_TIMER(t, foo() for loop,
2345 file.cpp, TAU_USER)
TAU_PROFILE_START(t) for(int i 0 i lt N
i) work(i) TAU_PROFILE_STOP(t)
// other statements in foo
19
Manual Instrumentation C Example
include ltTAU.hgt int main(int argc, char
argv) TAU_PROFILE_TIMER(tmain, int
main(int, char ),  , TAU_DEFAULT)
TAU_PROFILE_INIT(argc, argv)
TAU_PROFILE_SET_NODE(0) / for sequential
programs / TAU_PROFILE_START(tmain) foo()
TAU_PROFILE_STOP(tmain) return 0 int
foo(void) TAU_PROFILE_TIMER(t, foo(), ,
TAU_USER) TAU_PROFILE_START(t) for(int i
0 i lt N i) work(i)
TAU_PROFILE_STOP(t)
20
Manual Instrumentation F90 Example
cc34567 Cubes program comment line
PROGRAM SUM_OF_CUBES integer profiler(2)
save profiler INTEGER H, T, U
call TAU_PROFILE_INIT() call
TAU_PROFILE_TIMER(profiler, 'PROGRAM
SUM_OF_CUBES') call TAU_PROFILE_START(prof
iler) call TAU_PROFILE_SET_NODE(0)
! This program prints all 3-digit numbers that
! equal the sum of the cubes of their digits.
DO H 1, 9 DO T 0, 9 DO
U 0, 9 IF (100H 10T U H3
T3 U3) THEN PRINT "(3I1)", H,
T, U ENDIF END DO END
DO END DO call TAU_PROFILE_STOP(profil
er) END PROGRAM SUM_OF_CUBES
21
Instrumenting Multithreaded Applications
include ltTAU.hgt void threaded_function(void
data) TAU_REGISTER_THREAD() // Before any
other TAU calls TAU_PROFILE(void
threaded_function,  , TAU_DEFAULT)
work() int main(int argc, char argv)
TAU_PROFILE(int main(int, char ),  ,
TAU_DEFAULT) TAU_PROFILE_INIT(argc, argv)
TAU_PROFILE_SET_NODE(0) / for sequential
programs / pthread_attr_t attr pthread_t
tid pthread_attr_init(attr)
pthread_create(tid, NULL, threaded_function,
NULL) return 0
22
Compiling TAU Makefiles
  • Include TAU Stub Makefile (ltarchgt/lib) in the
    users Makefile.
  • Variables
  • TAU_CXX Specify the C compiler used by TAU
  • TAU_CC, TAU_F90 Specify the C, F90 compilers
  • TAU_DEFS Defines used by TAU. Add to CFLAGS
  • TAU_LDFLAGS Linker options. Add to LDFLAGS
  • TAU_INCLUDE Header files include path. Add to
    CFLAGS
  • TAU_LIBS Statically linked TAU library. Add to
    LIBS
  • TAU_SHLIBS Dynamically linked TAU library
  • TAU_MPI_LIBS TAUs MPI wrapper library for C/C
  • TAU_MPI_FLIBS TAUs MPI wrapper library for F90
  • TAU_FORTRANLIBS Must be linked in with C linker
    for F90.
  • TAU_DISABLE TAUs dummy F90 stub library
  • Note Not including TAU_DEFS in CFLAGS disables
    instrumentation in C/C programs (TAU_DISABLE
    for f90).

23
Including TAUs stub Makefile
include /usr/tau/sgi64/lib/Makefile.tau-pthread-kc
c CXX (TAU_CXX) CC (TAU_CC) CFLAGS
(TAU_DEFS) LIBS (TAU_LIBS) OBJS ... TARGET
a.out TARGET (OBJS) (CXX) (LDFLAGS)
(OBJS) -o _at_ (LIBS) .cpp.o (CC) (CFLAGS)
-c lt -o _at_
24
TAU Instrumentation Options
  • Manual instrumentation
  • TAU Profiling API
  • Automatic instrumentation approaches
  • PDT Source-to-source translation
  • MPI - Wrapper interposition library
  • Opari OpenMP directive rewriting

25
Program Database Toolkit (PDT)
  • Program code analysis framework for developing
    source-based tools
  • High-level interface to source code information
  • Integrated toolkit for source code parsing,
    database creation, and database query
  • commercial grade front end parsers
  • portable IL analyzer, database format, and access
    API
  • open software approach for tool development
  • Target and integrate multiple source languages
  • Use in TAU to build automated performance
    instrumentation tools

26
Program Database Toolkit
27
PDT Components
  • Language front end
  • Edison Design Group (EDG) C, C
  • Mutek Solutions Ltd. F77, F90
  • creates an intermediate-language (IL) tree
  • IL Analyzer
  • processes the intermediate language (IL) tree
  • creates program database (PDB) formatted file
  • DUCTAPE (Bernd Mohr, ZAM, Germany)
  • C program Database Utilities and Conversion
    Tools APplication Environment
  • processes and merges PDB files
  • C library to access the PDB for PDT applications

28
TAU Makefile for PDT C Example
include /usr/tau/include/Makefile CXX
(TAU_CXX) CC (TAU_CC) PDTPARSE
(PDTDIR)/(CONFIG_ARCH)/bin/cxxparse TAUINSTR
(TAUROOT)/(CONFIG_ARCH)/bin/tau_instrumentor CFL
AGS (TAU_DEFS) LIBS (TAU_LIBS) OBJS
... TARGET a.out TARGET (OBJS) (CXX)
(LDFLAGS) (OBJS) -o _at_ (LIBS) .cpp.o (PDTP
ARSE) lt (TAUINSTR) .pdb lt -o
.inst.cpp (CC) (CFLAGS) -c .inst.cpp -o
_at_
29
Instrumentation Control
  • Selection of which performance events to observe
  • Could depend on scope, type, level of interest
  • Could depend on instrumentation overhead
  • How is selection supported in instrumentation
    system?
  • No choice
  • Include / exclude lists (TAU)
  • Environment variables
  • Static vs. dynamic
  • Problem Controlling instrumentation of small
    routines
  • High relative measurement overhead
  • Significant intrusion and possible perturbation

30
Using PDT tau_instrumentor
tau_instrumentor Usage tau_instrumentor
ltpdbfilegt ltsourcefilegt -o ltoutputfilegt
-noinline -g groupname -i headerfile
-c-c-fortran -f ltinstr_req_filegt For
selective instrumentation, use f option cat
selective.dat Selective instrumentation
Specify an exclude/include list. BEGIN_EXCLUDE_LI
ST void quicksort(int , int, int) void
sort_5elements(int ) void interchange(int , int
) END_EXCLUDE_LIST If an include list is
specified, the routines in the list will be the
only routines that are instrumented. To
specify an include list (a list of routines that
will be instrumented) remove the leading to
uncomment the following lines BEGIN_INCLUDE_LIST
int main(int, char ) int select_ END_INCLUDE_
LIST
31
Rule-Based Overhead Analysis (N. Trebon, UO)
  • Analyze the performance data to determine events
    with high (relative) overhead performance
    measurements
  • Create a select list for excluding those events
  • Rule grammar (used in TAUreduce tool)
  • GroupName Field Operator Number
  • GroupName indicates rule applies to events in
    group
  • Field is a event metric attribute (from profile
    statistics)
  • numcalls, numsubs, percent, usec, cumusec, count
    PAPI, totalcount, stdev, usecs/call,
    counts/call
  • Operator is one of gt, lt, or
  • Number is any number
  • Compound rules possible using between simple
    rules

32
Example Rules
  • Exclude all events that are members of TAU_USER
    and use less than 1000 microseconds TAU_USERusec
    lt 1000
  • Exclude all events that have less than 100
    microseconds and are called only once usec lt
    1000 numcalls 1
  • Exclude all events that have less than 1000
    usecs per call OR have a (total inclusive)
    percent less than 5 usecs/call lt 1000 percent lt 5
  • Scientific notation can be used
  • usecgt1000 numcallsgt400000 usecs/calllt30
    percentgt25

33
TAU Instrumentation Options
  • Manual instrumentation
  • TAU Profiling API
  • Automatic instrumentation approaches
  • PDT Source-to-source translation
  • MPI - Wrapper interposition library
  • Opari OpenMP directive rewriting

34
TAUs MPI Wrapper Interposition Library
  • Uses standard MPI Profiling Interface
  • Provides name shifted interface
  • MPI_Send PMPI_Send
  • Weak bindings
  • Interpose TAUs MPI wrapper library between MPI
    and TAU
  • -lmpi replaced by lTauMpi lpmpi lmpi

35
MPI Library Instrumentation (MPI_Send)
int MPI_Send() / TAU redefines MPI_Send
/ ... int returnVal, typesize TAU_PROFILE_T
IMER(tautimer, "MPI_Send()", " ",
TAU_MESSAGE) TAU_PROFILE_START(tautimer) if
(dest ! MPI_PROC_NULL) PMPI_Type_size(datatyp
e, typesize) TAU_TRACE_SENDMSG(tag, dest,
typesizecount) / Wrapper calls PMPI_Send
/ returnVal PMPI_Send(buf, count, datatype,
dest, tag, comm) TAU_PROFILE_STOP(tautimer)
return returnVal
36
Including TAUs stub Makefile
include /usr/tau/sgi64/lib/Makefile.tau-mpi CXX
(TAU_CXX) CC (TAU_CC) CFLAGS
(TAU_DEFS) LIBS (TAU_MPI_LIBS)
(TAU_LIBS) LD_FLAGS (USER_OPT)
(TAU_LDFLAGS) OBJS ... TARGET a.out TARGET
(OBJS) (CXX) (LDFLAGS) (OBJS) -o _at_
(LIBS) .cpp.o (CC) (CFLAGS) -c lt -o _at_
37
TAU Instrumentation Options
  • Manual instrumentation
  • TAU Profiling API
  • Automatic instrumentation approaches
  • PDT Source-to-source translation
  • MPI - Wrapper interposition library
  • Opari OpenMP directive rewriting FZJ, Germany

38
Instrumentation of OpenMP Constructs
  • OpenMP Pragma And Region Instrumentor
  • Source-to-Source translator to insert POMP
    calls around OpenMP constructs and API functions
  • Done Supports
  • Fortran77 and Fortran90, OpenMP 2.0
  • C and C, OpenMP 1.0
  • POMP Extensions
  • EPILOG and TAU POMP implementations
  • Preserves source code information (line line
    file)
  • Work in Progress Investigating standardization
    through OpenMP Forum

39
OpenMP API Instrumentation
  • Transform
  • omp__lock() ? pomp__lock()
  • omp__nest_lock()? pomp__nest_lock()
  • init destroy set unset test
  • POMP version
  • Calls omp version internally
  • Can do extra stuff before and after call

40
Example !OMP PARALLEL DO Instrumentation
!OMP PARALLEL DO clauses... do
loop !OMP END PARALLEL DO
!OMP PARALLEL other-clauses... !OMP DO
schedule-clauses, ordered-clauses,
lastprivate-clauses do loop !OMP END
DO !OMP END PARALLEL DO
NOWAIT !OMP
BARRIER
call pomp_parallel_fork(d) call
pomp_parallel_begin(d)
call pomp_parallel_end(d) call
pomp_parallel_join(d)
call pomp_do_enter(d)
call pomp_do_exit(d)
call
pomp_barrier_enter(d) call pomp_barrier_exit(d)

41
Opari Instrumentation Example
  • OpenMP directive instrumentation

pomp_for_enter(omp_rd_2) line 252
"stommel.c" pragma omp for schedule(static)
reduction( diff) private(j) firstprivate
(a1,a2,a3,a4,a5) nowait for( ii1ilti2i)
for(jj1jltj2j) new_psiija1psii1
j a2psii-1j a3psiij1
a4psiij-1 - a5the_forij diffdifffab
s(new_psiij-psiij) pomp_barrier_ente
r(omp_rd_2) pragma omp barrier pomp_barrier_exi
t(omp_rd_2) pomp_for_exit(omp_rd_2) line 261
"stommel.c"
42
OPARI Basic Usage (f90)
  • Reset OPARI state information
  • rm -f opari.rc
  • Call OPARI for each input source file
  • opari file1.f90 ... opari fileN.f90
  • Generate OPARI runtime table, compile it with
    ANSI C
  • opari -table opari.tab.c cc -c opari.tab.c
  • Compile modified files .mod.f90 using OpenMP
  • Link the resulting object files, the OPARI
    runtime table opari.tab.o and the TAU POMP RTL

43
OPARI Makefile Template (C/C)
OMPCC ... insert C OpenMP compiler
here OMPCXX ... insert C OpenMP compiler
here .c.o opari lt (OMPCC) (CFLAGS) -c
.mod.c .cc.o opari lt (OMPCXX) (CXXFLAGS)
-c .mod.cc opari.init rm -rf
opari.rc opari.tab.o opari -table
opari.tab.c (CC) -c opari.tab.c myprog
opari.init myfile.o ... opari.tab.o (OMPCC) -o
myprog myfile.o opari.tab.o -lpomp myfile1.o
myfile1.c myheader.h myfile2.o ...
44
OPARI Makefile Template (Fortran)
OMPF77 ... insert f77 OpenMP compiler
here OMPF90 ... insert f90 OpenMP compiler
here .f.o opari lt (OMPF77) (CFLAGS) -c
.mod.F .f90.o opari lt (OMPF90) (CXXFLAGS)
-c .mod.F90 opari.init rm -rf
opari.rc opari.tab.o opari -table
opari.tab.c (CC) -c opari.tab.c myprog
opari.init myfile.o ... opari.tab.o (OMPF90)
-o myprog myfile.o opari.tab.o -lpomp myfile1.o
myfile1.f90 myfile2.o ...
45
TAU Measurement
  • Performance information
  • High-resolution timer library (real-time /
    virtual clocks)
  • General software counter library (user-defined
    events)
  • Hardware performance counters
  • PAPI (Performance API) (UTK, Ptools Consortium)
  • consistent, portable API
  • Organization
  • Node, context, thread levels
  • Profile groups for collective events (runtime
    selective)
  • Performance data mapping between software levels

46
TAU Measurement (continued)
  • Parallel profiling
  • Function-level, block-level, statement-level
  • Supports user-defined events
  • TAU parallel profile database
  • Callpath profiles
  • Hardware counts values
  • Tracing
  • All profile-level events
  • Inter-process communication events
  • Timestamp synchronization
  • User-configurable measurement library (user
    controlled)

47
TAU Measurement System Configuration
  • configure OPTIONS
  • -cltCCgt, -ccltccgt Specify C and C
    compilers
  • -pthread, -sproc Use pthread or SGI sproc
    threads
  • -openmp Use OpenMP threads
  • -opariltdirgt Specify location of Opari OpenMP
    tool
  • -papiltdirgt Specify location of PAPI
  • -pdtltdirgt Specify location of PDT
  • -mpiincltdgt, mpilibltdgt Specify MPI library
    instrumentation
  • -TRACE Generate TAU event traces
  • -PROFILE Generate TAU profiles
  • -PROFILECALLPATH Generate Callpath profiles
    (1-level)
  • -MULTIPLECOUNTERS Use more than one hardware
    counter
  • -CPUTIME Use usertimesystem time
  • -PAPIWALLCLOCK Use PAPI to access wallclock time
  • -PAPIVIRTUAL Use PAPI for virtual (user) time

48
TAU Measurement Configuration Examples
  • ./configure -cxlC -ccxlc pdt/usr/packages/pd
    toolkit-2.1 -pthread
  • Use TAU with IBMs xlC compiler, PDT and the
    pthread library
  • Enable TAU profiling (default)
  • ./configure -TRACE PROFILE
  • Enable both TAU profiling and tracing
  • ./configure -cCC -cccc MULTIPLECOUNTERS
    -papi/usr/local/packages/papi opari/usr/local/o
    pari-pomp-1.1 -mpiinc/usr/packages/mpich/includ
    e -mpilib/usr/packages/mpich/lib SGITIMERS
    -PAPIVIRTUAL
  • Use OpenMPMPI using SGIs compiler suite, Opari
    and use PAPI for accessing hardware performance
    counters virtual time for measurements
  • Typically configure multiple measurement libraries

49
Setup Running Applications
setenv PROFILEDIR /home/data/experiments/profile
/01 setenv TRACEDIR /home/data/experiments/trace
/01(optional) set path(path
lttaudirgt/ltarchgt/bin) setenv LD_LIBRARY_PATH
LD_LIBRARY_PATH\lttaudirgt/ltarchgt/lib For PAPI
(1 counter) setenv PAPI_EVENT PAPI_FP_INS For
PAPI (multiplecounters) setenv COUNTER1
PAPI_FP_INS (PAPIs Floating point ins)
setenv COUNTER2 PAPI_L1_DCM (PAPIs L1 Data
cache misses) setenv COUNTER3 P_VIRTUAL_TIME
(PAPIs virtual time) setenv COUNTER4
SGI_TIMERS (Wallclock time) mpirun np ltngt
ltapplicationgt llsubmit job.sh
50
Performance Mapping
  • Associate performance with significant entities
    (events)
  • Source code points are important
  • Functions, regions, control flow events, user
    events
  • Execution process and thread entities are
    important
  • Some entities are more abstract, harder to
    measure
  • Consider callgraph (callpath) profiling
  • Measure time (metric) along an edge (path) of
    callgraph
  • Incident edge gives parent / child view
  • Edge sequence (path) gives parent / descendant
    view
  • Problem Callpath profiling when callgraph is
    unknown
  • Determine callgraph dynamically at runtime
  • Map performance measurement to dynamic call path
    state

51
1-Level Callpath Implementation in TAU
  • TAU maintains a performance event (routine)
    callstack
  • Profiled routine (child) looks in callstack for
    parent
  • Previous profiled performance event is the parent
  • A callpath profile structure created first time
    parent calls
  • TAU records parent in a callgraph map for child
  • String representing 1-level callpath used as its
    key
  • a( )gtb( ) name for time spent in b when
    called by a
  • Map returns pointer to callpath profile structure
  • 1-level callpath is profiled using this profiling
    data
  • Build upon TAUs performance mapping technology
  • Measurement is independent of instrumentation
  • Use PROFILECALLPATH to configure TAU

52
TAU Analysis
  • Profile analysis
  • pprof
  • parallel profiler with text-based display
  • racy
  • graphical interface to pprof (Tcl/Tk)
  • jracy
  • Java implementation of Racy
  • Trace analysis and visualization
  • Trace merging and clock adjustment (if necessary)
  • Trace format conversion (ALOG, SDDF, Vampir)
  • Vampir (Pallas) trace visualization
  • Paraver (CEPBA) trace visualization

53
Pprof Command
  • pprof -c-b-m-t-e-i -r -s -n num -f
    file -l nodes
  • -c Sort according to number of calls
  • -b Sort according to number of subroutines called
  • -m Sort according to msecs (exclusive time total)
  • -t Sort according to total msecs (inclusive time
    total)
  • -e Sort according to exclusive time per call
  • -i Sort according to inclusive time per call
  • -v Sort according to standard deviation
    (exclusive usec)
  • -r Reverse sorting order
  • -s Print only summary profile information
  • -n num Print only first number of functions
  • -f file Specify full path and filename without
    node ids
  • -l List all functions and exit

54
TAU Parallel Performance Profiles
55
Terminology Example
  • For routine int main( )
  • Exclusive time
  • 100-20-50-2010 secs
  • Inclusive time
  • 100 secs
  • Calls
  • 1 call
  • Subrs (no. of child routines called)
  • 3
  • Inclusive time/call
  • 100secs

int main( ) / takes 100 secs / f1() /
takes 20 secs / f2() / takes 50 secs /
f1() / takes 20 secs / / other work
/ / Time can be replaced by counts /
56
jracy (NAS Parallel Benchmark LU)
Routine profile across all nodes
Global profiles
n node c context t thread
Individual profile
57
jracy (Callpath Profiles) (R. A. Bell, UO)
Callpath profile across all nodes
58
Vampir Trace Visualization Tool
  • Visualization and Analysis of MPI Programs
  • Originally developed by Forschungszentrum Jülich
  • Current development by Technical University
    Dresden
  • Distributed by PALLAS, Germany
  • http//www.pallas.de/pages/vampir.htm

59
Using TAU with Vampir
  • Configure TAU with -TRACE option
  • configure TRACE SGITIMERS
  • Execute application
  • mpirun np 4 a.out
  • This generates TAU traces and event descriptors
  • Merge all traces using tau_merge
  • tau_merge .trc app.trc
  • Convert traces to Vampir Trace format using
    tau_convert
  • tau_convert pv app.trc tau.edf app.pv
  • Note Use vampir instead of pv for
    multi-threaded traces
  • Load generated trace file in Vampir
  • vampir app.pv

60
Vampir Main Window
  • Trace file loading can be
  • Interrupted at any time
  • Resumed
  • Started at a specified time offset
  • Provides main menu
  • Access to global and process local displays
  • Preferences
  • Help
  • Trace file can be rewritten (regrouped symbols)

61
Vampir Timeline Diagram
  • Functions organized into groups
  • Coloring by group
  • Message lines can be colored by tag or size
  • Information about states, messages, collective,
    and I/O operations available by clicking on the
    representation

62
Vampir Timeline Diagram (Message Info)
  • Sourcecode references are displayed if recorded
    in trace

63
Vampir Execution Statistics Displays
  • Aggregated profiling information execution time,
    calls, inclusive/exclusive
  • Available for all/any group (activity)
  • Available for all routines (symbols)
  • Available for any trace part (select in timeline
    diagram)

64
Vampir Communication Statistics Displays
  • Bytes sent/received for collective operations
  • Message length statistics
  • Available for any trace part
  • Byte and message count, min/max/avg message
    length and min/max/avg bandwidth for each process
    pair

65
Vampir Other Features
  • Dynamic global call graph tree
  • Parallelism display
  • Powerful filtering and trace comparison features
  • All diagrams highly customizable (through context
    menus)

66
Vampir Process Displays
  • Activity chart
  • Call tree
  • Timeline
  • For all selected processes in the global displays

67
Vampir (NAS Parallel Benchmark LU)
Callgraph display
Timeline display
Parallelism display
Communications display
68
TAU Performance System Status
  • Computing platforms
  • IBM SP, SGI Origin, ASCI Red, Cray T3E, Compaq
    SC, HP, Sun, Apple, Windows, IA-32, IA-64
    (Linux), Hitachi, NEC
  • Programming languages
  • C, C, Fortran 77/90, HPF, Java
  • Communication libraries
  • MPI, PVM, Nexus, Tulip, ACLMPL, MPIJava
  • Thread libraries
  • pthread, Java,Windows, SGI sproc, Tulip, SMARTS,
    OpenMP
  • Compilers
  • KAI (KCC, KAP/Pro), PGI, GNU, Fujitsu, HP, Sun,
    Microsoft, SGI, Cray, IBM, HP, Compaq, Hitachi,
    NEC, Intel

69
PDT Status
  • Program Database Toolkit (Version 2.1, web
    download)
  • EDG C front end (Version 2.45.2)
  • Mutek Fortran 90 front end (Version 2.4.1)
  • C and Fortran 90 IL Analyzer
  • DUCTAPE library
  • Standard C system header files (KCC Version
    4.0f)
  • PDT-constructed tools
  • TAU instrumentor (C/C/F90)
  • Program analysis support for SILOON and CHASM
  • Platforms
  • SGI, IBM, Compaq, SUN, HP, Linux (IA32/IA64),
    Apple, Windows, Cray T3E, Hitachi

70
Work in Progress
  • Visualization
  • TAU will generate event-traces with PAPI
    performance data. Vampir (v3.0) will support
    visualization of this data
  • Performance Monitoring and Steering
  • Performance Database Framework

71
Vampir v3.x HPM Counter
  • Counter Timeline Display
  • Process Timeline Display

72
Performance Monitoring and Steering
  • Desirable to monitor performance during execution
  • Long-running applications
  • Steering computations for improved performance
  • Large-scale parallel applications complicate
    solutions
  • More parallel threads of execution producing data
  • Large amount of performance data (relative) to
    access
  • Analysis and visualization more difficult
  • Problem Online performance data access and
    analysis
  • Incremental profile sampling (based on files)
  • Integration in computational steering system
  • Dynamic performance measurement and access

73
Online Performance Analysis (K. Li, UO)
SCIRun (Univ. of Utah)
Performance Visualizer
Application
// performance data streams
TAU Performance System
Performance Analyzer
// performance data output
accumulated samples
Performance Data Reader
Performance Data Integrator
file system
sample sequencing reader synchronization
74
2D Field Performance Visualization in SCIRun
SCIRun program
75
Uintah Computational Framework (UCF)
  • University of Utah
  • UCF analysis
  • Scheduling
  • MPI library
  • Components
  • 500 processes
  • Use for online and offline visualization
  • Apply SCIRun steering

76
Empirical-Based Performance Optimization
Process
77
TAU Performance Database Framework
  • profile data only
  • XML representation
  • project / experiment / trial

78
PerfDBF Architecture (L. Li, R. Bell, UO)
App. profiled With TAU
Standard TAU Output Data
TAU XML Format
TAU to XML Converter
Database Loader
SQL Database
Analysis Tool
79
Scalability Analysis Process
  • Scalability study on LU
  • suite.def of procs -gt 1, 2, 4, and 8
  • mpirun -np 1 lu.W1
  • mpirun -np 2 lu.W2
  • mpirun -np 4 lu.W4
  • mpirun -np 8 lu.W8
  • populateDatabase.sh
  • run Java translator to translate profiles into
    XML
  • run Java XML reader to write XML profiles to
    database
  • Read times for routines and program from
    experiments
  • Calculate scalability metrics

80
Contents of Performance Database
81
Scalability Analysis Results
  • Scalability of LU performance experiments
  • Four trial runs
  • Funname processors meanspeedup
  • .
  • applu 2 2.0896117809566
  • applu 4 4.812100975788783
  • applu 8 8.168409581149514
  • exact 2 1.95853126762839071803
  • exact 4 4.03622321124616535446
  • exact 8 7.193812137750623668346

82
Current Status and Future
  • PerfDBF prototype
  • TAU profile to XML translator
  • XML to PerfDB populator
  • PostgresSQL database
  • Java-based PostgresSQL query module
  • Use as a layer to support performance analysis
    tools
  • Make accessing the Performance Database quicker
  • Continue development
  • XML parallel profile representation
  • Basic specification

83
Overview
  • Introduction
  • Definitions, general problem
  • Tuning and Analysis Utilities (TAU)
  • Instrumentation
  • Measurement
  • Analysis
  • Work in progress
  • Visualization Vampir
  • Performance Monitoring and Steering
  • Performance Database Framework
  • Case Study Uintah
  • Conclusions

84
Case Study Utah ASCI/ASAP Level 1 Center
  • C-SAFE was established to build a problem-solving
    environment (PSE) for the numerical simulation of
    accidental fires and explosions
  • Fundamental chemistry and engineering physics
    models
  • Coupled with non-linear solvers, optimization,
    computational steering, visualization, and
    experimental data verification
  • Very large-scale simulations
  • Computer science problems
  • Coupling of multiple simulation codes
  • Software engineering across diverse expert teams
  • Achieving high performance on large-scale systems

85
Example C-SAFE Simulation Problems
?
Heptane fire simulation
Typical C-SAFE simulation with a billion degrees
of freedom and non-linear time dynamics
Material stress simulation
86
Uintah High-Level Component View
87
Uintah Computational Framework
  • Execution model based on software (macro)
    dataflow
  • Exposes parallelism and hides data transport
    latency
  • Computations expressed a directed acyclic graphs
    of tasks
  • consumes input and produces output (input to
    future task)
  • input/outputs specified for each patch in a
    structured grid
  • Abstraction of global single-assignment memory
  • DataWarehouse
  • Directory mapping names to values (array
    structured)
  • Write value once then communicate to awaiting
    tasks
  • Task graph gets mapped to processing resources
  • Communications schedule approximates global
    optimal

88
Uintah Task Graph (Material Point Method)
  • Diagram of named tasks (ovals) and data (edges)
  • Imminent computation
  • Dataflow-constrained
  • MPM
  • Newtonian material point motion time step
  • Solid values defined at material point
    (particle)
  • Dashed values defined at vertex (grid)
  • Prime () values updated during time step

89
Uintah PSE
  • UCF automatically sets up
  • Domain decomposition
  • Inter-processor communication with
    aggregation/reduction
  • Parallel I/O
  • Checkpoint and restart
  • Performance measurement and analysis (stay tuned)
  • Software engineering
  • Coding standards
  • CVS (Commits Y3 - 26.6 files/day, Y4 - 29.9
    files/day)
  • Correctness regression testing with bugzilla bug
    tracking
  • Nightly build (parallel compiles)
  • 170,000 lines of code (Fortran and C tasks
    supported)

90
Performance Technology Integration
  • Uintah present challenges to performance
    integration
  • Software diversity and structure
  • UCF middleware, simulation code modules
  • component-based hierarchy
  • Portability objectives
  • cross-language and cross-platform
  • multi-parallelism thread, message passing, mixed
  • Scalability objectives
  • High-level programming and execution abstractions
  • Requires flexible and robust performance
    technology
  • Requires support for performance mapping

91
Task Execution in Uintah Parallel Scheduler
  • Profile methods and functions in scheduler and in
    MPI library

Task execution time dominates (what task?)
Task execution time distribution
MPI communication overheads (where?)
  • Need to map performance data!

92
Semantics-Based Performance Mapping
  • Associate performance measurements with
    high-level semantic abstractions
  • Need mapping support in the performance
    measurement system to assign data correctly

93
Semantic Entities/Attributes/Associations (SEAA)
  • New dynamic mapping scheme
  • Entities defined at any level of abstraction
  • Attribute entity with semantic information
  • Entity-to-entity associations
  • Two association types (implemented in TAU API)
  • Embedded extends data structure of associated
    object to store performance measurement entity
  • External creates an external look-up table
    using address of object as the key to locate
    performance measurement entity

94
Uintah Task Performance Mapping
  • Uintah partitions individual particles across
    processing elements (processes or threads)
  • Simulation tasks in task graph work on particles
  • Tasks have domain-specific character in the
    computation
  • interpolate particles to grid in Material Point
    Method
  • Task instances generated for each partitioned
    particle set
  • Execution scheduled with respect to task
    dependencies
  • How to attributed execution time among different
    tasks
  • Assign semantic name (task type) to a task
    instance
  • SerialMPMinterpolateParticleToGrid
  • Map TAU timer object to (abstract) task (semantic
    entity)
  • Look up timer object using task type (semantic
    attribute)
  • Further partition along different domain-specific
    axes

95
Using External Associations
  • Two level mappings
  • Level 1 lttask name, timergt
  • Level 2 lttask name, patch, timergt
  • Embedded association vs External
    association

Hash Table
Data (object)
Performance Data
96
Task Performance Mapping Instrumentation
  • void MPISchedulerexecute(const ProcessorGroup
    pc,
  • DataWarehouseP old_dw, DataWarehouseP
    dw )
  • ...
  • TAU_MAPPING_CREATE(
  • task-gtgetName(), "MPISchedulerexecute()",
    (TauGroup_t)(void)task-gtgetName(),
    task-gtgetName(), 0)
  • ...
  • TAU_MAPPING_OBJECT(tautimer)
  • TAU_MAPPING_LINK(tautimer,(TauGroup_t)(void)task
    -gtgetName())
  • // EXTERNAL ASSOCIATION
  • ...
  • TAU_MAPPING_PROFILE_TIMER(doitprofiler,
    tautimer, 0)
  • TAU_MAPPING_PROFILE_START(doitprofiler,0)
  • task-gtdoit(pc)
  • TAU_MAPPING_PROFILE_STOP(0)
  • ...

97
Task Performance Mapping (Profile)
Mapped task performance across processes
Performance mapping for different tasks
98
Task Performance Mapping (Trace)
Work packet computation events colored by task
type
Distinct phases of computation can be identifed
based on task
99
Task Performance Mapping (Trace - Zoom)
Startup communication imbalance
100
Task Performance Mapping (Trace - Parallelism)
Communication / load imbalance
101
Comparing Uintah Traces for Scalability Analysis
102
Scaling Performance Optimizations
Last year initial correct scheduler
Reduce communication by 10 x
Reduce task graph overhead by 20 x
ASCI Nirvana SGI Origin 2000 Los Alamos National
Laboratory
103
Scalability to 2000 Processors (Fall 2001)
ASCI Nirvana SGI Origin 2000 Los Alamos National
Laboratory
104
Concluding Remarks
  • Complex software and parallel computing systems
    pose challenging performance analysis problems
    that require robust methodologies and tools
  • To build more sophisticated performance tools,
    existing proven performance technology must be
    utilized
  • Performance tools must be integrated with
    software and systems models and technology
  • Performance engineered software
  • Function consistently and coherently in software
    and system environments
  • PAPI and TAU performance systems offer robust
    performance technology that can be broadly
    integrated

105
Information
  • TAU (http//www.acl.lanl.gov/tau)
  • PDT (http//www.acl.lanl.gov/pdtoolkit)
  • PAPI (http//icl.cs.utk.edu/projects/papi/)
  • OPARI (http//www.fz-juelich.de/zam/kojak/)

106
Support Acknowledgement
  • TAU and PDT support
  • Department of Energy (DOE)
  • DOE 2000 ACTS contract
  • DOE MICS contract
  • DOE ASCI Level 3 (LANL, LLNL)
  • U. of Utah DOE ASCI Level 1 subcontract
  • DARPA
  • NSF National Young Investigator (NYI) award
About PowerShow.com