Title: Performance Technology for Complex Parallel Systems Sameer Shende University of Oregon
1Performance Technology forComplex Parallel
Systems Sameer ShendeUniversity of Oregon
2General Problems
- How do we create robust and ubiquitous
performance technology for the analysis and
tuning of parallel and distributed software and
systems in the presence of (evolving) complexity
challenges? - How do we apply performance technology
effectively for the variety and diversity of
performance problems that arise in the context of
complex parallel and distributed computer systems.
3Computation Model for Performance Technology
- How to address dual performance technology goals?
- Robust capabilities widely available
methodologies - Contend with problems of system diversity
- Flexible tool composition/configuration/integratio
n - Approaches
- Restrict computation types / performance problems
- limited performance technology coverage
- Base technology on abstract computation model
- general architecture and software execution
features - map features/methods to existing complex system
types - develop capabilities that can adapt and be
optimized
4General Complex System Computation Model
- Node physically distinct shared memory machine
- Message passing node interconnection network
- Context distinct virtual memory space within
node - Thread execution threads (user/system) in context
Interconnection Network
Inter-node messagecommunication
Node
Node
Node
node memory
memory
memory
SMP
physicalview
VM space
modelview
Context
Threads
5Definitions Profiling
- Profiling
- Recording of summary information during execution
- inclusive, exclusive time, calls, hardware
statistics, - Reflects performance behavior of program entities
- functions, loops, basic blocks
- user-defined semantic entities
- Very good for low-cost performance assessment
- Helps to expose performance bottlenecks and
hotspots - Implemented through
- sampling periodic OS interrupts or hardware
counter traps - instrumentation direct insertion of measurement
code
6Definitions Tracing
- Tracing
- Recording of information about significant points
(events) during program execution - entering/exiting code region (function, loop,
block, ) - thread/process interactions (e.g., send/receive
message) - Save information in event record
- timestamp
- CPU identifier, thread identifier
- Event type and event-specific information
- Event trace is a time-sequenced stream of event
records - Can be used to reconstruct dynamic program
behavior - Typically requires code instrumentation
7Event Tracing Instrumentation, Monitor, Trace
Event definition
CPU A
timestamp
MONITOR
CPU B
8Event Tracing Timeline Visualization
main
master
slave
B
9TAU Performance System Framework
- Tuning and Analysis Utilities
- Performance system framework for scalable
parallel and distributed high-performance
computing - Targets a general complex system computation
model - nodes / contexts / threads
- Multi-level system / software / parallelism
- Measurement and analysis abstraction
- Integrated toolkit for performance
instrumentation, measurement, analysis, and
visualization - Portable performance profiling/tracing facility
- Open software approach
10TAU Performance System Architecture
11Levels of Code Transformation
- As program information flows through stages of
compilation/linking/execution, different
information is accessible at different stages - Each level poses different constraints and
opportunities for extracting information - At what level should performance instrumentation
be done?
12TAU Instrumentation
- Flexible instrumentation mechanisms at multiple
levels - Source code
- manual
- automatic using Program Database Toolkit (PDT),
OPARI - Object code
- pre-instrumented libraries (e.g., MPI using PMPI)
- statically linked
- dynamically linked (e.g., Virtual machine
instrumentation) - fast breakpoints (compiler generated)
- Executable code
- dynamic instrumentation (pre-execution) using
DynInstAPI
13TAU Instrumentation (continued)
- Targets common measurement interface (TAU API)
- Object-based design and implementation
- Macro-based, using constructor/destructor
techniques - Program units function, classes, templates,
blocks - Uniquely identify functions and templates
- name and type signature (name registration)
- static object creates performance entry
- dynamic object receives static object pointer
- runtime type identification for template
instantiations - C and Fortran instrumentation variants
- Instrumentation and measurement optimization
14Multi-Level Instrumentation
- Uses multiple instrumentation interfaces
- Shares information cooperation between
interfaces - Taps information at multiple levels
- Provides selective instrumentation at each level
- Targets a common performance model
- Presents a unified view of execution
15Program Database Toolkit (PDT)
- Program code analysis framework for developing
source-based tools - High-level interface to source code information
- Integrated toolkit for source code parsing,
database creation, and database query - commercial grade front end parsers
- portable IL analyzer, database format, and access
API - open software approach for tool development
- Target and integrate multiple source languages
- Use in TAU to build automated performance
instrumentation tools
16PDT Architecture and Tools
C/C
Fortran 77/90
17PDT Components
- Language front end
- Edison Design Group (EDG) C, C, Java
- Mutek Solutions Ltd. F77, F90
- creates an intermediate-language (IL) tree
- IL Analyzer
- processes the intermediate language (IL) tree
- creates program database (PDB) formatted file
- DUCTAPE (Bernd Mohr, ZAM, Germany)
- C program Database Utilities and Conversion
Tools APplication Environment - processes and merges PDB files
- C library to access the PDB for PDT applications
18TAU Measurement
- Performance information
- High-resolution timer library (real-time /
virtual clocks) - General software counter library (user-defined
events) - Hardware performance counters
- PCL (Performance Counter Library) (ZAM, Germany)
- PAPI (Performance API) (UTK, Ptools Consortium)
- consistent, portable API
- Organization
- Node, context, thread levels
- Profile groups for collective events (runtime
selective) - Performance data mapping between software levels
19TAU Measurement (continued)
- Parallel profiling
- Function-level, block-level, statement-level
- Supports user-defined events
- TAU parallel profile database
- Function callstack
- Hardware counts values (in replace of time)
- Tracing
- All profile-level events
- Inter-process communication events
- Timestamp synchronization
- User-configurable measurement library (user
controlled)
20TAU Measurement System Configuration
- configure OPTIONS
- -cltCCgt, -ccltccgt Specify C and C
compilers - -pthread, -sproc Use pthread or SGI sproc
threads - -openmp Use OpenMP threads
- -jdkltdirgt Specify location of Java Dev. Kit
- -opariltdirgt Specify location of Opari OpenMP
tool - -pcl, -papiltdirgt Specify location of PCL or
PAPI - -pdtltdirgt Specify location of PDT
- -dyninstltdirgt Specify location of DynInst
Package - -mpiincltdgt, mpilibltdgt Specify MPI library
instrumentation - -TRACE Generate TAU event traces
- -PROFILE Generate TAU profiles
- -CPUTIME Use usertimesystem time
- -PAPIWALLCLOCK Use PAPI to access wallclock time
- -PAPIVIRTUAL Use PAPI for virtual (user) time
21TAU Measurement Configuration Examples
- ./configure -cxlC -ccxlc pdt/usr/packages/pd
toolkit-2.1-pthread - Use TAU with IBMs xlC compiler, PDT and the
pthread library - Enable TAU profiling (default)
- ./configure -TRACE PROFILE
- Enable both TAU profiling and tracing
- ./configure -cguidec -ccguidec
-papi/usr/local/packages/papi openmp
-mpiinc/usr/packages/mpich/include
-mpilib/usr/packages/mpich/lib - Use OpenMPMPI using KAI's Guide compiler suite
and use PAPI for accessing hardware performance
counters for measurements - Typically configure multiple measurement libraries
22TAU Measurement API
- Initialization and runtime configuration
- TAU_PROFILE_INIT(argc, argv)TAU_PROFILE_SET_NODE
(myNode)TAU_PROFILE_SET_CONTEXT(myContext)TAU_
PROFILE_EXIT(message)TAU_REGISTER_THREAD() - Function and class methods
- TAU_PROFILE(name, type, group)
- Template
- TAU_TYPE_STRING(variable, type)TAU_PROFILE(name,
type, group)CT(variable) - User-defined timing
- TAU_PROFILE_TIMER(timer, name, type,
group)TAU_PROFILE_START(timer)TAU_PROFILE_STOP
(timer)
23Compiling TAU Makefiles
- Include TAU Makefile in the users Makefile.
- Variables
- TAU_CXX Specify the C compiler
- TAU_CC Specify the C compiler used by TAU
- TAU_DEFS Defines used by TAU. Add to CFLAGS
- TAU_LDFLAGS Linker options. Add to LDFLAGS
- TAU_INCLUDE Header files include path. Add to
CFLAGS - TAU_LIBS Statically linked TAU library. Add to
LIBS - TAU_SHLIBS Dynamically linked TAU library
- TAU_MPI_LIBS TAUs MPI wrapper library for C/C
- TAU_MPI_FLIBS TAUs MPI wrapper library for F90
- TAU_FORTRANLIBS Must be linked in with C linker
for F90. - Note Not including TAU_DEFS in CFLAGS disables
instrumentation in C/C programs.
24Including TAU Makefile - Example
include /usr/tau/sgi64/lib/Makefile.tau-pthread-kc
c CXX (TAU_CXX) CC (TAU_CC) CFLAGS
(TAU_DEFS) LIBS (TAU_LIBS) OBJS ... TARGET
a.out TARGET (OBJS) (CXX) (LDFLAGS)
(OBJS) -o _at_ (LIBS) .cpp.o (CC) (CFLAGS)
-c lt -o _at_
25TAU Makefile for PDT
include /usr/tau/include/Makefile CXX
(TAU_CXX) CC (TAU_CC) PDTPARSE
(PDTDIR)/(CONFIG_ARCH)/bin/cxxparse TAUINSTR
(TAUROOT)/(CONFIG_ARCH)/bin/tau_instrumentor CFL
AGS (TAU_DEFS) LIBS (TAU_LIBS) OBJS
... TARGET a.out TARGET (OBJS) (CXX)
(LDFLAGS) (OBJS) -o _at_ (LIBS) .cpp.o (PDTP
ARSE) lt (TAUINSTR) .pdb lt -o
.inst.cpp (CC) (CFLAGS) -c .inst.cpp -o
_at_
26Setup Running Applications
setenv PROFILEDIR /home/data/experiments/profile
/01 setenv TRACEDIR /home/data/experiments/trace
/01 set path(path lttaudirgt/ltarchgt/bin)
setenv LD_LIBRARY_PATH LD_LIBRARY_PATH\lttaudirgt/
ltarchgt/lib For PAPI/PCL setenv PAPI_EVENT
PAPI_FP_INS setenv PCL_EVENT PCL_FP_INSTR For
Java (without instrumentation) java
application With instrumentation java -XrunTAU
application java -XrunTAUexcludesun/io,java
application For DyninstAPI a.out tau_run
a.out tau_run -XrunTAUsh-papi a.out
27TAU Analysis
- Profile analysis
- pprof
- parallel profiler with text-based display
- racy
- graphical interface to pprof (Tcl/Tk)
- jracy
- Java implementation of Racy
- Trace analysis and visualization
- Trace merging and clock adjustment (if necessary)
- Trace format conversion (ALOG, SDDF, Vampir)
- Vampir (Pallas) trace visualization
28Pprof Command
- pprof -c-b-m-t-e-i -r -s -n num -f
file -l nodes - -c Sort according to number of calls
- -b Sort according to number of subroutines called
- -m Sort according to msecs (exclusive time total)
- -t Sort according to total msecs (inclusive time
total) - -e Sort according to exclusive time per call
- -i Sort according to inclusive time per call
- -v Sort according to standard deviation
(exclusive usec) - -r Reverse sorting order
- -s Print only summary profile information
- -n num Print only first number of functions
- -f file Specify full path and filename without
node ids - -l List all functions and exit
29Pprof Output (NAS Parallel Benchmark LU)
- Intel Quad PIII Xeon, RedHat, PGI F90
- F90 MPICH
- Profile for Node Context Thread
- Application events and MPI events
30jRacy (NAS Parallel Benchmark LU)
Routine profile across all nodes
Global profiles
n node c context t thread
Individual profile
31Vampir Trace Visualization Tool
- Visualization and Analysis of MPI Programs
- Originally developed by Forschungszentrum Jülich
- Current development by Technical University
Dresden - Distributed by PALLAS, Germany
- http//www.pallas.de/pages/vampir.htm
32Vampir (NAS Parallel Benchmark LU)
Callgraph display
Timeline display
Parallelism display
Communications display
33Case Study Hybrid Computation (OpenMPI MPI)
- Portable hybrid parallel programming
- OpenMP for shared memory parallel programming
- Fork-join model
- Loop level parallelism
- MPI for cross-box message-based parallelism
- OpenMP performance measurement
- Interface to OpenMP runtime system (RTS events)
- Compiler support and integration
- 2D Stommel model of ocean circulation
- Jacobi iteration, 5-point stencil
- Timothy Kaiser (San Diego Supercomputing Center)
34OpenMP Instrumentation
- OPARI FZJ, Germany
- OpenMP Pragma And Region Instrumentor (OPARI)
- Source-to-Source translator to insert POMP calls
around OpenMP constructs and API functions - POMP
- OpenMP Directive Instrumentation
- OpenMP Runtime Library Routine Instrumentation
- Performance Monitoring Library Control
- User Code Instrumentation
- Context Descriptors
- Conditional Compilation
- Conditional / Selective Transformations
35Example !OMP PARALLEL DO Instrumentation
!OMP PARALLEL DO clauses... do
loop !OMP END PARALLEL DO
!OMP PARALLEL other-clauses... !OMP DO
schedule-clauses, ordered-clauses,
lastprivate-clauses do loop !OMP END
DO !OMP END PARALLEL DO
NOWAIT !OMP
BARRIER
call pomp_parallel_fork(d) call
pomp_parallel_begin(d)
call pomp_parallel_end(d) call
pomp_parallel_join(d)
call pomp_do_enter(d)
call pomp_do_exit(d)
call
pomp_barrier_enter(d) call pomp_barrier_exit(d)
36Tracing Hybrid Executions TAU and Vampir
37Profiling Hybrid Executions
38Case Study Utah ASCI/ASAP Level 1 Center
- C-SAFE was established to build a problem-solving
environment (PSE) for the numerical simulation of
accidental fires and explosions - Fundamental chemistry and engineering physics
models - Coupled with non-linear solvers, optimization,
computational steering, visualization, and
experimental data verification - Very large-scale simulations
- Computer science problems
- Coupling of multiple simulation codes
- Software engineering across diverse expert teams
- Achieving high performance on large-scale systems
39Example C-SAFE Simulation Problems
?
Heptane fire simulation
Typical C-SAFE simulation with a billion degrees
of freedom and non-linear time dynamics
Material stress simulation
40Uintah High-Level Component View
41Uintah Parallel Component Architecture
42Uintah Computational Framework
- Execution model based on software (macro)
dataflow - Exposes parallelism and hides data transport
latency - Computations expressed a directed acyclic graphs
of tasks - consumes input and produces output (input to
future task) - input/outputs specified for each patch in a
structured grid - Abstraction of global single-assignment memory
- DataWarehouse
- Directory mapping names to values (array
structured) - Write value once then communicate to awaiting
tasks - Task graph gets mapped to processing resources
- Communications schedule approximates global
optimal
43Uintah Task Graph (Material Point Method)
- Diagram of named tasks (ovals) and data (edges)
- Imminent computation
- Dataflow-constrained
- MPM
- Newtonian material point motion time step
- Solid values defined at material point
(particle) - Dashed values defined at vertex (grid)
- Prime () values updated during time step
44Uintah PSE
- UCF automatically sets up
- Domain decomposition
- Inter-processor communication with
aggregation/reduction - Parallel I/O
- Checkpoint and restart
- Performance measurement and analysis (stay tuned)
- Software engineering
- Coding standards
- CVS (Commits Y3 - 26.6 files/day, Y4 - 29.9
files/day) - Correctness regression testing with bugzilla bug
tracking - Nightly build (parallel compiles)
- 170,000 lines of code (Fortran and C tasks
supported)
45Performance Technology Integration
- Uintah present challenges to performance
integration - Software diversity and structure
- UCF middleware, simulation code modules
- component-based hierarchy
- Portability objectives
- cross-language and cross-platform
- multi-parallelism thread, message passing, mixed
- Scalability objectives
- High-level programming and execution abstractions
- Requires flexible and robust performance
technology - Requires support for performance mapping
46Performance Analysis Objectives for Uintah
- Micro tuning
- Optimization of simulation code (task) kernels
for maximum serial performance - Scalability tuning
- Identification of parallel execution bottlenecks
- overheads scheduler, data warehouse,
communication - load imbalance
- Adjustment of task graph decomposition and
scheduling - Performance tracking
- Understand performance impacts of code
modifications - Throughout course of software development
- C-SAFE application and UCF software
47Uintah Performance Engineering Approach
- Contemporary performance methodology focuses on
control flow (function) level measurement and
analysis - C-SAFE application involves coupled-models with
task-based parallelism and dataflow control
constraints - Performance engineering on algorithmic (task)
basis - Observe performance based on algorithm (task)
semantics - Analyze task performance characteristics in
relation to other simulation tasks and UCF
components - scientific component developers can concentrate
on performance improvement at algorithmic level - UCF developers can concentrate on bottlenecks not
directly associated with simulation module code
48Task Execution in Uintah Parallel Scheduler
- Profile methods and functions in scheduler and in
MPI library
Task execution time dominates (what task?)
Task execution time distribution
MPI communication overheads (where?)
- Need to map performance data!
49Semantics-Based Performance Mapping
- Associate performance measurements with
high-level semantic abstractions - Need mapping support in the performance
measurement system to assign data correctly
50Hypothetical Mapping Example
- Particles distributed on surfaces of a cube
Particle PMAX / Array of particles / int
GenerateParticles() / distribute particles
over all faces of the cube / for (int face0,
last0 face lt 6 face) / particles on
this face / int particles_on_this_face
num(face) for (int ilast i lt
particles_on_this_face i) / particle
properties are a function of face / Pi
... f(face) ... last
particles_on_this_face
51Hypothetical Mapping Example (continued)
int ProcessParticle(Particle p) / perform
some computation on p / int main()
GenerateParticles() / create a list of
particles / for (int i 0 i lt N i) /
iterates over the list / ProcessParticle(Pi)
- How much time is spent processing face i
particles? - What is the distribution of performance among
faces? - How is this determined if execution is parallel?
52Semantic Entities/Attributes/Associations (SEAA)
- New dynamic mapping scheme
- Entities defined at any level of abstraction
- Attribute entity with semantic information
- Entity-to-entity associations
- Two association types (implemented in TAU API)
- Embedded extends data structure of associated
object to store performance measurement entity - External creates an external look-up table
using address of object as the key to locate
performance measurement entity
53No Performance Mapping versus Mapping
- Typical performance tools report performance with
respect to routines - Does not provide support for mapping
- Performance tools with SEAA mapping can observe
performance with respect to scientists
programming and problem abstractions
TAU (w/ mapping)
TAU (no mapping)
54Uintah Task Performance Mapping
- Uintah partitions individual particles across
processing elements (processes or threads) - Simulation tasks in task graph work on particles
- Tasks have domain-specific character in the
computation - interpolate particles to grid in Material Point
Method - Task instances generated for each partitioned
particle set - Execution scheduled with respect to task
dependencies - How to attributed execution time among different
tasks - Assign semantic name (task type) to a task
instance - SerialMPMinterpolateParticleToGrid
- Map TAU timer object to (abstract) task (semantic
entity) - Look up timer object using task type (semantic
attribute) - Further partition along different domain-specific
axes
55Using External Associations
- Two level mappings
- Level 1 lttask name, timergt
- Level 2 lttask name, patch, timergt
- Embedded association vs External
association
Hash Table
Data (object)
Performance Data
56Task Performance Mapping Instrumentation
- void MPISchedulerexecute(const ProcessorGroup
pc, - DataWarehouseP old_dw, DataWarehouseP
dw ) - ...
- TAU_MAPPING_CREATE(
- task-gtgetName(), "MPISchedulerexecute()",
(TauGroup_t)(void)task-gtgetName(),
task-gtgetName(), 0) - ...
- TAU_MAPPING_OBJECT(tautimer)
- TAU_MAPPING_LINK(tautimer,(TauGroup_t)(void)task
-gtgetName()) - // EXTERNAL ASSOCIATION
- ...
- TAU_MAPPING_PROFILE_TIMER(doitprofiler,
tautimer, 0) - TAU_MAPPING_PROFILE_START(doitprofiler,0)
- task-gtdoit(pc)
- TAU_MAPPING_PROFILE_STOP(0)
- ...
57Task Performance Mapping (Profile)
Mapped task performance across processes
Performance mapping for different tasks
58Task Performance Mapping (Trace)
Work packet computation events colored by task
type
Distinct phases of computation can be identifed
based on task
59Task Performance Mapping (Trace - Zoom)
Startup communication imbalance
60Task Performance Mapping (Trace - Parallelism)
Communication / load imbalance
61Comparing Uintah Traces for Scalability Analysis
62Scaling Performance Optimizations
Last year initial correct scheduler
Reduce communication by 10 x
Reduce task graph overhead by 20 x
ASCI NirvanaSGI Origin 2000 Los AlamosNational
Laboratory
63Scalability to 2000 Processors (Fall 2001)
ASCI NirvanaSGI Origin 2000 Los AlamosNational
Laboratory
64TAU Performance System Status
- Computing platforms
- IBM SP, SGI Origin, Intel Teraflop, Cray T3E,
Compaq SC, HP, Sun, Apple, Windows, IA-32, IA-64
(Linux), - Programming languages
- C, C, Fortran 77/90, HPF, Java
- Communication libraries
- MPI, PVM, Nexus, Tulip, ACLMPL, MPIJava
- Thread libraries
- pthread, Java,Windows, SGI sproc, Tulip, SMARTS,
OpenMP - Compilers
- KAI, PGI, GNU, Fujitsu, HP, Sun, Microsoft, SGI,
Cray, IBM, Compaq
65PDT Status
- Program Database Toolkit (Version 2.1, web
download) - EDG C front end (Version 2.45.2)
- Mutek Fortran 90 front end (Version 2.4.1)
- C and Fortran 90 IL Analyzer
- DUCTAPE library
- Standard C system header files (KCC Version
4.0f) - PDT-constructed tools
- TAU instrumentor (C/C/F90)
- Program analysis support for SILOON and CHASM
- Platforms
- SGI, IBM, Compaq, SUN, HP, Linux (IA32/IA64),
Apple, Windows, Cray T3E
66Evolution of the TAU Performance System
- Customization of TAU for specific needs
- TAUs existing strength lies in its robust
support for performance instrumentation and
measurement - TAU will evolve to support new performance
capabilities - Online performance data access via
application-level API - Dynamic performance measurement control
- Generalize performance mapping
- Runtime performance analysis and visualization
67Information
- TAU (http//www.acl.lanl.gov/tau)
- PDT (http//www.acl.lanl.gov/pdtoolkit)
68Support Acknowledgement
- TAU and PDT support
- Department of Energy (DOE)
- DOE 2000 ACTS contract
- DOE MICS contract
- DOE ASCI Level 3 (LANL, LLNL)
- U. of Utah DOE ASCI Level 1 subcontract
- DARPA
- NSF National Young Investigator (NYI) award