Performance Technology for Complex Parallel Systems Sameer Shende University of Oregon

About This Presentation

Title:

Performance Technology for Complex Parallel Systems Sameer Shende University of Oregon

Description:

map features/methods to existing complex system types ... s Print only summary profile information -n num Print only first number of functions ... – PowerPoint PPT presentation

Number of Views:33

Avg rating:3.0/5.0

Slides: 69

Provided by: allend7

Learn more at: http://www.cs.uoregon.edu

Category:

more less

Transcript and Presenter's Notes

Title: Performance Technology for Complex Parallel Systems Sameer Shende University of Oregon

1
Performance Technology forComplex Parallel
Systems Sameer ShendeUniversity of Oregon
2
General Problems

How do we create robust and ubiquitous
performance technology for the analysis and
tuning of parallel and distributed software and
systems in the presence of (evolving) complexity
challenges?
How do we apply performance technology
effectively for the variety and diversity of
performance problems that arise in the context of
complex parallel and distributed computer systems.

3
Computation Model for Performance Technology

How to address dual performance technology goals?
Robust capabilities widely available
methodologies
Contend with problems of system diversity
Flexible tool composition/configuration/integratio
n
Approaches
Restrict computation types / performance problems
limited performance technology coverage
Base technology on abstract computation model
general architecture and software execution
features
map features/methods to existing complex system
types
develop capabilities that can adapt and be
optimized

4
General Complex System Computation Model

Node physically distinct shared memory machine
Message passing node interconnection network
Context distinct virtual memory space within
node
Thread execution threads (user/system) in context

Interconnection Network
Inter-node messagecommunication

Node
Node
Node
node memory
memory
memory
SMP
physicalview
VM space

modelview

Context
Threads
5
Definitions Profiling

Profiling
Recording of summary information during execution
inclusive, exclusive time, calls, hardware
statistics,
Reflects performance behavior of program entities
functions, loops, basic blocks
user-defined semantic entities
Very good for low-cost performance assessment
Helps to expose performance bottlenecks and
hotspots
Implemented through
sampling periodic OS interrupts or hardware
counter traps
instrumentation direct insertion of measurement
code

6
Definitions Tracing

Tracing
Recording of information about significant points
(events) during program execution
entering/exiting code region (function, loop,
block, )
thread/process interactions (e.g., send/receive
message)
Save information in event record
timestamp
CPU identifier, thread identifier
Event type and event-specific information
Event trace is a time-sequenced stream of event
records
Can be used to reconstruct dynamic program
behavior
Typically requires code instrumentation

7
Event Tracing Instrumentation, Monitor, Trace
Event definition
CPU A
timestamp
MONITOR
CPU B
8
Event Tracing Timeline Visualization
main
master
slave
B
9
TAU Performance System Framework

Tuning and Analysis Utilities
Performance system framework for scalable
parallel and distributed high-performance
computing
Targets a general complex system computation
model
nodes / contexts / threads
Multi-level system / software / parallelism
Measurement and analysis abstraction
Integrated toolkit for performance
instrumentation, measurement, analysis, and
visualization
Portable performance profiling/tracing facility
Open software approach

10
TAU Performance System Architecture
11
Levels of Code Transformation

As program information flows through stages of
compilation/linking/execution, different
information is accessible at different stages
Each level poses different constraints and
opportunities for extracting information
At what level should performance instrumentation
be done?

12
TAU Instrumentation

Flexible instrumentation mechanisms at multiple
levels
Source code
manual
automatic using Program Database Toolkit (PDT),
OPARI
Object code
pre-instrumented libraries (e.g., MPI using PMPI)
statically linked
dynamically linked (e.g., Virtual machine
instrumentation)
fast breakpoints (compiler generated)
Executable code
dynamic instrumentation (pre-execution) using
DynInstAPI

13
TAU Instrumentation (continued)

Targets common measurement interface (TAU API)
Object-based design and implementation
Macro-based, using constructor/destructor
techniques
Program units function, classes, templates,
blocks
Uniquely identify functions and templates
name and type signature (name registration)
static object creates performance entry
dynamic object receives static object pointer
runtime type identification for template
instantiations
C and Fortran instrumentation variants
Instrumentation and measurement optimization

14
Multi-Level Instrumentation

Uses multiple instrumentation interfaces
Shares information cooperation between
interfaces
Taps information at multiple levels
Provides selective instrumentation at each level
Targets a common performance model
Presents a unified view of execution

15
Program Database Toolkit (PDT)

Program code analysis framework for developing
source-based tools
High-level interface to source code information
Integrated toolkit for source code parsing,
database creation, and database query
commercial grade front end parsers
portable IL analyzer, database format, and access
API
open software approach for tool development
Target and integrate multiple source languages
Use in TAU to build automated performance
instrumentation tools

16
PDT Architecture and Tools
C/C
Fortran 77/90
17
PDT Components

Language front end
Edison Design Group (EDG) C, C, Java
Mutek Solutions Ltd. F77, F90
creates an intermediate-language (IL) tree
IL Analyzer
processes the intermediate language (IL) tree
creates program database (PDB) formatted file
DUCTAPE (Bernd Mohr, ZAM, Germany)
C program Database Utilities and Conversion
Tools APplication Environment
processes and merges PDB files
C library to access the PDB for PDT applications

18
TAU Measurement

Performance information
High-resolution timer library (real-time /
virtual clocks)
General software counter library (user-defined
events)
Hardware performance counters
PCL (Performance Counter Library) (ZAM, Germany)
PAPI (Performance API) (UTK, Ptools Consortium)
consistent, portable API
Organization
Node, context, thread levels
Profile groups for collective events (runtime
selective)
Performance data mapping between software levels

19
TAU Measurement (continued)

Parallel profiling
Function-level, block-level, statement-level
Supports user-defined events
TAU parallel profile database
Function callstack
Hardware counts values (in replace of time)
Tracing
All profile-level events
Inter-process communication events
Timestamp synchronization
User-configurable measurement library (user
controlled)

20
TAU Measurement System Configuration

configure OPTIONS
-cltCCgt, -ccltccgt Specify C and C
compilers
-pthread, -sproc Use pthread or SGI sproc
threads
-openmp Use OpenMP threads
-jdkltdirgt Specify location of Java Dev. Kit
-opariltdirgt Specify location of Opari OpenMP
tool
-pcl, -papiltdirgt Specify location of PCL or
PAPI
-pdtltdirgt Specify location of PDT
-dyninstltdirgt Specify location of DynInst
Package
-mpiincltdgt, mpilibltdgt Specify MPI library
instrumentation
-TRACE Generate TAU event traces
-PROFILE Generate TAU profiles
-CPUTIME Use usertimesystem time
-PAPIWALLCLOCK Use PAPI to access wallclock time
-PAPIVIRTUAL Use PAPI for virtual (user) time

21
TAU Measurement Configuration Examples

./configure -cxlC -ccxlc pdt/usr/packages/pd
toolkit-2.1-pthread
Use TAU with IBMs xlC compiler, PDT and the
pthread library
Enable TAU profiling (default)
./configure -TRACE PROFILE
Enable both TAU profiling and tracing
./configure -cguidec -ccguidec
-papi/usr/local/packages/papi openmp
-mpiinc/usr/packages/mpich/include
-mpilib/usr/packages/mpich/lib
Use OpenMPMPI using KAI's Guide compiler suite
and use PAPI for accessing hardware performance
counters for measurements
Typically configure multiple measurement libraries

22
TAU Measurement API

Initialization and runtime configuration
TAU_PROFILE_INIT(argc, argv)TAU_PROFILE_SET_NODE
(myNode)TAU_PROFILE_SET_CONTEXT(myContext)TAU_
PROFILE_EXIT(message)TAU_REGISTER_THREAD()
Function and class methods
TAU_PROFILE(name, type, group)
Template
TAU_TYPE_STRING(variable, type)TAU_PROFILE(name,
type, group)CT(variable)
User-defined timing
TAU_PROFILE_TIMER(timer, name, type,
group)TAU_PROFILE_START(timer)TAU_PROFILE_STOP
(timer)

23
Compiling TAU Makefiles

Include TAU Makefile in the users Makefile.
Variables
TAU_CXX Specify the C compiler
TAU_CC Specify the C compiler used by TAU
TAU_DEFS Defines used by TAU. Add to CFLAGS
TAU_LDFLAGS Linker options. Add to LDFLAGS
TAU_INCLUDE Header files include path. Add to
CFLAGS
TAU_LIBS Statically linked TAU library. Add to
LIBS
TAU_SHLIBS Dynamically linked TAU library
TAU_MPI_LIBS TAUs MPI wrapper library for C/C
TAU_MPI_FLIBS TAUs MPI wrapper library for F90
TAU_FORTRANLIBS Must be linked in with C linker
for F90.
Note Not including TAU_DEFS in CFLAGS disables
instrumentation in C/C programs.

24
Including TAU Makefile - Example
include /usr/tau/sgi64/lib/Makefile.tau-pthread-kc
c CXX (TAU_CXX) CC (TAU_CC) CFLAGS
(TAU_DEFS) LIBS (TAU_LIBS) OBJS ... TARGET
a.out TARGET (OBJS) (CXX) (LDFLAGS)
(OBJS) -o _at_ (LIBS) .cpp.o (CC) (CFLAGS)
-c lt -o _at_
25
TAU Makefile for PDT
include /usr/tau/include/Makefile CXX
(TAU_CXX) CC (TAU_CC) PDTPARSE
(PDTDIR)/(CONFIG_ARCH)/bin/cxxparse TAUINSTR
(TAUROOT)/(CONFIG_ARCH)/bin/tau_instrumentor CFL
AGS (TAU_DEFS) LIBS (TAU_LIBS) OBJS
... TARGET a.out TARGET (OBJS) (CXX)
(LDFLAGS) (OBJS) -o _at_ (LIBS) .cpp.o (PDTP
ARSE) lt (TAUINSTR) .pdb lt -o
.inst.cpp (CC) (CFLAGS) -c .inst.cpp -o
_at_
26
Setup Running Applications
setenv PROFILEDIR /home/data/experiments/profile
/01 setenv TRACEDIR /home/data/experiments/trace
/01 set path(path lttaudirgt/ltarchgt/bin)
setenv LD_LIBRARY_PATH LD_LIBRARY_PATH\lttaudirgt/
ltarchgt/lib For PAPI/PCL setenv PAPI_EVENT
PAPI_FP_INS setenv PCL_EVENT PCL_FP_INSTR For
Java (without instrumentation) java
application With instrumentation java -XrunTAU
application java -XrunTAUexcludesun/io,java
application For DyninstAPI a.out tau_run
a.out tau_run -XrunTAUsh-papi a.out
27
TAU Analysis

Profile analysis
pprof
parallel profiler with text-based display
racy
graphical interface to pprof (Tcl/Tk)
jracy
Java implementation of Racy
Trace analysis and visualization
Trace merging and clock adjustment (if necessary)
Trace format conversion (ALOG, SDDF, Vampir)
Vampir (Pallas) trace visualization

28
Pprof Command

pprof -c-b-m-t-e-i -r -s -n num -f
file -l nodes
-c Sort according to number of calls
-b Sort according to number of subroutines called
-m Sort according to msecs (exclusive time total)
-t Sort according to total msecs (inclusive time
total)
-e Sort according to exclusive time per call
-i Sort according to inclusive time per call
-v Sort according to standard deviation
(exclusive usec)
-r Reverse sorting order
-s Print only summary profile information
-n num Print only first number of functions
-f file Specify full path and filename without
node ids
-l List all functions and exit

29
Pprof Output (NAS Parallel Benchmark LU)

Intel Quad PIII Xeon, RedHat, PGI F90
F90 MPICH
Profile for Node Context Thread
Application events and MPI events

30
jRacy (NAS Parallel Benchmark LU)
Routine profile across all nodes
Global profiles
n node c context t thread
Individual profile
31
Vampir Trace Visualization Tool

Visualization and Analysis of MPI Programs
Originally developed by Forschungszentrum Jülich
Current development by Technical University
Dresden
Distributed by PALLAS, Germany

http//www.pallas.de/pages/vampir.htm

32
Vampir (NAS Parallel Benchmark LU)
Callgraph display
Timeline display
Parallelism display
Communications display
33
Case Study Hybrid Computation (OpenMPI MPI)

Portable hybrid parallel programming
OpenMP for shared memory parallel programming
Fork-join model
Loop level parallelism
MPI for cross-box message-based parallelism
OpenMP performance measurement
Interface to OpenMP runtime system (RTS events)
Compiler support and integration
2D Stommel model of ocean circulation
Jacobi iteration, 5-point stencil
Timothy Kaiser (San Diego Supercomputing Center)

34
OpenMP Instrumentation

OPARI FZJ, Germany
OpenMP Pragma And Region Instrumentor (OPARI)
Source-to-Source translator to insert POMP calls
around OpenMP constructs and API functions
POMP
OpenMP Directive Instrumentation
OpenMP Runtime Library Routine Instrumentation
Performance Monitoring Library Control
User Code Instrumentation
Context Descriptors
Conditional Compilation
Conditional / Selective Transformations

35
Example !OMP PARALLEL DO Instrumentation
!OMP PARALLEL DO clauses... do
loop !OMP END PARALLEL DO
!OMP PARALLEL other-clauses... !OMP DO
schedule-clauses, ordered-clauses,
lastprivate-clauses do loop !OMP END
DO !OMP END PARALLEL DO
NOWAIT !OMP
BARRIER
call pomp_parallel_fork(d) call
pomp_parallel_begin(d)
call pomp_parallel_end(d) call
pomp_parallel_join(d)
call pomp_do_enter(d)
call pomp_do_exit(d)
call
pomp_barrier_enter(d) call pomp_barrier_exit(d)

36
Tracing Hybrid Executions TAU and Vampir
37
Profiling Hybrid Executions
38
Case Study Utah ASCI/ASAP Level 1 Center

C-SAFE was established to build a problem-solving
environment (PSE) for the numerical simulation of
accidental fires and explosions
Fundamental chemistry and engineering physics
models
Coupled with non-linear solvers, optimization,
computational steering, visualization, and
experimental data verification
Very large-scale simulations
Computer science problems
Coupling of multiple simulation codes
Software engineering across diverse expert teams
Achieving high performance on large-scale systems

39
Example C-SAFE Simulation Problems
?
Heptane fire simulation
Typical C-SAFE simulation with a billion degrees
of freedom and non-linear time dynamics
Material stress simulation
40
Uintah High-Level Component View
41
Uintah Parallel Component Architecture
42
Uintah Computational Framework

Execution model based on software (macro)
dataflow
Exposes parallelism and hides data transport
latency
Computations expressed a directed acyclic graphs
of tasks
consumes input and produces output (input to
future task)
input/outputs specified for each patch in a
structured grid
Abstraction of global single-assignment memory
DataWarehouse
Directory mapping names to values (array
structured)
Write value once then communicate to awaiting
tasks
Task graph gets mapped to processing resources
Communications schedule approximates global
optimal

43
Uintah Task Graph (Material Point Method)

Diagram of named tasks (ovals) and data (edges)
Imminent computation
Dataflow-constrained
MPM
Newtonian material point motion time step
Solid values defined at material point
(particle)
Dashed values defined at vertex (grid)
Prime () values updated during time step

44
Uintah PSE

UCF automatically sets up
Domain decomposition
Inter-processor communication with
aggregation/reduction
Parallel I/O
Checkpoint and restart
Performance measurement and analysis (stay tuned)
Software engineering
Coding standards
CVS (Commits Y3 - 26.6 files/day, Y4 - 29.9
files/day)
Correctness regression testing with bugzilla bug
tracking
Nightly build (parallel compiles)
170,000 lines of code (Fortran and C tasks
supported)

45
Performance Technology Integration

Uintah present challenges to performance
integration
Software diversity and structure
UCF middleware, simulation code modules
component-based hierarchy
Portability objectives
cross-language and cross-platform
multi-parallelism thread, message passing, mixed
Scalability objectives
High-level programming and execution abstractions
Requires flexible and robust performance
technology
Requires support for performance mapping

46
Performance Analysis Objectives for Uintah

Micro tuning
Optimization of simulation code (task) kernels
for maximum serial performance
Scalability tuning
Identification of parallel execution bottlenecks
overheads scheduler, data warehouse,
communication
load imbalance
Adjustment of task graph decomposition and
scheduling
Performance tracking
Understand performance impacts of code
modifications
Throughout course of software development
C-SAFE application and UCF software

47
Uintah Performance Engineering Approach

Contemporary performance methodology focuses on
control flow (function) level measurement and
analysis
C-SAFE application involves coupled-models with
task-based parallelism and dataflow control
constraints
Performance engineering on algorithmic (task)
basis
Observe performance based on algorithm (task)
semantics
Analyze task performance characteristics in
relation to other simulation tasks and UCF
components
scientific component developers can concentrate
on performance improvement at algorithmic level
UCF developers can concentrate on bottlenecks not
directly associated with simulation module code

48
Task Execution in Uintah Parallel Scheduler

Profile methods and functions in scheduler and in
MPI library

Task execution time dominates (what task?)
Task execution time distribution
MPI communication overheads (where?)

Need to map performance data!

49
Semantics-Based Performance Mapping

Associate performance measurements with
high-level semantic abstractions
Need mapping support in the performance
measurement system to assign data correctly

50
Hypothetical Mapping Example

Particles distributed on surfaces of a cube

Particle PMAX / Array of particles / int
GenerateParticles() / distribute particles
over all faces of the cube / for (int face0,
last0 face lt 6 face) / particles on
this face / int particles_on_this_face
num(face) for (int ilast i lt
particles_on_this_face i) / particle
properties are a function of face / Pi
... f(face) ... last
particles_on_this_face
51
Hypothetical Mapping Example (continued)
int ProcessParticle(Particle p) / perform
some computation on p / int main()
GenerateParticles() / create a list of
particles / for (int i 0 i lt N i) /
iterates over the list / ProcessParticle(Pi)

How much time is spent processing face i
particles?
What is the distribution of performance among
faces?
How is this determined if execution is parallel?

52
Semantic Entities/Attributes/Associations (SEAA)

New dynamic mapping scheme
Entities defined at any level of abstraction
Attribute entity with semantic information
Entity-to-entity associations
Two association types (implemented in TAU API)
Embedded extends data structure of associated
object to store performance measurement entity
External creates an external look-up table
using address of object as the key to locate
performance measurement entity

53
No Performance Mapping versus Mapping

Typical performance tools report performance with
respect to routines
Does not provide support for mapping

Performance tools with SEAA mapping can observe
performance with respect to scientists
programming and problem abstractions

TAU (w/ mapping)
TAU (no mapping)
54
Uintah Task Performance Mapping

Uintah partitions individual particles across
processing elements (processes or threads)
Simulation tasks in task graph work on particles
Tasks have domain-specific character in the
computation
interpolate particles to grid in Material Point
Method
Task instances generated for each partitioned
particle set
Execution scheduled with respect to task
dependencies
How to attributed execution time among different
tasks
Assign semantic name (task type) to a task
instance
SerialMPMinterpolateParticleToGrid
Map TAU timer object to (abstract) task (semantic
entity)
Look up timer object using task type (semantic
attribute)
Further partition along different domain-specific
axes

55
Using External Associations

Two level mappings
Level 1 lttask name, timergt
Level 2 lttask name, patch, timergt
Embedded association vs External
association

Hash Table
Data (object)
Performance Data
56
Task Performance Mapping Instrumentation

void MPISchedulerexecute(const ProcessorGroup
pc,
DataWarehouseP old_dw, DataWarehouseP
dw )
...
TAU_MAPPING_CREATE(
task-gtgetName(), "MPISchedulerexecute()",
(TauGroup_t)(void)task-gtgetName(),
task-gtgetName(), 0)
...
TAU_MAPPING_OBJECT(tautimer)
TAU_MAPPING_LINK(tautimer,(TauGroup_t)(void)task
-gtgetName())
// EXTERNAL ASSOCIATION
...
TAU_MAPPING_PROFILE_TIMER(doitprofiler,
tautimer, 0)
TAU_MAPPING_PROFILE_START(doitprofiler,0)
task-gtdoit(pc)
TAU_MAPPING_PROFILE_STOP(0)
...

57
Task Performance Mapping (Profile)
Mapped task performance across processes
Performance mapping for different tasks
58
Task Performance Mapping (Trace)
Work packet computation events colored by task
type
Distinct phases of computation can be identifed
based on task
59
Task Performance Mapping (Trace - Zoom)
Startup communication imbalance
60
Task Performance Mapping (Trace - Parallelism)
Communication / load imbalance
61
Comparing Uintah Traces for Scalability Analysis
62
Scaling Performance Optimizations
Last year initial correct scheduler
Reduce communication by 10 x
Reduce task graph overhead by 20 x
ASCI NirvanaSGI Origin 2000 Los AlamosNational
Laboratory
63
Scalability to 2000 Processors (Fall 2001)
ASCI NirvanaSGI Origin 2000 Los AlamosNational
Laboratory
64
TAU Performance System Status

Computing platforms
IBM SP, SGI Origin, Intel Teraflop, Cray T3E,
Compaq SC, HP, Sun, Apple, Windows, IA-32, IA-64
(Linux),
Programming languages
C, C, Fortran 77/90, HPF, Java
Communication libraries
MPI, PVM, Nexus, Tulip, ACLMPL, MPIJava
Thread libraries
pthread, Java,Windows, SGI sproc, Tulip, SMARTS,
OpenMP
Compilers
KAI, PGI, GNU, Fujitsu, HP, Sun, Microsoft, SGI,
Cray, IBM, Compaq

65
PDT Status

Program Database Toolkit (Version 2.1, web
download)
EDG C front end (Version 2.45.2)
Mutek Fortran 90 front end (Version 2.4.1)
C and Fortran 90 IL Analyzer
DUCTAPE library
Standard C system header files (KCC Version
4.0f)
PDT-constructed tools
TAU instrumentor (C/C/F90)
Program analysis support for SILOON and CHASM
Platforms
SGI, IBM, Compaq, SUN, HP, Linux (IA32/IA64),
Apple, Windows, Cray T3E

66
Evolution of the TAU Performance System

Customization of TAU for specific needs
TAUs existing strength lies in its robust
support for performance instrumentation and
measurement
TAU will evolve to support new performance
capabilities
Online performance data access via
application-level API
Dynamic performance measurement control
Generalize performance mapping
Runtime performance analysis and visualization

67
Information