Allen D. Malony Sameer S. Shende Robert Bell - PowerPoint PPT Presentation

Loading...

PPT – Allen D. Malony Sameer S. Shende Robert Bell PowerPoint presentation | free to download - id: 600d19-ZjQ1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Allen D. Malony Sameer S. Shende Robert Bell

Description:

The TAU Performance System Allen D. Malony Sameer S. Shende Robert Bell {malony, sameer, bertie}_at_cs.uoregon.edu Department of Computer and Information Science – PowerPoint PPT presentation

Number of Views:148
Avg rating:3.0/5.0
Slides: 114
Provided by: AllenD54
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Allen D. Malony Sameer S. Shende Robert Bell


1
The TAU Performance System
  • Allen D. Malony Sameer S. Shende Robert
    Bell
  • malony, sameer, bertie_at_cs.uoregon.edu
  • Department of Computer and Information Science
  • Computational Science Institute
  • University of Oregon

2
Overview
  • Motivation and goals
  • TAU architecture and toolkit
  • Instrumentation
  • Measurement
  • Analysis
  • Performance mapping
  • Application case studies
  • TAU Integration
  • Work in progress
  • Conclusions

3
Motivation
  • Tools for performance problem solving
  • Empirical-based performance optimization process
  • Versatile performance technology
  • Portable performance analysis methods

PerformanceTuning
hypotheses
Performance Diagnosis
PerformanceTechnology
properties
Performance Experimentation
characterization
Performance Observation
4
Problems
  • Diverse performance observability requirements
  • Multiple levels of software and hardware
  • Different types and detail of performance data
  • Alternative performance problem solving methods
  • Multiple targets of software and system
    application
  • Demands more robust performance technology
  • Broad scope of performance observation
  • Flexible and configurable mechanisms
  • Technology integration and extension
  • Cross-platform portability
  • Open, layered, and modular framework architecture

5
Complexity Challenges for Performance Tools
  • Computing system environment complexity
  • Observation integration and optimization
  • Access, accuracy, and granularity constraints
  • Diverse/specialized observation
    capabilities/technology
  • Restricted modes limit performance problem
    solving
  • Sophisticated software development environments
  • Programming paradigms and performance models
  • Performance data mapping to software abstractions
  • Uniformity of performance abstraction across
    platforms
  • Rich observation capabilities and flexible
    configuration
  • Common performance problem solving methods

6
General Problems (Performance Technology)
  • How do we create robust and ubiquitous
    performance technology for the analysis and
    tuning of parallel and distributed software and
    systems in the presence of (evolving) complexity
    challenges?
  • How do we apply performance technology
    effectively for the variety and diversity of
    performance problems that arise in the context of
    complex parallel and distributed computer systems?

?
7
Computation Model for Performance Technology
  • How to address dual performance technology goals?
  • Robust capabilities widely available methods
  • Contend with problems of system diversity
  • Flexible tool composition/configuration/integratio
    n
  • Approaches
  • Restrict computation types / performance problems
  • machines, languages, instrumentation technique,
  • limited performance technology coverage and
    application
  • Base technology on abstract computation model
  • general architecture and software execution
    features
  • map features/methods to existing complex system
    types
  • develop capabilities that can be adapted and
    optimized

8
General Complex System Computation Model
  • Node physically distinct shared memory machine
  • Message passing node interconnection network
  • Context distinct virtual memory space within
    node
  • Thread execution threads (user/system) in context

Interconnection Network
Inter-node messagecommunication


Node
Node
Node
node memory
memory
memory
SMP
physicalview
VM space

modelview

Context
Threads
9
TAU Performance System
  • Tuning and Analysis Utilities
  • Performance system framework for scalable
    parallel and distributed high-performance
    computing
  • Targets a general complex system computation
    model
  • nodes / contexts / threads
  • Multi-level system / software / parallelism
  • Measurement and analysis abstraction
  • Integrated toolkit for performance
    instrumentation, measurement, analysis, and
    visualization
  • Portable performance profiling and tracing
    facility
  • Open software approach with technology
    integration
  • University of Oregon , Forschungszentrum Jülich,
    LANL

10
Definitions Profiling
  • Profiling
  • Recording of summary information during execution
  • execution time, calls, hardware statistics,
  • Reflects performance behavior of program entities
  • functions, loops, basic blocks
  • user-defined semantic entities
  • Very good for low-cost performance assessment
  • Helps to expose performance bottlenecks and
    hotspots
  • Implemented through
  • sampling periodic OS interrupts or hardware
    counter traps
  • instrumentation direct insertion of measurement
    code

11
Definitions Tracing
  • Tracing
  • Recording of information about significant points
    (events) during program execution
  • entering/exiting code region (function, loop,
    block, )
  • thread/process interactions (e.g., send/receive
    message)
  • Save information in event record
  • timestamp
  • CPU identifier, thread identifier
  • Event type and event-specific information
  • Event trace is a time-sequenced stream of event
    records
  • Can be used to reconstruct dynamic program
    behavior
  • Typically requires code instrumentation

12
TAU Performance System Architecture
13
TAU Performance Systems Goals
  • Multi-level performance instrumentation
  • Multi-language automatic source instrumentation
  • Flexible and configurable performance measurement
  • Widely-ported parallel performance profiling
    system
  • Computer system architectures and operating
    systems
  • Different programming languages and compilers
  • Support for multiple parallel programming
    paradigms
  • Multi-threading, message passing, mixed-mode,
    hybrid
  • Support for performance mapping
  • Support for object-oriented and generic
    programming
  • Integration in complex software systems and
    applications

14
How To Use TAU?
  • Instrumentation
  • Application code and libraries
  • Selective instrumentation
  • Install, compile, and link with TAU measurement
    library
  • configure make clean install
  • Multiple configurations for different
    measurements options
  • Does not require change in instrumentation
  • Selective measurement control
  • Execute experiments produce performance data
  • Performance data generated at end or during
    execution
  • Use analysis tools to look at performance results

15
TAU Instrumentation Approach
  • Support for standard program events
  • Routines
  • Classes and templates
  • Statement-level blocks
  • Support for user-defined events
  • Begin/End events (user-defined timers)
  • Atomic events
  • Selection of event statistics
  • Support definition of semantic entities for
    mapping
  • Support for event groups
  • Instrumentation optimization

16
TAU Instrumentation
  • Flexible instrumentation mechanisms at multiple
    levels
  • Source code
  • manual
  • automatic
  • C, C, F77/90 (Program Database Toolkit (PDT))
  • OpenMP (directive rewriting (Opari))
  • Object code
  • pre-instrumented libraries (e.g., MPI using PMPI)
  • statically-linked and dynamically-linked
  • fast breakpoints (compiler generated)
  • Executable code
  • dynamic instrumentation (pre-execution)
    (DynInstAPI)
  • virtual machine instrumentation (e.g., Java using
    JVMPI)

17
Multi-Level Instrumentation
  • Targets common measurement interface
  • TAU API
  • Multiple instrumentation interfaces
  • Simultaneously active
  • Information sharing between interfaces
  • Utilizes instrumentation knowledge between levels
  • Selective instrumentation
  • Available at each level
  • Cross-level selection
  • Targets a common performance model
  • Presents a unified view of execution
  • Consistent performance events

18
Program Database Toolkit (PDT)
  • Program code analysis framework
  • develop source-based tools
  • High-level interface to source code information
  • Integrated toolkit for source code parsing,
    database creation, and database query
  • Commercial grade front-end parsers
  • Portable IL analyzer, database format, and access
    API
  • Open software approach for tool development
  • Multiple source languages
  • Implement automatic performance instrumentation
    tools
  • tau_instrumentor

19
PDT Architecture and Tools
Application / Library
C / C parser
Fortran 77/90 parser
Program documentation
PDBhtml
Application component glue
IL
IL
SILOON
C / C IL analyzer
Fortran 77/90 IL analyzer
C / F90 interoperability
CHASM
Program Database Files
Automatic source instrumentation
TAU_instr
DUCTAPE
20
PDT Components
  • Language front end
  • Edison Design Group (EDG) C, C, Java
  • Mutek Solutions Ltd. F77, F90
  • IL Analyzer
  • Processes intermediate language (IL) tree from
    front-end
  • Creates program database (PDB) formatted file
  • DUCTAPE (Bernd Mohr, FZJ/ZAM, Germany)
  • C program Database Utilities and Conversion
    Tools APplication Environment
  • Processes and merges PDB files
  • C library to access the PDB for PDT applications

21
Instrumentation Control
  • Selection of which performance events to observe
  • Could depend on scope, type, level of interest
  • Could depend on instrumentation overhead
  • How is selection supported in instrumentation
    system?
  • No choice
  • Include / exclude lists (TAU)
  • Environment variables
  • Static vs. dynamic
  • Controlling the instrumentation of small routines
  • High relative measurement overhead
  • Significant intrusion and possible perturbation

22
Selective Instrumentation
tau_instrumentor Usage tau_instrumentor
ltpdbfilegt ltsourcefilegt -o ltoutputfilegt
-noinline -g groupname -i headerfile
-c-c-fortran -f ltinstr_req_filegt For
selective instrumentation, use f option cat
selective.dat Selective instrumentation
Specify an exclude/include list. BEGIN_EXCLUDE_LI
ST void quicksort(int , int, int) void
sort_5elements(int ) void interchange(int , int
) END_EXCLUDE_LIST If an include list is
specified, the routines in the list will be the
only routines that are instrumented. To
specify an include list (a list of routines that
will be instrumented) remove the leading to
uncomment the following lines BEGIN_INCLUDE_LIST
int main(int, char ) int select_ END_INCLUDE_
LIST
23
Overhead Analysis for Automatic Selection
  • Analyze the performance data to determine events
    with high (relative) overhead performance
    measurements
  • Create a select list for excluding those events
  • Rule grammar (used in tau_reduce tool)
  • GroupName Field Operator Number
  • GroupName indicates rule applies to events in
    group
  • Field is a event metric attribute (from profile
    statistics)
  • numcalls, numsubs, percent, usec, cumusec, count,
    totalcount, stdev, usecs/call, counts/call
  • Operator is one of gt, lt, or
  • Number is any number
  • Compound rules possible using between simple
    rules

24
Example Rules
  • Exclude all events that are members of TAU_USER
    and use less than 1000 microseconds TAU_USERuse
    c lt 1000
  • Exclude all events that have less than 100
    microseconds and are called only once usec lt
    1000 numcalls 1
  • Exclude all events that have less than 1000
    usecs per call OR have a (total inclusive)
    percent less than 5 usecs/call lt 1000 percent lt
    5
  • Scientific notation can be used

25
TAU Measurement
  • Performance information
  • Performance events
  • High-resolution timer library (real-time /
    virtual clocks)
  • General software counter library (user-defined
    events)
  • Hardware performance counters
  • PCL (Performance Counter Library) (ZAM, Germany)
  • PAPI (Performance API) (UTK, Ptools Consortium)
  • consistent, portable API
  • Organization
  • Node, context, thread levels
  • Profile groups for collective events (runtime
    selective)
  • Performance data mapping between software levels

26
TAU Measurement Options
  • Parallel profiling
  • Function-level, block-level, statement-level
  • Supports user-defined events
  • TAU parallel profile data stored during execution
  • Hardware counts values
  • Support for multiple counters
  • Support for callpath profiling
  • Tracing
  • All profile-level events
  • Inter-process communication events
  • Timestamp synchronization
  • Trace merging and format conversion

27
TAU Measurement System Configuration
  • configure OPTIONS
  • -cltCCgt, -ccltccgt Specify C and C
    compilers
  • -pthread, -sproc , -smarts Use pthread, SGI
    sproc, smarts threads
  • -openmp Use OpenMP threads
  • -opariltdirgt Specify location of Opari OpenMP
    tool
  • -papi ,-pclltdirgt Specify location of PAPI or
    PCL
  • -pdtltdirgt Specify location of PDT
  • -mpiincltdgt, mpilibltdgt Specify MPI library
    instrumentation
  • -TRACE Generate TAU event traces
  • -PROFILE Generate TAU profiles
  • -PROFILECALLPATH Generate Callpath profiles
    (1-level)
  • -MULTIPLECOUNTERS Use more than one hardware
    counter
  • -CPUTIME Use usertimesystem time
  • -PAPIWALLCLOCK Use PAPI to access wallclock time
  • -PAPIVIRTUAL Use PAPI for virtual (user) time

28
TAU Measurement API
  • Initialization and runtime configuration
  • TAU_PROFILE_INIT(argc, argv)TAU_PROFILE_SET_NODE
    (myNode)TAU_PROFILE_SET_CONTEXT(myContext)TAU_
    PROFILE_EXIT(message)TAU_REGISTIER_THREAD()
  • Function and class methods
  • TAU_PROFILE(name, type, group)
  • Template
  • TAU_TYPE_STRING(variable, type)TAU_PROFILE(name,
    type, group)CT(variable)
  • User-defined timing
  • TAU_PROFILE_TIMER(timer, name, type,
    group)TAU_PROFILE_START(timer)TAU_PROFILE_STOP
    (timer)

29
TAU Measurement API (continued)
  • User-defined events
  • TAU_REGISTER_EVENT(variable, event_name)TAU_EVEN
    T(variable, value)TAU_PROFILE_STMT(statement)
  • Mapping
  • TAU_MAPPING(statement, key)TAU_MAPPING_OBJECT(fu
    ncIdVar)TAU_MAPPING_LINK(funcIdVar, key)
  • TAU_MAPPING_PROFILE (funcIdVar)TAU_MAPPING_PROFI
    LE_TIMER(timer, funcIdVar)TAU_MAPPING_PROFILE_ST
    ART(timer)TAU_MAPPING_PROFILE_STOP(timer)
  • Reporting
  • TAU_REPORT_STATISTICS()TAU_REPORT_THREAD_STATIST
    ICS()

30
Grouping Performance Data in TAU
  • Profile Groups
  • A group of related routines forms a profile group
  • Statically defined
  • TAU_DEFAULT, TAU_USER1-5, TAU_MESSAGE, TAU_IO,
  • Dynamically defined
  • group name based on string, such as adlib or
    particles
  • runtime lookup in a map to get unique group
    identifier
  • uses tau_instrumentor to instrument
  • Ability to change group names at runtime
  • Group-based instrumentation and measurement
    control

31
TAU Group Instrumentation Control API
  • Enabling Profile Groups
  • TAU_ENABLE_INSTRUMENTATION()
  • TAU_ENABLE_GROUP(TAU_GROUP)
  • TAU_ENABLE_GROUP_NAME(group name)
  • TAU_ENABLE_ALL_GROUPS()
  • Disabling Profile Groups
  • TAU_DISABLE_INSTRUMENTATION()
  • TAU_DISABLE_GROUP(TAU_GROUP)
  • TAU_DISABLE_GROUP_NAME()
  • TAU_DISABLE_ALL_GROUPS()
  • Obtaining Profile Group Identifier
  • Runtime Switching of Profile Groups

32
TAU Pre-execution Control
  • Dynamic groups defined at file scope
  • Group names and group associations runtime
    modifiable
  • Controlling groups at pre-execution time
  • --profile ltgroup1group2groupNgt option
  • tau_instrumentor app.pdb app.cpp \
  • o app.i.cpp g particles
  • mpirun np 4 application \
  • profile particlesfieldmeshio
  • Examples
  • POOMA (LANL) uses static groups
  • VTF (Caltech) uses dynamic group in Python-based
    execution instrumentation control

33
Configuring TAU Measurement Library
  • Profiling with wallclock time (on a quad PIII
    Linux machine)
  • configure -mpiinc/usr/local/packages/mpich/incl
    ude -mpilib/usr/local/packages/mpich/lib
    -pdt/usr/pkg/pdtoolkit/ -useropt-O2
    -LINUXTIMERS
  • Tracing
  • configure -mpiinc/usr/local/packages/mpich/incl
    ude -mpilib/usr/local/packages/mpich/lib
    -pdt/usr/pkg/pdtoolkit -useropt-O2
    -LINUXTIMERS
  • Profiling with PAPI
  • configure -mpiinc/usr/local/packages/mpich/incl
    ude -mpilib/usr/local/packages/mpich/lib
    -pdt/usr/pkg/pdtoolkit/ -useropt-O2
    -papi/usr/local/packages/papi
  • setenv PAPI_EVENT PAPI_FP_INS
  • setenv PAPI_EVENT PAPI_L1_DCM

34
Compiling with TAU Makefiles
  • Include TAU Stub Makefile (ltarchgt/lib) in the
    users Makefile
  • Variables
  • TAU_CXX Specify the C compiler used by TAU
  • TAU_CC, TAU_F90 Specify the C, F90 compilers
  • TAU_DEFS Defines used by TAU. Add to CFLAGS
  • TAU_LDFLAGS Linker options. Add to LDFLAGS
  • TAU_INCLUDE Header files include path. Add to
    CFLAGS
  • TAU_LIBS Statically linked TAU library. Add to
    LIBS
  • TAU_SHLIBS Dynamically linked TAU library
  • TAU_MPI_LIBS TAUs MPI wrapper library for C/C
  • TAU_MPI_FLIBS TAUs MPI wrapper library for F90
  • TAU_FORTRANLIBS Must be linked in with C linker
    for F90.
  • TAU_DISABLE TAUs dummy F90 stub library

35
TAU Analysis
  • Parallel profile analysis
  • Pprof
  • parallel profiler with text-based display
  • Racy
  • graphical interface to pprof (Tcl/Tk)
  • paraprof
  • Java implementation of Racy
  • Trace analysis and visualization
  • Trace merging and clock adjustment (if necessary)
  • Trace format conversion (ALOG, SDDF, VTF,
    Paraver)
  • Trace visualization using Vampir (Pallas)

36
Pprof Command
  • pprof -c-b-m-t-e-i -r -s -n num -f
    file -l nodes
  • -c Sort according to number of calls
  • -b Sort according to number of subroutines called
  • -m Sort according to msecs (exclusive time total)
  • -t Sort according to total msecs (inclusive time
    total)
  • -e Sort according to exclusive time per call
  • -i Sort according to inclusive time per call
  • -v Sort according to standard deviation
    (exclusive usec)
  • -r Reverse sorting order
  • -s Print only summary profile information
  • -n num Print only first number of functions
  • -f file Specify full path and filename without
    node ids
  • -l nodes List all functions and exit (prints only
    info about all contexts/threads of given node
    numbers)

37
Pprof Output (NAS Parallel Benchmark LU)
  • Intel QuadPIII Xeon
  • F90 MPICH
  • Profile - Node - Context - Thread
  • Events - code - MPI

38
Paraprof (NAS Parallel Benchmark LU)
Routine profile across all nodes
n node c context t thread
Global profiles
Event legend
Individual profile
39
Paraprof Profile Browser
40
Paraprof Profile Browser Main Window
41
Paraprof Profile Browser Node Window
42
Paraprof Profile Browser (Derived Metrics)
43
Paraprof Profile Browser Routine Window
44
TAU PAPI (NAS Parallel Benchmark LU )
  • Floating point operations
  • Re-link to alternate library
  • Can use multiple counter support

45
TAU Vampir (NAS Parallel Benchmark LU)
Callgraph display
Timeline display
Parallelism display
Communications display
46
tau_reduce Example
  • tau_reduce implements overhead reduction in TAU
  • Consider klargest example
  • Find kth largest element in a N elements
  • Compare two methods quicksort,
    select_kth_largest
  • Un-instrumented testcase i 2324, N 1000000
  • quicksort (wall clock) 0.188511 secs
  • select_kth_largest (wall clock) 0.149594 secs
  • Total (PIII/1.2GHz time) 0.340u 0.020s 000.37
  • Execute with all routines instrumented
  • Execute with rule-based selective instrumentation
  • usecgt1000 numcallsgt400000 usecs/calllt30
    percentgt25

47
Simple sorting example on one processor
Before selective instrumentation reduction
  • NODE 0CONTEXT 0THREAD 0
  • --------------------------------------------------
    -------------------------------------
  • Time Exclusive Inclusive Call
    Subrs Inclusive Name
  • msec msec
    usec/call
  • --------------------------------------------------
    -------------------------------------
  • 100.0 13 4,982 1
    4 4982030 int main
  • 93.5 3,223 4,659 4.20241E06
    1.40268E07 1 void quicksort
  • 62.9 0.00481 3,134 5
    5 626839 int kth_largest_qs
  • 36.4 137 1,813 28
    450057 64769 int select_kth_largest
  • 33.6 150 1,675 449978
    449978 4 void sort_5elements
  • 28.8 1,435 1,435 1.02744E07
    0 0 void interchange
  • 0.4 20 20 1
    0 20668 void setup
  • 0.0 0.0118 0.0118 49
    0 0 int ceil

After selective instrumentation reduction
NODE 0CONTEXT 0THREAD 0 -----------------------
--------------------------------------------------
-------------- Time Exclusive Inclusive
Call Subrs Inclusive Name
msec total msec
usec/call ----------------------------------------
----------------------------------------------- 10
0.0 14 383 1
4 383333 int main 50.9 195
195 5 0 39017 int
kth_largest_qs 40.0 153 153
28 79 5478 int
select_kth_largest 5.4 20
20 1 0 20611 void setup
0.0 0.02 0.02 49
0 0 int ceil
48
TAU Performance System Status
  • Computing platforms
  • IBM SP / Power4, SGI Origin 2K/3K, ASCI Red, Cray
    T3E / SV-1 (X-1 planned), HP (Compaq) SC (Tru64),
    HP Superdome (HP-UX), Sun, Hitachi SR8000, NEX
    SX-5 (SX-6 underway), Linux clusters (IA-32/64,
    Alpha, PPC, PA-RISC, Power), Apple (OS X),
    Windows
  • Programming languages
  • C, C, Fortran 77, F90, HPF, Java, OpenMP,
    Python
  • Communication libraries
  • MPI, PVM, Nexus, shmem, Tulip, ACLMPL, MPIJava
  • Thread libraries
  • pthreads, SGI sproc, Java,Windows, OpenMP, SMARTS

49
TAU Performance System Status (continued)
  • Compilers
  • Intel KAI (KCC, KAP/Pro), PGI, GNU, Fujitsu, Sun,
    Microsoft, SGI, Cray, IBM, Compaq, Hitachi, NEC,
    Intel
  • Application libraries (selected)
  • Blitz, A/P, PETSc, SAMRAI, Overture, PAWS
  • Application frameworks (selected)
  • POOMA, MC, Conejo, Uintah, VTF, UPS, GrACE
  • Performance projects using TAU
  • Aurora / SCALEA ACPC, University of Vienna
  • TAU full distribution (Version 2.12, web
    download)
  • TAU performance system toolkit and users guide
  • Automatic software installation and examples

50
PDT Status
  • Program Database Toolkit (Version 2.2, web
    download)
  • EDG C front end (Version 2.45.2)
  • Mutek Fortran 90 front end (Version 2.4.1)
  • C and Fortran 90 IL Analyzer
  • DUCTAPE library
  • Standard C system header files (KCC Version
    4.0f)
  • PDT-constructed tools
  • TAU instrumentor (C/C/F90)
  • Program analysis support for SILOON and CHASM
  • Platforms
  • Same as for TAU with a few exceptions

51
Performance Mapping
  • High-level semantic abstractions
  • Associate performance measurements
  • Performance mapping
  • performance measurement system support to assign
    data correctly

52
Semantic Entities/Attributes/Associations
  • New dynamic mapping scheme (SEAA)
  • Contrast with ParaMap (Miller and Irvin)
  • Entities defined at any level of abstraction
  • Attribute entity with semantic information
  • Entity-to-entity associations
  • Two association types (implemented in TAU API)
  • Embedded extends associatedobject to store
    performancemeasurement entity
  • External creates an external look-uptable
    using address of object as key tolocate
    performance measurement entity

53
Hypothetical Mapping Example
  • Particles distributed on surfaces of a cube

Particle PMAX / Array of particles / int
GenerateParticles() / distribute particles
over all faces of the cube / for (int face0,
last0 face lt 6 face) / particles on
this face / int particles_on_this_face
num(face) for (int ilast i lt
particles_on_this_face i) / particle
properties are a function of face / Pi
... f(face) ... last
particles_on_this_face
54
Hypothetical Mapping Example (continued)
int ProcessParticle(Particle p) / perform
some computation on p / int main()
GenerateParticles() / create a list of
particles / for (int i 0 i lt N i) /
iterates over the list / ProcessParticle(Pi)

work packets

engine
  • How much time is spent processing face i
    particles?
  • What is the distribution of performance among
    faces?

55
No Performance Mapping versus Mapping
  • Typical performance tools report performance with
    respect to routines
  • Does not provide support for mapping
  • Performance tools with SEAA mapping can observe
    performance with respect to scientists
    programming and problem abstractions

TAU (w/ mapping)
TAU (no mapping)
56
Performance Mapping in Callpath Profiling
  • Consider callgraph (callpath) profiling
  • Measure time (metric) along an edge (path) of
    callgraph
  • Incident edge gives parent / child view
  • Edge sequence (path) gives parent / descendant
    view
  • Callpath profiling when callgraph is unknown
  • Must determine callgraph dynamically at runtime
  • Map performance measurement to dynamic call path
    state
  • Callpath levels
  • 0-level current callgraph node
  • 1-level immediate parent (descendant)
  • k-level kth calling parent (call descendant)

57
1-Level Callpath Implementation in TAU
  • TAU maintains a performance event (routine)
    callstack
  • Profiled routine (child) looks in callstack for
    parent
  • Previous profiled performance event is the parent
  • A callpath profile structure created first time
    parent calls
  • TAU records parent in a callgraph map for child
  • String representing 1-level callpath used as its
    key
  • a( )gtb( ) name for time spent in b when
    called by a
  • Map returns pointer to callpath profile structure
  • 1-level callpath is profiled using this profiling
    data
  • Build upon TAUs performance mapping technology
  • Measurement is independent of instrumentation
  • Use PROFILECALLPATH to configure TAU

58
Callpath Profiling Example (NAS LU v2.3)
  • configure -PROFILECALLPATH -SGITIMERS
    -archsgi64-mpiinc/usr/include
    -mpilib/usr/lib64 -useropt-O2

59
Callpath Parallel Profile Display
  • 0-level and 1-level callpath grouping

1-Level Callpath
0-Level Callpath
60
Strategies for Empirical Performance Evaluation
  • Empirical performance evaluation as a series of
    performance experiments
  • Experiment trials describing instrumentation and
    measurement requirements
  • Where/When/How axes of empirical performance
    space
  • where are performance measurements made in
    program
  • when is performance instrumentation done
  • how are performance measurement/instrumentation
    chosen
  • Strategies for achieving flexibility and
    portability goals
  • Limited performance methods restrict evaluation
    scope
  • Non-portable methods force use of different
    techniques
  • Integration and combination of strategies

61
Case Study SIMPLE Performance Analysis
  • SIMPLE hydrodynamics benchmark
  • C code with MPI message communication
  • Multiple instrumentation methods
  • source-to-source translation (PDT)
  • MPI wrapper library level instrumentation (PMPI)
  • pre-execution binary instrumentation (DyninstAPI)
  • Alternative measurement strategies
  • statistical profiles of software actions
  • statistical profiles of hardware actions (PCL,
    PAPI)
  • program event tracing
  • choice of time source
  • gettimeofday, high-res physical, CPU, process
    virtual

62
SIMPLE Source Instrumentation (Preprocessed)
  • PDT automatically generates instrumentation code
  • names events with full function signatures
  • Similarly for all other routines in SIMPLE program

int compute_heat_conduction(double
theta_hatXY, double deltat, double
new_rXY, double new_zXY, double
new_alphaXY, double new_rhoXY, double
theta_lXY,double Gamma_kXY, double
Gamma_lXY) TAU_PROFILE("int
compute_heat_conduction( double ()259,
double, double ()259, double ()259, double
()259, double ()259, double ()259,
double ()259, double ()259)", " ",
TAU_USER) ...
63
MPI Library Instrumentation (MPI_Send)
  • Uses MPI profiling interposition library (PMPI)

int MPI_Send()... int returnVal,
typesize TAU_PROFILE_TIMER(tautimer,
"MPI_Send()", " ", TAU_MESSAGE) TAU_PROFILE_STAR
T(tautimer) if (dest ! MPI_PROC_NULL)
PMPI_Type_size(datatype, typesize) TAU_TRA
CE_SENDMSG(tag, dest, typesizecount) returnV
al PMPI_Send(buf, count, datatype, dest, tag,
comm) TAU_PROFILE_STOP(tautimer) return
returnVal
64
MPI Library Instrumentation (MPI_Recv)
  • int MPI_Recv()... int returnVal,
    size TAU_PROFILE_TIMER(tautimer, "MPI_Recv()",
    " ", TAU_MESSAGE) TAU_PROFILE_START(tautimer)
    returnVal PMPI_Recv(buf, count, datatype, src,
    tag, comm,
  • status) if (src ! MPI_PROC_NULL returnVal
    MPI_SUCCESS) PMPI_Get_count( status,
    MPI_BYTE, size ) TAU_TRACE_RECVMSG(status-gtMPI
    _TAG, status-gtMPI_SOURCE,
  • size) TAU_PROFILE_STOP(tautimer)
    return returnVal

65
Multi-Level Instrumentation (Profiling)
four processes
event legend
Profile per process
global profile
66
Multi-Level Instrumentation (Tracing)
  • Relink with TAU library configured for tracing
  • No modification of source instrumentation
    required!

TAU performance groups
67
Dynamic Instrumentation of SIMPLE
  • Uses DynInstAPI for runtime code patching
  • Mutator loads measurement library, instruments
    mutatee
  • One mutator (tau_run) per executable image
  • mpirun np ltngt tau.shell

68
Case Study PETSc v2.1.3 (ANL)
  • Portable, Extensible Toolkit for Scientific
    Computation
  • Scalable (parallel) PDE framework
  • Suite of data structures and routines (374,458
    code lines)
  • Solution of scientific applications modeled by
    PDEs
  • Parallel implementation
  • MPI used for inter-process communication
  • TAU instrumentation
  • PDT for C/C source instrumentation (100, no
    manual)
  • MPI wrapper interposition library instrumentation
  • Example
  • Linear system of equations (Axb) (SLES) (ex2
    test case)
  • Non-linear system of equations (SNES) (ex19 test
    case)

69
PETSc ex2 (Profile - wallclock time)
Sorted with respect to exclusive time
70
PETSc ex2(Profile - overall and message counts)
  • Observe load balance
  • Track messages

Capture with user-defined events
71
PETSc ex2 (Profile - percentages and time)
  • View per threadperformance on individual routines

72
PETSc ex2 (Trace)
73
PETSc ex19
  • Non-linear solver (SNES)
  • 2-D driven cavity code
  • Uses velocity-vorticity formulation
  • Finite difference discretization on a structured
    grid
  • Problem size and measurements
  • 56x56 mesh size on quad Pentium III (550 Mhz,
    Linux)
  • Executes for approximately one minute
  • MPI wrapper interposition library
  • PDT (tau_instrumentor)
  • Selective instrumentation (tau_reduce)
  • three routines identified with high
    instrumentation overhead

74
PETSc ex19 (Profile - wallclock time)
Sorted by inclusive time
Sorted by exclusive time
75
PETSc ex19 (Profile - overall and percentages)
76
PETSc ex19 (Tracing)
Commonly seen communicaton behavior
77
PETSc ex19 (Tracing - callgraph)
78
PETSc ex19 (PAPI_FP_INS, PAPI_L1_DCM)
  • Uses multiple counter profile measurement

PAPI_FP_INS
PAPI_L1_DCM
79
Case Study Mixed-mode Parallel Programs
  • Portable mixed-mode parallel programming
  • Multi-threaded shared memory programming
  • Inter-node message passing
  • Performance measurement
  • Access to runtime system and communication events
  • Associate communication and application events
  • 2-Dimensional Stommel model of ocean circulation
  • OpenMP for shared memory parallel programming
  • MPI for cross-box message-based parallelism
  • Jacobi iteration, 5-point stencil
  • Timothy Kaiser (San Diego Supercomputing Center)

80
Stommel Instrumentation
  • OpenMP directive instrumentation (uses OPARI)

pomp_for_enter(omp_rd_2) line 252
"stommel.c" pragma omp for schedule(static)
reduction( diff) private(j) firstprivate
(a1,a2,a3,a4,a5) nowait for( ii1ilti2i)
for(jj1jltj2j) new_psiija1psii1
j a2psii-1j a3psiij1
a4psiij-1 - a5the_forij diffdifffab
s(new_psiij-psiij) pomp_barrier_ente
r(omp_rd_2) pragma omp barrier pomp_barrier_exi
t(omp_rd_2) pomp_for_exit(omp_rd_2) line 261
"stommel.c"
81
OpenMP MPI Ocean Modeling (Trace)
Thread-paired message passing
Integrated OpenMP MPI events
82
OpenMP MPI Ocean Modeling (HW Profile)
configure -papi../packages/papi -openmp
-cpgCC -ccpgcc -mpiinc../packages/mpich/in
clude -mpilib../packages/mpich/lib
Integrated OpenMP MPI events
Integrated OpenMP MPI events
FP instructions
83
Case Study C and Performance Mapping
  • Object-oriented programming
  • abstract data types, encapsulation, inheritance,
  • Domain-specific abstractions
  • Implemented by OO languages in form of class
    libraries
  • Generic programming mechanisms
  • efficient coding abstractions, compile-time
    transformations
  • Creates a semantic gap between the transformed
    code and what the user expects (as describes in
    source code)
  • Need a mechanism to expose the nature of
    high-level abstract computation to the
    performance tools
  • Map low-level performance data to high-level
    semantics

84
C Template Instrumentation (Blitz, PETE)
  • High-level objects
  • Array classes
  • Templates (Blitz)
  • Optimizations
  • Array processing
  • Expressions (PETE)
  • Relate performance data to high-level statement
  • Complexity of template evaluation

Array expressions
Array expressions
85
Standard Template Instrumentation Difficulties
  • Instantiated templates result in mangled
    identifiers
  • Standard profiling techniques / tools are
    deficient
  • Integrated with proprietary compilers
  • Specific systems platforms and programming models

Uninterpretable routine names
Very long!
86
Blitz Library Instrumentation
  • Expression templates
  • embed the form of the expression in a template
    name
  • Blitz describes structure of the expression
    template
  • Present as pretty printed name to the profiling
    toolkit
  • Create performance event associated with
    expression type

Expression B C - 2.0 D

BinOpltAdd, B, ltBinOpltSubtract, C,
ltBinOpltMultiply, Scalarlt2.0gt, Dgtgtgt
B
-

C
2.0
D
87
Blitz Library Instrumentation (example)
  • ifdef BZ_TAU_PROFILING
  • static string exprDescription
  • if (!exprDescription.length())
  • exprDescription "A"
  • prettyPrintFormat format(_bz_true) // terse
    mode on
  • format.nextArrayOperandSymbol()
  • T_updateprettyPrint(exprDescription)
  • expr.prettyPrint(exprDescription, format)
  • TAU_PROFILE(" ", exprDescription, TAU_BLITZ)
  • endif

exprDescription is the event name
88
TAU Instrumentation and Profiling for C
Profile of expression types
Performance data presented with respect to
high-level array expression types
Performance data presented with respect to
high-level array expression types
89
Case Study C-SAFE / Uintah
  • Center for Simulation of Accidental Fires
    Explosions
  • ASCI ASAP Level 1 center, University of Utah
  • PSE for multi-model simulation high-energy
    explosion
  • Coupled non-linear solvers, optimization,
    computational steering, visualization, and
    experimental data verification
  • Very large-scale simulations
  • Computer science problems
  • Coupling of multiple simulation codes
  • Software engineering across diverse expert teams
  • Achieving high performance on large-scale systems

90
Example C-SAFE Simulation Problems
?
Heptane fire simulation
Typical C-SAFE simulation with a billion degrees
of freedom and non-linear time dynamics
Material stress simulation
91
Uintah Computational Framework (UCF)
  • Execution model based on software (macro)
    dataflow
  • Exposes parallelism and hides data transport
    latency
  • Computations expressed a directed acyclic graphs
    of tasks
  • consumes input and produces output (input to
    future task)
  • input/outputs specified for each patch in a
    structured grid
  • Abstraction of global single-assignment memory
  • DataWarehouse
  • Directory mapping names to values (array
    structured)
  • Write value once then communicate to awaiting
    tasks
  • Task graph gets mapped to processing resources
  • Communications schedule approximates global
    optimal

92
Performance Technology Integration
  • Uintah present challenges to performance
    integration
  • Software diversity and structure
  • UCF middleware, simulation code modules
  • component-based hierarchy
  • Portability objectives
  • cross-language and cross-platform
  • multi-parallelism thread, message passing, mixed
  • Scalability objectives
  • High-level programming and execution abstractions
  • Requires flexible and robust performance
    technology
  • Requires support for performance mapping

93
Task Execution in Uintah Parallel Scheduler
  • Profile methods and functions in scheduler and in
    MPI library

Task execution time dominates (what task?)
Task execution time distribution
MPI communication overheads (where?)
  • Need to map performance data!

94
Uintah Task Performance Mapping
  • Uintah partitions individual particles across
    processing elements (processes or threads)
  • Simulation tasks in task graph work on particles
  • Tasks have domain-specific character in the
    computation
  • interpolate particles to grid in Material Point
    Method
  • Task instances generated for each partitioned
    particle set
  • Execution scheduled with respect to task
    dependencies
  • How to attributed execution time among different
    tasks
  • Assign semantic name (task type) to a task
    instance
  • SerialMPMinterpolateParticleToGrid
  • Map TAU timer object to (abstract) task (semantic
    entity)
  • Look up timer object using task type (semantic
    attribute)
  • Further partition along different domain-specific
    axes

95
Mapping Instrumentation in UCF (example)
  • Use TAU performance mapping API

void MPISchedulerexecute(const ProcessorGroup
pc, DataWarehouseP old_dw,
DataWarehouseP dw ) ... TAU_MAPPING_C
REATE( task-gtgetName(), "MPISchedulerexecute(
)", (TauGroup_t)(void)task-gtgetName(),
task-gtgetName(), 0) ... TAU_MAPPING_OBJECT(taut
imer) TAU_MAPPING_LINK(tautimer,(TauGroup_t)(void
)task-gtgetName()) // EXTERNAL
ASSOCIATION ... TAU_MAPPING_PROFILE_TIMER(doitpr
ofiler, tautimer, 0) TAU_MAPPING_PROFILE_START(do
itprofiler,0) task-gtdoit(pc) TAU_MAPPING_PROFI
LE_STOP(0) ...
96
Task Performance Mapping (Profile)
Mapped task performance across processes
Performance mapping for different tasks
97
Work Packet to Task Mapping (Trace)
Work packet computation events colored by task
type
Distinct phases of computation can be identifed
based on task
98
Comparing Uintah Traces for Scalability Analysis
32 processes
32 processes
32 processes
99
Online Performance Analysis for C-SAFE Apps
SCIRun (Univ. of Utah)
Performance Visualizer
Application
// performance data streams
TAU Performance System
Performance Analyzer
// performance data output
accumulated samples
Performance Data Reader
Performance Data Integrator
file system
sample sequencing reader synchronization
100
2D Field Performance Visualization in SCIRun
SCIRun program
101
Uintah Computational Framework (UCF)
  • UCF analysis
  • Scheduling
  • MPI library
  • Components
  • 500 processes
  • Onlineand offlinevisualization
  • Performancesteering
  • use SCIRun support

102
Case Study SAMRAI (LLNL)
  • Structured Adaptive Mesh Refinement Application
    Infrastructure (SAMRAI)
  • Programming
  • C and MPI
  • SPMD
  • Instrumentation
  • PDT for automatic instrumentation of routines
  • MPI interposition wrappers
  • SAMRAI timers for interesting code segments
  • timers classified in groups (apps, mesh, )
  • timer groups are managed by TAU groups

103
SAMRAI (Profile)
  • Euler (2D)

routine name
return type
104
SAMRAI Euler (Profile)
105
SAMRAI Euler (Trace)
106
Case Study EVH1
  • Enhanced Virginia Hydrodynamics 1 (EVH1)
  • "TeraScale Simulations of Neutrino-Driven
    Supernovae and Their Nucleosynthesis" SciDAC
    project
  • Configured to run a simulation of the
    Sedov-Taylor blast wave solution in 2D spherical
    geometry
  • Performance study found EVH1 communication bound
    for more than 64 processors
  • Predominant routine (gt50 of execution time) at
    this scale is MPI_ALLTOALL
  • Used in matrix transpose-like operations

107
EVH1 Execution Profile
108
EVH1 Execution Trace
MPI_Alltoall is an execution bottleneck
109
TAU Integration (Selected)
  • SAMRAI (LLNL)
  • Overture (LLNL)
  • C-SAFE (ASCI ASAP)
  • VTF (ASCI ASAP)
  • SAGE (ASCI LANL)
  • POOMA, POOMA-II (LANL, Code Sourcery)
  • PETSc (ANL)
  • CCA (DOE SciDAC)
  • GrACE (Rutgers)
  • Aurora / SCALEA (University of Vienna)

110
Work in Progress
  • Trace visualization
  • Event traces with counters (Vampir 3.0 will
    visualize)
  • EPILOG trace conversion
  • Runtime performance monitoring and analysis
  • Online performance data access
  • Performance analysis and visualization in SCIRun
  • Performance Database Framework
  • XML parallel profile representation of TAU
    profiles
  • PostgresSQL performance database
  • Next-generation PDT
  • Performance analysis for component software (CCA)

111
Concluding Remarks
  • Complex software and parallel computing systems
    pose challenging performance analysis problems
    that require robust methodologies and tools
  • To build more sophisticated performance tools,
    existing proven performance technology must be
    utilized
  • Performance tools must be integrated with
    software and systems models and technology
  • Performance engineered software
  • Function consistently and coherently in software
    and system environments
  • TAU performance system offers robust performance
    technology that can be broadly integrated so
    USE IT!

112
Acknowledgements
  • Department of Energy (DOE)
  • MICS office
  • DOE 2000 ACTS contract
  • Performance Technology for Tera-class Parallel
    Computer Systems Evolution of the TAU
    Performance System
  • PERC SciDAC project affiliate
  • University of Utah DOE ASCI Level 1 sub-contract
  • DOE ASCI Level 3 (LANL, LLNL)
  • NSF National Young Investigator (NYI) award
  • Research Centre Juelich
  • John von Neumann Institute for Computing
  • Dr. Bernd Mohr
  • Los Alamos National Laboratory

113
Information
  • TAU (http//www.acl.lanl.gov/tau)
  • PDT (http//www.acl.lanl.gov/pdtoolkit)
  • PAPI (http//icl.cs.utk.edu/projects/papi/)
  • OPARI (http//www.fz-juelich.de/zam/kojak/)
About PowerShow.com