Interoperable Performance Tools - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Interoperable Performance Tools

Description:

Take event traces of MPI/OpenMP applications. Search for execution patterns. Calculate mapping ... Coloring. Analyzer. Presenter. Performance. behavior ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 24
Provided by: fzjue
Category:

less

Transcript and Presenter's Notes

Title: Interoperable Performance Tools


1
Interoperable Performance Tools
Nikhil Bhatia Fengguang Song Felix
Wolf University of Tennessee Innovative
Computing Laboratory
  • Bernd Mohr
  • Forschungszentrum Jülich
  • John von Neumann - Institut für Computing

2
Outline
  • KOJAK
  • Recent extensions
  • Intermediate conclusion
  • Two new interoperable components
  • CUBE
  • CONE
  • Future directions

3
Low-Level View of Performance Behavior
4
KOJAK
  • Automatic performance analysis
  • Take event traces of MPI/OpenMP applications
  • Search for execution patterns
  • Calculate mapping
  • Problem, call path, location ? time
  • Display in performance browser

?
5
KOJAK Architecture
  • Instrumentation
  • Inserting extra code to generate trace
  • Abstract representation of event trace
  • Precomputed relationships
  • Simplified specification of performance
    properties
  • Easy to extend
  • Analysis
  • Automatic classification and quantification of
    performance behavior
  • Presentation
  • Navigating / browsing through performance space
  • Can be combined with time-line display

Presentation
Analysis
Abstraction
Instrumentation
6
KOJAK Architecture (2)
Semiautomatic Instrumentation
Instrumented source code
OPARI / TAU
Source code
POMPPMPI Libraries
Compiler / Linker
Executable
EPILOG Library
PAPI Library
Run DPCL


Automatic
Analysis
EXPERT Analyzer
Analysis report
EXPERT Presenter
EPILOG Trace file
EARL



Manual Analysis

VTF3 Trace file
Trace converter
VAMPIR
7
Parallelism vs. CPU and Memory Performance
  • Interaction among different processes and threads
    ?
  • How do my processes and threads perform
    individually?
  • CPU performance
  • Memory performance
  • Integration of these performance aspects?
  • Specification of parallelism-related properties
  • Temporal and spatial relationships between
    run-time events
  • Specification of CPU and memory-related
    properties
  • Hardware counters

8
CPU Memory Performance in KOJAK
  • Event model trace format
  • Predefined and user-defined system metrics
  • Including but not limited to hardware counters
  • Metric values as part of ENTER / EXIT records
  • Flexible interval semantics
  • Run-time system
  • Hardware-counter access with PAPI
  • Portable access to hardware counters on most
    platforms
  • Abstraction layer
  • Additional event attributes
  • Attribute name as defined in trace file
  • print eventL1_D_CACHE

9
CPU Memory Performance in KOJAK (2)
  • Analysis
  • Identifies tuples (call path, thread) whose
    occurrence rate of a certain event is above /
    below a certain threshold
  • Use entire execution time of tuple as severity
    (upper bound)
  • Two experimental performance properties
  • L1 data cache misses per time above average
  • Floating point operations per time below average
    (25 peak)
  • Main results
  • Beneficial integration of parallelism with
    individual CPU performance
  • Actual run-time penalty still unknown
  • Better bound for severity
  • Need to cover additional aspects of CPU
    performance

10
Intermediate Conclusion
  • Manpower
  • Most of the time only two people
  • Demand for robust and portable software
  • Tool components vs. monolithic tool
  • KOJAK developed from and as a set of independent
    components
  • No detailed initial design
  • Use of third-party components (e.g., PAPI, TAU)
  • Native KOJAK components more generally usable
  • Synergy through interoperability
  • Components with well-defined interfaces
  • Portability
  • Open source

Monolithic Tool
11
Some Components and Interfaces
  • Hardware monitoring
  • HPM, PAPI, PCL
  • Instrumentation
  • DPCL, DPOMP, Dyninst, SCALEA, SDDF, SIR, TAU
  • Experiment management
  • ILab, Nimrod, ZENTURIO
  • Tool infrastructure
  • MRNet, TDP
  • Databases and source-code analysis
  • DUCTAPE, PDT, PerfDBF, PPerfDB
  • Presentation
  • Askalon, SvPablo, TAU
  • Modeling and prediction
  • MetaSim, PerformanceProphet/Teuta

12
Generic Presentation
  • Conclusions drawn from EXPERT presenter
  • Presentation independent of
  • Specific performance properties
  • Specific metric
  • Presentation only based on
  • Structure
  • Hierarchical decomposition
  • Relative weight (severity) of nodes
  • Coloring

Presenter
Analyzer
Performance behavior
Karavanic et al. structural difference
operator Miller et al. hierarchical
decomposition in Paradyn
13
CUBE Uniform Behavioral Encoding
?
  • High-level data model of performance behavior
  • Mapping performance aspect, program entities ?
    metric
  • Hierarchical decomposition
  • Multidimensional aggregation
  • Portable data format (XML)
  • Generic presentation component
  • Performance-data algebra (not yet supported)

?
?
Cube Tool
KOJAK
CUBE (XML)
CONE
Performance Tool 3
14
CUBE Prototype
  • Implemented in C by Fengguang Song
  • Mapping performance property, call tree,
    location ? metric
  • Hierarchical dimensions
  • Data format specified using XMLSchema
  • C class interface for reading / writing
  • Tested with the CONE call-graph profiler
  • Already some more features than EXPERT Presenter
  • Absolute values
  • Source-code display

General Behavior
Main
Grid
Machine
SMP Node
Process
Specific Behavior
Subroutine
Thread
15
CUBE Interface
class Cube public Cube() // property
dimension int def_prop(stdstring name,
stdstring uom, stdstring descr,
int parent_id) // call-tree dimension
int def_module(stdstring name, stdstring
path) int def_region(stdstring name, long
begln, long endln, stdstring descr,
int mod_id) int def_csite(int mod_id,
int line, int callee_id) int def_cnode(int
csite_id, int parent_id) // location
dimension int def_grid(stdstring name) int
def_mach(stdstring name, int grid_id) int
def_node(stdstring name, int mach_id) int
def_proc(stdstring name, int node_id) int
def_thrd(stdstring name, int proc_id)
void set_sev(int prop_id, int cnode_id, int
thrd_id, double value)
16
CUBE Data Format
lt?xml version"1.0" encoding"UTF-8"?gtltcube
version"0.1"gt ltbehaviorgt ltproperty
id"0"gt ltnamegtTIMElt/namegt
ltuomgtseclt/uomgt ltdescriptiongt Wall clock
timelt/descriptiongt ltproperty id"1"gt
ltnamegtUSER_TIMElt/namegt ltuomgtseclt/uomgt
ltdescriptiongt User cpu timelt/descriptiongt
lt/propertygt ltproperty id"2"gt
ltnamegtSYSTEM_TIMElt/namegt ltuomgtseclt/uomgt
ltdescriptiongt System cpu timelt/descriptiongt
lt/propertygt lt/propertygt
lt/behaviorgt lt/cubegt
17
Performance-data algebra
  • Comparative analysis
  • Different program versions
  • Different input data
  • Different configuration
  • Different random errors
  • Performance-data algebra
  • Perform arithmetic operations on CUBE instances
  • Difference, mean
  • Obtain CUBE instance as result
  • Display it like ordinary CUBE instance


-
CUBE (XML)
CUBE (XML)
CUBE (XML)
-

18
CONE
COntrol flow
Notification Engine
  • Flexible call-graph profiler
  • Implemented in C/C by Nikhil Bhatia
  • Binary instrumentation (DPCL)
  • Full call path including line numbers
  • Large variety of performance data
  • PAPI used for hardware monitoring
  • Based on IBMs call-graph tracking algorithm
  • CATCH profiler
  • MPI and serial applications
  • Presentation of data with CUBE


19
Online call-graph tracking
  • Compute the static call graph in advance
  • For each control flow maintain a pointer into
    call graph
  • Start at root node
  • Move the pointer upon every function call and
    return
  • Call from call site n
  • Move to child node n
  • Recursive programs push onto stack
  • Return
  • Move to previous node (parent)
  • Recursive programs pop from stack

The number of nodes directly reachable from a
function only depends on that function - not on
the current call path
20
Online call-graph tracking (2)
main() A( ) B( ) C( )
D( ) C( ) A( ) C( )
A( ) B( ) C( ) D( )
X() W() Y() Y() Y() Z()
Z() Z() Y()
21
Instrumenting the application
  • Every process holds a reference to current node
  • Requires only constant overhead
  • Note recursive programs require maintaining a
    stack

C(...) Y(...) Z(...) Y(...)
call(int i) current current-gtchildreni
...
call(0) return() call(1)
return() call(2) return()
return() current current-gtparent ...
22
CONE Architecture
CONE Tool
Target Application
CUBE
instruments
starts
Probe
Probe
calls
presents
Monitoring Manager
loads into
Call-Graph Manager
application
writes
CUBE File
DPCL
PAPI
Probe Module
23
Future Directions
  • KOJAK
  • Redesign of the analyzer component
  • Improved integration of hardware counters
  • CUBE will replace old presenter
  • CONE
  • Attaching to a running application
  • Selective tracing
  • More platforms ( moving to Dyninst )
  • CUBE
  • Performance algebra
  • Automatic tree expansion
  • Rates (derived property)
  • Property 1 / Property 2
  • KOJAK-specific extensions
  • Integration with VAMPIR
Write a Comment
User Comments (0)
About PowerShow.com