Title: Design of a Test Suite for Automatic Performance Analysis Tools
1Design of a Test Suitefor Automatic
Performance Analysis Tools
Design of a Test Suitefor (Automatic)Performance
Analysis Tools
- Bernd Mohr
- Forschungszentrum Jülich
- NIC
- Germany
- b.mohr_at_fz-juelich.de
Jesper Larsson Träff NEC Europe Ltd. CC
Research Labs Germany traff_at_ccrl-nece.de
Michael Gerndt Tech. Universität
München LRR Germany gerndt_at_in.tum.de
2APART Terminologie
- Performance Property
- Aspect of performance behavior of an application
- E.g., communication dominated by waiting time
- Specified as condition referring to performance
data - Quantified and normalized in terms
ofbehavior-independent metric (severity) - Performance Problem
- Performance property with negative implications
- Performance Bottleneck
- Performance Problem with highest severity
3Example Performance Property Message in Wrong
Order
Location
B
SEND
SEND
wait
RECV
A
Time
4The APART Test Suite (ATS)
- Users rely on correct working of tools
- Tools need to be especially well tested
- Systematic approach needed
- APART Test Suite
- Common project inside APART group
- Every member needs this ? minimize resources
- Ensures re-usability
- Will also allow evaluation / comparison ofthe
different member projects - Main focus automatic performance analysis tools
- But also useful for regular performance tools
- http//www.fz-juelich.de/apart/ats/
5Desired Functionality
- Tests to determine whether the semanticsof the
original program were not altered - Tests to see whether the recordedperformance
data is correct - Synthetic positive test cases for each known and
definedperformance property and combinations of
them - Negative test cases which have no known
performance problem - Real world size parallel applications and
benchmarks
-
- Can be partially based on existing validation
suites ? WWW -
- Probably needs to be tool specific
-
- Collect available benchmarks and applications ?
WWW
-
-
-
- Design and Implementation of a ATS Framework
6Validation Suites and Kernel Benchmarks (I)
- Validation
- MPI test / validation suites from Intel, IBM, ANL
- http//www-unix.mcs.anl.gov/mpi/mpi-test/tsuite.ht
ml - MPI Benchmarks
- PARKBENCH (PARallel Kernels and BENCHmarks)
- http//www.netlib.org/parkbench/
- PMB - Pallas MPI Benchmarks
- http//www.pallas.com/e/products/pmb/
- SKaMPI (Special Karlsruher MPI Benchmark)
- http//liinwww.ira.uka.de/skampi/
7Kernel Benchmarks (II)
- OpenMP Benchmarks
- EPCC OpenMP Microbenchmarks
- http//www.epcc.ed.ac.uk/ research/openmpbenc
h/openmp_index.html - Hybrid Benchmarks
- The Los Alamos MicroBenchmarks Suite (LAMB)
- MPI and multi threading ( Pthreads and OpenMP)
programming models based on SKaMPI and EPCC
8Real World Applications and Benchmarks
- The NAS Parallel Benchmarks (NPB)
- http//www.nas.nasa.gov/Software/NPB/
- The ASCI Purple and Blue Benchmark Codes
- http//www.llnl.gov/ asci/purple/benchmarks/l
imited/code_list.html asci_benchmarks/asci/as
ci_code_list.html - NCAR Benchmarks
- http//www.scd.ucar.edu/css/software/bench/
9Current Design of ATS Framework
10The Distribution Module
- Distribution specified by
- Distribution function
- Distribution parameters
- All distribution function have the same signature
- double distr_func (int me, int size, double sf,
distr_t dd) - me, size member me of group of size size
- sf scaling factor
- dd distribution parameter descriptor
- returns value for me calculated based on me,
size, and ddscaled by sf - ATS provides set of predefined distribution
functions - Can easily extended if needed
11Predefined Distribution Functions
n
same
linear
peak
block2
cyclic2
block3
cyclic3
12Current Design of ATS Framework
13Example MPI Property Function late_sender
void par_do_mpi_work(distr_func_t df, distr_t
dd, MPI_Comm c) int me, sz
MPI_Comm_rank(c, me) MPI_Comm_size(c, sz)
do_work(df(me, sz, 1.0, dd)) void
late_sender(double bwork, double ework, int r,
MPI_Comm c) val2_distr_t dd int i
mpi_buf_t buf alloc_mpi_buf(base_type,
base_cnt) dd.low bworkework dd.high
bwork for (i 0 iltr i)
par_do_mpi_work(df_cyclic2, dd, c)
mpi_commpattern_sendrecv(buf, DIR_UP, 0, 0, c)
free_mpi_buf(buf)
14Currently Implemented Performance Property
Functions
- MPI Point-to-Point Communication Performance
Properties - late_sender(basework, extrawork, rf, MPI_Comm)
- late_receiver(basework, extrawork, rf, MPI_Comm)
- MPI Collective Communication Performance
Properties - imbalance_at_mpi_barrier(distr_func, distr_param,
rf, MPI_Comm) - imbalance_at_mpi_alltoall(distr_func,
distr_param, rf, MPI_Comm) - late_broadcast(basework, rootextrawork, root, rf,
MPI_Comm) - late_scatter(basework, rootextrawork, root, rf,
MPI_Comm) - late_scatterv(basework, rootextrawork, root, rf,
MPI_Comm) - early_reduce(rootwork, baseextrawork, root, rf,
MPI_Comm) - early_gather(rootwork, baseextrawork, root, rf,
MPI_Comm) - early_gatherv(rootwork, baseextrawork, root, rf,
MPI_Comm)
15Currently Implemented Performance Property
Functions
- OpenMP Imbalance Performance Properties
- imbalance_in_parallel_region(distr_func,
distr_param, rf) - imbalance_at_barrier(distr_func, distr_param,
rf) - imbalance_in_parallel_loop(distr_func,
distr_param, rf) - imbalance_in_parallel_loop_nowait(distr_func,
distr_param, rf) - imbalance_in_parallel_section(distr_func,
distr_param, rf) - imbalance_due_to_uneven_section_distribution(workl
oad, rf) - imbalance_due_to_not_enough_section(workload,
rf) - imbalance_in_ordered_loop(distr_func,
distr_param, factor, rf) - OpenMP Unparallelized Performance Properties
- unparallelized_in_master_region(workload, rf)
- unparallelized_in_single_region(workload, rf)
- unparallelized_in_ordered_loop(workload, factor,
rf)
16Currently Implemented Performance Property
Functions
- OpenMP Synchronization Performance Properties
- critical_section_locking(rf)
- critical_section_contention(factor, rf)
- serialization_due_to_critical_section(factor,
workload, rf) - frequent_atomic(factor, rf)
- setting_lock(rf)
- lock_testing(workload, rf)
- lock_waiting(workload, rf)
- all_threads_lock_contention(factor, rf)
- pairwise_lock_contention(factor, rf)
17Currently Implemented Performance Property
Functions
- OpenMP Parallel Overhead Performance Properties
- dynamic_scheduling_overhead(factor, rf)
- scheduling_overhead_in_parallelized_inner_loop(big
factor, smallfactor, rf) - insufficient_work_in_parallel_loop(factor, rf)
- firstprivate_initialization(arraysize, rf)
- lastprivate_overhead(factor, rf)
- reduction_handling(factor, rf)
- false_sharing_in_parallel_region(factor, rf)
18Current Design of ATS Framework
19Performance Property Test Programs
- Single performance property testing
- Programs can be generated automatically
fromperformance property function signature - Generator based on Program Database Toolkit (PDT)
- http//www.cs.uoregon.edu/research/paracomp/pdtool
kit/ - Property parameters become test program arguments
- More extensive tests through scripting
languagesor experiment management system (e.g.,
Zenturio) - http//www.par.univie.ac.at/project/zenturio/
- Composite performance property testing
- Program containing multiple performance property
functions - Complexity only limited by imagination
- Currently manually implemented
20Example Single Performance Property Test Program
include "mpi_pattern.h" int main(int argc,
char argv) distr_func_t df
atodf("b20.51.0") distr_t dd
atodd("b20.51.0") int r 1 char
basecomm "d20000" MPI_Init(argc,
argv) switch ( argc ) case 4 basecomm
argv3 case 3 r atoi(argv2) case 2
df atodf(argv1) dd atodd(argv1) case
1 break default fprintf(stderr, "usage s
ltdistfgt ltrfacgt ltbcommgt\n",
argv0) break set_base_comm_a(basecomm)
imbalance_at_mpi_barrier(df, dd, r,
MPI_COMM_WORLD) MPI_Finalize()
21Example Single Performance Property Test Program
- imbalance_at_mpi_barrier ltdistribution-specgt
ltrepition-factorgt - b20.51.0 2
b20.12.0 5 - Problem additional property MPI
Setup/Termination Overhead also holds!
22Example Collection of MPI Performance Properties
23Examples Detail MPI Properties
24Example MPI Properties in 2 Communicators
25EXPERT Analysis of MPI 2 Communicator Example
26Example OpenMP Performance Property
27EWOMP03 Study
- Evaluating OpenMP Analysis Tools with the APART
Test Suite - Hitachi SR8000 Profiling
- KOJAK/Expert
- Vampir (based on EPILOG traces)
- Intel Vtune
- Result summary
- All tools find imbalance properties
- Expert and Vtune allow to also inspect sequential
code - Vtune is the only
- Distinguishing lock competition and synchronized
time - Identifying imbalance in nowait loop
- None of the tools is able to give more detailed
information,e.g., distinguishing Imbalance in
section andimbalanced number of sections
28ATS Status and Future Work
- Initial prototype available from APART website
- List of MPI, OpenMP, and hybridvalidation and
benchmark suites - 1st version of ATS framework including
- C and Fortran version of code
- Single property test program generator
- Future Work
- More complete collection of validation and
benchmark suites - Real real world applications
- More studies
- ATS Framework
- Even more complete list of property functions
forMPI, OpenMP, hybrid, and sequential
performance properties - Documentation