HIGH PERFORMANCE COMPUTING: MODELS, METHODS, - PowerPoint PPT Presentation

Loading...

PPT – HIGH PERFORMANCE COMPUTING: MODELS, METHODS, PowerPoint presentation | free to download - id: 71f9ef-NjY1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

HIGH PERFORMANCE COMPUTING: MODELS, METHODS,

Description:

Prof. Thomas Sterling Department of Computer Science Louisiana State University January 27, 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS – PowerPoint PPT presentation

Number of Views:74
Avg rating:3.0/5.0
Slides: 54
Provided by: Chir54
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: HIGH PERFORMANCE COMPUTING: MODELS, METHODS,


1
HIGH PERFORMANCE COMPUTING MODELS, METHODS,
MEANSBENCHMARKING
  • Prof. Thomas Sterling
  • Department of Computer Science
  • Louisiana State University
  • January 27, 2011

2
Topics
  • Definitions, properties and applications
  • Early benchmarks
  • Linpack
  • Other parallel benchmarks
  • Organized benchmarking
  • Presentation and interpretation of results
  • Summary

3
Topics
  • Definitions, properties and applications
  • Early benchmarks
  • Linpack
  • Other parallel benchmarks
  • Organized benchmarking
  • Presentation and interpretation of results
  • Summary

4
Basic Performance Metrics
  • Time related
  • Execution time seconds
  • wall clock time
  • system and user time
  • Latency
  • Response time
  • Rate related
  • Rate of computation
  • floating point operations per second flops
  • integer operations per second ops
  • Data transfer (I/O) rate bytes/second
  • Effectiveness
  • Efficiency
  • Sustained perf/peak perf
  • Memory consumption bytes
  • Productivity utility/(second)
  • Performance measures
  • Sustained Performance
  • Peak Performance
  • Benchmark sustained perf
  • HPL Rmax

5
What Is a Benchmark?
Benchmark a standardized problem or test that
serves as a basis for evaluation or comparison
(as of computer system performance)
Merriam-Webster
  • The term benchmark also commonly applies to
    specially-designed programs used in benchmarking
  • A benchmark should
  • be domain specific (the more general the
    benchmark, the less useful it is for anything in
    particular)
  • be a distillation of the essential attributes of
    a workload
  • avoid using single metric to express the overall
    performance
  • Computational benchmark kinds
  • synthetic specially-created programs that impose
    the load on the specific component in the system
  • application derived from a real-world
    application program

6
Purpose of Benchmarking
  • Provide a tool, enabling quantitative comparisons
  • Comparison of variations within the same system
  • Comparison of distinct systems
  • Driving progress
  • enable better engineering by defining measurable
    and repeatable objectives
  • Establishing of performance agenda
  • measure release-to-release or version-to-version
    progress
  • set goals to meet
  • be understandable and useful also to the people
    not having the expertise in the field (managers,
    etc.)

7
Properties of a Good Benchmark
  • Relevance meaningful within the target domain
  • Understandability
  • Good metric(s) linear, orthogonal, monotonic
  • Scalability applicable to a broad spectrum of
    hardware/architecture
  • Coverage does not over-constrain the typical
    environment (does not require any special
    conditions)
  • Acceptance embraced by users and vendors
  • Has to enable comparative evaluation

Adapted from Standard Benchmarks for Database
Systems by Charles Levine, SIGMOD 97
8
Topics
  • Definitions, properties and applications
  • Early benchmarks
  • Linpack
  • Other parallel benchmarks
  • Organized benchmarking
  • Presentation and interpretation of results
  • Summary

9
Early Benchmarks
  • Whetstone
  • Floating point intensive
  • Dhrystone
  • Integer and character string oriented
  • Livermore Fortran Kernels
  • Livermore Loops
  • Collection of short kernels
  • NAS kernel
  • 7 Fortran test kernels for aerospace computation
  • The sources of the benchmarks listed above are
    available from http//www.netlib.org/benchmark

10
Whetstone
  • Originally written in Algol 60 in 1972 at the
    National Physics Laboratory (UK)
  • Named after Whetstone Algol translator-interpreter
    on the KDF9 computer
  • Measures primarily floating point performance in
    WIPS Whetstone Instructions Per Second
  • Raised also the issue of efficiency of different
    programming languages
  • The original Algol code was translated to C and
    Fortran (single and double precision support),
    PL/I, APL, Pascal, Basic, Simula and others

11
Dhrystone
  • Synthetic benchmark developed in 1984 by Reinhold
    Weicker
  • The name is a pun on Whetstone
  • Measures integer and string operations
    performance, expressed in number of iterations,
    or Dhrystones, per second
  • Alternative unit D-MIPS, normalized to VAX
    11/780 performance
  • Latest version released 2.1, includes
    implementations in C, Ada and Pascal
  • Superseded by SPECint suite

Gordon Bell and VAX 11/780
12
Livermore Fortran Kernels (LFK)
  • Developed at Lawrence Livermore National
    Laboratory in 1970
  • also known as Livermore Loops
  • Consists of 24 separate kernels
  • hydrodynamic codes, Cholesky conjugate gradient,
    linear algebra, equation of state, integration,
    predictors, first sum and difference, particle in
    cell, Monte Carlo, linear recurrence, discrete
    ordinate transport, Planckian distribution and
    others
  • include careful and careless coding practices
  • Produces 72 timing results using 3 different
    DO-loop lengths for each kernel
  • Produces Megaflops values for each kernel and
    range statistics of the results
  • Can be used as performance, compiler accuracy
    (checksums stored in code) or hardware endurance
    test

13
NAS Kernel
  • Developed at the Numerical Aerodynamic Simulation
    Projects Office at NASA Ames
  • Focuses on vector floating point performance
  • Consists of 7 test kernels in Fortran (approx.
    1000 lines of code)
  • matrix multiply
  • complex 2-D FFT
  • Cholesky decomposition
  • block tri-diagonal matrix solver
  • vortex method setup with Gaussian elimination
  • vortex creation with boundary conditions
  • parallel inverse of three matrix pentadiagonals
  • Reports performance in Mflops (64-bit precision)

14
Topics
  • Definitions, properties and applications
  • Early benchmarks
  • Linpack
  • Other parallel benchmarks
  • Organized benchmarking
  • Presentation and interpretation of results
  • Summary

15
Linpack Overview
  • Introduced by Jack Dongarra in 1979
  • Based on LINPACK linear algebra package developed
    by J. Dongarra, J. Bunch, C. Moler and P. Stewart
    (now superseded by the LAPACK library)
  • Solves a dense, regular system of linear
    equations, using matrices initialized with
    pseudo-random numbers
  • Provides an estimate of systems effective
    floating-point performance
  • Does not reflect the overall performance of the
    machine!

16
Linpack Benchmark Variants
  • Linpack Fortran (single processor)
  • N100
  • N1000, TPP, best effort
  • Linpacks Highly Parallel Computing benchmark
    (HPL)
  • Java Linpack

17
Fortran Linpack (I)
  • N100 case
  • Provides results listed in Table 1 of Linpack
    Benchmark Report
  • Absolutely no changes to the code can be made
    (not even in comments!)
  • Matrix generated by the program must be used to
    run this case
  • An external timing function (SECOND) has to be
    supplied
  • Only compiler-induced optimizations allowed
  • Measures performance of two routines
  • DGEFA LU decomposition with partial pivoting
  • DGESL solves system of linear equations using
    result from DGEFA
  • Complexity O(n2) for DGESL, O(n3) for DGEFA

18
Fortran Linpack (II)
  • N1000 case, Toward Peak Performance (TPP), Best
    Effort
  • Provides results listed in Table 1 of Linpack
    Benchmark Report
  • The user can choose any linear equation to be
    solved
  • Allows a complete replacement of the
    factorization/solver code by the user
  • No restriction on the implementation language for
    the solver
  • The solution must conform to prescribed accuracy
    and the matrix used must be the same as the
    matrix used by the netlib driver

19
Linpack Fortran Performance on Different Platforms
Computer N100 MFlops N1000, TPP MFlops Theoretical Peak MFlops
Intel Pentium Woodcrest (1core, 3 GHz) 3018 6542 12000
NEC SX-8/8 (8 proc., 2 GHz) - 75140 128000
NEC SX-8/8 (1 proc., 2 GHz) 2177 14960 16000
HP ProLiant BL20p G3 (4 cores, 3.8 GHz Intel Xeon) - 8185 14800
HP ProLiant BL20p G3 (1 core 3.8 GHz Intel Xeon) 1852 4851 7400
IBM eServer p5-575 (8 POWER5 proc., 1.9 GHz) - 34570 60800
IBM eServer p5-575 (1 POWER5 proc., 1.9 GHz) 1776 5872 7600
SGI Altix 3700 Bx2 (1 Itanium2 proc., 1.6 GHz) 1765 5953 6400
HP ProLiant BL45p (4 cores AMD Opteron 854, 2.8 GHz) - 12860 22400
HP ProLiant BL45p (1 core AMD Opteron 854, 2.8 GHz) 1717 4191 5600
Fujitsu VPP5000/1 (1 proc., 3.33ns) 1156 8784 9600
Cray T932 (32 proc., 2.2ns) 1129 (1 proc.) 29360 57600
HP AlphaServer GS1280 7/1300 (8 Alpha proc., 1.3GHz) - 14260 20800
HP AlphaServer GS1280 7/1300 (1 Alpha proc., 1.3GHz) 1122 2132 2600
HP 9000 rp8420-32 (8 PA-8800 proc., 1000MHz) - 14150 32000
HP 9000 rp8420-32 (1 PA-8800 proc., 1000MHz) 843 2905 4000
Data excerpted from the 11-30-2006 LINPACK
Benchmark Report at http//www.netlib.org/benchmar
k/performance.ps
20
Fortran Linpack Demo
  • gt ./linpack
  • Please send the results of this run to
  • Jack J. Dongarra
  • Computer Science Department
  • University of Tennessee
  • Knoxville, Tennessee 37996-1300
  • Fax 865-974-8296
  • Internet dongarra_at_cs.utk.edu
  • This is version 29.5.04.
  • norm. resid resid machep
    x(1) x(n)
  • 1.25501937E00 1.39332990E-14 2.22044605E-16
    1.00000000E00 1.00000000E00
  • times are reported for matrices of order
    100
  • dgefa dgesl total mflops
    unit ratio b(1)

Total time (dgefadgesl)
Timing unit (obsolete)
First element of right hand side vector
Time spent in solver (dgesl)
Fraction of Cray-1S execution time (obsolete)
Sustained floating point rate
Time spent in matrix factorization routine (dgefa)
Two different dimensions used to test the effect
of array placement in memory
Reference http//www.netlib.org/utk/people/JackDo
ngarra/faq-linpack.html
21
Linpacks Highly Parallel Computing Benchmark
(HPL)
  • Measures the performance of distributed memory
    machines
  • Used in the Linpack Benchmark Report (Table 3)
    and to determine the order of machines on the
    Top500 list
  • The portable version (written in C)
  • External dependencies for Linpack installation
  • MPI-1.1 functionality for inter-node
    communication
  • BLAS or VSIPL library for simple vector
    operations such as scaled vector addition (DAXPY
    y axy) and inner dot product (DDOT a Sxiyi)
  • Ground rules
  • allows a complete user replacement of the LU
    factorization and solver steps (the accuracy must
    satisfy given bound)
  • same matrix as in the driver program
  • no restrictions on problem size

22
HPL Algorithm
  • Data distribution 2-D block-cyclic
  • Algorithm elements
  • right-looking variant of LU factorization with
    row partial pivoting featuring multiple
    look-ahead depths
  • recursive panel factorization with pivot search
    and column broadcast combined
  • various virtual panel broadcast topologies
  • bandwidth reducing swap-broadcast algorithm
  • backward substitution with look-ahead depth of
    one
  • Floating point operation count 2/3n3n2

23
HPL Algorithm Elements
Execution flow for single parameter set
Matrix Generation
Panel Factorization
Panel Broadcast
Look-ahead
Update
All columns of A processed?
N
Y
Backward Substitution
Six broadcast algorithms available
Solution Check
http//www.netlib.org/benchmark/hpl/algorithm.html

24
HPL Linpack Metrics
  • The HPL implementation of the benchmark is run
    for different problem sizes N on the entire
    machine
  • For certain problem size Nmax, the cumulative
    performance in Mflops (reflecting 64-bit addition
    and multiplication operations) reaches its
    maximum value denoted as Rmax
  • Another metric possible to obtain from the
    benchmark is N1/2, the problem size for which the
    half of the maximum performance (Rmax/2) is
    achieved
  • The Rmax value is used to rank supercomputers in
    Top500 list listed along with this number are
    the theoretical peak double precision floating
    point performance Rpeak of the machine and N1/2

25
Machine Parameters Influencing Linpack Performance
Parameter Linpack Fortran, N100 Linpack Fortran, N1000, TPP HPL
Processor speed Yes Yes Yes
Memory capacity No No (modern system) Yes (for Rmax)
Network latency/bandwidth No No Yes
Compiler flags Yes Yes Yes
26
Ten Fastest Supercomputers On Current Top500 List
Source http//www.top500.org/sublist
27
Java Linpack
  • Intended mostly to measure the efficiency of Java
    implementation rather than hardware floating
    point performance
  • Solves a dense 500x500 system of linear equations
    with one right-hand side, Axb
  • Matrix A is generated randomly
  • Vector b is constructed, so that all component of
    solution x are one
  • Uses Gaussian elimination with partial pivoting
  • Reports Mflops, time to solution, Norm Res
    (solution accuracy), relative machine precision

28
HPL Output Example
gt mpirun -np 4 xhpl
HPL
inpack 1.0a -- High-Performance Linpack
benchmark -- January 20, 2004 Written by A.
Petitet and R. Clint Whaley, Innovative
Computing Labs., UTK

An explanation of the input/output parameters
follows T/V Wall time / encoded variant. N
The order of the coefficient matrix A. NB
The partitioning blocking factor. P
The number of process rows. Q The number
of process columns. Time Time in seconds to
solve the linear system. Gflops Rate of
execution for solving the linear system. The
following parameter values will be used N
5000 NB 32 PMAP Row-major
process mapping P 2 1
4 Q 2 4 1 PFACT
Left NBMIN 2 NDIV 2 RFACT
Left BCAST 1ringM DEPTH 0
SWAP Mix (threshold 64) L1
transposed form U transposed form EQUIL
yes ALIGN 8 double precision
words -------------------------------------------
--------------------------------- - The matrix A
is randomly generated for each test. - The
following scaled residual checks will be
computed 1) Ax-b_oo / ( eps A_1
N ) 2) Ax-b_oo / ( eps A_1
x_1 ) 3) Ax-b_oo / ( eps A_oo
x_oo ) - The relative machine precision (eps)
is taken to be 1.110223e-16 -
Computational tests pass if scaled residuals are
less than 16.0

T/V N
NB P Q Time
Gflops -------------------------------------------
--------------------------------- WR01L2L2
5000 32 2 2 7.14
1.168e01 ---------------------------------------
------------------------------------- Ax-b_oo
/ ( eps A_1 N )
0.0400275 ...... PASSED Ax-b_oo / ( eps
A_1 x_1 ) 0.0264242 ......
PASSED Ax-b_oo / ( eps A_oo x_oo
) 0.0051580 ...... PASSED

T/V N NB P
Q Time
Gflops -------------------------------------------
--------------------------------- WR01L2L2
5000 32 1 4 7.00
1.192e01 ---------------------------------------
------------------------------------- Ax-b_oo
/ ( eps A_1 N )
0.0335428 ...... PASSED Ax-b_oo / ( eps
A_1 x_1 ) 0.0221433 ......
PASSED Ax-b_oo / ( eps A_oo x_oo
) 0.0043224 ...... PASSED

T/V N NB P
Q Time
Gflops -------------------------------------------
--------------------------------- WR01L2L2
5000 32 4 1 7.00
1.191e01 ---------------------------------------
------------------------------------- Ax-b_oo
/ ( eps A_1 N )
0.0426255 ...... PASSED Ax-b_oo / ( eps
A_1 x_1 ) 0.0281393 ......
PASSED Ax-b_oo / ( eps A_oo x_oo
) 0.0054928 ...... PASSED

Finished 3 tests with the
following results 3 tests
completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal
input values. ------------------------------------
---------------------------------------- End of
Tests.

For configuration issues, consult
http//www.netlib.org/benchmark/hpl/faqs.html
29
Topics
  • Definitions, properties and applications
  • Early benchmarks
  • Linpack
  • Other parallel benchmarks
  • Organized benchmarking
  • Presentation and interpretation of results
  • Summary

30
Other Parallel Benchmarks
  • High Performance Computing Challenge (HPCC)
    benchmarks
  • Devised and sponsored to enrich the benchmarking
    parameter set
  • NAS Parallel Benchmarks (NPB)
  • Powerful set of metrics
  • Reflects computational fluid dynamics
  • NPBIO-MPI
  • Stresses external I/O system

31
HPC Challenge Benchmark
  • Consists of 7 individual tests
  • HPL (Linpack TPP) floating point rate of
    execution of a solver of linear system of
    equations
  • DGEMM floating point rate of execution of double
    precision matrix-matrix multiplication
  • STREAM sustainable memory bandwidth (GB/s) and
    the corresponding computation rate for simple
    vector kernel
  • PTRANS (parallel matrix transpose) total
    capacity of the network using pairwise
    communicating processes
  • RandomAccess the rate of integer random updates
    of memory (in GUPS Giga-Updates Per Second)
  • FFT floating point rate of execution of double
    precision complex 1-D Discrete Fourier Transform
  • b_eff (effective bandwidth benchmark) latency
    and bandwidth of a number of simultaneous
    communication patterns

32
Comparison of HPCC Results on Selected
Supercomputers
  • Notes
  • all metrics shown are higher-better, except
    for the Random Ring Latency
  • machine labels include machine name (optional),
    manufacturer and system name, affiliation and (in
    parentheses)
  • processor/network fabric type

33
NAS Parallel Benchmarks
  • Derived from computational fluid dynamics (CFD)
    applications
  • Consist of five kernels and three
    pseudo-applications
  • Exist in several flavors
  • NPB 1 original paper-and-pencil specification
  • generally proprietary implementations by hardware
    vendors
  • NPB 2 MPI-based sources distributed by NAS
  • supplements NPB 1
  • can be run with little or no tuning
  • NPB 3 implementations in OpenMP, HPF and Java
  • derived from NPB-serial version with improved
    serial code
  • a set of multi-zone benchmarks was added
  • test implementation efficiency of multi-level and
    hybrid parallelization methods and tools (e.g.
    OpenMP with MPI)
  • GridNPB 3 new suite of benchmarks, designed to
    rate the performance of computational grids
  • includes only four benchmarks, derived from the
    original NPB
  • written in Fortran and Java
  • Globus as grid middleware

34
NPB 2 Overview
  • Multiple problem classes (S, W, A, B, C, D)
  • Tests written mainly in Fortran (IS in C)
  • BT (block tri-diagonal solver with 5x5 block
    size)
  • CG (conjugate gradient approximation to compute
    the smallest eigenvalue of a sparse, symmetric
    positive definite matrix)
  • EP (embarrassingly parallel evaluates an
    integral by means of pseudorandom trials)
  • FT (3-D PDE solver using Fast Fourier Transforms)
  • IS (large integer sort tests both integer
    computation speed and network performance)
  • LU (a regular-sparse, 5x5 block lower and upper
    triangular system solver)
  • MG (simplified multigrid kernel tests both short
    and long distance data communication)
  • SP (solves multiple independent system of
    non-diagonally dominant, scalar, pentadiagonal
    equations)
  • Sources and reports available from
    http//ww.nas.nasa.gov/Resources/Software/npb.html

35
NPBIO-MPI
  • Attempts to address lack of I/O tests in NPB,
    focusing primarily on file output
  • Based on BTIO (Block Tridiagonal Input Output)
    effort, which extended BT (Block-tridiagonal)
    benchmark with routines writing to storage five
    double precision numbers for every mesh point
  • runs for 200 iterations, writing every five
    iterations
  • after all time steps are finished, all data
    belonging to a single time step must be stored in
    the same file, sorted by vector components
  • timing must include all required data
    rearrangements to achieve the specified data
    layout
  • Supported access scenarios
  • simple MPI-IO without collective buffering
  • full MPI-IO collective buffering
  • fortran Fortran 77 file operations
  • epio where each process writes continuously its
    part of the computational domain to a separate
    file
  • Number of processes must be a square
  • Problem sizes class A (643), class B (1023),
    class C (1623)
  • Several possible results, depending on the
    benchmarking goal effective flops, effective
    output bandwidth or output overhead

36
Sample NPB 2 Results
Reference The NAS Parallel Benchmarks 2.1
Results by W. Saphir, A. Woo, and M. Yarrow
http//www.nas.nasa.gov/News/Techreports/1996/PDF/
nas-96-010.pdf
37
Topics
  • Definitions, properties and applications
  • Early benchmarks
  • Linpack
  • Other parallel benchmarks
  • Organized benchmarking
  • Presentation and interpretation of results
  • Summary

38
Benchmarking Organizations
  • SPEC (Standard Performance Evaluation
    Corporation)
  • Created to satisfy the need for realistic, fair
    and standardized performance tests
  • Motto An ounce of honest data is worth more
    than a pound of marketing hype
  • TPC (Transaction Processing Performance Council)
  • Formed primarily due to lack of reliable database
    benchmarks

39
SPEC Benchmark Suite Overview
  • Standard Performance Evaluation Corporation
    (SPEC) is a non-profit organization (financed by
    its members over 60 leading computer and
    software manufacturers) founded in 1988
  • SPEC benchmarks are written in platform-neutral
    language (typically C or Fortran)
  • The code may be compiled using arbitrary
    compilers, but the sources may not be modified
  • many manufacturers are known to optimize their
    compilers and/or systems to improve the SPEC
    results
  • Benchmarks may be obtained by purchasing the
    license from SPEC the results are published on
    the SPEC website
  • Website http//www.spec.org

40
SPEC Suite Components
  • SPEC CPU2006 combined performance of CPU, memory
    and compiler
  • CINT2006 (aka. SPECint) integer arithmetic test
    using compilers, interpreters, word processors,
    chess programs, etc.
  • CFP2006 (aka. SPECfp) floating point test using
    physical simulations, 3D graphics, image
    processing, computational chemistry, etc.
  • SPECweb2005 PHP/JSP performance
  • SPECviewperf OpenGL 3D graphic system
    performance
  • SPECapc several popular 3D-intensive
    applications
  • SPEC HPC2002 high-end parallel computing tests
    using quantum chemistry application, weather
    modeling, industrial oil deposits locator
  • SPEC OMP2001 OpenMP application performance
  • SPECjvm98 performance of java client on a Java
    VM
  • SPECjAppServer2004 multi-tier benchmark
    measuring the performance of J2EE application
    servers
  • SPECjbb2005 server-side Java performance
  • SPEC MAIL2001 mail server performance (SMTP and
    POP)
  • SPEC SFS97_R1 NFS server throughput and response
    time
  • Planned SPEC MPI2006, SPECimap, SPECpower,
    Virtualization

41
Sample Results SPEC CPU2006
System CINT2006 Speed CINT2006 Speed CFP2006 Speed CFP2006 Speed CINT2006 Rate CINT2006 Rate CFP2006 Rate CFP2006 Rate
System base peak base peak base peak base peak
Dell Precision 380 (Pentium EE965 3.73GHz, 2cores) 11.6 12.4 23.1 21.7
HP ProLiant DL380 G4 (Xeon 3.8GHz, 2 cores) 11.4 11.7 20.9 18.8
HP ProLiant DL585 (Opteron 854 2.8GHz, 2 cores) 11.2 12.7 12.1 13.0 22.3 25.2 24.1 25.9
Sun Blade 2500 (1 UltraSPARC IIIi, 1280MHz) 4.04 4.04
Sun Fire E25K (UltraSPARC IV 1500MHz, 144 cores) 759 904
HP Integrity rx6600 (Itanium2 1.6GHz/24MB, 2 cores) 14.5 15.7 17.3 18.1
HP Integrity rx6600 (Itanium2 1.6GHz/24MB, 8 cores) 94.7 102 69.1 71.4
HP Integrity Superdome (Itanium2 1.6GHz/24MB, 128 cores) 1534 1648 1422 1479
  • Notes
  • base metric requires that the same flags are
    used when compiling all instances of the
    benchmark (peak is less strict)
  • speed metric measures how fast a computer
    executes single task, while rate determines
    throughput with multiple tasks

42
TPC
  • Governed by the Transaction Processing
    Performance Council (http//www.tpc.org)
    founded in 1985
  • members include leading system and microprocessor
    manufacturers, and commercial database developers
  • the council appoints professional affiliates and
    auditors outside the member group to help fulfill
    the TPCs mission and validate benchmark results
  • Current benchmark flavors
  • TPC-C for transaction processing (de-facto
    standard for On-Line Transaction Processing)
  • TPC-H for decision support systems
  • TPC-App for web services
  • Obsolete benchmarks
  • TPC-A (performance of update-intensive databases)
  • TPC-B (throughput of a system in transactions per
    second)
  • TPC-D (decision support applications with long
    running queries against complex data structures)
  • TPC-R (business reporting, decision support)
  • TPC-W (transactional web e-Commerce benchmark)

43
Top Ten TPC-C Results
44
Topics
  • Definitions, properties and applications
  • Early benchmarks
  • Linpack
  • Other parallel benchmarks
  • Organized benchmarking
  • Presentation and interpretation of results
  • Summary

45
Presentation of the Results
  • Tables
  • Graphs
  • Bar graphs (a)
  • Scatter plots (b)
  • Line plots (c)
  • Pie charts (d)
  • Gantt charts (e)
  • Kiviat graphs (f)
  • Enhancements
  • Error bars, boxes or confidence intervals
  • Broken or offset scales (be careful!)
  • Multiple curves per graph (but avoid overloading)
  • Data labels, colors, etc.

(a)
(b)
(c)
(d)
(e)
(f)
46
Kiviat Graph Example
Source http//www.cse.clrc.ac.uk/disco/DLAB_BENCH
_WEB/hpcc/hpcc_kiviat.shtml
47
Mixed Graph Example
WRF OOCORE MILC
PARATEC HOMME BSSN_PUGH Whisky_Carpet
ADCIRC PETSc_FUN3D
Computation fraction
Floating point operations
Communication fraction
Load/store operations
Other operations
Characterization of NSF/CCT parallel applications
on POWER5 architecture (using data collected by
IPM)
48
Graph Dos and Donts
  • Good graphs
  • Require minimum effort from the reader
  • Maximize information
  • Maximize information-to-ink ratio
  • Use commonly accepted practices
  • Avoid ambiguity
  • Poor graphs
  • Have too many alternatives on a single chart
  • Display too many y-variables on a single chart
  • Use vague symbols in place of text
  • Show extraneous information
  • Select scale ranges improperly
  • Use line chart instead of a bar graph

Reference Raj Jain, The Art of Computer Systems
Performance Analysis, Chapter 10
49
Common Mistakes in Benchmarking
From Chapter 9 of The Art of Computer Systems
Performance Analysis by Raj Jain
  • Only average behavior represented in test
    workload
  • Skewness of device demands ignored
  • Loading level controlled inappropriately
  • Caching effects ignored
  • Buffering sizes not appropriate
  • Inaccuracies due to sampling ignored
  • Ignoring monitoring overhead
  • Not validating measurements
  • Not ensuring same initial conditions
  • Not measuring transient performance
  • Using device utilizations for performance
    comparisons
  • Collecting too much data but doing very little
    analysis

50
Misrepresentation of Performance Results on
Parallel Computers
  1. Quote only 32-bit performance results, not 64-bit
    results
  2. Present performance for an inner kernel,
    representing it as the performance of the entire
    application
  3. Quietly employ assembly code and other low-level
    constructs
  4. Scale problem size with the number of processors,
    but omit any mention of this fact
  5. Quote performance results projected to the full
    system
  6. Compare your results with scalar, unoptimized
    code run on another platform
  7. When direct run time comparisons are required,
    compare with an old code on an obsolete system
  8. If MFLOPS rates must be quoted, base the
    operation count on the parallel implementation,
    not on the best sequential implementation
  9. Quote performance in terms of processor
    utilization, parallel speedups or MFLOPS per
    dollar
  10. Mutilate the algorithm used in the parallel
    implementation to match the architecture
  11. Measure parallel run times on a dedicated system,
    but measure conventional run times in a busy
    environment
  12. If all else fails, show pretty pictures and
    animated videos, and don't talk about performance

Reference David Bailey Twelve Ways to Fool the
Masses When Giving Performance Results on
Parallel Computers, Supercomputing Review, Aug
1991, pp.54-55, http//crd.lbl.gov/dhbailey/dhbpa
pers/twelve-ways.pdf
51
Topics
  • Definitions, properties and applications
  • Early benchmarks
  • Linpack
  • Other parallel benchmarks
  • Organized benchmarking
  • Presentation and interpretation of results
  • Summary

52
Material For Test
  • Basic performance metrics (slide 4)
  • Definition of benchmark in own words purpose of
    benchmarking properties of good benchmark
    (slides 5, 6, 7)
  • Linpack what it is, what does it measure,
    concepts and complexities (slides 15, 17, 18)
  • HPL (slides 21 and 24)
  • Linpack compare and contrast (slide 25)
  • General knowledge about HPCC,SPEC and NPB suites
    (slides 30, 31, 34, 39)
  • Kiviat Graph (slide 46)
  • Benchmark result interpretation (slides 49, 50)

53
(No Transcript)
About PowerShow.com