The Next Four Orders of Magnitude in Parallel PDE Simulation Performance http:www.math.odu.edukeyest - PowerPoint PPT Presentation

1 / 52
About This Presentation

The Next Four Orders of Magnitude in Parallel PDE Simulation Performance http:www.math.odu.edukeyest


The Next Four Orders of Magnitude in. Parallel PDE Simulation Performance ... One to two orders of magnitude can be gained by catching up to the clock, and by ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 53
Provided by: william559
Learn more at:


Transcript and Presenter's Notes

Title: The Next Four Orders of Magnitude in Parallel PDE Simulation Performance http:www.math.odu.edukeyest

The Next Four Orders of Magnitude in Parallel
PDE Simulation Performance http//www.math.odu.e
  • David E. Keyes
  • Department of Mathematics Statistics,
  • Old Dominion University
  • Institute for Scientific Computing Research,
  • Lawrence Livermore National Laboratory
  • Institute for Computer Applications in Science
  • NASA Langley Research Center

Background of this Presentation
  • Originally prepared for Petaflops II Conference
  • History of the Petaflops Initiative in the USA
  • Enabling Technologies for Petaflops Computing,
    Feb 1994
  • book by Sterling, Messina, and Smith, MIT Press,
  • Applications Workshop, Aug 1994
  • Architectures Workshop, Apr 1995
  • Systems Software Workshop, Jun 1996
  • Algorithms Workshop, Apr 1997
  • Systems Operations Workshop Review, Jun 1998
  • Enabling Technologies for Petaflops II, Feb 1999
  • Topics in Ultra-scale Computing, book by Sterling
    et al., MIT Press, 2001 (to appear)

Weighing in at the Bottom Line
  • Characterization of a 1 Teraflop/s computer of
  • about 1,000 processors of 1 Gflop/s (peak) each
  • due to inefficiencies within the processors, more
    practically characterized as about 4,000
    processors of 250 Mflop/s each
  • How do we want to get to 1 Petaflop/s by 2007
    (original goal)?
  • 1,000,000 processors of 1 Gflop/s each (only
  • 10,000 processors of 100 Gflop/s each (mainly
  • From the point of view of PDE simulations on
    quasi-static Eulerian grids
  • Either!
  • Caveat dynamic grid simulations are not covered
    in this talk
  • but see work at Bonn, Erlangen, Heidelberg, LLNL,
    and ODU presented elsewhere

  • Many Grand Challenges in computational science
    are formulated as PDEs (possibly among
    alternative formulations)
  • However, PDE simulations historically have not
    performed as well as other scientific simulations
  • PDE simulations require a balance among
    architectural components that is not necessarily
    met in a machine designed to max out on the
    standard LINPACK benchmark
  • The justification for building petaflop/s
    architectures undoubtedly will (and should)
    include PDE applications
  • However, cost-effective use of petaflop/s on PDEs
    requires further attention to architectural and
    algorithmic matters
  • Memory-centric view of computation needs further

Application Performance History3 orders of
magnitude in 10 years better than Moores Law
Bell Prize Performance History
Plan of Presentation
  • General characterization of PDE requirements
  • Four sources of performance improvement to get
    from current 100's of Gflop/s (for PDEs) to 1
  • Each illustrated with examples from computational
  • offered as typical of real workloads (nonlinear,
    unstructured, multicomponent, multiscale, etc.)
  • Performance presented
  • on up to thousands of processors of T3E and ASCI
    Red (for parallel aspects)
  • on numerous uniprocessors (for memory hierarchy

Purpose of Presentation
  • Not to argue for specific algorithms/programming
    models/codes in any detail
  • but see talks on Newton-Krylov-Schwarz (NKS)
    methods under homepage
  • Provide a requirements target for designers of
    today's systems
  • typical of several contemporary successfully
    parallelized PDE applications
  • not comprehensive of all important large-scale
  • Speculate on requirements target for designers of
    tomorrow's sytems
  • Promote attention to current architectural
    weaknesses, relative to requirements of PDEs

Four Sources of Performance Improvement
  • Expanded number of processors
  • arbitrarily large factor, through extremely
    careful attention to load balancing and
  • More efficient use of processor cycles, and
    faster processor/memory elements
  • one to two orders of magnitude, through
    memory-assist language features,
    processors-in-memory, and multithreading
  • Algorithmic variants that are more
  • approximately an order of magnitude, through
    improved locality and relaxed synchronization
  • Algorithms that deliver more science per flop
  • possibly large problem-dependent factor, through
  • This last does not contribute to raw flop/s!

PDE Varieties and Complexities
  • Varieties of PDEs
  • evolution (time hyperbolic, time parabolic)
  • equilibrium (elliptic, spatially hyperbolic or
  • mixed, varying by region
  • mixed, of multiple type (e.g., parabolic with
    elliptic constraint)
  • Complexity parameterized by
  • spatial grid points, Nx
  • temporal grid points, Nt
  • components per point, Nc
  • auxiliary storage per point, Na
  • grid points in stencil, Ns
  • Memory M ? Nx? ( Nc Na Nc? Nc ? Ns )
  • Work W ? Nx ? Nt ? ( Nc Na Nc ? Nc ?
    Ns )

Resource Scaling for PDEs
  • For 3D problems, (Memory) ? (Work)3/4
  • for equilibrium problems, work scales with
    problem size ? no. of iteration steps for
    reasonable implicit methods proportional to
    resolution in single spatial dimension
  • for evolutionary problems, work scales with
    problem size ? time steps CFL-type arguments
    place latter on order of resolution in single
    spatial dimension
  • Proportionality constant can be adjusted over a
    very wide range by both discretization and by
    algorithmic tuning
  • If frequent time frames are to be captured, disk
    capacity and I/O rates must both scale linearly
    with work

Typical PDE Tasks
  • Vertex-based loops
  • state vector and auxiliary vector updates
  • Edge-based stencil op loops
  • residual evaluation, approximate Jacobian
  • Jacobian-vector product (often replaced with
    matrix-free form, involving residual evaluation)
  • intergrid transfer (coarse/fine)
  • Sparse, narrow-band recurrences
  • approximate factorization and back substitution
  • smoothing
  • Vector inner products and norms
  • orthogonalization/conjugation
  • convergence progress and stability checks

Edge-based Loop
  • Vertex-centered tetrahedral grid
  • Traverse by edges
  • load vertex values
  • compute intensively
  • store contributions to flux at vertices
  • Each vertex appears in approximately 15 flux

Explicit PDE Solvers
  • Concurrency is pointwise, O(N)
  • Comm.-to-Comp. ratio is surface-to-volume,
  • Communication range is nearest-neighbor, except
    for time-step computation
  • Synchronization frequency is once per step,
  • Storage per point is low
  • Load balance is straightforward for static
    quasi-uniform grids
  • Grid adaptivity (together with temporal stability
    limitation) makes load balance nontrivial

Domain-decomposed Implicit PDE Solvers
  • Concurrency is pointwise, O(N), or subdomainwise,
  • Comm.-to-Comp. ratio still mainly
    surface-to-volume, O((N/P)-1/3)
  • Communication still mainly nearest-neighbor, but
    nonlocal communication arises from conjugation,
    norms, coarse grid problems
  • Synchronization frequency often more than once
    per grid-sweep, up to Krylov dimension,
  • Storage per point is higher, by factor of O(K)
  • Load balance issues the same as for explicit

Source 1 Expanded Number of Processors
  • Amdahl's law can be defeated if serial sections
    make up a nonincreasing fraction of total work as
    problem size and processor count scale up
    together true for most explicit or iterative
    implicit PDE solvers
  • popularized in 1986 Karp Prize paper by
    Gustafson, et al.
  • Simple, back-of-envelope parallel complexity
    analyses show that processors can be increased as
    fast, or almost as fast, as problem size,
    assuming load is perfectly balanced
  • Caveat the processor network must also be
    scalable (applies to protocols as well as to
  • Remaining four orders of magnitude could be met
    by hardware expansion (but this does not mean
    that fixed-size applications of today would run
    104 times faster)

Back-of-Envelope Scalability Demonstration for
Bulk-synchronized PDE Computations
  • Given complexity estimates of the leading terms
  • the concurrent computation (per iteration phase)
  • the concurrent communication
  • the synchronization frequency
  • And a model of the architecture including
  • internode communication (network topology and
    protocol reflecting horizontal memory structure)
  • on-node computation (effective performance
    parameters including vertical memory structure)
  • One can estimate optimal concurrency and
    execution time
  • on per-iteration basis, or overall (by taking
    into account any granularity-dependent
    convergence rate)
  • simply differentiate time estimate in terms of
    (N,P) with respect to P, equate to zero and solve
    for P in terms of N

3D Stencil Costs (per Iteration)
  • grid points in each direction n, total work
  • processors in each direction p, total procs
  • memory per node requirements O(N/P)
  • execution time per iteration A n3/p3
  • grid points on side of each processor subdomain
  • neighbor communication per iteration B n2/p2
  • cost of global reductions in each iteration C
    log p or C p(1/d)
  • C includes synchronization frequency
  • same dimensionless units for measuring A, B, C
  • e.g., cost of scalar floating point multiply-add

3D Stencil Computation IllustrationRich local
network, tree-based global reductions
  • total wall-clock time per iteration
  • for optimal p, , or
  • or (with ),
  • without speeddown, p can grow with n
  • in the limit as

3D Stencil Computation Illustration Rich local
network, tree-based global reductions
  • optimal running time
  • where
  • limit of infinite neighbor bandwidth, zero
    neighbor latency ( )
  • (This analysis is on a per iteration basis
    fuller analysis would multiply this cost by an
    iteration count estimate that generally depends
    on n and p.)

Summary for Various Networks
  • With tree-based (logarithmic) global reductions
    and scalable nearest neighbor hardware
  • optimal number of processors scales linearly with
    problem size
  • With 3D torus-based global reductions and
    scalable nearest neighbor hardware
  • optimal number of processors scales as
    three-fourths power of problem size (almost
  • With common network bus (heavy contention)
  • optimal number of processors scales as one-fourth
    power of problem size (not scalable)
  • bad news for conventional Beowulf clusters, but
    see 2000 Bell Prize price-performance awards

1999 Bell Prize Parallel Scaling Results on ASCI
RedONERA M6 Wing Test Case, Tetrahedral grid of
2.8 million vertices (about 11 million unknowns)
on up to 3072 ASCI Red Nodes (Pentium Pro 333 MHz
Surface Visualization of Test Domain for
Computing Flow over an ONERA M6 Wing
Transonic Lambda Shock Solution
Fixed-size Parallel Scaling Results (Flop/s)
Fixed-size Parallel Scaling Results(Time in
Algorithm Newton-Krylov-Schwarz
Newton nonlinear solver asymptotically quadratic
Krylov accelerator spectrally adaptive
Schwarz preconditioner parallelizable
Fixed-size Scaling Results for W-cycle (Time in
seconds, courtesy of D. Mavriplis)
ASCI runs for grid of 3.1M vertices T3E runs
for grid of 24.7M vertices
Source 2 More Efficient Use of Faster
  • Current low efficiencies of sparse codes can be
    improved if regularity of reference is exploited
    with memory-assist features
  • PDEs have periodic workingset structure that
    permits effective use of prefetch/dispatch
    directives, and lots of slackness
  • Combined with processors-in-memory (PIM)
    technology for gather/scatter into densely used
    block transfers and multithreading, PDEs can
    approach full utilization of processor cycles
  • Caveat high bandwidth is critical, since PDE
    algorithms do only O(N) work for O(N) gridpoints
    worth of loads and stores
  • One to two orders of magnitude can be gained by
    catching up to the clock, and by following the
    clock into the few-GHz range

Following the Clock
  • 1999 Predictions from the Semiconductor Industry
    Association http//
  • A factor of 2-3 can be expected by 2007 by
    following the clock alone

Example of Multithreading
  • Same ONERA M6 wing Euler code simulation on ASCI
  • ASCI Red contains two processors per node,
    sharing memory
  • Can use second processor in either
    message-passing mode with its own subdomain, or
    in multithreaded shared memory mode, which does
    not require the number of subdomain partitions to
  • Latter is much more effective in flux evaluation
    phase, as shown by cumulative execution time
    (here, memory bandwidth is not an issue)

PDE Workingsets
  • Smallest data for single stencil
  • Ns? (Nc2 Nc Na ) (sharp)
  • Largest data for entire subdomain
  • (Nx/P) ? Ns ? (Nc2 Nc Na ) (sharp)
  • Intermediate data for a neighborhood collection
    of stencils, reused as possible

Cache Traffic for PDEs
  • As successive workingsets drop into a level of
    memory, capacity (and with effort conflict)
    misses disappear, leaving only compulsory,
    reducing demand on main memory bandwidth

Strategies Based on Workingset Structure
  • No performance value in memory levels larger than
  • Little performance value in memory levels smaller
    than subdomain but larger than required to permit
    full reuse of most data within each subdomain
  • After providing L1 large enough for smallest
    workingset (and multiple independent copies up to
    accommodate desired level of multithreading) all
    additional resources should be invested in large
  • Tables describing grid connectivity are built
    (after each grid rebalancing) and stored in PIM
    --- used to pack/unpack dense-use cache lines
    during subdomain traversal

Costs of Greater Per-processor Efficiency
  • Programming complexity of managing subdomain
  • Space to store gather/scatter tables in PIM
  • Time to (re)build gather/scatter tables in PIM
  • Memory bandwidth commensurate with peak rates of
    all processors

Source 3 More Architecture Friendly
  • Algorithmic practice needs to catch up to
    architectural demands
  • several one-time gains remain to be contributed
    that could improve data locality or reduce
    synchronization frequency, while maintaining
    required concurrency and slackness
  • One-time refers to improvements by small
    constant factors, nothing that scales in N or P
    complexities are already near information-theoreti
    c lower bounds, and we reject increases in flop
    rates that derive from less efficient algorithms
  • Caveat remaining algorithmic performance
    improvements may cost extra space or may bank on
    stability shortcuts that occasionally backfire,
    making performance modeling less predictable
  • Perhaps an order of magnitude of performance
    remains here

Raw Performance Improvement from Algorithms
  • Spatial reorderings that improve locality
  • interlacing of all related grid-based data
  • ordering gridpoints and grid edges for L1/L2
  • Discretizations that improve locality
  • higher-order methods (lead to larger denser
    blocks at each point than lower-order methods)
  • vertex-centering (for same tetrahedral grid,
    leads to denser blockrows than cell-centering)
  • Temporal reorderings that improve locality
  • block vector algorithms (reuse cached matrix
    blocks vectors in block are independent)
  • multi-step vector algorithms (reuse cached vector
    blocks vectors have sequential dependence)

Raw Performance Improvement from Algorithms
  • Temporal reorderings that reduce synchronization
  • less stable algorithmic choices that reduce
    synchronization frequency (deferred
    orthogonalization, speculative step selection)
  • less global methods that reduce synchronization
    range by replacing a tightly coupled global
    process (e.g., Newton) with loosely coupled sets
    of tightly coupled local processes (e.g.,
  • Precision reductions that make bandwidth seem
  • lower precision representation of preconditioner
    matrix coefficients or poorly known coefficients
    (arithmetic is still performed on full precision

Improvements Resulting from Locality Reordering
Improvements from Blocking Vectors
  • Same ONERA M6 Euler simulation, on SGI Origin
  • One vector represents standard GMRES acceleration
  • Four vectors is a blocked Krylov method, not yet
    in production version
  • Savings arises from not reloading matrix elements
    of Jacobian for each new vector (four-fold
    increase in matrix element use per load)
  • Flop/s rate is effectively tripled however, can
    the extra vectors be used efficiently from a
    numerical viewpoint??

Improvements from Reduced Precision
  • Same ONERA M6 Euler simulation, on SGI Origin
  • Standard (middle column) is double precision in
    all floating quantities
  • Optimization (right column) is to store
    preconditioner for Jacobian matrix in single
    precision only, promoting to double before use in
    the processor
  • Bandwidth and matrix cache capacities are
    effectively doubled, with no deterioration in
    numerical properties

Source 4 Algorithms Packing More Science Per
  • Some algorithmic improvements do not improve flop
    rate, but lead to the same scientific end in the
    same time at lower hardware cost (less memory,
    lower operation complexity)
  • Caveat such adaptive programs are more
    complicated and less thread-uniform than those
    they improve upon in quality/cost ratio
  • Desirable that petaflop/s machines be general
    purpose enough to run the best algorithms
  • Not daunting, conceptually, but puts an enormous
    premium on dynamic load balancing
  • An order of magnitude or more can be gained here
    for many problems

Example of Adaptive Opportunities
  • Spatial Discretization-based adaptivity
  • change discretization type and order to attain
    required approximation to the continuum
    everywhere without over-resolving in smooth,
    easily approximated regions
  • Fidelity-based adaptivity
  • change continuous formulation to accommodate
    required phenomena everywhere without enriching
    in regions where nothing happens
  • Stiffness-based adaptivity
  • change solution algorithm to provide more
    powerful, robust techniques in regions of
    space-time where discrete problem is linearly or
    nonlinearly stiff without extra work in nonstiff,
    locally well-conditioned regions

Experimental Example of Opportunity for Advanced
  • Driven cavity Newtons method (left) versus new
    Additive Schwarz Preconditioned Inexact Newton
    (ASPIN) nonlinear preconditioning (right)

Status and Prospectsfor Advanced Adaptivity
  • Metrics and procedures well developed in only a
    few areas
  • method-of-lines ODEs for stiff IBVPs and DAEs,
    FEA for elliptic BVPs
  • Multi-model methods used in ad hoc ways in
  • Boeing TRANAIR code
  • Poly-algorithmic solvers demonstrated in
    principle but rarely in the hostile environment
    of high-performance computing
  • Requirements for progress
  • management of hierarchical levels of
  • user specification of hierarchical priorities of
    different threads

Summary of Suggestions for PDE Petaflops
  • Algorithms that deliver more science per flop
  • possibly large problem-dependent factor, through
    adaptivity (but we won't count this towards rate
  • Algorithmic variants that are more
  • expect half an order of magnitude, through
    improved locality and relaxed synchronization
  • More efficient use of processor cycles, and
    faster processor/memory
  • expect one-and-a-half orders of magnitude,
    through memory-assist language features, PIM, and
  • Expanded number of processors
  • expect two orders of magnitude, through dynamic
    balancing and extreme care in implementation

Reminder about the Source of PDEs
  • Computational engineering is not about individual
    large-scale analyses, done fast and thrown over
    the wall
  • Both results and their sensitivities are
    desired often multiple operation points to be
    simulated are known a priori, rather than
  • Sensitivities may be fed back into optimization
  • Full PDE analyses may also be inner iterations in
    a multidisciplinary computation
  • In such contexts, petaflop/s may mean 1,000
    analyses running somewhat asynchronously with
    respect to each other, each at 1 Tflop/s
    clearly a less daunting challenge and one that
    has better synchronization properties for
    exploiting The Grid than 1 analysis running
    at 1 Pflop/s

Summary Recommendations for Architects
  • Support rich (mesh-like) interprocessor
    connectivity and fast global reductions
  • Allow disabling of expensive interprocessor cache
    coherence protocols for user-tagged data
  • Support fast message-passing protocols between
    processors that physically share memory, for
    legacy MP applications
  • Supply sufficient memory system bandwidth per
    processor (at least one word per clock per scalar
  • Give user optional control of L2 cache traffic
    through directives
  • Develop at least gather/scatter
    processor-in-memory capability
  • Support variety of precisions in blocked
    transfers and fast precision conversions

Recommendations for New Benchmarks
  • Recently introduced sPPM benchmark fills a void
    for memory-system-realistic full-application PDE
    performance, but is explicit, structured, and
    relatively high-order
  • Similar full-application benchmark is needed for
    implicit, unstructured, low-order PDE solvers
  • Reflecting the hierarchical, distributed memory
    layout of high end computers, this benchmark
    would have two aspects
  • uniprocessor (vertical) memory system
    performance suite of problems of various grid
    sizes and multicomponent sizes with different
    interlacings and edge orderings
  • parallel (horizontal) network performance
    problems of various subdomain sizes and
    synchronization frequencies

  • High Performance Parallel CFD, Gropp, Kaushik,
    Keyes Smith, 2001, Parallel Computing (to
    appear, 2001)
  • Toward Realistic Performance Bounds for Implicit
    CFD Codes, Gropp, Kaushik, Keyes Smith, 1999,
    in Proceedings of Parallel CFD'99, Elsevier
  • Prospects for CFD on Petaflops Systems, Keyes,
    Kaushik Smith, 1999, in Parallel Solution of
    Partial Differential Equations, Springer, pp.
  • Newton-Krylov-Schwarz Methods for Aerodynamics
    Problems Compressible and Incompressible Flows
    on Unstructured Grids, Kaushik, Keyes Smith,
    1998, in Proceedings of the 11th Intl. Conf. on
    Domain Decomposition Methods, Domain
    Decomposition Press, pp. 513-520
  • How Scalable is Domain Decomposition in Practice,
    Keyes, 1998, in Proceedings of the 11th Intl.
    Conf. on Domain Decomposition Methods, Domain
    Decomposition Press, pp. 286-297
  • On the Interaction of Architecture and Algorithm
    in the Domain-Based Parallelization of an
    Unstructured Grid Incompressible Flow Code,
    Kaushik, Keyes Smith, 1998, in Proceedings of
    the 10th Intl. Conf. on Domain Decomposition
    Methods, AMS, pp. 311-319

Related URLs
  • Follow-up on this talk
  • http//
  • ASCI platforms
  • http//
  • Int. Conferences on Domain Decomposition Methods
  • SIAM Conferences on Parallel Processing
    http// (Norfolk, USA 12-14 Mar 2001)
  • International Conferences on Parallel CFD

  • Collaborators Dinesh Kaushik (ODU), Kyle
    Anderson (NASA), and the PETSc team at ANL
    Satish Balay, Bill Gropp, Lois McInnes, Barry
  • Sponsors U.S. DOE, ICASE, NASA, NSF
  • Computer Resources DOE, SGI-Cray
  • Inspiration Shahid Bokhari (ICASE), Xiao-Chuan
    Cai (CU-Boulder), Rob Falgout (LLNL), Paul
    Fischer (ANL), Kyle Gallivan (FSU), Liz Jessup
    (CU-Boulder), Michael Kern (INRIA), Dimitri
    Mavriplis (ICASE), Alex Pothen (ODU), Uli Ruede
    (Univ. Erlangen), John Salmon (Caltech), Linda
    Stals (ODU), Bob Voigt (DOE), David Young
    (Boeing), Paul Woodward (UMinn)
Write a Comment
User Comments (0)