The Next Four Orders of Magnitude in Parallel PDE Simulation Performance http:www.math.odu.edukeyest - PowerPoint PPT Presentation

1 / 52
About This Presentation
Title:

The Next Four Orders of Magnitude in Parallel PDE Simulation Performance http:www.math.odu.edukeyest

Description:

The Next Four Orders of Magnitude in. Parallel PDE Simulation Performance ... One to two orders of magnitude can be gained by catching up to the clock, and by ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 53
Provided by: william559
Learn more at: https://www.cs.odu.edu
Category:

less

Transcript and Presenter's Notes

Title: The Next Four Orders of Magnitude in Parallel PDE Simulation Performance http:www.math.odu.edukeyest


1
The Next Four Orders of Magnitude in Parallel
PDE Simulation Performance http//www.math.odu.e
du/keyes/talks.html
  • David E. Keyes
  • Department of Mathematics Statistics,
  • Old Dominion University
  • Institute for Scientific Computing Research,
  • Lawrence Livermore National Laboratory
  • Institute for Computer Applications in Science
    Engineering,
  • NASA Langley Research Center

2
Background of this Presentation
  • Originally prepared for Petaflops II Conference
  • History of the Petaflops Initiative in the USA
  • Enabling Technologies for Petaflops Computing,
    Feb 1994
  • book by Sterling, Messina, and Smith, MIT Press,
    1995
  • Applications Workshop, Aug 1994
  • Architectures Workshop, Apr 1995
  • Systems Software Workshop, Jun 1996
  • Algorithms Workshop, Apr 1997
  • Systems Operations Workshop Review, Jun 1998
  • Enabling Technologies for Petaflops II, Feb 1999
  • Topics in Ultra-scale Computing, book by Sterling
    et al., MIT Press, 2001 (to appear)

3
Weighing in at the Bottom Line
  • Characterization of a 1 Teraflop/s computer of
    today
  • about 1,000 processors of 1 Gflop/s (peak) each
  • due to inefficiencies within the processors, more
    practically characterized as about 4,000
    processors of 250 Mflop/s each
  • How do we want to get to 1 Petaflop/s by 2007
    (original goal)?
  • 1,000,000 processors of 1 Gflop/s each (only
    wider)?
  • 10,000 processors of 100 Gflop/s each (mainly
    deeper)?
  • From the point of view of PDE simulations on
    quasi-static Eulerian grids
  • Either!
  • Caveat dynamic grid simulations are not covered
    in this talk
  • but see work at Bonn, Erlangen, Heidelberg, LLNL,
    and ODU presented elsewhere

4
Perspective
  • Many Grand Challenges in computational science
    are formulated as PDEs (possibly among
    alternative formulations)
  • However, PDE simulations historically have not
    performed as well as other scientific simulations
  • PDE simulations require a balance among
    architectural components that is not necessarily
    met in a machine designed to max out on the
    standard LINPACK benchmark
  • The justification for building petaflop/s
    architectures undoubtedly will (and should)
    include PDE applications
  • However, cost-effective use of petaflop/s on PDEs
    requires further attention to architectural and
    algorithmic matters
  • Memory-centric view of computation needs further
    promotion

5
Application Performance History3 orders of
magnitude in 10 years better than Moores Law
6
Bell Prize Performance History
7
Plan of Presentation
  • General characterization of PDE requirements
  • Four sources of performance improvement to get
    from current 100's of Gflop/s (for PDEs) to 1
    Pflop/s
  • Each illustrated with examples from computational
    aerodynamics
  • offered as typical of real workloads (nonlinear,
    unstructured, multicomponent, multiscale, etc.)
  • Performance presented
  • on up to thousands of processors of T3E and ASCI
    Red (for parallel aspects)
  • on numerous uniprocessors (for memory hierarchy
    aspects)

8
Purpose of Presentation
  • Not to argue for specific algorithms/programming
    models/codes in any detail
  • but see talks on Newton-Krylov-Schwarz (NKS)
    methods under homepage
  • Provide a requirements target for designers of
    today's systems
  • typical of several contemporary successfully
    parallelized PDE applications
  • not comprehensive of all important large-scale
    applications
  • Speculate on requirements target for designers of
    tomorrow's sytems
  • Promote attention to current architectural
    weaknesses, relative to requirements of PDEs

9
Four Sources of Performance Improvement
  • Expanded number of processors
  • arbitrarily large factor, through extremely
    careful attention to load balancing and
    synchronization
  • More efficient use of processor cycles, and
    faster processor/memory elements
  • one to two orders of magnitude, through
    memory-assist language features,
    processors-in-memory, and multithreading
  • Algorithmic variants that are more
    architecture-friendly
  • approximately an order of magnitude, through
    improved locality and relaxed synchronization
  • Algorithms that deliver more science per flop
  • possibly large problem-dependent factor, through
    adaptivity
  • This last does not contribute to raw flop/s!

10
PDE Varieties and Complexities
  • Varieties of PDEs
  • evolution (time hyperbolic, time parabolic)
  • equilibrium (elliptic, spatially hyperbolic or
    parabolic)
  • mixed, varying by region
  • mixed, of multiple type (e.g., parabolic with
    elliptic constraint)
  • Complexity parameterized by
  • spatial grid points, Nx
  • temporal grid points, Nt
  • components per point, Nc
  • auxiliary storage per point, Na
  • grid points in stencil, Ns
  • Memory M ? Nx? ( Nc Na Nc? Nc ? Ns )
  • Work W ? Nx ? Nt ? ( Nc Na Nc ? Nc ?
    Ns )

11
Resource Scaling for PDEs
  • For 3D problems, (Memory) ? (Work)3/4
  • for equilibrium problems, work scales with
    problem size ? no. of iteration steps for
    reasonable implicit methods proportional to
    resolution in single spatial dimension
  • for evolutionary problems, work scales with
    problem size ? time steps CFL-type arguments
    place latter on order of resolution in single
    spatial dimension
  • Proportionality constant can be adjusted over a
    very wide range by both discretization and by
    algorithmic tuning
  • If frequent time frames are to be captured, disk
    capacity and I/O rates must both scale linearly
    with work

12
Typical PDE Tasks
  • Vertex-based loops
  • state vector and auxiliary vector updates
  • Edge-based stencil op loops
  • residual evaluation, approximate Jacobian
    evaluation
  • Jacobian-vector product (often replaced with
    matrix-free form, involving residual evaluation)
  • intergrid transfer (coarse/fine)
  • Sparse, narrow-band recurrences
  • approximate factorization and back substitution
  • smoothing
  • Vector inner products and norms
  • orthogonalization/conjugation
  • convergence progress and stability checks

13
Edge-based Loop
  • Vertex-centered tetrahedral grid
  • Traverse by edges
  • load vertex values
  • compute intensively
  • store contributions to flux at vertices
  • Each vertex appears in approximately 15 flux
    computations

14
Explicit PDE Solvers
  • Concurrency is pointwise, O(N)
  • Comm.-to-Comp. ratio is surface-to-volume,
    O((N/P)-1/3)
  • Communication range is nearest-neighbor, except
    for time-step computation
  • Synchronization frequency is once per step,
    O((N/P)-1)
  • Storage per point is low
  • Load balance is straightforward for static
    quasi-uniform grids
  • Grid adaptivity (together with temporal stability
    limitation) makes load balance nontrivial

15
Domain-decomposed Implicit PDE Solvers
  • Concurrency is pointwise, O(N), or subdomainwise,
    O(P)
  • Comm.-to-Comp. ratio still mainly
    surface-to-volume, O((N/P)-1/3)
  • Communication still mainly nearest-neighbor, but
    nonlocal communication arises from conjugation,
    norms, coarse grid problems
  • Synchronization frequency often more than once
    per grid-sweep, up to Krylov dimension,
    O(K(N/P)-1)
  • Storage per point is higher, by factor of O(K)
  • Load balance issues the same as for explicit

16
Source 1 Expanded Number of Processors
  • Amdahl's law can be defeated if serial sections
    make up a nonincreasing fraction of total work as
    problem size and processor count scale up
    together true for most explicit or iterative
    implicit PDE solvers
  • popularized in 1986 Karp Prize paper by
    Gustafson, et al.
  • Simple, back-of-envelope parallel complexity
    analyses show that processors can be increased as
    fast, or almost as fast, as problem size,
    assuming load is perfectly balanced
  • Caveat the processor network must also be
    scalable (applies to protocols as well as to
    hardware)
  • Remaining four orders of magnitude could be met
    by hardware expansion (but this does not mean
    that fixed-size applications of today would run
    104 times faster)

17
Back-of-Envelope Scalability Demonstration for
Bulk-synchronized PDE Computations
  • Given complexity estimates of the leading terms
    of
  • the concurrent computation (per iteration phase)
  • the concurrent communication
  • the synchronization frequency
  • And a model of the architecture including
  • internode communication (network topology and
    protocol reflecting horizontal memory structure)
  • on-node computation (effective performance
    parameters including vertical memory structure)
  • One can estimate optimal concurrency and
    execution time
  • on per-iteration basis, or overall (by taking
    into account any granularity-dependent
    convergence rate)
  • simply differentiate time estimate in terms of
    (N,P) with respect to P, equate to zero and solve
    for P in terms of N

18
3D Stencil Costs (per Iteration)
  • grid points in each direction n, total work
    NO(n3)
  • processors in each direction p, total procs
    PO(p3)
  • memory per node requirements O(N/P)
  • execution time per iteration A n3/p3
  • grid points on side of each processor subdomain
    n/p
  • neighbor communication per iteration B n2/p2
  • cost of global reductions in each iteration C
    log p or C p(1/d)
  • C includes synchronization frequency
  • same dimensionless units for measuring A, B, C
  • e.g., cost of scalar floating point multiply-add

19
3D Stencil Computation IllustrationRich local
network, tree-based global reductions
  • total wall-clock time per iteration
  • for optimal p, , or
  • or (with ),
  • without speeddown, p can grow with n
  • in the limit as

20
3D Stencil Computation Illustration Rich local
network, tree-based global reductions
  • optimal running time
  • where
  • limit of infinite neighbor bandwidth, zero
    neighbor latency ( )
  • (This analysis is on a per iteration basis
    fuller analysis would multiply this cost by an
    iteration count estimate that generally depends
    on n and p.)

21
Summary for Various Networks
  • With tree-based (logarithmic) global reductions
    and scalable nearest neighbor hardware
  • optimal number of processors scales linearly with
    problem size
  • With 3D torus-based global reductions and
    scalable nearest neighbor hardware
  • optimal number of processors scales as
    three-fourths power of problem size (almost
    scalable)
  • With common network bus (heavy contention)
  • optimal number of processors scales as one-fourth
    power of problem size (not scalable)
  • bad news for conventional Beowulf clusters, but
    see 2000 Bell Prize price-performance awards

22
1999 Bell Prize Parallel Scaling Results on ASCI
RedONERA M6 Wing Test Case, Tetrahedral grid of
2.8 million vertices (about 11 million unknowns)
on up to 3072 ASCI Red Nodes (Pentium Pro 333 MHz
processors)
23
Surface Visualization of Test Domain for
Computing Flow over an ONERA M6 Wing
24
Transonic Lambda Shock Solution
25
Fixed-size Parallel Scaling Results (Flop/s)
26
Fixed-size Parallel Scaling Results(Time in
seconds)
27
Algorithm Newton-Krylov-Schwarz
Newton nonlinear solver asymptotically quadratic
Krylov accelerator spectrally adaptive
Schwarz preconditioner parallelizable
28
Fixed-size Scaling Results for W-cycle (Time in
seconds, courtesy of D. Mavriplis)
ASCI runs for grid of 3.1M vertices T3E runs
for grid of 24.7M vertices
29
Source 2 More Efficient Use of Faster
Processors
  • Current low efficiencies of sparse codes can be
    improved if regularity of reference is exploited
    with memory-assist features
  • PDEs have periodic workingset structure that
    permits effective use of prefetch/dispatch
    directives, and lots of slackness
  • Combined with processors-in-memory (PIM)
    technology for gather/scatter into densely used
    block transfers and multithreading, PDEs can
    approach full utilization of processor cycles
  • Caveat high bandwidth is critical, since PDE
    algorithms do only O(N) work for O(N) gridpoints
    worth of loads and stores
  • One to two orders of magnitude can be gained by
    catching up to the clock, and by following the
    clock into the few-GHz range

30
Following the Clock
  • 1999 Predictions from the Semiconductor Industry
    Association http//public.itrs.net/fil
    es/1999_SIA_Roadmap/Home.htm
  • A factor of 2-3 can be expected by 2007 by
    following the clock alone

31
Example of Multithreading
  • Same ONERA M6 wing Euler code simulation on ASCI
    Red
  • ASCI Red contains two processors per node,
    sharing memory
  • Can use second processor in either
    message-passing mode with its own subdomain, or
    in multithreaded shared memory mode, which does
    not require the number of subdomain partitions to
    double
  • Latter is much more effective in flux evaluation
    phase, as shown by cumulative execution time
    (here, memory bandwidth is not an issue)

32
PDE Workingsets
  • Smallest data for single stencil
  • Ns? (Nc2 Nc Na ) (sharp)
  • Largest data for entire subdomain
  • (Nx/P) ? Ns ? (Nc2 Nc Na ) (sharp)
  • Intermediate data for a neighborhood collection
    of stencils, reused as possible

33
Cache Traffic for PDEs
  • As successive workingsets drop into a level of
    memory, capacity (and with effort conflict)
    misses disappear, leaving only compulsory,
    reducing demand on main memory bandwidth

34
Strategies Based on Workingset Structure
  • No performance value in memory levels larger than
    subdomain
  • Little performance value in memory levels smaller
    than subdomain but larger than required to permit
    full reuse of most data within each subdomain
    subtraversal
  • After providing L1 large enough for smallest
    workingset (and multiple independent copies up to
    accommodate desired level of multithreading) all
    additional resources should be invested in large
    L2
  • Tables describing grid connectivity are built
    (after each grid rebalancing) and stored in PIM
    --- used to pack/unpack dense-use cache lines
    during subdomain traversal

35
Costs of Greater Per-processor Efficiency
  • Programming complexity of managing subdomain
    traversal
  • Space to store gather/scatter tables in PIM
  • Time to (re)build gather/scatter tables in PIM
  • Memory bandwidth commensurate with peak rates of
    all processors

36
Source 3 More Architecture Friendly
Algorithms
  • Algorithmic practice needs to catch up to
    architectural demands
  • several one-time gains remain to be contributed
    that could improve data locality or reduce
    synchronization frequency, while maintaining
    required concurrency and slackness
  • One-time refers to improvements by small
    constant factors, nothing that scales in N or P
    complexities are already near information-theoreti
    c lower bounds, and we reject increases in flop
    rates that derive from less efficient algorithms
  • Caveat remaining algorithmic performance
    improvements may cost extra space or may bank on
    stability shortcuts that occasionally backfire,
    making performance modeling less predictable
  • Perhaps an order of magnitude of performance
    remains here

37
Raw Performance Improvement from Algorithms
  • Spatial reorderings that improve locality
  • interlacing of all related grid-based data
    structures
  • ordering gridpoints and grid edges for L1/L2
    reuse
  • Discretizations that improve locality
  • higher-order methods (lead to larger denser
    blocks at each point than lower-order methods)
  • vertex-centering (for same tetrahedral grid,
    leads to denser blockrows than cell-centering)
  • Temporal reorderings that improve locality
  • block vector algorithms (reuse cached matrix
    blocks vectors in block are independent)
  • multi-step vector algorithms (reuse cached vector
    blocks vectors have sequential dependence)

38
Raw Performance Improvement from Algorithms
  • Temporal reorderings that reduce synchronization
    penalty
  • less stable algorithmic choices that reduce
    synchronization frequency (deferred
    orthogonalization, speculative step selection)
  • less global methods that reduce synchronization
    range by replacing a tightly coupled global
    process (e.g., Newton) with loosely coupled sets
    of tightly coupled local processes (e.g.,
    Schwarz)
  • Precision reductions that make bandwidth seem
    larger
  • lower precision representation of preconditioner
    matrix coefficients or poorly known coefficients
    (arithmetic is still performed on full precision
    extensions)

39
Improvements Resulting from Locality Reordering
40
Improvements from Blocking Vectors
  • Same ONERA M6 Euler simulation, on SGI Origin
  • One vector represents standard GMRES acceleration
  • Four vectors is a blocked Krylov method, not yet
    in production version
  • Savings arises from not reloading matrix elements
    of Jacobian for each new vector (four-fold
    increase in matrix element use per load)
  • Flop/s rate is effectively tripled however, can
    the extra vectors be used efficiently from a
    numerical viewpoint??

41
Improvements from Reduced Precision
  • Same ONERA M6 Euler simulation, on SGI Origin
  • Standard (middle column) is double precision in
    all floating quantities
  • Optimization (right column) is to store
    preconditioner for Jacobian matrix in single
    precision only, promoting to double before use in
    the processor
  • Bandwidth and matrix cache capacities are
    effectively doubled, with no deterioration in
    numerical properties

42
Source 4 Algorithms Packing More Science Per
Flop
  • Some algorithmic improvements do not improve flop
    rate, but lead to the same scientific end in the
    same time at lower hardware cost (less memory,
    lower operation complexity)
  • Caveat such adaptive programs are more
    complicated and less thread-uniform than those
    they improve upon in quality/cost ratio
  • Desirable that petaflop/s machines be general
    purpose enough to run the best algorithms
  • Not daunting, conceptually, but puts an enormous
    premium on dynamic load balancing
  • An order of magnitude or more can be gained here
    for many problems

43
Example of Adaptive Opportunities
  • Spatial Discretization-based adaptivity
  • change discretization type and order to attain
    required approximation to the continuum
    everywhere without over-resolving in smooth,
    easily approximated regions
  • Fidelity-based adaptivity
  • change continuous formulation to accommodate
    required phenomena everywhere without enriching
    in regions where nothing happens
  • Stiffness-based adaptivity
  • change solution algorithm to provide more
    powerful, robust techniques in regions of
    space-time where discrete problem is linearly or
    nonlinearly stiff without extra work in nonstiff,
    locally well-conditioned regions

44
Experimental Example of Opportunity for Advanced
Adaptivity
  • Driven cavity Newtons method (left) versus new
    Additive Schwarz Preconditioned Inexact Newton
    (ASPIN) nonlinear preconditioning (right)

45
Status and Prospectsfor Advanced Adaptivity
  • Metrics and procedures well developed in only a
    few areas
  • method-of-lines ODEs for stiff IBVPs and DAEs,
    FEA for elliptic BVPs
  • Multi-model methods used in ad hoc ways in
    production
  • Boeing TRANAIR code
  • Poly-algorithmic solvers demonstrated in
    principle but rarely in the hostile environment
    of high-performance computing
  • Requirements for progress
  • management of hierarchical levels of
    synchronization
  • user specification of hierarchical priorities of
    different threads

46
Summary of Suggestions for PDE Petaflops
  • Algorithms that deliver more science per flop
  • possibly large problem-dependent factor, through
    adaptivity (but we won't count this towards rate
    improvement)
  • Algorithmic variants that are more
    architecture-friendly
  • expect half an order of magnitude, through
    improved locality and relaxed synchronization
  • More efficient use of processor cycles, and
    faster processor/memory
  • expect one-and-a-half orders of magnitude,
    through memory-assist language features, PIM, and
    multithreading
  • Expanded number of processors
  • expect two orders of magnitude, through dynamic
    balancing and extreme care in implementation

47
Reminder about the Source of PDEs
  • Computational engineering is not about individual
    large-scale analyses, done fast and thrown over
    the wall
  • Both results and their sensitivities are
    desired often multiple operation points to be
    simulated are known a priori, rather than
    sequentially
  • Sensitivities may be fed back into optimization
    process
  • Full PDE analyses may also be inner iterations in
    a multidisciplinary computation
  • In such contexts, petaflop/s may mean 1,000
    analyses running somewhat asynchronously with
    respect to each other, each at 1 Tflop/s
    clearly a less daunting challenge and one that
    has better synchronization properties for
    exploiting The Grid than 1 analysis running
    at 1 Pflop/s

48
Summary Recommendations for Architects
  • Support rich (mesh-like) interprocessor
    connectivity and fast global reductions
  • Allow disabling of expensive interprocessor cache
    coherence protocols for user-tagged data
  • Support fast message-passing protocols between
    processors that physically share memory, for
    legacy MP applications
  • Supply sufficient memory system bandwidth per
    processor (at least one word per clock per scalar
    unit)
  • Give user optional control of L2 cache traffic
    through directives
  • Develop at least gather/scatter
    processor-in-memory capability
  • Support variety of precisions in blocked
    transfers and fast precision conversions

49
Recommendations for New Benchmarks
  • Recently introduced sPPM benchmark fills a void
    for memory-system-realistic full-application PDE
    performance, but is explicit, structured, and
    relatively high-order
  • Similar full-application benchmark is needed for
    implicit, unstructured, low-order PDE solvers
  • Reflecting the hierarchical, distributed memory
    layout of high end computers, this benchmark
    would have two aspects
  • uniprocessor (vertical) memory system
    performance suite of problems of various grid
    sizes and multicomponent sizes with different
    interlacings and edge orderings
  • parallel (horizontal) network performance
    problems of various subdomain sizes and
    synchronization frequencies

50
Bibliography
  • High Performance Parallel CFD, Gropp, Kaushik,
    Keyes Smith, 2001, Parallel Computing (to
    appear, 2001)
  • Toward Realistic Performance Bounds for Implicit
    CFD Codes, Gropp, Kaushik, Keyes Smith, 1999,
    in Proceedings of Parallel CFD'99, Elsevier
  • Prospects for CFD on Petaflops Systems, Keyes,
    Kaushik Smith, 1999, in Parallel Solution of
    Partial Differential Equations, Springer, pp.
    247-278
  • Newton-Krylov-Schwarz Methods for Aerodynamics
    Problems Compressible and Incompressible Flows
    on Unstructured Grids, Kaushik, Keyes Smith,
    1998, in Proceedings of the 11th Intl. Conf. on
    Domain Decomposition Methods, Domain
    Decomposition Press, pp. 513-520
  • How Scalable is Domain Decomposition in Practice,
    Keyes, 1998, in Proceedings of the 11th Intl.
    Conf. on Domain Decomposition Methods, Domain
    Decomposition Press, pp. 286-297
  • On the Interaction of Architecture and Algorithm
    in the Domain-Based Parallelization of an
    Unstructured Grid Incompressible Flow Code,
    Kaushik, Keyes Smith, 1998, in Proceedings of
    the 10th Intl. Conf. on Domain Decomposition
    Methods, AMS, pp. 311-319

51
Related URLs
  • Follow-up on this talk
  • http//www.mcs.anl.gov/petsc-fun3d
  • ASCI platforms
  • http//www.llnl.gov/asci/platforms
  • Int. Conferences on Domain Decomposition Methods
    http//www.ddm.org
  • SIAM Conferences on Parallel Processing
    http//www.siam.org (Norfolk, USA 12-14 Mar 2001)
  • International Conferences on Parallel CFD
    http//www.parcfd.org

52
Acknowledgments
  • Collaborators Dinesh Kaushik (ODU), Kyle
    Anderson (NASA), and the PETSc team at ANL
    Satish Balay, Bill Gropp, Lois McInnes, Barry
    Smith
  • Sponsors U.S. DOE, ICASE, NASA, NSF
  • Computer Resources DOE, SGI-Cray
  • Inspiration Shahid Bokhari (ICASE), Xiao-Chuan
    Cai (CU-Boulder), Rob Falgout (LLNL), Paul
    Fischer (ANL), Kyle Gallivan (FSU), Liz Jessup
    (CU-Boulder), Michael Kern (INRIA), Dimitri
    Mavriplis (ICASE), Alex Pothen (ODU), Uli Ruede
    (Univ. Erlangen), John Salmon (Caltech), Linda
    Stals (ODU), Bob Voigt (DOE), David Young
    (Boeing), Paul Woodward (UMinn)
Write a Comment
User Comments (0)
About PowerShow.com