Title: The Next Four Orders of Magnitude in Parallel PDE Simulation Performance http:www.math.odu.edukeyest
1The Next Four Orders of Magnitude in Parallel
PDE Simulation Performance http//www.math.odu.e
du/keyes/talks.html
- David E. Keyes
- Department of Mathematics Statistics,
- Old Dominion University
- Institute for Scientific Computing Research,
- Lawrence Livermore National Laboratory
- Institute for Computer Applications in Science
Engineering, - NASA Langley Research Center
2Background of this Presentation
- Originally prepared for Petaflops II Conference
- History of the Petaflops Initiative in the USA
- Enabling Technologies for Petaflops Computing,
Feb 1994 - book by Sterling, Messina, and Smith, MIT Press,
1995 - Applications Workshop, Aug 1994
- Architectures Workshop, Apr 1995
- Systems Software Workshop, Jun 1996
- Algorithms Workshop, Apr 1997
- Systems Operations Workshop Review, Jun 1998
- Enabling Technologies for Petaflops II, Feb 1999
- Topics in Ultra-scale Computing, book by Sterling
et al., MIT Press, 2001 (to appear)
3Weighing in at the Bottom Line
- Characterization of a 1 Teraflop/s computer of
today - about 1,000 processors of 1 Gflop/s (peak) each
- due to inefficiencies within the processors, more
practically characterized as about 4,000
processors of 250 Mflop/s each - How do we want to get to 1 Petaflop/s by 2007
(original goal)? - 1,000,000 processors of 1 Gflop/s each (only
wider)? - 10,000 processors of 100 Gflop/s each (mainly
deeper)? - From the point of view of PDE simulations on
quasi-static Eulerian grids - Either!
- Caveat dynamic grid simulations are not covered
in this talk - but see work at Bonn, Erlangen, Heidelberg, LLNL,
and ODU presented elsewhere
4Perspective
- Many Grand Challenges in computational science
are formulated as PDEs (possibly among
alternative formulations) - However, PDE simulations historically have not
performed as well as other scientific simulations - PDE simulations require a balance among
architectural components that is not necessarily
met in a machine designed to max out on the
standard LINPACK benchmark - The justification for building petaflop/s
architectures undoubtedly will (and should)
include PDE applications - However, cost-effective use of petaflop/s on PDEs
requires further attention to architectural and
algorithmic matters - Memory-centric view of computation needs further
promotion
5Application Performance History3 orders of
magnitude in 10 years better than Moores Law
6Bell Prize Performance History
7Plan of Presentation
- General characterization of PDE requirements
- Four sources of performance improvement to get
from current 100's of Gflop/s (for PDEs) to 1
Pflop/s - Each illustrated with examples from computational
aerodynamics - offered as typical of real workloads (nonlinear,
unstructured, multicomponent, multiscale, etc.) - Performance presented
- on up to thousands of processors of T3E and ASCI
Red (for parallel aspects) - on numerous uniprocessors (for memory hierarchy
aspects)
8Purpose of Presentation
- Not to argue for specific algorithms/programming
models/codes in any detail - but see talks on Newton-Krylov-Schwarz (NKS)
methods under homepage - Provide a requirements target for designers of
today's systems - typical of several contemporary successfully
parallelized PDE applications - not comprehensive of all important large-scale
applications - Speculate on requirements target for designers of
tomorrow's sytems - Promote attention to current architectural
weaknesses, relative to requirements of PDEs
9Four Sources of Performance Improvement
- Expanded number of processors
- arbitrarily large factor, through extremely
careful attention to load balancing and
synchronization - More efficient use of processor cycles, and
faster processor/memory elements - one to two orders of magnitude, through
memory-assist language features,
processors-in-memory, and multithreading - Algorithmic variants that are more
architecture-friendly - approximately an order of magnitude, through
improved locality and relaxed synchronization - Algorithms that deliver more science per flop
- possibly large problem-dependent factor, through
adaptivity - This last does not contribute to raw flop/s!
10PDE Varieties and Complexities
- Varieties of PDEs
- evolution (time hyperbolic, time parabolic)
- equilibrium (elliptic, spatially hyperbolic or
parabolic) - mixed, varying by region
- mixed, of multiple type (e.g., parabolic with
elliptic constraint) - Complexity parameterized by
- spatial grid points, Nx
- temporal grid points, Nt
- components per point, Nc
- auxiliary storage per point, Na
- grid points in stencil, Ns
- Memory M ? Nx? ( Nc Na Nc? Nc ? Ns )
- Work W ? Nx ? Nt ? ( Nc Na Nc ? Nc ?
Ns )
11Resource Scaling for PDEs
- For 3D problems, (Memory) ? (Work)3/4
- for equilibrium problems, work scales with
problem size ? no. of iteration steps for
reasonable implicit methods proportional to
resolution in single spatial dimension - for evolutionary problems, work scales with
problem size ? time steps CFL-type arguments
place latter on order of resolution in single
spatial dimension - Proportionality constant can be adjusted over a
very wide range by both discretization and by
algorithmic tuning - If frequent time frames are to be captured, disk
capacity and I/O rates must both scale linearly
with work
12Typical PDE Tasks
- Vertex-based loops
- state vector and auxiliary vector updates
- Edge-based stencil op loops
- residual evaluation, approximate Jacobian
evaluation - Jacobian-vector product (often replaced with
matrix-free form, involving residual evaluation) - intergrid transfer (coarse/fine)
- Sparse, narrow-band recurrences
- approximate factorization and back substitution
- smoothing
- Vector inner products and norms
- orthogonalization/conjugation
- convergence progress and stability checks
13Edge-based Loop
- Vertex-centered tetrahedral grid
- Traverse by edges
- load vertex values
- compute intensively
- store contributions to flux at vertices
- Each vertex appears in approximately 15 flux
computations
14Explicit PDE Solvers
- Concurrency is pointwise, O(N)
- Comm.-to-Comp. ratio is surface-to-volume,
O((N/P)-1/3) - Communication range is nearest-neighbor, except
for time-step computation - Synchronization frequency is once per step,
O((N/P)-1) - Storage per point is low
- Load balance is straightforward for static
quasi-uniform grids - Grid adaptivity (together with temporal stability
limitation) makes load balance nontrivial
15Domain-decomposed Implicit PDE Solvers
- Concurrency is pointwise, O(N), or subdomainwise,
O(P) - Comm.-to-Comp. ratio still mainly
surface-to-volume, O((N/P)-1/3) - Communication still mainly nearest-neighbor, but
nonlocal communication arises from conjugation,
norms, coarse grid problems - Synchronization frequency often more than once
per grid-sweep, up to Krylov dimension,
O(K(N/P)-1) - Storage per point is higher, by factor of O(K)
- Load balance issues the same as for explicit
16Source 1 Expanded Number of Processors
- Amdahl's law can be defeated if serial sections
make up a nonincreasing fraction of total work as
problem size and processor count scale up
together true for most explicit or iterative
implicit PDE solvers - popularized in 1986 Karp Prize paper by
Gustafson, et al. - Simple, back-of-envelope parallel complexity
analyses show that processors can be increased as
fast, or almost as fast, as problem size,
assuming load is perfectly balanced - Caveat the processor network must also be
scalable (applies to protocols as well as to
hardware) - Remaining four orders of magnitude could be met
by hardware expansion (but this does not mean
that fixed-size applications of today would run
104 times faster)
17Back-of-Envelope Scalability Demonstration for
Bulk-synchronized PDE Computations
- Given complexity estimates of the leading terms
of - the concurrent computation (per iteration phase)
- the concurrent communication
- the synchronization frequency
- And a model of the architecture including
- internode communication (network topology and
protocol reflecting horizontal memory structure) - on-node computation (effective performance
parameters including vertical memory structure) - One can estimate optimal concurrency and
execution time - on per-iteration basis, or overall (by taking
into account any granularity-dependent
convergence rate) - simply differentiate time estimate in terms of
(N,P) with respect to P, equate to zero and solve
for P in terms of N
183D Stencil Costs (per Iteration)
- grid points in each direction n, total work
NO(n3) - processors in each direction p, total procs
PO(p3) - memory per node requirements O(N/P)
- execution time per iteration A n3/p3
- grid points on side of each processor subdomain
n/p - neighbor communication per iteration B n2/p2
- cost of global reductions in each iteration C
log p or C p(1/d) - C includes synchronization frequency
- same dimensionless units for measuring A, B, C
- e.g., cost of scalar floating point multiply-add
193D Stencil Computation IllustrationRich local
network, tree-based global reductions
- total wall-clock time per iteration
- for optimal p, , or
-
- or (with ),
-
- without speeddown, p can grow with n
- in the limit as
203D Stencil Computation Illustration Rich local
network, tree-based global reductions
- optimal running time
- where
- limit of infinite neighbor bandwidth, zero
neighbor latency ( ) -
- (This analysis is on a per iteration basis
fuller analysis would multiply this cost by an
iteration count estimate that generally depends
on n and p.) -
21Summary for Various Networks
- With tree-based (logarithmic) global reductions
and scalable nearest neighbor hardware - optimal number of processors scales linearly with
problem size - With 3D torus-based global reductions and
scalable nearest neighbor hardware - optimal number of processors scales as
three-fourths power of problem size (almost
scalable) - With common network bus (heavy contention)
- optimal number of processors scales as one-fourth
power of problem size (not scalable) - bad news for conventional Beowulf clusters, but
see 2000 Bell Prize price-performance awards
221999 Bell Prize Parallel Scaling Results on ASCI
RedONERA M6 Wing Test Case, Tetrahedral grid of
2.8 million vertices (about 11 million unknowns)
on up to 3072 ASCI Red Nodes (Pentium Pro 333 MHz
processors)
23Surface Visualization of Test Domain for
Computing Flow over an ONERA M6 Wing
24Transonic Lambda Shock Solution
25Fixed-size Parallel Scaling Results (Flop/s)
26Fixed-size Parallel Scaling Results(Time in
seconds)
27Algorithm Newton-Krylov-Schwarz
Newton nonlinear solver asymptotically quadratic
Krylov accelerator spectrally adaptive
Schwarz preconditioner parallelizable
28Fixed-size Scaling Results for W-cycle (Time in
seconds, courtesy of D. Mavriplis)
ASCI runs for grid of 3.1M vertices T3E runs
for grid of 24.7M vertices
29Source 2 More Efficient Use of Faster
Processors
- Current low efficiencies of sparse codes can be
improved if regularity of reference is exploited
with memory-assist features - PDEs have periodic workingset structure that
permits effective use of prefetch/dispatch
directives, and lots of slackness - Combined with processors-in-memory (PIM)
technology for gather/scatter into densely used
block transfers and multithreading, PDEs can
approach full utilization of processor cycles - Caveat high bandwidth is critical, since PDE
algorithms do only O(N) work for O(N) gridpoints
worth of loads and stores - One to two orders of magnitude can be gained by
catching up to the clock, and by following the
clock into the few-GHz range
30Following the Clock
- 1999 Predictions from the Semiconductor Industry
Association http//public.itrs.net/fil
es/1999_SIA_Roadmap/Home.htm - A factor of 2-3 can be expected by 2007 by
following the clock alone
31Example of Multithreading
- Same ONERA M6 wing Euler code simulation on ASCI
Red - ASCI Red contains two processors per node,
sharing memory - Can use second processor in either
message-passing mode with its own subdomain, or
in multithreaded shared memory mode, which does
not require the number of subdomain partitions to
double - Latter is much more effective in flux evaluation
phase, as shown by cumulative execution time
(here, memory bandwidth is not an issue)
32PDE Workingsets
- Smallest data for single stencil
- Ns? (Nc2 Nc Na ) (sharp)
- Largest data for entire subdomain
- (Nx/P) ? Ns ? (Nc2 Nc Na ) (sharp)
- Intermediate data for a neighborhood collection
of stencils, reused as possible
33Cache Traffic for PDEs
- As successive workingsets drop into a level of
memory, capacity (and with effort conflict)
misses disappear, leaving only compulsory,
reducing demand on main memory bandwidth
34Strategies Based on Workingset Structure
- No performance value in memory levels larger than
subdomain - Little performance value in memory levels smaller
than subdomain but larger than required to permit
full reuse of most data within each subdomain
subtraversal - After providing L1 large enough for smallest
workingset (and multiple independent copies up to
accommodate desired level of multithreading) all
additional resources should be invested in large
L2 - Tables describing grid connectivity are built
(after each grid rebalancing) and stored in PIM
--- used to pack/unpack dense-use cache lines
during subdomain traversal
35Costs of Greater Per-processor Efficiency
- Programming complexity of managing subdomain
traversal - Space to store gather/scatter tables in PIM
- Time to (re)build gather/scatter tables in PIM
- Memory bandwidth commensurate with peak rates of
all processors
36Source 3 More Architecture Friendly
Algorithms
- Algorithmic practice needs to catch up to
architectural demands - several one-time gains remain to be contributed
that could improve data locality or reduce
synchronization frequency, while maintaining
required concurrency and slackness - One-time refers to improvements by small
constant factors, nothing that scales in N or P
complexities are already near information-theoreti
c lower bounds, and we reject increases in flop
rates that derive from less efficient algorithms - Caveat remaining algorithmic performance
improvements may cost extra space or may bank on
stability shortcuts that occasionally backfire,
making performance modeling less predictable - Perhaps an order of magnitude of performance
remains here
37Raw Performance Improvement from Algorithms
- Spatial reorderings that improve locality
- interlacing of all related grid-based data
structures - ordering gridpoints and grid edges for L1/L2
reuse - Discretizations that improve locality
- higher-order methods (lead to larger denser
blocks at each point than lower-order methods) - vertex-centering (for same tetrahedral grid,
leads to denser blockrows than cell-centering) - Temporal reorderings that improve locality
- block vector algorithms (reuse cached matrix
blocks vectors in block are independent) - multi-step vector algorithms (reuse cached vector
blocks vectors have sequential dependence)
38Raw Performance Improvement from Algorithms
- Temporal reorderings that reduce synchronization
penalty - less stable algorithmic choices that reduce
synchronization frequency (deferred
orthogonalization, speculative step selection) - less global methods that reduce synchronization
range by replacing a tightly coupled global
process (e.g., Newton) with loosely coupled sets
of tightly coupled local processes (e.g.,
Schwarz) - Precision reductions that make bandwidth seem
larger - lower precision representation of preconditioner
matrix coefficients or poorly known coefficients
(arithmetic is still performed on full precision
extensions)
39Improvements Resulting from Locality Reordering
40 Improvements from Blocking Vectors
- Same ONERA M6 Euler simulation, on SGI Origin
- One vector represents standard GMRES acceleration
- Four vectors is a blocked Krylov method, not yet
in production version - Savings arises from not reloading matrix elements
of Jacobian for each new vector (four-fold
increase in matrix element use per load) - Flop/s rate is effectively tripled however, can
the extra vectors be used efficiently from a
numerical viewpoint??
41 Improvements from Reduced Precision
- Same ONERA M6 Euler simulation, on SGI Origin
- Standard (middle column) is double precision in
all floating quantities - Optimization (right column) is to store
preconditioner for Jacobian matrix in single
precision only, promoting to double before use in
the processor - Bandwidth and matrix cache capacities are
effectively doubled, with no deterioration in
numerical properties
42Source 4 Algorithms Packing More Science Per
Flop
- Some algorithmic improvements do not improve flop
rate, but lead to the same scientific end in the
same time at lower hardware cost (less memory,
lower operation complexity) - Caveat such adaptive programs are more
complicated and less thread-uniform than those
they improve upon in quality/cost ratio - Desirable that petaflop/s machines be general
purpose enough to run the best algorithms - Not daunting, conceptually, but puts an enormous
premium on dynamic load balancing - An order of magnitude or more can be gained here
for many problems
43Example of Adaptive Opportunities
- Spatial Discretization-based adaptivity
- change discretization type and order to attain
required approximation to the continuum
everywhere without over-resolving in smooth,
easily approximated regions - Fidelity-based adaptivity
- change continuous formulation to accommodate
required phenomena everywhere without enriching
in regions where nothing happens - Stiffness-based adaptivity
- change solution algorithm to provide more
powerful, robust techniques in regions of
space-time where discrete problem is linearly or
nonlinearly stiff without extra work in nonstiff,
locally well-conditioned regions
44Experimental Example of Opportunity for Advanced
Adaptivity
- Driven cavity Newtons method (left) versus new
Additive Schwarz Preconditioned Inexact Newton
(ASPIN) nonlinear preconditioning (right)
45Status and Prospectsfor Advanced Adaptivity
- Metrics and procedures well developed in only a
few areas - method-of-lines ODEs for stiff IBVPs and DAEs,
FEA for elliptic BVPs - Multi-model methods used in ad hoc ways in
production - Boeing TRANAIR code
- Poly-algorithmic solvers demonstrated in
principle but rarely in the hostile environment
of high-performance computing - Requirements for progress
- management of hierarchical levels of
synchronization - user specification of hierarchical priorities of
different threads
46Summary of Suggestions for PDE Petaflops
- Algorithms that deliver more science per flop
- possibly large problem-dependent factor, through
adaptivity (but we won't count this towards rate
improvement) - Algorithmic variants that are more
architecture-friendly - expect half an order of magnitude, through
improved locality and relaxed synchronization - More efficient use of processor cycles, and
faster processor/memory - expect one-and-a-half orders of magnitude,
through memory-assist language features, PIM, and
multithreading - Expanded number of processors
- expect two orders of magnitude, through dynamic
balancing and extreme care in implementation
47Reminder about the Source of PDEs
- Computational engineering is not about individual
large-scale analyses, done fast and thrown over
the wall - Both results and their sensitivities are
desired often multiple operation points to be
simulated are known a priori, rather than
sequentially - Sensitivities may be fed back into optimization
process - Full PDE analyses may also be inner iterations in
a multidisciplinary computation - In such contexts, petaflop/s may mean 1,000
analyses running somewhat asynchronously with
respect to each other, each at 1 Tflop/s
clearly a less daunting challenge and one that
has better synchronization properties for
exploiting The Grid than 1 analysis running
at 1 Pflop/s
48Summary Recommendations for Architects
- Support rich (mesh-like) interprocessor
connectivity and fast global reductions - Allow disabling of expensive interprocessor cache
coherence protocols for user-tagged data - Support fast message-passing protocols between
processors that physically share memory, for
legacy MP applications - Supply sufficient memory system bandwidth per
processor (at least one word per clock per scalar
unit) - Give user optional control of L2 cache traffic
through directives - Develop at least gather/scatter
processor-in-memory capability - Support variety of precisions in blocked
transfers and fast precision conversions
49Recommendations for New Benchmarks
- Recently introduced sPPM benchmark fills a void
for memory-system-realistic full-application PDE
performance, but is explicit, structured, and
relatively high-order - Similar full-application benchmark is needed for
implicit, unstructured, low-order PDE solvers - Reflecting the hierarchical, distributed memory
layout of high end computers, this benchmark
would have two aspects - uniprocessor (vertical) memory system
performance suite of problems of various grid
sizes and multicomponent sizes with different
interlacings and edge orderings - parallel (horizontal) network performance
problems of various subdomain sizes and
synchronization frequencies
50Bibliography
- High Performance Parallel CFD, Gropp, Kaushik,
Keyes Smith, 2001, Parallel Computing (to
appear, 2001) - Toward Realistic Performance Bounds for Implicit
CFD Codes, Gropp, Kaushik, Keyes Smith, 1999,
in Proceedings of Parallel CFD'99, Elsevier - Prospects for CFD on Petaflops Systems, Keyes,
Kaushik Smith, 1999, in Parallel Solution of
Partial Differential Equations, Springer, pp.
247-278 - Newton-Krylov-Schwarz Methods for Aerodynamics
Problems Compressible and Incompressible Flows
on Unstructured Grids, Kaushik, Keyes Smith,
1998, in Proceedings of the 11th Intl. Conf. on
Domain Decomposition Methods, Domain
Decomposition Press, pp. 513-520 - How Scalable is Domain Decomposition in Practice,
Keyes, 1998, in Proceedings of the 11th Intl.
Conf. on Domain Decomposition Methods, Domain
Decomposition Press, pp. 286-297 - On the Interaction of Architecture and Algorithm
in the Domain-Based Parallelization of an
Unstructured Grid Incompressible Flow Code,
Kaushik, Keyes Smith, 1998, in Proceedings of
the 10th Intl. Conf. on Domain Decomposition
Methods, AMS, pp. 311-319
51Related URLs
- Follow-up on this talk
- http//www.mcs.anl.gov/petsc-fun3d
- ASCI platforms
- http//www.llnl.gov/asci/platforms
- Int. Conferences on Domain Decomposition Methods
http//www.ddm.org - SIAM Conferences on Parallel Processing
http//www.siam.org (Norfolk, USA 12-14 Mar 2001) - International Conferences on Parallel CFD
http//www.parcfd.org
52Acknowledgments
- Collaborators Dinesh Kaushik (ODU), Kyle
Anderson (NASA), and the PETSc team at ANL
Satish Balay, Bill Gropp, Lois McInnes, Barry
Smith - Sponsors U.S. DOE, ICASE, NASA, NSF
- Computer Resources DOE, SGI-Cray
- Inspiration Shahid Bokhari (ICASE), Xiao-Chuan
Cai (CU-Boulder), Rob Falgout (LLNL), Paul
Fischer (ANL), Kyle Gallivan (FSU), Liz Jessup
(CU-Boulder), Michael Kern (INRIA), Dimitri
Mavriplis (ICASE), Alex Pothen (ODU), Uli Ruede
(Univ. Erlangen), John Salmon (Caltech), Linda
Stals (ODU), Bob Voigt (DOE), David Young
(Boeing), Paul Woodward (UMinn)