1 / 52

The Next Four Orders of Magnitude in Parallel

PDE Simulation Performance http//www.math.odu.e

du/keyes/talks.html

- David E. Keyes
- Department of Mathematics Statistics,
- Old Dominion University
- Institute for Scientific Computing Research,
- Lawrence Livermore National Laboratory
- Institute for Computer Applications in Science

Engineering, - NASA Langley Research Center

Background of this Presentation

- Originally prepared for Petaflops II Conference
- History of the Petaflops Initiative in the USA
- Enabling Technologies for Petaflops Computing,

Feb 1994 - book by Sterling, Messina, and Smith, MIT Press,

1995 - Applications Workshop, Aug 1994
- Architectures Workshop, Apr 1995
- Systems Software Workshop, Jun 1996
- Algorithms Workshop, Apr 1997
- Systems Operations Workshop Review, Jun 1998
- Enabling Technologies for Petaflops II, Feb 1999
- Topics in Ultra-scale Computing, book by Sterling

et al., MIT Press, 2001 (to appear)

Weighing in at the Bottom Line

- Characterization of a 1 Teraflop/s computer of

today - about 1,000 processors of 1 Gflop/s (peak) each
- due to inefficiencies within the processors, more

practically characterized as about 4,000

processors of 250 Mflop/s each - How do we want to get to 1 Petaflop/s by 2007

(original goal)? - 1,000,000 processors of 1 Gflop/s each (only

wider)? - 10,000 processors of 100 Gflop/s each (mainly

deeper)? - From the point of view of PDE simulations on

quasi-static Eulerian grids - Either!
- Caveat dynamic grid simulations are not covered

in this talk - but see work at Bonn, Erlangen, Heidelberg, LLNL,

and ODU presented elsewhere

Perspective

- Many Grand Challenges in computational science

are formulated as PDEs (possibly among

alternative formulations) - However, PDE simulations historically have not

performed as well as other scientific simulations - PDE simulations require a balance among

architectural components that is not necessarily

met in a machine designed to max out on the

standard LINPACK benchmark - The justification for building petaflop/s

architectures undoubtedly will (and should)

include PDE applications - However, cost-effective use of petaflop/s on PDEs

requires further attention to architectural and

algorithmic matters - Memory-centric view of computation needs further

promotion

Application Performance History3 orders of

magnitude in 10 years better than Moores Law

Bell Prize Performance History

Plan of Presentation

- General characterization of PDE requirements
- Four sources of performance improvement to get

from current 100's of Gflop/s (for PDEs) to 1

Pflop/s - Each illustrated with examples from computational

aerodynamics - offered as typical of real workloads (nonlinear,

unstructured, multicomponent, multiscale, etc.) - Performance presented
- on up to thousands of processors of T3E and ASCI

Red (for parallel aspects) - on numerous uniprocessors (for memory hierarchy

aspects)

Purpose of Presentation

- Not to argue for specific algorithms/programming

models/codes in any detail - but see talks on Newton-Krylov-Schwarz (NKS)

methods under homepage - Provide a requirements target for designers of

today's systems - typical of several contemporary successfully

parallelized PDE applications - not comprehensive of all important large-scale

applications - Speculate on requirements target for designers of

tomorrow's sytems - Promote attention to current architectural

weaknesses, relative to requirements of PDEs

Four Sources of Performance Improvement

- Expanded number of processors
- arbitrarily large factor, through extremely

careful attention to load balancing and

synchronization - More efficient use of processor cycles, and

faster processor/memory elements - one to two orders of magnitude, through

memory-assist language features,

processors-in-memory, and multithreading - Algorithmic variants that are more

architecture-friendly - approximately an order of magnitude, through

improved locality and relaxed synchronization - Algorithms that deliver more science per flop
- possibly large problem-dependent factor, through

adaptivity - This last does not contribute to raw flop/s!

PDE Varieties and Complexities

- Varieties of PDEs
- evolution (time hyperbolic, time parabolic)
- equilibrium (elliptic, spatially hyperbolic or

parabolic) - mixed, varying by region
- mixed, of multiple type (e.g., parabolic with

elliptic constraint) - Complexity parameterized by
- spatial grid points, Nx
- temporal grid points, Nt
- components per point, Nc
- auxiliary storage per point, Na
- grid points in stencil, Ns
- Memory M ? Nx? ( Nc Na Nc? Nc ? Ns )
- Work W ? Nx ? Nt ? ( Nc Na Nc ? Nc ?

Ns )

Resource Scaling for PDEs

- For 3D problems, (Memory) ? (Work)3/4
- for equilibrium problems, work scales with

problem size ? no. of iteration steps for

reasonable implicit methods proportional to

resolution in single spatial dimension - for evolutionary problems, work scales with

problem size ? time steps CFL-type arguments

place latter on order of resolution in single

spatial dimension - Proportionality constant can be adjusted over a

very wide range by both discretization and by

algorithmic tuning - If frequent time frames are to be captured, disk

capacity and I/O rates must both scale linearly

with work

Typical PDE Tasks

- Vertex-based loops
- state vector and auxiliary vector updates
- Edge-based stencil op loops
- residual evaluation, approximate Jacobian

evaluation - Jacobian-vector product (often replaced with

matrix-free form, involving residual evaluation) - intergrid transfer (coarse/fine)
- Sparse, narrow-band recurrences
- approximate factorization and back substitution
- smoothing
- Vector inner products and norms
- orthogonalization/conjugation
- convergence progress and stability checks

Edge-based Loop

- Vertex-centered tetrahedral grid
- Traverse by edges
- load vertex values
- compute intensively
- store contributions to flux at vertices
- Each vertex appears in approximately 15 flux

computations

Explicit PDE Solvers

- Concurrency is pointwise, O(N)
- Comm.-to-Comp. ratio is surface-to-volume,

O((N/P)-1/3) - Communication range is nearest-neighbor, except

for time-step computation - Synchronization frequency is once per step,

O((N/P)-1) - Storage per point is low
- Load balance is straightforward for static

quasi-uniform grids - Grid adaptivity (together with temporal stability

limitation) makes load balance nontrivial

Domain-decomposed Implicit PDE Solvers

- Concurrency is pointwise, O(N), or subdomainwise,

O(P) - Comm.-to-Comp. ratio still mainly

surface-to-volume, O((N/P)-1/3) - Communication still mainly nearest-neighbor, but

nonlocal communication arises from conjugation,

norms, coarse grid problems - Synchronization frequency often more than once

per grid-sweep, up to Krylov dimension,

O(K(N/P)-1) - Storage per point is higher, by factor of O(K)
- Load balance issues the same as for explicit

Source 1 Expanded Number of Processors

- Amdahl's law can be defeated if serial sections

make up a nonincreasing fraction of total work as

problem size and processor count scale up

together true for most explicit or iterative

implicit PDE solvers - popularized in 1986 Karp Prize paper by

Gustafson, et al. - Simple, back-of-envelope parallel complexity

analyses show that processors can be increased as

fast, or almost as fast, as problem size,

assuming load is perfectly balanced - Caveat the processor network must also be

scalable (applies to protocols as well as to

hardware) - Remaining four orders of magnitude could be met

by hardware expansion (but this does not mean

that fixed-size applications of today would run

104 times faster)

Back-of-Envelope Scalability Demonstration for

Bulk-synchronized PDE Computations

- Given complexity estimates of the leading terms

of - the concurrent computation (per iteration phase)
- the concurrent communication
- the synchronization frequency
- And a model of the architecture including
- internode communication (network topology and

protocol reflecting horizontal memory structure) - on-node computation (effective performance

parameters including vertical memory structure) - One can estimate optimal concurrency and

execution time - on per-iteration basis, or overall (by taking

into account any granularity-dependent

convergence rate) - simply differentiate time estimate in terms of

(N,P) with respect to P, equate to zero and solve

for P in terms of N

3D Stencil Costs (per Iteration)

- grid points in each direction n, total work

NO(n3) - processors in each direction p, total procs

PO(p3) - memory per node requirements O(N/P)
- execution time per iteration A n3/p3
- grid points on side of each processor subdomain

n/p - neighbor communication per iteration B n2/p2
- cost of global reductions in each iteration C

log p or C p(1/d) - C includes synchronization frequency
- same dimensionless units for measuring A, B, C
- e.g., cost of scalar floating point multiply-add

3D Stencil Computation IllustrationRich local

network, tree-based global reductions

- total wall-clock time per iteration
- for optimal p, , or
- or (with ),
- without speeddown, p can grow with n
- in the limit as

3D Stencil Computation Illustration Rich local

network, tree-based global reductions

- optimal running time
- where
- limit of infinite neighbor bandwidth, zero

neighbor latency ( ) - (This analysis is on a per iteration basis

fuller analysis would multiply this cost by an

iteration count estimate that generally depends

on n and p.)

Summary for Various Networks

- With tree-based (logarithmic) global reductions

and scalable nearest neighbor hardware - optimal number of processors scales linearly with

problem size - With 3D torus-based global reductions and

scalable nearest neighbor hardware - optimal number of processors scales as

three-fourths power of problem size (almost

scalable) - With common network bus (heavy contention)
- optimal number of processors scales as one-fourth

power of problem size (not scalable) - bad news for conventional Beowulf clusters, but

see 2000 Bell Prize price-performance awards

1999 Bell Prize Parallel Scaling Results on ASCI

RedONERA M6 Wing Test Case, Tetrahedral grid of

2.8 million vertices (about 11 million unknowns)

on up to 3072 ASCI Red Nodes (Pentium Pro 333 MHz

processors)

Surface Visualization of Test Domain for

Computing Flow over an ONERA M6 Wing

Transonic Lambda Shock Solution

Fixed-size Parallel Scaling Results (Flop/s)

Fixed-size Parallel Scaling Results(Time in

seconds)

Algorithm Newton-Krylov-Schwarz

Newton nonlinear solver asymptotically quadratic

Krylov accelerator spectrally adaptive

Schwarz preconditioner parallelizable

Fixed-size Scaling Results for W-cycle (Time in

seconds, courtesy of D. Mavriplis)

ASCI runs for grid of 3.1M vertices T3E runs

for grid of 24.7M vertices

Source 2 More Efficient Use of Faster

Processors

- Current low efficiencies of sparse codes can be

improved if regularity of reference is exploited

with memory-assist features - PDEs have periodic workingset structure that

permits effective use of prefetch/dispatch

directives, and lots of slackness - Combined with processors-in-memory (PIM)

technology for gather/scatter into densely used

block transfers and multithreading, PDEs can

approach full utilization of processor cycles - Caveat high bandwidth is critical, since PDE

algorithms do only O(N) work for O(N) gridpoints

worth of loads and stores - One to two orders of magnitude can be gained by

catching up to the clock, and by following the

clock into the few-GHz range

Following the Clock

- 1999 Predictions from the Semiconductor Industry

Association http//public.itrs.net/fil

es/1999_SIA_Roadmap/Home.htm - A factor of 2-3 can be expected by 2007 by

following the clock alone

Example of Multithreading

- Same ONERA M6 wing Euler code simulation on ASCI

Red - ASCI Red contains two processors per node,

sharing memory - Can use second processor in either

message-passing mode with its own subdomain, or

in multithreaded shared memory mode, which does

not require the number of subdomain partitions to

double - Latter is much more effective in flux evaluation

phase, as shown by cumulative execution time

(here, memory bandwidth is not an issue)

PDE Workingsets

- Smallest data for single stencil
- Ns? (Nc2 Nc Na ) (sharp)
- Largest data for entire subdomain
- (Nx/P) ? Ns ? (Nc2 Nc Na ) (sharp)
- Intermediate data for a neighborhood collection

of stencils, reused as possible

Cache Traffic for PDEs

- As successive workingsets drop into a level of

memory, capacity (and with effort conflict)

misses disappear, leaving only compulsory,

reducing demand on main memory bandwidth

Strategies Based on Workingset Structure

- No performance value in memory levels larger than

subdomain - Little performance value in memory levels smaller

than subdomain but larger than required to permit

full reuse of most data within each subdomain

subtraversal - After providing L1 large enough for smallest

workingset (and multiple independent copies up to

accommodate desired level of multithreading) all

additional resources should be invested in large

L2 - Tables describing grid connectivity are built

(after each grid rebalancing) and stored in PIM

--- used to pack/unpack dense-use cache lines

during subdomain traversal

Costs of Greater Per-processor Efficiency

- Programming complexity of managing subdomain

traversal - Space to store gather/scatter tables in PIM
- Time to (re)build gather/scatter tables in PIM
- Memory bandwidth commensurate with peak rates of

all processors

Source 3 More Architecture Friendly

Algorithms

- Algorithmic practice needs to catch up to

architectural demands - several one-time gains remain to be contributed

that could improve data locality or reduce

synchronization frequency, while maintaining

required concurrency and slackness - One-time refers to improvements by small

constant factors, nothing that scales in N or P

complexities are already near information-theoreti

c lower bounds, and we reject increases in flop

rates that derive from less efficient algorithms - Caveat remaining algorithmic performance

improvements may cost extra space or may bank on

stability shortcuts that occasionally backfire,

making performance modeling less predictable - Perhaps an order of magnitude of performance

remains here

Raw Performance Improvement from Algorithms

- Spatial reorderings that improve locality
- interlacing of all related grid-based data

structures - ordering gridpoints and grid edges for L1/L2

reuse - Discretizations that improve locality
- higher-order methods (lead to larger denser

blocks at each point than lower-order methods) - vertex-centering (for same tetrahedral grid,

leads to denser blockrows than cell-centering) - Temporal reorderings that improve locality
- block vector algorithms (reuse cached matrix

blocks vectors in block are independent) - multi-step vector algorithms (reuse cached vector

blocks vectors have sequential dependence)

Raw Performance Improvement from Algorithms

- Temporal reorderings that reduce synchronization

penalty - less stable algorithmic choices that reduce

synchronization frequency (deferred

orthogonalization, speculative step selection) - less global methods that reduce synchronization

range by replacing a tightly coupled global

process (e.g., Newton) with loosely coupled sets

of tightly coupled local processes (e.g.,

Schwarz) - Precision reductions that make bandwidth seem

larger - lower precision representation of preconditioner

matrix coefficients or poorly known coefficients

(arithmetic is still performed on full precision

extensions)

Improvements Resulting from Locality Reordering

Improvements from Blocking Vectors

- Same ONERA M6 Euler simulation, on SGI Origin
- One vector represents standard GMRES acceleration
- Four vectors is a blocked Krylov method, not yet

in production version - Savings arises from not reloading matrix elements

of Jacobian for each new vector (four-fold

increase in matrix element use per load) - Flop/s rate is effectively tripled however, can

the extra vectors be used efficiently from a

numerical viewpoint??

Improvements from Reduced Precision

- Same ONERA M6 Euler simulation, on SGI Origin
- Standard (middle column) is double precision in

all floating quantities - Optimization (right column) is to store

preconditioner for Jacobian matrix in single

precision only, promoting to double before use in

the processor - Bandwidth and matrix cache capacities are

effectively doubled, with no deterioration in

numerical properties

Source 4 Algorithms Packing More Science Per

Flop

- Some algorithmic improvements do not improve flop

rate, but lead to the same scientific end in the

same time at lower hardware cost (less memory,

lower operation complexity) - Caveat such adaptive programs are more

complicated and less thread-uniform than those

they improve upon in quality/cost ratio - Desirable that petaflop/s machines be general

purpose enough to run the best algorithms - Not daunting, conceptually, but puts an enormous

premium on dynamic load balancing - An order of magnitude or more can be gained here

for many problems

Example of Adaptive Opportunities

- Spatial Discretization-based adaptivity
- change discretization type and order to attain

required approximation to the continuum

everywhere without over-resolving in smooth,

easily approximated regions - Fidelity-based adaptivity
- change continuous formulation to accommodate

required phenomena everywhere without enriching

in regions where nothing happens - Stiffness-based adaptivity
- change solution algorithm to provide more

powerful, robust techniques in regions of

space-time where discrete problem is linearly or

nonlinearly stiff without extra work in nonstiff,

locally well-conditioned regions

Experimental Example of Opportunity for Advanced

Adaptivity

- Driven cavity Newtons method (left) versus new

Additive Schwarz Preconditioned Inexact Newton

(ASPIN) nonlinear preconditioning (right)

Status and Prospectsfor Advanced Adaptivity

- Metrics and procedures well developed in only a

few areas - method-of-lines ODEs for stiff IBVPs and DAEs,

FEA for elliptic BVPs - Multi-model methods used in ad hoc ways in

production - Boeing TRANAIR code
- Poly-algorithmic solvers demonstrated in

principle but rarely in the hostile environment

of high-performance computing - Requirements for progress
- management of hierarchical levels of

synchronization - user specification of hierarchical priorities of

different threads

Summary of Suggestions for PDE Petaflops

- Algorithms that deliver more science per flop
- possibly large problem-dependent factor, through

adaptivity (but we won't count this towards rate

improvement) - Algorithmic variants that are more

architecture-friendly - expect half an order of magnitude, through

improved locality and relaxed synchronization - More efficient use of processor cycles, and

faster processor/memory - expect one-and-a-half orders of magnitude,

through memory-assist language features, PIM, and

multithreading - Expanded number of processors
- expect two orders of magnitude, through dynamic

balancing and extreme care in implementation

Reminder about the Source of PDEs

- Computational engineering is not about individual

large-scale analyses, done fast and thrown over

the wall - Both results and their sensitivities are

desired often multiple operation points to be

simulated are known a priori, rather than

sequentially - Sensitivities may be fed back into optimization

process - Full PDE analyses may also be inner iterations in

a multidisciplinary computation - In such contexts, petaflop/s may mean 1,000

analyses running somewhat asynchronously with

respect to each other, each at 1 Tflop/s

clearly a less daunting challenge and one that

has better synchronization properties for

exploiting The Grid than 1 analysis running

at 1 Pflop/s

Summary Recommendations for Architects

- Support rich (mesh-like) interprocessor

connectivity and fast global reductions - Allow disabling of expensive interprocessor cache

coherence protocols for user-tagged data - Support fast message-passing protocols between

processors that physically share memory, for

legacy MP applications - Supply sufficient memory system bandwidth per

processor (at least one word per clock per scalar

unit) - Give user optional control of L2 cache traffic

through directives - Develop at least gather/scatter

processor-in-memory capability - Support variety of precisions in blocked

transfers and fast precision conversions

Recommendations for New Benchmarks

- Recently introduced sPPM benchmark fills a void

for memory-system-realistic full-application PDE

performance, but is explicit, structured, and

relatively high-order - Similar full-application benchmark is needed for

implicit, unstructured, low-order PDE solvers - Reflecting the hierarchical, distributed memory

layout of high end computers, this benchmark

would have two aspects - uniprocessor (vertical) memory system

performance suite of problems of various grid

sizes and multicomponent sizes with different

interlacings and edge orderings - parallel (horizontal) network performance

problems of various subdomain sizes and

synchronization frequencies

Bibliography

- High Performance Parallel CFD, Gropp, Kaushik,

Keyes Smith, 2001, Parallel Computing (to

appear, 2001) - Toward Realistic Performance Bounds for Implicit

CFD Codes, Gropp, Kaushik, Keyes Smith, 1999,

in Proceedings of Parallel CFD'99, Elsevier - Prospects for CFD on Petaflops Systems, Keyes,

Kaushik Smith, 1999, in Parallel Solution of

Partial Differential Equations, Springer, pp.

247-278 - Newton-Krylov-Schwarz Methods for Aerodynamics

Problems Compressible and Incompressible Flows

on Unstructured Grids, Kaushik, Keyes Smith,

1998, in Proceedings of the 11th Intl. Conf. on

Domain Decomposition Methods, Domain

Decomposition Press, pp. 513-520 - How Scalable is Domain Decomposition in Practice,

Keyes, 1998, in Proceedings of the 11th Intl.

Conf. on Domain Decomposition Methods, Domain

Decomposition Press, pp. 286-297 - On the Interaction of Architecture and Algorithm

in the Domain-Based Parallelization of an

Unstructured Grid Incompressible Flow Code,

Kaushik, Keyes Smith, 1998, in Proceedings of

the 10th Intl. Conf. on Domain Decomposition

Methods, AMS, pp. 311-319

Related URLs

- Follow-up on this talk
- http//www.mcs.anl.gov/petsc-fun3d
- ASCI platforms
- http//www.llnl.gov/asci/platforms
- Int. Conferences on Domain Decomposition Methods

http//www.ddm.org - SIAM Conferences on Parallel Processing

http//www.siam.org (Norfolk, USA 12-14 Mar 2001) - International Conferences on Parallel CFD

http//www.parcfd.org

Acknowledgments

- Collaborators Dinesh Kaushik (ODU), Kyle

Anderson (NASA), and the PETSc team at ANL

Satish Balay, Bill Gropp, Lois McInnes, Barry

Smith - Sponsors U.S. DOE, ICASE, NASA, NSF
- Computer Resources DOE, SGI-Cray
- Inspiration Shahid Bokhari (ICASE), Xiao-Chuan

Cai (CU-Boulder), Rob Falgout (LLNL), Paul

Fischer (ANL), Kyle Gallivan (FSU), Liz Jessup

(CU-Boulder), Michael Kern (INRIA), Dimitri

Mavriplis (ICASE), Alex Pothen (ODU), Uli Ruede

(Univ. Erlangen), John Salmon (Caltech), Linda

Stals (ODU), Bob Voigt (DOE), David Young

(Boeing), Paul Woodward (UMinn)