The Next Four Orders of Magnitude in Parallel PDE Simulation Performance http:www.math.odu.edukeyest

About This Presentation

Title:

The Next Four Orders of Magnitude in Parallel PDE Simulation Performance http:www.math.odu.edukeyest

Description:

The Next Four Orders of Magnitude in. Parallel PDE Simulation Performance ... One to two orders of magnitude can be gained by catching up to the clock, and by ... – PowerPoint PPT presentation

Number of Views:77

Avg rating:3.0/5.0

Slides: 53

Provided by: william559

Learn more at: https://www.cs.odu.edu

Category:

more less

Transcript and Presenter's Notes

Title: The Next Four Orders of Magnitude in Parallel PDE Simulation Performance http:www.math.odu.edukeyest

1
The Next Four Orders of Magnitude in Parallel
PDE Simulation Performance http//www.math.odu.e
du/keyes/talks.html

David E. Keyes
Department of Mathematics Statistics,
Old Dominion University
Institute for Scientific Computing Research,
Lawrence Livermore National Laboratory
Institute for Computer Applications in Science
Engineering,
NASA Langley Research Center

2
Background of this Presentation

Originally prepared for Petaflops II Conference
History of the Petaflops Initiative in the USA
Enabling Technologies for Petaflops Computing,
Feb 1994
book by Sterling, Messina, and Smith, MIT Press,
1995
Applications Workshop, Aug 1994
Architectures Workshop, Apr 1995
Systems Software Workshop, Jun 1996
Algorithms Workshop, Apr 1997
Systems Operations Workshop Review, Jun 1998
Enabling Technologies for Petaflops II, Feb 1999
Topics in Ultra-scale Computing, book by Sterling
et al., MIT Press, 2001 (to appear)

3
Weighing in at the Bottom Line

Characterization of a 1 Teraflop/s computer of
today
about 1,000 processors of 1 Gflop/s (peak) each
due to inefficiencies within the processors, more
practically characterized as about 4,000
processors of 250 Mflop/s each
How do we want to get to 1 Petaflop/s by 2007
(original goal)?
1,000,000 processors of 1 Gflop/s each (only
wider)?
10,000 processors of 100 Gflop/s each (mainly
deeper)?
From the point of view of PDE simulations on
quasi-static Eulerian grids
Either!
Caveat dynamic grid simulations are not covered
in this talk
but see work at Bonn, Erlangen, Heidelberg, LLNL,
and ODU presented elsewhere

4
Perspective

Many Grand Challenges in computational science
are formulated as PDEs (possibly among
alternative formulations)
However, PDE simulations historically have not
performed as well as other scientific simulations
PDE simulations require a balance among
architectural components that is not necessarily
met in a machine designed to max out on the
standard LINPACK benchmark
The justification for building petaflop/s
architectures undoubtedly will (and should)
include PDE applications
However, cost-effective use of petaflop/s on PDEs
requires further attention to architectural and
algorithmic matters
Memory-centric view of computation needs further
promotion

5
Application Performance History3 orders of
magnitude in 10 years better than Moores Law
6
Bell Prize Performance History
7
Plan of Presentation

General characterization of PDE requirements
Four sources of performance improvement to get
from current 100's of Gflop/s (for PDEs) to 1
Pflop/s
Each illustrated with examples from computational
aerodynamics
offered as typical of real workloads (nonlinear,
unstructured, multicomponent, multiscale, etc.)
Performance presented
on up to thousands of processors of T3E and ASCI
Red (for parallel aspects)
on numerous uniprocessors (for memory hierarchy
aspects)

8
Purpose of Presentation

Not to argue for specific algorithms/programming
models/codes in any detail
but see talks on Newton-Krylov-Schwarz (NKS)
methods under homepage
Provide a requirements target for designers of
today's systems
typical of several contemporary successfully
parallelized PDE applications
not comprehensive of all important large-scale
applications
Speculate on requirements target for designers of
tomorrow's sytems
Promote attention to current architectural
weaknesses, relative to requirements of PDEs

9
Four Sources of Performance Improvement

Expanded number of processors
arbitrarily large factor, through extremely
careful attention to load balancing and
synchronization
More efficient use of processor cycles, and
faster processor/memory elements
one to two orders of magnitude, through
memory-assist language features,
processors-in-memory, and multithreading
Algorithmic variants that are more
architecture-friendly
approximately an order of magnitude, through
improved locality and relaxed synchronization
Algorithms that deliver more science per flop
possibly large problem-dependent factor, through
adaptivity
This last does not contribute to raw flop/s!

10
PDE Varieties and Complexities

Varieties of PDEs
evolution (time hyperbolic, time parabolic)
equilibrium (elliptic, spatially hyperbolic or
parabolic)
mixed, varying by region
mixed, of multiple type (e.g., parabolic with
elliptic constraint)
Complexity parameterized by
spatial grid points, Nx
temporal grid points, Nt
components per point, Nc
auxiliary storage per point, Na
grid points in stencil, Ns
Memory M ? Nx? ( Nc Na Nc? Nc ? Ns )
Work W ? Nx ? Nt ? ( Nc Na Nc ? Nc ?
Ns )

11
Resource Scaling for PDEs

For 3D problems, (Memory) ? (Work)3/4
for equilibrium problems, work scales with
problem size ? no. of iteration steps for
reasonable implicit methods proportional to
resolution in single spatial dimension
for evolutionary problems, work scales with
problem size ? time steps CFL-type arguments
place latter on order of resolution in single
spatial dimension
Proportionality constant can be adjusted over a
very wide range by both discretization and by
algorithmic tuning
If frequent time frames are to be captured, disk
capacity and I/O rates must both scale linearly
with work

12
Typical PDE Tasks

Vertex-based loops
state vector and auxiliary vector updates
Edge-based stencil op loops
residual evaluation, approximate Jacobian
evaluation
Jacobian-vector product (often replaced with
matrix-free form, involving residual evaluation)
intergrid transfer (coarse/fine)
Sparse, narrow-band recurrences
approximate factorization and back substitution
smoothing
Vector inner products and norms
orthogonalization/conjugation
convergence progress and stability checks

13
Edge-based Loop

Vertex-centered tetrahedral grid
Traverse by edges
load vertex values
compute intensively
store contributions to flux at vertices
Each vertex appears in approximately 15 flux
computations

14
Explicit PDE Solvers

Concurrency is pointwise, O(N)
Comm.-to-Comp. ratio is surface-to-volume,
O((N/P)-1/3)
Communication range is nearest-neighbor, except
for time-step computation
Synchronization frequency is once per step,
O((N/P)-1)
Storage per point is low
Load balance is straightforward for static
quasi-uniform grids
Grid adaptivity (together with temporal stability
limitation) makes load balance nontrivial

15
Domain-decomposed Implicit PDE Solvers

Concurrency is pointwise, O(N), or subdomainwise,
O(P)
Comm.-to-Comp. ratio still mainly
surface-to-volume, O((N/P)-1/3)
Communication still mainly nearest-neighbor, but
nonlocal communication arises from conjugation,
norms, coarse grid problems
Synchronization frequency often more than once
per grid-sweep, up to Krylov dimension,
O(K(N/P)-1)
Storage per point is higher, by factor of O(K)
Load balance issues the same as for explicit

16
Source 1 Expanded Number of Processors

Amdahl's law can be defeated if serial sections
make up a nonincreasing fraction of total work as
problem size and processor count scale up
together true for most explicit or iterative
implicit PDE solvers
popularized in 1986 Karp Prize paper by
Gustafson, et al.
Simple, back-of-envelope parallel complexity
analyses show that processors can be increased as
fast, or almost as fast, as problem size,
assuming load is perfectly balanced
Caveat the processor network must also be
scalable (applies to protocols as well as to
hardware)
Remaining four orders of magnitude could be met
by hardware expansion (but this does not mean
that fixed-size applications of today would run
104 times faster)

17
Back-of-Envelope Scalability Demonstration for
Bulk-synchronized PDE Computations

Given complexity estimates of the leading terms
of
the concurrent computation (per iteration phase)
the concurrent communication
the synchronization frequency
And a model of the architecture including
internode communication (network topology and
protocol reflecting horizontal memory structure)
on-node computation (effective performance
parameters including vertical memory structure)
One can estimate optimal concurrency and
execution time
on per-iteration basis, or overall (by taking
into account any granularity-dependent
convergence rate)
simply differentiate time estimate in terms of
(N,P) with respect to P, equate to zero and solve
for P in terms of N

18
3D Stencil Costs (per Iteration)

grid points in each direction n, total work
NO(n3)
processors in each direction p, total procs
PO(p3)
memory per node requirements O(N/P)
execution time per iteration A n3/p3
grid points on side of each processor subdomain
n/p
neighbor communication per iteration B n2/p2
cost of global reductions in each iteration C
log p or C p(1/d)
C includes synchronization frequency
same dimensionless units for measuring A, B, C
e.g., cost of scalar floating point multiply-add

19
3D Stencil Computation IllustrationRich local
network, tree-based global reductions

total wall-clock time per iteration
for optimal p, , or
or (with ),
without speeddown, p can grow with n
in the limit as

20
3D Stencil Computation Illustration Rich local
network, tree-based global reductions

optimal running time
where
limit of infinite neighbor bandwidth, zero
neighbor latency ( )
(This analysis is on a per iteration basis
fuller analysis would multiply this cost by an
iteration count estimate that generally depends
on n and p.)

21
Summary for Various Networks

With tree-based (logarithmic) global reductions
and scalable nearest neighbor hardware
optimal number of processors scales linearly with
problem size
With 3D torus-based global reductions and
scalable nearest neighbor hardware
optimal number of processors scales as
three-fourths power of problem size (almost
scalable)
With common network bus (heavy contention)
optimal number of processors scales as one-fourth
power of problem size (not scalable)
bad news for conventional Beowulf clusters, but
see 2000 Bell Prize price-performance awards

22
1999 Bell Prize Parallel Scaling Results on ASCI
RedONERA M6 Wing Test Case, Tetrahedral grid of
2.8 million vertices (about 11 million unknowns)
on up to 3072 ASCI Red Nodes (Pentium Pro 333 MHz
processors)
23
Surface Visualization of Test Domain for
Computing Flow over an ONERA M6 Wing
24
Transonic Lambda Shock Solution
25
Fixed-size Parallel Scaling Results (Flop/s)
26
Fixed-size Parallel Scaling Results(Time in
seconds)
27
Algorithm Newton-Krylov-Schwarz
Newton nonlinear solver asymptotically quadratic
Krylov accelerator spectrally adaptive
Schwarz preconditioner parallelizable
28
Fixed-size Scaling Results for W-cycle (Time in
seconds, courtesy of D. Mavriplis)
ASCI runs for grid of 3.1M vertices T3E runs
for grid of 24.7M vertices
29
Source 2 More Efficient Use of Faster
Processors

Current low efficiencies of sparse codes can be
improved if regularity of reference is exploited
with memory-assist features
PDEs have periodic workingset structure that
permits effective use of prefetch/dispatch
directives, and lots of slackness
Combined with processors-in-memory (PIM)
technology for gather/scatter into densely used
block transfers and multithreading, PDEs can
approach full utilization of processor cycles
Caveat high bandwidth is critical, since PDE
algorithms do only O(N) work for O(N) gridpoints
worth of loads and stores
One to two orders of magnitude can be gained by
catching up to the clock, and by following the
clock into the few-GHz range

30
Following the Clock

1999 Predictions from the Semiconductor Industry
Association http//public.itrs.net/fil
es/1999_SIA_Roadmap/Home.htm
A factor of 2-3 can be expected by 2007 by
following the clock alone

31
Example of Multithreading

Same ONERA M6 wing Euler code simulation on ASCI
Red
ASCI Red contains two processors per node,
sharing memory
Can use second processor in either
message-passing mode with its own subdomain, or
in multithreaded shared memory mode, which does
not require the number of subdomain partitions to
double
Latter is much more effective in flux evaluation
phase, as shown by cumulative execution time
(here, memory bandwidth is not an issue)

32
PDE Workingsets

Smallest data for single stencil
Ns? (Nc2 Nc Na ) (sharp)
Largest data for entire subdomain
(Nx/P) ? Ns ? (Nc2 Nc Na ) (sharp)
Intermediate data for a neighborhood collection
of stencils, reused as possible

33
Cache Traffic for PDEs

As successive workingsets drop into a level of
memory, capacity (and with effort conflict)
misses disappear, leaving only compulsory,
reducing demand on main memory bandwidth

34
Strategies Based on Workingset Structure

No performance value in memory levels larger than
subdomain
Little performance value in memory levels smaller
than subdomain but larger than required to permit
full reuse of most data within each subdomain
subtraversal
After providing L1 large enough for smallest
workingset (and multiple independent copies up to
accommodate desired level of multithreading) all
additional resources should be invested in large
L2
Tables describing grid connectivity are built
(after each grid rebalancing) and stored in PIM
--- used to pack/unpack dense-use cache lines
during subdomain traversal

35
Costs of Greater Per-processor Efficiency

Programming complexity of managing subdomain
traversal
Space to store gather/scatter tables in PIM
Time to (re)build gather/scatter tables in PIM
Memory bandwidth commensurate with peak rates of
all processors

36
Source 3 More Architecture Friendly
Algorithms

Algorithmic practice needs to catch up to
architectural demands
several one-time gains remain to be contributed
that could improve data locality or reduce
synchronization frequency, while maintaining
required concurrency and slackness
One-time refers to improvements by small
constant factors, nothing that scales in N or P
complexities are already near information-theoreti
c lower bounds, and we reject increases in flop
rates that derive from less efficient algorithms
Caveat remaining algorithmic performance
improvements may cost extra space or may bank on
stability shortcuts that occasionally backfire,
making performance modeling less predictable
Perhaps an order of magnitude of performance
remains here

37
Raw Performance Improvement from Algorithms

Spatial reorderings that improve locality
interlacing of all related grid-based data
structures
ordering gridpoints and grid edges for L1/L2
reuse
Discretizations that improve locality
higher-order methods (lead to larger denser
blocks at each point than lower-order methods)
vertex-centering (for same tetrahedral grid,
leads to denser blockrows than cell-centering)
Temporal reorderings that improve locality
block vector algorithms (reuse cached matrix
blocks vectors in block are independent)
multi-step vector algorithms (reuse cached vector
blocks vectors have sequential dependence)

38
Raw Performance Improvement from Algorithms

Temporal reorderings that reduce synchronization
penalty
less stable algorithmic choices that reduce
synchronization frequency (deferred
orthogonalization, speculative step selection)
less global methods that reduce synchronization
range by replacing a tightly coupled global
process (e.g., Newton) with loosely coupled sets
of tightly coupled local processes (e.g.,
Schwarz)
Precision reductions that make bandwidth seem
larger
lower precision representation of preconditioner
matrix coefficients or poorly known coefficients
(arithmetic is still performed on full precision
extensions)

39
Improvements Resulting from Locality Reordering
40
Improvements from Blocking Vectors

Same ONERA M6 Euler simulation, on SGI Origin
One vector represents standard GMRES acceleration
Four vectors is a blocked Krylov method, not yet
in production version
Savings arises from not reloading matrix elements
of Jacobian for each new vector (four-fold
increase in matrix element use per load)
Flop/s rate is effectively tripled however, can
the extra vectors be used efficiently from a
numerical viewpoint??

41
Improvements from Reduced Precision

Same ONERA M6 Euler simulation, on SGI Origin
Standard (middle column) is double precision in
all floating quantities
Optimization (right column) is to store
preconditioner for Jacobian matrix in single
precision only, promoting to double before use in
the processor
Bandwidth and matrix cache capacities are
effectively doubled, with no deterioration in
numerical properties

42
Source 4 Algorithms Packing More Science Per
Flop

Some algorithmic improvements do not improve flop
rate, but lead to the same scientific end in the
same time at lower hardware cost (less memory,
lower operation complexity)
Caveat such adaptive programs are more
complicated and less thread-uniform than those
they improve upon in quality/cost ratio
Desirable that petaflop/s machines be general
purpose enough to run the best algorithms
Not daunting, conceptually, but puts an enormous
premium on dynamic load balancing
An order of magnitude or more can be gained here
for many problems

43
Example of Adaptive Opportunities

Spatial Discretization-based adaptivity
change discretization type and order to attain
required approximation to the continuum
everywhere without over-resolving in smooth,
easily approximated regions
Fidelity-based adaptivity
change continuous formulation to accommodate
required phenomena everywhere without enriching
in regions where nothing happens
Stiffness-based adaptivity
change solution algorithm to provide more
powerful, robust techniques in regions of
space-time where discrete problem is linearly or
nonlinearly stiff without extra work in nonstiff,
locally well-conditioned regions

44
Experimental Example of Opportunity for Advanced
Adaptivity

Driven cavity Newtons method (left) versus new
Additive Schwarz Preconditioned Inexact Newton
(ASPIN) nonlinear preconditioning (right)

45
Status and Prospectsfor Advanced Adaptivity

Metrics and procedures well developed in only a
few areas
method-of-lines ODEs for stiff IBVPs and DAEs,
FEA for elliptic BVPs
Multi-model methods used in ad hoc ways in
production
Boeing TRANAIR code
Poly-algorithmic solvers demonstrated in
principle but rarely in the hostile environment
of high-performance computing
Requirements for progress
management of hierarchical levels of
synchronization
user specification of hierarchical priorities of
different threads

46
Summary of Suggestions for PDE Petaflops

Algorithms that deliver more science per flop
possibly large problem-dependent factor, through
adaptivity (but we won't count this towards rate
improvement)
Algorithmic variants that are more
architecture-friendly
expect half an order of magnitude, through
improved locality and relaxed synchronization
More efficient use of processor cycles, and
faster processor/memory
expect one-and-a-half orders of magnitude,
through memory-assist language features, PIM, and
multithreading
Expanded number of processors
expect two orders of magnitude, through dynamic
balancing and extreme care in implementation

47
Reminder about the Source of PDEs

Computational engineering is not about individual
large-scale analyses, done fast and thrown over
the wall
Both results and their sensitivities are
desired often multiple operation points to be
simulated are known a priori, rather than
sequentially
Sensitivities may be fed back into optimization
process
Full PDE analyses may also be inner iterations in
a multidisciplinary computation
In such contexts, petaflop/s may mean 1,000
analyses running somewhat asynchronously with
respect to each other, each at 1 Tflop/s
clearly a less daunting challenge and one that
has better synchronization properties for
exploiting The Grid than 1 analysis running
at 1 Pflop/s

48
Summary Recommendations for Architects

Support rich (mesh-like) interprocessor
connectivity and fast global reductions
Allow disabling of expensive interprocessor cache
coherence protocols for user-tagged data
Support fast message-passing protocols between
processors that physically share memory, for
legacy MP applications
Supply sufficient memory system bandwidth per
processor (at least one word per clock per scalar
unit)
Give user optional control of L2 cache traffic
through directives
Develop at least gather/scatter
processor-in-memory capability
Support variety of precisions in blocked
transfers and fast precision conversions

49
Recommendations for New Benchmarks

Recently introduced sPPM benchmark fills a void
for memory-system-realistic full-application PDE
performance, but is explicit, structured, and
relatively high-order
Similar full-application benchmark is needed for
implicit, unstructured, low-order PDE solvers
Reflecting the hierarchical, distributed memory
layout of high end computers, this benchmark
would have two aspects
uniprocessor (vertical) memory system
performance suite of problems of various grid
sizes and multicomponent sizes with different
interlacings and edge orderings
parallel (horizontal) network performance
problems of various subdomain sizes and
synchronization frequencies

50
Bibliography

High Performance Parallel CFD, Gropp, Kaushik,
Keyes Smith, 2001, Parallel Computing (to
appear, 2001)
Toward Realistic Performance Bounds for Implicit
CFD Codes, Gropp, Kaushik, Keyes Smith, 1999,
in Proceedings of Parallel CFD'99, Elsevier
Prospects for CFD on Petaflops Systems, Keyes,
Kaushik Smith, 1999, in Parallel Solution of
Partial Differential Equations, Springer, pp.
247-278
Newton-Krylov-Schwarz Methods for Aerodynamics
Problems Compressible and Incompressible Flows
on Unstructured Grids, Kaushik, Keyes Smith,
1998, in Proceedings of the 11th Intl. Conf. on
Domain Decomposition Methods, Domain
Decomposition Press, pp. 513-520
How Scalable is Domain Decomposition in Practice,
Keyes, 1998, in Proceedings of the 11th Intl.
Conf. on Domain Decomposition Methods, Domain
Decomposition Press, pp. 286-297
On the Interaction of Architecture and Algorithm
in the Domain-Based Parallelization of an
Unstructured Grid Incompressible Flow Code,
Kaushik, Keyes Smith, 1998, in Proceedings of
the 10th Intl. Conf. on Domain Decomposition
Methods, AMS, pp. 311-319

51
Related URLs

Follow-up on this talk
http//www.mcs.anl.gov/petsc-fun3d
ASCI platforms
http//www.llnl.gov/asci/platforms
Int. Conferences on Domain Decomposition Methods
http//www.ddm.org
SIAM Conferences on Parallel Processing
http//www.siam.org (Norfolk, USA 12-14 Mar 2001)
International Conferences on Parallel CFD
http//www.parcfd.org

52
Acknowledgments

Collaborators Dinesh Kaushik (ODU), Kyle
Anderson (NASA), and the PETSc team at ANL
Satish Balay, Bill Gropp, Lois McInnes, Barry
Smith
Sponsors U.S. DOE, ICASE, NASA, NSF
Computer Resources DOE, SGI-Cray
Inspiration Shahid Bokhari (ICASE), Xiao-Chuan
Cai (CU-Boulder), Rob Falgout (LLNL), Paul
Fischer (ANL), Kyle Gallivan (FSU), Liz Jessup
(CU-Boulder), Michael Kern (INRIA), Dimitri
Mavriplis (ICASE), Alex Pothen (ODU), Uli Ruede
(Univ. Erlangen), John Salmon (Caltech), Linda
Stals (ODU), Bob Voigt (DOE), David Young
(Boeing), Paul Woodward (UMinn)