Loading...

PPT – Technologies and Tools for High-Performance Distributed Computing PowerPoint presentation | free to download - id: 69921f-NDQ1Y

The Adobe Flash plugin is needed to view this content

Algorithms and Software for Terascale Computation

of PDEs and PDE-constrained Optimization

David E. Keyes Center for Computational

Science Old Dominion University Institute for

Scientific Computing Research (ISCR) Lawrence

Livermore National Laboratory

Plan of presentation

- Imperative of optimal algorithms for terascale

computing - Basic domain decomposition and multilevel

algorithmic concepts - Illustration of solver performance on ASCI

platforms - Terascale Optimal PDE Simulations (TOPS) software

project of the U.S. DOE - Conclusions and outlook

Related URLs

- Personal homepage papers, talks, etc.
- http//www.math.odu.edu/keyes
- SciDAC initiative
- http//www.science.doe.gov/scidac
- TOPS software project
- http//www.math.odu.edu/keyes/scidac
- PETSc software project
- http//www.mcs.anl.gov/petsc
- Hypre software project
- http//www.llnl.gov/CASC/hypre

Bibliography

- Jacobian-Free Newton-Krylov Methods Approaches

and Applications, Knoll Keyes, 2002, submitted

to J. Comp. Phys. - Nonlinearly Preconditioned Inexact Newton

Algorithms, Cai Keyes, 2002, SIAM J. Sci. Comp.

24183-200 - High Performance Parallel Implicit CFD, Gropp,

Kaushik, Keyes Smith, 2001, Parallel Computing

27337-362 - Four Horizons for Enhancing the Performance of

Parallel Simulations based on Partial

Differential Equations, Keyes, 2000, Lect. Notes

Comp. Sci., Springer, 19001-17 - Globalized Newton-Krylov-Schwarz Algorithms and

Software for Parallel CFD, Gropp, Keyes, McInnes

Tidriri, 2000, Int. J. High Performance

Computing Applications 14102-136 - Achieving High Sustained Performance in an

Unstructured Mesh CFD Application, Anderson,

Gropp, Kaushik, Keyes Smith, 1999, Proceedings

of SC'99 - Prospects for CFD on Petaflops Systems, Keyes,

Kaushik Smith, 1999, in Parallel Solution of

Partial Differential Equations, Springer, pp.

247-278 - How Scalable is Domain Decomposition in

Practice?, Keyes, 1998, in Proceedings of the

11th Intl. Conf. on Domain Decomposition

Methods, Domain Decomposition Press, pp. 286-297

Terascale simulation has been sold

Applied Physics radiation transport supernovae

Environment global climate contaminant transport

Scientific Simulation

In these, and many other areas, simulation is an

important complement to experiment.

Terascale simulation has been sold

Applied Physics radiation transport supernovae

Environment global climate contaminant transport

Experiments controversial

Scientific Simulation

In these, and many other areas, simulation is an

important complement to experiment.

Terascale simulation has been sold

Applied Physics radiation transport supernovae

Experiments dangerous

Environment global climate contaminant transport

Experiments controversial

Scientific Simulation

In these, and many other areas, simulation is an

important complement to experiment.

Terascale simulation has been sold

Experiments prohibited or impossible

Applied Physics radiation transport supernovae

Experiments dangerous

Environment global climate contaminant transport

Experiments controversial

Scientific Simulation

In these, and many other areas, simulation is an

important complement to experiment.

Terascale simulation has been sold

Experiments prohibited or impossible

Applied Physics radiation transport supernovae

Experiments dangerous

Experiments difficult to instrument

Environment global climate contaminant transport

Experiments controversial

Scientific Simulation

In these, and many other areas, simulation is an

important complement to experiment.

Terascale simulation has been sold

Experiments prohibited or impossible

Applied Physics radiation transport supernovae

Experiments dangerous

Experiments difficult to instrument

Environment global climate contaminant transport

Experiments controversial

Experiments expensive

Scientific Simulation

In these, and many other areas, simulation is an

important complement to experiment.

Terascale simulation has been sold

Experiments prohibited or impossible

Applied Physics radiation transport supernovae

Experiments dangerous

Experiments difficult to instrument

Environment global climate contaminant transport

Experiments controversial

Experiments expensive

Scientific Simulation

However, simulation is far from proven! To meet

expectations, we need to handle problems of

multiple physical scales.

Large platforms have been provided

100 Tflop / 30 TB

Livermore

50 Tflop / 25 TB

7.2 Tflop/s LINPACK

30 Tflop / 10 TB

Capability

10 Tflop / 4 TB

White

3 Tflop / 1.5 TB

Blue

Livermore

Red

1 Tflop / 0.5 TB

97

98

99

00

01

02

03

04

05

06

Time (CY)

Sandia

Los Alamos

ASCI program of the U.S. DOE has roadmap to go to

100 Tflop/s by 2006 www.llnl.gov/asci/platforms

NSFs 13.6 TF TeraGrid coming on line

TeraGrid NCSA, SDSC, Caltech, Argonne

www.teragrid.org

Site Resources

Site Resources

26

HPSS

HPSS

4

24

External Networks

External Networks

8

5

40Gb/s

External Networks

External Networks

Site Resources

Site Resources

HPSS

UniTree

c/o I. Foster

Earth Simulator

Birds-eye View of the Earth Simulator System

Disks

Cartridge Tape Library System

Processor Node (PN) Cabinets

35.6 Tflop/s LINPACK

Interconnection Network (IN) Cabinets

Air Conditioning System

65m

Power Supply System

50m

Double Floor for IN Cables

Building platforms is the easy part

- Algorithms must be
- highly concurrent and straightforward to load

balance - latency tolerant
- cache friendly (temporal and spatial locality of

reference) - highly scalable (in the sense of convergence)
- Goal for algorithmic scalability fill up memory

of arbitrarily large machines while preserving

constant running times with respect to

proportionally smaller problem on one processor - Domain-decomposed multilevel methods natural

for all of these

- Domain decomposition also natural for software

engineering

or logarithmically growing

Algorithmic requirementsfrom architecture

- Must run on physically distributed memory units

connected by message-passing network, each

serving one or more processors with multiple

levels of cache

horizontal aspects

vertical aspects

T3E

message passing, shared memory threads

register blocking, cache blocking, prefetching

Keyword Optimal

- Convergence rate nearly independent of

discretization parameters - Multilevel schemes for rapid linear convergence

of linear problems - Newton-like schemes for quadratic convergence of

nonlinear problems - Convergence rate as independent as possible of

physical parameters - Continuation schemes
- Physics-based preconditioning

Why Optimal Algorithms?

- The more powerful the computer, the greater the

importance of optimality - Example
- Suppose Alg1 solves a problem in time CN2, where

N is the input size - Suppose Alg2 solves the same problem in time CN
- Suppose that the machine on which Alg1 and Alg2

have been parallelized to run has 10,000

processors - In constant time (compared to serial), Alg1 can

run a problem 100X larger, whereas Alg2 can run a

problem 10,000X larger

Why Optimal?, cont.

- Alternatively, filling the machines memory, Alg1

requires 100X time, whereas Alg2 runs in constant

time - Is 10,000 processors a reasonable expectation?
- Yes, we have it today (ASCI White)!
- Could computational scientists really use 10,000X

scaling? - Of course we are approximating the continuum
- A grid for weather prediction allows points every

1km versus every 100km on the earths surface - In 2D 10,000X disappears fast in 3D even faster
- However, these machines are expensive (Earth

Simulator is 0.5B, plus ongoing operating

costs), and optimal algorithms are the only

algorithms that we can afford to run on them

Decomposition strategies for Luf in ?

- Operator decomposition
- Function space decomposition
- Domain decomposition

Operator decomposition

- Consider ADI

- Iteration matrix consists of four sequential

(multiplicative) substeps per timestep - two sparse matrix-vector multiplies
- two sets of unidirectional bandsolves
- Parallelism within each substep
- But global data exchanges between bandsolve

substeps

Function space decomposition

- Consider a spectral Galerkin method

Domain decomposition

- Consider restriction and extension operators for

subdomains, , and for possible

coarse grid, - Replace discretized with

- Solve by a Krylov method, e.g., CG
- Matrix-vector multiplies with
- parallelism on each subdomain
- nearest-neighbor exchanges, global reductions
- possible small global system (not needed for

parabolic case)

Comparison

- Operator decomposition (ADI)
- natural row-based assignment requires all-to-all,

bulk data exchanges in each step (for transpose) - Function space decomposition (Fourier)
- natural mode-based assignment requires

all-to-all, bulk data exchanges in each step (for

transform) - Domain decomposition (Schwarz)
- natural domain-based assignment requires local

(nearest neighbor) data exchanges, global

reductions, and optional small global problem

Theoretical scaling of domain decomposition (for

three common network topologies)

- With logarithmic-time (hypercube- or tree-based)

global reductions and scalable nearest neighbor

interconnects - optimal number of processors scales linearly with

problem size (scalable, assumes one subdomain

per processor) - With power-law-time (3D torus-based) global

reductions and scalable nearest neighbor

interconnects - optimal number of processors scales as

three-fourths power of problem size (almost

scalable) - With linear-time (common bus) network
- optimal number of processors scales as one-fourth

power of problem size (not scalable) - bad news for conventional Beowulf clusters, but

see 2000 2001 Bell Prize price-performance

awards using multiple commodity NICs per Beowulf

node!

Three Basic Concepts

- Iterative correction
- Schwarz preconditioning
- Schur preconditioning

Iterative correction

- The most basic idea in iterative methods
- Evaluate residual accurately, but solve

approximately, where is an approximate

inverse to A - A sequence of complementary solves can be used,

e.g., with first and then one has

- Optimal polynomials of lead to

various preconditioned Krylov methods - Scale recurrence, e.g., with

, leads to multilevel methods

Multilevel Preconditioning

Finest Grid

Schwarz Preconditioning

- Given A x b , partition x into subvectors,

corresp. to subdomains of the domain

of the PDE, nonempty, possibly overlapping, whose

union is all of the elements of

- Let Boolean rectangular matrix extract the

subset of

- Let

The Boolean matrices are gather/scatter

operators, mapping between a global vector and

its subdomain support

Iteration count estimates from the Schwarz theory

- Krylov-Schwarz iterative methods typically

converge in a number of iterations that scales as

the square-root of the condition number of the

Schwarz-preconditioned system

- In terms of N and P, where for d-dimensional

isotropic problems, Nh-d and PH-d, for mesh

parameter h and subdomain diameter H, iteration

counts may be estimated as follows

Comments on the Schwarz theory

- Basic Schwarz estimates are for
- self-adjoint operators with smooth coefficients
- positive definite operators
- exact subdomain solves,
- two-way overlapping with
- generous overlap, ?O(H) (otherwise 2-level

result is O(1H/?)) - Extensible to
- nonself-adjointness (e.g, convection) and jumping

coefficients - indefiniteness (e.g., wave Helmholtz)
- inexact subdomain solves
- one-way overlap communication (restricted

additive Schwarz) - small overlap

Schur Preconditioning

- Given a partition
- Condense
- Let M be a good preconditioner for S
- Then is a

preconditioner for A - Moreover, solves with may be done

approximately if all degrees of freedom are

retained

Schwarz polynomials

- Polynomials of Schwarz projections that are

combinations of additive and multiplicative may

be appropriate for certain implementations - We may solve the fine subdomains concurrently and

follow with a coarse grid (redundantly/cooperative

ly)

Schwarz-on-Schur

- Preconditioning the Schur complement is complex

in and of itself Schwarz can be used on the

reduced problem - Neumann-Neumann alg

- Multigrid on the Schur complement

Schwarz-inside-Schur

- Consider Newtons method for solving the

nonlinear rootfinding problem derived from the

necessary conditions for constrained

optimization - Constraint
- Objective
- Lagrangian
- Form the gradient of the Lagrangian with respect

to each of x, u, and ?

Schwarz-inside-Schur

- Equality constrained optimization leads to the

KKT system for states x , designs u , and

multipliers ?

- Then

Schwarz-inside-Schur, cont.

- Problems
- is the Jacobian of a PDE ? huge!
- involve Hessians of objective and

constraints ? second derivatives and huge - H is unreasonable to form, store, or invert

- Solutions
- Use Schur preconditioning on full system
- Form forward action of Hessians by automatic

differentiation (vector-to-vector map) - Form approximate inverse action of state Jacobian

and its transpose by Schwarz

Example of PDE-constrained Optimization

Lagrange-Newton-Krylov-Schur implemented in

Veltisto/PETSc

- Optimal control of laminar viscous flow
- optimization variables are surface

suction/injection - objective is minimum drag
- 700,000 states 4,000 controls
- 128 Cray T3E processors
- 5 hrs for optimal solution (1 hr for analysis)

c/o G. Biros and O. Ghattas

www.cs.nyu.edu/biros/veltisto/

Nonlinear Schwarz preconditioning

- Nonlinear Schwarz has Newton both inside and

outside and is fundamentally Jacobian-free - It replaces with a new nonlinear

system possessing the same root, - Define a correction to the

partition (e.g., subdomain) of the solution

vector by solving the following local nonlinear

system - where is nonzero only in the

components of the partition - Then sum the corrections

Nonlinear Schwarz, cont.

- It is simple to prove that if the Jacobian of

F(u) is nonsingular in a neighborhood of the

desired root then and

have the same unique root - To lead to a Jacobian-free Newton-Krylov

algorithm we need to be able to evaluate for any

- The residual
- The Jacobian-vector product
- Remarkably, (Cai-Keyes, 2000) it can be shown

that - where and
- All required actions are available in terms of

!

Example of nonlinear Schwarz

Unreasonable effectiveness of Schwarz

- When does the sum of partial inverses equal the

inverse of the sums? When the decomposition is

right!

Let be a complete set of orthonormal row

eigenvectors for A or

- Good decompositions are a compromise between

conditioning and parallel complexity, in practice

Newton-Krylov-Schwarz a parallel PDE

workhorse

Popularized in parallel Jacobian-free form under

this name by Cai, Gropp, Keyes Tidriri (1994),

in PETSc since Balays MS project at ODU (1995)

Newton nonlinear solver asymptotically quadratic

Krylov accelerator spectrally adaptive

Schwarz preconditioner parallelizable

Jacobian-Free Newton-Krylov Method

- In the Jacobian-Free Newton-Krylov (JFNK) method,

a Krylov method solves the linear Newton

correction equation, requiring Jacobian-vector

products - These are approximated by the Fréchet derivatives
- so that the actual Jacobian elements are

never explicitly needed, where ? is chosen

with a fine balance between approximation and

floating point rounding error - ? Schwarz preconditions, using approximate

elements

Philosophy of Jacobian-free NK

- To evaluate the linear residual, we use the true

F(u) , giving a true Newton step and asymptotic

quadratic Newton convergence - To precondition the linear residual, we do

anything convenient that uses understanding of

the dominant physics/mathematics in the system

and respects the limitations of the parallel

computer architecture and the cost of various

operations - combinations of operator-split Jacobians (for

reasons of physics or reasons of numerics) - Jacobian of related discretization (for fast

solves) - Jacobian of lower-order discretization (for more

stability, less storage) - Jacobian with lagged values for expensive terms

(for less computation per degree of freedom) - Jacobian stored in lower precision (for less

memory traffic per preconditioning step) - Jacobian blocks decomposed for parallelism

Philosophy of Jacobian-free NK, cont.

- These motivations are not new most large-scale

application codes also take short cuts on the

approximate Jacobian operator to be inverted

showing physical intuition - The problem with many codes is that they do not

anywhere have an accurate global Jacobian

operator they use only the weak Jacobian - This leads to a weakly nonlinearly converging

defect correction method - Defect correction
- in contrast to preconditioned Newton

Physics-based Preconditioning

- Recall the example of the shallow-water wave

splitting, treated earlier as a solver, leaving a

first-order in time splitting error - In the Jacobian-free Newton-Krylov framework,

this solver, which maps a residual into a

correction, can be regarded as a preconditioner - The true Jacobian is never formed yet the

time-implicit nonlinear residual at each time

step can be made as small as needed for nonlinear

consistency in long time integrations

Physics-based preconditioning

- In Newton iteration, one seeks to obtain a

correction (delta) to solution, by inverting

the Jacobian matrix on (the negative of) the

nonlinear residual - A typical operator-split code also derives a

delta to the solution, by some implicitly

defined means, through a series of implicit and

explicit substeps - This implicitly defined mapping from residual to

delta is a natural preconditioner - Software must accommodate this!

1D Shallow water preconditioning

- Define continuity residual for each timestep
- Define momentum residual for each timestep

- Continuity delta-form ()
- Momentum delta form ()

1D Shallow water preconditioning, cont.

- Solving () for and

substituting into (),

- After this parabolic equation is solved for ?? ,

we have - This completes the application of the

preconditioner to one Newton-Krylov iteration at

one timestep - Of course, the parabolic solve need not be done

exactly one sweep of multigrid can be used - See paper by Mousseau et al. (2002) in Ref 1

for impressive results for longtime weather

integration

Operator-split preconditioning

- Subcomponents of a PDE operator often have

special structure that can be exploited if they

are treated separately - Algebraically, this is just a generalization of

Schwarz, by term instead of by subdomain - Suppose and a

preconditioner is to be constructed, where

and are each easy to

invert - Form a preconditioned vector from as follows

- Equivalent to replacing with
- First-order splitting error, yet often used as a

solver!

Operator-split preconditioning, cont.

- Suppose S is convection-diffusion and R is

reaction, among a collection of fields stored as

gridfunctions - On a small regular 2D grid with a five-point

stencil - R is trivially invertible in block diagonal form
- S is invertible with one multilevel solve per

field

J

S

R

Operator-split preconditioning, cont.

- Preconditioners assembled from just the strong

elements of the Jacobian, alternating the source

term and the diffusion term operators, are

competitive in convergence rates with block-ILU

on the Jacobian - particularly, since the decoupled scalar

diffusion systems are amenable to simple

multigrid treatment not as trivial for the

coupled system - The decoupled preconditioners store many fewer

elements and significantly reduce memory

bandwidth requirements and are expected to be

much faster per iteration when carefully

implemented - See alternative block factorization by Bank et

al. in Ref 1 incorporated into SciDAC TSI

solver by DAzevedo

Using Jacobian of related discretization

- To precondition a variable coefficient operator,

such as ?(?? ?) , use , based on a

constant coefficient average - Brown Saad (1980) showed that, because of the

availability of fast solvers, it may even be

acceptable to use to precondition

something like

Using Jacobian of lower order discretization

- Orszag popularized the use of linear finite

element discretizations as preconditioners for

high-order spectral element discretizations in

the 1970s both approach the same continuous

operator - It is common in CFD to employ first-order

upwinded convective operators as approximate

inversions for higher-order operators - better factorization stability
- smaller matrix bandwidth and complexity
- With Jacobian-free NK, we can have the best of

both worlds a stable factorization/cheap solve

and a true Jacobian step

Using Jacobian with lagged terms

- Newton-chord methods (e.g., papers by Smooke et

al.) freeze the Jacobian matrices - saves Jacobian evaluation and factorization,

which can be up to 90 of the running time of the

code in some apps - however, nonlinear convergence degrades to linear

rate - In Jacobian-free NK, we can freeze some or all

of the terms in the Jacobian preconditioner,

while always accessing the action of the true

Jacobian for the Krylov matrix-vector multiply - still saves Jacobian work
- maintains asymptotically quadratic rate for

nonlinear convergence - See ref 1 for example with coupled edge plasma

and Navier-Stokes, showing five-fold improvement

over full Newton with constantly refreshed

Jacobian on LHS, versus JFNK with preconditioner

refreshed once each ten timesteps

Using Jacobian with lower precision elements

- Memory bandwidth is the critical architectural

parameter for sparse linear algebra computations - Storing the preconditioner elements in single

precision effectively doubles memory bandwidth

(and potentially halves runtime) for this

critical phase - We still form the Jacobian-vector product with

full precision and zero-pad the preconditioner

elements back to full length in the arithmetic

unit, so the numerical quality of the Krylov

subspace does not degrade

Memory BW bottleneck revealed via precision

reduction

Execution times for unstructured NKS Euler

Simulation on Origin 2000 double precision

matrices versus single precision preconditioner

Note that times are nearly halved, along with

precision, for the BW-limited linear solve phase,

indicating that the BW can be at least doubled

before hitting the next bottleneck!

Computational Aerodynamics

Transonic Lambda Shock, Mach contours on

surfaces

mesh c/o D. Mavriplis, ICASE

Implemented in PETSc www.mcs.anl.gov/petsc

Fixed-size Parallel Scaling Results

This scaling study, featuring our widest range of

processor number, was done for the incompressible

case.

c/o K. Anderson, W. Gropp, D. Kaushik, D. Keyes

and B. Smith

- Lab-university collaborations to develop

Integrated Software Infrastructure Centers

(ISICs) and partner with application groups - For FY2002, 51 new projects at 57M/year total
- Approximately one-third for ISICs
- A third for grid infrastructure and

collaboratories - A third for applications groups
- 5 Tflop/s IBM SP platforms Seaborg at NERSC (5

in Jun02 Top 500 list) and Cheetah at ORNL

(recently installed) available for SciDAC

Introducing Terascale Optimal PDE Simulations

(TOPS) ISIC

Nine institutions, 17M, five years, 24 co-PIs

Scope for TOPS

- Design and implementation of solvers
- Time integrators, with sens. analysis
- Nonlinear solvers, with sens. analysis
- Optimizers
- Linear solvers
- Eigensolvers

Optimizer

Sens. Analyzer

Time integrator

Nonlinear solver

Eigensolver

Linear solver

- Software integration
- Performance optimization

Indicates dependence

TOPS Philosophy on PDEs

- Solution of a system of PDEs is rarely a goal in

itself - PDEs are solved to derive various functionals

from specified inputs - Actual goal is characterization of a response

surface or a design or control strategy - Together with analysis, sensitivities and

stability are often desired

- Tools for PDE solution should also support such

related desires

Conclusions

- Domain decomposition and multilevel iteration the

dominant paradigm in contemporary terascale PDE

simulation - Several freely available software toolkits exist,

and successfully scale to thousands of tightly

coupled processors for problems on quasi-static

meshes - Concerted efforts underway to make elements of

these toolkits interoperate, and to allow

expression of the best methods, which tend to be

modular, hierarchical, recursive, and above all

adaptive! - Tunability of NKS algorithmics allows solver

adaption to application/architecture combinations - Next generation software should incorporate best

practices in applications as preconditioners

Acknowledgments

- Collaborators or Contributors
- Xiao-Chuan Cai (Univ. Colorado, Boulder)
- Omar Ghattas (Carnegie-Mellon)
- Dinesh Kaushik (ODU)
- Dana Knoll (LANL)
- Dimitri Mavriplis (ICASE)
- PETSc team at Argonne National Laboratory
- Satish Balay, Bill Gropp, Lois

McInnes, Barry Smith - Sponsors DOE, NASA, NSF
- Computer Resources LLNL, LANL, SNL, NERSC

Related URLs

- Personal homepage papers, talks, etc.
- http//www.math.odu.edu/keyes
- SciDAC initiative
- http//www.science.doe.gov/scidac
- TOPS project
- http//www.math.odu.edu/keyes/scidac
- PETSc project
- http//www.mcs.anl.gov/petsc
- Hypre project
- http//www.llnl.gov/CASC/hypre
- ASCI platforms
- http//www.llnl.gov/asci/platforms

Bibliography

- Jacobian-Free Newton-Krylov Methods Approaches

and Applications, Knoll Keyes, 2002, to be

submitted to J. Comp. Phys. - Nonlinearly Preconditioned Inexact Newton

Algorithms, Cai Keyes, 2002, to appear in SIAM

J. Sci. Comp. - High Performance Parallel Implicit CFD, Gropp,

Kaushik, Keyes Smith, 2001, Parallel Computing

27337-362 - Four Horizons for Enhancing the Performance of

Parallel Simulations based on Partial

Differential Equations, Keyes, 2000, Lect. Notes

Comp. Sci., Springer, 19001-17 - Globalized Newton-Krylov-Schwarz Algorithms and

Software for Parallel CFD, Gropp, Keyes, McInnes

Tidriri, 2000, Int. J. High Performance

Computing Applications 14102-136 - Achieving High Sustained Performance in an

Unstructured Mesh CFD Application, Anderson,

Gropp, Kaushik, Keyes Smith, 1999, Proceedings

of SC'99 - Prospects for CFD on Petaflops Systems, Keyes,

Kaushik Smith, 1999, in Parallel Solution of

Partial Differential Equations, Springer, pp.

247-278 - How Scalable is Domain Decomposition in

Practice?, Keyes, 1998, in Proceedings of the

11th Intl. Conf. on Domain Decomposition

Methods, Domain Decomposition Press, pp. 286-297

EOF