PPT – CS 267: Applications of Parallel Computers Solving Linear Systems arising from PDEs - I PowerPoint presentation

About This Presentation

Title:

CS 267: Applications of Parallel Computers Solving Linear Systems arising from PDEs - I

Description:

Tu = -h2 * f. for u where. 03/07/2006. CS267 Lecture 15. 2D Poisson's equation ... Red-Black SOR (successive over-relaxation): Variation of Jacobi that exploits ... – PowerPoint PPT presentation

Number of Views:93

Avg rating:3.0/5.0

Slides: 55

Provided by: DavidE7

Learn more at: https://people.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: CS 267: Applications of Parallel Computers Solving Linear Systems arising from PDEs - I

1
CS 267 Applications of Parallel
ComputersSolving Linear Systems arising from
PDEs - I

James Demmel
www.cs.berkeley.edu/demmel/cs267_Spr06

2
Outline

Review Poisson equation
Overview of Methods for Poisson Equation
Jacobis method
Red-Black SOR method
Conjugate Gradients
FFT
Multigrid (next lecture)

Reduce to sparse-matrix-vector multiply Need them
to understand Multigrid
3
Recap of Sources of Parallelism Lecture

Discrete event systems
Examples Game of Life, logic level circuit
simulation.
Particle systems
Examples billiard balls, semiconductor device
simulation, galaxies.
Lumped variables depending on continuous
parameters
ODEs, e.g., circuit simulation (Spice),
structural mechanics, chemical kinetics.
Continuous variables depending on continuous
parameters
PDEs, e.g., heat, elasticity, electrostatics.
A given phenomenon can be modeled at multiple
levels.
Many simulations combine more than one of these
techniques.

4
Recap, cont Solving PDEs

Hyperbolic problems (waves)
Sound wave(position, time)
Use explicit time-stepping
Solution at each point depends on neighbors at
previous time
Elliptic (steady state) problems
Electrostatic Potential (position)
Everything depends on everything else
This means locality is harder to find than in
hyperbolic problems
Parabolic (time-dependent) problems
Temperature(position,time)
Involves an elliptic solve at each time-step
Focus on elliptic problems
Canonical example is the Poisson equation

?2u/?x2 ?2u/?y2 ?2u/?z2 f(x,y,z)
5
Poissons equation arises in many models
3D ?2u/?x2 ?2u/?y2 ?2u/?z2 f(x,y,z)
f represents the sources also need boundary
conditions
2D ?2u/?x2 ?2u/?y2 f(x,y)
1D d2u/dx2 f(x)

Electrostatic or Gravitational Potential
Potential(position)
Heat flow Temperature(position, time)
Diffusion Concentration(position, time)
Fluid flow Velocity,Pressure,Density(position,tim
e)
Elasticity Stress,Strain(position,time)
Variations of Poisson have variable coefficients

6
Relation of Poissons equation to Gravity,
Electrostatics

Force on particle at (x,y,z) due to particle at 0
3D -(x,y,z)/r3, where r ?(x2 y2 z2 )
2D -(x,y)/r2, where r ?(x2 y2 )
Force is also gradient of potential V
3D V -1/r, ?V (?V/?x, ?V/?y, ?V/?z)
2D V log r, ?V (?V/?x, ?V/?y)
V satisfies Poissons equation (try it!)

7
Poissons equation in 1D ?2u/?x2 f(x)
Discretize d2u/dx2 f(x) on
regular mesh ui u(ih) to get
u i1 2u i u i-1 / h2
f(x) Write as solving Tu -h2 f
for u where
2 -1 -1 2 -1 -1 2 -1
-1 2 -1 -1 2
Graph and stencil
2
-1
-1
T
8
2D Poissons equation

Similar to the 1D case, but the matrix T is now
3D is analogous

Graph and stencil
4 -1 -1 -1 4 -1 -1
-1 4 -1 -1
4 -1 -1 -1 -1 4
-1 -1 -1
-1 4 -1
-1 4 -1
-1 -1 4 -1
-1 -1 4
-1
4
-1
-1
T
-1
9
Algorithms for 2D (3D) Poisson Equation (N n2
(n3) vars)

Algorithm Serial PRAM Memory Procs
Dense LU N3 N N2 N2
Band LU N2 (N7/3) N N3/2 (N5/3) N (N4/3)
Jacobi N2 (N5/3) N (N2/3) N N
Explicit Inv. N2 log N N2 N2
Conj.Gradients N3/2 (N4/3) N1/2(1/3) log N N N
Red/Black SOR N3/2 (N4/3) N1/2 (N1/3) N N
Sparse LU N3/2 (N2) N1/2 Nlog N (N4/3) N
FFT Nlog N log N N N
Multigrid N log2 N N N
Lower bound N log N N
PRAM is an idealized parallel model with zero
cost communication
Reference James Demmel, Applied Numerical
Linear Algebra, SIAM, 1997.

10
Overview of Algorithms on n x n grid (n x n x n
grid) (1)

Sorted in two orders (roughly)
from slowest to fastest on sequential machines.
from most general (works on any matrix) to most
specialized (works on matrices like T).
Dense LU Gaussian elimination works on any
N-by-N matrix.
Band LU Exploits the fact that T is nonzero only
on N1/2 (N2/3) diagonals nearest main
diagonal.
Jacobi Essentially does matrix-vector multiply
by T in inner loop of iterative algorithm.
Explicit Inverse Assume we want to solve many
systems with T, so we can precompute and store
inv(T) for free, and just multiply by it (but
still expensive!).
Conjugate Gradient Uses matrix-vector
multiplication, like Jacobi, but exploits
mathematical properties of T that Jacobi does
not.
Red-Black SOR (successive over-relaxation)
Variation of Jacobi that exploits yet different
mathematical properties of T. Used in multigrid
schemes.
Sparse LU Gaussian elimination exploiting
particular zero structure of T.
FFT (fast Fourier transform) Works only on
matrices very like T.
Multigrid Also works on matrices like T, that
come from elliptic PDEs.
Lower Bound Serial (time to print answer)
parallel (time to combine N inputs).
Details in class notes and www.cs.berkeley.edu/de
mmel/ma221.

11
Overview of Algorithms on n x n grid (n x n x n
grid) (2)

Dimension N n2 in 2D (n3 in 3D)
Bandwidth BW n in 2D (n2 in 3D)
Condition number k n2 in 2D or 3D
its number of iterations
SpMV Sparse-matrix-vector-multiply
Dense LU
cost N3 (serial) or N (parallel, with N2
procs)
Band LU
cost N BW2 (serial) or N (parallel with BW2
procs)
Jacobi
cost cost(SpMV) its, its k.
Conjugate Gradient
cost (cost(SpMV) cost(dot prod)) its,
its sqrt(k)
Red-Black SOR
cost cost(SpMV) its, its sqrt(k)
Iterative algorithms need different assumptions
for convergence analysis

12
Comments on practical meshes

Regular 1D, 2D, 3D meshes
Important as building blocks for more complicated
meshes
We will discuss these first
Practical meshes are often irregular
Composite meshes, consisting of multiple bent
regular meshes joined at edges
Unstructured meshes, with arbitrary mesh points
and connectivities
Adaptive meshes, which change resolution during
solution process to put computational effort
where needed
In later lectures we will talk about methods on
unstructured meshes

13
Composite Mesh from Mechanical Structure
14
Unstructured mesh NASA Airfoil in 2D
15
Adaptive Mesh Refinement (AMR)

Adaptive mesh around an explosion
Source John Bell and Phil Colella at LBL

16
Jacobis Method

To derive Jacobis method, write Poisson as
u(i,j) (u(i-1,j) u(i1,j) u(i,j-1)
u(i,j1) b(i,j))/4
Let u(i,j,m) be approximation for u(i,j) after m
steps
u(i,j,m1) (u(i-1,j,m) u(i1,j,m)
u(i,j-1,m)
u(i,j1,m) b(i,j)) / 4
I.e., u(i,j,m1) is a weighted average of
neighbors
Motivation u(i,j,m1) chosen to exactly satisfy
equation at (i,j)
Steps to converge proportional to problem size,
Nn2
See Lecture 24 of www.cs.berkeley.edu/demmel/cs26
7_Spr99
Therefore, serial complexity is O(N2)

17
Convergence of Nearest Neighbor Methods

Jacobis method involves nearest neighbor
computation on nxn grid (N n2)
So it takes O(n) O(sqrt(N)) iterations for
information to propagate
E.g., consider a rhs (b) that is 0, except the
center is 1
The exact solution looks like

Even in the best case, any nearest neighbor
computation will take n/2 steps to propagate on
an nxn grid
18
Convergence of Nearest Neighbor Methods
19
Parallelizing Jacobis Method

Reduces to sparse-matrix-vector multiply by
(nearly) T
U(m1) (T/4 - I) U(m) B/4
Each value of U(m1) may be updated
independently
keep 2 copies for iterates m and m1
Requires that boundary values be communicated
if each processor owns n2/p elements to update
amount of data communicated, n/p per neighbor, is
relatively small if ngtgtp

20
Locality Optimization in Jacobi

Nearest neighbor update in Jacobi has
Good spatial locality (for regular mesh)
traverse rows/columns
Poor temporal locality few flops per mesh point
E.g., on 2D mesh 4 adds and 1 multiply for 5
loads and 1 store
For both parallel and single processors, may
trade off extra computation for reduced memory
operations
Idea for each subgrid, do multiple Jacobi
iterations
Size of subgrid shrinks on each iteration, so
start with larger one
Used for uniprocessors
Rescheduling for Locality, Michelle Mills Strout
www-unix.mcs.anl.gov/mstrout
Cache-efficient multigrid algorithms, Sid
Chatterjee
And parallel machines, including tuning for
grids
Ian Foster et al, SC2001

21
Redundant Ghost Nodes in Jacobi

Overview of Memory Hierarchy Optimization
Can be used on unstructured meshes
Size of ghost region (and redundant computation)
depends on network/memory speed vs. computation

To compute green
Copy yellow
Compute blue
22
Successive Overrelaxation (SOR)

Similar to Jacobi u(i,j,m1) is computed as a
linear combination of neighbors
Numeric coefficients and update order are
different
Based on 2 improvements over Jacobi
Use most recent values of u that are available,
since these are probably more accurate
Update value of u(m1) more aggressively at
each step
First, note that while evaluating sequentially
u(i,j,m1) (u(i-1,j,m) u(i1,j,m)
some of the values for m1 are already
available
u(i,j,m1) (u(i-1,j,latest) u(i1,j,latest)
where latest is either m or m1

23
Gauss-Seidel

Updating left-to-right row-wise order, we get the
Gauss-Seidel algorithm
for i 1 to n
for j 1 to n
u(i,j,m1) (u(i-1,j,m1) u(i1,j,m)
u(i,j-1,m1) u(i,j1,m)
b(i,j)) / 4
Cannot be parallelized, because of dependencies,
so instead we use a red-black order
forall black points u(i,j)
u(i,j,m1) (u(i-1,j,m)
forall red points u(i,j)
u(i,j,m1) (u(i-1,j,m1)

For general graph, use graph coloring
Can use repeated Maximal Independent Sets to
color
Graph(T) is bipartite gt 2 colorable (red and
black)
Nodes for each color can be updated
simultaneously
Still Sparse-matrix-vector multiply, using
submatrices

24
Successive Overrelaxation (SOR)

Red-black Gauss-Seidel converges twice as fast as
Jacobi, but there are twice as many parallel
steps, so the same in practice
To motivate next improvement, write basic step in
algorithm as
u(i,j,m1) u(i,j,m) correction(i,j,m)
If correction is a good direction to move, then
one should move even further in that direction by
some factor wgt1
u(i,j,m1) u(i,j,m) w
correction(i,j,m)
Called successive overrelaxation (SOR)
Parallelizes like Jacobi (Still
sparse-matrix-vector multiply)
Can prove w 2/(1sin(p/(n1)) ) for best
convergence
Number of steps to converge parallel
complexity O(n), instead of O(n2) for Jacobi
Serial complexity O(n3) O(N3/2), instead of
O(n4) O(N2) for Jacobi

25
Conjugate Gradient (CG) for solving Ax b

This method can be used when the matrix A is
symmetric, i.e., A AT
positive definite, defined equivalently as
all eigenvalues are positive
xT A x gt 0 for all nonzero vectors s
a Cholesky factorization, A LLT exists
Algorithm maintains 3 vectors
x the approximate solution, improved after each
iteration
r the residual, r Ax - b
p search direction, also called the conjugate
gradient
One iteration costs
Sparse-matrix-vector multiply by A (major cost)
3 dot products, 3 saxpys (scalevector vector)
Converges in O(n) O(N1/2) steps, like SOR
Serial complexity O(N3/2)
Parallel complexity O(N1/2 log N), log N
factor from dot-products

26
Conjugate Gradient Algorithm for Solving Axb

Initial guess x
r b Ax, j1
Repeat
rho rTr dot product
If j1, p r, else beta rho/old_rho, p r
betap, endif saxpy
q Ap sparse matrix vector multiply
alpha rho / pT q dot product
x x alpha p saxpy
r r alpha q saxpy
old_rho rho jj1
Until rho small enough

27
Templates for choosing an iterative algorithm for
Axb

Use only matrix-vector-multiplication, dot,
saxpy,
www.netlib.org/templates

N
Y
N
N
Y
Y
N
Y
N
Y
28
Summary of Jacobi, SOR and CG

Jacobi, SOR, and CG all perform
sparse-matrix-vector multiply
For Poisson, this means nearest neighbor
communication on an n-by-n grid
It takes n N1/2 steps for information to travel
across an n-by-n grid
Since solution on one side of grid depends on
data on other side of grid faster methods require
faster ways to move information
FFT
Multigrid

29
Solving the Poisson equation with the FFT

Motivation express continuous solution as
Fourier series
u(x,y) Si Sk uik sin(p ix) sin(p ky)
uik called Fourier coefficient of u(x,y)
Poissons equation ?2u/?x2 ?2u/?y2 b becomes
Si Sk (-pi2 - pk2) uik sin(p ix) sin(p ky)
Si Sk bik sin(p ix) sin(p ky)
where bik are Fourier coefficients of b(x,y)
By uniqueness of Fourier series, uik bik /
(-pi2 - pk2)
Continuous Algorithm (Discrete Algorithm)
Compute Fourier coefficient bik of right hand
side
Apply 2D FFT to values of b(i,k) on grid
Compute Fourier coefficients uik of solution
Divide each transformed b(i,k) by function(i,k)
Compute solution u(x,y) from Fourier coefficients
Apply 2D inverse FFT to values of b(i,k)

30
Serial FFT

Let isqrt(-1) and index matrices and vectors
from 0.
The Discrete Fourier Transform of an m-element
vector v is
Fv
Where F is the mm matrix defined as
Fj,k v (jk)
Where v is
v e (2pi/m) cos(2p/m)
isin(2p/m)
v is a complex number with whose mth power vm 1
and is therefore called an mth root of unity
E.g., for m 4
v i, v2 -1, v3 -i, v4
1,

31
Using the 1D FFT for filtering

Signal sin(7t) .5 sin(5t) at 128 points
Noise random number bounded by .75
Filter by zeroing out FFT components lt .25

32
Using the 2D FFT for image compression

Image 200x320 matrix of values
Compress by keeping largest 2.5 of FFT
components
Similar idea used by jpeg

33
Related Transforms

Most applications require multiplication by both
F and inverse(F).
Multiplying by F and inverse(F) are essentially
the same. (inverse(F) is the complex conjugate
of F divided by n.)
For solving the Poisson equation and various
other applications, we use variations on the FFT
The sin transform -- imaginary part of F
The cos transform -- real part of F
Algorithms are similar, so we will focus on the
forward FFT.

34
Serial Algorithm for the FFT

Compute the FFT of an m-element vector v, Fv
(Fv)j S F(j,k) v(k)
S v (jk) v(k)
S (v j)k v(k)
V(v j)
Where V is defined as the polynomial
V(x) S xk v(k)

m-1 k 0
m-1 k 0
m-1 k 0
m-1 k 0
35
Divide and Conquer FFT

V can be evaluated using divide-and-conquer
V(x) S (x)k v(k)
v0 x2v2
x4v4
x(v1 x2v3
x4v5 )
Veven(x2) xVodd(x2)
V has degree m-1, so Veven and Vodd are
polynomials of degree m/2-1
We evaluate these at points (v j)2 for 0ltjltm-1
But this is really just m/2 different points,
since
(v (jm/2) )2 (v j v m/2) )2 v 2j v m
(v j)2
So FFT on m points reduced to 2 FFTs on m/2
points
Divide and conquer!

m-1 k 0
36
Divide-and-Conquer FFT

FFT(v, v, m)
if m 1 return v0
else
veven FFT(v02m-2, v 2, m/2)
vodd FFT(v12m-1, v 2, m/2)
v-vec v0, v1, v (m/2-1)
return veven (v-vec . vodd),
veven - (v-vec . vodd)
The . above is component-wise multiply.
The , is construction an m-element vector
from 2 m/2 element vectors
This results in an O(m log m) algorithm.

precomputed
37
An Iterative Algorithm

The call tree of the dc FFT algorithm is a
complete binary tree of log m levels
Practical algorithms are iterative, going across
each level in the tree starting at the bottom
Algorithm overwrites vi by (Fv)bitreverse(i)

FFT(0,1,2,3,,15) FFT(xxxx)
even
odd
FFT(1,3,,15) FFT(xxx1)
FFT(0,2,,14) FFT(xxx0)
FFT(xx10)
FFT(xx01)
FFT(xx11)
FFT(xx00)
FFT(x100)
FFT(x010)
FFT(x110)
FFT(x001)
FFT(x101)
FFT(x011)
FFT(x111)
FFT(x000)
FFT(0) FFT(8) FFT(4) FFT(12) FFT(2) FFT(10)
FFT(6) FFT(14) FFT(1) FFT(9) FFT(5) FFT(13)
FFT(3) FFT(11) FFT(7) FFT(15)
38
Parallel 1D FFT

Data dependencies in 1D FFT
Butterfly pattern
A PRAM algorithm takes O(log m) time
each step to right is parallel
there are log m steps
What about communication cost?
See LogP paper for details

39
Block Layout of 1D FFT

Using a block layout (m/p contiguous elts per
processor)
No communication in last log m/p steps
Each step requires fine-grained communication in
first log p steps

40
Cyclic Layout of 1D FFT

Cyclic layout (only 1 element per processor,
wrapped)
No communication in first log(m/p) steps
Communication in last log(p) steps

41
Parallel Complexity

m vector size, p number of processors
f time per flop 1
a startup for message (in f units)
b time per word in a message (in f units)
Time(blockFFT) Time(cyclicFFT)
2mlog(m)/p
log(p) a
mlog(p)/p b

42
FFT With Transpose

If we start with a cyclic layout for first log(p)
steps, there is no communication
Then transpose the vector for last log(m/p) steps
All communication is in the transpose

43
Why is the Communication Step Called a Transpose?

Analogous to transposing an array
View as a 2D array of n/p by p
Note same idea is useful for uniprocessor caches

44
Complexity of the FFT with Transpose

If no communication is overlapped (overestimate!)
Time(transposeFFT)
2mlog(m)/p
same as before
(p-1) a
was log(p) a
m(p-1)/p2 b
was m log(p)/p b
If communication is overlapped, so we do not pay
for p-1 messages, the second term becomes simply
a, rather than (p-1)a.
This is close to optimal. See LogP paper for
details.
See also following papers on class resource page
A. Sahai, Hiding Communication Costs in
Bandwidth Limited FFT
R. Nishtala et al, Optimizing bandwidth limited
problems using one-sided communication

45
Comment on the 1D Parallel FFT

The above algorithm leaves data in bit-reversed
order
Some applications can use it this way, like
Poisson
Others require another transpose-like operation
Other parallel algorithms also exist
A very different 1D FFT is due to Edelman (see
http//www-math.mit.edu/edelman)
Based on the Fast Multipole algorithm
Less communication for non-bit-reversed algorithm

46
Higher Dimension FFTs

FFTs on 2 or 3 dimensions are define as 1D FFTs
on vectors in all dimensions.
E.g., a 2D FFT does 1D FFTs on all rows and then
all columns
There are 3 obvious possibilities for the 2D FFT
(1) 2D blocked layout for matrix, using 1D
algorithms for each row and column
(2) Block row layout for matrix, using serial 1D
FFTs on rows, followed by a transpose, then more
serial 1D FFTs
(3) Block row layout for matrix, using serial 1D
FFTs on rows, followed by parallel 1D FFTs on
columns
Option 2 is best, if we overlap communication and
computation
For a 3D FFT the options are similar
2 phases done with serial FFTs, followed by a
transpose for 3rd
can overlap communication with 2nd phase in
practice

47
FFTW Fastest Fourier Transform in the West

www.fftw.org
Produces FFT implementation optimized for
Your version of FFT (complex, real,)
Your value of n (arbitrary, possibly prime)
Your architecture
Close to optimal for serial, can be improved for
parallel
Similar in spirit to PHIPAC/ATLAS/Sparsity
Won 1999 Wilkinson Prize for Numerical Software
Widely used

48
Extra Slides
49
Poissons equation arises in many models

Heat flow Temperature(position, time)
Diffusion Concentration(position, time)
Electrostatic or Gravitational Potential Pote
ntial(position)
Fluid flow Velocity,Pressure,Density(position,tim
e)
Quantum mechanics Wave-function(position,time)
Elasticity Stress,Strain(position,time)

50
Poissons equation in 1D
2 -1 -1 2 -1 -1 2 -1
-1 2 -1 -1 2
Graph and stencil
T
2
-1
-1
51
Algorithms for 2D Poisson Equation with N unknowns

Algorithm Serial PRAM Memory Procs
Dense LU N3 N N2 N2
Band LU N2 N N3/2 N
Jacobi N2 N N N
Explicit Inv. N log N N N
Conj.Grad. N 3/2 N 1/2 log N N N
RB SOR N 3/2 N 1/2 N N
Sparse LU N 3/2 N 1/2 Nlog N N
FFT Nlog N log N N N
Multigrid N log2 N N N
Lower bound N log N N
PRAM is an idealized parallel model with zero
cost communication

2
2
2
52
Short explanations of algorithms on previous slide

Sorted in two orders (roughly)
from slowest to fastest on sequential machines
from most general (works on any matrix) to most
specialized (works on matrices like Poisson)
Dense LU Gaussian elimination works on any
N-by-N matrix
Band LU exploit fact that T is nonzero only on
sqrt(N) diagonals nearest main diagonal, so
faster
Jacobi essentially does matrix-vector multiply
by T in inner loop of iterative algorithm
Explicit Inverse assume we want to solve many
systems with T, so we can precompute and store
inv(T) for free, and just multiply by it
Its still expensive!
Conjugate Gradients uses matrix-vector
multiplication, like Jacobi, but exploits
mathematical properies of T that Jacobi does not
Red-Black SOR (Successive Overrelaxation)
Variation of Jacobi that exploits yet different
mathematical properties of T
Used in Multigrid
Sparse LU Gaussian elimination exploiting
particular zero structure of T
FFT (Fast Fourier Transform) works only on
matrices very like T
Multigrid also works on matrices like T, that
come from elliptic PDEs
Lower Bound serial (time to print answer)
parallel (time to combine N inputs)
Details in class notes and www.cs.berkeley.edu/de
mmel/ma221