Minimizing Communication in Numerical Linear Algebra www.cs.berkeley.edu/~demmel - PowerPoint PPT Presentation

1 / 47

About This Presentation

Title:

Minimizing Communication in Numerical Linear Algebra www.cs.berkeley.edu/~demmel

Description:

Title: PowerPoint Presentation Author: WSE Last modified by: Nicola Mastronardi Created Date: 5/7/2002 1:59:17 PM Document presentation format: Presentazione su ... – PowerPoint PPT presentation

Number of Views:112

Avg rating:3.0/5.0

Slides: 48

Provided by: WSE107

Category:

more less

Transcript and Presenter's Notes

Title: Minimizing Communication in Numerical Linear Algebra www.cs.berkeley.edu/~demmel

1
Minimizing Communication in Numerical Linear
Algebrawww.cs.berkeley.edu/demmel Optimizing
Krylov Subspace Methods
Jim Demmel EECS Math Departments, UC
Berkeley demmel_at_cs.berkeley.edu
2
Outline of Lecture 8 Optimizing Krylov Subspace
Methods

k-steps of typical iterative solver for Axb or
Ax?x
Does k SpMVs with starting vector (eg with b, if
solving Axb)
Finds best solution among all linear
combinations of these k1 vectors
Many such Krylov Subspace Methods
Conjugate Gradients, GMRES, Lanczos, Arnoldi,
Goal minimize communication in Krylov Subspace
Methods
Assume matrix well-partitioned, with modest
surface-to-volume ratio
Parallel implementation
Conventional O(k log p) messages, because k
calls to SpMV
New O(log p) messages - optimal
Serial implementation
Conventional O(k) moves of data from slow to
fast memory
New O(1) moves of data optimal
Lots of speed up possible (modeled and measured)
Price some redundant computation

2
3
Locally Dependent Entries for x,Ax, A
tridiagonal, 2 processors
Proc 1
Proc 2
Can be computed without communication
Summer School Lecture 8
3
4
Locally Dependent Entries for x,Ax,A2x, A
tridiagonal, 2 processors
Proc 1
Proc 2
Can be computed without communication
Summer School Lecture 8
4
5
Locally Dependent Entries for x,Ax,,A3x, A
tridiagonal, 2 processors
Proc 1
Proc 2
Can be computed without communication
Summer School Lecture 8
5
6
Locally Dependent Entries for x,Ax,,A4x, A
tridiagonal, 2 processors
Proc 1
Proc 2
Can be computed without communication
Summer School Lecture 8
6
7
Locally Dependent Entries for x,Ax,,A8x, A
tridiagonal, 2 processors
Proc 1
Proc 2
Can be computed without communication k8 fold
reuse of A
7
8
Remotely Dependent Entries for x,Ax,,A8x, A
tridiagonal, 2 processors
Proc 1
Proc 2
One message to get data needed to compute
remotely dependent entries, not k8 Minimizes
number of messages latency cost Price
redundant work ? surface/volume ratio
8
9
Spyplot of A with sparsity pattern of a 5 point
stencil, natural order
9
10
Spyplot of A with sparsity pattern of a 5 point
stencil, natural order
Summer School Lecture 8
10
11
Spyplot of A with sparsity pattern of a 5 point
stencil, natural order, assigned to p9
processors
For n x n mesh assigned to p processors, each
processor needs 2n data from neighbors
11
12
Spyplot of A with sparsity pattern of a 5 point
stencil, nested dissection order
12
13
Spyplot of A with sparsity pattern of a 5 point
stencil, nested dissection order
For n x n mesh assigned to p processors, each
processor needs 4n/p1/2 data from neighbors
13
14
Remotely Dependent Entries for x,Ax,,A3x, A
with sparsity pattern of a 5 point stencil
14
15
Remotely Dependent Entries for x,Ax,,A3x, A
with sparsity pattern of a 5 point stencil
Summer School Lecture 8
15
16
Remotely Dependent Entries for x,Ax,A2x,A3x, A
irregular, multiple processors
16
17
Remotely Dependent Entries for x,Ax,,A8x, A
tridiagonal, 2 processors
Proc 1
Proc 2
One message to get data needed to compute
remotely dependent entries, not k8 Minimizes
number of messages latency cost Price
redundant work ? surface/volume ratio
17
18
Reducing redundant work for a tridiagonal matrix
Summer School Lecture 8
18
19
Summary of Parallel Optimizations so far,
forAkx kernel (A,x,k) ? Ax,A2x,,Akx

Possible to reduce messages from O(k) to O(1)
Depends on matrix being well-partitioned
Each processor only needs data from a few
neighbors
Price redundant computation of some entries of
Ajx
Amount of redundant work depends on surface to
volume ratio of partition ideally each
processor only needs a little data from
neighbors, compared to what it owns itself

Best case tridiagonal (1D mesh) need O(1)
data from nbrs, versus
O(n/p) data locally
flops grows by factor 1O(k/(n/p)) 1
O(k/data_per_proc) ? 1
Pretty good 2D mesh (n x n) need O(n/p1/2)
data from nbors, versus
O(n2/p) locally flops grows by
1O(k/(data_per_proc)1/2)
Less good 3D mesh (n x n x n) need O(n2/p2/3)
data from nbors, versus
O(n3/p) locally flops grows by
1O(k/(data_per_proc)1/3)

Summer School Lecture 8
19
20
Predicted speedup for model Petascale machine of
Ax,A2x,,Akx for 2D (n x n) Mesh
k
log2 n
20
21
Predicted fraction of time spent doing arithmetic
for model Petascale machine of Ax,A2x,,Akx
for 2D (n x n) Mesh
21
22
Predicted ratio of extra arithmetic for model
Petascale machine of Ax,A2x,,Akx for 2D (n x
n) Mesh
22
23
Predicted speedup for model Petascale machine of
Ax,A2x,,Akx for 3D (n x n x n) Mesh
k
log2 n
23
24
Predicted fraction of time spent doing arithmetic
for model Petascale machine of Ax,A2x,,Akx
for 3D (n x n x n) Mesh
24
25
Predicted ratio of extra arithmetic for model
Petascale machine of Ax,A2x,,Akx for 3D (n x
n x n) Mesh
25
26
Minimizing messages versus words_moved

Parallel case
Cant reduce words needed from other processors
Required to get right answer
Sequential case
Can reduce words needed from slow memory
Assume A, x too large to fit in fast memory,
Naïve implementation of Ax,A2x,,Akx by k
calls to SpMV need to move A, x between fast and
slow memory k times
We will move it ?one time clearly optimal

Summer School Lecture 8
26
27
Sequential x,Ax,,A4x, with memory hierarchy
v
One read of matrix from slow memory, not
k4 Minimizes words moved bandwidth cost No
redundant work
27
28
Remotely Dependent Entries for x,Ax,,A3x, A
100x100 with bandwidth 2 Only 25 of A, vectors
fits in fast memory
28
29
In what order should the sequential algorithm
process a general sparse matrix?

For a band matrix, we obviously
want to process the matrix
from left to right, since data
we need for the next partition
is already in fast memory
Not obvious what best order is
for a general matrix
We can formulate question of
finding the best order as a
Traveling Salesman Problem (TSP)

One vertex per matrix partition
Weight of edge (j, k) is memory cost of
processing
partition k right after partition j
TSP find lowest cost tour visiting all
vertices

30
What about multicore?

Two kinds of communication to minimize
Between processors on the chip
Between on-chip cache and off-chip DRAM
Use hybrid of both techniques described so far
Use parallel optimization so each core can work
independently
Use sequential optimization to minimize off-chip
DRAM traffic of each core

Summer School Lecture 8
30
31
Speedups on Intel Clovertown (8 core)
Test matrices include stencils and practical
matrices See SC09 paper on bebop.cs.berkeley.edu
for details
Summer School Lecture 8
31
32
Minimizing Communication of GMRES to solve Axb

GMRES find x in spanb,Ab,,Akb minimizing
Ax-b 2
Cost of k steps of standard GMRES vs new GMRES

Standard GMRES for i1 to k w A
v(i-1) MGS(w, v(0),,v(i-1)) update
v(i), H endfor solve LSQ problem with
H Sequential words_moved O(knnz)
from SpMV O(k2n) from MGS Parallel
messages O(k) from SpMV
O(k2 log p) from MGS
Communication-Avoiding GMRES CA-GMRES W
v, Av, A2v, , Akv Q,R TSQR(W)
Tall Skinny QR Build H from R, solve LSQ
problem Sequential words_moved
O(nnz) from SpMV O(kn) from
TSQR Parallel messages O(1) from
computing W O(log p) from TSQR

Oops W from power method, precision lost!

32
33
33
34
How to make CA-GMRES Stable?

Use a different polynomial basis for the Krylov
subspace
Not Monomial basis W v, Av, A2v, , instead
v, p1(A)v,p2(A)v,
Possible choices
Newton Basis WN v, (A ?1 I)v , (A ?2 I)(A
?1 I)v, where shifts ?i chosen as
approximate eigenvalues from Arnoldi (using same
Krylov subspace, so free)
Chebyshev Basis WC v, T1(A)v, T2(A)v,
where T1(z) chosen to be small in region of
complex plane containing large eigenvalues

34
35
Monomial basis Ax,,Akx fails to converge
Newton polynomial basis does converge
35
36
Speed ups on 8-core Clovertown
36
37
Summary of what is known (1/3), open

GMRES
Can independently choose k to optimize speed,
restart length r to optimize convergence
Need to co-tune Akx kernel and TSQR
Know how to use more stable polynomial bases
Proven speedups
Can similarly reorganize other Krylov methods
Arnoldi and Lanczos, for Ax ?x and for Ax ?Mx
Conjugate Gradients, for Ax b
Biconjugate Gradients, for Axb
BiCGStab?
Other Krylov methods?

Summer School Lecture 8
37
38
Summary of what is known (2/3), open

Preconditioning MAx Mb
Need different communication-avoiding kernel
x,Ax,MAx,AMAx,MAMAx,AMAMAx,
Can think of this as union of two of the earlier
kernels
x,(MA)x,(MA)2x,,(MA)kx and
x,Ax,(AM)Ax,(AM)2Ax,,(AM)kAx
For which preconditioners M can we minimize
communication?
Easy diagonal M
How about block diagonal M?

Summer School Lecture 8
38
39
Examine x,Ax,MAx,AMAx,MAMAx, for A
tridiagonal, M block-diagonal
Summer School Lecture 8
39
40
Examine x,Ax,MAx,AMAx,MAMAx, for A
tridiagonal, M block-diagonal
Summer School Lecture 8
40
41
Examine x,Ax,MAx,AMAx,MAMAx, for A
tridiagonal, M block-diagonal
Summer School Lecture 8
41
42
Examine x,Ax,MAx,AMAx,MAMAx, for A
tridiagonal, M block-diagonal
Summer School Lecture 8
42
43
Examine x,Ax,MAx,AMAx,MAMAx, for A
tridiagonal, M block-diagonal
Summer School Lecture 8
43
44
Summary of what is known (3/3), open

Preconditioning MAx Mb
Need different communication-avoiding kernel
x,Ax,MAx,AMAx,MAMAx,AMAMAx,
For block diagonal M, matrix powers rapidly
become dense, but ranks of off-diagonal block
grow slowly
Can take advantage of low rank to minimize
communication
yi (AM)kij xj U ? V xj U ? (V xj )
Works (in principle) for Hierarchically
Semi-Separable M
How does it work in practice?
For details (and open problems) see M. Hoemmens
PhD thesis

Summer School Lecture 8
44
45
Of things not said (much about)

Other Communication-Avoiding (CA) dense
factorizations
QR with column pivoting (tournament pivoting)
Cholesky with diagonal pivoting
GE with complete pivoting
LDLT ? Maybe with complete pivoting?
CA-Sparse factorizations
Cholesky, assuming good separators
Lower bounds from Lipton/Rose/Tarjan
Matrix multiplication, anything else?
Strassen, and all linear algebra with
words_moved O(n? / M?/2-1 )
Parallel?
Extending CA-Krylov methods to bottom solver in
multigrid with Adaptive Mesh Refinement (AMR)