Title: Minimizing Communication in Numerical Linear Algebra www.cs.berkeley.edu/~demmel
1Minimizing Communication in Numerical Linear
Algebrawww.cs.berkeley.edu/demmel Optimizing
Krylov Subspace Methods
Jim Demmel EECS Math Departments, UC
Berkeley demmel_at_cs.berkeley.edu
2Outline of Lecture 8 Optimizing Krylov Subspace
Methods
- k-steps of typical iterative solver for Axb or
Ax?x - Does k SpMVs with starting vector (eg with b, if
solving Axb) - Finds best solution among all linear
combinations of these k1 vectors - Many such Krylov Subspace Methods
- Conjugate Gradients, GMRES, Lanczos, Arnoldi,
- Goal minimize communication in Krylov Subspace
Methods - Assume matrix well-partitioned, with modest
surface-to-volume ratio - Parallel implementation
- Conventional O(k log p) messages, because k
calls to SpMV - New O(log p) messages - optimal
- Serial implementation
- Conventional O(k) moves of data from slow to
fast memory - New O(1) moves of data optimal
- Lots of speed up possible (modeled and measured)
- Price some redundant computation
2
3Locally Dependent Entries for x,Ax, A
tridiagonal, 2 processors
Proc 1
Proc 2
Can be computed without communication
Summer School Lecture 8
3
4Locally Dependent Entries for x,Ax,A2x, A
tridiagonal, 2 processors
Proc 1
Proc 2
Can be computed without communication
Summer School Lecture 8
4
5Locally Dependent Entries for x,Ax,,A3x, A
tridiagonal, 2 processors
Proc 1
Proc 2
Can be computed without communication
Summer School Lecture 8
5
6Locally Dependent Entries for x,Ax,,A4x, A
tridiagonal, 2 processors
Proc 1
Proc 2
Can be computed without communication
Summer School Lecture 8
6
7Locally Dependent Entries for x,Ax,,A8x, A
tridiagonal, 2 processors
Proc 1
Proc 2
Can be computed without communication k8 fold
reuse of A
7
8Remotely Dependent Entries for x,Ax,,A8x, A
tridiagonal, 2 processors
Proc 1
Proc 2
One message to get data needed to compute
remotely dependent entries, not k8 Minimizes
number of messages latency cost Price
redundant work ? surface/volume ratio
8
9Spyplot of A with sparsity pattern of a 5 point
stencil, natural order
9
10Spyplot of A with sparsity pattern of a 5 point
stencil, natural order
Summer School Lecture 8
10
11Spyplot of A with sparsity pattern of a 5 point
stencil, natural order, assigned to p9
processors
For n x n mesh assigned to p processors, each
processor needs 2n data from neighbors
11
12Spyplot of A with sparsity pattern of a 5 point
stencil, nested dissection order
12
13Spyplot of A with sparsity pattern of a 5 point
stencil, nested dissection order
For n x n mesh assigned to p processors, each
processor needs 4n/p1/2 data from neighbors
13
14Remotely Dependent Entries for x,Ax,,A3x, A
with sparsity pattern of a 5 point stencil
14
15Remotely Dependent Entries for x,Ax,,A3x, A
with sparsity pattern of a 5 point stencil
Summer School Lecture 8
15
16Remotely Dependent Entries for x,Ax,A2x,A3x, A
irregular, multiple processors
16
17Remotely Dependent Entries for x,Ax,,A8x, A
tridiagonal, 2 processors
Proc 1
Proc 2
One message to get data needed to compute
remotely dependent entries, not k8 Minimizes
number of messages latency cost Price
redundant work ? surface/volume ratio
17
18Reducing redundant work for a tridiagonal matrix
Summer School Lecture 8
18
19Summary of Parallel Optimizations so far,
forAkx kernel (A,x,k) ? Ax,A2x,,Akx
- Possible to reduce messages from O(k) to O(1)
- Depends on matrix being well-partitioned
- Each processor only needs data from a few
neighbors - Price redundant computation of some entries of
Ajx - Amount of redundant work depends on surface to
volume ratio of partition ideally each
processor only needs a little data from
neighbors, compared to what it owns itself
- Best case tridiagonal (1D mesh) need O(1)
data from nbrs, versus - O(n/p) data locally
- flops grows by factor 1O(k/(n/p)) 1
O(k/data_per_proc) ? 1 - Pretty good 2D mesh (n x n) need O(n/p1/2)
data from nbors, versus - O(n2/p) locally flops grows by
1O(k/(data_per_proc)1/2) - Less good 3D mesh (n x n x n) need O(n2/p2/3)
data from nbors, versus - O(n3/p) locally flops grows by
1O(k/(data_per_proc)1/3)
Summer School Lecture 8
19
20Predicted speedup for model Petascale machine of
Ax,A2x,,Akx for 2D (n x n) Mesh
k
log2 n
20
21Predicted fraction of time spent doing arithmetic
for model Petascale machine of Ax,A2x,,Akx
for 2D (n x n) Mesh
21
22Predicted ratio of extra arithmetic for model
Petascale machine of Ax,A2x,,Akx for 2D (n x
n) Mesh
22
23Predicted speedup for model Petascale machine of
Ax,A2x,,Akx for 3D (n x n x n) Mesh
k
log2 n
23
24Predicted fraction of time spent doing arithmetic
for model Petascale machine of Ax,A2x,,Akx
for 3D (n x n x n) Mesh
24
25Predicted ratio of extra arithmetic for model
Petascale machine of Ax,A2x,,Akx for 3D (n x
n x n) Mesh
25
26Minimizing messages versus words_moved
- Parallel case
- Cant reduce words needed from other processors
- Required to get right answer
- Sequential case
- Can reduce words needed from slow memory
- Assume A, x too large to fit in fast memory,
- Naïve implementation of Ax,A2x,,Akx by k
calls to SpMV need to move A, x between fast and
slow memory k times - We will move it ?one time clearly optimal
Summer School Lecture 8
26
27Sequential x,Ax,,A4x, with memory hierarchy
v
One read of matrix from slow memory, not
k4 Minimizes words moved bandwidth cost No
redundant work
27
28Remotely Dependent Entries for x,Ax,,A3x, A
100x100 with bandwidth 2 Only 25 of A, vectors
fits in fast memory
28
29In what order should the sequential algorithm
process a general sparse matrix?
- For a band matrix, we obviously
- want to process the matrix
- from left to right, since data
- we need for the next partition
- is already in fast memory
- Not obvious what best order is
- for a general matrix
- We can formulate question of
- finding the best order as a
- Traveling Salesman Problem (TSP)
- One vertex per matrix partition
- Weight of edge (j, k) is memory cost of
processing - partition k right after partition j
- TSP find lowest cost tour visiting all
vertices
30What about multicore?
- Two kinds of communication to minimize
- Between processors on the chip
- Between on-chip cache and off-chip DRAM
- Use hybrid of both techniques described so far
- Use parallel optimization so each core can work
independently - Use sequential optimization to minimize off-chip
DRAM traffic of each core
Summer School Lecture 8
30
31Speedups on Intel Clovertown (8 core)
Test matrices include stencils and practical
matrices See SC09 paper on bebop.cs.berkeley.edu
for details
Summer School Lecture 8
31
32Minimizing Communication of GMRES to solve Axb
- GMRES find x in spanb,Ab,,Akb minimizing
Ax-b 2 - Cost of k steps of standard GMRES vs new GMRES
Standard GMRES for i1 to k w A
v(i-1) MGS(w, v(0),,v(i-1)) update
v(i), H endfor solve LSQ problem with
H Sequential words_moved O(knnz)
from SpMV O(k2n) from MGS Parallel
messages O(k) from SpMV
O(k2 log p) from MGS
Communication-Avoiding GMRES CA-GMRES W
v, Av, A2v, , Akv Q,R TSQR(W)
Tall Skinny QR Build H from R, solve LSQ
problem Sequential words_moved
O(nnz) from SpMV O(kn) from
TSQR Parallel messages O(1) from
computing W O(log p) from TSQR
- Oops W from power method, precision lost!
32
3333
34How to make CA-GMRES Stable?
- Use a different polynomial basis for the Krylov
subspace - Not Monomial basis W v, Av, A2v, , instead
v, p1(A)v,p2(A)v, - Possible choices
- Newton Basis WN v, (A ?1 I)v , (A ?2 I)(A
?1 I)v, where shifts ?i chosen as
approximate eigenvalues from Arnoldi (using same
Krylov subspace, so free) - Chebyshev Basis WC v, T1(A)v, T2(A)v,
where T1(z) chosen to be small in region of
complex plane containing large eigenvalues
34
35Monomial basis Ax,,Akx fails to converge
Newton polynomial basis does converge
35
36Speed ups on 8-core Clovertown
36
37Summary of what is known (1/3), open
- GMRES
- Can independently choose k to optimize speed,
restart length r to optimize convergence - Need to co-tune Akx kernel and TSQR
- Know how to use more stable polynomial bases
- Proven speedups
- Can similarly reorganize other Krylov methods
- Arnoldi and Lanczos, for Ax ?x and for Ax ?Mx
- Conjugate Gradients, for Ax b
- Biconjugate Gradients, for Axb
- BiCGStab?
- Other Krylov methods?
Summer School Lecture 8
37
38Summary of what is known (2/3), open
- Preconditioning MAx Mb
- Need different communication-avoiding kernel
x,Ax,MAx,AMAx,MAMAx,AMAMAx, - Can think of this as union of two of the earlier
kernels - x,(MA)x,(MA)2x,,(MA)kx and
x,Ax,(AM)Ax,(AM)2Ax,,(AM)kAx - For which preconditioners M can we minimize
communication? - Easy diagonal M
- How about block diagonal M?
Summer School Lecture 8
38
39Examine x,Ax,MAx,AMAx,MAMAx, for A
tridiagonal, M block-diagonal
Summer School Lecture 8
39
40Examine x,Ax,MAx,AMAx,MAMAx, for A
tridiagonal, M block-diagonal
Summer School Lecture 8
40
41Examine x,Ax,MAx,AMAx,MAMAx, for A
tridiagonal, M block-diagonal
Summer School Lecture 8
41
42Examine x,Ax,MAx,AMAx,MAMAx, for A
tridiagonal, M block-diagonal
Summer School Lecture 8
42
43Examine x,Ax,MAx,AMAx,MAMAx, for A
tridiagonal, M block-diagonal
Summer School Lecture 8
43
44Summary of what is known (3/3), open
- Preconditioning MAx Mb
- Need different communication-avoiding kernel
x,Ax,MAx,AMAx,MAMAx,AMAMAx, - For block diagonal M, matrix powers rapidly
become dense, but ranks of off-diagonal block
grow slowly - Can take advantage of low rank to minimize
communication - yi (AM)kij xj U ? V xj U ? (V xj )
- Works (in principle) for Hierarchically
Semi-Separable M - How does it work in practice?
- For details (and open problems) see M. Hoemmens
PhD thesis
Summer School Lecture 8
44
45Of things not said (much about)
- Other Communication-Avoiding (CA) dense
factorizations - QR with column pivoting (tournament pivoting)
- Cholesky with diagonal pivoting
- GE with complete pivoting
- LDLT ? Maybe with complete pivoting?
- CA-Sparse factorizations
- Cholesky, assuming good separators
- Lower bounds from Lipton/Rose/Tarjan
- Matrix multiplication, anything else?
- Strassen, and all linear algebra with
- words_moved O(n? / M?/2-1 )
- Parallel?
- Extending CA-Krylov methods to bottom solver in
multigrid with Adaptive Mesh Refinement (AMR)
45
Summer School Lecture 8
46Summary
Time to redesign all dense and sparse linear
algebra
Summer School Lecture 8
47Extra slides
Summer School Lecture 8
47