Minimizing Communication in Numerical Linear Algebra www.cs.berkeley.edu/~demmel - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Minimizing Communication in Numerical Linear Algebra www.cs.berkeley.edu/~demmel

Description:

Title: PowerPoint Presentation Author: WSE Last modified by: Nicola Mastronardi Created Date: 5/7/2002 1:59:17 PM Document presentation format: Presentazione su ... – PowerPoint PPT presentation

Number of Views:112
Avg rating:3.0/5.0
Slides: 48
Provided by: WSE107
Category:

less

Transcript and Presenter's Notes

Title: Minimizing Communication in Numerical Linear Algebra www.cs.berkeley.edu/~demmel


1
Minimizing Communication in Numerical Linear
Algebrawww.cs.berkeley.edu/demmel Optimizing
Krylov Subspace Methods
Jim Demmel EECS Math Departments, UC
Berkeley demmel_at_cs.berkeley.edu
2
Outline of Lecture 8 Optimizing Krylov Subspace
Methods
  • k-steps of typical iterative solver for Axb or
    Ax?x
  • Does k SpMVs with starting vector (eg with b, if
    solving Axb)
  • Finds best solution among all linear
    combinations of these k1 vectors
  • Many such Krylov Subspace Methods
  • Conjugate Gradients, GMRES, Lanczos, Arnoldi,
  • Goal minimize communication in Krylov Subspace
    Methods
  • Assume matrix well-partitioned, with modest
    surface-to-volume ratio
  • Parallel implementation
  • Conventional O(k log p) messages, because k
    calls to SpMV
  • New O(log p) messages - optimal
  • Serial implementation
  • Conventional O(k) moves of data from slow to
    fast memory
  • New O(1) moves of data optimal
  • Lots of speed up possible (modeled and measured)
  • Price some redundant computation

2
3
Locally Dependent Entries for x,Ax, A
tridiagonal, 2 processors
Proc 1
Proc 2
Can be computed without communication
Summer School Lecture 8
3
4
Locally Dependent Entries for x,Ax,A2x, A
tridiagonal, 2 processors
Proc 1
Proc 2
Can be computed without communication
Summer School Lecture 8
4
5
Locally Dependent Entries for x,Ax,,A3x, A
tridiagonal, 2 processors
Proc 1
Proc 2
Can be computed without communication
Summer School Lecture 8
5
6
Locally Dependent Entries for x,Ax,,A4x, A
tridiagonal, 2 processors
Proc 1
Proc 2
Can be computed without communication
Summer School Lecture 8
6
7
Locally Dependent Entries for x,Ax,,A8x, A
tridiagonal, 2 processors
Proc 1
Proc 2
Can be computed without communication k8 fold
reuse of A
7
8
Remotely Dependent Entries for x,Ax,,A8x, A
tridiagonal, 2 processors
Proc 1
Proc 2
One message to get data needed to compute
remotely dependent entries, not k8 Minimizes
number of messages latency cost Price
redundant work ? surface/volume ratio
8
9
Spyplot of A with sparsity pattern of a 5 point
stencil, natural order
9
10
Spyplot of A with sparsity pattern of a 5 point
stencil, natural order
Summer School Lecture 8
10
11
Spyplot of A with sparsity pattern of a 5 point
stencil, natural order, assigned to p9
processors
For n x n mesh assigned to p processors, each
processor needs 2n data from neighbors
11
12
Spyplot of A with sparsity pattern of a 5 point
stencil, nested dissection order
12
13
Spyplot of A with sparsity pattern of a 5 point
stencil, nested dissection order
For n x n mesh assigned to p processors, each
processor needs 4n/p1/2 data from neighbors
13
14
Remotely Dependent Entries for x,Ax,,A3x, A
with sparsity pattern of a 5 point stencil
14
15
Remotely Dependent Entries for x,Ax,,A3x, A
with sparsity pattern of a 5 point stencil
Summer School Lecture 8
15
16
Remotely Dependent Entries for x,Ax,A2x,A3x, A
irregular, multiple processors
16
17
Remotely Dependent Entries for x,Ax,,A8x, A
tridiagonal, 2 processors
Proc 1
Proc 2
One message to get data needed to compute
remotely dependent entries, not k8 Minimizes
number of messages latency cost Price
redundant work ? surface/volume ratio
17
18
Reducing redundant work for a tridiagonal matrix
Summer School Lecture 8
18
19
Summary of Parallel Optimizations so far,
forAkx kernel (A,x,k) ? Ax,A2x,,Akx
  • Possible to reduce messages from O(k) to O(1)
  • Depends on matrix being well-partitioned
  • Each processor only needs data from a few
    neighbors
  • Price redundant computation of some entries of
    Ajx
  • Amount of redundant work depends on surface to
    volume ratio of partition ideally each
    processor only needs a little data from
    neighbors, compared to what it owns itself
  • Best case tridiagonal (1D mesh) need O(1)
    data from nbrs, versus
  • O(n/p) data locally
  • flops grows by factor 1O(k/(n/p)) 1
    O(k/data_per_proc) ? 1
  • Pretty good 2D mesh (n x n) need O(n/p1/2)
    data from nbors, versus
  • O(n2/p) locally flops grows by
    1O(k/(data_per_proc)1/2)
  • Less good 3D mesh (n x n x n) need O(n2/p2/3)
    data from nbors, versus
  • O(n3/p) locally flops grows by
    1O(k/(data_per_proc)1/3)

Summer School Lecture 8
19
20
Predicted speedup for model Petascale machine of
Ax,A2x,,Akx for 2D (n x n) Mesh
k
log2 n
20
21
Predicted fraction of time spent doing arithmetic
for model Petascale machine of Ax,A2x,,Akx
for 2D (n x n) Mesh
21
22
Predicted ratio of extra arithmetic for model
Petascale machine of Ax,A2x,,Akx for 2D (n x
n) Mesh
22
23
Predicted speedup for model Petascale machine of
Ax,A2x,,Akx for 3D (n x n x n) Mesh
k
log2 n
23
24
Predicted fraction of time spent doing arithmetic
for model Petascale machine of Ax,A2x,,Akx
for 3D (n x n x n) Mesh
24
25
Predicted ratio of extra arithmetic for model
Petascale machine of Ax,A2x,,Akx for 3D (n x
n x n) Mesh
25
26
Minimizing messages versus words_moved
  • Parallel case
  • Cant reduce words needed from other processors
  • Required to get right answer
  • Sequential case
  • Can reduce words needed from slow memory
  • Assume A, x too large to fit in fast memory,
  • Naïve implementation of Ax,A2x,,Akx by k
    calls to SpMV need to move A, x between fast and
    slow memory k times
  • We will move it ?one time clearly optimal

Summer School Lecture 8
26
27
Sequential x,Ax,,A4x, with memory hierarchy
v
One read of matrix from slow memory, not
k4 Minimizes words moved bandwidth cost No
redundant work
27
28
Remotely Dependent Entries for x,Ax,,A3x, A
100x100 with bandwidth 2 Only 25 of A, vectors
fits in fast memory
28
29
In what order should the sequential algorithm
process a general sparse matrix?
  • For a band matrix, we obviously
  • want to process the matrix
  • from left to right, since data
  • we need for the next partition
  • is already in fast memory
  • Not obvious what best order is
  • for a general matrix
  • We can formulate question of
  • finding the best order as a
  • Traveling Salesman Problem (TSP)
  • One vertex per matrix partition
  • Weight of edge (j, k) is memory cost of
    processing
  • partition k right after partition j
  • TSP find lowest cost tour visiting all
    vertices

30
What about multicore?
  • Two kinds of communication to minimize
  • Between processors on the chip
  • Between on-chip cache and off-chip DRAM
  • Use hybrid of both techniques described so far
  • Use parallel optimization so each core can work
    independently
  • Use sequential optimization to minimize off-chip
    DRAM traffic of each core

Summer School Lecture 8
30
31
Speedups on Intel Clovertown (8 core)
Test matrices include stencils and practical
matrices See SC09 paper on bebop.cs.berkeley.edu
for details
Summer School Lecture 8
31
32
Minimizing Communication of GMRES to solve Axb
  • GMRES find x in spanb,Ab,,Akb minimizing
    Ax-b 2
  • Cost of k steps of standard GMRES vs new GMRES

Standard GMRES for i1 to k w A
v(i-1) MGS(w, v(0),,v(i-1)) update
v(i), H endfor solve LSQ problem with
H Sequential words_moved O(knnz)
from SpMV O(k2n) from MGS Parallel
messages O(k) from SpMV
O(k2 log p) from MGS
Communication-Avoiding GMRES CA-GMRES W
v, Av, A2v, , Akv Q,R TSQR(W)
Tall Skinny QR Build H from R, solve LSQ
problem Sequential words_moved
O(nnz) from SpMV O(kn) from
TSQR Parallel messages O(1) from
computing W O(log p) from TSQR
  • Oops W from power method, precision lost!

32
33
33
34
How to make CA-GMRES Stable?
  • Use a different polynomial basis for the Krylov
    subspace
  • Not Monomial basis W v, Av, A2v, , instead
    v, p1(A)v,p2(A)v,
  • Possible choices
  • Newton Basis WN v, (A ?1 I)v , (A ?2 I)(A
    ?1 I)v, where shifts ?i chosen as
    approximate eigenvalues from Arnoldi (using same
    Krylov subspace, so free)
  • Chebyshev Basis WC v, T1(A)v, T2(A)v,
    where T1(z) chosen to be small in region of
    complex plane containing large eigenvalues

34
35
Monomial basis Ax,,Akx fails to converge
Newton polynomial basis does converge
35
36
Speed ups on 8-core Clovertown
36
37
Summary of what is known (1/3), open
  • GMRES
  • Can independently choose k to optimize speed,
    restart length r to optimize convergence
  • Need to co-tune Akx kernel and TSQR
  • Know how to use more stable polynomial bases
  • Proven speedups
  • Can similarly reorganize other Krylov methods
  • Arnoldi and Lanczos, for Ax ?x and for Ax ?Mx
  • Conjugate Gradients, for Ax b
  • Biconjugate Gradients, for Axb
  • BiCGStab?
  • Other Krylov methods?

Summer School Lecture 8
37
38
Summary of what is known (2/3), open
  • Preconditioning MAx Mb
  • Need different communication-avoiding kernel
    x,Ax,MAx,AMAx,MAMAx,AMAMAx,
  • Can think of this as union of two of the earlier
    kernels
  • x,(MA)x,(MA)2x,,(MA)kx and
    x,Ax,(AM)Ax,(AM)2Ax,,(AM)kAx
  • For which preconditioners M can we minimize
    communication?
  • Easy diagonal M
  • How about block diagonal M?

Summer School Lecture 8
38
39
Examine x,Ax,MAx,AMAx,MAMAx, for A
tridiagonal, M block-diagonal
Summer School Lecture 8
39
40
Examine x,Ax,MAx,AMAx,MAMAx, for A
tridiagonal, M block-diagonal
Summer School Lecture 8
40
41
Examine x,Ax,MAx,AMAx,MAMAx, for A
tridiagonal, M block-diagonal
Summer School Lecture 8
41
42
Examine x,Ax,MAx,AMAx,MAMAx, for A
tridiagonal, M block-diagonal
Summer School Lecture 8
42
43
Examine x,Ax,MAx,AMAx,MAMAx, for A
tridiagonal, M block-diagonal
Summer School Lecture 8
43
44
Summary of what is known (3/3), open
  • Preconditioning MAx Mb
  • Need different communication-avoiding kernel
    x,Ax,MAx,AMAx,MAMAx,AMAMAx,
  • For block diagonal M, matrix powers rapidly
    become dense, but ranks of off-diagonal block
    grow slowly
  • Can take advantage of low rank to minimize
    communication
  • yi (AM)kij xj U ? V xj U ? (V xj )
  • Works (in principle) for Hierarchically
    Semi-Separable M
  • How does it work in practice?
  • For details (and open problems) see M. Hoemmens
    PhD thesis

Summer School Lecture 8
44
45
Of things not said (much about)
  • Other Communication-Avoiding (CA) dense
    factorizations
  • QR with column pivoting (tournament pivoting)
  • Cholesky with diagonal pivoting
  • GE with complete pivoting
  • LDLT ? Maybe with complete pivoting?
  • CA-Sparse factorizations
  • Cholesky, assuming good separators
  • Lower bounds from Lipton/Rose/Tarjan
  • Matrix multiplication, anything else?
  • Strassen, and all linear algebra with
  • words_moved O(n? / M?/2-1 )
  • Parallel?
  • Extending CA-Krylov methods to bottom solver in
    multigrid with Adaptive Mesh Refinement (AMR)

45
Summer School Lecture 8
46
Summary
Time to redesign all dense and sparse linear
algebra
  • Dont Communic

Summer School Lecture 8
47
Extra slides
Summer School Lecture 8
47
Write a Comment
User Comments (0)
About PowerShow.com