Title: Reducing Data Movement in a Block Variant of GMRES
1Reducing Data Movement in a Block Variant of GMRES
- February 25, 2004
- John M. Dennis (dennisjm_at_cs.colorado.edu),
- Allison H. Baker (baker59_at_llnl.gov),
- Elizabeth Jessup (jessup_at_cs.colorado.edu)
2Motivation
- Sparse linear systems are time consuming to solve
- Accessing coefficient matrix A is expensive in
iterative linear solvers - Floating-point costs vs. Memory costs
- Microprocessor Performance improving 60 per
year - Memory access costs improving 10 per year
- Memory costs gtgt FLOP cost for large problems
3Outline
- Background
- Multivector optimization
- Characterizing memory efficiency
- Iterative Algorithm
- GMRES (General minimal residual)
- LGMRES (Loose GMRES)
- B-LGMRES (Block version)
- Impact of Multivector
- Reducing data movement
4Background
- Indirect addressing results in poor reference
locality - Poor reference locality increases data movement
through the memory hierarchy - Optimization techniques improve locality of
reference! -gtMultivector
5Multivector
- Modifies storage format of multiple column
vectors v1, v2, - non-Multivector v1(1),v1(2),v1(3),
- Multivector v1(1),v2(1),v3(1),
- Sparse linear algebra multiple matrix-vector
multiplies - u1,u2,A v1,v2, ...
- Reduces movement of A through memory hierarchy
6Multivector (cont)
- Dense linear algebra
- dot product, daxpy ? matrix matrix multiply
- reduce movement of vectors through memory
hierarchy - Use BLAS library (_GEMM)
- or
- Hand-code with loop unrolling
- Freeware
- Flexible and experimental
- Competitive in time (vendor optimized)
7Characterizing memory efficiency
- Storage Requirement (SR) The storage
requirement for all operands in bytes. - Algorithm dependent
- Working Set size (WS) The size in bytes of data
loaded through memory hierarchy - Implementation dependent
- The WS of an implementation is only important
when SR gt sizeof(cache)
8Example of working set size
x,y,p,r -gt Vectors of length n ?, ? -gt
Scalars sizeof(cache) 2 n (8 bytes)
- Implementation 1
- x x ? y
- y y ? x
- p p ? r
- SR 4 n (8 bytes)
- WS 4 n (8 bytes)
- Implementation 2
- x x ? y
- p p ? r
- y y ? x
- SR 4 n (8 bytes)
- WS 6 n(8 bytes)
Increase in data movement 50 (WS2/WS1 6/4
1.5)
9Outline
- Background
- Multivector optimization
- Characterizing memory efficiency
- Iterative Algorithm
- GMRES (General minimal residual)
- LGMRES (Loose GMRES)
- B-LGMRES (Block version)
- Impact of Multivector
- Reducing Data-movement
10GMRES
- Solve Ax b (A is n x n)
- Initial guess ? x(0)
- residual ? r(0) b - Ax(0)
- Iterate (j)
- new guess ? x(j) ? x(0) Kj(A,r(0)),
- Kj(A,r(0)) ? spanr(0), Ar(0), ,
Aj-1r(0) - minimize b - Ax(j) 2
11GMRES(m)Restarted GMRES
- Algorithm
- do m iterations of GMRES (m ltlt n)
- compute x(m)
- restart with x(0) ? x(m)
- Definitions
- restart parameter m
- restart cycle (i) m iterations
- xi1 ? xi Km(A,ri)
12LGMRES(m, k)
- Solve Ax b
- Initial guess x0
- Compute x1 (GMRES(m))
- do i 1, until converged
- Set zi xi - xi-1 (error approximation)
- Compute xi1 ? xi Km(A,ri) spanzi
- Observations
- Augment with previous k error approximations
- Krylov subspace of size mk
13B-LGMRES(m, k)
- Solve Ax b via AXB (k1)
- Initial guess x0
- Compute x1 (GMRES(m))
- do i 1, until converged
- X xi, 0 (initial guess)
- B b, zi , zi xi - xi-1 (right-hand side)
- Compute xi1 ? xi Km(A,ri) Km(A,zi)
- Observations
- Requires k1 matrix-vector multiplies per
iteration - (k1) times the floating-point costs as LGMRES
- Mitigate with multivector optimization
14B-LGMRES (1 restart cycle)
- Calculate initial residual ri
- for j1,m
- Uj A Vj
- for l1,j
- Hl,j VTl Uj
- Uj Uj Vl Hl,j
- end
- Vj Uj1Hj1,j
- end
- Solve Least Squares problem for ym
- Compute approximate solution xi1 xi Wm ym
MatMult
Section
MGS Section
15Outline
- Background
- Multivector optimization
- Characterizing memory efficiency
- Iterative Algorithm
- GMRES (General minimal residual)
- LGMRES (Loose GMRES)
- B-LGMRES (Block version)
- Impact of Multivector
- Reducing Data-movement
16B-LGMRES implementation
- Implemented with Portable Extensible Toolkit for
Scientific Computing (PETSc) - Implemented using both with (MV) and without
multivector (non-MV) - Impact of multivectors on data movement?
17Experimental design
- 17 test problems of various sizes and application
areas - Multivector size n x 2
- All results for 10 restart cycles
- Single 400 Mhz Sun Ultra II
- 32 Kbyte L1 cache
- 4 Mbyte L2 cache
- Sun hardware performance counters (Solaris 8)
18Ratio of execution time
19Impact of Multivectors?
- Recall definitions
- SR storage requirement
- WS working set size
- Choose those problems where SR gt sizeof(cacheL2)
- Expected reduction in data movement
- WSnon-MV/WSMV
- Plot expected versus measured reduction in data
movement. (MbytesL2)
20MbytesL2 for SR gtsizeof(cacheL2) (MatMult)
21MbytesL1 for SR gt sizeof(cacheL1) (MatMult)
22Correlation between execution time and data
movement
23Conclusions
- Examined the memory efficiency of a block
algorithm (B-LGMRES) - Multivector optimization effectively
- Reduces data movement
- Reduces execution time
- Data movement accurately characterized by working
set size (WS) - Reduction in execution time correlated with
reduction in MbytesL1
24Questions?
- Contact Information
- John M. Dennis (dennisjm_at_cs.colorado.edu),
25MbytesL2 for 10 restart cycles
26Execution time for MV and non-MV
27MbytesL1 for 10 restart cycles for SR gt
sizeof(cacheL1) (MGS)
28B-LGMRES(m,k) algorithm
- Solve Ax b via AXB (skinny block system)
- Augment right hand side -gt accelerate convergence
- Requires k1 matrix-vector multiplies per
iteration - (k1) times the floating-point costs
- Similar memory access costs
29LGMRES(m,k) algorithm
- Krylov subspace of size mk
- m vectors (GMRES)
- k vectors (error approximations)
- Restart cycle i1
- Augment with previous k error approximations
- zi xi - xi-1
- New approximation
- xi1 ? xi Km(A,ri) spanzj
30B-LGMRES(m,k) algorithm (cont)
- Let k1, B-LGMRES(m,1)
- Restart cycle i1 Ax b -gt AX B
- X xi, 0 (initial guess)
- B b, zi (right-hand side)
- zi xi - xi-1
- New approximate solution
- xi1 ? xi Km(A,ri) Km(A,zi)
31Memory hierarchy