Reducing Data Movement in a Block Variant of GMRES

About This Presentation

Title:

Reducing Data Movement in a Block Variant of GMRES

Description:

Reducing Data Movement in a Block Variant of GMRES. February ... Use BLAS library (_GEMM) or. Hand-code with loop unrolling. Freeware. Flexible and experimental ... – PowerPoint PPT presentation

Number of Views:26

Avg rating:3.0/5.0

Slides: 32

Provided by: JohnD9

Learn more at: http://www.cisl.ucar.edu

Category:

more less

Transcript and Presenter's Notes

Title: Reducing Data Movement in a Block Variant of GMRES

1
Reducing Data Movement in a Block Variant of GMRES

February 25, 2004
John M. Dennis (dennisjm_at_cs.colorado.edu),
Allison H. Baker (baker59_at_llnl.gov),
Elizabeth Jessup (jessup_at_cs.colorado.edu)

2
Motivation

Sparse linear systems are time consuming to solve
Accessing coefficient matrix A is expensive in
iterative linear solvers
Floating-point costs vs. Memory costs
Microprocessor Performance improving 60 per
year
Memory access costs improving 10 per year
Memory costs gtgt FLOP cost for large problems

3
Outline

Background
Multivector optimization
Characterizing memory efficiency
Iterative Algorithm
GMRES (General minimal residual)
LGMRES (Loose GMRES)
B-LGMRES (Block version)
Impact of Multivector
Reducing data movement

4
Background

Indirect addressing results in poor reference
locality
Poor reference locality increases data movement
through the memory hierarchy
Optimization techniques improve locality of
reference! -gtMultivector

5
Multivector

Modifies storage format of multiple column
vectors v1, v2,
non-Multivector v1(1),v1(2),v1(3),
Multivector v1(1),v2(1),v3(1),
Sparse linear algebra multiple matrix-vector
multiplies
u1,u2,A v1,v2, ...
Reduces movement of A through memory hierarchy

6
Multivector (cont)

Dense linear algebra
dot product, daxpy ? matrix matrix multiply
reduce movement of vectors through memory
hierarchy
Use BLAS library (_GEMM)
or
Hand-code with loop unrolling
Freeware
Flexible and experimental
Competitive in time (vendor optimized)

7
Characterizing memory efficiency

Storage Requirement (SR) The storage
requirement for all operands in bytes.
Algorithm dependent
Working Set size (WS) The size in bytes of data
loaded through memory hierarchy
Implementation dependent
The WS of an implementation is only important
when SR gt sizeof(cache)

8
Example of working set size
x,y,p,r -gt Vectors of length n ?, ? -gt
Scalars sizeof(cache) 2 n (8 bytes)

Implementation 1
x x ? y
y y ? x
p p ? r
SR 4 n (8 bytes)
WS 4 n (8 bytes)

Implementation 2
x x ? y
p p ? r
y y ? x
SR 4 n (8 bytes)
WS 6 n(8 bytes)

Increase in data movement 50 (WS2/WS1 6/4
1.5)
9
Outline

Background
Multivector optimization
Characterizing memory efficiency
Iterative Algorithm
GMRES (General minimal residual)
LGMRES (Loose GMRES)
B-LGMRES (Block version)
Impact of Multivector
Reducing Data-movement

10
GMRES

Solve Ax b (A is n x n)
Initial guess ? x(0)
residual ? r(0) b - Ax(0)
Iterate (j)
new guess ? x(j) ? x(0) Kj(A,r(0)),
Kj(A,r(0)) ? spanr(0), Ar(0), ,
Aj-1r(0)
minimize b - Ax(j) 2

11
GMRES(m)Restarted GMRES

Algorithm
do m iterations of GMRES (m ltlt n)
compute x(m)
restart with x(0) ? x(m)
Definitions
restart parameter m
restart cycle (i) m iterations
xi1 ? xi Km(A,ri)

12
LGMRES(m, k)

Solve Ax b
Initial guess x0
Compute x1 (GMRES(m))
do i 1, until converged
Set zi xi - xi-1 (error approximation)
Compute xi1 ? xi Km(A,ri) spanzi
Observations
Augment with previous k error approximations
Krylov subspace of size mk

13
B-LGMRES(m, k)

Solve Ax b via AXB (k1)
Initial guess x0
Compute x1 (GMRES(m))
do i 1, until converged
X xi, 0 (initial guess)
B b, zi , zi xi - xi-1 (right-hand side)
Compute xi1 ? xi Km(A,ri) Km(A,zi)
Observations
Requires k1 matrix-vector multiplies per
iteration
(k1) times the floating-point costs as LGMRES
Mitigate with multivector optimization

14
B-LGMRES (1 restart cycle)

Calculate initial residual ri
for j1,m
Uj A Vj
for l1,j
Hl,j VTl Uj
Uj Uj Vl Hl,j
end
Vj Uj1Hj1,j
end
Solve Least Squares problem for ym
Compute approximate solution xi1 xi Wm ym

MatMult
Section
MGS Section

15
Outline

Background
Multivector optimization
Characterizing memory efficiency
Iterative Algorithm
GMRES (General minimal residual)
LGMRES (Loose GMRES)
B-LGMRES (Block version)
Impact of Multivector
Reducing Data-movement

16
B-LGMRES implementation

Implemented with Portable Extensible Toolkit for
Scientific Computing (PETSc)
Implemented using both with (MV) and without
multivector (non-MV)
Impact of multivectors on data movement?

17
Experimental design

17 test problems of various sizes and application
areas
Multivector size n x 2
All results for 10 restart cycles
Single 400 Mhz Sun Ultra II
32 Kbyte L1 cache
4 Mbyte L2 cache
Sun hardware performance counters (Solaris 8)

18
Ratio of execution time
19
Impact of Multivectors?

Recall definitions
SR storage requirement
WS working set size
Choose those problems where SR gt sizeof(cacheL2)
Expected reduction in data movement
WSnon-MV/WSMV
Plot expected versus measured reduction in data
movement. (MbytesL2)

20
MbytesL2 for SR gtsizeof(cacheL2) (MatMult)
21
MbytesL1 for SR gt sizeof(cacheL1) (MatMult)
22
Correlation between execution time and data
movement
23
Conclusions

Examined the memory efficiency of a block
algorithm (B-LGMRES)
Multivector optimization effectively
Reduces data movement
Reduces execution time
Data movement accurately characterized by working
set size (WS)
Reduction in execution time correlated with
reduction in MbytesL1

24
Questions?

Contact Information
John M. Dennis (dennisjm_at_cs.colorado.edu),

25
MbytesL2 for 10 restart cycles
26
Execution time for MV and non-MV
27
MbytesL1 for 10 restart cycles for SR gt
sizeof(cacheL1) (MGS)
28
B-LGMRES(m,k) algorithm