Reducing Data Movement in a Block Variant of GMRES - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Reducing Data Movement in a Block Variant of GMRES

Description:

Reducing Data Movement in a Block Variant of GMRES. February ... Use BLAS library (_GEMM) or. Hand-code with loop unrolling. Freeware. Flexible and experimental ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 32
Provided by: JohnD9
Learn more at: http://www.cisl.ucar.edu
Category:

less

Transcript and Presenter's Notes

Title: Reducing Data Movement in a Block Variant of GMRES


1
Reducing Data Movement in a Block Variant of GMRES
  • February 25, 2004
  • John M. Dennis (dennisjm_at_cs.colorado.edu),
  • Allison H. Baker (baker59_at_llnl.gov),
  • Elizabeth Jessup (jessup_at_cs.colorado.edu)

2
Motivation
  • Sparse linear systems are time consuming to solve
  • Accessing coefficient matrix A is expensive in
    iterative linear solvers
  • Floating-point costs vs. Memory costs
  • Microprocessor Performance improving 60 per
    year
  • Memory access costs improving 10 per year
  • Memory costs gtgt FLOP cost for large problems

3
Outline
  • Background
  • Multivector optimization
  • Characterizing memory efficiency
  • Iterative Algorithm
  • GMRES (General minimal residual)
  • LGMRES (Loose GMRES)
  • B-LGMRES (Block version)
  • Impact of Multivector
  • Reducing data movement

4
Background
  • Indirect addressing results in poor reference
    locality
  • Poor reference locality increases data movement
    through the memory hierarchy
  • Optimization techniques improve locality of
    reference! -gtMultivector

5
Multivector
  • Modifies storage format of multiple column
    vectors v1, v2,
  • non-Multivector v1(1),v1(2),v1(3),
  • Multivector v1(1),v2(1),v3(1),
  • Sparse linear algebra multiple matrix-vector
    multiplies
  • u1,u2,A v1,v2, ...
  • Reduces movement of A through memory hierarchy

6
Multivector (cont)
  • Dense linear algebra
  • dot product, daxpy ? matrix matrix multiply
  • reduce movement of vectors through memory
    hierarchy
  • Use BLAS library (_GEMM)
  • or
  • Hand-code with loop unrolling
  • Freeware
  • Flexible and experimental
  • Competitive in time (vendor optimized)

7
Characterizing memory efficiency
  • Storage Requirement (SR) The storage
    requirement for all operands in bytes.
  • Algorithm dependent
  • Working Set size (WS) The size in bytes of data
    loaded through memory hierarchy
  • Implementation dependent
  • The WS of an implementation is only important
    when SR gt sizeof(cache)

8
Example of working set size
x,y,p,r -gt Vectors of length n ?, ? -gt
Scalars sizeof(cache) 2 n (8 bytes)
  • Implementation 1
  • x x ? y
  • y y ? x
  • p p ? r
  • SR 4 n (8 bytes)
  • WS 4 n (8 bytes)
  • Implementation 2
  • x x ? y
  • p p ? r
  • y y ? x
  • SR 4 n (8 bytes)
  • WS 6 n(8 bytes)

Increase in data movement 50 (WS2/WS1 6/4
1.5)
9
Outline
  • Background
  • Multivector optimization
  • Characterizing memory efficiency
  • Iterative Algorithm
  • GMRES (General minimal residual)
  • LGMRES (Loose GMRES)
  • B-LGMRES (Block version)
  • Impact of Multivector
  • Reducing Data-movement

10
GMRES
  • Solve Ax b (A is n x n)
  • Initial guess ? x(0)
  • residual ? r(0) b - Ax(0)
  • Iterate (j)
  • new guess ? x(j) ? x(0) Kj(A,r(0)),
  • Kj(A,r(0)) ? spanr(0), Ar(0), ,
    Aj-1r(0)
  • minimize b - Ax(j) 2

11
GMRES(m)Restarted GMRES
  • Algorithm
  • do m iterations of GMRES (m ltlt n)
  • compute x(m)
  • restart with x(0) ? x(m)
  • Definitions
  • restart parameter m
  • restart cycle (i) m iterations
  • xi1 ? xi Km(A,ri)

12
LGMRES(m, k)
  • Solve Ax b
  • Initial guess x0
  • Compute x1 (GMRES(m))
  • do i 1, until converged
  • Set zi xi - xi-1 (error approximation)
  • Compute xi1 ? xi Km(A,ri) spanzi
  • Observations
  • Augment with previous k error approximations
  • Krylov subspace of size mk

13
B-LGMRES(m, k)
  • Solve Ax b via AXB (k1)
  • Initial guess x0
  • Compute x1 (GMRES(m))
  • do i 1, until converged
  • X xi, 0 (initial guess)
  • B b, zi , zi xi - xi-1 (right-hand side)
  • Compute xi1 ? xi Km(A,ri) Km(A,zi)
  • Observations
  • Requires k1 matrix-vector multiplies per
    iteration
  • (k1) times the floating-point costs as LGMRES
  • Mitigate with multivector optimization

14
B-LGMRES (1 restart cycle)
  • Calculate initial residual ri
  • for j1,m
  • Uj A Vj
  • for l1,j
  • Hl,j VTl Uj
  • Uj Uj Vl Hl,j
  • end
  • Vj Uj1Hj1,j
  • end
  • Solve Least Squares problem for ym
  • Compute approximate solution xi1 xi Wm ym

MatMult
Section
MGS Section

15
Outline
  • Background
  • Multivector optimization
  • Characterizing memory efficiency
  • Iterative Algorithm
  • GMRES (General minimal residual)
  • LGMRES (Loose GMRES)
  • B-LGMRES (Block version)
  • Impact of Multivector
  • Reducing Data-movement

16
B-LGMRES implementation
  • Implemented with Portable Extensible Toolkit for
    Scientific Computing (PETSc)
  • Implemented using both with (MV) and without
    multivector (non-MV)
  • Impact of multivectors on data movement?

17
Experimental design
  • 17 test problems of various sizes and application
    areas
  • Multivector size n x 2
  • All results for 10 restart cycles
  • Single 400 Mhz Sun Ultra II
  • 32 Kbyte L1 cache
  • 4 Mbyte L2 cache
  • Sun hardware performance counters (Solaris 8)

18
Ratio of execution time
19
Impact of Multivectors?
  • Recall definitions
  • SR storage requirement
  • WS working set size
  • Choose those problems where SR gt sizeof(cacheL2)
  • Expected reduction in data movement
  • WSnon-MV/WSMV
  • Plot expected versus measured reduction in data
    movement. (MbytesL2)

20
MbytesL2 for SR gtsizeof(cacheL2) (MatMult)
21
MbytesL1 for SR gt sizeof(cacheL1) (MatMult)
22
Correlation between execution time and data
movement
23
Conclusions
  • Examined the memory efficiency of a block
    algorithm (B-LGMRES)
  • Multivector optimization effectively
  • Reduces data movement
  • Reduces execution time
  • Data movement accurately characterized by working
    set size (WS)
  • Reduction in execution time correlated with
    reduction in MbytesL1

24
Questions?
  • Contact Information
  • John M. Dennis (dennisjm_at_cs.colorado.edu),

25
MbytesL2 for 10 restart cycles
26
Execution time for MV and non-MV
27
MbytesL1 for 10 restart cycles for SR gt
sizeof(cacheL1) (MGS)
28
B-LGMRES(m,k) algorithm
  • Solve Ax b via AXB (skinny block system)
  • Augment right hand side -gt accelerate convergence
  • Requires k1 matrix-vector multiplies per
    iteration
  • (k1) times the floating-point costs
  • Similar memory access costs

29
LGMRES(m,k) algorithm
  • Krylov subspace of size mk
  • m vectors (GMRES)
  • k vectors (error approximations)
  • Restart cycle i1
  • Augment with previous k error approximations
  • zi xi - xi-1
  • New approximation
  • xi1 ? xi Km(A,ri) spanzj

30
B-LGMRES(m,k) algorithm (cont)
  • Let k1, B-LGMRES(m,1)
  • Restart cycle i1 Ax b -gt AX B
  • X xi, 0 (initial guess)
  • B b, zi (right-hand side)
  • zi xi - xi-1
  • New approximate solution
  • xi1 ? xi Km(A,ri) Km(A,zi)

31
Memory hierarchy
Write a Comment
User Comments (0)
About PowerShow.com