Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid

Description:

Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid ... Purcell, et al. 2002. data structures: streams. algorithms: kernels. Concrete model ... – PowerPoint PPT presentation

Number of Views:228
Avg rating:3.0/5.0
Slides: 20
Provided by: jeffreybol
Category:

less

Transcript and Presenter's Notes

Title: Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid


1
Sparse Matrix Solvers on the GPU Conjugate
Gradients and Multigrid
  • Jeffrey Bolz, Ian Farmer, Eitan Grinspun, Peter
    Schröder
  • Caltech ASCI Center

2
Why Use the GPU?
  • Semiconductor trends
  • cost
  • wires vs. compute
  • Stanford streaming supercomputer
  • Parallelism
  • many functional units
  • graphics is prime example
  • Harvesting this power
  • what application suitable?
  • what abstractions useful?
  • History
  • massively parallel SIMD machines
  • media processing

Chart courtesy Bill Dally
Imagine stream processor Bill Dally, Stanford
Connection Machine CM2 Thinking Machines
3
Contributions and Related Work
  • Contributions
  • numerical algorithms on GPU
  • unstructured grids conjugate gradients
  • regular grids multigrid
  • what abstractions are needed?
  • Numerical algorithms
  • Goodnight et al. 2003 (MG)
  • Hall et al. 2003 (cache)
  • Harris et al. 2002 (FD sim.)
  • Hillisland et al. 2003 (optimization)
  • Krueger Westermann 2003 (NLA)
  • Strzodka (PDEs)

4
Streaming Model
output record stream
  • Abstract model
  • Purcell, et al. 2002
  • data structures streams
  • algorithms kernels
  • Concrete model
  • render a rectangle
  • data structures textures
  • algorithms fragment programs

input record stream
globals
globals
5
Sparse Matrices Geometric Flow
  • Ubiquitous in numerical computing
  • discretization of PDEs animation
  • finite elements, difference, volumes
  • optimization, editing, etc., etc.
  • Example here
  • processing of surfaces
  • Canonical non-linear problem
  • mean curvature flow
  • implicit time discretization
  • solve sequence of SPD systems

6
Conjugate Gradients
  • High level code
  • inner loop
  • matrix-vectormultiply
  • sum-reduction
  • scalar-vector MAD
  • Inner product
  • fragment-wise multiply
  • followed by sum-reduction
  • odd dimensions can be handled

7
yAx
Aj off-diagonal matrix elements
R pointers to segments
8
Row-Vector Product
X vector elements
J pointers to xj
R pointers to segments
Fragment program
Aj off-diagonal matrix elements
Ai diagonal matrix elements
9
Apply to All Pixels
  • Two extremes
  • one row at a time setup overhead
  • all rows at once limited by worst row
  • Middle ground
  • organize batches of work
  • How to arrange batches?
  • order rows by non-zero entries
  • optimal packing NP hard
  • We choose fixed size rectangles
  • fragment pipe is quantized
  • simple experiments reveal best size
  • 26 x 18 91 efficient
  • wasted fragments on diagonal

10
Packing (Greedy)

9
9
8
8
8
8
8
7
7
15
13
13
7
7
7
7
7
7
7
7
6
5
5
4
12
12
11
10
9
9
non-zero entries per row
15
13
13
9
9
8
7
7
7
12
12
11
8
8
8
7
7
7
10
9
9
8
7
7
7
7
6
All this setup done once only at the beginning of
time. Depends only on mesh connectivity
11
Recomputing Matrix
  • Matrix entries depend on surface
  • must render into matrix
  • two additional indirection textures
  • previous and next

12
Results (NV30_at_500MHz)
  • 37k elements
  • matrix multiply
  • 33 instructions, 120 per second
  • only 13 flops
  • latency limited
  • reduction
  • 7 inst/frag/pass, 3400 per second
  • CG solve 20 per second

13
Regular Grids
  • Poisson solver as example
  • multigrid approach
  • this time variables on pixel grid
  • e.g. Navier-Stokes

after discretization solve Poisson eq. at each
time step
14
Poisson Equation
  • Appears all over the place
  • easy to discretize on regular grid
  • matrix multiply isstencil application
  • FD Laplace stencil
  • Use iterative matrix solver
  • just need application of stencil
  • easy just like filtering
  • incorporate geometry (Jacobian)
  • variable coefficients

15
Multigrid
  • Fine to coarse to fine cycle
  • high freq. error removed quickly
  • lower frequency error takes longer

Relax, Project, Interpolate
16
Computations and Storage Layout
  • Lots of stencil applications
  • matrix multiply 3x3 stencil
  • projection 3x3 stencil
  • interpolation 2x2(!)
  • floor op in indexing
  • Storage for matrices and DOFs
  • variables in one texture
  • matrices in 9(3x3) textures
  • all textures packed
  • exploit 4 channels
  • domain decomp.
  • padded boundary

17
Coarser Matrices
  • Operator at coarser level
  • needed for relaxation at all levels
  • Triple matrix product
  • work out terms and map to stencils
  • exploit local support of stencils
  • straightforward but t-e-d-i-o-u-s

18
Results (NV30_at_500MHz)
  • 257x257 grid
  • matrix multiply - 27 instructions
  • 1370 per second
  • interpolation 10 inst.
  • projection 19 inst.
  • Overall performance
  • 257x257 at 80 fps!

19
Conclusions
  • Enhancements
  • global registers for reductions
  • texture fetch with offset
  • rectangular texture border
  • scalar versus vector problems
  • Where are we now?
  • good streaming processor
  • twice as fast as CPU implementation
  • lots of room for improvement
  • Scientific computing compiler
  • better languages! Brook? C?
  • manage layout in a buffer
Write a Comment
User Comments (0)
About PowerShow.com