Title: Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid
1Sparse Matrix Solvers on the GPU Conjugate
Gradients and Multigrid
- Jeffrey Bolz, Ian Farmer, Eitan Grinspun, Peter
Schröder - Caltech ASCI Center
2Why Use the GPU?
- Semiconductor trends
- cost
- wires vs. compute
- Stanford streaming supercomputer
- Parallelism
- many functional units
- graphics is prime example
- Harvesting this power
- what application suitable?
- what abstractions useful?
- History
- massively parallel SIMD machines
- media processing
Chart courtesy Bill Dally
Imagine stream processor Bill Dally, Stanford
Connection Machine CM2 Thinking Machines
3Contributions and Related Work
- Contributions
- numerical algorithms on GPU
- unstructured grids conjugate gradients
- regular grids multigrid
- what abstractions are needed?
- Numerical algorithms
- Goodnight et al. 2003 (MG)
- Hall et al. 2003 (cache)
- Harris et al. 2002 (FD sim.)
- Hillisland et al. 2003 (optimization)
- Krueger Westermann 2003 (NLA)
- Strzodka (PDEs)
4Streaming Model
output record stream
- Abstract model
- Purcell, et al. 2002
- data structures streams
- algorithms kernels
- Concrete model
- render a rectangle
- data structures textures
- algorithms fragment programs
input record stream
globals
globals
5Sparse Matrices Geometric Flow
- Ubiquitous in numerical computing
- discretization of PDEs animation
- finite elements, difference, volumes
- optimization, editing, etc., etc.
- Example here
- processing of surfaces
- Canonical non-linear problem
- mean curvature flow
- implicit time discretization
- solve sequence of SPD systems
6Conjugate Gradients
- High level code
- inner loop
- matrix-vectormultiply
- sum-reduction
- scalar-vector MAD
- Inner product
- fragment-wise multiply
- followed by sum-reduction
- odd dimensions can be handled
7yAx
Aj off-diagonal matrix elements
R pointers to segments
8Row-Vector Product
X vector elements
J pointers to xj
R pointers to segments
Fragment program
Aj off-diagonal matrix elements
Ai diagonal matrix elements
9Apply to All Pixels
- Two extremes
- one row at a time setup overhead
- all rows at once limited by worst row
- Middle ground
- organize batches of work
- How to arrange batches?
- order rows by non-zero entries
- optimal packing NP hard
- We choose fixed size rectangles
- fragment pipe is quantized
- simple experiments reveal best size
- 26 x 18 91 efficient
- wasted fragments on diagonal
10Packing (Greedy)
9
9
8
8
8
8
8
7
7
15
13
13
7
7
7
7
7
7
7
7
6
5
5
4
12
12
11
10
9
9
non-zero entries per row
15
13
13
9
9
8
7
7
7
12
12
11
8
8
8
7
7
7
10
9
9
8
7
7
7
7
6
All this setup done once only at the beginning of
time. Depends only on mesh connectivity
11Recomputing Matrix
- Matrix entries depend on surface
- must render into matrix
- two additional indirection textures
- previous and next
12Results (NV30_at_500MHz)
- 37k elements
- matrix multiply
- 33 instructions, 120 per second
- only 13 flops
- latency limited
- reduction
- 7 inst/frag/pass, 3400 per second
- CG solve 20 per second
13Regular Grids
- Poisson solver as example
- multigrid approach
- this time variables on pixel grid
- e.g. Navier-Stokes
after discretization solve Poisson eq. at each
time step
14Poisson Equation
- Appears all over the place
- easy to discretize on regular grid
- matrix multiply isstencil application
- FD Laplace stencil
- Use iterative matrix solver
- just need application of stencil
- easy just like filtering
- incorporate geometry (Jacobian)
- variable coefficients
15Multigrid
- Fine to coarse to fine cycle
- high freq. error removed quickly
- lower frequency error takes longer
Relax, Project, Interpolate
16Computations and Storage Layout
- Lots of stencil applications
- matrix multiply 3x3 stencil
- projection 3x3 stencil
- interpolation 2x2(!)
- floor op in indexing
- Storage for matrices and DOFs
- variables in one texture
- matrices in 9(3x3) textures
- all textures packed
- exploit 4 channels
- domain decomp.
- padded boundary
17Coarser Matrices
- Operator at coarser level
- needed for relaxation at all levels
- Triple matrix product
- work out terms and map to stencils
- exploit local support of stencils
- straightforward but t-e-d-i-o-u-s
18Results (NV30_at_500MHz)
- 257x257 grid
- matrix multiply - 27 instructions
- 1370 per second
- interpolation 10 inst.
- projection 19 inst.
- Overall performance
- 257x257 at 80 fps!
19Conclusions
- Enhancements
- global registers for reductions
- texture fetch with offset
- rectangular texture border
- scalar versus vector problems
- Where are we now?
- good streaming processor
- twice as fast as CPU implementation
- lots of room for improvement
- Scientific computing compiler
- better languages! Brook? C?
- manage layout in a buffer