Title: A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware
1A Multigrid Solver for Boundary Value Problems
Using Programmable Graphics Hardware
Nolan Goodnight Cliff Woolley Gregory
LewinDavid Luebke Greg Humphreys
University of Virginia
augmented by Klaus Mueller, Stony Brook University
2General-Purpose GPU Programming
- Why do we port algorithms to the GPU?
- How much faster can we expect it to be, really?
- What is the challenge in porting?
3Case Study
- Problem Implement a Boundary Value Problem (BVP)
solver using the GPU - Could benefit an entire class of scientific and
engineering applications, e.g. - Heat transfer
- Fluid flow
4Related Work
- Krüger and Westermann Linear Algebra Operators
for GPU Implementation of Numerical Algorithms - Bolz et al. Sparse Matrix Solvers on the GPU
Conjugate Gradients and Multigrid - Very similar to our system
- Developed concurrently
- Complementary approach
5Driving problem Fluid mechanics sim
- Problem domain is a warped disc
regular grid
regular grid
6BVPs Background
- Boundary value problems are sometimes governedby
PDEs of the form - L? f
- L is some operator
- ? is the problem domain
- f is a forcing function (source term)
- Given L and f, solve for ?.
7BVPs Example
- Heat Transfer
- Find a steady-state temperature distribution T in
a solid of thermal conductivity k with thermal
source S - This requires solving a Poisson equation of the
form - k?2T -S
- This is a BVP where L is the Laplacian operator
?2 - All our applications require a Poisson solver.
8BVPs Solving
- Most such problems cannot be solved analytically
- Instead, discretize onto a grid to form a set of
linear equations, then solve - Direct elimination
- Gauss-Seidel iteration
- Conjugate-gradient
- Strongly implicit procedures
- Multigrid method
9Multigrid method
- Iteratively corrects an approximation to the
solution - Operates at multiple grid resolutions
- Low-resolution grids are used to correct
higher-resolution grids recursively - Very fast, especially for large grids O(n)
10Multigrid method
- Use coarser grid levels to recursively correct an
approximation to the solution - may converge slowly on fine grid -gt restrict to
course grid - push out long wavelength errors quickly (single
grid solvers only smooth out high frequency
errors) - Algorithm
- smooth
- residual ?
- restrict
- recurse
- interpolate
? L?i - f
11Implementation - Overview
- For each step of the algorithm
- Bind as texture maps the buffers that contain the
necessary data (current solution, residual,
source terms, etc.) - Set the target buffer for rendering
- Activate a fragment program that performs the
necessary kernel computation (smoothing, residual
calculation, restriction, interpolation) - Render a grid-sized quad with multitexturing
source buffer texture
source buffer texture
render target buffer
render target buffer
fragment program
12Implementation - Overview
13Input buffers
- Solution buffer four-channel floating point
pixel buffer (p-buffer) - one channel each for solution, residual, source
term, and a debugging term - toggle front and back surfaces used to hold old
and new solution - Operator map contains the discretized operator
(e.g., Laplacian) - Red-black map accelerate odd-even tests (see
later)
14Smoothing
- Jacobi method
- one matrix row
- calculate new value for each solution vector
element - in our application, the aij are the Laplacian
(sparse matrix)
15Smoothing
- Also factor in the source term
- Use Red-black map to update only half of the grid
cells in each pass - converges faster in practice
- known as red-black iteration
- requires two passes per iteration
16Calculate residual
- Apply operator (Laplacian) and source term to the
current solution - residual ? k?2T S
- Store result in the target surface
- Use occlusion query to determine if all solution
fragments are below threshold (e lt threshold) - occlusion query true means all fragments are
below threshold - this is an L? norm, which may be too strict
- less strict norms L1, L2, will require reduction
or fragment accumulation register (not available
yet), could run in CPU instead
17Multigrid reduction and refinement
- Average (restrict) current residual into coarser
grid - Iterate/smooth on coarser grid, solving k?2 ?
-S - Interpolate correction back into finer grid
- or restrict once more -gt recursion
- use bilinear interpolation
- Update grid with this correction
- Iterate/smooth on fine grid
18Boundary conditions
- Dirichlet (prescribed)
- Neumann (prescribed derivative)
- Mixed (coupled value and derivative)
- Uk value at grid point k
- nk normal at grid point k
- Periodic boundaries result in toroidal mapping
- Apply boundary conditions in smoothing pass
19Boundary conditions
- Only need to compute at boundaries
- boundaries need significantly more computations
- restrict computations to boundaries
- GPUs do not allow branching
- or better, both branches are executed and the
invalid fragment is discarded - even more wasteful
- decompose domain into boundary and interior areas
- use general (boundary) and fastpath (interior)
shaders - run these in two separate passes, on respective
domains
20Optimizing the Solver
- Detect steady-state natively on GPU
- Minimize shader length
- Use special-case whenever possible
- Limit context switches
21Optimizing the Solver Steady-state
- How to detect convergence?
- L1 norm - average error
- L2 norm RMS error (common in visual sim)
- L? norm max error (common in sci/eng apps)
- Can use occlusion query!
secs to steady statevs. grid size
22Optimizing the Solver Shader length
- Minimize number of registers used
- Vectorize as much as possible
- Use the rasterizer to perform computations of
linearly-varying values - Pre-compute invariants on CPU
- Compute texture coodinate offsets in vertex shader
23Optimizing the Solver Special-case
- Fast-path vs. slow-path
- write several variants of each fragment program
to handle boundary cases - eliminates conditionals in the fragment program
- equivalent to avoiding CPU inner-loop branching
fast path, no boundaries
slow path with boundaries
24Optimizing the Solver Special-case
- Fast-path vs. slow-path
- write several variants of each fragment program
to handle boundary cases - eliminates conditionals in the fragment program
- equivalent to avoiding CPU inner-loop branching
secs per v-cyclevs. grid size
25Optimizing the Solver Context-switching
- Find best packing data of multiple grid
levelsinto the pbuffer surfaces - many p-buffers
26Optimizing the Solver Context-switching
- Find best packing data of multiple grid
levelsinto the pbuffer surfaces - two p-buffers
27Optimizing the Solver Context-switching
- Find best packing data of multiple grid
levelsinto the pbuffer surfaces - a single
p-buffer - Still one front- and one back surface for
iterative smoothing
28Optimizing the Solver Context-switching
- Remove context switching
- Can introduce operations with undefined results
reading/writing same surface - Why do we need to do this?
- there is a chance that we write and read from the
same surface at the same time - Can we get away with it?
- Yes, we can. Just need to be careful to avoid
these conflicts - What about RGBA parallelism?
- was not used in this implemtation, may give
another boost of factor 4
29Data Layout
secs to steady statevs. grid size
30Data Layout
- Possible additional vectorization
- Compute 4 values at a time
- Requires source, residual, solution values to be
in different buffers - Complicates boundary calculations
- Adds setup and teardown overhead
Stacked domain
31Results CPU vs. GPU
secs to steady statevs. grid size
32Applications Flow Simulation
33Applications High Dynamic Range
CPU
GPU
34Conclusions
- What we need going forward
- Superbuffers
- or Universal support for multiple-surface
pbuffers - or Cheap context switching
- Developer tools
- Debugging tools
- Documentation
- Global accumulator
- Ever increasing amounts of precision, memory
- Textures bigger than 2048 on a side
35Acknowledgements
- Hardware
- David Kirk
- Matt Papakipos
- Driver Support
- Nick Triantos
- Pat Brown
- Stephen Ehmann
- Fragment Programming
- James Percy
- Matt Pharr
- General-purpose GPU
- Mark Harris
- Aaron Lefohn
- Ian Buck
- Funding
- NSF Award 0092793