A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware - PowerPoint PPT Presentation

About This Presentation
Title:

A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware

Description:

A Multigrid Solver for Boundary Value Problems Using Programmable ... General-purpose GPU. Mark Harris. Aaron Lefohn. Ian Buck. Funding. NSF Award #0092793 ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 36
Provided by: csSu5
Category:

less

Transcript and Presenter's Notes

Title: A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware


1
A Multigrid Solver for Boundary Value Problems
Using Programmable Graphics Hardware
Nolan Goodnight Cliff Woolley Gregory
LewinDavid Luebke Greg Humphreys
University of Virginia
augmented by Klaus Mueller, Stony Brook University
2
General-Purpose GPU Programming
  • Why do we port algorithms to the GPU?
  • How much faster can we expect it to be, really?
  • What is the challenge in porting?

3
Case Study
  • Problem Implement a Boundary Value Problem (BVP)
    solver using the GPU
  • Could benefit an entire class of scientific and
    engineering applications, e.g.
  • Heat transfer
  • Fluid flow

4
Related Work
  • Krüger and Westermann Linear Algebra Operators
    for GPU Implementation of Numerical Algorithms
  • Bolz et al. Sparse Matrix Solvers on the GPU
    Conjugate Gradients and Multigrid
  • Very similar to our system
  • Developed concurrently
  • Complementary approach

5
Driving problem Fluid mechanics sim
  • Problem domain is a warped disc

regular grid
regular grid
6
BVPs Background
  • Boundary value problems are sometimes governedby
    PDEs of the form
  • L? f
  • L is some operator
  • ? is the problem domain
  • f is a forcing function (source term)
  • Given L and f, solve for ?.

7
BVPs Example
  • Heat Transfer
  • Find a steady-state temperature distribution T in
    a solid of thermal conductivity k with thermal
    source S
  • This requires solving a Poisson equation of the
    form
  • k?2T -S
  • This is a BVP where L is the Laplacian operator
    ?2
  • All our applications require a Poisson solver.

8
BVPs Solving
  • Most such problems cannot be solved analytically
  • Instead, discretize onto a grid to form a set of
    linear equations, then solve
  • Direct elimination
  • Gauss-Seidel iteration
  • Conjugate-gradient
  • Strongly implicit procedures
  • Multigrid method

9
Multigrid method
  • Iteratively corrects an approximation to the
    solution
  • Operates at multiple grid resolutions
  • Low-resolution grids are used to correct
    higher-resolution grids recursively
  • Very fast, especially for large grids O(n)

10
Multigrid method
  • Use coarser grid levels to recursively correct an
    approximation to the solution
  • may converge slowly on fine grid -gt restrict to
    course grid
  • push out long wavelength errors quickly (single
    grid solvers only smooth out high frequency
    errors)
  • Algorithm
  • smooth
  • residual ?
  • restrict
  • recurse
  • interpolate

? L?i - f
11
Implementation - Overview
  • For each step of the algorithm
  • Bind as texture maps the buffers that contain the
    necessary data (current solution, residual,
    source terms, etc.)
  • Set the target buffer for rendering
  • Activate a fragment program that performs the
    necessary kernel computation (smoothing, residual
    calculation, restriction, interpolation)
  • Render a grid-sized quad with multitexturing

source buffer texture
source buffer texture
render target buffer
render target buffer
fragment program
12
Implementation - Overview
13
Input buffers
  • Solution buffer four-channel floating point
    pixel buffer (p-buffer)
  • one channel each for solution, residual, source
    term, and a debugging term
  • toggle front and back surfaces used to hold old
    and new solution
  • Operator map contains the discretized operator
    (e.g., Laplacian)
  • Red-black map accelerate odd-even tests (see
    later)

14
Smoothing
  • Jacobi method
  • one matrix row
  • calculate new value for each solution vector
    element
  • in our application, the aij are the Laplacian
    (sparse matrix)

15
Smoothing
  • Also factor in the source term
  • Use Red-black map to update only half of the grid
    cells in each pass
  • converges faster in practice
  • known as red-black iteration
  • requires two passes per iteration

16
Calculate residual
  • Apply operator (Laplacian) and source term to the
    current solution
  • residual ? k?2T S
  • Store result in the target surface
  • Use occlusion query to determine if all solution
    fragments are below threshold (e lt threshold)
  • occlusion query true means all fragments are
    below threshold
  • this is an L? norm, which may be too strict
  • less strict norms L1, L2, will require reduction
    or fragment accumulation register (not available
    yet), could run in CPU instead

17
Multigrid reduction and refinement
  • Average (restrict) current residual into coarser
    grid
  • Iterate/smooth on coarser grid, solving k?2 ?
    -S
  • Interpolate correction back into finer grid
  • or restrict once more -gt recursion
  • use bilinear interpolation
  • Update grid with this correction
  • Iterate/smooth on fine grid

18
Boundary conditions
  • Dirichlet (prescribed)
  • Neumann (prescribed derivative)
  • Mixed (coupled value and derivative)
  • Uk value at grid point k
  • nk normal at grid point k
  • Periodic boundaries result in toroidal mapping
  • Apply boundary conditions in smoothing pass

19
Boundary conditions
  • Only need to compute at boundaries
  • boundaries need significantly more computations
  • restrict computations to boundaries
  • GPUs do not allow branching
  • or better, both branches are executed and the
    invalid fragment is discarded
  • even more wasteful
  • decompose domain into boundary and interior areas
  • use general (boundary) and fastpath (interior)
    shaders
  • run these in two separate passes, on respective
    domains

20
Optimizing the Solver
  • Detect steady-state natively on GPU
  • Minimize shader length
  • Use special-case whenever possible
  • Limit context switches

21
Optimizing the Solver Steady-state
  • How to detect convergence?
  • L1 norm - average error
  • L2 norm RMS error (common in visual sim)
  • L? norm max error (common in sci/eng apps)
  • Can use occlusion query!

secs to steady statevs. grid size
22
Optimizing the Solver Shader length
  • Minimize number of registers used
  • Vectorize as much as possible
  • Use the rasterizer to perform computations of
    linearly-varying values
  • Pre-compute invariants on CPU
  • Compute texture coodinate offsets in vertex shader

23
Optimizing the Solver Special-case
  • Fast-path vs. slow-path
  • write several variants of each fragment program
    to handle boundary cases
  • eliminates conditionals in the fragment program
  • equivalent to avoiding CPU inner-loop branching

fast path, no boundaries
slow path with boundaries
24
Optimizing the Solver Special-case
  • Fast-path vs. slow-path
  • write several variants of each fragment program
    to handle boundary cases
  • eliminates conditionals in the fragment program
  • equivalent to avoiding CPU inner-loop branching

secs per v-cyclevs. grid size
25
Optimizing the Solver Context-switching
  • Find best packing data of multiple grid
    levelsinto the pbuffer surfaces - many p-buffers

26
Optimizing the Solver Context-switching
  • Find best packing data of multiple grid
    levelsinto the pbuffer surfaces - two p-buffers

27
Optimizing the Solver Context-switching
  • Find best packing data of multiple grid
    levelsinto the pbuffer surfaces - a single
    p-buffer
  • Still one front- and one back surface for
    iterative smoothing

28
Optimizing the Solver Context-switching
  • Remove context switching
  • Can introduce operations with undefined results
    reading/writing same surface
  • Why do we need to do this?
  • there is a chance that we write and read from the
    same surface at the same time
  • Can we get away with it?
  • Yes, we can. Just need to be careful to avoid
    these conflicts
  • What about RGBA parallelism?
  • was not used in this implemtation, may give
    another boost of factor 4

29
Data Layout
  • Performance

secs to steady statevs. grid size
30
Data Layout
  • Possible additional vectorization
  • Compute 4 values at a time
  • Requires source, residual, solution values to be
    in different buffers
  • Complicates boundary calculations
  • Adds setup and teardown overhead

Stacked domain
31
Results CPU vs. GPU
  • Performance

secs to steady statevs. grid size
32
Applications Flow Simulation
33
Applications High Dynamic Range
CPU
GPU
34
Conclusions
  • What we need going forward
  • Superbuffers
  • or Universal support for multiple-surface
    pbuffers
  • or Cheap context switching
  • Developer tools
  • Debugging tools
  • Documentation
  • Global accumulator
  • Ever increasing amounts of precision, memory
  • Textures bigger than 2048 on a side

35
Acknowledgements
  • Hardware
  • David Kirk
  • Matt Papakipos
  • Driver Support
  • Nick Triantos
  • Pat Brown
  • Stephen Ehmann
  • Fragment Programming
  • James Percy
  • Matt Pharr
  • General-purpose GPU
  • Mark Harris
  • Aaron Lefohn
  • Ian Buck
  • Funding
  • NSF Award 0092793
Write a Comment
User Comments (0)
About PowerShow.com