GPU Acceleration of Scientific Applications Using CUDA - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

GPU Acceleration of Scientific Applications Using CUDA

Description:

Theoretical and Computational ... Beckman Institute for Advanced Science and Technology, ... based key components of the real science code, without the baggage ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 25
Provided by: astr87
Category:

less

Transcript and Presenter's Notes

Title: GPU Acceleration of Scientific Applications Using CUDA


1
GPU Acceleration of Scientific Applications Using
CUDA
  • John E. StoneTheoretical and Computational
    Biophysics GroupNIH Resource for Macromolecular
    Modeling and BioinformaticsBeckman Institute for
    Advanced Science and Technology,
  • University of Illinois at Urbana-Champaign

2
What Speedups Can GPUs Achieve?
  • Speedups of 8x to 30x are quite common
  • Best speedups (100x!) are attained on codes that
    are skewed towards floating point arithmetic,
    esp. CPU-unfriendly operations that prevent use
    of SSE, vectorization
  • Amdahls Law can prevent legacy codes from
    achieving peak speedups with only shallow GPU
    acceleration efforts

3
Some GPU Speedup Examples(vs. SSE-vectorized CPU
code)
  • Fluorescence microscopy simulation 12x
  • Molecular dynamics
  • Non-bonded force calc (no pairlist) 8x
  • Non-bonded force calc (pairlist) 16x
  • Electrostatics, ion placement
  • Direct Coulomb summation 40-120x
  • Multilevel Coulomb summation (short range lattice
    cutoff) 7x

4
How Difficult is CUDA Programming?
  • Parallel algorithms are the hard part, not the
    programming API
  • CUDA is as easy to learn as any other parallel
    programming API Ive used
  • Easy to mix with other parallel programming APIs
    (e.g. POSIX threads, MPI, etc..)
  • CUDAs fine-grained parallelism nicely
    complements the comparatively coarse-grained
    parallelism available in other APIs
  • GPU hardware constraints present their own
    challenges

5
CUDA Class at Illinois
  • ECE498 Programming Massively Parallel
    Processors
  • Wen-mei Hwu (ECE Professor, UIUC)
  • David Kirk (Chief Scientist, NVIDIA)
  • Several guest lecturers
  • Attended by both students and interested
    researchers on campus
  • Class projects are supported by research groups
    on campus
  • MRI Processing
  • Biomolecular Simulations
  • Scientific Visualization
  • Many more.
  • Class home page, lectures, MP3 audio
  • http//courses.ece.uiuc.edu/ece498/al1/

6
An Approach to Writing CUDA Kernels
  • Use algorithms that can expose substantial
    parallelism, youll need thousands of threads
  • Identify ideal GPU memory system to use for
    kernel data for best performance
  • Minimize host/GPU DMA transfers, use pinned
    memory buffers when appropriate
  • Optimal kernels involve many trade-offs, easier
    to explore through experimentation with
    microbenchmarks based key components of the real
    science code, without the baggage
  • Analyze the real-world use cases and select the
    kernel(s) that best match, by size, parameters,
    etc

7
Be Open-Minded
  • Experienced programmers have a hard time getting
    used to the idea that GPUs can actually do
    arithmetic 100x faster than CPUs
  • CPU-centric programming idioms are often frugal
    with arithmetic ops but cavalier with memory
    references/locality/register spills, GPU hardware
    prefers almost the opposite approach
  • Pretend like youve never written optimized code
    before and to learn the GPU on its own terms,
    dont force it to run CPU-centric code

8
Potentially Beneficial Trade-offs
  • Additional arithmetic for reduced memory
    references, lower register count
  • Example CPU codes often precalculate values to
    reduce arithmetic. On the GPU arithmetic is
    cheaper than memory accesses or register use
  • Additional arithmetic/memory to avoid branching,
    and especially branch divergence
  • Example pad input data to full block sizes
    rather than handling boundaries specially
  • Additional arithmetic for more parallelism
  • Example decrease computational tile size by
    forgoing loop optimizations that reduce redundant
    arithmetic yields better performance on very
    small datasets

9
Fluorescence Microscopy
  • 2-D reaction-diffusion simulation used to predict
    results of fluorescence microphotolysis
    experiments
  • Simulate 1-10 second microscopy experiments,
    0.1ms integration timesteps
  • Goal lt 1 min per simulation on commodity PC
    hardware
  • Project home page
  • http//www.ks.uiuc.edu/Research/microscope/

10
Fluorescence Microscopy (2)
  • Challenges for CPU
  • Efficient handling of boundary conditions
  • Large number of floating point operations per
    timestep
  • Challenges for GPU/CUDA
  • Hiding global memory latency, improving memory
    access patterns, controlling register use
  • Few arithmetic operations per memory reference
    (for a GPU)

11
Fluorescence Microscopy (3)
  • Simulation runtime, software development time
  • Original research code (CPU) 80 min
  • Optimized algorithm (CPU) 27 min
  • 40 hours of work
  • SSE-vectorized (CPU) 8 min
  • 20 hours of work
  • CUDA w/ 8800GTX 38 sec, 12 times faster than
    SSE!
  • 12 hours of work, should be possible to improve
    further, but it is already fast enough for real
    use
  • CUDA code was more similar to the original than
    to the SSE vectorized version arithmetic is
    almost free on the GPU

12
Biomolecular Simulation Process
  • Prepare model
  • Assemble structure
  • Add ions
  • Add solvent
  • Perform molecular dynamics simulation
  • Analyze simulation trajectories
  • GPUs can accelerate many of the steps in this
    process

Satellite Tobacco Mosaic Virus
13
Molecular Dynamics Initial NAMD GPU Performance
  • Full NAMD, not test harness, Amdahls Law
    applies
  • Useful performance boost
  • 8x speedup for nonbonded
  • 5x speedup overall w/o PME
  • 3.5x speedup overall w/ PME
  • Plans for better performance
  • Overlap GPU and CPU work.
  • Tune or port remaining work.
  • PME, bonded, integration, etc.

ApoA1 Performance
faster
2.67 GHz Core 2 Quad Extreme GeForce 8800 GTX
14
Overview of Ion Placement Process
  • Calculate initial electrostatic potential map
    around the simulated structure considering the
    contributions of all atoms (most costly step!)
  • Ions are then placed one at a time
  • Find the voxel containing the minimum potential
    value
  • Add a new ion atom at location of minimum
    potential
  • Add the potential contribution of the newly
    placed ion to the entire map
  • Repeat until the required number of ions have
    been added

15
GPU Accelerated Ion PlacementElectrostatic
Potential Calculations
  • Direct Coulomb Summation (DCS)
  • Brute force arithmetic, no approximations, O(MN)
  • GPU 40-120x faster than CPU-SSE
  • Outperforms MCS for small to medium sized
    structures, and ion placement map updates
  • Template for inner loop of other grid-evaluated
    kernels (e.g. MCS)
  • Multilevel Coulomb Summation (MCS)
  • Efficient hierarchical approximation, O(MN)
  • GPU short-range lattice cutoff part 7x faster
    than CPU-SSE
  • Supports periodic boundary conditions

16
Runtime of Coulomb Summation Algorithms on CPU
and GPU
17
Direct Coulomb Summation Algorithm
  • At each lattice point, sum potential
    contributions for all atoms in the simulated
    structure
  • potential chargei / (distance to atomi)

Distance to Atomi
Lattice point being evaluated
Atomi
18
Comparison of Direct Coulomb Summation Kernels on
CPU and GPU
19
DCS CUDA Block/Grid Decomposition (non-unrolled)
Grid of thread blocks
Thread blocks 64-256 threads
Threads compute 1 potential each
Padding waste
20
DCS CUDA Algorithm Unrolling Loops
  • Add each atoms contribution to several lattice
    points at a time, where distances only differ in
    one component
  • potentialA chargei / (distanceA to atomi)
  • potentialB chargei / (distanceB to atomi)

Distances to Atomi
Atomi
21
DCS Loop Unrolling (CUDA-Unroll4x), Multiple
Lattice Points Per Iteration
  • for (atomid0 atomidltnumatoms atomid)
  • float dy coory - atominfoatomid.y
  • float dysqpdzsq (dy dy)
    atominfoatomid.z
  • float dx1 coorx1 - atominfoatomid.x
  • float dx2 coorx2 - atominfoatomid.x
  • float dx3 coorx3 - atominfoatomid.x
  • float dx4 coorx4 - atominfoatomid.x
  • energyvalx1 atominfoatomid.w (1.0f /
    sqrtf(dx1dx1 dysqpdzsq))
  • energyvalx2 atominfoatomid.w (1.0f /
    sqrtf(dx2dx2 dysqpdzsq))
  • energyvalx3 atominfoatomid.w (1.0f /
    sqrtf(dx3dx3 dysqpdzsq))
  • energyvalx4 atominfoatomid.w (1.0f /
    sqrtf(dx4dx4 dysqpdzsq))

22
DCS CUDA Block/Grid Decomposition
(unrolled, coalesced)
Unrolling increases computational tile size
Grid of thread blocks
Thread blocks 64-256 threads
0,0
0,1

1,0
1,1




Threads compute up to 8 potentials, skipping by
half-warps
Padding waste
23
Questions?
8 min exposure, central Illinois
24
References and Acknowledgements
  • Additional Information and References
  • http//www.ks.uiuc.edu/Research/gpu/
  • http//www.ks.uiuc.edu/Research/vmd/
  • Questions, source code requests
  • John Stone johns_at_ks.uiuc.edu
  • Acknowledgements
  • J. Phillips, P. Freddolino, D. Hardy, L. Trabuco,
    K. Schulten (UIUC TCB Group)
  • Prof. Wen-mei Hwu (UIUC)
  • David Kirk and the CUDA team at NVIDIA
  • NIH support P41-RR05969
Write a Comment
User Comments (0)
About PowerShow.com