Accelerating Cosmological Simulations through General Purpose Graphics Processors Edward Lee, Pritis - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Accelerating Cosmological Simulations through General Purpose Graphics Processors Edward Lee, Pritis

Description:

The particles are in close proximity in a universe with periodic boundary condition ... therefore, no sharing between threads across blocks. Code Features, contd. ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 23
Provided by: coursesEc
Category:

less

Transcript and Presenter's Notes

Title: Accelerating Cosmological Simulations through General Purpose Graphics Processors Edward Lee, Pritis


1
Accelerating Cosmological Simulations through
General Purpose Graphics Processors Edward
Lee, Pritish Jetley, Lukasz Wesolowski
2
N-body Simulations
  • Simulate gravity force interactions
  • Model millions of particles
  • Motivations
  • Used to study galaxies and large structures
  • See if results matches theory

3
N-body Implementations
  • Pairwise computation between particles
  • O(N2) so infeasible for large systems
  • Barnes-Hut estimation
  • Approximate far-away particles as one
  • Hierarchical decomposition of space
  • Generate tree structures with particles leaves
  • Traverse tree to compute force
  • Open nodes only if it's close

4
Optimization Potential
  • Gravity computation
  • Group particles into buckets
  • Process buckets in parallel for computations
  • Force comp., subtree walk, space decomposition
  • Ewald summation
  • Allow particles at far ends to interact with each
    other with replicated universes
  • Compute acceleration and potential of particles
    then consider far-away effects

5
Periodic Boundary Condition
  • Cube-shaped universe model
  • A and B on opposite faces of the cube
  • The particles are in close proximity in a
    universe with periodic boundary condition

6
Ewald Summation
  • 3D array of replicas of the universe
  • Local summation in space domain
  • Global summation in Fourier domain

7
Implementation
  • Two kernels
  • Local summation
  • Fourier domain summation
  • One thread per particle

8
Local Summation Kernel
  • Triple-nested for loop
  • 343 iterations for 5 million particle set
  • Branch inside
  • Special functions erfc, expf, sqrt
  • Some integer calculation overhead

9
Fourier Summation Kernel
  • For loop
  • 80 iterations for 5 million particle set
  • Special functions cos, sin
  • Overall speedup for both kernels 9.6x

10
Gravity Computation
  • Control flow
  • Transfer particle and tree data to device
  • Compute interaction lists for buckets on host
  • Transfer lists to device
  • Commence force computation on device
  • Synchronize, transfer results from device
  • Update host data structures
  • Can't calculate all forces simultaneously
  • memory constraints
  • list computation times prohibitive

11
CUDA Implementation
  • Split computation into chunks'
  • Sets of contiguously numbered buckets
  • Buckets in chunks spatially proximal
  • Advantages
  • Smaller memory footprint
  • Can overlap list construction of one chunk with
    force computation of previous
  • Force computation becomes essentially free!

12
CUDA Implementation, contd.
  • Further reduction
  • All buckets refer to nodes/particles from among
    the same universal set lots of repetitions
  • Use indirection in lists
  • Example
  • No chunks, no indirection node lists alone 44 GB
  • Chunked computation (100 chunks per iteration)
    427 MB
  • With indirection 65 MB (62 MB node data)?

13
Code Features
  • Clean interface between CUDA and Charm (C)
    code
  • Kernel force computation organization
  • Given a set of particles S and an interaction
    list L, iteration space is S L
  • Different ways to split points in iteration space
    between threads
  • Assign a single block to a bucket
  • therefore, no sharing between threads across
    blocks

14
Code Features, contd.
  • Organizing kernel computations

15
Code Features, contd.
  • Group-based scheme has better global memory
    access access
  • But requires slightly greater amount of shared
    memory per block
  • Used in current implementation

16
Performance Insights
  • Decreasing chunk size (computation grain)
    improves execution time
  • Exact cause unknown
  • Speculation
  • list construction memory intensive
  • random memory access pattern, increased TLB and
    cache misses
  • GPU held up by CPU
  • Knowing correct chunk size is critical

17
Test Framework
  • Compares results from with expected values
  • small 30,000 particle data set
  • Good accuracy results
  • GPU version
  • RMS 0.015855065808977459
  • Maximum 0.51600037721903569
  • CPU version (double precision)?
  • RMS 0.015855063670699353
  • Maximum 0.51600645407026324

18
Performance
  • Highly clustered, 5M particle data set
  • Application execution times

19
Performance, contd.
  • Time break up

20
Performance, contd.
  • Data transfer traffic

21
Performance, contd.
  • Speedups
  • nodeGravityComputation 25.094
  • particleGravityComputation 60.049
  • Application 9.02

22
Future Work
  • Automatic, measurement based interpolation of
    performance parameters
  • Exploit other parallel parts of algorithm
  • Multiple CPU's and GPU's
  • Quantitative study of trade-offs, how exactly
    various factors affect performance
  • Approximate model of computation?
Write a Comment
User Comments (0)
About PowerShow.com