Accelerating Cosmological Simulations through General Purpose Graphics Processors Edward Lee, Pritis - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

Accelerating Cosmological Simulations through General Purpose Graphics Processors Edward Lee, Pritis

Description:

The particles are in close proximity in a universe with periodic boundary condition ... therefore, no sharing between threads across blocks. Code Features, contd. ... – PowerPoint PPT presentation

Number of Views:19

Avg rating:3.0/5.0

Slides: 23

Provided by: coursesEc

Category:

more less

Transcript and Presenter's Notes

Title: Accelerating Cosmological Simulations through General Purpose Graphics Processors Edward Lee, Pritis

1
Accelerating Cosmological Simulations through
General Purpose Graphics Processors Edward
Lee, Pritish Jetley, Lukasz Wesolowski
2
N-body Simulations

Simulate gravity force interactions
Model millions of particles
Motivations
Used to study galaxies and large structures
See if results matches theory

3
N-body Implementations

Pairwise computation between particles
O(N2) so infeasible for large systems
Barnes-Hut estimation
Approximate far-away particles as one
Hierarchical decomposition of space
Generate tree structures with particles leaves
Traverse tree to compute force
Open nodes only if it's close

4
Optimization Potential

Gravity computation
Group particles into buckets
Process buckets in parallel for computations
Force comp., subtree walk, space decomposition
Ewald summation
Allow particles at far ends to interact with each
other with replicated universes
Compute acceleration and potential of particles
then consider far-away effects

5
Periodic Boundary Condition

Cube-shaped universe model
A and B on opposite faces of the cube
The particles are in close proximity in a
universe with periodic boundary condition

6
Ewald Summation

3D array of replicas of the universe
Local summation in space domain
Global summation in Fourier domain

7
Implementation

Two kernels
Local summation
Fourier domain summation
One thread per particle

8
Local Summation Kernel

Triple-nested for loop
343 iterations for 5 million particle set
Branch inside
Special functions erfc, expf, sqrt
Some integer calculation overhead

9
Fourier Summation Kernel

For loop
80 iterations for 5 million particle set
Special functions cos, sin
Overall speedup for both kernels 9.6x

10
Gravity Computation

Control flow
Transfer particle and tree data to device
Compute interaction lists for buckets on host
Transfer lists to device
Commence force computation on device
Synchronize, transfer results from device
Update host data structures
Can't calculate all forces simultaneously
memory constraints
list computation times prohibitive

11
CUDA Implementation

Split computation into chunks'
Sets of contiguously numbered buckets
Buckets in chunks spatially proximal
Advantages
Smaller memory footprint
Can overlap list construction of one chunk with
force computation of previous
Force computation becomes essentially free!

12
CUDA Implementation, contd.

Further reduction
All buckets refer to nodes/particles from among
the same universal set lots of repetitions
Use indirection in lists
Example
No chunks, no indirection node lists alone 44 GB
Chunked computation (100 chunks per iteration)
427 MB
With indirection 65 MB (62 MB node data)?

13
Code Features

Clean interface between CUDA and Charm (C)
code
Kernel force computation organization
Given a set of particles S and an interaction
list L, iteration space is S L
Different ways to split points in iteration space
between threads
Assign a single block to a bucket
therefore, no sharing between threads across
blocks

14
Code Features, contd.

Organizing kernel computations

15
Code Features, contd.

Group-based scheme has better global memory
access access
But requires slightly greater amount of shared
memory per block
Used in current implementation

16
Performance Insights

Decreasing chunk size (computation grain)
improves execution time
Exact cause unknown
Speculation
list construction memory intensive
random memory access pattern, increased TLB and
cache misses
GPU held up by CPU
Knowing correct chunk size is critical

17
Test Framework

Compares results from with expected values
small 30,000 particle data set
Good accuracy results
GPU version
RMS 0.015855065808977459
Maximum 0.51600037721903569
CPU version (double precision)?
RMS 0.015855063670699353
Maximum 0.51600645407026324

18
Performance

Highly clustered, 5M particle data set
Application execution times

19
Performance, contd.

Time break up

20
Performance, contd.

Data transfer traffic

21
Performance, contd.

Speedups
nodeGravityComputation 25.094
particleGravityComputation 60.049
Application 9.02

22
Future Work

Automatic, measurement based interpolation of
performance parameters
Exploit other parallel parts of algorithm
Multiple CPU's and GPU's
Quantitative study of trade-offs, how exactly
various factors affect performance
Approximate model of computation?

Write a Comment

User Comments (0)