Scientific Computing on Graphical Processors: FMM, Flagon, Signal Processing, Plasma - PowerPoint PPT Presentation

Loading...

PPT – Scientific Computing on Graphical Processors: FMM, Flagon, Signal Processing, Plasma PowerPoint presentation | free to download - id: f3463-NzU1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Scientific Computing on Graphical Processors: FMM, Flagon, Signal Processing, Plasma

Description:

Joint work with Yuancheng Luo, Adam O'Donovan, Bill Dorland and students of CMSC ... Azimuth. Elevation. Azimuth. Adam O'Donovan ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 28
Provided by: Ram200
Learn more at: http://www.cs.umd.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Scientific Computing on Graphical Processors: FMM, Flagon, Signal Processing, Plasma


1
Scientific Computing on Graphical Processors
FMM, Flagon, Signal Processing, Plasma
  • Ramani Duraiswami and Nail Gumerov
  • Computer Science UMIACS
  • University of Maryland, College Park

Joint work with Yuancheng Luo, Adam ODonovan,
Bill Dorland and students of CMSC 828 E
(Scientific Computing on Graphical
Processors) www.umiacs.umd.edu/ramani/cmsc828e_gp
usci
2
FMM on the GPU
  • N.A. Gumerov and R. Duraiswami, Fast multipole
    methods on graphics processors. Journal of
    Computational Physics, 227, 8290-8313, 2008.
  • N-body problems - important in stellar dynamics,
    molecular modeling, etc.
  • Several papers implement quadratic algorithms on
    the GPU (but restricted to O(104) particles)
  • To go to O(106) and beyond we need the FMM
  • Reduces quadratic complexity to linear order
  • Complex algorithm which relies on a balance
    between local interactions (brute force) and
    tree-based far field

3
Direct summation on GPU
FMM requires a balance between direct summation
and the rest of the algorithm
Compare GPU final summation complexity
Cost A1 NB1 N/sC1 Ns.
and total FMM complexity
Cost ANBN/sCNs.
Optimal cluster size for direct summation step of
the FMM
sopt (B1 /C1 )1/2,
This leads to
Cost (AA1 )N(BB1 )N/sC1 Ns,
and sopt ((BB1 )/C1 )1/2 .
4
Performance on a 8800 GTX
(potential only)
N1,048,576
(potentialforces (gradient))
N1,048,576
5
Performance
FMM
Computations of the potential and forces Peak
performance of GPU for direct summation 290
Gigaflops, while for the FMM on GPU effective
rates in range 25-50 Teraflops are observed
(following the citation below).
GPU
dir
FMM
CPU
M.S. Warren, J.K. Salmon, D.J. Becker, M.P. Goda,
T. Sterling G.S. Winckelmans. Pentium Pro
inside I. a treecode at 430 Gigaflops on ASCI
Red, Bell price winning paper at SC97, 1997.
direct
6
Performance
p4
p8
p12
7
What is more accurate for solution of large
problems on GPU direct summation or FMM?
Error computed over a grid of 729 sampling
points, relative to exact solution, which is
direct summation with double precision.
Possible reason why the GPU error in direct
summation grows systematic roundoff error in
computation of function 1/sqrt(x). (still a
question).
8
Flagon Use GPUs via extensible libraries
  • GPUs are great as we all have heard
  • But require you to program in extended version of
    C
  • Need NVIDIA toolchain
  • What if you have an application that is
  • In Fortran 9x/2003, Matlab, C/C
  • Too large to fit on the GPU and needs to use the
    CPU cores, MPI, etc. as part of a larger
    application, but take advantage of GPU
  • Offload computations which have good speedups on
    the GPU to it using library calls in your
    programming environment
  • Enter the FLAGON
  • An extensible open source library and a
    middleware framework that allows use of GPU
  • Implemented currently for Fortran-9X, and
    preliminarily for C and MATLAB

9
Programming on the GPU
  • GPU organized as 2-30 groups of multiprocessors
    (8 relatively slow processors) with small amount
    of own memory and fast access to common shared
    memory, and slow access to global memory
  • Factor of 100s difference in speed as one goes up
    the memory hierarchy
  • To achieve gains problems must fit the SPMD
    paradigm and manage memory
  • Research issues
  • Identifying important tasks and mapping them to
    the architecture
  • Making it convenient for programmers to call GPU
    code from host code

Local memory 50kB
GPU global memory1GB
Host memory2-32 GB
10
Approach to use GPU Flagon Middleware
  • Defines Module/Class that provides pointers on
    CPU to Device Variables on the GPU
  • Execute small, well written, CU functions to
    perform primitive operations on device
  • avoid data transfer overhead by
  • Initially using pinned memory copies and pointers
  • Subsequently transfer data to CPU only when
    necessary
  • Provide wrappers to BLAS, FFT, and other software
    (random number, sort, screen dump, etc.)
  • Allow incorporation of existing mechanisms for
    doing distributed programming (OpenMP, MPI, etc.)
    to handle clusters
  • Allow relatively easy conversion of existing code

11
Sample scientific computing applications
  • Radial basis function fitting
  • Plasma turbulence computations
  • Fast Multipole Force calculation in particle
    systems
  • Machine Learning
  • Numerical Relativity
  • Space Turbulence
  • Signal Processing
  • Integral Equations

12
FLAGON Device Variables
  • User instantiates device variables in Fortran
  • Encapsulates parameters and attributes of the
    data structure transferred between host and
    device
  • Tracks (via pointers) allocated memory on the
    device
  • Stores data attributes (type and dimensions) on
    the host and device

13
FLAGON Work-Cycle
  • Compiling and link library to user Fortran code
  • Load library into memory
  • Allocate device variables and copy host data to
    device
  • Work-cycle allows subsequent computations to be
    performed solely on the device
  • Data transfer from device to host when done
  • Discard/free data on the device

14
FLAGON Functions
  • Initialization functions
  • open_devObjects, close_devObjects
  • Memory functions
  • Allocation/deallocation
  • allocate_dv(chartype, nx, ny, nz)
  • deallocate_dv(devVar)
  • Memory transfer
  • transfer_i, r, c4(hostVar, devVar, c2g)
  • transfer_i, r, c (hostVar, devVar, c2g)
  • Memory copy
  • copy(devVar1,devVar2)
  • function cloneDeepWData(devVarA)
  • function cloneDeepWOData(devVarA)
  • Misc.
  • swap(devVar1, devVar2)
  • part(deviceVariable,i1,i2,j1,j2,k1,k2)
  • get_i, s, c
  • set_I, s, c
  • Point-wise Functions
  • CUBLAS Functions
  • BLAS 1, BLAS 2, BLAS 3 (with shorter call
    strings)
  • CUFFT Functions
  • FFT Plans
  • devf_fftplan(devVariable, fft_type, batch)
  • devf_destroyfftplan(plan)
  • FFT Functions
  • devf_fft(input, plan, output)
  • devf_bfft(input, plan, output)
  • devf_ifft(input, plan, output)
  • devf_fftR2C(input, plan, output)
  • devf_fftC2R(input, plan, output)
  • CUDPP Functions
  • devf_ancCUDPPSortScan(devVarIn, devVarOut,
    operation, dataType, algorithm, option)
  • devf_ancCUDPPSortSimple(devVarIn, devVarOut)
  • Ancillary Functions
  • devf_ancMatrixTranspose(devVarIn, devVarOut)
  • devf_ancBitonicSort(devVar1)

Extensible
15
Example of code conversion
16
Plasma turbulence computations
  • spectral code, solved via a standard Runge-Kutta
    time advance, coupled with a pseudo-spectral
    evaluation of NL terms.
  • Derivatives are evaluated in k-space, while
    multiplications in Eq. (2) are carried out in
    real space.
  • standard 2/3 rule for dealiasing is applied, and
    small hyperviscous damping terms are added to
    provide stability at the grid scale.
  • results agree with analytic expectations and same
    on both CPU GPU.

32x speedup!
with Bill Dorland
17
Audio Camera
Adam ODonovan
  • spherical array of microphones
  • Use beamforming algorithms we developed can find
    sounds coming from particular directions
  • Run several beamformers, one look direction
    and assign output to an Audio pixel
  • Compose audio image.
  • Transform the spherical array into a camera for
    audio images
  • Requires significant processing to form pixels
    from all directions in a frame before the next
    frame is ready

Elevation ?
Azimuth
Azimuth ?
18
ODonovan et al. Several papers in IEEE CVPR,
IEEE ICASSP, WASPAA (2007-2008) Movies at
www.umiacs.umd.edu/odonovan/Audio_Camera
19
Plasma Computations via PIC
Image courtesy George Stanchev and Bill Dorland
20
Data structures for coalesced access
  • Particles modeling a density or real particles
  • Right hand side of evolution equation controlled
    by a PDE for field solved on a regular grid
  • Either spectrally or via finite differences
  • Before/After time step require interpolation of
    field quantities at grid nodes to/from particles
  • Organized particles in a box using octrees
    created via bit interleaving resulting in a
    Morton curve layout
  • Update procedures at the end of each time step

George Stantchev, William Dorland, Nail Gumerov
Fast parallel particle-to-grid interpolation
for plasma PIC simulations on the GPU, J.
Parallel Distrib. Comput., 2008
21
Numerical relativity
  • Beginning collaboration with Prof. Tiglio's group
  • Student (John Dickerson) project in CMSC 828 E
  • Spectral element computations of Kerr tails in
    numerical relativity accelerated using FLAGON

22
Kernel Methods on Balaji V. GPUs Srinivasan
  • Kernel methods are very popular in computational
    statistics and computational ML
  • kernel density estimation (KDE),
  • Renyi entropy based distances between
    distributions (KRD)
  • Gaussian process regresion
  • Acceleration of 10x to 100x on a GT240

Optimized bandwith based KDE
23
Map Reduce framework Aparnafor large scale
video Kothaanalysis
  • Video data is extremely large and ubiquitous
  • Particular motivation 30,000 hours of biological
    video (courtship rituals of Australian bowebirds)
  • Algorithm framework reduce frames to a few
    features and compare frame-based features
  • Ripe for Map-reduce type operations
  • Simple bird-locator and activity detector
  • 3 X speed-up
  • More complex video processing larger speedups

24
LVIS Data Analysis Shravya Konda
  • NASAs Laser Vegetation Imaging Sensor
  • LIDAR based
  • Analyze the returned pulsefor peaks and mode
    charac-teristics
  • Achieved 25X Speed up on a 8800GTX
  • Work ongoing

Thanks to Michelle Hofton, (Geography, UMD) and
J. Brian Blair (NASA Goddard) for data and
discussion
25
Adding QR, LU, random Lipinginitialization to
FLAGON Liu
  • Flagon allows Fortran-9X users to define GPU
    based variables as pointers, copy data to them,
    and use GPU
  • Allows custom functions for extensibility
  • Lightweight no-overhead GPU use
  • Added dense matrix decompositions to FLAGON
  • LU for linear systems
  • QR for least squares
  • Random number initialization (uniform and normal)
  • Port of work of Volkov/Demmel (Berkley) and Giles
    (Oxford)
  • Achieved speed ups reported by these authors but
    in Flagon framework

26
Displaying Flagon objects Adam during
simulation ODonovan
  • A much discussed application of GPUs monitoring
    computations as they proceed
  • Perhaps use it for computational steering
  • A mechanism to throw up line-graphs (vector
    data), matrix data (colour maps) and slice data
    on screen
  • Issues OpenGL thread model and interaction
    with CUDA computations

27
Other CMSC 828E projects
  • Implementing a caching scheme for GPU
    computingKapil Anand
  • Accelerating the Approximate Nearest Neighbor
    library Daniel Hakim
  • Adding MultiGPU MPI capabilities to FlagonKate
    DeSpain
  • Support Vector Machines for Speaker ID on
    CUDASamuel Lamphier
About PowerShow.com