Brook for GPUs - PowerPoint PPT Presentation

About This Presentation
Title:

Brook for GPUs

Description:

New ATI (420) and NVIDIA (NV40) hardware. Linux and Windows. DX and OpenGL ... ATI: 18% of peak performance, 99% of peak cache bandwidth. ... – PowerPoint PPT presentation

Number of Views:125
Avg rating:3.0/5.0
Slides: 25
Provided by: IanB86
Learn more at: http://gamma.cs.unc.edu
Category:
Tags: ati | brook | chipset | gpus

less

Transcript and Presenter's Notes

Title: Brook for GPUs


1
Brook for GPUs
  • Ian Buck, Tim Foley, Daniel Horn, Jeremy
    Sugerman, Kayvon Fatahalian, Mike Houston, Pat
    Hanrahan
  • Stanford University
  • DARPA Site Visit, UNC
  • May 6th, 2004

2
Motivation
  • GPUs are faster than CPUs
  • GPUs are getting faster, faster
  • Why?
  • Massive parallelism (1000s of ALUs)
  • Choreographed communication
  • Efficiently utilize VLSI resources DIS/PCA
    mantra
  • Programmable GPUs stream processors
  • Many streaming applications beyond graphics
  • Buy desktop supercomputer for 50!
  • Revolutionize computing?

3
Recent Performance Trends
4
(No Transcript)
5
CPU vs GPU
  • Intel 3 Ghz Pentium 4
  • 12 GFLOPS peak performance (via SSE2)
  • 5.96 GB/sec peak memory bandwidth
  • 44 GB/sec peak bandwidth from 8K L1 data cache
  • NVIDIA GeForce 6800
  • 45 GFLOPS peak performance
  • 36 GB/sec peak memory bandwidth
  • Texture cache bandwidth and size (undisclosed)?

6
Deliverables
  • Develop version of PCA Brook for GPUs
  • Programmer need not know GL
  • Versions
  • New ATI (420) and NVIDIA (NV40) hardware
  • Linux and Windows
  • DX and OpenGL
  • Release as open source V1.0 Dec 2003
  • Support OneSAF LOS, collision detection and route
    planning algorithms

7
Research Issues
  • Brook semantics
  • E.g. variable length streams vout
  • Compilation techniques
  • Virtualization of GPU
  • Splitting kernels (MRDS)
  • Explore streaming application space
  • Scientific computing RT, MD, BLAS, FFT,
  • Machine learning HMM, linear mod., Bayes,

8
Brook Update
  • Ian Buck

9
(No Transcript)
10
Understanding the Efficiency of GPU Algorithms
for Matrix-Matrix Multiplication
  • Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan

11
Dense Matrix-Matrix Multiplication
  • Atlas on the Intel P4 wins!

12
CPU vs GPU
  • Intel 3 Ghz Pentium 4
  • 12 GFLOPS peak performance (via SSE2)
  • 5.96 GB/sec peak memory bandwidth
  • 44 GB/sec peak bandwidth from 8K L1 data cache
  • NVIDIA GeForce 6800
  • 43 GFLOPS peak performance
  • 36 GB/sec peak memory bandwidth
  • Texture cache bandwidth and size (undisclosed)?
  • Why is graphics hardware so slow?

13
(No Transcript)
14
Why is Graphics Hardware so Slow?
Microbenchmark (MAD)
GFLOPS Cache BW Seq Read BW
NV35 39.99 11.08 4.40
NV40 43.00 18.9 3.85
ATI 9800XT 26.14 12.20 7.33
ATI X800 33.4 30.7 18.4
  • NVIDIA 8 compute efficiency, 82 of cache
    bandwidth.
  • Arithmetic intensity 12 math operations per
    float fetched from cache
  • ATI 18 of peak performance, 99 of peak cache
    bandwidth.
  • Arithimetic intensity 8 to 1 math to cache-fetch
    ratio

15
Why is Graphics Hardware so Slow?
Matrix-Matrix Multiplication
GFLOPS Bandwidth
NV35 3.04 9.07
NV40 7.24 14.88
ATI 9800XT 4.83 12.06
ATI X800 12 30
P4 7.78 27.68
  • Matrix-matrix multiplication is bandwidth limited
    on GPU.
  • Memory blocking to increase cache utilization
    does not help
  • Architectural problem, not programming model
    problem
  • PCA stream processing architectures (Imagine)
    will do much better!

16
Variable Output Shaders
Daniel Horn, Ian Buck, Pat Hanrahan
17
Motivation Enabling Algorithms
  • Not all algorithms map to the 1-in 1-out
    semantics of GPUs
  • Other classes of algorithms require data
    filtering (1-in 0-out) and amplification (1-in
    n-out).
  • Vout is conditional write on Imagine

18
Algorithms
  • Ray Tracing terrains
  • Marching Cubes
  • Adaptive Subdivision Surfaces
  • Collision Detection OBB
  • Graph traversal

19
Implementation on GPU
  • Push output (sentinel if no push)
  • Options to consolidate sentinels
  • Sort O(n (log n)2)
  • Sort sentinels to the end, truncate
  • Scan/Search O(n log n)
  • Perform a running sum, then search for gather loc
  • Scan/Scatter O(n log n)
  • Perform a running sum, scatter to destination
  • Constant time hardware implementation

20
Timing and Bandwidth Numbers
21
Future Work
  • Brook semantics, compiling, virtualization
  • Support new GPU features (branching, FB ops, )
  • Predication
  • Integration with graphics pipeline
  • Documented path to texture for rendering
  • Access to other GPU features e.g. occlusion
    culling
  • Interactive simulation new algorithms
  • Collision detection and line of sight
    calculations
  • Merge ray tracer with UNC/SAIC algorithm
  • Machine learning HMM, GLM, K-means, ...
  • Protein folding (StreamMD) and docking
  • Virtual surgery

22
Distributed Brook
  • Stream- and thread-level parallelism
  • UPC distributed memory semantics
  • PCI-express system for fast readback

23
GPU Cluster DOE
  • 16 node cluster
  • Each node 3U half depth
  • 32 2.4GHz P4 Xeons
  • 16GB DDR
  • 1.2TB disk
  • Infiniband 4X interconnect
  • Dual 2.4GHz P4 Xeons
  • Intel E7505 chipset
  • 1GB DDR
  • ATI Radeon 9800 Pro 256MB
  • GigE
  • 80 GB IDE

24
Questions?
Fly-fishing fly images from The English Fly
Fishing Shop
Write a Comment
User Comments (0)
About PowerShow.com