GPU Computation Strategies - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

GPU Computation Strategies

Description:

DirectX or OpenGL? DirectX Render to Texture. SetRenderTarget() No float targets on NV3x ... Issue points, set point x,y in vertex shader using address texture ... – PowerPoint PPT presentation

Number of Views:80
Avg rating:3.0/5.0
Slides: 19
Provided by: steve1631
Category:

less

Transcript and Presenter's Notes

Title: GPU Computation Strategies


1
GPU Computation Strategies Tricks
  • Ian Buck Stanford University

2
DirectX or OpenGL?
  • DirectX
  • Render to Texture
  • SetRenderTarget()
  • No float targets on NV3x
  • Write once run anywhere
  • DBMON
  • Short programs
  • Only 96 instr required
  • ps_2_a compiler target allows long programs on
    NV3x
  • Readback is slow!
  • 50 MB/sec
  • OpenGL
  • 0 to N texture addressing
  • GL_TEXTURE_RECTANGLE_EXT
  • Readback is fast
  • Render to Texture not finalized
  • Pbuffer rendering can be slow
  • SuperBuffers
  • GL_EXT_render_target
  • Specialized float formats forATI and NV
  • No ARB standard for creating float Pbuffer
  • ATI float2 Red and Alpha
  • NV float2 Red and Green

3
ATI Radeon 9800XT or NVIDIA GeForce 5900 Ultra?
Instruction Timings
4
Floating Point Precision
  • NVIDIA FP32
  • s23e8 (largest counting number 16,777,217)
  • ATI 24-bit float
  • s16e7 (largest 131,073)
  • NVIDIA FP16
  • s10e5 (largest 2,049)

mantissa
exponent
s
sign 1.mantissa 2(exponentbias)
5
Floating Point Precision
  • Common Mistake
  • Pack large 1D array in 2D texture
  • Compute 1D address in shader
  • Convert 1D address into 2D
  • FP precision will leave unaddressable texels!

NVIDIA FP32 16,777,217 ATI 24-bit float
131,073 NVIDIA FP16 2,049
6
Multiple Outputs
  • Hardware supported multiple outputs
  • Not as fast as you think

ATI 9800XT
7
Multiple Outputs
  • Software solution
  • Let cgc or fxc do dead code elimination
  • can be a good idea if shader is separable

kernel void foo (float3 altgt, float3 bltgt,
, out float3 xltgt, out float3 yltgt)
kernel void foo1(float3 altgt, float3 bltgt,
, out float3 xltgt)
kernel void foo2(float3 altgt, float3 bltgt,
, out float3 yltgt)
8
Scatter Techniques
  • Problem ai p
  • indirect write
  • Cant set the x,y of fragment in pixel shader
  • Also want to do ai p

9
Scatter Techniques
  • Solution 1
  • Sort Search
  • Shader outputs destination address and data
  • Bitonic sort based on address
  • Run binary search shader over destination buffer
  • Each fragment searches for source data
  • See Sorting and Searching course notes

10
Scatter Techniques
  • Solution 2
  • Render points
  • Use vertex shader to set destination
  • or just read back the data and reissue

11
Scatter Techniques
  • Solution 3
  • Vertex Textures
  • Render data and address to texture
  • Issue points, set point x,y in vertex shader
    using address texture
  • Requires texld instruction in vertex program

12
Conditional Mask
  • How to efficiently implement if (a) then cb
  • Kill instruction or LRP c, a, b, c
  • Executes all conditional code
  • Using early Z-kill
  • Set Zbuffer equal to conditional
  • Z test can prevent shader execution

13
Conditional Mask
  • Using early Z-kill
  • Z-kill operates at 4x4 block resolution
  • Good only if locality in conditional

14
Optimizing Execution
  • Two methods for GPGPU shader execution

glBegin(GL_QUADS) glVertex2f(left,
bottom) glVertex2f(right, bottom) glVertex2f(rig
ht, top) glVertex2f(left, top) glEnd()
glViewport(0,0,width,height) glBegin(GL_TRIANGLE)
glVertex2f( 0, 0) glVertex2f(width2,
0) glVertex2f( 0, height2) glEnd()
Faster Higher observed bandwidth
15
Performance Issues
  • Peak GFLOPS

16
Performance Issues
  • NV3x Register Penalty
  • The more registers used in a shader, the slower a
    shader executes
  • 3-4 R x2 slower
  • 5-6 R x3 slower
  • 7-8 R x4 slower
  • 9-12R x6 slower
  • 13-16R x8 slower
  • 17-24R x12 slower
  • 25-32R x16 slower
  • Compiler / driver will try to minimize register
    usage.
  • General Rule The more state in your program the
    slower the execution

17
Performance Issues
  • Floating Point Texture Bandwidth
  • Observed Results
  • GeForce 5900 Ultra
  • Cache 11.08 GB/sec
  • Sequential 4.40 GB/sec
  • Random 0.76 GB/sec
  • ATI 9800 XT (24-bit)
  • Cache 9.15 GB/sec
  • Sequential 5.55 GB/sec
  • Random 1.80 GB/sec
  • Big Penalty for Random Access!

18
Performance Issues
  • WinXP Float4 Download and Readback
  • NVIDIA
  • 1215 MB/sec texture download
  • 221 MB/sec glReadPixels rate
  • ATI
  • 926 MB/sec texture download
  • 180 MB/sec glReadPixel rate
  • Readback should be faster!
  • 680 MB/sec ATI Linux Readback
Write a Comment
User Comments (0)
About PowerShow.com