CMSC 635 - PowerPoint PPT Presentation

About This Presentation

Title:

CMSC 635

Description:

CMSC 635 Graphics Hardware (Josh Barczak) GPU Performance Tips Pixel Processing Small triangles hurt performance 2x2 pixel quads Waste at edge pixels Respect the ... – PowerPoint PPT presentation

Number of Views:78

Avg rating:3.0/5.0

Slides: 37

Provided by: MarcOl160

Learn more at: https://redirect.cs.umbc.edu

Category:

more less

Transcript and Presenter's Notes

Title: CMSC 635

1
CMSC 635

Graphics Hardware
(Josh Barczak)

2
A Graphics Pipeline
Transform
Shade
Vertex
Clip
Project
Rasterize
Triangle
Interpolate
Texture
Fragment
Z-buffer
3
Fragment vs. Pixel

OpenGL terminology
Pixel on-screen RGBAZ
Fragment proto-pixel
RGBA Z Texture Coordinates
Multiple Fragments per Pixel
Depth Complexity
Supersamples

4
Computation Bandwidth
Based on 100 Mtri/sec (1.6M/frame_at_60Hz) 256
B vertex data 128 B interpolated 68 B
fragment output 5x depth complexity 16 4-byte
textures 223 ops/vert 1664 ops/frag No
caching No compression
Vertex
67 GFLOPS
75 GB/s
Triangle
13 GB/s
Fragment
335 GB/s Texture 45 GB/s Fragment
1.1 TFLOPS
5
Data Parallel
Distribute
Task
Task
Task
Task
Merge
6
Sort First
Objects
Distribute objects by screen tile
Vertex
Vertex
Vertex
Some pixels Some objects
Triangle
Triangle
Triangle
Fragment
Fragment
Fragment
Screen
7
Sort Middle
Objects
Distribute objects or vertices
Vertex
Vertex
Vertex
Some objects
Merge Redistribute by screen location
Triangle
Triangle
Triangle
Triangle
Some pixels Some objects
Fragment
Fragment
Fragment
Fragment
Screen
8
Screen Subdivision
Tiled
Interleaved
Scan-Line Interleaved
Column Interleaved
9
Sort Last
Objects
Distribute by object
Vertex
Vertex
Vertex
Full Screen Some objects
Triangle
Triangle
Triangle
Fragment
Fragment
Fragment
Z-merge
Screen
10
Graphics Processing Unit (GPU)

Sort Middle(ish)
Fixed-Function HW for clip/cull, raster,
texturing, Ztest
Programmable stages
Commands in, pixels out

11
GPU computation
12
Architecture Latency

CPU Make one thread go very fast
Avoid the stalls
Branch prediction
Out-of-order execution
Memory prefetch
Big caches

GPU Make 1000 threads go very fast
Hide the stalls
HW thread scheduler
Swap threads to hide stalls

13
Architecture (MIMD vs SIMD)
MIMD(CPU-Like)
SIMD (GPU-Like)
CTRL
ALU
CTRL
ALU
ALU
CTRL
ALU
ALU
ALU
ALU
CTRL
ALU
ALU
ALU
ALU
ALU
ALU
ALU
CTRL
Flexibility
Horsepower
Ease of Use
14
SIMD Branching

if( x ) // mask threads
// issue instructions
else // invert mask
// issue instructions
// unmask

Threads agree, issue if
Threads agree, issue else
Threads disagree, issue if AND else
15
SIMD Looping
Useful
Useless

while(x) // update mask
// do stuff
They all run till the last ones done.

16
NVIDIA GeForce 6
Kilgaraff and Fernando, GPU Gems 2
17
AMD/ATI R600
Toms Hardware
18
Dispatch
19
SIMD Units
2x2 Quads (4 per SIMD) 20 ALU/Quad (5 per thread)
Wavefront of 64 Threads, executed over 8 clocks
2 Waves interleaved Interleaving multi-cycling
hides ALU latency. Wavefront switching hides
memory latency. GPR Usage determines wavefront
count.
General Purpose Registers 4x32bit (THOUSANDS of
them)
20
Texture
21
DEMO!

R600 Instruction Set
Brought to you by GPU ShaderAnalyzer
http//developer.amd.com/gpu/shader/pages/default.
aspx

22
NVIDIA G80
NVIDIA 8800 Architectural Overview, NVIDIA
TB-02787-001_v01, November 2006
23
Streaming Processors
24
CUDA
__global__ void scan(float g_odata, float
g_idata, int n) extern __shared__ float
temp // allocated on invocation int thid
threadIdx.x // unique thread ID int
pout 0, pin 1 // ping-pong input output
buffers // load input into shared memory.
temppoutn thid (thid gt 0) ?
g_idatathid-1 0 __syncthreads()
for (int offset 1 offset lt n offset 2)
pout 1 - pout pin 1 pout //
swap double buffer indices if (thid gt
offset) temppoutnthid temppinnthid -
offset else
temppoutnthid temppinnthid
__syncthreads() g_odatathid
temppoutnthid1 // write output
Harris, Prefix Parallel Sum (Scan) with CUDA,
NVIDIA White Paper, April 2007
25
NVIDIA Fermi
Beyond3D NVIDIA Fermi GPU and Architecture
Analysis, 2010
26
NVIDA Fermi SM
NVIDIA, NVIDIAs Next Generation CUDA Compute
Architecture Fermi, 2009
27
GPU Performance Tips
28
Graphics System Architecture
Your Code
Display
GPU
GPU
GPU(s)
API
Driver
Produce
Consume
Current Frame (Buffering Commands)
Previous Frame(s) (Submitted, Pending Execution)
29
GPU Performance Tips