NVIDIA GeForce - PowerPoint PPT Presentation

About This Presentation

Title:

NVIDIA GeForce

Description:

NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall What Does a GPU Actually Do? Historically, from: Acting simply as a frame buffer Doing vertex ... – PowerPoint PPT presentation

Number of Views:417

Avg rating:3.0/5.0

Slides: 82

Provided by: csVirgin62

Learn more at: https://www.cs.virginia.edu

Category:

more less

Transcript and Presenter's Notes

Title: NVIDIA GeForce

1
NVIDIA GeForce

Ryan Hendrixson
Ryan Schubert
Allison Walthall

2
What Does a GPU Actually Do?

Historically, from
Acting simply as a frame buffer
Doing vertex transformations and pixel color
calculations
Now even programmable
In the simplest sense, a modern GPU implements a
3D rendering pipeline

3
3D Rendering Pipeline (direct illumination)
3D Geometric Primitives
Modeling Transformation
Lighting
This is a pipelined sequence of operations to
draw a 3D primitive into a 2D image
Viewing Transformation
Projection Transformation
Clipping
Scan Conversion
Image
4
3D Rendering Pipeline (direct illumination)
3D Geometric Primitives
Modeling Transformation
Transform into 3D world coordinate system
Lighting
Viewing Transformation
Projection Transformation
Clipping
Scan Conversion
Image
5
3D Rendering Pipeline (direct illumination)
3D Geometric Primitives
Modeling Transformation
Transform into 3D world coordinate system
Lighting
Illuminate according to lighting and reflectance
Viewing Transformation
Projection Transformation
Clipping
Scan Conversion
Image
6
3D Rendering Pipeline (direct illumination)
3D Geometric Primitives
Modeling Transformation
Transform into 3D world coordinate system
Lighting
Illuminate according to lighting and reflectance
Viewing Transformation
Transform into 3D camera coordinate system
Projection Transformation
Clipping
Scan Conversion
Image
7
3D Rendering Pipeline (direct illumination)
3D Geometric Primitives
Modeling Transformation
Transform into 3D world coordinate system
Lighting
Illuminate according to lighting and reflectance
Viewing Transformation
Transform into 3D camera coordinate system
Projection Transformation
Transform into 2D screen coordinate system
Clipping
Scan Conversion
Image
8
3D Rendering Pipeline (direct illumination)
3D Geometric Primitives
Modeling Transformation
Transform into 3D world coordinate system
Lighting
Illuminate according to lighting and reflectance
Viewing Transformation
Transform into 3D camera coordinate system
Projection Transformation
Transform into 2D screen coordinate system
Clipping
Clip primitives outside cameras view
Scan Conversion
Image
9
3D Rendering Pipeline (direct illumination)
3D Geometric Primitives
Modeling Transformation
Transform into 3D world coordinate system
Lighting
Illuminate according to lighting and reflectance
Viewing Transformation
Transform into 3D camera coordinate system
Projection Transformation
Transform into 2D screen coordinate system
Clipping
Clip primitives outside cameras view
Scan Conversion
Draw pixels
Image
10
Modern OpenGL Pipeline
Graphics State
GPU
CPU
VertexProcessor
PixelProcessor
Application
VertexProcessor
Assembly Rasterization
PixelProcessor
VideoMemory(Textures)
Finalpixels(Color, Depth)
Fragments(pre-pixels)
Vertices(3D)
Xformed,LitVertices(2D)
Render-to-texture

Programmable Vertex Processor
Programmable Fragment (Pixel) Processor

11
OpenGL vs. DirectX

Just graphics
Standard C interfaces
State machine
Multiple platforms
Academic use

Graphics, multimedia, etc.
C interfaces
Object oriented
Windows
PC games

12
Possible GPU Performance Bottlenecks

CPU/Bus Bound
Simply not able to send enough vertices to the
card to keep it busy
Vertex Bound
Vertex processing engine is fully loaded, while
the fragment engine is just waiting and grabbing
data as soon as its ready
Pixel Bound
The fragment engine is fully loaded, causing the
vertex engine to have to wait before sending more
data

13
Early History

NVIDIA founded in 1993
1997 RIVA
1998 RIVA TNT
1999 GeForce 256 (NV10)

14
GeForce 256 (NV10)

Lighting and transformation
DDR and SDR
HDTV compliant
Hardware alpha-blending
4 pixel pipelines at 120 MHz
Fill Rate 480 Megapixels/second

15
GeForce2

2000 GeForce 2 GTS
Doubled the pixel fill rate
Quadrupled the texel fill rate
Increased clock speed
Multi-texturing
S3TC, MPEG-2, FSAA

16
Anti-Aliasing

Without Anti-Aliasing

With Anti-Aliasing

17
GeForce2

2000 GeForce 2 MX
Cut pixel pipeline by 2, making it cost effective
Twinview
Compatible with MACs

18
GeForce2

Jan 2001 Apple selected GeForce2 MX as default
high-end graphics solution for Power Mac G4
August 2000 GeForce2 Ultra
November 2000 GeForce2 Go
December 2000 NVIDIA buys 3DFX

19
GeForce3

2001 GeForce3 (NV20)
240 MHz Core/500 MHz Memory
57 million transistors
46-76 Gigaflops
Vertex shader technology
Pixel shader technology
LightSpeed Memory architecture

20
(No Transcript)
21
LightSpeed Memory Architecture
22
GeForce4

2002 GeForce4 Ti (NV25) and MX (NV17)
Ti
4200, 4400, 4600, and 4800 versions
63 million transistors
Chip clock 225-300 MHz
Memory Clock 500-650 MHz
75-100 million vertices/second

23
GeForce FX

November 2002 Geforce FX (NV30)
16 variations for different price ranges
125 million transistors
8 pixels/clock
1 tmu/pipe (16 textures/unit)
128 bit memory interface
128 MB/256 MB Memory size support

24
GeForce 6 series

GeForce 6 series (NV40 )
6200 6600 GT and Ultra 6800 GT, Ultra, and
Ultra Extreme
Core clock speed 450 MHz
Memory clock speed 600 MHz
6 4-wide fp32 vector MADDs/ clock cycle vertex
shader units
16 4-wide fp32 vector MADDs/ clock cycle pixel
shader units

25
GeForce 6 series

Super scalar 16 pipe architecture
CineFX3.0 engine
All operations done in FP32 precision per
component
200 Gigaflops (Compare this to the Itaniums 6.4
Gigaflops)

26
General Diagram (6800/NV40)
27
TurboCache

Uses PCI-Express bandwidth to render directly to
system memory
Card needs less memory
Performance boost while lowering cost
TurboCache Manager dynamically allocates from
main memory
Local memory used to cache data and to deliver
peak performance when needed

28
TurboCache
29
NV40 Vertex Processor
An NV40 vertex processor is able to execute one
vector operation (up to four FP32 components),
one scalar FP32 operation, and make one access to
the texture per clock cycle
30
NV40 Fragment Processors
Early termination from mini z buffer and z buffer
checks resulting sets of 4 pixels (quads) passed
on to fragment units
31
Programmable 2D and Video Processor

Can be used for video decoding and coding (IDCT,
deinterlacing, color model transformations, etc.)

32
Why NV40 series was better

Massive parallelism
Scalability
Lower end products have fewer pixel pipes and
fewer vertex shader units
Computation Power
222 million transistors
First to comply with Microsofts DirectX 9 spec
Dynamic Branching in pixel shaders

33
Dynamic Branching

Helps detect if pixel needs shading
Instruction flow handled in groups of pixels
Specify branch granularity (the number of
consecutive pixels that take the same branch)
Better distribution of blocks of pixels between
the different quad engines

34
Dynamic Branching
35
GeForce 7 series

7800 GT
449
7 vertex units
20 pixel pipelines
Clock speed 400 MHz
Memory clock speed 500 MHz

7800 GTX
600
8 vertex units
24 pixel pipelines
Clock speed 430 MHz
Memory clock speed 600 MHz

36
GeForce 7800

302 million transistors
200 Gigaflops of multiply/add calculations per
second
128-bit floating point precision through the
entire rendering pipeline
Fill Rate 10.3 Gigatexels
860 million vertices/sec

37
GeForce 7800
38
(No Transcript)
39
(No Transcript)
40
ALU Units in Pixel Processor

Sub-unit 1
NV40 textures data and can issue a MUL vector
instruction or use its mini-ALU to issue a
non-vector instruction
G70 same but also can issue a multiply/add
Sub-unit 2
NV40 can issue a multiply/add vector instruction
or use its own mini-ALU to issue a non-vector
instruction
G70 same

41
GeForce 6 vs. GeForce 7

ALU Units
G70 24 ALU Units
NV40 16 ALU Units
Register file same size
Texture samplers the same but when fetching large
textures in preparation for filtering, G70's
samplers have less latency pulling those textures
out of memory

42
GeForce 6 vs. GeForce 7(speculative)

Increased L2 texture cache (to around 12KB)
Better cache re-use with larger textures,
decompressing those larger textures into L1
faster
Possibly offering more granularity in cache
access by the GPU, to reduce texture bandwidth,
speeding up rendering.

43
GeForce 6 vs. GeForce 7

33 more vertex units, each with more
performance
Improved vertex fetch unit (unconfirmed by
Nvidia)
Triangle setup and rasteriser optimized via the
use of a new raster pattern (again unconfirmed by
Nvidia)

44
General Diagram (7800/G70)
45
32-bit IEEE floating-pointthroughout pipeline
(NV40)

Framebuffer
Textures
Fragment processor
Vertex processor
Interpolants
GeForce 7800 (G70) supports 128 bit through
entire pipeline!

46
Hardware supports several other data types

Fragment processor also supports
16-bit half floating point
12-bit fixed point
These may be faster than 32-bit on some HW
Framebuffer/textures also support
Large variety of fixed-point formats
E.g., classical 8-bit per component
These formats use less memory bandwidth than FP32

47
How are current GPUs different from CPU?

GPU is a stream processor
Multiple programmable processing units
Connected by data flows

VertexProcessor
FragmentProcessor
FramebufferOperations
Assembly Rasterization
Application
Framebuffer
Textures
48
How are current GPUs different from CPU?

Optimized for 4-vector arithmetic
Useful for graphics colors, vectors, texcoords
Easy way to get high performance/cost
SIMD/MIMD

49
GPU Memory Model vs CPUs

Much more restricted memory access
Allocate/free memory only before computation
Limited memory access during computation (kernel)
Registers
Read/write
Local memory
Does not exist
Global memory
Read-only during computation
Write-only at end of computation (pre-computed
address)
Disk access
Does not exist

50
GPU Memory Model

Where is GPU Data Stored?
Vertex buffer
Frame buffer
Texture

VS 3.0 GPUs
Texture
Vertex Processor
Fragment Processor
Frame Buffer(s)
Vertex Buffer
Rasterizer
51
GPGPU and Motivation

GPUs are fast
Itanium 6.4 GFLOPS
GeForceFX 7800 200 GFLOPs
GPUs are getting faster, faster
CPUs annual growth ? 1.5 ? decade growth ? 60
GPUs annual growth gt 2.0 ? decade growth gt 1000

52
MotivationComputational Power
GPU
GPU
CPU
Courtesy Naga Govindaraju
53
GPGPU

Good for inherently parallel applications
Rapidly evolving ISA and HW architecture
Largely secret
Cant simply port code written for the CPU!

54
Programs are Shaders

Bound by the specific hardware profile
E.g. different cards have different supported
hardware, OpenGL has different restrictions than
DirectX, etc
Hardware profiles change relatively drastically
as new GPUs are developed
But typically new profiles only add features, so
there is generally still backwards compatibility
(but not always)

55
Vertex processor

256 instructions per program originally(effective
ly higher with branching)
Now up to 65535 instructions
Executes on all vertices
Outputs new vertices or texture coordinates, etc

56
Fragment Processor Flow Chart
57
Fragment processor hasflexible texture mapping

Memory is accessible through texture reads
Texture reads are just another instruction
Allows computed texture coordinates,nested to
arbitrary depth
Allows multiple uses of a singletexture unit

58
Additional fragment processor capabilities

Read access to window-space position
Read/write access to fragment Z
Built-in derivative instructions
Partial derivatives w.r.t. screen-space x or y
Useful for anti-aliasing
Conditional fragment-kill instruction
Multiple FP formats supported

59
Fragment processor limitations

Originally No branching
Now support dynamic branching (but its still
costly)
No indexed reads from registers
Use texture reads instead
No memory writes

60
Branching Instruction Costs(GeForce 6800)
61
Fragment shaders

Originally very limited in size (only 96
instructions), now expanded to 65535
instructions
New cards support dynamic branching (but it still
incurs some performance penalty)
Now have the ability to output to multiple render
targets

62
CineFX 4.0 Engine

A redesigned vertex shader unit reduces the time
to set up and perform geometry processing.
A new pixel shader unit design can carry out
twice as many floating-point operations and
greatly accelerates other mathematical operations
to increase throughput.
An advanced texture unit incorporates new
hardware algorithms and better caching to speed
filtering and blending operations.

63
Vertex Shaders

The 7800 has 8 vertex shaders
The Triangle Setup stage turns the vertex points
into a triangle
It also determines mathmatically the
rasterization for each triangle
Accelerating triangle setup increases the total
throughput of the 3D pipeline

64
Theoretical Rasterization Pattern of a Triangle
65
New Pixel Shader MADD

Multiply and Accumulate are commonly used math
functions in 3D graphics
MADD stands for Multiply-ADD operations
The 7800 can do twice the amount of MADD
operations than previous GPUs could
This allows developers to create much more
complex visual effects

66
Transparency Adaptive Supersampling

Takes extra passes of thin-lined objects such as
chain linked fences or trees to enhance quality
Pixels inside of a polygon are usually not
touched by anti-aliasing methods
With this, a key set is devised, and those pixels
are anti-aliased, creating a smoother image.

67
Transparency Adaptive Supersampling
68
Transparency Adaptive Multisampling

Higher levels of performance, because it uses one
texel to determine other subpixel values
Not as high quality

69
(No Transcript)
70
Supporting the Future

The 7800 is already set up to support the new
Microsoft Longhorn OS with some of the following
advancements
Video post-processing
Real-time desktop compositing
Seamless multiple 3D applications
Accelerated antialiased text rendering
Special effects and animation

71
Accelerated Graphics Port (AGP)

The AGP is superior to the PCI because it
provides a dedicated pathways between the slot
and the processor
Uses sideband addressing
PCI must load a texture from the hard drive into
the systems RAM, then from the RAM into the GNU
framebuffer
AGP can read textures directly from system RAM by
tricking the CPU into believing the textures
are in the framebuffer, when they are really in
memory

72
PCI Express

Based on the PCI system, allowing for backwards
compatibility
Uses 1 bit, bi-directional lanes (PCI used a bus)
Each lane can support 250 MB/s in each lane
(4GB/s total)
AGP is only 2 GB/s

73
Scalable Link Interface (SLI)

Takes advantage of the PCI express bus, which
will allow more than one discrete graphics device
on the same PCI host
Allows two of the same GeForce GPUs to run on one
machine, thus sharing load.
There are two modes for this
Split-frame Rendering (SFR)
Alternate-frame Rendering (AFR)

74
(No Transcript)
75
Split-frame Rendering

Has each GPU render a portion of the screen,
split horizontally
No extra latency
Not necessarily evenly split
SFR is load shared, so it splits up the frame by
the amount of work, not the size
A large amount of overhead is involved, causing a
max speed up of around 1.8 times

76
Alternate-frame Rendering

Avoids all the overhead problems of SFR
Many buffer swaps
Reliant on the speed of the processor
Can cause latency issues
Recommended mode by NVIDIA

77
GeForce Go 7800 GTX

The mobile version of the 7800 GTX
Everything from the desktop release has been
carried over to this
Can switch between x1 and x16 lanes of PCI
Express
Uses PowerMizer 6.0, which allows this chip to
operate in the same envelope as its predecessor,
the 6800

78
(No Transcript)
79
(No Transcript)
80
GeForce Go 7800 Power Issues

Power consumption and package are the same as the
6800 Ultra chip, meaning notebook designers do
not have to change very much about their thermal
designs
Dynamic clock scaling can run as slow as 16 MHz
This is true for the engine, memory, and pixel
clocks
Heavier use of clock gating than the desktop
version
Runs at voltages lower than any other mobile
performance part
Regardless, you wont get much battery-based
runtime for a 3D game