NVIDIA GeForce - PowerPoint PPT Presentation

About This Presentation
Title:

NVIDIA GeForce

Description:

NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall What Does a GPU Actually Do? Historically, from: Acting simply as a frame buffer Doing vertex ... – PowerPoint PPT presentation

Number of Views:417
Avg rating:3.0/5.0
Slides: 82
Provided by: csVirgin62
Category:
Tags: geforce | nvidia

less

Transcript and Presenter's Notes

Title: NVIDIA GeForce


1
NVIDIA GeForce
  • Ryan Hendrixson
  • Ryan Schubert
  • Allison Walthall

2
What Does a GPU Actually Do?
  • Historically, from
  • Acting simply as a frame buffer
  • Doing vertex transformations and pixel color
    calculations
  • Now even programmable
  • In the simplest sense, a modern GPU implements a
    3D rendering pipeline

3
3D Rendering Pipeline (direct illumination)
3D Geometric Primitives
Modeling Transformation
Lighting
This is a pipelined sequence of operations to
draw a 3D primitive into a 2D image
Viewing Transformation
Projection Transformation
Clipping
Scan Conversion
Image
4
3D Rendering Pipeline (direct illumination)
3D Geometric Primitives
Modeling Transformation
Transform into 3D world coordinate system
Lighting
Viewing Transformation
Projection Transformation
Clipping
Scan Conversion
Image
5
3D Rendering Pipeline (direct illumination)
3D Geometric Primitives
Modeling Transformation
Transform into 3D world coordinate system
Lighting
Illuminate according to lighting and reflectance
Viewing Transformation
Projection Transformation
Clipping
Scan Conversion
Image
6
3D Rendering Pipeline (direct illumination)
3D Geometric Primitives
Modeling Transformation
Transform into 3D world coordinate system
Lighting
Illuminate according to lighting and reflectance
Viewing Transformation
Transform into 3D camera coordinate system
Projection Transformation
Clipping
Scan Conversion
Image
7
3D Rendering Pipeline (direct illumination)
3D Geometric Primitives
Modeling Transformation
Transform into 3D world coordinate system
Lighting
Illuminate according to lighting and reflectance
Viewing Transformation
Transform into 3D camera coordinate system
Projection Transformation
Transform into 2D screen coordinate system
Clipping
Scan Conversion
Image
8
3D Rendering Pipeline (direct illumination)
3D Geometric Primitives
Modeling Transformation
Transform into 3D world coordinate system
Lighting
Illuminate according to lighting and reflectance
Viewing Transformation
Transform into 3D camera coordinate system
Projection Transformation
Transform into 2D screen coordinate system
Clipping
Clip primitives outside cameras view
Scan Conversion
Image
9
3D Rendering Pipeline (direct illumination)
3D Geometric Primitives
Modeling Transformation
Transform into 3D world coordinate system
Lighting
Illuminate according to lighting and reflectance
Viewing Transformation
Transform into 3D camera coordinate system
Projection Transformation
Transform into 2D screen coordinate system
Clipping
Clip primitives outside cameras view
Scan Conversion
Draw pixels
Image
10
Modern OpenGL Pipeline
Graphics State
GPU
CPU
VertexProcessor
PixelProcessor
Application
VertexProcessor
Assembly Rasterization
PixelProcessor
VideoMemory(Textures)
Finalpixels(Color, Depth)
Fragments(pre-pixels)
Vertices(3D)
Xformed,LitVertices(2D)
Render-to-texture
  • Programmable Vertex Processor
  • Programmable Fragment (Pixel) Processor

11
OpenGL vs. DirectX
  • Just graphics
  • Standard C interfaces
  • State machine
  • Multiple platforms
  • Academic use
  • Graphics, multimedia, etc.
  • C interfaces
  • Object oriented
  • Windows
  • PC games

12
Possible GPU Performance Bottlenecks
  • CPU/Bus Bound
  • Simply not able to send enough vertices to the
    card to keep it busy
  • Vertex Bound
  • Vertex processing engine is fully loaded, while
    the fragment engine is just waiting and grabbing
    data as soon as its ready
  • Pixel Bound
  • The fragment engine is fully loaded, causing the
    vertex engine to have to wait before sending more
    data

13
Early History
  • NVIDIA founded in 1993
  • 1997 RIVA
  • 1998 RIVA TNT
  • 1999 GeForce 256 (NV10)

14
GeForce 256 (NV10)
  • Lighting and transformation
  • DDR and SDR
  • HDTV compliant
  • Hardware alpha-blending
  • 4 pixel pipelines at 120 MHz
  • Fill Rate 480 Megapixels/second

15
GeForce2
  • 2000 GeForce 2 GTS
  • Doubled the pixel fill rate
  • Quadrupled the texel fill rate
  • Increased clock speed
  • Multi-texturing
  • S3TC, MPEG-2, FSAA

16
Anti-Aliasing
  • Without Anti-Aliasing
  • With Anti-Aliasing

17
GeForce2
  • 2000 GeForce 2 MX
  • Cut pixel pipeline by 2, making it cost effective
  • Twinview
  • Compatible with MACs

18
GeForce2
  • Jan 2001 Apple selected GeForce2 MX as default
    high-end graphics solution for Power Mac G4
  • August 2000 GeForce2 Ultra
  • November 2000 GeForce2 Go
  • December 2000 NVIDIA buys 3DFX

19
GeForce3
  • 2001 GeForce3 (NV20)
  • 240 MHz Core/500 MHz Memory
  • 57 million transistors
  • 46-76 Gigaflops
  • Vertex shader technology
  • Pixel shader technology
  • LightSpeed Memory architecture

20
(No Transcript)
21
LightSpeed Memory Architecture
22
GeForce4
  • 2002 GeForce4 Ti (NV25) and MX (NV17)
  • Ti
  • 4200, 4400, 4600, and 4800 versions
  • 63 million transistors
  • Chip clock 225-300 MHz
  • Memory Clock 500-650 MHz
  • 75-100 million vertices/second

23
GeForce FX
  • November 2002 Geforce FX (NV30)
  • 16 variations for different price ranges
  • 125 million transistors
  • 8 pixels/clock
  • 1 tmu/pipe (16 textures/unit)
  • 128 bit memory interface
  • 128 MB/256 MB Memory size support

24
GeForce 6 series
  • GeForce 6 series (NV40 )
  • 6200 6600 GT and Ultra 6800 GT, Ultra, and
    Ultra Extreme
  • Core clock speed 450 MHz
  • Memory clock speed 600 MHz
  • 6 4-wide fp32 vector MADDs/ clock cycle vertex
    shader units
  • 16 4-wide fp32 vector MADDs/ clock cycle pixel
    shader units

25
GeForce 6 series
  • Super scalar 16 pipe architecture
  • CineFX3.0 engine
  • All operations done in FP32 precision per
    component
  • 200 Gigaflops (Compare this to the Itaniums 6.4
    Gigaflops)

26
General Diagram (6800/NV40)
27
TurboCache
  • Uses PCI-Express bandwidth to render directly to
    system memory
  • Card needs less memory
  • Performance boost while lowering cost
  • TurboCache Manager dynamically allocates from
    main memory
  • Local memory used to cache data and to deliver
    peak performance when needed

28
TurboCache
29
NV40 Vertex Processor
An NV40 vertex processor is able to execute one
vector operation (up to four FP32 components),
one scalar FP32 operation, and make one access to
the texture per clock cycle
30
NV40 Fragment Processors
Early termination from mini z buffer and z buffer
checks resulting sets of 4 pixels (quads) passed
on to fragment units
31
Programmable 2D and Video Processor
  • Can be used for video decoding and coding (IDCT,
    deinterlacing, color model transformations, etc.)

32
Why NV40 series was better
  • Massive parallelism
  • Scalability
  • Lower end products have fewer pixel pipes and
    fewer vertex shader units
  • Computation Power
  • 222 million transistors
  • First to comply with Microsofts DirectX 9 spec
  • Dynamic Branching in pixel shaders

33
Dynamic Branching
  • Helps detect if pixel needs shading
  • Instruction flow handled in groups of pixels
  • Specify branch granularity (the number of
    consecutive pixels that take the same branch)
  • Better distribution of blocks of pixels between
    the different quad engines

34
Dynamic Branching
35
GeForce 7 series
  • 7800 GT
  • 449
  • 7 vertex units
  • 20 pixel pipelines
  • Clock speed 400 MHz
  • Memory clock speed 500 MHz
  • 7800 GTX
  • 600
  • 8 vertex units
  • 24 pixel pipelines
  • Clock speed 430 MHz
  • Memory clock speed 600 MHz

36
GeForce 7800
  • 302 million transistors
  • 200 Gigaflops of multiply/add calculations per
    second
  • 128-bit floating point precision through the
    entire rendering pipeline
  • Fill Rate 10.3 Gigatexels
  • 860 million vertices/sec

37
GeForce 7800
38
(No Transcript)
39
(No Transcript)
40
ALU Units in Pixel Processor
  • Sub-unit 1
  • NV40 textures data and can issue a MUL vector
    instruction or use its mini-ALU to issue a
    non-vector instruction
  • G70 same but also can issue a multiply/add
  • Sub-unit 2
  • NV40 can issue a multiply/add vector instruction
    or use its own mini-ALU to issue a non-vector
    instruction
  • G70 same

41
GeForce 6 vs. GeForce 7
  • ALU Units
  • G70 24 ALU Units
  • NV40 16 ALU Units
  • Register file same size
  • Texture samplers the same but when fetching large
    textures in preparation for filtering, G70's
    samplers have less latency pulling those textures
    out of memory

42
GeForce 6 vs. GeForce 7(speculative)
  • Increased L2 texture cache (to around 12KB)
  • Better cache re-use with larger textures,
    decompressing those larger textures into L1
    faster
  • Possibly offering more granularity in cache
    access by the GPU, to reduce texture bandwidth,
    speeding up rendering.

43
GeForce 6 vs. GeForce 7
  • 33 more vertex units, each with more
    performance
  • Improved vertex fetch unit (unconfirmed by
    Nvidia)
  • Triangle setup and rasteriser optimized via the
    use of a new raster pattern (again unconfirmed by
    Nvidia)

44
General Diagram (7800/G70)
45
32-bit IEEE floating-pointthroughout pipeline
(NV40)
  • Framebuffer
  • Textures
  • Fragment processor
  • Vertex processor
  • Interpolants
  • GeForce 7800 (G70) supports 128 bit through
    entire pipeline!

46
Hardware supports several other data types
  • Fragment processor also supports
  • 16-bit half floating point
  • 12-bit fixed point
  • These may be faster than 32-bit on some HW
  • Framebuffer/textures also support
  • Large variety of fixed-point formats
  • E.g., classical 8-bit per component
  • These formats use less memory bandwidth than FP32

47
How are current GPUs different from CPU?
  • GPU is a stream processor
  • Multiple programmable processing units
  • Connected by data flows

VertexProcessor
FragmentProcessor
FramebufferOperations
Assembly Rasterization
Application
Framebuffer
Textures
48
How are current GPUs different from CPU?
  • Optimized for 4-vector arithmetic
  • Useful for graphics colors, vectors, texcoords
  • Easy way to get high performance/cost
  • SIMD/MIMD

49
GPU Memory Model vs CPUs
  • Much more restricted memory access
  • Allocate/free memory only before computation
  • Limited memory access during computation (kernel)
  • Registers
  • Read/write
  • Local memory
  • Does not exist
  • Global memory
  • Read-only during computation
  • Write-only at end of computation (pre-computed
    address)
  • Disk access
  • Does not exist

50
GPU Memory Model
  • Where is GPU Data Stored?
  • Vertex buffer
  • Frame buffer
  • Texture

VS 3.0 GPUs
Texture
Vertex Processor
Fragment Processor
Frame Buffer(s)
Vertex Buffer
Rasterizer
51
GPGPU and Motivation
  • GPUs are fast
  • Itanium 6.4 GFLOPS
  • GeForceFX 7800 200 GFLOPs
  • GPUs are getting faster, faster
  • CPUs annual growth ? 1.5 ? decade growth ? 60
  • GPUs annual growth gt 2.0 ? decade growth gt 1000

52
MotivationComputational Power
GPU
GPU
CPU
Courtesy Naga Govindaraju
53
GPGPU
  • Good for inherently parallel applications
  • Rapidly evolving ISA and HW architecture
  • Largely secret
  • Cant simply port code written for the CPU!

54
Programs are Shaders
  • Bound by the specific hardware profile
  • E.g. different cards have different supported
    hardware, OpenGL has different restrictions than
    DirectX, etc
  • Hardware profiles change relatively drastically
    as new GPUs are developed
  • But typically new profiles only add features, so
    there is generally still backwards compatibility
    (but not always)

55
Vertex processor
  • 256 instructions per program originally(effective
    ly higher with branching)
  • Now up to 65535 instructions
  • Executes on all vertices
  • Outputs new vertices or texture coordinates, etc

56
Fragment Processor Flow Chart
57
Fragment processor hasflexible texture mapping
  • Memory is accessible through texture reads
  • Texture reads are just another instruction
  • Allows computed texture coordinates,nested to
    arbitrary depth
  • Allows multiple uses of a singletexture unit

58
Additional fragment processor capabilities
  • Read access to window-space position
  • Read/write access to fragment Z
  • Built-in derivative instructions
  • Partial derivatives w.r.t. screen-space x or y
  • Useful for anti-aliasing
  • Conditional fragment-kill instruction
  • Multiple FP formats supported

59
Fragment processor limitations
  • Originally No branching
  • Now support dynamic branching (but its still
    costly)
  • No indexed reads from registers
  • Use texture reads instead
  • No memory writes

60
Branching Instruction Costs(GeForce 6800)
61
Fragment shaders
  • Originally very limited in size (only 96
    instructions), now expanded to 65535
    instructions
  • New cards support dynamic branching (but it still
    incurs some performance penalty)
  • Now have the ability to output to multiple render
    targets

62
CineFX 4.0 Engine
  • A redesigned vertex shader unit reduces the time
    to set up and perform geometry processing.
  • A new pixel shader unit design can carry out
    twice as many floating-point operations and
    greatly accelerates other mathematical operations
    to increase throughput.
  • An advanced texture unit incorporates new
    hardware algorithms and better caching to speed
    filtering and blending operations.

63
Vertex Shaders
  • The 7800 has 8 vertex shaders
  • The Triangle Setup stage turns the vertex points
    into a triangle
  • It also determines mathmatically the
    rasterization for each triangle
  • Accelerating triangle setup increases the total
    throughput of the 3D pipeline

64
Theoretical Rasterization Pattern of a Triangle
65
New Pixel Shader MADD
  • Multiply and Accumulate are commonly used math
    functions in 3D graphics
  • MADD stands for Multiply-ADD operations
  • The 7800 can do twice the amount of MADD
    operations than previous GPUs could
  • This allows developers to create much more
    complex visual effects

66
Transparency Adaptive Supersampling
  • Takes extra passes of thin-lined objects such as
    chain linked fences or trees to enhance quality
  • Pixels inside of a polygon are usually not
    touched by anti-aliasing methods
  • With this, a key set is devised, and those pixels
    are anti-aliased, creating a smoother image.

67
Transparency Adaptive Supersampling
68
Transparency Adaptive Multisampling
  • Higher levels of performance, because it uses one
    texel to determine other subpixel values
  • Not as high quality

69
(No Transcript)
70
Supporting the Future
  • The 7800 is already set up to support the new
    Microsoft Longhorn OS with some of the following
    advancements
  • Video post-processing
  • Real-time desktop compositing
  • Seamless multiple 3D applications
  • Accelerated antialiased text rendering
  • Special effects and animation

71
Accelerated Graphics Port (AGP)
  • The AGP is superior to the PCI because it
    provides a dedicated pathways between the slot
    and the processor
  • Uses sideband addressing
  • PCI must load a texture from the hard drive into
    the systems RAM, then from the RAM into the GNU
    framebuffer
  • AGP can read textures directly from system RAM by
    tricking the CPU into believing the textures
    are in the framebuffer, when they are really in
    memory

72
PCI Express
  • Based on the PCI system, allowing for backwards
    compatibility
  • Uses 1 bit, bi-directional lanes (PCI used a bus)
  • Each lane can support 250 MB/s in each lane
    (4GB/s total)
  • AGP is only 2 GB/s

73
Scalable Link Interface (SLI)
  • Takes advantage of the PCI express bus, which
    will allow more than one discrete graphics device
    on the same PCI host
  • Allows two of the same GeForce GPUs to run on one
    machine, thus sharing load.
  • There are two modes for this
  • Split-frame Rendering (SFR)
  • Alternate-frame Rendering (AFR)

74
(No Transcript)
75
Split-frame Rendering
  • Has each GPU render a portion of the screen,
    split horizontally
  • No extra latency
  • Not necessarily evenly split
  • SFR is load shared, so it splits up the frame by
    the amount of work, not the size
  • A large amount of overhead is involved, causing a
    max speed up of around 1.8 times

76
Alternate-frame Rendering
  • Avoids all the overhead problems of SFR
  • Many buffer swaps
  • Reliant on the speed of the processor
  • Can cause latency issues
  • Recommended mode by NVIDIA

77
GeForce Go 7800 GTX
  • The mobile version of the 7800 GTX
  • Everything from the desktop release has been
    carried over to this
  • Can switch between x1 and x16 lanes of PCI
    Express
  • Uses PowerMizer 6.0, which allows this chip to
    operate in the same envelope as its predecessor,
    the 6800

78
(No Transcript)
79
(No Transcript)
80
GeForce Go 7800 Power Issues
  • Power consumption and package are the same as the
    6800 Ultra chip, meaning notebook designers do
    not have to change very much about their thermal
    designs
  • Dynamic clock scaling can run as slow as 16 MHz
  • This is true for the engine, memory, and pixel
    clocks
  • Heavier use of clock gating than the desktop
    version
  • Runs at voltages lower than any other mobile
    performance part
  • Regardless, you wont get much battery-based
    runtime for a 3D game

81
Questions?
Questions?
Write a Comment
User Comments (0)
About PowerShow.com