A Single Unified Shader GPU Microarchitecture for Embedded Systems - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

A Single Unified Shader GPU Microarchitecture for Embedded Systems

Description:

ATI R520/NVidia G70. Framebuffer. ATTILA OpenGL Driver. ATTILA Simulator. Framebuffer ... NVidia GeForce FX 5900XT. 21. Outline. ATTILA PC. ATTILA Embedded ... – PowerPoint PPT presentation

Number of Views:123
Avg rating:3.0/5.0
Slides: 34
Provided by: victo7
Category:

less

Transcript and Presenter's Notes

Title: A Single Unified Shader GPU Microarchitecture for Embedded Systems


1
A Single (Unified) Shader GPU Microarchitecture
for Embedded Systems
  • Victor Moya, Carlos González, Jordi Roca, Agustín
    Fernández
  • Department of Computer Architecture UPC

Roger Espasa Intel DEG Barcelona
2
Introduction
  • Graphics and specifically 3D graphics have become
    an important element in current PDA, mobile phone
    and other handheld systems
  • OpenGL ES A simplified OpenGL specification for
    embedded systems
  • The classic GPU architecture for the PC is not
    suited for embedded systems
  • Low power
  • Low area budget
  • We propose a single unified shader GPU
    architecture for embedded systems

3
Outline
  • ATTILA PC
  • ATTILA Embedded
  • Triangle Setup in the Shader Unit
  • ATTILA Simulation Framework
  • Results

4
Outline
  • ATTILA PC
  • ATTILA Embedded
  • Triangle Setup in the Shader Unit
  • ATTILA Simulation Framework
  • Results

5
Attila Classic for PCs
  • Optimized for large resolutions
  • Above 1024x768
  • Optimized for high performance
  • High power requirements
  • No power optimizations
  • 100 watts on current high-end GPUs
  • Large area budget
  • 300 million transistors on current high-end GPUs
  • Large dedicated of memory bandwidth
  • 40 GB/s on current high-end GPUs
  • Specialized Shader Units
  • 2 to 8 vertex shader units
  • 1 to 6 fragment shader units

6
Attila PC
Vertex Fetch
Vertex Shader
Vertex Shader
Primitive Assembly
Clipping
Specialized Shaders
Triangle Setup
Rasterization
HierarchicalZ
Fragment Shader
Fragment Shader
Four fragments processed in parallel
ROP
ROP
Memory Controller
Memory Controller
7
Outline
  • ATTILA PC
  • ATTILA Embedded
  • Triangle Setup in the Shader Unit
  • ATTILA Simulation Framework
  • Results

8
Embedded Requirements
  • Optimized for small resolutions
  • 320x240 to 640x480
  • Optimized for low power
  • Reduce frequency
  • Power optimizations
  • Improve efficiency
  • Small area budget
  • Remove non crucial hardware
  • Low available bandwidth
  • Reduced shading power
  • Reduce design complexity

9
Attila Embedded
  • No Hierarchical Z
  • No Z compression
  • Single unified shader
  • 1 SIMD ALU
  • Multithreaded
  • 16 threads of four vertex/triangle/fragment
    elements
  • 16 128-bit registers for temporal storage
    available per thread
  • Texture unit outputs 1 bilinear for a whole
    fragment quad each 4 cycles
  • 4 KB Texture Cache
  • ROP
  • One z and one color values updated per cycle in
    the framebuffer (a fragment quad each 4 cycles).
  • Single 64-bit DDR channel
  • Limited by current simulator implementation
  • Assimilated to small (1 MB) embedded DRAM
  • 32-bit high latency bus to large system memory
    for textures

10
Attila Embedded
Vertex Fetch
Single Unified Shader
Primitive Assembly
Scheduler
Distributor
Shader
Clipping
Rasterization
Memory Controller
ROP
Single fragment per cycle pipeline
Vertices
Triangles
Fragments
11
Outline
  • ATTILA PC
  • ATTILA Embedded
  • Triangle Setup in the Shader Unit
  • ATTILA Simulation Framework
  • Results

12
Triangle Setup in the Shader
  • 2D Homogeneous Rasterization
  • Olano Greer
  • Triangle setup algorithm
  • Calculate setup matrix from triangle vertex
    matrix
  • Calculate interpolation equation for fragment Z
  • Cull triangles based on their facing direction
    (area sign)
  • Algorithm suited for a SIMD implementation in the
    Unified Shader
  • Inputs
  • Four 3 component vectors as input for the
    triangle vertex positions
  • Outputs
  • Three 4 component vectors as output for the
    triangle edge and z interpolation equation
    coefficients.
  • One signed triangle area register as output for
    face culling stage
  • 26 Instruction Triangle Shader program

13
Triangle Setup in the Shader
  • Benefits
  • Reduce area
  • No specialized hardware required for Triangle
    setup
  • Reduce design complexity
  • Improve efficiency
  • Graphic workload in embedded applications may not
    fully utilize the triangle setup specialized
    hardware in most cases
  • Higher utilization of the shader
  • Costs
  • Shader workload increases
  • Rerouting of the rasterization pipeline required

14
Outline
  • ATTILA PC
  • ATTILA Embedded
  • Triangle Setup in the Shader Unit
  • ATTILA Simulation Framework
  • Results

15
Collect
Verify
Simulate
Analyze
OpenGL Application
GLInterceptor
Trace
GLPlayer
Statistics
Vendor OpenGL Driver
Vendor OpenGL Driver
ATTILA OpenGL Driver
Signal Traffic
ATI R520/NVidia G70
ATI R520/NVidia G70
ATTILA Simulator
Framebuffer
Framebuffer
Framebuffer
Signal Visualizer
CHECK!
CHECK!
16
Collect
Verify
Simulate
Analyze
OpenGL Application
  • GLInterceptor
  • Capture a trace of OpenGL API calls from a real
    game

GLInterceptor
Trace
GLPlayer
Statistics
Vendor OpenGL Driver
Vendor OpenGL Driver
ATTILA OpenGL Driver
Signal Traffic
ATI R520/NVidia G70
ATI R520/NVidia G70
ATTILA Simulator
Framebuffer
Framebuffer
Framebuffer
Signal Visualizer
CHECK!
CHECK!
17
Collect
Verify
Simulate
Analyze
OpenGL Application
GLInterceptor
  • GLPlayer
  • Reproduce the captured trace

Trace
GLPlayer
Statistics
Vendor OpenGL Driver
Vendor OpenGL Driver
ATTILA OpenGL Driver
Signal Traffic
ATI R520/NVidia G70
ATI R520/NVidia G70
ATTILA Simulator
Framebuffer
Framebuffer
Framebuffer
Signal Visualizer
CHECK!
CHECK!
18
Collect
Verify
Simulate
Analyze
OpenGL Library - Transform Fixed Function API
into Shader code - 200 API calls supported - ARB
Vertex and Fragment extensions - Alpha and Fog
emulated via Shader code Driver - Low level
interface to GPU hardware - Attila memory
management
OpenGL Application
GLInterceptor
Trace
GLPlayer
Statistics
Vendor OpenGL Driver
Vendor OpenGL Driver
ATTILA OpenGL Driver
Signal Traffic
ATI R520/NVidia G70
ATI R520/NVidia G70
ATTILA Simulator
Framebuffer
Framebuffer
Framebuffer
Signal Visualizer
CHECK!
CHECK!
19
Collect
Verify
Simulate
Analyze
ATTILA Simulator - Detailed cycle-by-cycle
simulation of all pipeline stages - 20 boxes,
modeling a 100-deep pipeline - Execute_at_Execute
functionality embedded at each pipeline stage
OpenGL Application
GLInterceptor
Trace
GLPlayer
Statistics
Vendor OpenGL Driver
Vendor OpenGL Driver
ATTILA OpenGL Driver
Signal Traffic
ATI R520/NVidia G70
ATI R520/NVidia G70
ATTILA Simulator
Framebuffer
Framebuffer
Framebuffer
Signal Visualizer
CHECK!
CHECK!
20
Spot the differences
Attila
NVidia GeForce FX 5900XT
21
Outline
  • ATTILA PC
  • ATTILA Embedded
  • Triangle Setup in the Shader Unit
  • ATTILA Simulation Framework
  • Results

22
Benchmark
  • Unreal Tournament 2004
  • NOT AN EMBEDDED BENCHMARK
  • Up to 300K vertices per frame!
  • Fixed function OpenGL API
  • Vertex and fragments shaders generated by our
    library
  • 320x240 resolution
  • 140 of 450 frames simulated
  • 100 frames 1 day simulation
  • On a Xeon P4 _at_ 2.0Ghz

23
Configurations
  • We have evaluated
  • 3 middle-end to low-end PC GPU configurations
  • 2 integrated on chipset GPUs and high-end PDA
    GPUs configurations
  • 4 embedded low-end GPUs configurations
  • We tried to keep a balance between memory
    bandwidth and shading computing power
  • From 4 to no vertex shader units
  • From 2 quad fragment shader units to a single
    unified shader unit
  • From four to one 64-bit DDR memory channels
  • Store framebuffer in small (1 MB) GPU memory and
    textures in system memory
  • Halved the frequency for embedded systems
  • Restricted design rules
  • Reduce power consumption
  • Removed all optional features at the low end
  • Hierarchical Z
  • Z compression
  • Specialized Triangle Setup hardware

24
Evaluated Configurations
25
Configuration Comparison
26
Performance
  • Average of 20 frames per second at 320x240 for
    the lower end single shader configurations

27
Efficiency
  • The limiting factor for PC and high embedded
    configurations is memory bandwidth
  • Shaders underutilized for the evaluated benchmark
  • The limiting factor for low end configurations is
    shading processing
  • Memory bandwidth could be further reduced
  • Caches seem over dimensioned for the low-end
    embedded configurations

28
Shaded Triangle Setup Performance
  • No overhead on fragment limited benchmarks
  • 16 less performance in vertex and triangle
    limited traces

29
Conclusion
  • The Attila Embedded achieves 20 frames per second
    on a single unified shader architecture at a
    320x240 resolution when using a year old PC
    benchmark
  • 1 MB of fast embedded DRAM provides more than
    enough bandwidth for framebuffer accesses
  • Texture data stored in system memory
  • 16 performance reduction when removing the
    specialized Triangle Setup unit in the worst
    tested case

30
  • Questions?

31
Attila PC
Shader
Vertex Fetch
Shader
Scheduler
Distributor
Primitive Assembly
Clipping
Shader
Triangle Setup
Shader
Rasterization
HierarchicalZ
Unified Shader Pool
ROP
ROP
ROP
ROP
Memory Controller
Memory Controller
Memory Controller
Memory Controller
32
(No Transcript)
33
PowerVR SGX
Write a Comment
User Comments (0)
About PowerShow.com