Efficient Primitive Traversal Hardware for 3D Graphics TileBased Rasterizers - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Efficient Primitive Traversal Hardware for 3D Graphics TileBased Rasterizers

Description:

... hierarchical priority encoding, and resetting of the highest priority locations afterwards ... by employing hierarchical priority encoding, read then reset. ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 33
Provided by: dcr7
Category:

less

Transcript and Presenter's Notes

Title: Efficient Primitive Traversal Hardware for 3D Graphics TileBased Rasterizers


1
Efficient Primitive Traversal Hardware for 3D
Graphics Tile-Based Rasterizers
  • Dan Crisu, Sorin Cotofana, and Stamatis
    Vassiliadis
  • Computer Engineering Laboratory
  • Electrical Engineering Department
  • Delft University of Technology

2
Talk Outline
  • Terminology
  • Tile-Based Rasterization
  • Primitive Traversal Algorithm Hardware
  • Hardware Implementation Results
  • Conclusions

3
Typical 3D Graphics Pipeline
The rasterizer stage is always accelerated in
hardware!
4
Rasterization
  • Rasterizer stage input
  • consists of geometrical primitives (triangles)
    projected to the screen area
  • Triangle vertex attributes
  • Screen coordinates (x, y)
  • Homogeneous coordinate (w)
  • Depth component (z)
  • Color components (R, G, B, A)
  • Texture coordinates (s, t, q)

B
A
C
  • Fragment raster position inside the triangle
    stencil

Rasterization generation of fragment attributes
by interpolating triangle vertex attributes
  • Rasterization algorithm
  • While fragments remaining do
  • 1 generate a fragment (in the order given by the
    triangle traversal algorithm)
  • 2 generate the fragment attributes employing the
    fragment coordinate

5
Tile Based Rasterization (TBR)
  • Tiling Architecture
  • the screen is divided in a number of
    non-overlapping regions tiles
  • geometry is sorted by screen location and dumped
    into one or more bins, one bin per tile
  • the geometry is rendered from bin N to the tile N
    before moving to the tile N1

Full screen rasterization (FSR)
TBR
6
Tile Based Rasterization (TBR)
  • Advantages
  • Low-cost the information is maintained only per
    tile, not per screen
  • On-chip color, Z buffers with multiple
    samples/pixel for antialiasing
  • Low-power reduces accesses to external memory
  • Scalability with the screen resolution
  • Simplifies arithmetic if tile size is power of two

Disadvantage overhead at the software driver
level
7
Rasterization with edge functions
The Edge Function
  • Most notable property
  • can be computed incrementally by simple addition
    for adjacent pixels
  • and generalizing

Triangle representation using edge functions.
Notational conventions for the edge function.
multi-operand addition if and are
very small
8
Rasterization with edge functions
Traversing the tile entirely.
Efficient triangle traversal algorithm.
Triangle Traversal Algorithm
9
Rasterization with edge functions
  • Triangle rasterization algorithm (inner loop) for
    full-screen (FSR) approach
  • Save the rasterization context
  • Move to a new rasterization position
  • Test the edge functions
  • Hit? Then communicate position downstream and
    update context
  • Miss? Restore the rasterization context
  • Predict the next rasterization position

10
Rasterization with edge functions in TBR context
Ghost triangle for tiles (0, 2), (1, 0), (2,
0), and (2,2)
  • Can the FSR algorithm be applied efficiently in
    TBR?
  • Yes, if you can detect very fast a hit position
  • but this will not happen for two reasons
  • not enough information to find fast a hit
    position (overhead in searches 50-300)
  • ghost primitives

11
Rasterization with edge functions in TBR context
  • Additional issues (applies equally for FSR and
    TBR) that increase rasterization complexity
  • Which has to be the triangle traversal order to
  • Obtain a good texture cache hit-ratio for a
    pull texture architecture?
  • Minimize the frame buffer bank conflict because
    ofread-write-modify dependency in Z-test and
    color blending?
  • It must be another way to perform TBR than the
    FSR algorithm!

12
Rasterization with edge functions in TBR context
INPUT / OUTPUT PORTS
SDRAM Memory
OpenGL State Registers
OpenGL State Registers
  • Triangle Setup
  • Gradient computation
  • Edge functions
  • Depth (z)
  • Color components
  • Texture coordinates
  • Triangle Rasterization
  • Interpolators
  • Depth (z)
  • Color components
  • Texture coordinates
  • Pixel coverage value
  • for Antialiasing

Per-Fragment Operation
Stage
Scissor
OpenGL State Registers
SOC bus
CPU
  • Triangle Traversal
  • Systolic subsystem
  • Logic-enhanced memory

Alpha
OpenGL State Registers
  • Texture Mapping
  • Mip-map level selection
  • Texel combiners

Stencil Depth
OpenGL State Registers
Texel Cache
Tile Stencil Depth SRAM
Blending Logical Op
Other peripherals
  • Texel BIU
  • Texel block fetching
  • Texel block unpacking

OpenGL State Registers
Tile Color SRAM
  • Tile Pixel BIU
  • Transfer tile to global FB

13
Proposed pixel rasterization order for TBR
  • Space-filling path
  • Good for texture cache
  • Path alternates banks
  • Rasterization order given by path can be enforced
    by hierarchical priority encoding, and resetting
    of the highest priority locations afterwards

Fragment locations in blocks (groups, quads)
encountered earlier on the path have higher
priority than any fragment locations encountered
in a block (group, quad) later on the path
14
Traversal algorithm hardware
  • Traversal algorithm hardware
  • Systolic primitive scan-conversion subsystem to
    compute the stencil of the triangle
  • Uses edge functions
  • Works on a sliding window of 8x8 pixels (a block)
  • Outputs the primitive shape for a different 4x4
    pixel region (a group) every clock cycle
  • An entire tile (32x16 pixels) is processed in
    32 cycles
  • Logic-enhanced memory
  • Coupled back-to-back with the systolic subsystem
  • Enforces space-filling path rasterization order
    by employing hierarchical priority encoding, read
    then reset.
  • On request delivers in one clock cycle at least 1
    and up to 4 hit locations following the
    space-filling path

15
Systolic computation of the primitive stencil
  • The edge function can be reformulated

Tile coordinates on the screen
Block coordinates in the tile
Pixel offsets in the block
Goal compute the edge function values for a
block (8x8 pixel region) in parallel!
16
Systolic computation of the primitive stencil
The computations to be solved in parallel
for
can be implemented with the following tree
  • Disadvantage
  • Large area 78 28-bit adders
  • Large Latency critical path spans 4 28-bit adders

Solution compute only 7 bits of the result at a
time (in a clock cycle) starting from the least
significant bits
17
Systolic computation of the primitive stencil
Cell processing element circuit diagram
18
Systolic computation of the primitive stencil
(M)2t
(M0)t
Systolic computation of
19
Systolic computation of the primitive stencil
Systolic computation of
20
Systolic computation of the primitive stencil
Node processing element circuit diagram
21
Systolic computation of the edge function for an
8x8 pixel window
22
Systolic computation of the primitive stencil
  • Traversal algorithm hardware
  • Systolic primitive scan-conversion subsystem to
    compute the stencil of the triangle
  • Uses edge functions
  • Works on a sliding window of 8x8 pixels (a block)
  • Outputs the primitive shape for a different 4x4
    pixel region (a group) every clock cycle
  • An entire tile (32x16 pixels) is processed in
    32 cycles

23
Logic-Enhanced Memory
  • Logic-enhanced memory
  • Coupled back-to-back with the systolic subsystem
  • Enforces space-filling path rasterization order
    by employing hierarchical priority encoding, read
    then reset.
  • On request delivers in one clock cycle at least 1
    an up to 4 hit locations following the
    space-filling path
  • Interface
  • Write a group in one clock cycle, write protocol
    similar to any general purpose SRAM memory
  • Read on request delivers in one clock cycle a
    quad that contains at least 1 hit location, the
    quad encoding in (Block, Group, Quad) format, 1
    signals indicating if all the hit locations were
    transferred out.
  • Organization
  • 32 wordlines
  • Each wordline contains a group, each group
    contains four quads, each quad contains four
    location bits

24
Logic-Enhanced Memory
Quad Cell
25
Logic-Enhanced Memory
Group Cell
26
Logic-Enhanced Memory
Group priority encoder logic table
27
Logic-Enhanced Memory
Group priority encoder high-speed low-power
n-type domino logic priority encoder with
one-level of lookahead
28
Logic-Enhanced Memory
29
Logic-Enhanced Memory
Priority encoder obtained by chaining 4
high-speed low-power 8-bit priority encoder macro
with three-level of lookahead
NB the schematics was drawn to show how the PE
can be fitted to the memory cell pitch
30
Hardware Implementation Results
  • IC Technology UMC Logic18-1.8V/3.3V-1P6M
  • Tools Synopsys Design Compiler, Alliance,
    H-SPICE
  • Systolic Subsystem
  • Critical Path Latency 3.8ns
  • Area 244,980um2
  • Equivalent gate no 17kgates
  • Power 18mW (preliminary)
  • Logic-Enhanced Memory
  • Critical Path Latency 2.387ns, therefore
    fCLK200MHz
  • Area (total, including peripheral circuitry)
    118,985um2
  • Bit Cell 144um2, Quad Cell 636um2, Group Cell
    2894um2
  • Equivalent gate no 8kgates
  • Power 6mW (preliminary)
  • Total hardware area is approx. ¾ tile color
    buffer (SRAM 512 x 32 bits 16k bits)

31
Conclusions
  • Efficient Tile-Based Traversal Algorithm Hardware
  • Good throughput at least 1 and up to 4 pixels
    pushed per clock cycle
  • Rasterization order generates high texture cache
    hit-ratios
  • Breaks the read-modify-write dependency
    leads to efficient pipelining

32
Author Contact Information
Dan CRISU
Sorin COTOFANA
Stamatis VASSILIADIS
E-mail dan, sorin, stamatis_at_ce.et.tudelft.nl
Computer Engineering Laboratory Electrical
Engineering Department Delft University of
Technology Mekelweg 4 (15th floor) 2628 CD
Delft The Netherlands Phone (31) 15
2783644 Fax (31) 15 2784898
Write a Comment
User Comments (0)
About PowerShow.com