Efficient Primitive Traversal Hardware for 3D Graphics TileBased Rasterizers - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

Efficient Primitive Traversal Hardware for 3D Graphics TileBased Rasterizers

Description:

... hierarchical priority encoding, and resetting of the highest priority locations afterwards ... by employing hierarchical priority encoding, read then reset. ... – PowerPoint PPT presentation

Number of Views:48

Avg rating:3.0/5.0

Slides: 33

Provided by: dcr7

Category:

more less

Transcript and Presenter's Notes

Title: Efficient Primitive Traversal Hardware for 3D Graphics TileBased Rasterizers

1
Efficient Primitive Traversal Hardware for 3D
Graphics Tile-Based Rasterizers

Dan Crisu, Sorin Cotofana, and Stamatis
Vassiliadis
Computer Engineering Laboratory
Electrical Engineering Department
Delft University of Technology

2
Talk Outline

Terminology
Tile-Based Rasterization
Primitive Traversal Algorithm Hardware
Hardware Implementation Results
Conclusions

3
Typical 3D Graphics Pipeline
The rasterizer stage is always accelerated in
hardware!
4
Rasterization

Rasterizer stage input
consists of geometrical primitives (triangles)
projected to the screen area
Triangle vertex attributes
Screen coordinates (x, y)
Homogeneous coordinate (w)
Depth component (z)
Color components (R, G, B, A)
Texture coordinates (s, t, q)

B
A
C

Fragment raster position inside the triangle
stencil

Rasterization generation of fragment attributes
by interpolating triangle vertex attributes

Rasterization algorithm
While fragments remaining do
1 generate a fragment (in the order given by the
triangle traversal algorithm)
2 generate the fragment attributes employing the
fragment coordinate

5
Tile Based Rasterization (TBR)

Tiling Architecture
the screen is divided in a number of
non-overlapping regions tiles
geometry is sorted by screen location and dumped
into one or more bins, one bin per tile
the geometry is rendered from bin N to the tile N
before moving to the tile N1

Full screen rasterization (FSR)
TBR
6
Tile Based Rasterization (TBR)

Advantages
Low-cost the information is maintained only per
tile, not per screen
On-chip color, Z buffers with multiple
samples/pixel for antialiasing
Low-power reduces accesses to external memory
Scalability with the screen resolution
Simplifies arithmetic if tile size is power of two

Disadvantage overhead at the software driver
level
7
Rasterization with edge functions
The Edge Function

Most notable property
can be computed incrementally by simple addition
for adjacent pixels
and generalizing

Triangle representation using edge functions.
Notational conventions for the edge function.
multi-operand addition if and are
very small
8
Rasterization with edge functions
Traversing the tile entirely.
Efficient triangle traversal algorithm.
Triangle Traversal Algorithm
9
Rasterization with edge functions

Triangle rasterization algorithm (inner loop) for
full-screen (FSR) approach
Save the rasterization context
Move to a new rasterization position
Test the edge functions
Hit? Then communicate position downstream and
update context
Miss? Restore the rasterization context
Predict the next rasterization position

10
Rasterization with edge functions in TBR context
Ghost triangle for tiles (0, 2), (1, 0), (2,
0), and (2,2)

Can the FSR algorithm be applied efficiently in
TBR?
Yes, if you can detect very fast a hit position
but this will not happen for two reasons
not enough information to find fast a hit
position (overhead in searches 50-300)
ghost primitives

11
Rasterization with edge functions in TBR context

Additional issues (applies equally for FSR and
TBR) that increase rasterization complexity
Which has to be the triangle traversal order to
Obtain a good texture cache hit-ratio for a
pull texture architecture?
Minimize the frame buffer bank conflict because
ofread-write-modify dependency in Z-test and
color blending?
It must be another way to perform TBR than the
FSR algorithm!

12
Rasterization with edge functions in TBR context
INPUT / OUTPUT PORTS
SDRAM Memory
OpenGL State Registers
OpenGL State Registers

Triangle Setup
Gradient computation
Edge functions
Depth (z)
Color components
Texture coordinates

Triangle Rasterization
Interpolators
Depth (z)
Color components
Texture coordinates
Pixel coverage value
for Antialiasing

Per-Fragment Operation
Stage
Scissor
OpenGL State Registers
SOC bus
CPU

Triangle Traversal
Systolic subsystem
Logic-enhanced memory

Alpha
OpenGL State Registers

Texture Mapping
Mip-map level selection
Texel combiners

Stencil Depth
OpenGL State Registers
Texel Cache
Tile Stencil Depth SRAM
Blending Logical Op
Other peripherals

Texel BIU
Texel block fetching
Texel block unpacking

OpenGL State Registers
Tile Color SRAM

Tile Pixel BIU
Transfer tile to global FB

13
Proposed pixel rasterization order for TBR

Space-filling path
Good for texture cache
Path alternates banks
Rasterization order given by path can be enforced
by hierarchical priority encoding, and resetting
of the highest priority locations afterwards

Fragment locations in blocks (groups, quads)
encountered earlier on the path have higher
priority than any fragment locations encountered
in a block (group, quad) later on the path
14
Traversal algorithm hardware

Traversal algorithm hardware
Systolic primitive scan-conversion subsystem to
compute the stencil of the triangle
Uses edge functions
Works on a sliding window of 8x8 pixels (a block)
Outputs the primitive shape for a different 4x4
pixel region (a group) every clock cycle
An entire tile (32x16 pixels) is processed in
32 cycles
Logic-enhanced memory
Coupled back-to-back with the systolic subsystem
Enforces space-filling path rasterization order
by employing hierarchical priority encoding, read
then reset.
On request delivers in one clock cycle at least 1
and up to 4 hit locations following the
space-filling path

15
Systolic computation of the primitive stencil

The edge function can be reformulated

Tile coordinates on the screen
Block coordinates in the tile
Pixel offsets in the block
Goal compute the edge function values for a
block (8x8 pixel region) in parallel!
16
Systolic computation of the primitive stencil
The computations to be solved in parallel
for
can be implemented with the following tree

Disadvantage
Large area 78 28-bit adders
Large Latency critical path spans 4 28-bit adders

Solution compute only 7 bits of the result at a
time (in a clock cycle) starting from the least
significant bits
17
Systolic computation of the primitive stencil
Cell processing element circuit diagram
18
Systolic computation of the primitive stencil
(M)2t
(M0)t
Systolic computation of
19
Systolic computation of the primitive stencil
Systolic computation of
20
Systolic computation of the primitive stencil
Node processing element circuit diagram
21
Systolic computation of the edge function for an
8x8 pixel window
22
Systolic computation of the primitive stencil

Traversal algorithm hardware
Systolic primitive scan-conversion subsystem to
compute the stencil of the triangle
Uses edge functions
Works on a sliding window of 8x8 pixels (a block)
Outputs the primitive shape for a different 4x4
pixel region (a group) every clock cycle
An entire tile (32x16 pixels) is processed in
32 cycles

23
Logic-Enhanced Memory

Logic-enhanced memory
Coupled back-to-back with the systolic subsystem
Enforces space-filling path rasterization order
by employing hierarchical priority encoding, read
then reset.
On request delivers in one clock cycle at least 1
an up to 4 hit locations following the
space-filling path
Interface
Write a group in one clock cycle, write protocol
similar to any general purpose SRAM memory
Read on request delivers in one clock cycle a
quad that contains at least 1 hit location, the
quad encoding in (Block, Group, Quad) format, 1
signals indicating if all the hit locations were
transferred out.
Organization
32 wordlines
Each wordline contains a group, each group
contains four quads, each quad contains four
location bits

24
Logic-Enhanced Memory
Quad Cell
25
Logic-Enhanced Memory
Group Cell
26
Logic-Enhanced Memory
Group priority encoder logic table
27
Logic-Enhanced Memory
Group priority encoder high-speed low-power
n-type domino logic priority encoder with
one-level of lookahead
28
Logic-Enhanced Memory
29
Logic-Enhanced Memory
Priority encoder obtained by chaining 4
high-speed low-power 8-bit priority encoder macro
with three-level of lookahead
NB the schematics was drawn to show how the PE
can be fitted to the memory cell pitch
30
Hardware Implementation Results

IC Technology UMC Logic18-1.8V/3.3V-1P6M
Tools Synopsys Design Compiler, Alliance,
H-SPICE
Systolic Subsystem
Critical Path Latency 3.8ns
Area 244,980um2
Equivalent gate no 17kgates
Power 18mW (preliminary)
Logic-Enhanced Memory
Critical Path Latency 2.387ns, therefore
fCLK200MHz
Area (total, including peripheral circuitry)
118,985um2
Bit Cell 144um2, Quad Cell 636um2, Group Cell
2894um2
Equivalent gate no 8kgates
Power 6mW (preliminary)
Total hardware area is approx. ¾ tile color
buffer (SRAM 512 x 32 bits 16k bits)

31
Conclusions

Efficient Tile-Based Traversal Algorithm Hardware
Good throughput at least 1 and up to 4 pixels
pushed per clock cycle
Rasterization order generates high texture cache
hit-ratios
Breaks the read-modify-write dependency
leads to efficient pipelining

32
Author Contact Information
Dan CRISU
Sorin COTOFANA
Stamatis VASSILIADIS
E-mail dan, sorin, stamatis_at_ce.et.tudelft.nl
Computer Engineering Laboratory Electrical
Engineering Department Delft University of
Technology Mekelweg 4 (15th floor) 2628 CD
Delft The Netherlands Phone (31) 15
2783644 Fax (31) 15 2784898

Write a Comment

User Comments (0)