Title: Efficient Primitive Traversal Hardware for 3D Graphics TileBased Rasterizers
1 Efficient Primitive Traversal Hardware for 3D
Graphics Tile-Based Rasterizers
- Dan Crisu, Sorin Cotofana, and Stamatis
Vassiliadis - Computer Engineering Laboratory
- Electrical Engineering Department
- Delft University of Technology
2Talk Outline
- Terminology
- Tile-Based Rasterization
- Primitive Traversal Algorithm Hardware
- Hardware Implementation Results
- Conclusions
3Typical 3D Graphics Pipeline
The rasterizer stage is always accelerated in
hardware!
4Rasterization
- Rasterizer stage input
- consists of geometrical primitives (triangles)
projected to the screen area - Triangle vertex attributes
- Screen coordinates (x, y)
- Homogeneous coordinate (w)
- Depth component (z)
- Color components (R, G, B, A)
- Texture coordinates (s, t, q)
B
A
C
- Fragment raster position inside the triangle
stencil
Rasterization generation of fragment attributes
by interpolating triangle vertex attributes
- Rasterization algorithm
- While fragments remaining do
- 1 generate a fragment (in the order given by the
triangle traversal algorithm) - 2 generate the fragment attributes employing the
fragment coordinate -
5Tile Based Rasterization (TBR)
- Tiling Architecture
- the screen is divided in a number of
non-overlapping regions tiles - geometry is sorted by screen location and dumped
into one or more bins, one bin per tile - the geometry is rendered from bin N to the tile N
before moving to the tile N1
Full screen rasterization (FSR)
TBR
6Tile Based Rasterization (TBR)
- Advantages
- Low-cost the information is maintained only per
tile, not per screen - On-chip color, Z buffers with multiple
samples/pixel for antialiasing - Low-power reduces accesses to external memory
- Scalability with the screen resolution
- Simplifies arithmetic if tile size is power of two
Disadvantage overhead at the software driver
level
7Rasterization with edge functions
The Edge Function
- Most notable property
- can be computed incrementally by simple addition
for adjacent pixels - and generalizing
Triangle representation using edge functions.
Notational conventions for the edge function.
multi-operand addition if and are
very small
8Rasterization with edge functions
Traversing the tile entirely.
Efficient triangle traversal algorithm.
Triangle Traversal Algorithm
9Rasterization with edge functions
- Triangle rasterization algorithm (inner loop) for
full-screen (FSR) approach - Save the rasterization context
- Move to a new rasterization position
- Test the edge functions
- Hit? Then communicate position downstream and
update context - Miss? Restore the rasterization context
- Predict the next rasterization position
10Rasterization with edge functions in TBR context
Ghost triangle for tiles (0, 2), (1, 0), (2,
0), and (2,2)
- Can the FSR algorithm be applied efficiently in
TBR? - Yes, if you can detect very fast a hit position
- but this will not happen for two reasons
- not enough information to find fast a hit
position (overhead in searches 50-300) - ghost primitives
11Rasterization with edge functions in TBR context
- Additional issues (applies equally for FSR and
TBR) that increase rasterization complexity - Which has to be the triangle traversal order to
- Obtain a good texture cache hit-ratio for a
pull texture architecture? - Minimize the frame buffer bank conflict because
ofread-write-modify dependency in Z-test and
color blending? - It must be another way to perform TBR than the
FSR algorithm!
12Rasterization with edge functions in TBR context
INPUT / OUTPUT PORTS
SDRAM Memory
OpenGL State Registers
OpenGL State Registers
- Triangle Setup
- Gradient computation
- Edge functions
- Depth (z)
- Color components
- Texture coordinates
- Triangle Rasterization
- Interpolators
- Depth (z)
- Color components
- Texture coordinates
- Pixel coverage value
- for Antialiasing
Per-Fragment Operation
Stage
Scissor
OpenGL State Registers
SOC bus
CPU
- Triangle Traversal
- Systolic subsystem
- Logic-enhanced memory
Alpha
OpenGL State Registers
- Texture Mapping
- Mip-map level selection
- Texel combiners
Stencil Depth
OpenGL State Registers
Texel Cache
Tile Stencil Depth SRAM
Blending Logical Op
Other peripherals
- Texel BIU
- Texel block fetching
- Texel block unpacking
OpenGL State Registers
Tile Color SRAM
- Tile Pixel BIU
- Transfer tile to global FB
13Proposed pixel rasterization order for TBR
- Space-filling path
- Good for texture cache
- Path alternates banks
- Rasterization order given by path can be enforced
by hierarchical priority encoding, and resetting
of the highest priority locations afterwards
Fragment locations in blocks (groups, quads)
encountered earlier on the path have higher
priority than any fragment locations encountered
in a block (group, quad) later on the path
14Traversal algorithm hardware
- Traversal algorithm hardware
- Systolic primitive scan-conversion subsystem to
compute the stencil of the triangle - Uses edge functions
- Works on a sliding window of 8x8 pixels (a block)
- Outputs the primitive shape for a different 4x4
pixel region (a group) every clock cycle - An entire tile (32x16 pixels) is processed in
32 cycles - Logic-enhanced memory
- Coupled back-to-back with the systolic subsystem
- Enforces space-filling path rasterization order
by employing hierarchical priority encoding, read
then reset. - On request delivers in one clock cycle at least 1
and up to 4 hit locations following the
space-filling path
15Systolic computation of the primitive stencil
- The edge function can be reformulated
Tile coordinates on the screen
Block coordinates in the tile
Pixel offsets in the block
Goal compute the edge function values for a
block (8x8 pixel region) in parallel!
16Systolic computation of the primitive stencil
The computations to be solved in parallel
for
can be implemented with the following tree
- Disadvantage
- Large area 78 28-bit adders
- Large Latency critical path spans 4 28-bit adders
Solution compute only 7 bits of the result at a
time (in a clock cycle) starting from the least
significant bits
17Systolic computation of the primitive stencil
Cell processing element circuit diagram
18Systolic computation of the primitive stencil
(M)2t
(M0)t
Systolic computation of
19Systolic computation of the primitive stencil
Systolic computation of
20Systolic computation of the primitive stencil
Node processing element circuit diagram
21Systolic computation of the edge function for an
8x8 pixel window
22Systolic computation of the primitive stencil
- Traversal algorithm hardware
- Systolic primitive scan-conversion subsystem to
compute the stencil of the triangle - Uses edge functions
- Works on a sliding window of 8x8 pixels (a block)
- Outputs the primitive shape for a different 4x4
pixel region (a group) every clock cycle - An entire tile (32x16 pixels) is processed in
32 cycles
23Logic-Enhanced Memory
- Logic-enhanced memory
- Coupled back-to-back with the systolic subsystem
- Enforces space-filling path rasterization order
by employing hierarchical priority encoding, read
then reset. - On request delivers in one clock cycle at least 1
an up to 4 hit locations following the
space-filling path - Interface
- Write a group in one clock cycle, write protocol
similar to any general purpose SRAM memory - Read on request delivers in one clock cycle a
quad that contains at least 1 hit location, the
quad encoding in (Block, Group, Quad) format, 1
signals indicating if all the hit locations were
transferred out. - Organization
- 32 wordlines
- Each wordline contains a group, each group
contains four quads, each quad contains four
location bits
24Logic-Enhanced Memory
Quad Cell
25Logic-Enhanced Memory
Group Cell
26Logic-Enhanced Memory
Group priority encoder logic table
27Logic-Enhanced Memory
Group priority encoder high-speed low-power
n-type domino logic priority encoder with
one-level of lookahead
28Logic-Enhanced Memory
29Logic-Enhanced Memory
Priority encoder obtained by chaining 4
high-speed low-power 8-bit priority encoder macro
with three-level of lookahead
NB the schematics was drawn to show how the PE
can be fitted to the memory cell pitch
30Hardware Implementation Results
- IC Technology UMC Logic18-1.8V/3.3V-1P6M
- Tools Synopsys Design Compiler, Alliance,
H-SPICE - Systolic Subsystem
- Critical Path Latency 3.8ns
- Area 244,980um2
- Equivalent gate no 17kgates
- Power 18mW (preliminary)
- Logic-Enhanced Memory
- Critical Path Latency 2.387ns, therefore
fCLK200MHz - Area (total, including peripheral circuitry)
118,985um2 - Bit Cell 144um2, Quad Cell 636um2, Group Cell
2894um2 - Equivalent gate no 8kgates
- Power 6mW (preliminary)
- Total hardware area is approx. ¾ tile color
buffer (SRAM 512 x 32 bits 16k bits)
31Conclusions
- Efficient Tile-Based Traversal Algorithm Hardware
- Good throughput at least 1 and up to 4 pixels
pushed per clock cycle - Rasterization order generates high texture cache
hit-ratios - Breaks the read-modify-write dependency
leads to efficient pipelining
32Author Contact Information
Dan CRISU
Sorin COTOFANA
Stamatis VASSILIADIS
E-mail dan, sorin, stamatis_at_ce.et.tudelft.nl
Computer Engineering Laboratory Electrical
Engineering Department Delft University of
Technology Mekelweg 4 (15th floor) 2628 CD
Delft The Netherlands Phone (31) 15
2783644 Fax (31) 15 2784898