Iosif Antochi - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

Iosif Antochi

Description:

Optimizations and Trade-offs for Low-Power 3D Graphics Tile-Based Rendering Architectures ... Tux Racer (Tux) AWadvs-04 (AW). Part of Viewperf ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 55
Provided by: Stam57
Category:
Tags: antochi | iosif | racer

less

Transcript and Presenter's Notes

Title: Iosif Antochi


1
Optimizations and Trade-offs for Low-Power 3D
Graphics Tile-Based Rendering Architectures
  • Iosif Antochi

Computer Engineering Laboratory Delft University
of Technology The Netherlands
2
Summary
  • Introduction
  • The GRAAL environment
  • Tile-based rendering
  • Memory Requirements
  • Scene Management
  • State Management

3
GRAAL EnvironmentOverview
Applications (Benchmarks)
Tracer (Player)
SIMULATOR
HUMAN
INTERACTION
New Architecture
Simulator Front-end (Augmented Mesa)
TCP/IP or files
Performance Evaluation
Fast Back-end (Qt)
RTL Back-end (SystemC)
4
Introduction Graphics Application Data
Structures
World
Object 1
Object 2
Object n

Texture mapping
Position
Shape

Textures
Texture
5
A Rendered Scene
Scene from the Quake 3 FPS game developed by id
Software
6
The Structure of the Scene
7
3D Graphics Pipeline
3D Graphics Pipeline
8
Tile Based Rendering
  • Part 1 Memory Requirements

9
Overview
  • Motivation
  • Background
  • Traditional and tile-based rendering methods
  • Tile size vs. external data traffic
  • Tile-based vs. conventional rendering
  • Conclusions

10
Motivation
  • External memory accesses consume a lot of power
  • Tile-based rendering might be used to reduce
    external memory traffic.

11
Traditional Rendering
12
Tile-Based Rendering
13
Rendering Models
Traditional rendering
Tile-Based rendering
14
Tile Size Vs. External Data Traffic
15
Triangle Size Histogram
  • First indication of required tile size
  • Very few triangles (7) are larger than 1024
    pixels

16
Number of Kilotriangles Transferred per Frame for
Various Tile Sizes
  • As expected, external traffic reduces if tile
    size is increased
  • A tile size of 32 32 yields a good trade-off
    between external traffic and on chip memory

17
Tile-Based vs. Conventional Rendering
18
Data Traffic Front
  • The front data traffic increases for tile-based
    rendering
  • State change increases more than geometry for
    tile-based rendering

19
Data Traffic Back
  • Usually, there is no z data traffic for
    tile-based rendering
  • The color data traffic decreases for tile-based
    rendering

20
Total Data Traffic
  • Tile-based rendering reduces the total amount of
    data traffic by a factor of 1.96

21
TBR Data Traffic - Conclusions
  • A tile size of 32 ? 32 pixels yields good
    trade-off between the amount of on-chip memory
    and the amount of external data traffic.
  • Tile-based rendering reduces the total amount of
    external traffic by a factor of 1.96.
  • For workloads with a high overlap and low
    overdraw, traditional rendering can outperform
    tile-based rendering. For workloads with a low
    overlap and high overdraw, tile-based rendering
    is more better than traditional rendering.

22
Tile Based Rendering
  • Part 2 Scene Management Details

23
Scene Management Overview
  • Motivation
  • Background
  • The Two-Stage Model
  • Overlap Tests
  • Scene Management Algorithms
  • Results
  • Conclusions

24
Motivation
  • Tile-based rendering requires that primitives are
    sorted into bins corresponding to tiles.

25
Background Scene Management for Tile-Based
Rendering
26
Two Stage Model
  • Tile-based Rendering Model -
  • ( based on retained mode execution )
  • procedure call_driver_for_instr(i)
  •   if (!buffer_is_full IS_Bufferable(i)) 
  • Buffer_Instr(i)
  •   else 
  •     // we have to render all buffered instructions
  •     // since either we have a Swap_Buffers Instruc
    tion
  •     // or we ran out of buffer space 
  •    
  •        inictxSave_Context()
  •        for ( tile0 tilelt maxtile tile )
  •          
  •     if (tile!0) Set_Context(inictx)
  •     RestoreCurrentTile(tile)
  •     Render_All_Instr_For_Tile(tile)
  •     SwapTile(tile)
  • Stage 1 Buffering (and initial sorting )
  • Stage 2 (Second sorting and) primitive sending

27
BBOX Overlap Test
Determines if the bounding box of a triangle
overlaps with a tile
28
LET Overlap Test
Consider a 2D vector defined by two points A
(X,Y) and B (X dX Y dY), and a line L AB
that passes through the two points. The edge
function for a certain point P (x, y) is defined
as
Triangle T(A,B,C) overlaps tile T(xc,yc,l) if
?
?
?
29
Algorithm DIRECT
  • For each tile scan the whole list of primitives
    and send the primitives that (potentially)
    overlap the tile to the rasterizer.
  • Pseudocode
  • for each triangle Tr
  • buffer Tr
  • for each tile T
  • for each triangle Tr
  • compute bbox of Tr
  • if bbox of Tr and T overlap
  • send Tr

30
Algorithm TWO_STEP
  • Compute and store the bounding box of each
    triangle during the buffering stage. This avoids
    having to recompute the bounding box for each
    triangle/tile tuple, but it requires more memory.
  • Pseudocode
  • for each triangle Tr
  • buffer Tr
  • compute and store bbox of Tr
  • for each tile T
  • for each triangle Tr
  • if bbox of Tr and T overlap
  • send Tr

31
Algorithm TWO_STEP_L
  • In the second stage the LET overlap test is used
    instead of BBOX. Since the LET test contains the
    BBOX test, the main LET is applied only to
    triangles that have passed the BBOX test.
  • Pseudocode
  • for each triangle Tr
  • buffer Tr
  • compute and store bbox of Tr
  • for each tile T
  • for each triangle Tr
  • if Tr and T overlap (using LET)
  • send Tr

32
Algorithm SORT
  • For each tile a buffer with pointers to the
    primitives that overlap the tile is created.
    During the second step these buffers are scanned.
  • Pseudocode
  • for each triangle Tr
  • buffer Tr
  • compute bbox of Tr
  • for each tile T that overlaps bbox of Tr
  • insert pointer to Tr in the buffer of
    T
  • for each tile T
  • for each triangle Tr in the buffer of T
  • send Tr

33
Algorithm SORT_L
  • Identical to SORT except that the LET is used to
    determine if a triangle and a tile overlap.
  • Pseudocode
  • for each triangle Tr
  • buffer Tr
  • compute bbox of Tr
  • for each tile T that overlaps bbox of Tr
  • If LET indicates Tr and T overlap
  • insert pointer to Tr in the buffer of
    T
  • for each tile T
  • for each triangle Tr in the buffer of T
  • send Tr

34
Experimental Setup
  • OpenGL Benchmarks
  • Quake III (Q3) (low high resolution)
  • Tux Racer (Tux)
  • AWadvs-04 (AW). Part of Viewperf
  • VRML Scenes - Austrian National Library (ANL),
    Graz 3D (GRA), Dino (DIN)
  • Estimations and Assumptions
  • Some of the parameters such as the average number
    of operations to compute the bbox or perform bbox
    or let were determined by running the
    benchmarks.
  • Some other parameters such as the average number
    of operations to buffer a primitive or insert a
    primitive to a tile buffer were statistically
    estimated.

35
Estimated Running Time
Estimated Running Time Relative to DIRECT
36
Amount of Additional Memory Required
Kbytes
37
TBR Scene Management - Conclusions
  • Which algorithm is preferable depends on the
    available additional memory and computational
    power.
  • The DIRECT algorithm has poor performance. On
    average the DIRECT algorithm is 44 times slower
    than SORT.
  • The TWO_STEP algorithm is slower than SORT by a
    factor of 6 while reducing the amount of
    additional memory by a factor of 3.2.
  • SORT_LET is slower than SORT by a factor of 1.6

38
Efficient Bounding-Box Computation
For a tile T given by the tuple (T.MinX,
T.MinY, T.MaxX, T.MaxY), a possible
implementation of the BBOX test in C is if
(BBOX.MaxX lt T.MinX) / Test 1 / return
NoOverlap if (BBOX.MinX gt T.MaxX) / Test
2 / return NoOverlap if (BBOX.MaxY lt T.MinY)
/ Test 3 / return NoOverlap if
(BBOX.MinY gt T.MaxY) / Test 4 / return
NoOverlap return MightOverlap
Let a triangle Tr be defined by three points
A(x,y), B(x,y), C(x,y) Bounding box (BBOX) of
Tr is defined by the tuple (BBOX.MinX, BBOX.Min
Y, BBOX.MaxX, BBOX.MaxY) where BBOX.MinX MIN
(A.x, B.x, C.x) BBOX.MinY MIN (A.y, B.y, C.y)
BBOX.MaxX MAX (A.x, B.x, C.x) BBOX.MaxY MAX
(A.y, B.y, C.y)
39
Bounding-Box Tests Order
  • The four comparisons required to determine if a
    BBOX and a tile overlap can be performed in an
    arbitrary order.
  • This gives a total of 24 possible arrangements.
    However, not every order produces the same number
    of comparisons on average.
  • A tile divides the screen into five, possibly
    intersecting regions the tile itself, the region
    to the east of the tile ( x T.MaxX), the region
    to the west of the tile (xlt T.MinX ), the region
    to the north (y T.MaxY ), and the region to the
    south (ylt T.MinY ).

40
Static Bounding-Box Tests
STATIC1 If a certain test (comparison) fails,
then there is a high probability that the test in
the opposite direction along the same dimension
succeeds. This is because after these two tests
there is only a small region left where the BBOX
of a primitive can be situated.
STATIC2 The first and second (and, hence, the
third and fourth) comparison check different
dimensions. For example, one possible order is
west, south, east, north.
41
Dynamic Bounding-Box Tests
The probability that a primitive is completely
located in the largest region is the highest.
This observation is the basis of our dynamic
versions of the bounding box test.
DYNAMIC1 First checks the largest region.
Thereafter, the opposite direction along the same
dimension is tested. The third test examines the
largest region in the other dimension, and the
fourth test checks the remaining region.
DYNAMIC2 The comparison corresponding to the
largest region is applied first, then the
comparison corresponding to the second largest
region, etc. The region to the east of the tile
is the largest and checked first, then the region
to the south, then the one to the north, and,
finally, the region to the west.
We remark that although these schemes are called
dynamic, the order in which the comparisons are
applied depends only on the tile position and can
be determined statically off-line. For example,
for all tiles in the upper left sub-scene under
the main diagonal, the order is east, south,
north, west.
42
Bounding-Box Tests Experimental Results
The average number of comparisons per primitive
for each workload
43
Tile Based Rendering
  • Part 3 State Management Details

44
SW Driver Block Diagram
Main Memory
Applications
Global Primitive List (GPL)
Mesa Core
Texture Images List (TIL)
Texture Objects List (TOL)
  • Global Instr. Buffering
  • Initial primitive sorting
  • Small triangle filter
  • Texture Preprocessing

Graal Device Driver
Per Tile Sorted Primitive List (Pointers to GPL)
  • Tile Instr. Iterator
  • Sends instructions to the Graal Accelerator in a
    tile based order until all tiles are completed

GPP (ARM)
Soc Bus
Rasterizer instructions LDTRI DEPTHEN
SETDPTFCT . SWPBUFF
Graal 3D Graphics Accelerator
45
State Information In Detail
  • Unit enable/disable
  • Enable blending
  • Disable depth
  • Unit functionality change
  • Change blending mode
  • Change depth test function
  • Texturing state (much larger than the rest of the
    state information)

46
Lazy State Update
  • Initial stream
  • Bindtex 1
  • Endepth
  • Tri 1
  • Disdepth
  • Tri 2
  • Endepth
  • Bindtex 2
  • Tri 4
  • Bindtex 3
  • Tri5
  • Current tile stream
  • Bindtex 1
  • Endepth
  • Tri 1
  • Disdepth
  • Endepth
  • Bindtex 2
  • Bindtex 3
  • Tri5
  • Current tile stream
  • using lazy update
  • Bindtex 1
  • Endepth
  • Tri 1
  • Bindtex 3
  • Tri5

47
Texture State Handling (I)
48
Texture State Handling (II)
  • Late commit of texture images
  • We use a global list of texture images, but a
    separate list of (bindable) texture objects for
    each context, thus we share the texture images in
    order to save space and memory transfers.
  • Deleting texture objects can be solved either by
    partial rendering or by postponing it when
    possible.

49
State Management Algorithms
  • What can we do whenever an instruction that has
    side-effects is encountered (e.g.,
    DeleteTexture) in the input stream ?
  • PARTIAL RENDERING (DIRECT)
  • Render all previously buffered instructions
  • Executes the instruction.
  • This algorithm might also introduce significant
    rendering overhead.
  • DELAYED EXECUTION
  • The driver will postpones the execution of the
    current instruction until all the primitives
    depending on the current instruction are rendered
    or the end of the current frame is reached.
  • Execute instruction

50
Example
Start Frame Tile 1 c1SaveCurrentContext RestoreTi
leFromGlobalBuffer CreateTexture(i) MakeCurrentTex
ture(i) Triangle(2) MarkDeleteTexture(i) RenameTex
ture(i,j) MakeCurrentTexture(j) Triangle(3) SaveTi
leToGlobalBuffer Tile 2 RestoreContext(c1) Restore
TileFromGlobalBuffer MakeCurrentTexture(i) Triangl
e(1) MakeCurrentTexture(j) Triangle(3) SaveTileToG
lobalBuffer After Last Tile DeleteTexture(i) MoveT
extureLinks(i,j) End Frame Delayed tiled
instruction stream using delayed commit
Start Frame Tile 1 c1SaveCurrentContext RestoreTi
leFromGlobalBuffer CreateTexture(i) MakeCurrentTex
ture(i) Triangle(2) SaveTileToGlobalBuffer c2Save
CurrentContext Tile 2 RestoreContext(c1) RestoreTi
leFromGlobalBuffer MakeCurrentTexture(i) Triangle(
1) SaveTileToGlobalBuffer DeleteTexture(i) Tile1 c
1SaveCurrentContext RestoreTileFromGlobalBuffer C
reateTexture(i) MakeCurrentTexture(i) Triangle(3)
SaveTileToGlobalBuffer Tile 2 RestoreContext(c1) R
estoreTileFromGlobalBuffer MakeCurrentTexture(i) T
riangle(3) SaveTileToGlobalBuffer End Frame
Tiled instruction stream using partial rendering
Start Frame CreateTexture(i) MakeCurrentTexture(i)
Triangle(1) Triangle(2) DeleteTexture(i) CreateTe
xture(i) MakeCurrentTexture(i) Triangle(3) End
Frame Initial instruction stream
Assumptions triangle 1 overlaps tile 2,
triangle 2 overlaps tile 1, and triangle 3
overlaps tiles 1 and 2.
51
State Management Experimental Results (I)
Percentage of state information and triangles
sent to the accelerator per frame.
52
State Management Experimental Results (II)
Average number of state information writes to the
accelerator per frame.
53
State Management Conclusions
  • While in traditional (non tile-based) rendering
    the state information traffic can be negligible
    compared to the traffic generated by the
    primitives, in tile-based rendering
    architectures, since the state information might
    need being duplicated in multiple streams, the
    required processing power and generated traffic
    can increase significantly.
  • To remove a state change instruction from the
    instruction stream of a tile, information about
    the previous or the following state change
    instructions and/or primitives is required. Thus,
    in order to send an optimal state change stream
    to the accelerator, i.e., use minimal bandwidth,
    additional processing power and more processor
    bandwidth is required.
  • By sending an optimized state change stream to
    the accelerator, the state change traffic to the
    accelerator was decreased up to 58.

54
Questions?
  • Thank You !
Write a Comment
User Comments (0)
About PowerShow.com