Parallel Graphics Rendering - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

Parallel Graphics Rendering

Description:

String together numerous PC's with good graphics boards and render the models in parallel. ... and Fast Ethernet (200 Mbps full-duplex) communication fabrics. ... – PowerPoint PPT presentation

Number of Views:33

Avg rating:3.0/5.0

Slides: 23

Provided by: mattc159

Learn more at: https://courses.cs.vt.edu

Category:

more less

Transcript and Presenter's Notes

Title: Parallel Graphics Rendering

1
Parallel Graphics Rendering

Matthew Campbell
Senior, Computer Science
mcampbel_at_vt.edu

2
Overview

Motivation
Three categories of parallel rendering
Our approach
Results
Questions

3
Motivation

PC graphics cards are getting faster at an
exponential rate.
PC graphics boards are much cheaper than
proprietary SGI hardware.
Geforce4 FX 150.00 (130 Mtris/sec)
SGI Onyx 300 145,000 (80 Mtris/sec)
Maintanance costs are lower
Replacement parts are easy to get.
PCs are not as complicated as proprietary
hardware.

4
Parallel Rendering

String together numerous PCs with good graphics
boards and render the models in parallel.
Increased performace
Better technology tracking
Three groups of algorithms
Sort-First
Sort-Middle
Sort-Last

5
Rendering Pipeline

Transformation stage
Per-Vertex operations
Primitive Assembly
3D World Space!
Rasterization stage
Per-fragment operations
Texture mapping
2D Image Space!

6
Parallel Rendering Sort Last

Sort Last
Distribute polygons
Round robin distribution resulting in an equal
load on each processor.
Pass through entire rendering pipeline.
Transformation / Rasterization (see last slide)
Each CPU now has the entire scene
But individual scenes are incomplete
Hidden polygons may be visible
Solution Image composition

7
Sort Last Image Composition

The scene at each CPU has a frame buffer with
color values for each pixel and a depth buffer
with Z values for each pixel.
Composition Given 2 scenes it computes the color
of the pixel at each screen coordinate
Compare the depth buffer values at each pixel
location. The resultant color value is the color
of the pixel corresponding to a lower z axis
value.
Alpha blending is more complex.
Why?

8
Sort Last Image Composition

Time complexity of the previous sort algorithm is
O(n), which is pretty bad.
Can we improve it?
Alternate algorithms
Tree composition.
Rotating rings.
Binary composition.

9
Sort-Last Performance

Sort-Last has very high communication bandwidth
requirement.
Each processor needs to send and receive an
entire frame
1280x1024 resolution, 24-bits for color, 16-bits
for depth, 30fps
(3.9MB 2.6MB) 30 196MB/sec bidirectional!
Need a very fast network interconnecting the CPUs
in the cluster.
In actuality, we need more bandwidth, because we
havent taken into account, the time it takes to
render the scene!
But.. No overhead for rendering the actual scene!

10
Parallel Rendering Sort Middle

Sort Middle
Distribute polygons in a round robin fashion
Trap polygons between geometry and rasterization
phases
Each CPU in the cluster is responsible for a
specific region in screen coordinates
Calculate the bounding boxes (screen space) for
the trapped polygons and redistribute them to the
appropriate CPU responsible for the region.
Collate Images

11
Parallel Rendering Sort Middle

How do you divide the screen into regions?
Strips (either horizontal or vertical)
Squares
What is the mapping ratio between CPUs and
regions?
One-to-One Each CPU manages 1 region
One-to-Many Each CPU manages many regions
What about polygons that cross region boundaries?
Multiple CPUs render the same polygon.

12
Sort-Middle Performance

Load-balancing can be poor. The slowest CPU will
block the system from rendering the next scene.
Load balancing is highly scene and view
dependent.
Need adaptive load-balancing schemes.
In high polygon count scenes, the size of each
polygon can be very small (1 2 pixels).
In this case, sort middle requires more bandwidth
than sort-last.
Communication bandwidth required is dependent on
the scene complexity. (Bad)

13
Parallel Rendering Sort First

Sort First
Distribute polygons round-robin to all CPUs.
Calculate bounding volumes for each polygon
Remember, we are still in the world coordinate
system.
Each CPU is responsible for 1 volume.
Redistribute polygons based on bounding volumes.
Pass through complete rendering pipeline
In the end we have sub-images at each processor.
Designate a coordinator node, which receives
sub-images from all other processors.
Coordinator collates sub-images into the final
image.

14
Sort First - Performance

Communication bandwidth required is based only on
screen space resolution.
Example
4 CPUs, 10241024 scene, 32 bits/color
The coordinator node receives 1024102424
bits/frame. 3MB.
Bandwidth 90MB/sec for 30 fps.
Problem Similar to sort-middle, load balancing
is scene dependent.
Bigger issue Cant use a one-to-many CPU to
region mapping.
Or can you?

15
Parallel Rendering Issues

Cannot break the rendering pipeline
Pipeline is implemented in hardware
Therefore, very expensive. Could lead to
excessive stalls, cache misses, etc..
Modern graphics cards have large amounts of
memory on the board and much faster access times.
8GB/sec vs. 1GB/sec for AGP4x
Graphics driver source code is unavailable
Additional cost/overhead due to framebuffer
accesses.

16
Our Approach

High Performance real-time rendering.
High scene complexity and/or multiple displays as
in a VE.
Target 200-300 million triangles/sec. In
comparison the best SGI platform Reality
Monster is capable of 80 million polygons/sec
Approach
Distributed Sort-First.
Two level sorting.
Organize your model in a spatial tree data
structure.
At run-time compare bounding volumes for interior
nodes of the tree. The bounding volume for an
interior node is a superset of its children. This
minimizes comparisons.
Fine pruning based on viewing frustum.

17
Hardware

32 Intel Xeon processor cluster (1.5 GHz
processor)
256 MB RDRAM/node (3.2 GB/sec memory bandwidth)
Myrinet (4 Gbps) and Fast Ethernet (200 Mbps
full-duplex) communication fabrics.
64 bit/66 MHz PCI bus (4 Gbps throughput)
4x AGP (1GB/sec throughput)

18
Software

Extensible Parallel 3D Rendering Engine
Supports large geometric databases, including
standard formats such as 3D Studio
Provides an extensible API.
Underlying system is based on OpenGL.
Based on dynamic shared object model.
Dynamic Load Balancing
Adaptively resizes volumes assigned to a
processor for single display systems.
Adaptively changes the number of processors and
rendering volumes for multi-display systems.

19
Software Architecture

Master-Slave arrangement
Multi-threaded
Two stage parallel rendering pipeline.

20
Results Rendering Rate
Figure 1 Scalability of our implementation.
Actual depicts the performance taking into
account triangle overlap among nodes, effective
depicts what the system is capable of
delivering. Left image uses a real world dataset
(LIDAR data). Right image uses a generated
dataset to fully exploit the overlap issue.
21
Results Load Balancing
Figure 2 The effects of load balancing on 4
nodes (left) and 16 nodes (right). The graph
depicts the individiual frame times for first 100
frames.
22

Write a Comment

User Comments (0)