Title: Parallel Graphics Rendering
1Parallel Graphics Rendering
- Matthew Campbell
- Senior, Computer Science
- mcampbel_at_vt.edu
2Overview
- Motivation
- Three categories of parallel rendering
- Our approach
- Results
- Questions
3Motivation
- PC graphics cards are getting faster at an
exponential rate. - PC graphics boards are much cheaper than
proprietary SGI hardware. - Geforce4 FX 150.00 (130 Mtris/sec)
- SGI Onyx 300 145,000 (80 Mtris/sec)
- Maintanance costs are lower
- Replacement parts are easy to get.
- PCs are not as complicated as proprietary
hardware.
4Parallel Rendering
- String together numerous PCs with good graphics
boards and render the models in parallel. - Increased performace
- Better technology tracking
- Three groups of algorithms
- Sort-First
- Sort-Middle
- Sort-Last
5Rendering Pipeline
- Transformation stage
- Per-Vertex operations
- Primitive Assembly
- 3D World Space!
- Rasterization stage
- Per-fragment operations
- Texture mapping
- 2D Image Space!
6Parallel Rendering Sort Last
- Sort Last
- Distribute polygons
- Round robin distribution resulting in an equal
load on each processor. - Pass through entire rendering pipeline.
- Transformation / Rasterization (see last slide)
- Each CPU now has the entire scene
- But individual scenes are incomplete
- Hidden polygons may be visible
- Solution Image composition
7Sort Last Image Composition
- The scene at each CPU has a frame buffer with
color values for each pixel and a depth buffer
with Z values for each pixel. - Composition Given 2 scenes it computes the color
of the pixel at each screen coordinate - Compare the depth buffer values at each pixel
location. The resultant color value is the color
of the pixel corresponding to a lower z axis
value. - Alpha blending is more complex.
- Why?
8Sort Last Image Composition
- Time complexity of the previous sort algorithm is
O(n), which is pretty bad. - Can we improve it?
- Alternate algorithms
- Tree composition.
- Rotating rings.
- Binary composition.
9Sort-Last Performance
- Sort-Last has very high communication bandwidth
requirement. - Each processor needs to send and receive an
entire frame - 1280x1024 resolution, 24-bits for color, 16-bits
for depth, 30fps - (3.9MB 2.6MB) 30 196MB/sec bidirectional!
- Need a very fast network interconnecting the CPUs
in the cluster. - In actuality, we need more bandwidth, because we
havent taken into account, the time it takes to
render the scene! - But.. No overhead for rendering the actual scene!
10Parallel Rendering Sort Middle
- Sort Middle
- Distribute polygons in a round robin fashion
- Trap polygons between geometry and rasterization
phases - Each CPU in the cluster is responsible for a
specific region in screen coordinates - Calculate the bounding boxes (screen space) for
the trapped polygons and redistribute them to the
appropriate CPU responsible for the region. - Collate Images
11Parallel Rendering Sort Middle
- How do you divide the screen into regions?
- Strips (either horizontal or vertical)
- Squares
- What is the mapping ratio between CPUs and
regions? - One-to-One Each CPU manages 1 region
- One-to-Many Each CPU manages many regions
- What about polygons that cross region boundaries?
- Multiple CPUs render the same polygon.
12Sort-Middle Performance
- Load-balancing can be poor. The slowest CPU will
block the system from rendering the next scene. - Load balancing is highly scene and view
dependent. - Need adaptive load-balancing schemes.
- In high polygon count scenes, the size of each
polygon can be very small (1 2 pixels). - In this case, sort middle requires more bandwidth
than sort-last. - Communication bandwidth required is dependent on
the scene complexity. (Bad)
13Parallel Rendering Sort First
- Sort First
- Distribute polygons round-robin to all CPUs.
- Calculate bounding volumes for each polygon
- Remember, we are still in the world coordinate
system. - Each CPU is responsible for 1 volume.
- Redistribute polygons based on bounding volumes.
- Pass through complete rendering pipeline
- In the end we have sub-images at each processor.
- Designate a coordinator node, which receives
sub-images from all other processors. - Coordinator collates sub-images into the final
image.
14Sort First - Performance
- Communication bandwidth required is based only on
screen space resolution. - Example
- 4 CPUs, 10241024 scene, 32 bits/color
- The coordinator node receives 1024102424
bits/frame. 3MB. - Bandwidth 90MB/sec for 30 fps.
- Problem Similar to sort-middle, load balancing
is scene dependent. - Bigger issue Cant use a one-to-many CPU to
region mapping. - Or can you?
15Parallel Rendering Issues
- Cannot break the rendering pipeline
- Pipeline is implemented in hardware
- Therefore, very expensive. Could lead to
excessive stalls, cache misses, etc.. - Modern graphics cards have large amounts of
memory on the board and much faster access times. - 8GB/sec vs. 1GB/sec for AGP4x
- Graphics driver source code is unavailable
- Additional cost/overhead due to framebuffer
accesses.
16Our Approach
- High Performance real-time rendering.
- High scene complexity and/or multiple displays as
in a VE. - Target 200-300 million triangles/sec. In
comparison the best SGI platform Reality
Monster is capable of 80 million polygons/sec - Approach
- Distributed Sort-First.
- Two level sorting.
- Organize your model in a spatial tree data
structure. - At run-time compare bounding volumes for interior
nodes of the tree. The bounding volume for an
interior node is a superset of its children. This
minimizes comparisons. - Fine pruning based on viewing frustum.
17Hardware
- 32 Intel Xeon processor cluster (1.5 GHz
processor) - 256 MB RDRAM/node (3.2 GB/sec memory bandwidth)
- Myrinet (4 Gbps) and Fast Ethernet (200 Mbps
full-duplex) communication fabrics. - 64 bit/66 MHz PCI bus (4 Gbps throughput)
- 4x AGP (1GB/sec throughput)
18Software
- Extensible Parallel 3D Rendering Engine
- Supports large geometric databases, including
standard formats such as 3D Studio - Provides an extensible API.
- Underlying system is based on OpenGL.
- Based on dynamic shared object model.
- Dynamic Load Balancing
- Adaptively resizes volumes assigned to a
processor for single display systems. - Adaptively changes the number of processors and
rendering volumes for multi-display systems.
19Software Architecture
- Master-Slave arrangement
- Multi-threaded
- Two stage parallel rendering pipeline.
20Results Rendering Rate
Figure 1 Scalability of our implementation.
Actual depicts the performance taking into
account triangle overlap among nodes, effective
depicts what the system is capable of
delivering. Left image uses a real world dataset
(LIDAR data). Right image uses a generated
dataset to fully exploit the overlap issue.
21Results Load Balancing
Figure 2 The effects of load balancing on 4
nodes (left) and 16 nodes (right). The graph
depicts the individiual frame times for first 100
frames.
22