Parallel Graphics Rendering - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Parallel Graphics Rendering

Description:

String together numerous PC's with good graphics boards and render the models in parallel. ... and Fast Ethernet (200 Mbps full-duplex) communication fabrics. ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 23
Provided by: mattc159
Category:

less

Transcript and Presenter's Notes

Title: Parallel Graphics Rendering


1
Parallel Graphics Rendering
  • Matthew Campbell
  • Senior, Computer Science
  • mcampbel_at_vt.edu

2
Overview
  • Motivation
  • Three categories of parallel rendering
  • Our approach
  • Results
  • Questions

3
Motivation
  • PC graphics cards are getting faster at an
    exponential rate.
  • PC graphics boards are much cheaper than
    proprietary SGI hardware.
  • Geforce4 FX 150.00 (130 Mtris/sec)
  • SGI Onyx 300 145,000 (80 Mtris/sec)
  • Maintanance costs are lower
  • Replacement parts are easy to get.
  • PCs are not as complicated as proprietary
    hardware.

4
Parallel Rendering
  • String together numerous PCs with good graphics
    boards and render the models in parallel.
  • Increased performace
  • Better technology tracking
  • Three groups of algorithms
  • Sort-First
  • Sort-Middle
  • Sort-Last

5
Rendering Pipeline
  • Transformation stage
  • Per-Vertex operations
  • Primitive Assembly
  • 3D World Space!
  • Rasterization stage
  • Per-fragment operations
  • Texture mapping
  • 2D Image Space!

6
Parallel Rendering Sort Last
  • Sort Last
  • Distribute polygons
  • Round robin distribution resulting in an equal
    load on each processor.
  • Pass through entire rendering pipeline.
  • Transformation / Rasterization (see last slide)
  • Each CPU now has the entire scene
  • But individual scenes are incomplete
  • Hidden polygons may be visible
  • Solution Image composition

7
Sort Last Image Composition
  • The scene at each CPU has a frame buffer with
    color values for each pixel and a depth buffer
    with Z values for each pixel.
  • Composition Given 2 scenes it computes the color
    of the pixel at each screen coordinate
  • Compare the depth buffer values at each pixel
    location. The resultant color value is the color
    of the pixel corresponding to a lower z axis
    value.
  • Alpha blending is more complex.
  • Why?

8
Sort Last Image Composition
  • Time complexity of the previous sort algorithm is
    O(n), which is pretty bad.
  • Can we improve it?
  • Alternate algorithms
  • Tree composition.
  • Rotating rings.
  • Binary composition.

9
Sort-Last Performance
  • Sort-Last has very high communication bandwidth
    requirement.
  • Each processor needs to send and receive an
    entire frame
  • 1280x1024 resolution, 24-bits for color, 16-bits
    for depth, 30fps
  • (3.9MB 2.6MB) 30 196MB/sec bidirectional!
  • Need a very fast network interconnecting the CPUs
    in the cluster.
  • In actuality, we need more bandwidth, because we
    havent taken into account, the time it takes to
    render the scene!
  • But.. No overhead for rendering the actual scene!

10
Parallel Rendering Sort Middle
  • Sort Middle
  • Distribute polygons in a round robin fashion
  • Trap polygons between geometry and rasterization
    phases
  • Each CPU in the cluster is responsible for a
    specific region in screen coordinates
  • Calculate the bounding boxes (screen space) for
    the trapped polygons and redistribute them to the
    appropriate CPU responsible for the region.
  • Collate Images

11
Parallel Rendering Sort Middle
  • How do you divide the screen into regions?
  • Strips (either horizontal or vertical)
  • Squares
  • What is the mapping ratio between CPUs and
    regions?
  • One-to-One Each CPU manages 1 region
  • One-to-Many Each CPU manages many regions
  • What about polygons that cross region boundaries?
  • Multiple CPUs render the same polygon.

12
Sort-Middle Performance
  • Load-balancing can be poor. The slowest CPU will
    block the system from rendering the next scene.
  • Load balancing is highly scene and view
    dependent.
  • Need adaptive load-balancing schemes.
  • In high polygon count scenes, the size of each
    polygon can be very small (1 2 pixels).
  • In this case, sort middle requires more bandwidth
    than sort-last.
  • Communication bandwidth required is dependent on
    the scene complexity. (Bad)

13
Parallel Rendering Sort First
  • Sort First
  • Distribute polygons round-robin to all CPUs.
  • Calculate bounding volumes for each polygon
  • Remember, we are still in the world coordinate
    system.
  • Each CPU is responsible for 1 volume.
  • Redistribute polygons based on bounding volumes.
  • Pass through complete rendering pipeline
  • In the end we have sub-images at each processor.
  • Designate a coordinator node, which receives
    sub-images from all other processors.
  • Coordinator collates sub-images into the final
    image.

14
Sort First - Performance
  • Communication bandwidth required is based only on
    screen space resolution.
  • Example
  • 4 CPUs, 10241024 scene, 32 bits/color
  • The coordinator node receives 1024102424
    bits/frame. 3MB.
  • Bandwidth 90MB/sec for 30 fps.
  • Problem Similar to sort-middle, load balancing
    is scene dependent.
  • Bigger issue Cant use a one-to-many CPU to
    region mapping.
  • Or can you?

15
Parallel Rendering Issues
  • Cannot break the rendering pipeline
  • Pipeline is implemented in hardware
  • Therefore, very expensive. Could lead to
    excessive stalls, cache misses, etc..
  • Modern graphics cards have large amounts of
    memory on the board and much faster access times.
  • 8GB/sec vs. 1GB/sec for AGP4x
  • Graphics driver source code is unavailable
  • Additional cost/overhead due to framebuffer
    accesses.

16
Our Approach
  • High Performance real-time rendering.
  • High scene complexity and/or multiple displays as
    in a VE.
  • Target 200-300 million triangles/sec. In
    comparison the best SGI platform Reality
    Monster is capable of 80 million polygons/sec
  • Approach
  • Distributed Sort-First.
  • Two level sorting.
  • Organize your model in a spatial tree data
    structure.
  • At run-time compare bounding volumes for interior
    nodes of the tree. The bounding volume for an
    interior node is a superset of its children. This
    minimizes comparisons.
  • Fine pruning based on viewing frustum.

17
Hardware
  • 32 Intel Xeon processor cluster (1.5 GHz
    processor)
  • 256 MB RDRAM/node (3.2 GB/sec memory bandwidth)
  • Myrinet (4 Gbps) and Fast Ethernet (200 Mbps
    full-duplex) communication fabrics.
  • 64 bit/66 MHz PCI bus (4 Gbps throughput)
  • 4x AGP (1GB/sec throughput)

18
Software
  • Extensible Parallel 3D Rendering Engine
  • Supports large geometric databases, including
    standard formats such as 3D Studio
  • Provides an extensible API.
  • Underlying system is based on OpenGL.
  • Based on dynamic shared object model.
  • Dynamic Load Balancing
  • Adaptively resizes volumes assigned to a
    processor for single display systems.
  • Adaptively changes the number of processors and
    rendering volumes for multi-display systems.

19
Software Architecture
  • Master-Slave arrangement
  • Multi-threaded
  • Two stage parallel rendering pipeline.

20
Results Rendering Rate
Figure 1 Scalability of our implementation.
Actual depicts the performance taking into
account triangle overlap among nodes, effective
depicts what the system is capable of
delivering. Left image uses a real world dataset
(LIDAR data). Right image uses a generated
dataset to fully exploit the overlap issue.
21
Results Load Balancing
Figure 2 The effects of load balancing on 4
nodes (left) and 16 nodes (right). The graph
depicts the individiual frame times for first 100
frames.
22
  • ?
Write a Comment
User Comments (0)
About PowerShow.com