Title: Sort-First,%20Distributed%20Memory%20Parallel%20Visualization%20and%20Rendering%20Wes%20Bethel,%20R3vis%20Corporation%20and%20Lawrence%20Berkeley%20National%20Laboratory
1Sort-First, Distributed Memory Parallel
Visualization and RenderingWes Bethel, R3vis
Corporation and Lawrence Berkeley National
Laboratory
- Parallel Visualization and Graphics Workshop
- Sunday October 18, 2003,
- Seattle, Washington
2The Actual Presentation Title
- Why Distributed Memory Parallel Rendering is a
Challenge Combining OpenRM Scene Graph and
Chromium for General Purpose Use on Distributed
Memory Clusters - Outline
- Problem Statement, Desired Outcomes
- Sort-First Parallel Architectural Overview
- The Many Ways I Broke Chromium
- Scene Graph Considerations
- Demo?
- Conclusions
3Motivation and Problem Statement
- The Allure of COTS solutions
- Performance of COTS GPUs exceed that of custom
silicon. - Attractive price/performance of COTS platforms
(e.g., x86 PCs). - Gigabit Ethernet is cheap 100/NIC, 500 8-port
switch. - Can build a screamer cluster for about 2K/node.
- Were accustomed to nice, friendly software
infrastructure. E.g., hardware accelerated
Xinerama. - Enter Chromium the means to use a bunch of PCs
to do parallel rendering. - Parallel submission of graphics streams is a
custom solution, and presents challenges. - Want a flexible, resilient API to interface
between parallel visualization/rendering
applications and Chromium.
4 5Our Approach
- Distributed memory parallel visualization
application design amortizes expensive data I/O
and visualization across many nodes. - The scene graph layer mediates interaction
between the application and the rendering
subsystem portability, hide the icky parallel
rendering details, provides an infrastructure for
accelerating rendering. - Chromium provides routing of graphics commands to
support hardware accelerated rendering on a
variety of platforms. - Focus on COTS solutions all hardware and
software we used is cheap (PC cluster) or free
(software). - Focus on simplicity our sample applications are
straightforward in implementation, easily
reproducible by others and highly portable. - Want an infrastructure that is suitable for use
regardless of the type of parallel programming
model used by the application - No parallel objects in the Scene Graph!!!!!
6Our Approach, ctd.
7The Many Ways I Broke Chromium
- Retained mode object namespace collisions in
parallel submission. - Broadcasting how to burn up bandwidth without
even trying! - Scene Graph Issues to be discussed in our PVG
paper presentation on Monday.
8The Collision Problem
- Want to use OpenGL retained-mode semantics and
structures to realize performance gains in DM
environment - Problem
- Namespace collision of retained-mode
identifiers during parallel submission of
graphics commands. - Example
- The problem exists for all OpenGL retained mode
objects display lists, texture objects and
programs. - The problem extends to all OpenGL retained mode
objects display lists, texture object ids,
programs.
Process A GLuint n glNewList(1) printf(
idd\n) // id0 // build list, draw with list
Process A GLuint n glNewList(1) printf(
idd\n,n) // id 0 // build list, draw with
list
9Manifestation of Collision Problem
- Show image of four textured quads when the
problem is present.
10Desired Result
- Show image of four textured quads when the
problem is fixed.
11Resolving the Collision Problem
- New CR configuration file options
- shared_textures, shared_display_lists,
shared_programs - When set to 1, beware of collisions in parallel
submission. - When set to 0, collisions resolved in parallel
submission. - Using shared_ to zero will enforce unique
retained mode identifiers across all parallel
submitters. - Thanks, Brian!
12The Broadcast Problem
- Whats the Problem?
- Geometry and textures from N application PEs is
replicated across M crservers. Bumped into limits
of memory and bandwidth. - To Chromium, a display list is an opaque blob of
stuff. Tilesort doesnt peek inside a display
list to see where it should be sent. - Early performance testing showed two types of
broadcasting - Display lists being broadcast from one tilesort
to all servers. - Textures associated with textured geometry in
display lists was being broadcast from one
tilesort to all servers.
13Broadcast Workarounds (Short Term)
- Help Chromium decide how to route textures with
the GL_OBJECT_BBOX_CR extension. - Dont use display lists (for now).
- Immediate mode geometry isnt broadcast.
Sorting/routing is accelerated using
GL_OBJECT_BBOX_CR to provide hints to tilesort
it doesnt have to look at all vertices in a
geometry blob to perform routing decisions. - For scientific visualization, which generates
lots of geometry, this is clearly a problem. - Our volume rendering application uses 3D textures
(N3 data) and textured geometry (N2 data), so
the heavy payload data isnt broadcast. The
cost is immediate mode transmission of geometry
(approximately 36KB/frame of geometry as compared
to 160MB/frame of texture data).
14Broadcast Workarounds (Long Term)
- Funding for Chromium developers to implement
display list caching and routing, similar to
existing capabilities to manage texture objects. - Lurking problems
- Aging of display lists the crserver is like a
roach motel display lists check in, but they
never check out. - Adding retained mode object aging and management
to applications is an unreasonable burden (IMO). - There exists no commonly accepted mechanism for
LRU aging, etc. in the graphics API. Such an
aging mechanism will probably show up as an
extension. - Better as an extension with tunable parameters
than requiring applications to reach deeply
into graphics API implementations.
15Scene Graph Issues and Performance Analysis
- Discussed in our 2003 PVG Paper (Monday
afternoon). - Our parallel scene graph implementation can be
used by any parallel application, regardless of
the type of parallel programming model used by
the application developer. - The big issue in sort-first is how much data is
duplicated. What weve seen so far is about 1.8x
duplication was required for the first frame (in
a hardware accelerated volume rendering
application). - While the scene graph supports any type of
parallel operation, certain types of
synchronization are required to ensure correct
rendering. These can be achieved using only
Chromium barriers no parallel objects in the
scene graph are required.
16Some Performance Graphs
17Parallel Scene Graph API Stuff
- Collectives
- rmPipeSetCommSize(), rmPipeGetCommSize()
- rmPipeSetMyRank(), rmPipeGetMyRank()
- Chromium-specific
- rmPipeBArrierCreateCR()
- Creates a Chromium barrier, number of
participants is set by value specified using
rmPipeSetCommSize() - rmPipeBarrierExecCR()
- Doesnt block application code execution.
- Used to synchronize rendering execution from
rmPipeGetCommSize() streams of graphics commands.
18Demo Applications Parallel Isosurface
19Demo Application Parallel Volume Rendering
20Demo Application Parallel Volume Rendering with
LOD Volumes
21Conclusions
- We met our objectives
- General purpose infrastructure for doing parallel
visualization and hardware accelerated rendering
on PC clusters. - The infrastructure can be used by any parallel
application, regardless of the parallel
programming model. - The architecture scaled well from one to 24
displays, supporting extremely high-resolution
output (e.g, 7860x4096). - We bumped into network bandwidth limits (not a
big surprise). - Display lists are still broadcast in Chromium.
Please fund them to add this much-needed
capability, which is fundamental for efficient
sort-first operation of clusters.
22Sources of Software
- OpenRM Scene Graph www.openrm.org
- Source code for OpenRMChromium applications
www.openrm.org. - Chromium chromium.sourceforge.net.
23Acknowledgement
- This work was supported by the U. S. Department
of Energy, Office of Science, Office of Advanced
Scientific Computing Research under SBIR grant
DE-FE03-02ER83443. - The authors wish to thank Randall Frank of the
ASCI/VIEWS program at Lawrence Livermore National
Laboratory and the Scientific Computing and
Imaging Institute at the University of Utah for
use of computing facilities during the course of
this research. - The Argon shock bubble dataset was provided
courtesy of John Bell and Vince Beckner at the
Center for Computational Sciences and
Engineering, Lawrence Berkeley National
Laboratory.