Practical Parallel Processing for Today presentation

About This Presentation

Transcript and Presenter's Notes

Title: Practical Parallel Processing for Today

1

Practical Parallel Processing for Todays
Rendering Challenges
SIGGRAPH 2001 Course 40
Los Angeles, CA

2
Speakers

Alan Chalmers, University of Bristol
Tim Davis, Clemson University
Erik Reinhard, University of Utah
Toshi Kato, SquareUSA

3
Schedule

Introduction
Parallel / Distributed Rendering Issues
Classification of Parallel Rendering Systems
Practical Applications
Summary / Discussion

4
Schedule

Introduction (Davis)
Parallel / Distributed Rendering Issues
Classification of Parallel Rendering Systems
Practical Applications
Summary / Discussion

5
The Need for Speed

Graphics rendering is time-consuming
large amount of data in a single image
animations much worse
Demand continues to rise for high-quality graphics

6
Rendering and Parallel Processing

A holy union
Many graphics rendering tasks can be performed in
parallel
Often embarrassing parallel

7
3-D Graphics Boards

Getting better
Perform tricks with texture mapping
Steve Jobs remark on constant frame rendering
time

8
Parallel / Distributed Rendering

Fundamental Issues
Task Management
Task subdivision, Migration, Load balancing
Data Management
Data distributed across system
Communication

9
Schedule

Introduction
Parallel / Distributed Rendering Issues
(Chalmers)
Classification of Parallel Rendering Systems
Practical Applications
Summary / Discussion

10
Introduction

Parallel processing is like a dogs walking on
its hind legs. It is not done well, but you are
surprised to find it done at all
Steve Fiddes (apologies to Samuel Johnson)
Co-operation
Dependencies
Scalability
Control

11
Co-operation

Solution of a single problem
One person takes a certain time to solve the
problem
Divide problem into a number of sub-problems
Each sub-problem solved by a single worker
Reduced problem solution time
BUT
co-operation ? overheads

12
Working Together

Overheads
access to pool
collision avoidance

13
Dependencies

Divide a problem into a number of distinct stages
Parallel solution of one stage before next can
start
May be too severe ? no parallel solution
each sub-problem dependent on previous stage
Dependency-free problems
order of task completion unimportant
BUT co-operation still required

14
Building with Blocks
Strictly sequential
Dependency-free
15
Scalability

Upper bound on the number of workers
Additional workers will NOT improve solution
time
Shows how suitable a problem is for parallel
processing
Given problem ? finite number of sub-problems
more workers than tasks
Upper bound may be (a lot) less than number of
tasks
bottlenecks

16
Bottleneck at Doorway
More workers may result in LONGER solution time
17
Control

Required by all parallel implementations
What constitutes a task
When has the problem been solved
How to deal with multiple stages
Forms of control
centralised
distributed

18
Control Required
Sequential
Parallel
19
Inherent Difficulties

Failure to successfully complete
Sequential solution
deficiencies in algorithm or data
Parallel solution
deficiencies in algorithm or data
deadlock
data consistency

20
Novel Difficulties

Factors arising from implementation
Deadlock
processor waiting indefinitely for an event
Data consistency
data is distributed amongst processors
Communication overheads
latency in message transfer

21
Evaluating Parallel Implementations

Realisation penalties
Algorithmic penalty
nature of the algorithm chosen
Implementation penalty
need to communicate
concurrent computation communication activities
idle time

22
Solution Times
23
Task Management

Providing tasks to the processors
Problem decomposition
algorithmic decomposition
domain decomposition
Definition of a task
Computational Model

24
Problem Decomposition

Exploit parallelism
Inherent in algorithm
algorithmic decomposition
parallelising compilers
Applying same algorithm to different data items
domain decomposition
need for explicit system software support

25
Abstract Definition of a Task

Principal Data Item (PDI) - application of
algorithm
Additional Data Items (ADIs) - needed to complete
computation

26
Computational Models

Determines the manner tasks are allocated to PEs
Maximise PE computation time
Minimise idle time
load balancing
Evenly allocate tasks amongst the processors

27
Data Driven Models

All PDIs allocated to specific PEs before
computation starts
Each PE knows a priori which PDIs it is
responsible for
Balanced (geometric decomposition)
evenly allocate tasks amongst the processors
if PDIs not exact multiple of Pes then some PEs
do one extra task

28
Balanced Data Driven
initial distribution
solution time

24
3

result collation
29
Demand Driven Model

Task computation time unknown
Work is allocated dynamically as PEs become idle
PEs no longer bound to particular PDIs
PEs explicitly demand new tasks
Task supplier process must satisfy these demands

30
Dynamic Allocation of Tasks
2 x total comms time
solution time

total comp time for all PDIs
number of PEs
31
Task Supplier Process
PROCESS Task_Supplier() Begin
remaining_tasks total_number_of_tasks (
initialise all processors with one task )
FOR p 1 TO number_of_PEs SEND task TO
PEp remaining_tasks remaining_tasks
-1 WHILE results_outstanding DO
RECEIVE result FROM PEi IF
remaining_tasks gt 0 THEN SEND task TO
PEi remaining_tasks
remaining_tasks -1 ENDIF End (
Task_Supplier )
Simple demand driven task supplier
32
Load Balancing

All PEs should complete at the same time
Some PEs busy with complex tasks
Other PEs available for easier tasks
Computation effort of each task unknown
hot spot at end of processing ? unbalanced
solution
Any knowledge about hot spots should be used

33
Task Definition Granularity

Computational elements
Atomic element (ray-object intersection)
sequential problems lowest computational element
Task (trace complete path of one ray)
parallel problems smallest computational element
Task granularity
number of atomic units is one task

34
Task Packet

Unit of task distribution
Informs a PE of which task(s) to perform
Task packet may include
indication of which task(s) to compute
data items (the PDI and (possibly) ADIs)
Task packet for ray tracer ? one or more rays to
be traced

35
Algorithmic Dependencies

Algorithm adopted for parallelisation
May specify order of task completion
Dependencies MUST be preserved
Algorithmic dependencies introduce
synchronisation points ? distinct problem stages
data dependencies ? careful data management

36
Distributed Task Management

Centralised task supply
All requests for new tasks to System Controller ?
bottleneck
Significant delay in fetching new tasks
Distributed task supply
task requests handled remotely from System
Controller
spread of communication load across system
reduced time to satisfy task request

37
Preferred Bias Allocation

Combining Data driven Demand driven
Balanced data driven
tasks allocated in a predetermined manner
Demand driven
tasks allocated dynamically on demand
Preferred Bias Regions are purely conceptual
enables the exploitation of any coherence

38
Conceptual Regions

task allocation no longer arbitrary

39
Data Management

Providing data to the processors
World model
Virtual shared memory
Data manager process
local data cache
requesting locating data
Consistency

40
Remote Data Fetches

Advanced data management
Minimising communication latencies
Prefetching
Multi-threading
Profiling
Multi-stage problems

41
Data Requirements

Requirements may be large
Fit in the local memory of each processor
world model
Too large for each local memory
distributed data
provide virtual world model/virtual shared memory

42
Virtual Shared Memory (VSM)

Providing a conceptual single memory space
Memory is in fact distributed
Request is the same for both local remote data
Speed of access may be (very) different

43
Consistency

Read/write can result in inconsistencies
Distributed memory
multiple copies of the same data item
Updating such a data item
update all copies of this data item
invalidate all other copies of this data item

44
Minimising Impact of Remote Data

Failure to find a data item locally ? remote
fetch
Time to find data item can be significant
Processor idle during this time
Latency difficult to predict
eg depends on current message densities
Data management must minimise this idle time

45
Data Management Techniques

Hiding the Latency
Overlapping the communication with computation
prefetching
multi-threading
Minimising the Latency
Reducing the time of a remote fetch
profiling
caching

46
Prefetching

Exploiting knowledge of data requests
A priori knowledge of data requirements
nature of the problem
choice of computational model
DM can prefetch them (up to some specified
horizon)
available locally when required
overlapping communication with computation

47
Multi-Threading

Keeping PE busy with useful computation
Remote data fetch ? current task stalled
Start another task (Processor kept busy)
separate threads of computation (BSP)
Disadvantages Overheads
Context switches between threads
Increased message densities
Reduced local cache for each thread

48
Results for Multi-Threading

More than optimal threads reduces performance
Cache 22 situation
less local cache ? more data misses ? more
threads

49
Profiling

Reducing the remote fetch time
At the end of computation all data requests are
known
if known then can be prefetched
Monitor data requests for each task
build up a picture of possible requirements
Exploit spatial coherence (with preferred bias
allocation)
prefetch those data items likely to be required

50
Spatial Coherence
51
Schedule

Introduction
Parallel / Distributed Rendering Issues
Classification of Parallel Rendering Systems
(Davis)
Practical Applications
Summary / Discussion

52
Classification of Parallel Rendering Systems

Parallel rendering performed in many ways
Classification by
task subdivision
polygon rendering
ray tracing
hardware
parallel hardware
distributed computing

53
Classification by Task Subdivision

Original rendering task broken into smaller
pieces to be processed in parallel
Depends on type of rendering
Goals
maximize parallelism
minimize overhead, including communication

54
Task Subdivision in Polygon Rendering

Rendering many primitives
Polygon rendering pipeline
geometry processing (transformation, clipping,
lighting)
rasterization (scan conversion, visibility,
shading)

55
Polygon Rendering Pipeline
56
Primitive Processing and Sorting

View processing of primitives as sorting problem
primitives can fall anywhere on or off the screen
Sorting can be done in either software or
hardware, but mostly done in hardware

57
Primitive Processing and Sorting

Sorting can occur at various places in the
rendering pipeline
during geometry processing (sort-first)
between geometry processing and rasterization
(sort-middle)
during rasterization (sort-last)

58
Sort-first
59
Sort-first Method

Each processor (renderer) assigned a portion of
the screen
Primitives arbitrarily assigned to processors
Processors perform enough calculations to send
primitives to correct renderers
Processors then perform geometry processing and
rasterization for their primitives in parallel

60
Screen Subdivision
61
Sort-first Discussion

Communication costs can be kept low
- Duplication of effort if primitives fall into
more than one screen area
- Load imbalance if primitives concentrated
- Very few, if any, sort-first renderers built

62
Sort-middle
63
Sort-middle Method

Primitives arbitrarily assigned to renderers
Each renderer performs geometry processing on its
primitives
Primitives then redistributed to rasterizers
according to screen region

64
Sort-middle Discussion

Natural breaking point in graphics pipeline
- Load imbalance if primitives concentrated in
particular screen regions
Several successful hardware implementations
PixelPlanes 5
SGI Reality Engine

65
Sort-last
66
Sort-last Method

Primitives arbitrarily distributed to renderers
Each renderer computes pixel values for its
primitives
Pixel values are then sent to processors
according to screen location
Rasterizers perform visibility and compositing

67
Sort-last Discussion

Less prone to load imbalance
- Pixel traffic can be high
Some working systems
Denali

68
Task Subdivision in Ray Tracing

Ray tracing often prohibitively expensive on
single processor
Prime candidate for parallelization
each pixel can be rendered independently
Processing easily subdivided
image space subdivision
object space subdivision
object subdivision

69
Image Space Subdivision
70
Image Space Subdivision Discussion

Straightforward
High parallelism possible
- Entire scene database must reside on each
processor
need adequate storage
Low processor communication

71
Image Space Subdivision Discussion

- Load imbalance possible
screen space may be further subdivided
Used in many parallel ray tracers
works better with MIMD machines
distributed computing environments

72
Object Space Subdivision

3-D object space divided into voxels
Each voxel assigned to a processor
Rays are passed from processor to processor as
voxel space is traversed

73
Object Space Subdivision Discussion

Each processor needs only scene information
associated with its voxel(s)
- Rays must be tracked through voxel space
Load balance good
- Communication can be high
Some successful systems

74
Object Partitioning

Each object in the scene is assigned to a
processor
Rays passed as messages between processors
Processors check for intersection

75
Object Partitioning Discussion

Load balancing good
- Communication high due to ray message traffic
- Fewer implementations

76
Schedule

Introduction
Parallel / Distributed Rendering Issues
Classification of Parallel Rendering Systems
Practical Applications
Rendering at Clemson / Distributed Computing and
Spatial/Temporal Coherence (Davis)
Interactive Ray Tracing
Parallel Rendering and the Quest for Realism The
Kilauea Massively Parallel Ray Tracer
Summary / Discussion

77
Practical Experiences at Clemson

Problems with Rendering
Current Resources
Deciding on a Solution
A New Render Farm

78
A Demand for Rendering

Computer Animation course
3 SIGGRAPH animation submissions
render over semester break

79
Current Resources

dedicated lab
8 SGI 02s (R12000, 384 MB)
general-purpose lab
4 SGI 02s
shared lab
dual-pipe Onyx2 (8 R12000, 8 GB)
10 SGI 02s (R12000, 256 MB)
offices
5 SGI 02s

80
Resource Problems

Rendering prohibits interactive sessions
Little organized control over resources
users must be self-monitoring
m renders on n machines ? 1 render on n/m
machines
Disk space
Cross-platform distributed rendering to PCs
problematic
security (rsh)
distributed rendering software
directory paths

81
Short-term Solutions

Distributed rendering restricted to late night
Resources partitioned

82
Problems with Maya

video
Traditional distributed computing problems
dropped frames
incomplete frames
tools developed

83
Problems with Maya

Tools (DropCheck)

84
Problems with Maya

Tools (Load Scan)

85
Problems with Maya

Animation inconsistencies
next slide
Some frames would not render
Particle system inconsistencies

86
Problems with Maya
87
Rendering Tips

Layering

88
Rendering Tips

Layering

89
Deciding on a Solution - RenderDrive

RenderDrive by ART (Advanced Rendering
Technology)
network appliance for ray tracing
16-48 specialized processors
claims speedups of 15-40 over Pentium III
768MB to 1.5GB memory
4GB hard disk cache

90
Deciding on a Solution - RenderDrive

plug-in interface to Maya
Renderman ray tracer
15K - 25K

91
Deciding on a Solution - PCs

Network of PCs as a render farm
10 PCs each with 1.4GHz, 1GB memory, and 40GB
hard drive
Maya will run under Windows 2000 or Linux (Maya
4.0)
Distributed rendering software not included for
Windows 2000

92
Deciding on a Solution - PCs Win

RenderDrive had some unusual anomalies
Interactive capabilities
Scan-line or ray tracing
Distributed rendering software may be included
Problems with security still exist
shared file system

93
Schedule

Introduction
Parallel / Distributed Rendering Issues
Classification of Parallel Rendering Systems
Practical Applications
Rendering at Clemson / Distributed Computing and
Spatial/Temporal Coherence (Davis)
Interactive Ray Tracing
Parallel Rendering and the Quest for Realism The
Kilauea Massively Parallel Ray Tracer
Summary / Discussion

94
Agenda

Background
Temporal Depth-Buffer
Frame Coherence Algorithm
Parallel Frame Coherence Algorithm

95
Background - Ray Tracing

Closest to physical model of light
High cost in terms of time / complexity

96
Background - Frame Coherence

Frame coherence
those pixels that do not change from one frame to
the next
derived from object and temporal coherence
We should not have to re-compute those pixels
whose values will not change
writing pixels to frame files

97
Background - Test Animation

Glass Bounce (60 frames at 320x240 5 obj)

98
Background - Frame Coherence
99
Previous Work

Frame coherence
moving camera/static world Hubschman and Zucker
81
estimated frames Badt 88
stereoscopic pairs Adelson and Hodges 93/95
4D bounding volumes Glassner 88
voxels and ray tracking Jevans 92
incremental ray tracing Murakami90

100
Previous Work (cont.)

Distributed computing
Alias and 3D Studio
most major productions starting with Toy Story
Henne 96

101
Goals

Render exactly the same set of frames in much
less time
Work in conjunction with other optimization
techniques
Run on a variety of platforms
Extend a currently popular ray tracer (POV-Ray)
to allow for general use

102
Temporal Depth-Buffer

Similar to traditional z-buffer
For each pixel, store a temporal depth in frame
units

103
Frame Coherence Algorithm
104
Frame Coherence Algorithm
105
Frame Coherence Algorithm

Identify volume within 3D object space where
movement occurs
Divide volume uniformly into voxels
For each voxel, create a list of frame numbers in
which changing objects inhabit this voxel

106
Frame Coherence Algorithm

In each frame, track rays through voxels for each
pixel
From the voxels traversed, find the one with the
lowest frame number
Record that number in the temporal depth-buffer

107
Frame Coherence Algorithm
for each frame of the animation for each
pixel that needs to be computed for this frame
trace the rays for this pixel for
each voxel that any of these rays intersect
get the next frame number to compute
set the t-buffer entry to the lowest frame
number found
108
Frame Coherence Algorithm
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
3
3
3
5
5
5
5
5
5
5
5
5
5
3
3
3
5
5
5
3
5
5
5
5
5
2
2
2
3
3
5
5
5
5
5
5
5
5
2
2
2
3
3
5
5
5
2
5
5
5
5
2
2
2
2
5
5
5
5
5
5
5
5
5
2
2
2
2
5
5
5
5
5
1
5
5
5
5
2
2
2
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
109
Voxel Volume

Uniform voxel spatial subdivision
Voxel can be non-cubical
Ways to determine voxel volume
user-supplied
pre-processing phase
active voxel marking
in distributed environment, done by master or
slave or both

110
Frame Coherence Example
111
Test Animation

Pool Shark (620 frames at 640x480 174 obj)

112
Test Animations - Problem

Bounding box problem

113
Results
114
Frame Coherence Discussion

Localized movement can have global effects
Performance depends on both the number and
complexity of recomputed pixels
Issues
overhead
antialiasing
motion blur

115
Temporal Depth-Buffer Discussion

Uses less memory than other methods
Simple
Can be used with other algorithms

116
Parallel Frame Coherence Algorithm

Distributed computing environment
1-8 Sun Sparc Ultra 5 processors running at 270
MHz
Coarse-grain parallelism
Load balancing
divide work among processors
keep data together for frame coherence

117
Load Balancing

Image space subdivision
each processor computes a subregion for the
entire length of the run
Recursively subdivide subsequences to keep
processors busy

118
Screen Subdivision
119
Load Balancing

Coarse bin packing find block with smallest
number of computed frames
Keep statistics on average first frame time and
average coherent frame time
Find a hole in the sequence
Leave some free frames before new start

120
Load Balancing Example
121
Results - Parallel Frame Coherence
122
Results
123
Another Test Animation

Soda Worship (60 frames at 160x120 839 obj)

124
Another Test Animation
125
Results
126
Results Discussion

Good speedup
Multiplicative speedup with both
Speedup limitations
voxel approximation
writing pixels to frame files (communication)

127
Conclusions

Frame coherence algorithm combined with
distributed computing provides good speedup
Algorithm scales well
Techniques are useful and accessible to a wide
variety of users
Benefits depend on inherent properties of the
animation

128
Shameless Advertisement

Masters of Fine Arts in Computing (MFAC)
special effects and animation courses
two year program
Clemson Computer Animation Festival in Fall 2002

129
Schedule

Introduction
Parallel / Distributed Rendering Issues
Classification of Parallel Rendering Systems
Practical Applications
Rendering at Clemson / Distributed Computing and
Spatial/Temporal Coherence
Interactive Ray Tracing (Reinhard)
Parallel Rendering and the Quest for Realism The
Kilauea Massively Parallel Ray Tracer
Summary / Discussion

130
Overview

Introduction
Interactive ray tracer
Animation and interactive ray tracing
Sample reuse techniques

131
Introduction
132
Interactive Ray Tracing

Renders effects not available using other
rendering algorithms
Feasible on high-end supercomputers provided
suitable hardware is chosen
Scales sub-linearly in scene complexity
Scales almost linearly in number of processors

133
Hardware Choices

Shared memory vs. distributed memory
Latency and throughput for pixel communication
Choice ? Shared memory
This section of the course focuses on SGI Origin
series super computers

134
Shared Memory

Shared address space
Physically distributed memory
ccNUMA architecture

135
SGI Origin 2000 Architecture
136
Implications

ccNUMA machines are easy to program,
But it is more difficult to generate efficient
code
Memory mapping and processor placement may be
important for certain applications
Topic returns later in this course

137
Overview

Introduction
Interactive ray tracer
Animation and interactive ray tracing
Sample re-use techniques

138
Interactive Ray Tracing
139
Basic Algorithm

Master-slave configuration
Master (display thread) displays results and
farms out ray tasks
Slaves produce new rays
Task size reduced towards end of each frame
Load balancing
Cache coherence

140
Tracing a Single Ray

Use spatial subdivisions for ray acceleration
(assumed familiar)
Use grid or bounding volume hierarchy
Could be optimized further, but good results have
been obtained with these acceleration structures
Efficiency mainly due to low level optimization

141
Low Level Optimization

Ray tracing in general
Ray coherence neighboring rays tend to intersect
the same objects
Cache coherence objects previously intersected
are likely to still reside in cache for current
ray
Memory access patterns are important (next slide)

142
Memory Access

On SGI Origin series computers
Memory allocated for a specific process may be
located elsewhere in the machine ? reading memory
may be expensive
Processes may migrate to other processors when
executing a system call ? whole cache becomes
invalidated previously local memory may now be
remote and more expensive to access

143
Memory Access (2)

Pin down processes to processors
Allocate memory close to where the processes run
that will use this memory
Use sysmp and sproc for processor placement
Use mmap or dplace for memory placement

144
Further Low Level Optimizations

Know the architecture you work on (Appendix III.A
in the course notes)
Use profiling to find expensive bits of code and
cache misses (Appendix III.B in the course notes)
Use padding to fit important data structures on a
single cache line

145
Frameless Rendering

Display pixel as soon as it is computed
No concept of frames
Perceptually preferable
Equivalent of a full frame takes longer to
compute
Less efficient exploitation of cache coherence
This alternative will return later in this course

146
Overview

Introduction
Interactive ray tracer
Animation and interactive ray tracing
Sample re-use techniques

147
Animation and Interactive Ray Tracing
148
Why Animation?

Once interactive rendering is feasible,
walk-through is not enough
Desire to manipulate the scene interactively
Render preprogrammed animation paths

149
Issues to Be Addressed

What stops us from animating objects?
Answer spatial subdivisions
Acceleration structures normally built during
pre-processing
They assume objects are stationary

150
Possible Solutions

Target applications that require a small number
of objects to be manipulated/ animated
Render these objects separately
Traversal cost will be linear in the number of
animated objects
Only feasible for extremely small number of
objects

151
Possible Solutions (2)

Target small number of manipulated or animated
objects
Modify existing spatial subdivisions
For each frame delete object from data structure
Update objects coordinates
Re-insert object into data structure
This is our preferred approach

152
Spatial Subdivision

Should be able to deal with
Basic operations such as insertion and deletion
of objects should be rapid
User manipulation can cause the extent of the
scene to grow

153
Subdivisions Investigated

Regular grid
Hierarchical grid
Borrows from octree spatial subdivision
In our case this is a full tree all leaf nodes
are at the same depth
Both acceleration structures are investigated in
the next few slides

154
Regular Grid Data Structure

We assume familiarity with spatial subdivisions!

155
Object Insertion Into Grid

Compute bounding box of object
Compute overlap of bounding box with grid voxels
Object is inserted into overlapping voxels
Object deletion works similarly

156
Extensions to Regular Grid

Dealing with expanding scenes requires
Modifications to object insertion/deletion
Ray traversal

157
Extensions to Regular Grid (2)
158
Features of New Grid Data Structure

We call this an Interactive Grid
Straightforward object insertion/deletion
Deals with expanding scenes
Insertion cost depends on relative object size
Traversal cost somewhat higher than for
regular grid

159
Hierarchical Grid

Objectives
Reduce insertion/deletion cost for larger objects
Retain advantages of interactive grid

160
Hierarchical Grid (2)
161
Hierarchical Grid (3)

Build full octree with all leaf nodes at the same
level
Allow objects to reside in leaf nodes as well as
in nodes higher up in the hierarchy
Each object can be inserted into one or more
voxels of at most one level in the hierarchy
Small object reside in leaf nodes, large objects
reside elsewhere in the hierarchy

162
Hierarchical Grid (4)

Features
Deals with expanding scenes similar to
interactive grid
Reduced insertion/deletion cost
Traversal cost somewhat higher than interactive
grid

163
Test Scenes
164
Video
165
Measurements

We measure
Traversal cost of
Interactive grid
Hierarchical grid
Regular grid
Object update rates of
Interactive grid
Hierarchical grid

166
Framerate vs. Grid Size (Sphereflake)
167
Framerate vs. Grid Size (Triangles)
168
Framerate Over Time (Sphereflake)
169
Framerate Over Time (Triangles)
170
Conclusions

Interactive manipulation of ray traced scenes is
both desirable and feasible using these
modifications to grid and hierarchical grids
Slight impact on traversal cost
(More results available in course notes)

171
Overview

Introduction
Interactive ray tracer
Animation and interactive ray tracing
Sample re-use techniques

172
Sample Re-use Techniques
173
Brute Force Ray Tracing

Enables interactive ray tracing
Does not allow large image sizes
Does not scale to scenes with
high depth complexity

174
Solution

Exploit temporal coherence
Re-use results from previous frames

175
Practical Solutions

Tapestry (Simmons et. al. 2000)
Focuses on complex lighting simulation
Render cache (Walter et. al. 1999)
Addresses scene complexity issues
Explained next
Parallel render cache (Reinhard et. al. 2000)
Builds on Walters render cache
Explained next

176
Render Cache Algorithm

Basic setup
One front-end for
Displaying pixels
Managing previous results
Parallel back-end for
Producing new pixels

177
Render Cache Front-end

Frame based rendering
For each frame do
Project existing points
Smooth image and display
Select new rays using heuristics
Request samples from back-end
Insert new points into point cloud

178
Render Cache
179
Render Cache (2)

Point reprojection is relatively cheap
Smooth camera movement for small images
Does not scale to large images or large numbers
of renderers ? front-end becomes bottleneck

180
Parallel Render Cache

Aim remove front-end bottleneck
Distribute point reprojection functionality
Integrate point reprojection with renderers
Front-end only displays results

181
Parallel Render Cache (2)
182
Parallel Render Cache (3)

Features
Scalable behavior for scene complexity
Scalable in number of processors
Allows larger images to be rendered
Retains artifacts from render cache
Introduces new artifacts

183
Artifacts

Render cache artifacts at tile boundaries
Image deteriorates during camera movement
These artifacts are deemed more acceptable than
loss of smooth camera movement!

184
Video
185
Test Scenes
186
Results

Sub-parts of algorithm measured individually
Measure time per call to subroutine
Sum over all processors and all invocations
Afterwards divide by number of processors and
number of invocations
Results are measured in events per second per
processor

187
Scalability (Teapot Model)
188
Scalability (Room Model)
189
Samples Per Second
190
Reprojections Per Second
191
Conclusions

Exploitation of temporal coherence gives
significantly smoother results than available
with brute force ray tracing alone
This is at the cost of some artifacts which
require further investigation
(More results available in course notes)

192
Acknowledgements

Thanks to
Steven Parker for writing the interactive ray
tracer in the first place
Brian Smits, Peter Shirley and Charles Hansen for
involvement in the animation and parallel point
reprojection projects
Bruce Walter and George Drettakis for the render
cache source code

193
Schedule

Introduction
Parallel / Distributed Rendering Issues
Classification of Parallel Rendering Systems
Practical Applications
Rendering at Clemson / Distributed Computing and
Spatial/Temporal Coherence
Interactive Ray Tracing
Parallel Rendering and the Quest for Realism The
Kilauea Massively Parallel Ray Tracer (Kato)
Summary / Discussion

194
Outline

What is Kilauea ?
Parallel ray tracing photon mapping
Kilauea architecture
Shading logic
Rendering results

195
Outline

What is Kilauea ?
Parallel ray tracing photon mapping
Kilauea architecture
Shading logic
Rendering results

196
Objective

Global illumination
Extremely complex scenes

197
Parallel Processing

Hardware
Multi-CPU machine
Linux PC cluster
Software
Threading (Pthread)
Message passing (MPI)

198
Our Render Farm
199
Global Illumination

Photon map

200
Ray Tracing Renderer
201
Ray Tracing Renderer
202
Ray Tracing Renderer
203
Outline

What is Kilauea ?
Parallel ray tracing photon mapping
Kilauea architecture
Shading logic
Rendering results

204
Parallel Ray Tracing

Simple case
Complex case

205
Parallel Ray Tracing

Simple case
Complex case

206
Accel Grid
207
Simple Case (scene distribution)
208
Simple Case (ray tracing)
209
Parallel Ray Tracing

Simple case
Complex case

210
Complex Case (scene distribution)
211
Complex Case (accel grid construction)
212
Complex Case (ray tracing)
213
Outline

What is Kilauea ?
Parallel ray tracing photon mapping
Kilauea architecture
Shading logic
Rendering results

214
Parallel Photon Mapping

Photon trace
Photon lookup

215
Parallel Photon Mapping

Photon trace
Photon lookup

216
Photon Tracing (simple case)
217
Photon Tracing (complex case)
218
Parallel Photon Mapping

Photon trace
Photon lookup

219
Photon Lookup (simple case)
220
Photon Lookup (complex case)
221
Outline

What is Kilauea ?
Parallel ray tracing photon mapping
Kilauea architecture
Shading logic
Rendering results

222
Task

Mtask
Wtask
Btask
Stask
Rtask

Atask
Etask
Ltask
Ptask
Otask

223
Task Assignment
224
Roles of Tasks
225
Task Configuration
226
Task Configuration
227
Task Configuration
228
Task Interaction
229
Task Interaction
230
Task Interaction
231
Task Interaction
232
Task Interaction
233
Task Interaction
234
Task Interaction (simple case)
235
Roles of Tasks (photon map)
236
Task Configuration (photon map)
237
Task Configuration (photon map)
238
Task Interaction (photon map)
239
Task Interaction (photon map)
240
Task Interaction (photon map)
241
Task Interaction (photon map)
242
Task Configuration (simple photon)
243
Task Priority
244
Outline

What is Kilauea ?
Parallel ray tracing photon mapping
Kilauea architecture
Shading logic
Rendering results

245
Parallel Shading Problem
246
Parallel Shading Problem
247
Parallel Shading Problem (solution)
248
Parallel Shading Problem (solution)
249
Parallel Shading Problem (solution)
250
Parallel Shading Problem (solution)
251
Parallel Shading Problem (solution)
252
Parallel Shading Problem (solution)
253
Parallel Shading Problem (solution)
254
Parallel Shading Problem (solution)
255
Decomposing Shading Computation
256
Decomposing Shading Computation
257
Decomposing Shading Computation
258
SPOT
259
SPOT Condition
260
Parallel Shading Solution using SPOT
261
Parallel Shading Solution using SPOT
262
Shader SPOT Network Example
263
Outline

What is Kilauea ?
Parallel ray tracing photon mapping
Kilauea architecture
Shading logic
Rendering results

264
Rendering Results

Test machine specification
1GHz Dual Pentium III
512Mbyte memory
100BaseT Ethernet
18 machines connected via 100BaseT switch

265
Quatro

700,223 triangles, 1 area point sky light, 1280
x 692
18 machines 7min 19sec

266
Quatro single Atask test
267
Jeep

715,059 triangles, 1 directional sky light,
1280 x 692
18 machines 8min 27sec

268
Jeep4

2,859,636 triangles, 1 directional sky light,
1280 x 692
18 machines 12min 38sec 2 Atsks x 1

269
Jeep4 2 Atasks test

1Atask group 2 machines

270
Jeep8

5,719,072 triangles, 1 directional sky light,
1280 x 692
16 machines 18min 43sec 4 Atasks x 4

271
Escape POD

468,321 triangles, 1 directional sky light,
1280 x 692
18 machines 14min 55sec

272
ansGun

20,279 triangles, 1 spot sky light, 1280 x 960
18 machines 16min 38sec

273
SCN101

787,255 triangls, 1 area light, 1280 x 692
18 machines 9min 10sec

274
Video
275
Conclusion / Future Work

We achieved
Close to linear parallel performance
Highly extensible architecture
We will achieve even more
Speed
Stability
Usability (user interface)
Etc.

276
Additional Information

Kilauea live rendering demo
BOOTH 1927 SquareUSA
http//www.squareusa.com/kilauea/

277
Schedule

Introduction
Parallel / Distributed Rendering Issues
Classification of Parallel Rendering Systems
Practical Applications
Summary / Discussion (Chalmers)

278
Summary
279
Contact Information

Alan Chalmers
alan_at_cs.bris.ac.uk
Tim Davis
tadavis_at_cs.clemson.edu
Toshi Kato
http//www.squareusa.com/kilauea/
Erik Reinhard
reinhard_at_cs.utah.edu
Slides
http//www.cs.clemson.edu/tadavis

Write a Comment

User Comments (0)

About PowerShow.com

Practical Parallel Processing for Today PowerPoint PPT Presentation