Practical Parallel Processing for Today - PowerPoint PPT Presentation

Loading...

PPT – Practical Parallel Processing for Today PowerPoint presentation | free to view - id: 19dfbc-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Practical Parallel Processing for Today

Description:

Practical Parallel Processing for Todays Rendering Challenges 1 – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 280
Provided by: steve1674
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Practical Parallel Processing for Today


1
  • Practical Parallel Processing for Todays
    Rendering Challenges
  • SIGGRAPH 2001 Course 40
  • Los Angeles, CA

2
Speakers
  • Alan Chalmers, University of Bristol
  • Tim Davis, Clemson University
  • Erik Reinhard, University of Utah
  • Toshi Kato, SquareUSA

3
Schedule
  • Introduction
  • Parallel / Distributed Rendering Issues
  • Classification of Parallel Rendering Systems
  • Practical Applications
  • Summary / Discussion

4
Schedule
  • Introduction (Davis)
  • Parallel / Distributed Rendering Issues
  • Classification of Parallel Rendering Systems
  • Practical Applications
  • Summary / Discussion

5
The Need for Speed
  • Graphics rendering is time-consuming
  • large amount of data in a single image
  • animations much worse
  • Demand continues to rise for high-quality graphics

6
Rendering and Parallel Processing
  • A holy union
  • Many graphics rendering tasks can be performed in
    parallel
  • Often embarrassing parallel

7
3-D Graphics Boards
  • Getting better
  • Perform tricks with texture mapping
  • Steve Jobs remark on constant frame rendering
    time

8
Parallel / Distributed Rendering
  • Fundamental Issues
  • Task Management
  • Task subdivision, Migration, Load balancing
  • Data Management
  • Data distributed across system
  • Communication

9
Schedule
  • Introduction
  • Parallel / Distributed Rendering Issues
    (Chalmers)
  • Classification of Parallel Rendering Systems
  • Practical Applications
  • Summary / Discussion

10
Introduction
  • Parallel processing is like a dogs walking on
    its hind legs. It is not done well, but you are
    surprised to find it done at all
  • Steve Fiddes (apologies to Samuel Johnson)
  • Co-operation
  • Dependencies
  • Scalability
  • Control

11
Co-operation
  • Solution of a single problem
  • One person takes a certain time to solve the
    problem
  • Divide problem into a number of sub-problems
  • Each sub-problem solved by a single worker
  • Reduced problem solution time
  • BUT
  • co-operation ? overheads

12
Working Together
  • Overheads
  • access to pool
  • collision avoidance

13
Dependencies
  • Divide a problem into a number of distinct stages
  • Parallel solution of one stage before next can
    start
  • May be too severe ? no parallel solution
  • each sub-problem dependent on previous stage
  • Dependency-free problems
  • order of task completion unimportant
  • BUT co-operation still required

14
Building with Blocks
Strictly sequential
Dependency-free
15
Scalability
  • Upper bound on the number of workers
  • Additional workers will NOT improve solution
    time
  • Shows how suitable a problem is for parallel
    processing
  • Given problem ? finite number of sub-problems
  • more workers than tasks
  • Upper bound may be (a lot) less than number of
    tasks
  • bottlenecks

16
Bottleneck at Doorway
More workers may result in LONGER solution time
17
Control
  • Required by all parallel implementations
  • What constitutes a task
  • When has the problem been solved
  • How to deal with multiple stages
  • Forms of control
  • centralised
  • distributed

18
Control Required
Sequential
Parallel
19
Inherent Difficulties
  • Failure to successfully complete
  • Sequential solution
  • deficiencies in algorithm or data
  • Parallel solution
  • deficiencies in algorithm or data
  • deadlock
  • data consistency

20
Novel Difficulties
  • Factors arising from implementation
  • Deadlock
  • processor waiting indefinitely for an event
  • Data consistency
  • data is distributed amongst processors
  • Communication overheads
  • latency in message transfer

21
Evaluating Parallel Implementations
  • Realisation penalties
  • Algorithmic penalty
  • nature of the algorithm chosen
  • Implementation penalty
  • need to communicate
  • concurrent computation communication activities
  • idle time

22
Solution Times
23
Task Management
  • Providing tasks to the processors
  • Problem decomposition
  • algorithmic decomposition
  • domain decomposition
  • Definition of a task
  • Computational Model

24
Problem Decomposition
  • Exploit parallelism
  • Inherent in algorithm
  • algorithmic decomposition
  • parallelising compilers
  • Applying same algorithm to different data items
  • domain decomposition
  • need for explicit system software support

25
Abstract Definition of a Task
  • Principal Data Item (PDI) - application of
    algorithm
  • Additional Data Items (ADIs) - needed to complete
    computation

26
Computational Models
  • Determines the manner tasks are allocated to PEs
  • Maximise PE computation time
  • Minimise idle time
  • load balancing
  • Evenly allocate tasks amongst the processors

27
Data Driven Models
  • All PDIs allocated to specific PEs before
    computation starts
  • Each PE knows a priori which PDIs it is
    responsible for
  • Balanced (geometric decomposition)
  • evenly allocate tasks amongst the processors
  • if PDIs not exact multiple of Pes then some PEs
    do one extra task

28
Balanced Data Driven
initial distribution
solution time

24
3

result collation
29
Demand Driven Model
  • Task computation time unknown
  • Work is allocated dynamically as PEs become idle
  • PEs no longer bound to particular PDIs
  • PEs explicitly demand new tasks
  • Task supplier process must satisfy these demands

30
Dynamic Allocation of Tasks
2 x total comms time
solution time

total comp time for all PDIs
number of PEs
31
Task Supplier Process
PROCESS Task_Supplier() Begin
remaining_tasks total_number_of_tasks (
initialise all processors with one task )
FOR p 1 TO number_of_PEs SEND task TO
PEp remaining_tasks remaining_tasks
-1 WHILE results_outstanding DO
RECEIVE result FROM PEi IF
remaining_tasks gt 0 THEN SEND task TO
PEi remaining_tasks
remaining_tasks -1 ENDIF End (
Task_Supplier )
Simple demand driven task supplier
32
Load Balancing
  • All PEs should complete at the same time
  • Some PEs busy with complex tasks
  • Other PEs available for easier tasks
  • Computation effort of each task unknown
  • hot spot at end of processing ? unbalanced
    solution
  • Any knowledge about hot spots should be used

33
Task Definition Granularity
  • Computational elements
  • Atomic element (ray-object intersection)
  • sequential problems lowest computational element
  • Task (trace complete path of one ray)
  • parallel problems smallest computational element
  • Task granularity
  • number of atomic units is one task

34
Task Packet
  • Unit of task distribution
  • Informs a PE of which task(s) to perform
  • Task packet may include
  • indication of which task(s) to compute
  • data items (the PDI and (possibly) ADIs)
  • Task packet for ray tracer ? one or more rays to
    be traced

35
Algorithmic Dependencies
  • Algorithm adopted for parallelisation
  • May specify order of task completion
  • Dependencies MUST be preserved
  • Algorithmic dependencies introduce
  • synchronisation points ? distinct problem stages
  • data dependencies ? careful data management

36
Distributed Task Management
  • Centralised task supply
  • All requests for new tasks to System Controller ?
    bottleneck
  • Significant delay in fetching new tasks
  • Distributed task supply
  • task requests handled remotely from System
    Controller
  • spread of communication load across system
  • reduced time to satisfy task request

37
Preferred Bias Allocation
  • Combining Data driven Demand driven
  • Balanced data driven
  • tasks allocated in a predetermined manner
  • Demand driven
  • tasks allocated dynamically on demand
  • Preferred Bias Regions are purely conceptual
  • enables the exploitation of any coherence

38
Conceptual Regions
  • task allocation no longer arbitrary

39
Data Management
  • Providing data to the processors
  • World model
  • Virtual shared memory
  • Data manager process
  • local data cache
  • requesting locating data
  • Consistency

40
Remote Data Fetches
  • Advanced data management
  • Minimising communication latencies
  • Prefetching
  • Multi-threading
  • Profiling
  • Multi-stage problems

41
Data Requirements
  • Requirements may be large
  • Fit in the local memory of each processor
  • world model
  • Too large for each local memory
  • distributed data
  • provide virtual world model/virtual shared memory

42
Virtual Shared Memory (VSM)
  • Providing a conceptual single memory space
  • Memory is in fact distributed
  • Request is the same for both local remote data
  • Speed of access may be (very) different

43
Consistency
  • Read/write can result in inconsistencies
  • Distributed memory
  • multiple copies of the same data item
  • Updating such a data item
  • update all copies of this data item
  • invalidate all other copies of this data item

44
Minimising Impact of Remote Data
  • Failure to find a data item locally ? remote
    fetch
  • Time to find data item can be significant
  • Processor idle during this time
  • Latency difficult to predict
  • eg depends on current message densities
  • Data management must minimise this idle time

45
Data Management Techniques
  • Hiding the Latency
  • Overlapping the communication with computation
  • prefetching
  • multi-threading
  • Minimising the Latency
  • Reducing the time of a remote fetch
  • profiling
  • caching

46
Prefetching
  • Exploiting knowledge of data requests
  • A priori knowledge of data requirements
  • nature of the problem
  • choice of computational model
  • DM can prefetch them (up to some specified
    horizon)
  • available locally when required
  • overlapping communication with computation

47
Multi-Threading
  • Keeping PE busy with useful computation
  • Remote data fetch ? current task stalled
  • Start another task (Processor kept busy)
  • separate threads of computation (BSP)
  • Disadvantages Overheads
  • Context switches between threads
  • Increased message densities
  • Reduced local cache for each thread

48
Results for Multi-Threading
  • More than optimal threads reduces performance
  • Cache 22 situation
  • less local cache ? more data misses ? more
    threads

49
Profiling
  • Reducing the remote fetch time
  • At the end of computation all data requests are
    known
  • if known then can be prefetched
  • Monitor data requests for each task
  • build up a picture of possible requirements
  • Exploit spatial coherence (with preferred bias
    allocation)
  • prefetch those data items likely to be required

50
Spatial Coherence
51
Schedule
  • Introduction
  • Parallel / Distributed Rendering Issues
  • Classification of Parallel Rendering Systems
    (Davis)
  • Practical Applications
  • Summary / Discussion

52
Classification of Parallel Rendering Systems
  • Parallel rendering performed in many ways
  • Classification by
  • task subdivision
  • polygon rendering
  • ray tracing
  • hardware
  • parallel hardware
  • distributed computing

53
Classification by Task Subdivision
  • Original rendering task broken into smaller
    pieces to be processed in parallel
  • Depends on type of rendering
  • Goals
  • maximize parallelism
  • minimize overhead, including communication

54
Task Subdivision in Polygon Rendering
  • Rendering many primitives
  • Polygon rendering pipeline
  • geometry processing (transformation, clipping,
    lighting)
  • rasterization (scan conversion, visibility,
    shading)

55
Polygon Rendering Pipeline
56
Primitive Processing and Sorting
  • View processing of primitives as sorting problem
  • primitives can fall anywhere on or off the screen
  • Sorting can be done in either software or
    hardware, but mostly done in hardware

57
Primitive Processing and Sorting
  • Sorting can occur at various places in the
    rendering pipeline
  • during geometry processing (sort-first)
  • between geometry processing and rasterization
    (sort-middle)
  • during rasterization (sort-last)

58
Sort-first
59
Sort-first Method
  • Each processor (renderer) assigned a portion of
    the screen
  • Primitives arbitrarily assigned to processors
  • Processors perform enough calculations to send
    primitives to correct renderers
  • Processors then perform geometry processing and
    rasterization for their primitives in parallel

60
Screen Subdivision
61
Sort-first Discussion
  • Communication costs can be kept low
  • - Duplication of effort if primitives fall into
    more than one screen area
  • - Load imbalance if primitives concentrated
  • - Very few, if any, sort-first renderers built

62
Sort-middle
63
Sort-middle Method
  • Primitives arbitrarily assigned to renderers
  • Each renderer performs geometry processing on its
    primitives
  • Primitives then redistributed to rasterizers
    according to screen region

64
Sort-middle Discussion
  • Natural breaking point in graphics pipeline
  • - Load imbalance if primitives concentrated in
    particular screen regions
  • Several successful hardware implementations
  • PixelPlanes 5
  • SGI Reality Engine

65
Sort-last
66
Sort-last Method
  • Primitives arbitrarily distributed to renderers
  • Each renderer computes pixel values for its
    primitives
  • Pixel values are then sent to processors
    according to screen location
  • Rasterizers perform visibility and compositing

67
Sort-last Discussion
  • Less prone to load imbalance
  • - Pixel traffic can be high
  • Some working systems
  • Denali

68
Task Subdivision in Ray Tracing
  • Ray tracing often prohibitively expensive on
    single processor
  • Prime candidate for parallelization
  • each pixel can be rendered independently
  • Processing easily subdivided
  • image space subdivision
  • object space subdivision
  • object subdivision

69
Image Space Subdivision
70
Image Space Subdivision Discussion
  • Straightforward
  • High parallelism possible
  • - Entire scene database must reside on each
    processor
  • need adequate storage
  • Low processor communication

71
Image Space Subdivision Discussion
  • - Load imbalance possible
  • screen space may be further subdivided
  • Used in many parallel ray tracers
  • works better with MIMD machines
  • distributed computing environments

72
Object Space Subdivision
  • 3-D object space divided into voxels
  • Each voxel assigned to a processor
  • Rays are passed from processor to processor as
    voxel space is traversed

73
Object Space Subdivision Discussion
  • Each processor needs only scene information
    associated with its voxel(s)
  • - Rays must be tracked through voxel space
  • Load balance good
  • - Communication can be high
  • Some successful systems

74
Object Partitioning
  • Each object in the scene is assigned to a
    processor
  • Rays passed as messages between processors
  • Processors check for intersection

75
Object Partitioning Discussion
  • Load balancing good
  • - Communication high due to ray message traffic
  • - Fewer implementations

76
Schedule
  • Introduction
  • Parallel / Distributed Rendering Issues
  • Classification of Parallel Rendering Systems
  • Practical Applications
  • Rendering at Clemson / Distributed Computing and
    Spatial/Temporal Coherence (Davis)
  • Interactive Ray Tracing
  • Parallel Rendering and the Quest for Realism The
    Kilauea Massively Parallel Ray Tracer
  • Summary / Discussion

77
Practical Experiences at Clemson
  • Problems with Rendering
  • Current Resources
  • Deciding on a Solution
  • A New Render Farm

78
A Demand for Rendering
  • Computer Animation course
  • 3 SIGGRAPH animation submissions
  • render over semester break

79
Current Resources
  • dedicated lab
  • 8 SGI 02s (R12000, 384 MB)
  • general-purpose lab
  • 4 SGI 02s
  • shared lab
  • dual-pipe Onyx2 (8 R12000, 8 GB)
  • 10 SGI 02s (R12000, 256 MB)
  • offices
  • 5 SGI 02s

80
Resource Problems
  • Rendering prohibits interactive sessions
  • Little organized control over resources
  • users must be self-monitoring
  • m renders on n machines ? 1 render on n/m
    machines
  • Disk space
  • Cross-platform distributed rendering to PCs
    problematic
  • security (rsh)
  • distributed rendering software
  • directory paths

81
Short-term Solutions
  • Distributed rendering restricted to late night
  • Resources partitioned

82
Problems with Maya
  • video
  • Traditional distributed computing problems
  • dropped frames
  • incomplete frames
  • tools developed

83
Problems with Maya
  • Tools (DropCheck)

84
Problems with Maya
  • Tools (Load Scan)

85
Problems with Maya
  • Animation inconsistencies
  • next slide
  • Some frames would not render
  • Particle system inconsistencies

86
Problems with Maya
87
Rendering Tips
  • Layering

88
Rendering Tips
  • Layering

89
Deciding on a Solution - RenderDrive
  • RenderDrive by ART (Advanced Rendering
    Technology)
  • network appliance for ray tracing
  • 16-48 specialized processors
  • claims speedups of 15-40 over Pentium III
  • 768MB to 1.5GB memory
  • 4GB hard disk cache

90
Deciding on a Solution - RenderDrive
  • plug-in interface to Maya
  • Renderman ray tracer
  • 15K - 25K

91
Deciding on a Solution - PCs
  • Network of PCs as a render farm
  • 10 PCs each with 1.4GHz, 1GB memory, and 40GB
    hard drive
  • Maya will run under Windows 2000 or Linux (Maya
    4.0)
  • Distributed rendering software not included for
    Windows 2000

92
Deciding on a Solution - PCs Win
  • RenderDrive had some unusual anomalies
  • Interactive capabilities
  • Scan-line or ray tracing
  • Distributed rendering software may be included
  • Problems with security still exist
  • shared file system

93
Schedule
  • Introduction
  • Parallel / Distributed Rendering Issues
  • Classification of Parallel Rendering Systems
  • Practical Applications
  • Rendering at Clemson / Distributed Computing and
    Spatial/Temporal Coherence (Davis)
  • Interactive Ray Tracing
  • Parallel Rendering and the Quest for Realism The
    Kilauea Massively Parallel Ray Tracer
  • Summary / Discussion

94
Agenda
  • Background
  • Temporal Depth-Buffer
  • Frame Coherence Algorithm
  • Parallel Frame Coherence Algorithm

95
Background - Ray Tracing
  • Closest to physical model of light
  • High cost in terms of time / complexity

96
Background - Frame Coherence
  • Frame coherence
  • those pixels that do not change from one frame to
    the next
  • derived from object and temporal coherence
  • We should not have to re-compute those pixels
    whose values will not change
  • writing pixels to frame files

97
Background - Test Animation
  • Glass Bounce (60 frames at 320x240 5 obj)

98
Background - Frame Coherence
99
Previous Work
  • Frame coherence
  • moving camera/static world Hubschman and Zucker
    81
  • estimated frames Badt 88
  • stereoscopic pairs Adelson and Hodges 93/95
  • 4D bounding volumes Glassner 88
  • voxels and ray tracking Jevans 92
  • incremental ray tracing Murakami90

100
Previous Work (cont.)
  • Distributed computing
  • Alias and 3D Studio
  • most major productions starting with Toy Story
    Henne 96

101
Goals
  • Render exactly the same set of frames in much
    less time
  • Work in conjunction with other optimization
    techniques
  • Run on a variety of platforms
  • Extend a currently popular ray tracer (POV-Ray)
    to allow for general use

102
Temporal Depth-Buffer
  • Similar to traditional z-buffer
  • For each pixel, store a temporal depth in frame
    units

103
Frame Coherence Algorithm
104
Frame Coherence Algorithm
105
Frame Coherence Algorithm
  • Identify volume within 3D object space where
    movement occurs
  • Divide volume uniformly into voxels
  • For each voxel, create a list of frame numbers in
    which changing objects inhabit this voxel

106
Frame Coherence Algorithm
  • In each frame, track rays through voxels for each
    pixel
  • From the voxels traversed, find the one with the
    lowest frame number
  • Record that number in the temporal depth-buffer

107
Frame Coherence Algorithm
for each frame of the animation for each
pixel that needs to be computed for this frame
trace the rays for this pixel for
each voxel that any of these rays intersect
get the next frame number to compute
set the t-buffer entry to the lowest frame
number found
108
Frame Coherence Algorithm
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
3
3
3
5
5
5
5
5
5
5
5
5
5
3
3
3
5
5
5
3
5
5
5
5
5
2
2
2
3
3
5
5
5
5
5
5
5
5
2
2
2
3
3
5
5
5
2
5
5
5
5
2
2
2
2
5
5
5
5
5
5
5
5
5
2
2
2
2
5
5
5
5
5
1
5
5
5
5
2
2
2
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
109
Voxel Volume
  • Uniform voxel spatial subdivision
  • Voxel can be non-cubical
  • Ways to determine voxel volume
  • user-supplied
  • pre-processing phase
  • active voxel marking
  • in distributed environment, done by master or
    slave or both

110
Frame Coherence Example
111
Test Animation
  • Pool Shark (620 frames at 640x480 174 obj)

112
Test Animations - Problem
  • Bounding box problem

113
Results
114
Frame Coherence Discussion
  • Localized movement can have global effects
  • Performance depends on both the number and
    complexity of recomputed pixels
  • Issues
  • overhead
  • antialiasing
  • motion blur

115
Temporal Depth-Buffer Discussion
  • Uses less memory than other methods
  • Simple
  • Can be used with other algorithms

116
Parallel Frame Coherence Algorithm
  • Distributed computing environment
  • 1-8 Sun Sparc Ultra 5 processors running at 270
    MHz
  • Coarse-grain parallelism
  • Load balancing
  • divide work among processors
  • keep data together for frame coherence

117
Load Balancing
  • Image space subdivision
  • each processor computes a subregion for the
    entire length of the run
  • Recursively subdivide subsequences to keep
    processors busy

118
Screen Subdivision
119
Load Balancing
  • Coarse bin packing find block with smallest
    number of computed frames
  • Keep statistics on average first frame time and
    average coherent frame time
  • Find a hole in the sequence
  • Leave some free frames before new start

120
Load Balancing Example
121
Results - Parallel Frame Coherence
122
Results
123
Another Test Animation
  • Soda Worship (60 frames at 160x120 839 obj)

124
Another Test Animation
125
Results
126
Results Discussion
  • Good speedup
  • Multiplicative speedup with both
  • Speedup limitations
  • voxel approximation
  • writing pixels to frame files (communication)

127
Conclusions
  • Frame coherence algorithm combined with
    distributed computing provides good speedup
  • Algorithm scales well
  • Techniques are useful and accessible to a wide
    variety of users
  • Benefits depend on inherent properties of the
    animation

128
Shameless Advertisement
  • Masters of Fine Arts in Computing (MFAC)
  • special effects and animation courses
  • two year program
  • Clemson Computer Animation Festival in Fall 2002

129
Schedule
  • Introduction
  • Parallel / Distributed Rendering Issues
  • Classification of Parallel Rendering Systems
  • Practical Applications
  • Rendering at Clemson / Distributed Computing and
    Spatial/Temporal Coherence
  • Interactive Ray Tracing (Reinhard)
  • Parallel Rendering and the Quest for Realism The
    Kilauea Massively Parallel Ray Tracer
  • Summary / Discussion

130
Overview
  • Introduction
  • Interactive ray tracer
  • Animation and interactive ray tracing
  • Sample reuse techniques

131
Introduction
132
Interactive Ray Tracing
  • Renders effects not available using other
    rendering algorithms
  • Feasible on high-end supercomputers provided
    suitable hardware is chosen
  • Scales sub-linearly in scene complexity
  • Scales almost linearly in number of processors

133
Hardware Choices
  • Shared memory vs. distributed memory
  • Latency and throughput for pixel communication
  • Choice ? Shared memory
  • This section of the course focuses on SGI Origin
    series super computers

134
Shared Memory
  • Shared address space
  • Physically distributed memory
  • ccNUMA architecture

135
SGI Origin 2000 Architecture
136
Implications
  • ccNUMA machines are easy to program,
  • But it is more difficult to generate efficient
    code
  • Memory mapping and processor placement may be
    important for certain applications
  • Topic returns later in this course

137
Overview
  • Introduction
  • Interactive ray tracer
  • Animation and interactive ray tracing
  • Sample re-use techniques

138
Interactive Ray Tracing
139
Basic Algorithm
  • Master-slave configuration
  • Master (display thread) displays results and
    farms out ray tasks
  • Slaves produce new rays
  • Task size reduced towards end of each frame
  • Load balancing
  • Cache coherence

140
Tracing a Single Ray
  • Use spatial subdivisions for ray acceleration
    (assumed familiar)
  • Use grid or bounding volume hierarchy
  • Could be optimized further, but good results have
    been obtained with these acceleration structures
  • Efficiency mainly due to low level optimization

141
Low Level Optimization
  • Ray tracing in general
  • Ray coherence neighboring rays tend to intersect
    the same objects
  • Cache coherence objects previously intersected
    are likely to still reside in cache for current
    ray
  • Memory access patterns are important (next slide)

142
Memory Access
  • On SGI Origin series computers
  • Memory allocated for a specific process may be
    located elsewhere in the machine ? reading memory
    may be expensive
  • Processes may migrate to other processors when
    executing a system call ? whole cache becomes
    invalidated previously local memory may now be
    remote and more expensive to access

143
Memory Access (2)
  • Pin down processes to processors
  • Allocate memory close to where the processes run
    that will use this memory
  • Use sysmp and sproc for processor placement
  • Use mmap or dplace for memory placement

144
Further Low Level Optimizations
  • Know the architecture you work on (Appendix III.A
    in the course notes)
  • Use profiling to find expensive bits of code and
    cache misses (Appendix III.B in the course notes)
  • Use padding to fit important data structures on a
    single cache line

145
Frameless Rendering
  • Display pixel as soon as it is computed
  • No concept of frames
  • Perceptually preferable
  • Equivalent of a full frame takes longer to
    compute
  • Less efficient exploitation of cache coherence
  • This alternative will return later in this course

146
Overview
  • Introduction
  • Interactive ray tracer
  • Animation and interactive ray tracing
  • Sample re-use techniques

147
Animation and Interactive Ray Tracing
148
Why Animation?
  • Once interactive rendering is feasible,
    walk-through is not enough
  • Desire to manipulate the scene interactively
  • Render preprogrammed animation paths

149
Issues to Be Addressed
  • What stops us from animating objects?
  • Answer spatial subdivisions
  • Acceleration structures normally built during
    pre-processing
  • They assume objects are stationary

150
Possible Solutions
  • Target applications that require a small number
    of objects to be manipulated/ animated
  • Render these objects separately
  • Traversal cost will be linear in the number of
    animated objects
  • Only feasible for extremely small number of
    objects

151
Possible Solutions (2)
  • Target small number of manipulated or animated
    objects
  • Modify existing spatial subdivisions
  • For each frame delete object from data structure
  • Update objects coordinates
  • Re-insert object into data structure
  • This is our preferred approach

152
Spatial Subdivision
  • Should be able to deal with
  • Basic operations such as insertion and deletion
    of objects should be rapid
  • User manipulation can cause the extent of the
    scene to grow

153
Subdivisions Investigated
  • Regular grid
  • Hierarchical grid
  • Borrows from octree spatial subdivision
  • In our case this is a full tree all leaf nodes
    are at the same depth
  • Both acceleration structures are investigated in
    the next few slides

154
Regular Grid Data Structure
  • We assume familiarity with spatial subdivisions!

155
Object Insertion Into Grid
  • Compute bounding box of object
  • Compute overlap of bounding box with grid voxels
  • Object is inserted into overlapping voxels
  • Object deletion works similarly

156
Extensions to Regular Grid
  • Dealing with expanding scenes requires
  • Modifications to object insertion/deletion
  • Ray traversal

157
Extensions to Regular Grid (2)
158
Features of New Grid Data Structure
  • We call this an Interactive Grid
  • Straightforward object insertion/deletion
  • Deals with expanding scenes
  • Insertion cost depends on relative object size
  • Traversal cost somewhat higher than for
    regular grid

159
Hierarchical Grid
  • Objectives
  • Reduce insertion/deletion cost for larger objects
  • Retain advantages of interactive grid

160
Hierarchical Grid (2)
161
Hierarchical Grid (3)
  • Build full octree with all leaf nodes at the same
    level
  • Allow objects to reside in leaf nodes as well as
    in nodes higher up in the hierarchy
  • Each object can be inserted into one or more
    voxels of at most one level in the hierarchy
  • Small object reside in leaf nodes, large objects
    reside elsewhere in the hierarchy

162
Hierarchical Grid (4)
  • Features
  • Deals with expanding scenes similar to
    interactive grid
  • Reduced insertion/deletion cost
  • Traversal cost somewhat higher than interactive
    grid

163
Test Scenes
164
Video
165
Measurements
  • We measure
  • Traversal cost of
  • Interactive grid
  • Hierarchical grid
  • Regular grid
  • Object update rates of
  • Interactive grid
  • Hierarchical grid

166
Framerate vs. Grid Size (Sphereflake)
167
Framerate vs. Grid Size (Triangles)
168
Framerate Over Time (Sphereflake)
169
Framerate Over Time (Triangles)
170
Conclusions
  • Interactive manipulation of ray traced scenes is
    both desirable and feasible using these
    modifications to grid and hierarchical grids
  • Slight impact on traversal cost
  • (More results available in course notes)

171
Overview
  • Introduction
  • Interactive ray tracer
  • Animation and interactive ray tracing
  • Sample re-use techniques

172
Sample Re-use Techniques
173
Brute Force Ray Tracing
  • Enables interactive ray tracing
  • Does not allow large image sizes
  • Does not scale to scenes with
    high depth complexity

174
Solution
  • Exploit temporal coherence
  • Re-use results from previous frames

175
Practical Solutions
  • Tapestry (Simmons et. al. 2000)
  • Focuses on complex lighting simulation
  • Render cache (Walter et. al. 1999)
  • Addresses scene complexity issues
  • Explained next
  • Parallel render cache (Reinhard et. al. 2000)
  • Builds on Walters render cache
  • Explained next

176
Render Cache Algorithm
  • Basic setup
  • One front-end for
  • Displaying pixels
  • Managing previous results
  • Parallel back-end for
  • Producing new pixels

177
Render Cache Front-end
  • Frame based rendering
  • For each frame do
  • Project existing points
  • Smooth image and display
  • Select new rays using heuristics
  • Request samples from back-end
  • Insert new points into point cloud

178
Render Cache
179
Render Cache (2)
  • Point reprojection is relatively cheap
  • Smooth camera movement for small images
  • Does not scale to large images or large numbers
    of renderers ? front-end becomes bottleneck

180
Parallel Render Cache
  • Aim remove front-end bottleneck
  • Distribute point reprojection functionality
  • Integrate point reprojection with renderers
  • Front-end only displays results

181
Parallel Render Cache (2)
182
Parallel Render Cache (3)
  • Features
  • Scalable behavior for scene complexity
  • Scalable in number of processors
  • Allows larger images to be rendered
  • Retains artifacts from render cache
  • Introduces new artifacts

183
Artifacts
  • Render cache artifacts at tile boundaries
  • Image deteriorates during camera movement
  • These artifacts are deemed more acceptable than
    loss of smooth camera movement!

184
Video
185
Test Scenes
186
Results
  • Sub-parts of algorithm measured individually
  • Measure time per call to subroutine
  • Sum over all processors and all invocations
  • Afterwards divide by number of processors and
    number of invocations
  • Results are measured in events per second per
    processor

187
Scalability (Teapot Model)
188
Scalability (Room Model)
189
Samples Per Second
190
Reprojections Per Second
191
Conclusions
  • Exploitation of temporal coherence gives
    significantly smoother results than available
    with brute force ray tracing alone
  • This is at the cost of some artifacts which
    require further investigation
  • (More results available in course notes)

192
Acknowledgements
  • Thanks to
  • Steven Parker for writing the interactive ray
    tracer in the first place
  • Brian Smits, Peter Shirley and Charles Hansen for
    involvement in the animation and parallel point
    reprojection projects
  • Bruce Walter and George Drettakis for the render
    cache source code

193
Schedule
  • Introduction
  • Parallel / Distributed Rendering Issues
  • Classification of Parallel Rendering Systems
  • Practical Applications
  • Rendering at Clemson / Distributed Computing and
    Spatial/Temporal Coherence
  • Interactive Ray Tracing
  • Parallel Rendering and the Quest for Realism The
    Kilauea Massively Parallel Ray Tracer (Kato)
  • Summary / Discussion

194
Outline
  • What is Kilauea ?
  • Parallel ray tracing photon mapping
  • Kilauea architecture
  • Shading logic
  • Rendering results

195
Outline
  • What is Kilauea ?
  • Parallel ray tracing photon mapping
  • Kilauea architecture
  • Shading logic
  • Rendering results

196
Objective
  • Global illumination
  • Extremely complex scenes

197
Parallel Processing
  • Hardware
  • Multi-CPU machine
  • Linux PC cluster
  • Software
  • Threading (Pthread)
  • Message passing (MPI)

198
Our Render Farm
199
Global Illumination
  • Photon map

200
Ray Tracing Renderer
201
Ray Tracing Renderer
202
Ray Tracing Renderer
203
Outline
  • What is Kilauea ?
  • Parallel ray tracing photon mapping
  • Kilauea architecture
  • Shading logic
  • Rendering results

204
Parallel Ray Tracing
  • Simple case
  • Complex case

205
Parallel Ray Tracing
  • Simple case
  • Complex case

206
Accel Grid
207
Simple Case (scene distribution)
208
Simple Case (ray tracing)
209
Parallel Ray Tracing
  • Simple case
  • Complex case

210
Complex Case (scene distribution)
211
Complex Case (accel grid construction)
212
Complex Case (ray tracing)
213
Outline
  • What is Kilauea ?
  • Parallel ray tracing photon mapping
  • Kilauea architecture
  • Shading logic
  • Rendering results

214
Parallel Photon Mapping
  • Photon trace
  • Photon lookup

215
Parallel Photon Mapping
  • Photon trace
  • Photon lookup

216
Photon Tracing (simple case)
217
Photon Tracing (complex case)
218
Parallel Photon Mapping
  • Photon trace
  • Photon lookup

219
Photon Lookup (simple case)
220
Photon Lookup (complex case)
221
Outline
  • What is Kilauea ?
  • Parallel ray tracing photon mapping
  • Kilauea architecture
  • Shading logic
  • Rendering results

222
Task
  • Mtask
  • Wtask
  • Btask
  • Stask
  • Rtask
  • Atask
  • Etask
  • Ltask
  • Ptask
  • Otask

223
Task Assignment
224
Roles of Tasks
225
Task Configuration
226
Task Configuration
227
Task Configuration
228
Task Interaction
229
Task Interaction
230
Task Interaction
231
Task Interaction
232
Task Interaction
233
Task Interaction
234
Task Interaction (simple case)
235
Roles of Tasks (photon map)
236
Task Configuration (photon map)
237
Task Configuration (photon map)
238
Task Interaction (photon map)
239
Task Interaction (photon map)
240
Task Interaction (photon map)
241
Task Interaction (photon map)
242
Task Configuration (simple photon)
243
Task Priority
244
Outline
  • What is Kilauea ?
  • Parallel ray tracing photon mapping
  • Kilauea architecture
  • Shading logic
  • Rendering results

245
Parallel Shading Problem
246
Parallel Shading Problem
247
Parallel Shading Problem (solution)
248
Parallel Shading Problem (solution)
249
Parallel Shading Problem (solution)
250
Parallel Shading Problem (solution)
251
Parallel Shading Problem (solution)
252
Parallel Shading Problem (solution)
253
Parallel Shading Problem (solution)
254
Parallel Shading Problem (solution)
255
Decomposing Shading Computation
256
Decomposing Shading Computation
257
Decomposing Shading Computation
258
SPOT
259
SPOT Condition
260
Parallel Shading Solution using SPOT
261
Parallel Shading Solution using SPOT
262
Shader SPOT Network Example
263
Outline
  • What is Kilauea ?
  • Parallel ray tracing photon mapping
  • Kilauea architecture
  • Shading logic
  • Rendering results

264
Rendering Results
  • Test machine specification
  • 1GHz Dual Pentium III
  • 512Mbyte memory
  • 100BaseT Ethernet
  • 18 machines connected via 100BaseT switch

265
Quatro
  • 700,223 triangles, 1 area point sky light, 1280
    x 692
  • 18 machines 7min 19sec

266
Quatro single Atask test
267
Jeep
  • 715,059 triangles, 1 directional sky light,
    1280 x 692
  • 18 machines 8min 27sec

268
Jeep4
  • 2,859,636 triangles, 1 directional sky light,
    1280 x 692
  • 18 machines 12min 38sec 2 Atsks x 1

269
Jeep4 2 Atasks test
  • 1Atask group 2 machines

270
Jeep8
  • 5,719,072 triangles, 1 directional sky light,
    1280 x 692
  • 16 machines 18min 43sec 4 Atasks x 4

271
Escape POD
  • 468,321 triangles, 1 directional sky light,
    1280 x 692
  • 18 machines 14min 55sec

272
ansGun
  • 20,279 triangles, 1 spot sky light, 1280 x 960
  • 18 machines 16min 38sec

273
SCN101
  • 787,255 triangls, 1 area light, 1280 x 692
  • 18 machines 9min 10sec

274
Video
275
Conclusion / Future Work
  • We achieved
  • Close to linear parallel performance
  • Highly extensible architecture
  • We will achieve even more
  • Speed
  • Stability
  • Usability (user interface)
  • Etc.

276
Additional Information
  • Kilauea live rendering demo
  • BOOTH 1927 SquareUSA
  • http//www.squareusa.com/kilauea/

277
Schedule
  • Introduction
  • Parallel / Distributed Rendering Issues
  • Classification of Parallel Rendering Systems
  • Practical Applications
  • Summary / Discussion (Chalmers)

278
Summary
279
Contact Information
  • Alan Chalmers
  • alan_at_cs.bris.ac.uk
  • Tim Davis
  • tadavis_at_cs.clemson.edu
  • Toshi Kato
  • http//www.squareusa.com/kilauea/
  • Erik Reinhard
  • reinhard_at_cs.utah.edu
  • Slides
  • http//www.cs.clemson.edu/tadavis
About PowerShow.com