Title: A Coherent Grid Traversal Algorithm for Volume Rendering
1A Coherent Grid Traversal Algorithm for Volume
Rendering
UCL Department of Computer Science
- Ioannis Makris
- Supervisors Philipp Slusallek, Céline Loscos
- Computer Graphics Lab, Universität des
Saarlandes
2Overview
UCL Department of Computer Science
- Introduction
- Previous work in software Direct Volume Rendering
- Introduction to the Cell Broadband Engine
- The Coherent Grid Traversal Algorithm
- Parallelisation Schemes
3Introduction to Direct Volume Rendering
- Technique of displaying a 2D projection of a 3D
sampled dataset (volume), by accumulating samples
across lines of sight with some transfer
function. - Several types of sampled data. We will only deal
with rectilinear grids.
4Direct Volume Rendering
UCL Department of Computer Science
- Ray Casting (Levoy 1988, 1990)
- Image order algorithm
- Splatting (Westover 1990)
- Object order
- Shear Warp (Lacroute 1994, 1996)
- Hybrid order
5Ray Casting
UCL Department of Computer Science
- Cast a ray from the viewpoint to the volume for
all pixels - Obtain samples from the volume in equal
intervals, by trilinearly interpolating
neighbouring voxels. Accumulate with some
operator to get final colour. - Several acceleration techniques have been
suggested (early ray termination (Levoy 1990),
adaptive sampling, octrees (Ogata et al. 1998),
kd-trees(Wald et al 2005)
6Shear-Warp
UCL Department of Computer Science
- Considered the fastest known Direct Volume
Rendering algorithm. - Steps
- Transform volume to sheared object space
- Project sheared slices on an intermediate image
- Transform the intermediate image to image space
- Requires 3 copies of the data, for every
principal axis, but RLE compression can help.
7Characteristics of modern x86 processors
- Deep instruction pipeline.
- Very sophisticated hardware branch prediction
- 2 levels of cache, supports software prefetching
- Rich SIMD instruction set
8The CELL processor
UCL Department of Computer Science
- Developed jointly by IBM, Sony and Toshiba
- Combines a PowerPC general purpose processor with
8 separate SIMD execution units (SPUs). - Exceptional FLOPS / cost ratio and more powerful
than the Itanium! - Needs fast memory, which is relatively expensive
9Notable Characteristics of the SPUs
UCL Department of Computer Science
- Software managed local store (i.e. no caches)
- No branch prediction, expensive branch misses
- SIMD loads/stores ONLY
- Favors streaming code
10Motivation for a new algorithm
UCL Department of Computer Science
- Ray Casting algorithms are typically not cache
friendly. Performance depends on viewing axis. - Acceleration structures may produce non-streaming
code and several overheads. - Shear Warp may require too much memory for
certain data.
11A Coherent Grid Traversal Algorithm for Volume
Rendering (1)
UCL Department of Computer Science
- Original idea from Ray Tracing Animated Scenes
using Coherent Grid Traversal (Wald et al,
SIGGRAPH 2006). - Bundles (frustums) of coherent rays are traced in
grid space, by incrementaly computing the overlap
with grid slices. The overlap of the frustum is
computed with a SIMD addition and a SIMD
truncation only
12A Coherent Grid Traversal Algorithm for Volume
Rendering (2)
UCL Department of Computer Science
- The volume rendering version of the algorithm
uses a bricked volume (Sakas et al 1994),
bricks replace the grid elements. - Bricks are referenced by 3 maps, one for each
principal axis. - Compression is achieved by not storing empty
bricks.
13A Coherent Grid Traversal Algorithm for Volume
Rendering (3)
14A Coherent Grid Traversal Algorithm for Volume
Rendering (4)
UCL Department of Computer Science
- Traversal is performed on the principal axis,
using the corresponding map. - Indices are computed incrementally.
- If all the overlapping bricks of a slice are
empty, the slice is skipped. - If some bricks are empty, they are associated
with a locally stored empty brick and processed
redundantly (but not fetched).
15A Coherent Grid Traversal Algorithm for Volume
Rendering (examples)
UCL Department of Computer Science
16Bundle Parallelisation
UCL Department of Computer Science
- Bundle Parallelisation is trivial. On a x86 C
OpenMP implementation, it only required 1 line of
code. - It is possible to have some blocks fetched
multiple times from neighbouring bundles.
17Slice Parallelisation
UCL Department of Computer Science
- A slice parallelisation is less likely to exhibit
this problem, but traversal of brick slices is
not incremental! - So, how would the processing element know which
bundles to process for a given slice?
18Slice Parallelisation
UCL Department of Computer Science
- Most bundles will start on k0, or end on kkmax
(or both). - During tracing, we create 2 vectors of references
to bundles, we shall call them A and D, along
with 2 index tables for the corresponding slices
we shall call P and Q . - The bundles that run through a given slice s can
be expressed as - Only 2 memory reads are required for that, or no
memory reads if the bundles are large enough for
A and D to fit in the cache/local store.
19Slice Parallelisation
UCL Department of Computer Science
- Remaining bundles can take up to 33 (they are
about 14 average). - We use two more lists, we shall call S and E with
index tables M and N. S holds references to the
remaining bundles sorted by the first slice they
intersect, and E sorted by the last. - Remaining bundles that run through s are
- We need to run through both these lists to find
that out, but this does not hit performance.
20A notable problem of the CGT algorithm as
described in Wald 2006
- When the roll angle of the bundles to the
respective angle of the volume is close to p/4,
the number of blocks fetches can be double than
the number required. - There is a good solution to that (not yet
published).
21Results
UCL Department of Computer Science
- First results demonstrated an speed increase of
up to 2 orders of magnitude from ray-casting. - This may increase with further optimisations
22Conclusion
- We have developed a scalable algorithm for
coherent volume traversal with performance on-par
with the Shear Warp, with reduced memory
requirements. - We demonstrated parallel implementations.
23Future Work
- Investigate mixed parallelisation schemes
- Optimise the computation performed per brick.
24The End
UCL Department of Computer Science
- Thank you for your attention
- Questions?