Steven F. Ashby Center for Applied Scientific Computing Month DD, 1997 - PowerPoint PPT Presentation

About This Presentation

Title:

Steven F. Ashby Center for Applied Scientific Computing Month DD, 1997

Description:

Title: Steven F. Ashby Center for Applied Scientific Computing Month DD, 1997 Author: Computations Last modified by: Department of Computer Science – PowerPoint PPT presentation

Number of Views:146

Avg rating:3.0/5.0

Slides: 67

Provided by: Computations

Learn more at: http://vis.computer.org

Category:

more less

Transcript and Presenter's Notes

Title: Steven F. Ashby Center for Applied Scientific Computing Month DD, 1997

1
Mesh Layouts for Block-Based Caches
Sung-Eui Yoon Peter Lindstrom Lawrence Livermore
National Laboratory
2
Goal

Provide cache-coherent layouts of meshes and
graphs
Derive metrics that measure cache-coherence of
layouts
Generality
Simplicity
Efficiency
Accuracy

3
Cache-Coherent Metrics

Measure the expected number of cache misses of a
layout given block-based caches
Should correlate well with the observed number of
cache misses
Cache-aware metrics
Measure cache-coherence given known cache
parameters (e.g., block size)
Cache-oblivious metrics
Consider all possible cache parameters

4
Motivation

Lower growth rate of data access speed

130X
Accumulated growth rate during 1993 2004 (log
scale)
46X
20X
1.5X
during 99 - 04
Courtesy Anselmo Lastra, http//www.hcibook.com/e
3/online/moores-law/
5
Memory Hierarchies and Block-Based Caches
Fast memory or cache
Slow memory
Block transfer
Disk
CPU
10-2 sec
10-7 sec
10-8 sec
Access time
6
Main Contributions

Propose novel and practical cache-aware and
cache-oblivious metrics
Derive metrics given block-based caches
Propose efficient cache-coherent layout
constructions
Apply to different applications

7
Related Work

Computation reordering
Data layout optimization

8
Computational Reordering

Cache-aware Coleman and McKinley 95, Vitter 01,
Sen et al. 02
Cache-oblivious Frigo et al. 99, Arge et al. 04

Focus on specific problems such as sorting and
linear algebra computations
9
Data Layout Optimization

Graph and matrix layout Diaz et al. 02
Minimum linear arrangement (MLA), bandwidth, and
wavefront, etc.
Space-filling curves
Sagan 94, Pascucci and Frank 01, Lindstrom and
Pascucci 01, Gopi and Eppstein 04
Rendering and processing sequences
Deering 95, Hoppe 99, Bogomjakov and Gotsman 02,
Isenburg and Lindstrom 05
Cache-oblivious mesh layout
Yoon et al. 05

10
Outline

Computation models
Cache-aware and cache-oblivious metrics
Results

11
Outline

Computation models
Cache-aware and cache-oblivious metrics
Results

12
General Framework of Layout Computation
na
Input directed graph, G (N, A)
nb
nd
nc
Cache-coherent metric
Layout algorithm, f
na
nd
nc
nb
..
1D layout, f(N)
13
Two-Level I/O Model Aggarwal and Vitter 88
na
Input directed graph
nb
nd
nc
M cache blocks, whose size is B
Layout algorithm
nb
Cache
na
nd
nc
nb
..
1D layout with block size 3
14
Graph Representation

Directed graph, G (N, A)
Represent access patterns between nodes
Nodes, N
Data element
(e.g., mesh vertex or mesh triangle)
Directed arcs, A
Connects two nodes if they are accessed
sequentially

na
nb
nd
nc
15
Weights of Nodes and Arcs

Indicate probabilities that each element will be
accessed
Computed in an equilibrium status during infinite
random walks
Assume that applications infinitely access the
data according to the input graph
Correspond to eigen-values of the probability
transition matrix

16
Cache-Coherence of a Layout given Block-Based
Caches

Expected number of cache misses of a layout
Probability accessing a node from another node by
traversing an arc
Conditional probability that we will have a cache
miss given the above access pattern

na
nb
nd
nc
17
Specialization to Meshes

Expected number of cache misses of a layout
Probability accessing a node from another node by
traversing an arc
Conditional probability that we will have a cache
miss given the above access pattern

constant
na
na
1. Two opposite directed arcs 2. Uniform
distribution to access adjacent nodes
given a node
nb
nd
nb
nd
nc
nc
Implicitly derived graph
An input mesh
18
Outline

Computation models
Cache-aware and cache-oblivious metrics
Results

19
Four Different Cases
Cache-aware case single cache block, M1
Cache-oblivious case single cache block, M1
Cache-aware case multiple cache blocks, Mgt1
Cache-oblivious case multiple cache blocks, Mgt1
20
Cache-Aware Single Cache Block, M1
na
Input directed graph
nb
nd
nc
Straddling arcs
Cache, whose block size is B
na
nd
nc
nb
..
1D layout with block size 3
21
Cache-Aware Multiple Cache Blocks, Mgt1
na
Input directed graph
nb
nd
nc
Straddling arcs
Cache
na
nd
nc
nb
..
1D layout with block boundary
22
Final Cache-Aware Metric

Counts the number of straddling arcs of the
layout given a block size B

block index containing the node, i
where
Unit step function, 1 if x gt 0
0 otherwise.
23
High Accuracy of Cache-Aware Metric
Tested block size 4KB
Linear correlation -1, 1 Observed number of cache misses Observed number of cache misses
Linear correlation -1, 1 With 5 cache blocks With 25 cache blocks
Cache-aware metric 0.97 0.97
Z-curve on a uniform grid
Tested layouts Z-curve, Hilbert curve, H-order,
minimum linear arrangement layout, ßO-layout,
geometric CO layout, (bi or uni) row-by-row, (bi
or uni) diagnoal layouts
24
Four Different Cases
Cache-aware case single cache block, M1
Cache-oblivious case single cache block, M1
Cache-aware case multiple cache blocks, Mgt1
Cache-oblivious case multiple cache blocks, Mgt1
25
Cache-Oblivious Single Cache Block, M1
Does not assume a particular block size Then,
what are good representatives for block sizes?
Cache
26
Two Possible Block Size Progressions

Arithmetic progression
1, 2, 3, 4,
Geometric progression
20 , 21 , 22 , 23 ,
Well reflects current caching architectures
E.g., L1 32B, L2 64B, Page 4KB, etc.

27
Probability that an Arc is a Straddling Arc
Computed as a probability as a function of arc
length, l
Is an arc straddling given a block size?
Arc length, l, 2
na
nd
nc
nb
..
28
Two Cache-Oblivious Metrics

Arithmetic cache-oblivious metric,
Geometric cache-oblivious metric,

MLA metric, Arithmetic mean
Arc length of arc (i, j)
Geometric mean of arc lengths
29
Validation for Cache-Oblivious (CO) Metrics
73 of tested power-of-two block sizes
97 of tested block sizes

Geometric cache-oblivious metric
Practical and useful

The number of cache misses when M 1 (log scale)
Geometric CO layout
Arithmetic CO layout
30
Correlations between Metrics and Observed Number
of Cache Misses
Tested block size 4KB
Linear correlation -1, 1 Observed number of cache misses Observed number of cache misses
Linear correlation -1, 1 With 1 cache block With 5 cache blocks
Geometric CO metric 0.98 0.81
Arithmetic CO metric -0.19 -0.32

Tested with 10 different layouts on a uniform grid
31
Efficient Layout Computation for Our Metrics

Cache-aware layouts
Optimized with cache-aware metric given a block
size B
Computed from the graph partitioning
Geometric cache-oblivious metric
Very efficient
Can be used in different layout methods

32
Layout Computation with Geometric Cache-Oblivious
Metric

Multi-level construction method
Partition an input mesh into k different sets
Layout partitions based on our metric

Generalized layout method
for unstructured meshes

1. Partition
2. Lay out
33
Outline

Computation models
Cache-aware and cache-oblivious metrics
Results

34
Applications

Isosurface extraction
View-dependent rendering

35
Iso-Surface Extraction
Spx model (140K vertices)

Uses contour tree van Kreveld et al. 97
Runtime is dominated by the traversal of
iso-surface
Layout graph
Use an input tetrahedral mesh

36
High Correlation with Number of Cache Misses
Tested block size 4KB
Linear correlation -1, 1 Observed number of cache misses Observed number of cache misses
Linear correlation -1, 1 With 1 cache block With 10K cache blocks
Geometric CO metric 0.99 0.98
Tested with 8 different layouts our geometric
CO, our cache-aware, breadth-first (and
depth-first) layouts, spectral Juvan and Mohar
92, cache-oblivious mesh Yoon et al. 05,
Z-curve Sagan 94, X-axis sorted layouts
37
High Correlation with Runtime Performance
Memory access time is major bottleneck
Disk I/O time is major bottleneck
Linear correlation -1, 1 First iso-surface extraction time Second iso-surface extraction time
Geometric CO metric 0.94 0.94
38
Comparison with Other Layouts
The first iso-surface extraction time (sec)
8 - 77 improvement and very close to the
cache-aware performance
39
View-Dependent Rendering

Layout vertices and triangles of progressive
meshes
Used in an efficient VDR system Yoon et al. 04
Reduce misses in GPU vertex cache

40
Cache Miss Ratio on Bunny Model
Universal rendering seq. Bogomjakov and Gotsman
02
GPU vertex cache miss ratio
Hoppe Hoppe 99
Theoretical lower bound Bar-Yehuda and Gotsman
96
Geometric CO layout
Vertex cache size
41
Cache Miss Ratio on Power Plant Model
GPU vertex cache miss ratio
Z-curve
COML Yoon et al. 05
Hoppes rendering seq. Hoppe 99
Theoretical lower bound Bar-Yehuda and Gotsman
96
Geometric CO layout
Vertex cache size
42
Conclusion

Novel cache-aware and cache-oblivious metrics to
evaluate layouts
Derived metrics based on two-level I/O model
Improved the performance of applications without
modifying codes

OpenCCL, open source library http//gamma.cs.unc
.edu/COL/OpenCCL
43
Ongoing and Future Work

Derive a lower bound on our geometric
cache-oblivious metric
Employ mesh compression to further reduce disk
I/O accesses
Investigate efficient layout method for deforming
models
Apply to non-graphics applications
e.g., shortest path or other graph computations

44
Cache-Efficient Layouts of Bounding Volume
Hierarchies

Yoon and Manocha, Eurographics 06

Ray tracing
Collision detection
45
Acknowledgements

Ajith Mascarenhas
Martin Isenburg
Dinesh Manocha
Fabio Bernardon, Joao Comba, and Claudio Silva
For their unstructured tetrahedra rendering
program
Members of data analysis group in LLNL
Anonymous reviewers

46
UCRL-PRES-225448
This work was performed under the auspices of the
U.S. Department of Energy by University of
California Lawrence Livermore National Laboratory
under contract No. W-7405-ENG-48.
47
Additional slides
48
Cache-Coherence of Layouts

Well known heuristics for cache-coherent layouts
Space-filling curves Sagan 94
How can we compute better layouts?
Requires metrics measuring cache-coherence of
layouts

49
Main Results

Define cache-coherence of layout as
Expected number of cache misses during random
walks of a graph given block-based caches
Then, the exp. number of cache misses
Number of straddling arcs in a cache-aware cache
Geometric mean of arc lengths in a
cache-oblivious case

50
Data Layout Optimization

Rendering sequences
Triangle strips
Deering 95, Hoppe 99, Bogomjakov and Gotsman 02
Processing sequences
Isenburg and Gumhold 03, Isenburg and Lindstrom
05

Assume that access pattern globally follows the
layout order
51
Data Layout Optimization

Space-filling curves
Sagan 94, Velho and Gomes 91, Pascucci and Frank
01, Lindstrom and Pascucci 01, Gopi and Eppstein
04

Assume geometric regularity
52
Data Layout Optimization

Graph and matrix layout
A survey Diaz et al. 02
Minimum linear arrangement (MLA)
Bandwidth
Profile
Wavefront, etc.

Does not necessarily produce good layouts for
block-based caches
53
Cache-Aware Metric

Expected number of cache misses given a block
size B

54
Correlation between Cache-Aware Metric and
Observed Number of Cache Misses
R2 0.97
R2 0.97
M 5 M 25 Observed number of
cache misses
Cache-aware metric
Cache block size 4KB
55
Evaluating Existing Layouts
An existing layout, f

No known tight bound
Compare against the best layout we can construct
Employ an efficient sampling method

Is it close to the optimal layout?
Use it
Build a new one
56
Implementation

Modify our open source layout computation codes,
OpenCCL
Based on METIS graph partitioning library
Karypis and Kumar 98
Processing speed
15k triangles / second

57
Comparison with Cache-Oblivious Mesh Layouts
(COML) Yoon et al. 05

Two major improvements
Accuracy
Usability

58
Pros and Cons

Limitations
Not directly applicable to dynamic models
Pros
Generality
Can have benefits without modifying underlying
codes

59
Specialization to Meshes
na
na

Assume an equally likelihood to access adjacent
nodes given a node

nb
nd
nb
nd
nc
nc
Implicitly derived graph
An input mesh
60
Two Possible Block Size Progressions
Uniform distribution

Arithmetic progression
1, 2, 3, 4,
Geometric progression
20 , 21 , 22 , 23 ,
Well reflects current caching architectures
E.g., L1 32B, L2 64B, Page 4KB, etc.

Pr(B)
B
Geometric distribution
Pr(B)
B
61
Cache Miss Ratio with Different Mesh Resolutions
of Bunny Model
GPU vertex cache miss ratio (case size 32)
Hoppes rendering sequence Hoppe 99
COLg
Mesh resolution
62
Correlation between Cache-Aware Metric and Number
of Cache Misses
R2 0.97
R2 0.97
M 5 M 25 Observed number of
cache misses
Cache block size 4KB
63
Correlations with Observed Number of Cache Misses
R2 0.98
R2 0.81
Cache block size 4KB
,

Cache misses One blk Mult blks
Arithmetic CO metric
Geometric CO metric
64
Comparison with Other Layouts
0.94
Correlation
0.98
0.94
0.99
X-axis
BFL
Spectral layout Juvan and Mohar 92
Up to 2X speedup 9 lower than cache-aware layout
DFL
COML Yoon et al. 05
Z-curve
COLg
Aware
65
Comparison with Other Layouts