Title: Steven F. Ashby Center for Applied Scientific Computing Month DD, 1997
1Mesh Layouts for Block-Based Caches
Sung-Eui Yoon Peter Lindstrom Lawrence Livermore
National Laboratory
2Goal
- Provide cache-coherent layouts of meshes and
graphs - Derive metrics that measure cache-coherence of
layouts - Generality
- Simplicity
- Efficiency
- Accuracy
3Cache-Coherent Metrics
- Measure the expected number of cache misses of a
layout given block-based caches - Should correlate well with the observed number of
cache misses - Cache-aware metrics
- Measure cache-coherence given known cache
parameters (e.g., block size) - Cache-oblivious metrics
- Consider all possible cache parameters
4Motivation
- Lower growth rate of data access speed
130X
Accumulated growth rate during 1993 2004 (log
scale)
46X
20X
1.5X
during 99 - 04
Courtesy Anselmo Lastra, http//www.hcibook.com/e
3/online/moores-law/
5Memory Hierarchies and Block-Based Caches
Fast memory or cache
Slow memory
Block transfer
Disk
CPU
10-2 sec
10-7 sec
10-8 sec
Access time
6Main Contributions
- Propose novel and practical cache-aware and
cache-oblivious metrics - Derive metrics given block-based caches
- Propose efficient cache-coherent layout
constructions - Apply to different applications
7Related Work
- Computation reordering
- Data layout optimization
8Computational Reordering
- Cache-aware Coleman and McKinley 95, Vitter 01,
Sen et al. 02 - Cache-oblivious Frigo et al. 99, Arge et al. 04
Focus on specific problems such as sorting and
linear algebra computations
9Data Layout Optimization
- Graph and matrix layout Diaz et al. 02
- Minimum linear arrangement (MLA), bandwidth, and
wavefront, etc. - Space-filling curves
- Sagan 94, Pascucci and Frank 01, Lindstrom and
Pascucci 01, Gopi and Eppstein 04 - Rendering and processing sequences
- Deering 95, Hoppe 99, Bogomjakov and Gotsman 02,
Isenburg and Lindstrom 05 - Cache-oblivious mesh layout
- Yoon et al. 05
10Outline
- Computation models
- Cache-aware and cache-oblivious metrics
- Results
11Outline
- Computation models
- Cache-aware and cache-oblivious metrics
- Results
12General Framework of Layout Computation
na
Input directed graph, G (N, A)
nb
nd
nc
Cache-coherent metric
Layout algorithm, f
na
nd
nc
nb
..
1D layout, f(N)
13Two-Level I/O Model Aggarwal and Vitter 88
na
Input directed graph
nb
nd
nc
M cache blocks, whose size is B
Layout algorithm
nb
Cache
na
nd
nc
nb
..
1D layout with block size 3
14Graph Representation
- Directed graph, G (N, A)
- Represent access patterns between nodes
- Nodes, N
- Data element
- (e.g., mesh vertex or mesh triangle)
- Directed arcs, A
- Connects two nodes if they are accessed
sequentially
na
nb
nd
nc
15Weights of Nodes and Arcs
- Indicate probabilities that each element will be
accessed - Computed in an equilibrium status during infinite
random walks - Assume that applications infinitely access the
data according to the input graph - Correspond to eigen-values of the probability
transition matrix
16Cache-Coherence of a Layout given Block-Based
Caches
- Expected number of cache misses of a layout
- Probability accessing a node from another node by
traversing an arc - Conditional probability that we will have a cache
miss given the above access pattern
na
nb
nd
nc
17Specialization to Meshes
- Expected number of cache misses of a layout
- Probability accessing a node from another node by
traversing an arc - Conditional probability that we will have a cache
miss given the above access pattern
constant
na
na
1. Two opposite directed arcs 2. Uniform
distribution to access adjacent nodes
given a node
nb
nd
nb
nd
nc
nc
Implicitly derived graph
An input mesh
18Outline
- Computation models
- Cache-aware and cache-oblivious metrics
- Results
19Four Different Cases
Cache-aware case single cache block, M1
Cache-oblivious case single cache block, M1
Cache-aware case multiple cache blocks, Mgt1
Cache-oblivious case multiple cache blocks, Mgt1
20Cache-Aware Single Cache Block, M1
na
Input directed graph
nb
nd
nc
Straddling arcs
Cache, whose block size is B
na
nd
nc
nb
..
1D layout with block size 3
21Cache-Aware Multiple Cache Blocks, Mgt1
na
Input directed graph
nb
nd
nc
Straddling arcs
Cache
na
nd
nc
nb
..
1D layout with block boundary
22Final Cache-Aware Metric
- Counts the number of straddling arcs of the
layout given a block size B
block index containing the node, i
where
Unit step function, 1 if x gt 0
0 otherwise.
23High Accuracy of Cache-Aware Metric
Tested block size 4KB
Linear correlation -1, 1 Observed number of cache misses Observed number of cache misses
Linear correlation -1, 1 With 5 cache blocks With 25 cache blocks
Cache-aware metric 0.97 0.97
Z-curve on a uniform grid
Tested layouts Z-curve, Hilbert curve, H-order,
minimum linear arrangement layout, ßO-layout,
geometric CO layout, (bi or uni) row-by-row, (bi
or uni) diagnoal layouts
24Four Different Cases
Cache-aware case single cache block, M1
Cache-oblivious case single cache block, M1
Cache-aware case multiple cache blocks, Mgt1
Cache-oblivious case multiple cache blocks, Mgt1
25Cache-Oblivious Single Cache Block, M1
Does not assume a particular block size Then,
what are good representatives for block sizes?
Cache
26Two Possible Block Size Progressions
- Arithmetic progression
- 1, 2, 3, 4,
- Geometric progression
- 20 , 21 , 22 , 23 ,
- Well reflects current caching architectures
- E.g., L1 32B, L2 64B, Page 4KB, etc.
27Probability that an Arc is a Straddling Arc
Computed as a probability as a function of arc
length, l
Is an arc straddling given a block size?
Arc length, l, 2
na
nd
nc
nb
..
28Two Cache-Oblivious Metrics
- Arithmetic cache-oblivious metric,
- Geometric cache-oblivious metric,
MLA metric, Arithmetic mean
Arc length of arc (i, j)
Geometric mean of arc lengths
29Validation for Cache-Oblivious (CO) Metrics
73 of tested power-of-two block sizes
97 of tested block sizes
- Geometric cache-oblivious metric
- Practical and useful
The number of cache misses when M 1 (log scale)
Geometric CO layout
Arithmetic CO layout
30Correlations between Metrics and Observed Number
of Cache Misses
Tested block size 4KB
Linear correlation -1, 1 Observed number of cache misses Observed number of cache misses
Linear correlation -1, 1 With 1 cache block With 5 cache blocks
Geometric CO metric 0.98 0.81
Arithmetic CO metric -0.19 -0.32
Tested with 10 different layouts on a uniform grid
31Efficient Layout Computation for Our Metrics
- Cache-aware layouts
- Optimized with cache-aware metric given a block
size B - Computed from the graph partitioning
- Geometric cache-oblivious metric
- Very efficient
- Can be used in different layout methods
32Layout Computation with Geometric Cache-Oblivious
Metric
- Multi-level construction method
- Partition an input mesh into k different sets
- Layout partitions based on our metric
- Generalized layout method
- for unstructured meshes
1. Partition
2. Lay out
33Outline
- Computation models
- Cache-aware and cache-oblivious metrics
- Results
34Applications
- Isosurface extraction
- View-dependent rendering
35Iso-Surface Extraction
Spx model (140K vertices)
- Uses contour tree van Kreveld et al. 97
- Runtime is dominated by the traversal of
iso-surface - Layout graph
- Use an input tetrahedral mesh
36High Correlation with Number of Cache Misses
Tested block size 4KB
Linear correlation -1, 1 Observed number of cache misses Observed number of cache misses
Linear correlation -1, 1 With 1 cache block With 10K cache blocks
Geometric CO metric 0.99 0.98
Tested with 8 different layouts our geometric
CO, our cache-aware, breadth-first (and
depth-first) layouts, spectral Juvan and Mohar
92, cache-oblivious mesh Yoon et al. 05,
Z-curve Sagan 94, X-axis sorted layouts
37High Correlation with Runtime Performance
Memory access time is major bottleneck
Disk I/O time is major bottleneck
Linear correlation -1, 1 First iso-surface extraction time Second iso-surface extraction time
Geometric CO metric 0.94 0.94
38Comparison with Other Layouts
The first iso-surface extraction time (sec)
8 - 77 improvement and very close to the
cache-aware performance
39View-Dependent Rendering
- Layout vertices and triangles of progressive
meshes - Used in an efficient VDR system Yoon et al. 04
- Reduce misses in GPU vertex cache
40Cache Miss Ratio on Bunny Model
Universal rendering seq. Bogomjakov and Gotsman
02
GPU vertex cache miss ratio
Hoppe Hoppe 99
Theoretical lower bound Bar-Yehuda and Gotsman
96
Geometric CO layout
Vertex cache size
41Cache Miss Ratio on Power Plant Model
GPU vertex cache miss ratio
Z-curve
COML Yoon et al. 05
Hoppes rendering seq. Hoppe 99
Theoretical lower bound Bar-Yehuda and Gotsman
96
Geometric CO layout
Vertex cache size
42Conclusion
- Novel cache-aware and cache-oblivious metrics to
evaluate layouts - Derived metrics based on two-level I/O model
- Improved the performance of applications without
modifying codes
OpenCCL, open source library http//gamma.cs.unc
.edu/COL/OpenCCL
43Ongoing and Future Work
- Derive a lower bound on our geometric
cache-oblivious metric - Employ mesh compression to further reduce disk
I/O accesses - Investigate efficient layout method for deforming
models - Apply to non-graphics applications
- e.g., shortest path or other graph computations
44Cache-Efficient Layouts of Bounding Volume
Hierarchies
- Yoon and Manocha, Eurographics 06
Ray tracing
Collision detection
45Acknowledgements
- Ajith Mascarenhas
- Martin Isenburg
- Dinesh Manocha
- Fabio Bernardon, Joao Comba, and Claudio Silva
- For their unstructured tetrahedra rendering
program - Members of data analysis group in LLNL
- Anonymous reviewers
46UCRL-PRES-225448
This work was performed under the auspices of the
U.S. Department of Energy by University of
California Lawrence Livermore National Laboratory
under contract No. W-7405-ENG-48.
47Additional slides
48Cache-Coherence of Layouts
- Well known heuristics for cache-coherent layouts
- Space-filling curves Sagan 94
- How can we compute better layouts?
- Requires metrics measuring cache-coherence of
layouts
49Main Results
- Define cache-coherence of layout as
- Expected number of cache misses during random
walks of a graph given block-based caches - Then, the exp. number of cache misses
- Number of straddling arcs in a cache-aware cache
- Geometric mean of arc lengths in a
cache-oblivious case
50Data Layout Optimization
- Rendering sequences
- Triangle strips
- Deering 95, Hoppe 99, Bogomjakov and Gotsman 02
- Processing sequences
- Isenburg and Gumhold 03, Isenburg and Lindstrom
05
Assume that access pattern globally follows the
layout order
51Data Layout Optimization
- Space-filling curves
- Sagan 94, Velho and Gomes 91, Pascucci and Frank
01, Lindstrom and Pascucci 01, Gopi and Eppstein
04
Assume geometric regularity
52Data Layout Optimization
- Graph and matrix layout
- A survey Diaz et al. 02
- Minimum linear arrangement (MLA)
- Bandwidth
- Profile
- Wavefront, etc.
Does not necessarily produce good layouts for
block-based caches
53Cache-Aware Metric
- Expected number of cache misses given a block
size B
54Correlation between Cache-Aware Metric and
Observed Number of Cache Misses
R2 0.97
R2 0.97
M 5 M 25 Observed number of
cache misses
Cache-aware metric
Cache block size 4KB
55Evaluating Existing Layouts
An existing layout, f
- No known tight bound
- Compare against the best layout we can construct
- Employ an efficient sampling method
Is it close to the optimal layout?
Use it
Build a new one
56Implementation
- Modify our open source layout computation codes,
OpenCCL - Based on METIS graph partitioning library
Karypis and Kumar 98 - Processing speed
- 15k triangles / second
57Comparison with Cache-Oblivious Mesh Layouts
(COML) Yoon et al. 05
- Two major improvements
- Accuracy
- Usability
58Pros and Cons
- Limitations
- Not directly applicable to dynamic models
- Pros
- Generality
- Can have benefits without modifying underlying
codes
59Specialization to Meshes
na
na
- Assume an equally likelihood to access adjacent
nodes given a node
nb
nd
nb
nd
nc
nc
Implicitly derived graph
An input mesh
60Two Possible Block Size Progressions
Uniform distribution
- Arithmetic progression
- 1, 2, 3, 4,
- Geometric progression
- 20 , 21 , 22 , 23 ,
- Well reflects current caching architectures
- E.g., L1 32B, L2 64B, Page 4KB, etc.
Pr(B)
B
Geometric distribution
Pr(B)
B
61Cache Miss Ratio with Different Mesh Resolutions
of Bunny Model
GPU vertex cache miss ratio (case size 32)
Hoppes rendering sequence Hoppe 99
COLg
Mesh resolution
62Correlation between Cache-Aware Metric and Number
of Cache Misses
R2 0.97
R2 0.97
M 5 M 25 Observed number of
cache misses
Cache block size 4KB
63Correlations with Observed Number of Cache Misses
R2 0.98
R2 0.81
Cache block size 4KB
,
Cache misses One blk Mult blks
Arithmetic CO metric
Geometric CO metric
64Comparison with Other Layouts
0.94
Correlation
0.98
0.94
0.99
X-axis
BFL
Spectral layout Juvan and Mohar 92
Up to 2X speedup 9 lower than cache-aware layout
DFL
COML Yoon et al. 05
Z-curve
COLg
Aware
65Comparison with Other Layouts
- Compute eight different layouts
- Our geometric cache-oblivious layout
- Our cache-aware layout
- Breadth-first layout
- Depth-first layout
- Spectral layout Juvan and Mohar 92
- Cache-oblivious mesh layout Yoon et al. 05
- Z-curve Sagan 94
- X-axis sorted layout
66Our Layout of Bunny Model