Title: Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring
1Better than the Two Exceeding Private and
Shared Caches via Two-Dimensional Page Coloring
Dept. of Computer Science University of Pittsburgh
2Multicore distributed L2 caches
- L2 caches typically sub-banked and distributed
- IBM Power4/5 3 banks
- Sun Microsystems T1 4 banks
- Intel Itanium2 (L3) many sub-arrays
- (Distributed L2 caches switched NoC) ? NUCA
- Hardware-based management schemes
- Private caching
- Shared caching
- Hybrid caching
3Private and shared caching
- Private caching
- ? short hit latency (always local)
- ? high on-chip miss rate
- long miss resolution time
- complex coherence enforcement
- Shared caching
- low on-chip miss rate
- straightforward data location
- simple coherence (no replication)
- long average hit latency
4Other approaches
- Hybrid/flexible schemes
- Core clustering Speight et al., ISCA2005
- Flexible CMP cache sharing Huh et al.,
ICS2004 - Flexible bank mapping Liu et al., HPCA2004
- Improving shared caching
- Victim replication Zhang and Asanovic,
ISCA2005 - Improving private caching
- Cooperative caching Chang and Sohi, ISCA2006
- CMP-NuRAPID Chishti et al., ISCA2005
5Motivation
Hit latency
Miss rate
What is the optimal balance between miss rate and
hit latency?
6Talk roadmap
- Data mapping, a key property cho and Jin,
Micro2006 - Two-dimensional (2D) page coloring algorithm
- Evaluation and results
- Conclusion and future works
7Data mapping
- Data mapping
- Memory data ? location in L2 cache
- Private caching
- Data mapping determined by program location
- Mapping created at miss time
- No explicit control
- Shared caching
- Data mapping determined by address
- slice number (block address) (Nslice)
- Mapping is static
- No explicit control
8Change mapping granularity
Block granularity
Page granularity
Page
Page
Page
slice number (block address) (N slice)
Page
slice number (page address) (N slice)
9OS controlled page mapping
Program 1
Memory pages
OS PAGE ALLOCATION
OS PAGE ALLOCATION
Program 2
Virtual address space
Physical address space
102D page coloring the problem
access miss
cost
9000
6900
9000
8100
9600
Page
Page
Page
Page
Page
500 30
500 3
P
500 10
500 7
500 12
Network latency / hop 3 cycles Memory latency
300 cycles
Cost(color ) ( access x hop x 3 cycles)
( miss x 300 cycles)
112D coloring algorithm
- Collect L2 reference trace
- Derive conflict information Sherwood et al.,
ICS1999
122D coloring algorithm (contd)
- Derive conflict information
Reference Matrix A B C A 0 0 0 B 0 0 0 C 0 0 0
Conflict Matrix A B C A 0 0 0 B 0 0 0 C 0 0 0
11
132D coloring algorithm (contd)
- Derive conflict information
Reference Matrix A B C A 0 0 0 B 1 0 0 C 1 0 0
Conflict Matrix A B C A 0 0 0 B 0 0 0 C 0 0 0
142D coloring algorithm (contd)
- Derive conflict information
Reference Matrix A B C A 0 0 0 B 1 0 0 C 1 0 0
Conflict Matrix A B C A 0 0 0 B 0 0 0 C 0 0 0
152D coloring algorithm (contd)
- Derive conflict information
Reference Matrix A B C A 0 1 0 B 1 0 0 C 1 1 0
Conflict Matrix A B C A 0 0 0 B 0 0 0 C 0 0 0
1
0
162D coloring algorithm (contd)
- Derive conflict information
Reference Matrix A B C A 0 1 0 B 0 0 0 C 1 1 0
Conflict Matrix A B C A 0 0 0 B 1 0 0 C 0 0 0
172D coloring algorithm (contd)
- Derive conflict information
Reference Matrix A B C A 0 1 0 B 0 0 0 C 1 1 0
Conflict Matrix A B C A 0 0 0 B 1 0 0 C 0 0 0
182D coloring algorithm (contd)
- Derive conflict information
Reference Matrix A B C A 0 1 1 B 0 0 1 C 1 1 0
Conflict Matrix A B C A 0 0 0 B 1 0 0 C 0 0 0
0
0
1
1
192D coloring algorithm (contd)
Conflict Matrix A B C A 0 0 0 B 1 0 0 C 1 1 0
Access Counter A B C 1 2 1
Conflict Matrix A B C A 0 0 0 B 1 0 0 C 1 1 0
202D coloring algorithm (contd)
Conflict Matrix A B C A 0 0 0 B 1 0 0 C 1 1 0
Access Counter A B C 1 2 1
Conflict(color)
Access
Cost(color, page) (
x mem latency)
x hop(color) x hop
delay)
a x
(1-a) x
Optimal color(page) C Cost(C)
MINCost(color, page)
for all colors
21Experiments setup
- Experiments were carried out using simulator
derived from SimpleScalar toolset. - The simulator models a 16-core tile-based CMP.
- Each core has private 32KB I/D L1, global shared
256KB L2 slice (total 4MB).
22Optimal page mapping
a 1/64
a 1/256
of pages
of pages
x
y
y
x
gcc
23Access distribution
24Relative performance
25Value of a
26Conclusions
- With cautious data placement, there is huge room
for performance improvement. - Dynamic mapping schemes with information assisted
by hardware are possible to achieve similar
perform-ance improvement. - This method can also be applied to other
optimization target.
27Current and future works
- Dynamic mapping schemes
- Performance
- Power
- Multiprogrammed and parallel workloads
28Thank you Questions?
29Private caching
- ? short hit latency (always local)
- ? high on-chip miss rate
- long miss resolution time
- complex coherence enforcement
- L1 miss
- L2 access
- Hit
- Miss
- Access directory
- A copy on chip
- Global miss
Local L2 access
30Shared caching
- L1 miss
- L2 access
- Hit
- Miss
- low on-chip miss rate
- straightforward data location
- simple coherence (no replication)
- long average hit latency
31Performance