Title: Scavenger: A New Last Level Cache Architecture with Global Block Priority
1ScavengerA New Last Level Cache Architecture
with Global Block Priority
- Arkaprava Basu, IIT Kanpur Nevin Kirman,
Cornell - Mainak Chaudhuri, IIT Kanpur Meyrem Kirman,
Cornell -
Jose F. Martinez, Cornell
2Talk in one slide
- Observation1 large number of blocks miss
repeatedly in the last-level cache - Observation2 number of evictions between an
eviction-reuse pair is too large to be captured
by a conventional fully associative victim file - How to exploit this temporal behavior with large
period? - Our solution prioritizes blocks evicted from the
last-level cache by their miss frequencies - Top k frequently missing blocks are scavenged and
retained in a fast k-entry victim file
3Sketch
- Observations and hypothesis
- Scavenger overview (Contributions)
- Scavenger architecture
- Frequency estimator
- Priority queue
- Victim file
- Simulation environment
- Simulation results
- Related work
- Summary
4Observations and hypothesis
1
ROB stall cycles ()
2-9
10-99
100-999
100
gt 1000
512 KB 8-way
1 MB 8-way
80
60
40
20
0
gz
wu
sw
ap
vp
gc
me
ar
mc
eq
cr
am
pe
bz
tw
aps
5Observations and hypothesis
Too small
Wish, but too large (FA?)
6Observations and hypothesis
- Block addresses repeat in the miss address stream
of the last-level cache - Repeating block addresses in the miss stream
cause significant ROB stall - Hypothesis identifying and retaining the most
frequently missing blocks in a victim file should
be beneficial, but - Number of evictions between an eviction-reuse
pair is very large - Temporal behavior happens at too large a scale to
be captured by any reasonably sized fully
associative victim file
7Sketch
- Observations and hypothesis
- Scavenger overview (Contributions)
- Scavenger architecture
- Frequency estimator
- Priority queue
- Victim file
- Simulation environment
- Simulation results
- Related work
- Summary
8Scavenger overview (Contributions)
- Functional requirements
- Determine the frequency of occurrence of an
evicted block address in the miss stream seen so
far - Determine (preferably O(1) time) the min.
frequency among the top k frequently missing
blocks and if the frequency of the current block
is bigger than or equal to this min., replace the
min., insert this frequency, and compute new
minimum quickly - Allocate a new block in the victim file by
replacing the minimum frequency block,
irrespective of the addresses of these blocks
9Scavenger overview (L2 eviction)
To MC
Evicted block address
L2 tag data
Bloom filter
Freq.
Replace min., insert new
Min.
Victim file
Min- heap
Freq. gt Min.
10Scavenger overview (L1 miss)
L2 tag data
Bloom filter
Miss address
L1
To MC
De-allocate
Hit
Victim file
Min- heap
11Sketch
- Observations and hypothesis
- Scavenger overview (Contributions)
- Scavenger architecture
- Frequency estimator
- Priority queue
- Victim file
- Simulation environment
- Simulation results
- Related work
- Summary
12Miss frequency estimator
189
2419
BF3
BF4
Block address
BF0
BF1
BF2
2523
2215
140
Min
Estimate
13Priority queue (min-heap)
5
T1
6
T2
5
10
T3
6
T4
10
T5
6
10
13
T6
11
T7
11
T8
6
10
13
11
9
T9
15
T10
12
T11
15
T12
11
9
15
15
12
18
11
13
18
T13
11
Right child (i ltlt 1) 1 Left child i ltlt 1
T14
13
T15
Priority VPTR
VF tag
14Pipelined min-heap
- Both insertion and de-allocation require O(log k)
steps for a k-entry heap - Each step involves read, comparison, and write
operations step latency rcw cycles - Latency of (rcw)log(k) cycles is too high to
cope up with bursty cache misses - Both insertion and de-allocation must be
pipelined - We unify insertion and de-allocation into a
single pipelined operation called replacement - De-allocation is same as a zero insertion
15Pipelined heap replacement
20, 10, 0
5
5
R
20
C
W
6
10
5
20
20, 10, 0
10
R
6
6
10
C
10
W
13
11
6
10
13
11
R
11
C
9
W
15
11
9
15
15
12
18
11
13
12
15
Right child (i ltlt 1) 1 Left child i ltlt 1
18
R
11
C
13
W
16Pipelined heap replacement
20, 10, 0
5
5
R
C
W
20
5
20, 10, 0
10
20
R
6
10
20
C
10
W
13
20, 10, 1
11
6
10
13
11
R
11
C
9
W
15
11
9
15
15
12
18
11
13
12
15
Right child (i ltlt 1) 1 Left child i ltlt 1
18
R
11
C
13
W
17Pipelined heap replacement
20, 10, 0
5
5
R
C
W
20
5
20, 10, 0
10
R
6
10
20
C
10
W
13
6, 10
20, 10, 1
11
6
10
13
11
R
11
20
C
9
6
W
15
11
9
15
15
12
18
11
13
20, 100, 1
12
15
Right child (i ltlt 1) 1 Left child i ltlt 1
18
R
11
C
13
W
18Pipelined heap replacement
20, 10, 0
5
5
R
C
W
6
5
20, 10, 0
10
6
R
20
6
10
C
6
10
W
13
6, 10
20, 10, 1
11
10
13
11
20
R
11
C
9
W
15
11
9
15
15
12
18
11
13
20, 100, 1
12
15
Right child (i ltlt 1) 1 Left child i ltlt 1
11, 9
18
R
9
11
C
20
13
W
19Victim file
- Functional requirements
- Should be able to replace a block with minimum
priority by a block of higher or equal priority
irrespective of addresses (fully associative
functionality) - Should offer fast lookup (conventional fully
associative wont do) - On a hit, should de-allocate the block and move
it to main L2 cache (different from conventional
victim caches)
20Victim file organization
- Tag array
- Direct-mapped hash table with collisions (i.e.,
conflicts) resolved by chaining - Each tag entry contains an upstream (toward head)
and a downstream (toward tail) pointer, and a
head (H) and a tail (T) bit - Victim file lookup at address A walks the tag
list sequentially starting at direct-mapped index
of A - Each tag lookup has latency equal to the latency
of a direct-mapped cache of same size - A replacement delinks the replaced tag from its
list and links it up with the list of the new tag
21Victim file lookup
(A gtgt BO) (k-1)
Requires a back pointer to heap
VF Tag
VF Data
Tail
k
Head
Insert zero priority in heap node
Hit
Invalid
22Sketch
- Observations and hypothesis
- Scavenger overview (Contributions)
- Scavenger architecture
- Frequency estimator
- Priority queue
- Victim file
- Simulation environment
- Simulation results
- Related work
- Summary
23Simulation environment
- Single-stream evaluation in this paper
- Configs differ only in L2 cache arch.
- Common attributes (more in paper)
- 4 GHz, 4-4-6 pipe, 128-entry ROB, 160 i/fpRF
- L1 caches 32 KB/4-way/64B/LRU/0.75 ns
- L2 cache miss latency (load-to-use) 121 ns
- 16-stream stride prefetcher between L2 cache and
memory with max. stride 256B - Applications 1 billion representative dynamic
instructions from sixteen SPEC 2000 applications
(will discuss results for nine memory-bound
applications rest in paper)
24Simulation environment
- L2 cache configurations
- Baseline 1 MB/8-way/64B/LRU/2.25 ns/ 15.54 mm2
- Scavenger 512 KB/8-way/64B/LRU/2 ns conventional
L2 cache 512 KB VF (8192 entries x 64
B/entry)/0.5 ns, 0.75 ns auxiliary data
structures (8192-entry priority queue, BFs,
pointer RAMs)/0.5 ns - 16.75 mm2
- 16-way 1 MB/16-way/64B/LRU/2.75 ns/ 26.4 mm2
- 512KB-FA-VC 512 KB/8-way/64B/LRU/2 ns
conventional L2 cache 512 KB/FA/64B/ Random/3.5
ns conventional VC
25Sketch
- Observations and hypothesis
- Scavenger overview (Contributions)
- Scavenger architecture
- Frequency estimator
- Priority queue
- Victim file
- Simulation environment
- Simulation results
- Related work
- Summary
26Victim file characteristics
- Number of tag accesses per L1 cache miss request
- Mean below 1.5 for 14 applications
- Mode (common case) is one for 15 applications
(enjoy direct-mapped latency) - More than 90 requests require at most three for
15 applications
27Performance (Speedup)
Higher is better
1.63
1.4
16-way (1.01, 1.00)
512 KB-FA-VC (1.01, 1.01)
1.3
Scavenger (1.14, 1.08)
1.2
1.1
1.0
0.9
wu
sw
ap
vp
ar
mc
eq
am
tw
28Performance (L2 cache misses)
16-way (0.98, 0.98)
Lower is better
512 KB-FA-VC (0.94, 0.96)
1.1
Scavenger (0.85, 0.90)
1.0
0.9
0.8
0.7
0.6
wu
sw
ap
vp
ar
mc
eq
am
tw
29Sketch
- Observations and hypothesis
- Scavenger overview (Contributions)
- Scavenger architecture
- Frequency estimator
- Priority queue
- Victim file
- Simulation environment
- Simulation results
- Related work
- Summary
30L2 cache misses in recent proposals
Lower is better
DIP ISCA07 (0.84)
Beats Scavenger in art and mcf only
V-way ISCA05 (0.87)
Beats Scavenger only in ammp
Scavenger (0.84)
Improvement across the board
1.00
Bottleneck BFs
0.85
0.7
0.55
0.4
wu
sw
ap
vp
ar
mc
eq
am
tw
31Summary of Scavenger
- Last-level cache arch. with algorithms to
discover global block priority - Divides the storage into a conventional
set-associative cache and a large fast VF
offering the functionality of a FA VF without
using any CAM - Insertion into VF is controlled by a priority
queue backed by a cache block miss frequency
estimator - Offers IPC improvement of up to 63 and on
average 8 for a set of sixteen SPEC 2000
applications
32ScavengerA New Last Level Cache Architecture
with Global Block Priority
THANK YOU!
- Arkaprava Basu, IIT Kanpur Nevin Kirman,
Cornell - Mainak Chaudhuri, IIT Kanpur Meyrem Kirman,
Cornell -
Jose F. Martinez, Cornell