Scavenger: A New Last Level Cache Architecture with Global Block Priority - PowerPoint PPT Presentation

About This Presentation

Title:

Scavenger: A New Last Level Cache Architecture with Global Block Priority

Description:

Number of evictions between an eviction-reuse pair is very large ... Scavenger overview (L2 eviction) L2 tag & data. Bloom. filter. Victim. file. Min- heap ... – PowerPoint PPT presentation

Number of Views:30

Avg rating:3.0/5.0

Slides: 33

Provided by: Mai110

Learn more at: https://microarch.org

Category:

more less

Transcript and Presenter's Notes

Title: Scavenger: A New Last Level Cache Architecture with Global Block Priority

1
ScavengerA New Last Level Cache Architecture
with Global Block Priority

Arkaprava Basu, IIT Kanpur Nevin Kirman,
Cornell
Mainak Chaudhuri, IIT Kanpur Meyrem Kirman,
Cornell
Jose F. Martinez, Cornell

2
Talk in one slide

Observation1 large number of blocks miss
repeatedly in the last-level cache
Observation2 number of evictions between an
eviction-reuse pair is too large to be captured
by a conventional fully associative victim file
How to exploit this temporal behavior with large
period?
Our solution prioritizes blocks evicted from the
last-level cache by their miss frequencies
Top k frequently missing blocks are scavenged and
retained in a fast k-entry victim file

3
Sketch

Observations and hypothesis
Scavenger overview (Contributions)
Scavenger architecture
Frequency estimator
Priority queue
Victim file
Simulation environment
Simulation results
Related work
Summary

4
Observations and hypothesis
1
ROB stall cycles ()
2-9
10-99
100-999
100
gt 1000
512 KB 8-way
1 MB 8-way
80
60
40
20
0
gz
wu
sw
ap
vp
gc
me
ar
mc
eq
cr
am
pe
bz
tw
aps
5
Observations and hypothesis
Too small
Wish, but too large (FA?)
6
Observations and hypothesis

Block addresses repeat in the miss address stream
of the last-level cache
Repeating block addresses in the miss stream
cause significant ROB stall
Hypothesis identifying and retaining the most
frequently missing blocks in a victim file should
be beneficial, but
Number of evictions between an eviction-reuse
pair is very large
Temporal behavior happens at too large a scale to
be captured by any reasonably sized fully
associative victim file

7
Sketch

Observations and hypothesis
Scavenger overview (Contributions)
Scavenger architecture
Frequency estimator
Priority queue
Victim file
Simulation environment
Simulation results
Related work
Summary

8
Scavenger overview (Contributions)

Functional requirements
Determine the frequency of occurrence of an
evicted block address in the miss stream seen so
far
Determine (preferably O(1) time) the min.
frequency among the top k frequently missing
blocks and if the frequency of the current block
is bigger than or equal to this min., replace the
min., insert this frequency, and compute new
minimum quickly
Allocate a new block in the victim file by
replacing the minimum frequency block,
irrespective of the addresses of these blocks

9
Scavenger overview (L2 eviction)
To MC
Evicted block address
L2 tag data
Bloom filter
Freq.
Replace min., insert new
Min.
Victim file
Min- heap
Freq. gt Min.
10
Scavenger overview (L1 miss)
L2 tag data
Bloom filter
Miss address
L1
To MC
De-allocate
Hit
Victim file
Min- heap
11
Sketch

Observations and hypothesis
Scavenger overview (Contributions)
Scavenger architecture
Frequency estimator
Priority queue
Victim file
Simulation environment
Simulation results
Related work
Summary

12
Miss frequency estimator
189
2419
BF3
BF4
Block address
BF0
BF1
BF2
2523
2215
140
Min
Estimate
13
Priority queue (min-heap)
5
T1
6
T2
5
10
T3
6
T4
10
T5
6
10
13
T6
11
T7
11
T8
6
10
13
11
9
T9
15
T10
12
T11
15
T12
11
9
15
15
12
18
11
13
18
T13
11
Right child (i ltlt 1) 1 Left child i ltlt 1
T14
13
T15
Priority VPTR
VF tag
14
Pipelined min-heap

Both insertion and de-allocation require O(log k)
steps for a k-entry heap
Each step involves read, comparison, and write
operations step latency rcw cycles
Latency of (rcw)log(k) cycles is too high to
cope up with bursty cache misses
Both insertion and de-allocation must be
pipelined
We unify insertion and de-allocation into a
single pipelined operation called replacement
De-allocation is same as a zero insertion

15
Pipelined heap replacement
20, 10, 0
5
5
R
20
C
W
6
10
5
20
20, 10, 0
10
R
6
6
10
C
10
W
13
11
6
10
13
11
R
11
C
9
W
15
11
9
15
15
12
18
11
13
12
15
Right child (i ltlt 1) 1 Left child i ltlt 1
18
R
11
C
13
W
16
Pipelined heap replacement
20, 10, 0
5
5
R
C
W
20
5
20, 10, 0
10
20
R
6
10
20
C
10
W
13
20, 10, 1
11
6
10
13
11
R
11
C
9
W
15
11
9
15
15
12
18
11
13
12
15
Right child (i ltlt 1) 1 Left child i ltlt 1
18
R
11
C
13
W
17
Pipelined heap replacement
20, 10, 0
5
5
R
C
W
20
5
20, 10, 0
10
R
6
10
20
C
10
W
13
6, 10
20, 10, 1
11
6
10
13
11
R
11
20
C
9
6
W
15
11
9
15
15
12
18
11
13
20, 100, 1
12
15
Right child (i ltlt 1) 1 Left child i ltlt 1
18
R
11
C
13
W
18
Pipelined heap replacement
20, 10, 0
5
5
R
C
W
6
5
20, 10, 0
10
6
R
20
6
10
C
6
10
W
13
6, 10
20, 10, 1
11
10
13
11
20
R
11
C
9
W
15
11
9
15
15
12
18
11
13
20, 100, 1
12
15
Right child (i ltlt 1) 1 Left child i ltlt 1
11, 9
18
R
9
11
C
20
13
W
19
Victim file

Functional requirements
Should be able to replace a block with minimum
priority by a block of higher or equal priority
irrespective of addresses (fully associative
functionality)
Should offer fast lookup (conventional fully
associative wont do)
On a hit, should de-allocate the block and move
it to main L2 cache (different from conventional
victim caches)

20
Victim file organization

Tag array
Direct-mapped hash table with collisions (i.e.,
conflicts) resolved by chaining
Each tag entry contains an upstream (toward head)
and a downstream (toward tail) pointer, and a
head (H) and a tail (T) bit
Victim file lookup at address A walks the tag
list sequentially starting at direct-mapped index
of A
Each tag lookup has latency equal to the latency
of a direct-mapped cache of same size
A replacement delinks the replaced tag from its
list and links it up with the list of the new tag

21
Victim file lookup
(A gtgt BO) (k-1)
Requires a back pointer to heap
VF Tag
VF Data
Tail
k
Head
Insert zero priority in heap node
Hit
Invalid
22
Sketch

Observations and hypothesis
Scavenger overview (Contributions)
Scavenger architecture
Frequency estimator
Priority queue
Victim file
Simulation environment
Simulation results
Related work
Summary

23
Simulation environment

Single-stream evaluation in this paper
Configs differ only in L2 cache arch.
Common attributes (more in paper)
4 GHz, 4-4-6 pipe, 128-entry ROB, 160 i/fpRF
L1 caches 32 KB/4-way/64B/LRU/0.75 ns
L2 cache miss latency (load-to-use) 121 ns
16-stream stride prefetcher between L2 cache and
memory with max. stride 256B
Applications 1 billion representative dynamic
instructions from sixteen SPEC 2000 applications
(will discuss results for nine memory-bound
applications rest in paper)

24
Simulation environment

L2 cache configurations
Baseline 1 MB/8-way/64B/LRU/2.25 ns/ 15.54 mm2
Scavenger 512 KB/8-way/64B/LRU/2 ns conventional
L2 cache 512 KB VF (8192 entries x 64
B/entry)/0.5 ns, 0.75 ns auxiliary data
structures (8192-entry priority queue, BFs,
pointer RAMs)/0.5 ns
16.75 mm2
16-way 1 MB/16-way/64B/LRU/2.75 ns/ 26.4 mm2
512KB-FA-VC 512 KB/8-way/64B/LRU/2 ns
conventional L2 cache 512 KB/FA/64B/ Random/3.5
ns conventional VC

25
Sketch

Observations and hypothesis
Scavenger overview (Contributions)
Scavenger architecture
Frequency estimator
Priority queue
Victim file
Simulation environment
Simulation results
Related work
Summary

26
Victim file characteristics

Number of tag accesses per L1 cache miss request
Mean below 1.5 for 14 applications
Mode (common case) is one for 15 applications
(enjoy direct-mapped latency)
More than 90 requests require at most three for
15 applications

27
Performance (Speedup)
Higher is better
1.63
1.4
16-way (1.01, 1.00)
512 KB-FA-VC (1.01, 1.01)
1.3
Scavenger (1.14, 1.08)
1.2
1.1
1.0
0.9
wu
sw
ap
vp
ar
mc
eq
am
tw
28
Performance (L2 cache misses)
16-way (0.98, 0.98)
Lower is better
512 KB-FA-VC (0.94, 0.96)
1.1
Scavenger (0.85, 0.90)
1.0
0.9
0.8
0.7
0.6
wu
sw
ap
vp
ar
mc
eq
am
tw
29
Sketch

Observations and hypothesis
Scavenger overview (Contributions)
Scavenger architecture
Frequency estimator
Priority queue
Victim file
Simulation environment
Simulation results
Related work
Summary

30
L2 cache misses in recent proposals
Lower is better
DIP ISCA07 (0.84)
Beats Scavenger in art and mcf only
V-way ISCA05 (0.87)
Beats Scavenger only in ammp
Scavenger (0.84)
Improvement across the board
1.00
Bottleneck BFs
0.85
0.7
0.55
0.4
wu
sw
ap
vp
ar
mc
eq
am
tw
31
Summary of Scavenger

Last-level cache arch. with algorithms to
discover global block priority
Divides the storage into a conventional
set-associative cache and a large fast VF
offering the functionality of a FA VF without
using any CAM
Insertion into VF is controlled by a priority
queue backed by a cache block miss frequency
estimator
Offers IPC improvement of up to 63 and on
average 8 for a set of sixteen SPEC 2000
applications

32
ScavengerA New Last Level Cache Architecture
with Global Block Priority
THANK YOU!