Scavenger: A New Last Level Cache Architecture with Global Block Priority - PowerPoint PPT Presentation

About This Presentation
Title:

Scavenger: A New Last Level Cache Architecture with Global Block Priority

Description:

Number of evictions between an eviction-reuse pair is very large ... Scavenger overview (L2 eviction) L2 tag & data. Bloom. filter. Victim. file. Min- heap ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 33
Provided by: Mai110
Learn more at: https://microarch.org
Category:

less

Transcript and Presenter's Notes

Title: Scavenger: A New Last Level Cache Architecture with Global Block Priority


1
ScavengerA New Last Level Cache Architecture
with Global Block Priority
  • Arkaprava Basu, IIT Kanpur Nevin Kirman,
    Cornell
  • Mainak Chaudhuri, IIT Kanpur Meyrem Kirman,
    Cornell

  • Jose F. Martinez, Cornell

2
Talk in one slide
  • Observation1 large number of blocks miss
    repeatedly in the last-level cache
  • Observation2 number of evictions between an
    eviction-reuse pair is too large to be captured
    by a conventional fully associative victim file
  • How to exploit this temporal behavior with large
    period?
  • Our solution prioritizes blocks evicted from the
    last-level cache by their miss frequencies
  • Top k frequently missing blocks are scavenged and
    retained in a fast k-entry victim file

3
Sketch
  • Observations and hypothesis
  • Scavenger overview (Contributions)
  • Scavenger architecture
  • Frequency estimator
  • Priority queue
  • Victim file
  • Simulation environment
  • Simulation results
  • Related work
  • Summary

4
Observations and hypothesis
1
ROB stall cycles ()
2-9
10-99
100-999
100
gt 1000
512 KB 8-way
1 MB 8-way
80
60
40
20
0
gz
wu
sw
ap
vp
gc
me
ar
mc
eq
cr
am
pe
bz
tw
aps
5
Observations and hypothesis
Too small
Wish, but too large (FA?)
6
Observations and hypothesis
  • Block addresses repeat in the miss address stream
    of the last-level cache
  • Repeating block addresses in the miss stream
    cause significant ROB stall
  • Hypothesis identifying and retaining the most
    frequently missing blocks in a victim file should
    be beneficial, but
  • Number of evictions between an eviction-reuse
    pair is very large
  • Temporal behavior happens at too large a scale to
    be captured by any reasonably sized fully
    associative victim file

7
Sketch
  • Observations and hypothesis
  • Scavenger overview (Contributions)
  • Scavenger architecture
  • Frequency estimator
  • Priority queue
  • Victim file
  • Simulation environment
  • Simulation results
  • Related work
  • Summary

8
Scavenger overview (Contributions)
  • Functional requirements
  • Determine the frequency of occurrence of an
    evicted block address in the miss stream seen so
    far
  • Determine (preferably O(1) time) the min.
    frequency among the top k frequently missing
    blocks and if the frequency of the current block
    is bigger than or equal to this min., replace the
    min., insert this frequency, and compute new
    minimum quickly
  • Allocate a new block in the victim file by
    replacing the minimum frequency block,
    irrespective of the addresses of these blocks

9
Scavenger overview (L2 eviction)
To MC
Evicted block address
L2 tag data
Bloom filter
Freq.
Replace min., insert new
Min.
Victim file
Min- heap
Freq. gt Min.
10
Scavenger overview (L1 miss)
L2 tag data
Bloom filter
Miss address
L1
To MC
De-allocate
Hit
Victim file
Min- heap
11
Sketch
  • Observations and hypothesis
  • Scavenger overview (Contributions)
  • Scavenger architecture
  • Frequency estimator
  • Priority queue
  • Victim file
  • Simulation environment
  • Simulation results
  • Related work
  • Summary

12
Miss frequency estimator
189
2419
BF3
BF4
Block address
BF0
BF1
BF2
2523
2215
140
Min
Estimate
13
Priority queue (min-heap)
5
T1
6
T2
5
10
T3
6
T4
10
T5
6
10
13
T6
11
T7
11
T8
6
10
13
11
9
T9
15
T10
12
T11
15
T12
11
9
15
15
12
18
11
13
18
T13
11
Right child (i ltlt 1) 1 Left child i ltlt 1
T14
13
T15
Priority VPTR
VF tag
14
Pipelined min-heap
  • Both insertion and de-allocation require O(log k)
    steps for a k-entry heap
  • Each step involves read, comparison, and write
    operations step latency rcw cycles
  • Latency of (rcw)log(k) cycles is too high to
    cope up with bursty cache misses
  • Both insertion and de-allocation must be
    pipelined
  • We unify insertion and de-allocation into a
    single pipelined operation called replacement
  • De-allocation is same as a zero insertion

15
Pipelined heap replacement
20, 10, 0
5
5
R
20
C
W
6
10
5
20
20, 10, 0
10
R
6
6
10
C
10
W
13
11
6
10
13
11
R
11
C
9
W
15
11
9
15
15
12
18
11
13
12
15
Right child (i ltlt 1) 1 Left child i ltlt 1
18
R
11
C
13
W
16
Pipelined heap replacement
20, 10, 0
5
5
R
C
W
20
5
20, 10, 0
10
20
R
6
10
20
C
10
W
13
20, 10, 1
11
6
10
13
11
R
11
C
9
W
15
11
9
15
15
12
18
11
13
12
15
Right child (i ltlt 1) 1 Left child i ltlt 1
18
R
11
C
13
W
17
Pipelined heap replacement
20, 10, 0
5
5
R
C
W
20
5
20, 10, 0
10
R
6
10
20
C
10
W
13
6, 10
20, 10, 1
11
6
10
13
11
R
11
20
C
9
6
W
15
11
9
15
15
12
18
11
13
20, 100, 1
12
15
Right child (i ltlt 1) 1 Left child i ltlt 1
18
R
11
C
13
W
18
Pipelined heap replacement
20, 10, 0
5
5
R
C
W
6
5
20, 10, 0
10
6
R
20
6
10
C
6
10
W
13
6, 10
20, 10, 1
11
10
13
11
20
R
11
C
9
W
15
11
9
15
15
12
18
11
13
20, 100, 1
12
15
Right child (i ltlt 1) 1 Left child i ltlt 1
11, 9
18
R
9
11
C
20
13
W
19
Victim file
  • Functional requirements
  • Should be able to replace a block with minimum
    priority by a block of higher or equal priority
    irrespective of addresses (fully associative
    functionality)
  • Should offer fast lookup (conventional fully
    associative wont do)
  • On a hit, should de-allocate the block and move
    it to main L2 cache (different from conventional
    victim caches)

20
Victim file organization
  • Tag array
  • Direct-mapped hash table with collisions (i.e.,
    conflicts) resolved by chaining
  • Each tag entry contains an upstream (toward head)
    and a downstream (toward tail) pointer, and a
    head (H) and a tail (T) bit
  • Victim file lookup at address A walks the tag
    list sequentially starting at direct-mapped index
    of A
  • Each tag lookup has latency equal to the latency
    of a direct-mapped cache of same size
  • A replacement delinks the replaced tag from its
    list and links it up with the list of the new tag

21
Victim file lookup
(A gtgt BO) (k-1)
Requires a back pointer to heap
VF Tag
VF Data
Tail
k
Head
Insert zero priority in heap node
Hit
Invalid
22
Sketch
  • Observations and hypothesis
  • Scavenger overview (Contributions)
  • Scavenger architecture
  • Frequency estimator
  • Priority queue
  • Victim file
  • Simulation environment
  • Simulation results
  • Related work
  • Summary

23
Simulation environment
  • Single-stream evaluation in this paper
  • Configs differ only in L2 cache arch.
  • Common attributes (more in paper)
  • 4 GHz, 4-4-6 pipe, 128-entry ROB, 160 i/fpRF
  • L1 caches 32 KB/4-way/64B/LRU/0.75 ns
  • L2 cache miss latency (load-to-use) 121 ns
  • 16-stream stride prefetcher between L2 cache and
    memory with max. stride 256B
  • Applications 1 billion representative dynamic
    instructions from sixteen SPEC 2000 applications
    (will discuss results for nine memory-bound
    applications rest in paper)

24
Simulation environment
  • L2 cache configurations
  • Baseline 1 MB/8-way/64B/LRU/2.25 ns/ 15.54 mm2
  • Scavenger 512 KB/8-way/64B/LRU/2 ns conventional
    L2 cache 512 KB VF (8192 entries x 64
    B/entry)/0.5 ns, 0.75 ns auxiliary data
    structures (8192-entry priority queue, BFs,
    pointer RAMs)/0.5 ns
  • 16.75 mm2
  • 16-way 1 MB/16-way/64B/LRU/2.75 ns/ 26.4 mm2
  • 512KB-FA-VC 512 KB/8-way/64B/LRU/2 ns
    conventional L2 cache 512 KB/FA/64B/ Random/3.5
    ns conventional VC

25
Sketch
  • Observations and hypothesis
  • Scavenger overview (Contributions)
  • Scavenger architecture
  • Frequency estimator
  • Priority queue
  • Victim file
  • Simulation environment
  • Simulation results
  • Related work
  • Summary

26
Victim file characteristics
  • Number of tag accesses per L1 cache miss request
  • Mean below 1.5 for 14 applications
  • Mode (common case) is one for 15 applications
    (enjoy direct-mapped latency)
  • More than 90 requests require at most three for
    15 applications

27
Performance (Speedup)
Higher is better
1.63
1.4
16-way (1.01, 1.00)
512 KB-FA-VC (1.01, 1.01)
1.3
Scavenger (1.14, 1.08)
1.2
1.1
1.0
0.9
wu
sw
ap
vp
ar
mc
eq
am
tw
28
Performance (L2 cache misses)
16-way (0.98, 0.98)
Lower is better
512 KB-FA-VC (0.94, 0.96)
1.1
Scavenger (0.85, 0.90)
1.0
0.9
0.8
0.7
0.6
wu
sw
ap
vp
ar
mc
eq
am
tw
29
Sketch
  • Observations and hypothesis
  • Scavenger overview (Contributions)
  • Scavenger architecture
  • Frequency estimator
  • Priority queue
  • Victim file
  • Simulation environment
  • Simulation results
  • Related work
  • Summary

30
L2 cache misses in recent proposals
Lower is better
DIP ISCA07 (0.84)
Beats Scavenger in art and mcf only
V-way ISCA05 (0.87)
Beats Scavenger only in ammp
Scavenger (0.84)
Improvement across the board
1.00
Bottleneck BFs
0.85
0.7
0.55
0.4
wu
sw
ap
vp
ar
mc
eq
am
tw
31
Summary of Scavenger
  • Last-level cache arch. with algorithms to
    discover global block priority
  • Divides the storage into a conventional
    set-associative cache and a large fast VF
    offering the functionality of a FA VF without
    using any CAM
  • Insertion into VF is controlled by a priority
    queue backed by a cache block miss frequency
    estimator
  • Offers IPC improvement of up to 63 and on
    average 8 for a set of sixteen SPEC 2000
    applications

32
ScavengerA New Last Level Cache Architecture
with Global Block Priority
THANK YOU!
  • Arkaprava Basu, IIT Kanpur Nevin Kirman,
    Cornell
  • Mainak Chaudhuri, IIT Kanpur Meyrem Kirman,
    Cornell

  • Jose F. Martinez, Cornell
Write a Comment
User Comments (0)
About PowerShow.com