High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP)

About This Presentation

Title:

High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP)

Description:

High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) Aamer Jaleel, Kevin Theobald, Simon Steely Jr., Joel Emer Intel Corporation, VSSAD – PowerPoint PPT presentation

Number of Views:190

Avg rating:3.0/5.0

Slides: 33

Provided by: AamerJ6

Learn more at: http://csg.csail.mit.edu

Category:

more less

Transcript and Presenter's Notes

Title: High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP)

1
High Performance Cache Replacement
UsingRe-Reference Interval Prediction (RRIP)

Aamer Jaleel, Kevin Theobald,
Simon Steely Jr., Joel Emer
Intel Corporation, VSSAD

International Symposium on Computer Architecture
( ISCA 2010 )
2
Motivation

Factors making caching important
Increasing ratio of CPU speed to memory speed
Multi-core poses challenges on better shared
cache management
LRU has been the standard replacement policy at
LLC
However LRU has problems!

3
Problems with LRU Replacement
LLCsize
Wsize
Working set larger than the cache causes thrashing
miss
miss
miss
miss
miss
References to non-temporal data (scans) discards
frequently referenced working set
hit
hit
hit
miss
hit
hit
miss
miss
scan
scan
scan
Our studies show that scans occur frequently in
many commercial workloads
Wsize
4
Desired Behavior from Cache Replacement
Working set larger than the cache ? Preserve some
of working set in the cache
hit
hit
hit
hit
hit
miss
miss
miss
miss
miss
Wsize
LLCsize
Recurring scans ? Preserve frequently referenced
working set in the cache
5
Prior Solutions to Enhance Cache Replacement
Working set larger than the cache ? Preserve some
of working set in the cache

Dynamic Insertion Policy (DIP) ?
Thrash-resistance with minimal changes to HW

Recurring scans ? Preserve frequently referenced
working set in the cache
Least Frequently Used (LFU) ? addresses scans LFU
adds complexity and also performs bad for recency
friendly workloads
GOAL Design a High Performing Scan-Resistant
Policy that Requires Minimum Changes to HW
6
Beladys Optimal (OPT) Replacement Policy

Replacement decisions using perfect knowledge of
future reference order
Victim Selection Policy
Replaces block that will be re-referenced
furthest in future

victim block
Cache Tag
a
c
b
h
f
d
g
e
7
Practical Cache Replacement Policies

Replacement decisions made by predicting the
future reference order
Victim Selection Policy
Replace block predicted to be re-referenced
furthest in future
Continually update predictions on the future
reference order
Natural update opportunities are on cache fills
and cache hits

victim block
Cache Tag
a
c
b
h
f
d
g
e
8
LRU Replacement in Prediction Framework

The LRU chain maintains the re-reference
prediction
Head of chain (i.e. MRU position) predicted to be
re-referenced soon
Tail of chain (i.e. LRU position) predicted to
re-referenced far in the future
LRU predicts that blocks are re-referenced in
reverse order of reference
Rename LRU Chain to the Re-Reference
Prediction (RRP) Chain
Rename MRU position ? RRP Head and LRU
position ? RRP Tail

h
g
f
e
d
c
b
a
9
Practicality of Chain Based Replacement
h
g
f
e
d
c
b
a
RRPV (n2)
Qualitative Prediction

Problem Chain based replacement is too
expensive!
log2(associativity) bits required per cache block
(16-way requires 4-bits/block)
Solution LRU chain positions can be quantized
into different buckets
Each bucket corresponds to a predicted
Re-Reference Interval
Value of bucket is called the Re-Reference
Prediction Value (RRPV)
Hardware Cost n bits per block ideally you
would like n lt log2A

10
Representation of Quantized Replacement (n 2)
11
Emulating LRU with Quantized Buckets (n2)

Victim Selection Policy Evict block with
distant RRPV (i.e. 2n-1 3)
If no distant RRPV (i.e. 3) found, increment
all RRPVs and repeat the search
If multiple found, need tie breaker. Let us
always start search from physical way 0
Insertion Policy Insert new block with RRPV0
Update Policy Cache hits update the blocks
RRPV0

hit
victim block
s
0
But We Want to do BETTER than LRU!!!
12
Re-Reference Interval Prediction (RRIP)

Framework enables re-reference predictions to be
tuned at insertion/update
Unlike LRU, can use non-zero RRPV on insertion
Unlike LRU, can use a non-zero RRPV on cache
hits
Static Re-Reference Interval Prediction (SRRIP)
Determine best insertion/update prediction using
profiling and apply to all apps
Dynamic Re-Reference Interval Prediction (DRRIP)
Dynamically determine best re-reference
prediction at insertion

13
Static RRIP Insertion Policy Learn Blocks
Re-reference Interval

Key Idea Do not give new blocks too much (or
too little) time in the cache
Predict new cache block will not be re-referenced
soon
Insert new block with some RRPV other than 0
Similar to inserting in the middle of the RRP
chain
However it is NOT identical to a fixed insertion
position on RRP chain (see paper)

victim block
s
14
Static RRIP Update Policy on Cache Hits

Hit Priority (HP)
Like LRU, Always update RRPV0 on cache hits.
Intuition Predicts that blocks receiving hits
after insertion will be re-referenced soon

hit
Cache Tag
c
2
RRPV
0
An Alternative Update Scheme Also Described in
Paper
15
SRRIP Hit Priority Sensitivity to Cache Insertion
Prediction at LLC
Averaged Across PC Games, Multimedia, Server, and
SPEC06 Workloads on 16-way 2MB LLC
n1
Re-Reference Interval Prediction (RRIP) Value At
Insertion
n1 is in fact the NRU replacement policy
commonly used in commercial processors
16
SRRIP Hit Priority Sensitivity to Cache Insertion
Prediction at LLC
Averaged Across PC Games, Multimedia, Server, and
SPEC06 Workloads on 16-way 2MB LLC
n1
n2
n3
n4
n5
Re-Reference Interval Prediction (RRIP) Value At
Insertion
Regardless of n Static RRIP Performs Best When
RRPVinsertion is 2n-2
Regardless of n Static RRIP Performs Worst When
RRPVinsertion is 2n-1
17
Why Does RRPVinsertion of 2n-2 Work Best for
SRRIP?
Wsize
Slen
hit
hit
hit
?
hit
hit
?
?
scan
scan
scan

Before scan, re-reference prediction of active
working set is 0
Recall, NRU (n1) is not scan-resistant
For scan resistance RRPVinsertion MUST be
different from RRPV of working set blocks
Larger insertion RRPV tolerates larger scans
Maximum insertion prediction (i.e. 2n-2) works
best!
In general, re-references after scan hit IF
Slen lt ( RRPVinsertion Starting-RRPVworkingse
t) (LLCsize Wsize)
SRRIP is Scan Resistant for Slen lt (
RRPVinsertion ) (LLCsize Wsize)

For n gt 1 Static RRIP is Scan Resistant! What
about Thrash Resistance?
18
DRRIP Extending Scan-Resistant SRRIP to Be
Thrash-Resistant
miss
miss
miss
miss
miss
SRRIP

Always using same prediction for all insertions
will thrashes the cache
Like DIP, need to preserve some fraction of
working set in cache
Extend DIP to SRRIP to provide thrash resistance
Dynamic Re-Reference Interval Prediction
Dynamically select between inserting blocks with
2n-1 and 2n-2 using Set Dueling
Inserting blocks with 2n-1 is same as no update
on insertion

DRRIP Provides Both Scan-Resistance and
Thrash-Resistance
19
Performance Comparison of Replacement Policies
16-way 2MB LLC
Static RRIP Always Outperforms LRU Replacement
Dynamic RRIP Further Improves Performance of
Static RRIP
20
Cache Replacement Competition (CRC) Results
Averaged Across PC Games, Multimedia, Enterprise
Server, SPEC CPU2006 Workloads
D R R I P
DRRIP

16-way 1MB Private Cache
65 Single-Threaded Workloads

16-way 4MB Shared Cache 165 4-core Workloads
Un-tuned DRRIP Would Be Ranked 2nd and is within
1 of CRC Winner Unlike CRC Winner, DRRIP Does
Not Require Any Changes to Cache Structure
21
Total Storage Overhead (16-way Set Associative
Cache)

LRU 4-bits / cache block
NRU 1-bit / cache block
DRRIP-3 3-bits / cache block
CRC Winner 8-bits / cache block

DRRIP Outperforms LRU With Less Storage Than
LRU NRU Can Be Easily Extended to Realize DRRIP!
22
Summary

Scan-resistance is an important problem in
commercial workloads
State-of-the art policies do not address
scan-resistance
We Propose a Simple and Practical Replacement
Policy
Static RRIP (SRRIP) for scan-resistance
Dynamic RRIP (DRRIP) for thrash-resistance and
scan-resistance
DRRIP requires ONLY 3-bits per block
In fact it incurs less storage than LRU
Un-tuned DRRIP would be 2nd place in CRC
Championship
DRRIP requires significantly less storage than
CRC winner

23
QA
24
QA
25
QA
26
Static RRIP with n1

Static RRIP with n 1 is the commonly used NRU
policy (polarity inverted)
Victim Selection Policy Evict block with
RRPV1
Insertion Policy Insert new block with RRPV0
Update Policy Cache hits update the blocks
RRPV0

hit
victim block
s
0
But NRU Is Not Scan-Resistant ?
27
SRRIP Update Policy on Cache Hits

Frequency Priority (FP)
Improve re-reference prediction to be shorter
than before on hits (i.e. RRPV--).
Intuition Like LFU, predicts that frequently
referenced blocks should have higher priority to
stay in cache

Cache Tag
c
RRPV
2
1
28
SRRIP-HP and SRRIP-FP Cache Performance
SRRIP-Frequency Priority

SRRIP-HP has 2X better cache performance relative
to LRU than SRRIP-FP
We do not need to precisely detect frequently
referenced blocks
We need to preserve blocks that receive hits

SRRIP-Hit Priority
29
Common Access Patterns in Workloads

Stack Access Pattern (a1, a2,ak,a2, a1)A
Solution For any k, LRU performs well for such
access patterns
Streaming Access Pattern (a1, a2, ak) for k gtgt
assoc
No Solution Cache replacement can not solve this
problem
Thrashing Access Pattern (a1, a2, ak)A , for k
gt assoc
LRU receives no cache hits due to cache thrashing
Solution preserve some fraction of working set
in cache (e.g. Use BIP)
BIP does NOT update replacement state for the
majority of cache insertions
Mixed Access Pattern (a1, a2,ak,a2, a1)A (b1,
b2, bm) N, m gt assoc-k
LRU always misses on frequently referenced (a1,
a2, ak, a2, a1)A
(b1, b2, bm) commonly referenced to as a scan
in literature
In absence of scan, LRU performs well for such
access patterns
Solution preserve frequently referenced working
set in cache (e.g. use LFU)
LFU replaces infrequently referenced blocks in
the presence of frequently referenced blocks

30
Performance of Hybrid Replacement Policies at LLC
PC Games / multimedia
SPEC CPU2006
server
Average
4-way OoO Processor, 32KB L1, 256KB L2, 2MB LLC

DIP addresses SPEC workloads but NOT PC games
multimedia workloads
Real world workloads prefer scan-resistance
instead of thrash-resistance

31
Understanding LRU Enhancements in the Prediction
Framework
h
g
f
e
d
c
b
a

Recent policies, e.g., DIP, say Insert new
blocks at the LRU position
What does it mean to insert an MRU line in the
LRU position?
Prediction that new block will be re-referenced
later than existing blocks in the cache
What DIP really means is Insert new blocks at
the RRIP Tail
Other policies, e.g., PIPP, say Insert new
blocks in middle of the LRU chain
Prediction that new block will be re-referenced
at an intermediate time

The Re-Reference Prediction Framework Helps
Describe the Intuitions Behind Existing
Replacement Policy Enhancements
32
Performance Comparison of Replacement Policies
16-way 2MB LLC
Static RRIP Always Outperforms LRU Replacement
Dynamic RRIP Further Improves Performance of
Static RRIP

Write a Comment

User Comments (0)

About PowerShow.com

High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) - PowerPoint PPT Presentation

High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP)

High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) Aamer Jaleel, Kevin Theobald, Simon Steely Jr., Joel Emer Intel Corporation, VSSAD – PowerPoint PPT presentation