Title: High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP)
1High Performance Cache Replacement
UsingRe-Reference Interval Prediction (RRIP)
- Aamer Jaleel, Kevin Theobald,
- Simon Steely Jr., Joel Emer
- Intel Corporation, VSSAD
International Symposium on Computer Architecture
( ISCA 2010 )
2Motivation
- Factors making caching important
- Increasing ratio of CPU speed to memory speed
- Multi-core poses challenges on better shared
cache management - LRU has been the standard replacement policy at
LLC - However LRU has problems!
3Problems with LRU Replacement
LLCsize
Wsize
Working set larger than the cache causes thrashing
miss
miss
miss
miss
miss
References to non-temporal data (scans) discards
frequently referenced working set
hit
hit
hit
miss
hit
hit
miss
miss
scan
scan
scan
Our studies show that scans occur frequently in
many commercial workloads
Wsize
4Desired Behavior from Cache Replacement
Working set larger than the cache ? Preserve some
of working set in the cache
hit
hit
hit
hit
hit
miss
miss
miss
miss
miss
Wsize
LLCsize
Recurring scans ? Preserve frequently referenced
working set in the cache
5Prior Solutions to Enhance Cache Replacement
Working set larger than the cache ? Preserve some
of working set in the cache
- Dynamic Insertion Policy (DIP) ?
Thrash-resistance with minimal changes to HW
Recurring scans ? Preserve frequently referenced
working set in the cache
Least Frequently Used (LFU) ? addresses scans LFU
adds complexity and also performs bad for recency
friendly workloads
GOAL Design a High Performing Scan-Resistant
Policy that Requires Minimum Changes to HW
6Beladys Optimal (OPT) Replacement Policy
- Replacement decisions using perfect knowledge of
future reference order - Victim Selection Policy
- Replaces block that will be re-referenced
furthest in future
victim block
Cache Tag
a
c
b
h
f
d
g
e
7Practical Cache Replacement Policies
- Replacement decisions made by predicting the
future reference order - Victim Selection Policy
- Replace block predicted to be re-referenced
furthest in future - Continually update predictions on the future
reference order - Natural update opportunities are on cache fills
and cache hits
victim block
Cache Tag
a
c
b
h
f
d
g
e
8LRU Replacement in Prediction Framework
- The LRU chain maintains the re-reference
prediction - Head of chain (i.e. MRU position) predicted to be
re-referenced soon - Tail of chain (i.e. LRU position) predicted to
re-referenced far in the future - LRU predicts that blocks are re-referenced in
reverse order of reference - Rename LRU Chain to the Re-Reference
Prediction (RRP) Chain - Rename MRU position ? RRP Head and LRU
position ? RRP Tail
h
g
f
e
d
c
b
a
9Practicality of Chain Based Replacement
h
g
f
e
d
c
b
a
RRPV (n2)
Qualitative Prediction
- Problem Chain based replacement is too
expensive! - log2(associativity) bits required per cache block
(16-way requires 4-bits/block) - Solution LRU chain positions can be quantized
into different buckets - Each bucket corresponds to a predicted
Re-Reference Interval - Value of bucket is called the Re-Reference
Prediction Value (RRPV) - Hardware Cost n bits per block ideally you
would like n lt log2A
10Representation of Quantized Replacement (n 2)
11Emulating LRU with Quantized Buckets (n2)
- Victim Selection Policy Evict block with
distant RRPV (i.e. 2n-1 3) - If no distant RRPV (i.e. 3) found, increment
all RRPVs and repeat the search - If multiple found, need tie breaker. Let us
always start search from physical way 0 - Insertion Policy Insert new block with RRPV0
- Update Policy Cache hits update the blocks
RRPV0
hit
victim block
s
0
But We Want to do BETTER than LRU!!!
12Re-Reference Interval Prediction (RRIP)
- Framework enables re-reference predictions to be
tuned at insertion/update - Unlike LRU, can use non-zero RRPV on insertion
- Unlike LRU, can use a non-zero RRPV on cache
hits - Static Re-Reference Interval Prediction (SRRIP)
- Determine best insertion/update prediction using
profiling and apply to all apps - Dynamic Re-Reference Interval Prediction (DRRIP)
- Dynamically determine best re-reference
prediction at insertion
13Static RRIP Insertion Policy Learn Blocks
Re-reference Interval
- Key Idea Do not give new blocks too much (or
too little) time in the cache - Predict new cache block will not be re-referenced
soon - Insert new block with some RRPV other than 0
- Similar to inserting in the middle of the RRP
chain - However it is NOT identical to a fixed insertion
position on RRP chain (see paper)
victim block
s
14Static RRIP Update Policy on Cache Hits
- Hit Priority (HP)
- Like LRU, Always update RRPV0 on cache hits.
- Intuition Predicts that blocks receiving hits
after insertion will be re-referenced soon
hit
Cache Tag
c
2
RRPV
0
An Alternative Update Scheme Also Described in
Paper
15SRRIP Hit Priority Sensitivity to Cache Insertion
Prediction at LLC
Averaged Across PC Games, Multimedia, Server, and
SPEC06 Workloads on 16-way 2MB LLC
n1
Re-Reference Interval Prediction (RRIP) Value At
Insertion
n1 is in fact the NRU replacement policy
commonly used in commercial processors
16SRRIP Hit Priority Sensitivity to Cache Insertion
Prediction at LLC
Averaged Across PC Games, Multimedia, Server, and
SPEC06 Workloads on 16-way 2MB LLC
n1
n2
n3
n4
n5
Re-Reference Interval Prediction (RRIP) Value At
Insertion
Regardless of n Static RRIP Performs Best When
RRPVinsertion is 2n-2
Regardless of n Static RRIP Performs Worst When
RRPVinsertion is 2n-1
17Why Does RRPVinsertion of 2n-2 Work Best for
SRRIP?
Wsize
Slen
hit
hit
hit
?
hit
hit
?
?
scan
scan
scan
- Before scan, re-reference prediction of active
working set is 0 - Recall, NRU (n1) is not scan-resistant
- For scan resistance RRPVinsertion MUST be
different from RRPV of working set blocks - Larger insertion RRPV tolerates larger scans
- Maximum insertion prediction (i.e. 2n-2) works
best! - In general, re-references after scan hit IF
- Slen lt ( RRPVinsertion Starting-RRPVworkingse
t) (LLCsize Wsize) - SRRIP is Scan Resistant for Slen lt (
RRPVinsertion ) (LLCsize Wsize)
For n gt 1 Static RRIP is Scan Resistant! What
about Thrash Resistance?
18DRRIP Extending Scan-Resistant SRRIP to Be
Thrash-Resistant
miss
miss
miss
miss
miss
SRRIP
- Always using same prediction for all insertions
will thrashes the cache - Like DIP, need to preserve some fraction of
working set in cache - Extend DIP to SRRIP to provide thrash resistance
- Dynamic Re-Reference Interval Prediction
- Dynamically select between inserting blocks with
2n-1 and 2n-2 using Set Dueling - Inserting blocks with 2n-1 is same as no update
on insertion
DRRIP Provides Both Scan-Resistance and
Thrash-Resistance
19Performance Comparison of Replacement Policies
16-way 2MB LLC
Static RRIP Always Outperforms LRU Replacement
Dynamic RRIP Further Improves Performance of
Static RRIP
20Cache Replacement Competition (CRC) Results
Averaged Across PC Games, Multimedia, Enterprise
Server, SPEC CPU2006 Workloads
D R R I P
DRRIP
- 16-way 1MB Private Cache
- 65 Single-Threaded Workloads
16-way 4MB Shared Cache 165 4-core Workloads
Un-tuned DRRIP Would Be Ranked 2nd and is within
1 of CRC Winner Unlike CRC Winner, DRRIP Does
Not Require Any Changes to Cache Structure
21Total Storage Overhead (16-way Set Associative
Cache)
- LRU 4-bits / cache block
- NRU 1-bit / cache block
- DRRIP-3 3-bits / cache block
- CRC Winner 8-bits / cache block
DRRIP Outperforms LRU With Less Storage Than
LRU NRU Can Be Easily Extended to Realize DRRIP!
22Summary
- Scan-resistance is an important problem in
commercial workloads - State-of-the art policies do not address
scan-resistance - We Propose a Simple and Practical Replacement
Policy - Static RRIP (SRRIP) for scan-resistance
- Dynamic RRIP (DRRIP) for thrash-resistance and
scan-resistance - DRRIP requires ONLY 3-bits per block
- In fact it incurs less storage than LRU
- Un-tuned DRRIP would be 2nd place in CRC
Championship - DRRIP requires significantly less storage than
CRC winner
23QA
24QA
25QA
26Static RRIP with n1
- Static RRIP with n 1 is the commonly used NRU
policy (polarity inverted) - Victim Selection Policy Evict block with
RRPV1 - Insertion Policy Insert new block with RRPV0
- Update Policy Cache hits update the blocks
RRPV0
hit
victim block
s
0
But NRU Is Not Scan-Resistant ?
27SRRIP Update Policy on Cache Hits
- Frequency Priority (FP)
- Improve re-reference prediction to be shorter
than before on hits (i.e. RRPV--). - Intuition Like LFU, predicts that frequently
referenced blocks should have higher priority to
stay in cache
Cache Tag
c
RRPV
2
1
28SRRIP-HP and SRRIP-FP Cache Performance
SRRIP-Frequency Priority
- SRRIP-HP has 2X better cache performance relative
to LRU than SRRIP-FP - We do not need to precisely detect frequently
referenced blocks - We need to preserve blocks that receive hits
SRRIP-Hit Priority
29Common Access Patterns in Workloads
- Stack Access Pattern (a1, a2,ak,a2, a1)A
- Solution For any k, LRU performs well for such
access patterns - Streaming Access Pattern (a1, a2, ak) for k gtgt
assoc - No Solution Cache replacement can not solve this
problem - Thrashing Access Pattern (a1, a2, ak)A , for k
gt assoc - LRU receives no cache hits due to cache thrashing
- Solution preserve some fraction of working set
in cache (e.g. Use BIP) - BIP does NOT update replacement state for the
majority of cache insertions - Mixed Access Pattern (a1, a2,ak,a2, a1)A (b1,
b2, bm) N, m gt assoc-k - LRU always misses on frequently referenced (a1,
a2, ak, a2, a1)A - (b1, b2, bm) commonly referenced to as a scan
in literature - In absence of scan, LRU performs well for such
access patterns - Solution preserve frequently referenced working
set in cache (e.g. use LFU) - LFU replaces infrequently referenced blocks in
the presence of frequently referenced blocks
30Performance of Hybrid Replacement Policies at LLC
PC Games / multimedia
SPEC CPU2006
server
Average
4-way OoO Processor, 32KB L1, 256KB L2, 2MB LLC
- DIP addresses SPEC workloads but NOT PC games
multimedia workloads - Real world workloads prefer scan-resistance
instead of thrash-resistance
31Understanding LRU Enhancements in the Prediction
Framework
h
g
f
e
d
c
b
a
- Recent policies, e.g., DIP, say Insert new
blocks at the LRU position - What does it mean to insert an MRU line in the
LRU position? - Prediction that new block will be re-referenced
later than existing blocks in the cache - What DIP really means is Insert new blocks at
the RRIP Tail - Other policies, e.g., PIPP, say Insert new
blocks in middle of the LRU chain - Prediction that new block will be re-referenced
at an intermediate time
The Re-Reference Prediction Framework Helps
Describe the Intuitions Behind Existing
Replacement Policy Enhancements
32Performance Comparison of Replacement Policies
16-way 2MB LLC
Static RRIP Always Outperforms LRU Replacement
Dynamic RRIP Further Improves Performance of
Static RRIP