High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) - PowerPoint PPT Presentation

About This Presentation
Title:

High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP)

Description:

High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) Aamer Jaleel, Kevin Theobald, Simon Steely Jr., Joel Emer Intel Corporation, VSSAD – PowerPoint PPT presentation

Number of Views:190
Avg rating:3.0/5.0
Slides: 33
Provided by: AamerJ6
Learn more at: http://csg.csail.mit.edu
Category:

less

Transcript and Presenter's Notes

Title: High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP)


1
High Performance Cache Replacement
UsingRe-Reference Interval Prediction (RRIP)
  • Aamer Jaleel, Kevin Theobald,
  • Simon Steely Jr., Joel Emer
  • Intel Corporation, VSSAD

International Symposium on Computer Architecture
( ISCA 2010 )
2
Motivation
  • Factors making caching important
  • Increasing ratio of CPU speed to memory speed
  • Multi-core poses challenges on better shared
    cache management
  • LRU has been the standard replacement policy at
    LLC
  • However LRU has problems!

3
Problems with LRU Replacement
LLCsize
Wsize
Working set larger than the cache causes thrashing
miss
miss
miss
miss
miss
References to non-temporal data (scans) discards
frequently referenced working set
hit
hit
hit
miss
hit
hit
miss
miss
scan
scan
scan
Our studies show that scans occur frequently in
many commercial workloads
Wsize
4
Desired Behavior from Cache Replacement
Working set larger than the cache ? Preserve some
of working set in the cache
hit
hit
hit
hit
hit
miss
miss
miss
miss
miss
Wsize
LLCsize
Recurring scans ? Preserve frequently referenced
working set in the cache
5
Prior Solutions to Enhance Cache Replacement
Working set larger than the cache ? Preserve some
of working set in the cache
  • Dynamic Insertion Policy (DIP) ?
    Thrash-resistance with minimal changes to HW

Recurring scans ? Preserve frequently referenced
working set in the cache
Least Frequently Used (LFU) ? addresses scans LFU
adds complexity and also performs bad for recency
friendly workloads
GOAL Design a High Performing Scan-Resistant
Policy that Requires Minimum Changes to HW
6
Beladys Optimal (OPT) Replacement Policy
  • Replacement decisions using perfect knowledge of
    future reference order
  • Victim Selection Policy
  • Replaces block that will be re-referenced
    furthest in future

victim block
Cache Tag
a
c
b
h
f
d
g
e
7
Practical Cache Replacement Policies
  • Replacement decisions made by predicting the
    future reference order
  • Victim Selection Policy
  • Replace block predicted to be re-referenced
    furthest in future
  • Continually update predictions on the future
    reference order
  • Natural update opportunities are on cache fills
    and cache hits

victim block
Cache Tag
a
c
b
h
f
d
g
e
8
LRU Replacement in Prediction Framework
  • The LRU chain maintains the re-reference
    prediction
  • Head of chain (i.e. MRU position) predicted to be
    re-referenced soon
  • Tail of chain (i.e. LRU position) predicted to
    re-referenced far in the future
  • LRU predicts that blocks are re-referenced in
    reverse order of reference
  • Rename LRU Chain to the Re-Reference
    Prediction (RRP) Chain
  • Rename MRU position ? RRP Head and LRU
    position ? RRP Tail

h
g
f
e
d
c
b
a
9
Practicality of Chain Based Replacement
h
g
f
e
d
c
b
a
RRPV (n2)
Qualitative Prediction
  • Problem Chain based replacement is too
    expensive!
  • log2(associativity) bits required per cache block
    (16-way requires 4-bits/block)
  • Solution LRU chain positions can be quantized
    into different buckets
  • Each bucket corresponds to a predicted
    Re-Reference Interval
  • Value of bucket is called the Re-Reference
    Prediction Value (RRPV)
  • Hardware Cost n bits per block ideally you
    would like n lt log2A

10
Representation of Quantized Replacement (n 2)
11
Emulating LRU with Quantized Buckets (n2)
  • Victim Selection Policy Evict block with
    distant RRPV (i.e. 2n-1 3)
  • If no distant RRPV (i.e. 3) found, increment
    all RRPVs and repeat the search
  • If multiple found, need tie breaker. Let us
    always start search from physical way 0
  • Insertion Policy Insert new block with RRPV0
  • Update Policy Cache hits update the blocks
    RRPV0

hit
victim block
s
0
But We Want to do BETTER than LRU!!!
12
Re-Reference Interval Prediction (RRIP)
  • Framework enables re-reference predictions to be
    tuned at insertion/update
  • Unlike LRU, can use non-zero RRPV on insertion
  • Unlike LRU, can use a non-zero RRPV on cache
    hits
  • Static Re-Reference Interval Prediction (SRRIP)
  • Determine best insertion/update prediction using
    profiling and apply to all apps
  • Dynamic Re-Reference Interval Prediction (DRRIP)
  • Dynamically determine best re-reference
    prediction at insertion

13
Static RRIP Insertion Policy Learn Blocks
Re-reference Interval
  • Key Idea Do not give new blocks too much (or
    too little) time in the cache
  • Predict new cache block will not be re-referenced
    soon
  • Insert new block with some RRPV other than 0
  • Similar to inserting in the middle of the RRP
    chain
  • However it is NOT identical to a fixed insertion
    position on RRP chain (see paper)

victim block
s
14
Static RRIP Update Policy on Cache Hits
  • Hit Priority (HP)
  • Like LRU, Always update RRPV0 on cache hits.
  • Intuition Predicts that blocks receiving hits
    after insertion will be re-referenced soon

hit
Cache Tag
c
2
RRPV
0
An Alternative Update Scheme Also Described in
Paper
15
SRRIP Hit Priority Sensitivity to Cache Insertion
Prediction at LLC
Averaged Across PC Games, Multimedia, Server, and
SPEC06 Workloads on 16-way 2MB LLC
n1
Re-Reference Interval Prediction (RRIP) Value At
Insertion
n1 is in fact the NRU replacement policy
commonly used in commercial processors
16
SRRIP Hit Priority Sensitivity to Cache Insertion
Prediction at LLC
Averaged Across PC Games, Multimedia, Server, and
SPEC06 Workloads on 16-way 2MB LLC
n1
n2
n3
n4
n5
Re-Reference Interval Prediction (RRIP) Value At
Insertion
Regardless of n Static RRIP Performs Best When
RRPVinsertion is 2n-2
Regardless of n Static RRIP Performs Worst When
RRPVinsertion is 2n-1
17
Why Does RRPVinsertion of 2n-2 Work Best for
SRRIP?
Wsize
Slen
hit
hit
hit
?
hit
hit
?
?
scan
scan
scan
  • Before scan, re-reference prediction of active
    working set is 0
  • Recall, NRU (n1) is not scan-resistant
  • For scan resistance RRPVinsertion MUST be
    different from RRPV of working set blocks
  • Larger insertion RRPV tolerates larger scans
  • Maximum insertion prediction (i.e. 2n-2) works
    best!
  • In general, re-references after scan hit IF
  • Slen lt ( RRPVinsertion Starting-RRPVworkingse
    t) (LLCsize Wsize)
  • SRRIP is Scan Resistant for Slen lt (
    RRPVinsertion ) (LLCsize Wsize)

For n gt 1 Static RRIP is Scan Resistant! What
about Thrash Resistance?
18
DRRIP Extending Scan-Resistant SRRIP to Be
Thrash-Resistant
miss
miss
miss
miss
miss
SRRIP
  • Always using same prediction for all insertions
    will thrashes the cache
  • Like DIP, need to preserve some fraction of
    working set in cache
  • Extend DIP to SRRIP to provide thrash resistance
  • Dynamic Re-Reference Interval Prediction
  • Dynamically select between inserting blocks with
    2n-1 and 2n-2 using Set Dueling
  • Inserting blocks with 2n-1 is same as no update
    on insertion

DRRIP Provides Both Scan-Resistance and
Thrash-Resistance
19
Performance Comparison of Replacement Policies
16-way 2MB LLC
Static RRIP Always Outperforms LRU Replacement
Dynamic RRIP Further Improves Performance of
Static RRIP
20
Cache Replacement Competition (CRC) Results
Averaged Across PC Games, Multimedia, Enterprise
Server, SPEC CPU2006 Workloads
D R R I P
DRRIP
  • 16-way 1MB Private Cache
  • 65 Single-Threaded Workloads

16-way 4MB Shared Cache 165 4-core Workloads
Un-tuned DRRIP Would Be Ranked 2nd and is within
1 of CRC Winner Unlike CRC Winner, DRRIP Does
Not Require Any Changes to Cache Structure
21
Total Storage Overhead (16-way Set Associative
Cache)
  • LRU 4-bits / cache block
  • NRU 1-bit / cache block
  • DRRIP-3 3-bits / cache block
  • CRC Winner 8-bits / cache block

DRRIP Outperforms LRU With Less Storage Than
LRU NRU Can Be Easily Extended to Realize DRRIP!
22
Summary
  • Scan-resistance is an important problem in
    commercial workloads
  • State-of-the art policies do not address
    scan-resistance
  • We Propose a Simple and Practical Replacement
    Policy
  • Static RRIP (SRRIP) for scan-resistance
  • Dynamic RRIP (DRRIP) for thrash-resistance and
    scan-resistance
  • DRRIP requires ONLY 3-bits per block
  • In fact it incurs less storage than LRU
  • Un-tuned DRRIP would be 2nd place in CRC
    Championship
  • DRRIP requires significantly less storage than
    CRC winner

23
QA
24
QA
25
QA
26
Static RRIP with n1
  • Static RRIP with n 1 is the commonly used NRU
    policy (polarity inverted)
  • Victim Selection Policy Evict block with
    RRPV1
  • Insertion Policy Insert new block with RRPV0
  • Update Policy Cache hits update the blocks
    RRPV0

hit
victim block
s
0
But NRU Is Not Scan-Resistant ?
27
SRRIP Update Policy on Cache Hits
  • Frequency Priority (FP)
  • Improve re-reference prediction to be shorter
    than before on hits (i.e. RRPV--).
  • Intuition Like LFU, predicts that frequently
    referenced blocks should have higher priority to
    stay in cache

Cache Tag
c
RRPV
2
1
28
SRRIP-HP and SRRIP-FP Cache Performance
SRRIP-Frequency Priority
  • SRRIP-HP has 2X better cache performance relative
    to LRU than SRRIP-FP
  • We do not need to precisely detect frequently
    referenced blocks
  • We need to preserve blocks that receive hits

SRRIP-Hit Priority
29
Common Access Patterns in Workloads
  • Stack Access Pattern (a1, a2,ak,a2, a1)A
  • Solution For any k, LRU performs well for such
    access patterns
  • Streaming Access Pattern (a1, a2, ak) for k gtgt
    assoc
  • No Solution Cache replacement can not solve this
    problem
  • Thrashing Access Pattern (a1, a2, ak)A , for k
    gt assoc
  • LRU receives no cache hits due to cache thrashing
  • Solution preserve some fraction of working set
    in cache (e.g. Use BIP)
  • BIP does NOT update replacement state for the
    majority of cache insertions
  • Mixed Access Pattern (a1, a2,ak,a2, a1)A (b1,
    b2, bm) N, m gt assoc-k
  • LRU always misses on frequently referenced (a1,
    a2, ak, a2, a1)A
  • (b1, b2, bm) commonly referenced to as a scan
    in literature
  • In absence of scan, LRU performs well for such
    access patterns
  • Solution preserve frequently referenced working
    set in cache (e.g. use LFU)
  • LFU replaces infrequently referenced blocks in
    the presence of frequently referenced blocks

30
Performance of Hybrid Replacement Policies at LLC
PC Games / multimedia
SPEC CPU2006
server
Average
4-way OoO Processor, 32KB L1, 256KB L2, 2MB LLC
  • DIP addresses SPEC workloads but NOT PC games
    multimedia workloads
  • Real world workloads prefer scan-resistance
    instead of thrash-resistance

31
Understanding LRU Enhancements in the Prediction
Framework
h
g
f
e
d
c
b
a
  • Recent policies, e.g., DIP, say Insert new
    blocks at the LRU position
  • What does it mean to insert an MRU line in the
    LRU position?
  • Prediction that new block will be re-referenced
    later than existing blocks in the cache
  • What DIP really means is Insert new blocks at
    the RRIP Tail
  • Other policies, e.g., PIPP, say Insert new
    blocks in middle of the LRU chain
  • Prediction that new block will be re-referenced
    at an intermediate time

The Re-Reference Prediction Framework Helps
Describe the Intuitions Behind Existing
Replacement Policy Enhancements
32
Performance Comparison of Replacement Policies
16-way 2MB LLC
Static RRIP Always Outperforms LRU Replacement
Dynamic RRIP Further Improves Performance of
Static RRIP
Write a Comment
User Comments (0)
About PowerShow.com