Adaptive Insertion Policies for Managing Shared Caches - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Adaptive Insertion Policies for Managing Shared Caches

Description:

Adaptive Insertion Policies. for Managing Shared Caches ... TADIP Scales to Large Number of Concurrently Executing Applications ... – PowerPoint PPT presentation

Number of Views:106
Avg rating:3.0/5.0
Slides: 29
Provided by: aamerj
Category:

less

Transcript and Presenter's Notes

Title: Adaptive Insertion Policies for Managing Shared Caches


1
Adaptive Insertion Policies for Managing Shared
Caches
  • Aamer Jaleel, William Hasenplaugh, Moinuddin
    Qureshi,
  • Julien Sebot, Simon C. Steely, Joel Emer

International Conference on Parallel
Architectures and Compilation Techniques (PACT)
2
Paper Motivation
  • Shared caches common and more so with increasing
    of cores
  • concurrent applications ? ? contention for
    shared cache ?
  • High Performance ? Manage shared cache efficiently

3
Addressing Shared Cache Performance
  • Conventional LRU policy allocates resources based
    on rate of demand
  • Applications that do not benefit from cache cause
    destructive cache interference
  • Cache Partitioning Reserves cache resources
    based on application benefit rather than rate of
    demand
  • Requires HW for detecting benefit
  • Significant changes to cache structure
  • Not scalable to large of applications

4
Paper Contributions
  • Problem For shared caches, conventional LRU
    policy allocates cache resources based on rate-of
    demand rather than benefit
  • Goals Design a hardware mechanism that
  • 1. Provides High Performance by Allocating
    Cache on a Benefit-basis
  • 2. Is Robust Across Different Concurrently
    Executing Applications
  • 3. Scales to Large Number of Competing
    Applications
  • 4. Requires Low Design Overhead
  • Solution Thread-Aware Dynamic Insertion Policy
    (TADIP) that improves average throughput by
    12-18 for 2, 4, 8, and 16-core systems with ?
    two bytes of storage per HW-thread

5
Review Insertion Policies
  • Adaptive Insertion Policies for High-Performance
    Caching
  • Moinuddin Qureshi, Aamer Jaleel, Yale Patt, Simon
    Steely Jr., Joel Emer
  • Appeared in ISCA07

6
Cache Replacement 101 ISCA07
  • Two components of cache replacement
  • Victim Selection
  • Which line to replace for incoming line? (E.g.
    LRU, Random etc)
  • Insertion Policy
  • With what priority is the new line placed in the
    replacement list? (E.g. insert new line into MRU
    position)

Simple changes to insertion policy can minimize
cache thrashing and improves cache performance
for memory-intensive workloads
7
Static Insertion Policies ISCA07
  • Conventional (MRU Insertion) Policy
  • Choose victim, promote to MRU
  • LRU Insertion Policy (LIP)
  • Choose victim, DO NOT promote to MRU
  • Unless reused, lines stay at LRU position
  • Bimodal Insertion Policy (BIP)
  • LIP does not age older lines
  • Infrequently insert some misses at MRU
  • Bimodal Throttle b
  • We used b 1/32 3

8
Dynamic Insertion Policy using Set-Dueling
ISCA07
  • Set Dueling Monitors (SDMs) Dedicated sets to
    estimate the performance of a pre-defined policy
  • Divide the cache in three
  • SDM-LRU Dedicated LRU-sets
  • SDM-BIP Dedicated BIP-sets
  • Follower sets
  • PSEL n-bit saturating counter
  • misses to SDM-LRU PSEL
  • misses to SDM-BIP PSEL--
  • Follower sets insertion policy
  • Use LRU If PSEL MSB 0
  • Use BIP If PSEL MSB 1

PSEL
  • - Based on Analytical and Empirical Studies
  • 32 Sets per SDM
  • 10 bit PSEL counter

HW Required 10 bits Combinational Logic
9
Extending DIP to Shared Caches
  • DIP uses a single policy (LRU or BIP) for all
    applications competing for the cache
  • DIP can not distinguish between apps that benefit
    from cache and those that do not
  • Example soplex h264ref w/2MB cache
  • DIP learns LRU for both apps
  • soplex causes destructive interference
  • Desirable that only h264ref follow LRU and soplex
    follow BIP

soplex
Misses Per 1000 Instr (under LRU)
h264ref
Need a Thread-Aware Dynamic Insertion Policy
(TADIP)
10
Thread Aware Dynamic Insertion Policy (TADIP)
  • Assume N concurrent applications, what is best
    insertion policy for each? (LRU0, BIP1)
  • Insertion policy decision can be thought of as an
    N-bit binary string
  • lt P0, P1, P2 PN gt
  • If Px 1, then for application c use BIP, else
    use LRU
  • e.g. 0000 ? always use conventional LRU, 1111 ?
    always use BIP
  • With N-bit string, 2N possible combinations. How
    to find best one???
  • Offline Profiling Input set/system dependent
    impractical with large N
  • Brute Force Search using SDMs Infeasible with
    large N

Need a PRACTICAL and SCALABLE Implementation of
TADIP
11
Using Set-Dueling As a Practical Approach to TADIP
  • Unnecessary to exhaustively search all 2N
    combinations
  • Some bits of the best binary insertion string can
    be learned independently
  • Example Always use BIP for applications that do
    not benefit from cache
  • Exponential Search Space ? Linear Search Space
  • Learn best policy (BIP or LRU) for each app in
    presence of all other apps

Use Per-Application SDMs To Determine In the
presence of other apps, should a given app use
BIP or LRU?
12
TADIP Using Set-Dueling Monitors (SDMs)
  • Assume a cache shared by 4 applications APP0
    APP1 APP2 APP3

lt P0, P1, P2, P3 gt
lt 0, P1, P2, P3 gt
In the presence of other apps, should APP0 do LRU
or BIP?
lt 1, P1, P2, P3 gt
lt P0, 0, P2, P3 gt
lt P0, 1, P2, P3 gt
lt P0, P1, 0, P3 gt
lt P0, P1, 1, P3 gt
lt P0, P1, P2, 0 gt
lt P0, P1, P2, 1 gt
Follower Sets
  • Pc MSB( PSELc )

High-Level View of Cache
Set-Level View of Cache
13
TADIP Using Set-Dueling Monitors (SDMs)
  • Assume a cache shared by 4 applications APP0
    APP1 APP2 APP3
  • LRU SDMs for each APP
  • BIP SDMs for each APP
  • Follower sets
  • Per-APP PSEL saturating counters
  • misses to LRU PSEL
  • misses to BIP PSEL--
  • Follower sets insertion policy
  • SDMs of one thread are follower sets of another
    thread
  • Let Px MSB PSELx
  • Fill Decision ltP0, P1gt

lt P0, P1, P2, P3 gt
lt 0, P1, P2, P3 gt
lt 1, P1, P2, P3 gt
lt P0, 0, P2, P3 gt
lt P0, 1, P2, P3 gt
lt P0, P1, 0, P3 gt
lt P0, P1, 1, P3 gt
lt P0, P1, P2, 0 gt
lt P0, P1, P2, 1 gt
Follower Sets
  • 32 sets per SDM
  • 10-bit PSEL
  • Pc MSB( PSELc )

HW Required (10T) bits Combinational Logic
14
Experimental Setup
  • Simulator and Benchmarks
  • CMPim A Pin-based Multi-Core Performance
    Simulator
  • 17 representative SPEC CPU2006 benchmarks
  • Baseline Study
  • 4-core CMP with in-order cores
  • Three-level Non-Inclusive Cache Hierarchy 32KB
    L1, 256KB L2, 4MB L3
  • 15 workload mixes of four different SPEC CPU2006
    benchmarks
  • Scalability Study
  • 2-core, 4-core, 8-core, 16-core systems
  • 50 workload mixes of 2, 4, 8, 16 different SPEC
    CPU2006 benchmarks

15
soplex h264ref Sharing 2MB Cache
MPKI
APKI accesses per 1000 inst MPKI misses per
1000 inst
SOPLEX
H264REF
TADIP Improves Throughput by 27 over LRU and DIP
16
TADIP Results Throughput
No Gains from DIP
TADIP Provides More Than Two Times Performance of
DIP TADIP Improves Performance over LRU by 18
17
TADIP Compared to Offline Best Static Policy
Static Best almost always better since it
optimized for best IPC while TADIP optimized for
fewer misses. TADIP optimizing for other
metrics such as IPC can reduce the gap
TADIP Better Due to Phase Adaptation
TADIP is within 85 Best Offline Determined
Insertion Policy
18
TADIP Vs. Utility Based Cache Partitioning (UCP)
DIP Out-Performs UCP Without Requiring Any Cache
Partitioning Hardware
Unlike Cache Partitioning Schemes, TADIP Does NOT
Reserve Cache Space Instead TADIP Does Efficient
CACHE MANAGEMENT by Changing Insertion Policy
19
TADIP Results Sensitivity to Cache Size
TADIP Provides Performance Equivalent to Doubling
Cache Size
20
TADIP Results Scalability
Throughput Normalized to Baseline System
TADIP Scales to Large Number of Concurrently
Executing Applications
21
Summary
  • The Problem LRU causes cache thrashing when
    workloads with differing working sets compete for
    a shared cache
  • Solution Thread-Aware Dynamic Insertion Policy
    (TADIP)
  • 1. Provides High Performance by Minimizing
    Thrashing
  • - Up to 94, 64, 26 and 16 performance
    on 2, 4, 8, and 16 core CMPs
  • 2. Is Robust Across Different Workload Mixes
  • - Does not significantly hurt performance
    when LRU works well
  • 3. Scales to Large Number of Competing
    Applications
  • - Evaluated 16-cores in our study
  • 4. Requires Low Design Overhead
  • - Less than 2 bytes of HW require per
    hardware-thread in the system

22
QA
23
TADIP Results Weighted Speedup
24
TADIP Results Fairness Metric
25
TADIP In Presence of Prefetching on 4-core CMP
26
Cache Occupancy (16-Cores)
  • Changing fill policy directly controls the amount
    of cache resources provided to an application
  • In figure, only showing only the fill policy for
    xalancbmk and sphinx3
  • 28 perf improvement

27
TADIP Using Set-Dueling Monitors (SDMs)
  • Assume a cache shared by 2 applications APP0
    and APP1

lt P0 , P1 gt
lt 0 , P1 gt
In the presence of other apps, should APP0 do LRU
or BIP?
PSEL0
lt 1 , P1 gt
lt P0 , 0 gt
In the presence of other apps, should APP1 do LRU
or BIP?
PSEL1
lt P0 , 1 gt
Follower Sets
  • 32 sets per SDM
  • 9-bit PSEL
  • Pc MSB( PSELc )

High-Level View of Cache
Set-Level View of Cache
28
TADIP Using Set-Dueling Monitors (SDMs)
  • Assume a cache shared by 2 applications APP0
    and APP1
  • LRU SDMs for each APP
  • BIP SDMs for each APP
  • Follower sets
  • PSEL0, PSEL1 per-APP PSEL
  • misses to LRU PSEL
  • misses to BIP PSEL--
  • Follower sets insertion policy
  • SDMs of one thread are follower sets of another
    thread
  • Let Px MSB PSELx
  • Fill Decision ltP0, P1gt

lt P0 , P1 gt
lt 0 , P1 gt
PSEL0
lt 1 , P1 gt
lt P0 , 0 gt
PSEL1
lt P0 , 1 gt
Follower Sets
  • 32 sets per SDM
  • 9-bit PSEL cntr
  • Pc MSB( PSELc )

HW Required (9T) bits Combinational Logic
Write a Comment
User Comments (0)
About PowerShow.com