PEEP: Exploiting Predictability of Memory Dependences in SMT Processors - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

PEEP: Exploiting Predictability of Memory Dependences in SMT Processors

Description:

Low ILP thread eventually uses up the CPU resources ... Tackle the problem at the source. FETCH UNIT. Execution Units. 4. 4. Reservation Stations ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 40
Provided by: Saman7
Category:

less

Transcript and Presenter's Notes

Title: PEEP: Exploiting Predictability of Memory Dependences in SMT Processors


1
PEEP Exploiting Predictability of Memory
Dependences in SMT Processors
  • Samantika Subramaniam, Milos Prvulovic, Gabriel
    H. Loh

2
Simplified view of SMT execution
Front-end
Reservation Stations
Icache
Execution Units
Store per thread state Enough work from all
threads put together High throughput
3
Something bad happens
Producer insn stalls
Front-end
Icache
Reservation Stations
Execution Units
Low ILP thread eventually uses up the CPU
resources Other independent high ILP threads
forced to stall Defeats purpose of SMT
Tackle the problem at the source
FETCH UNIT
4
Previously proposed solution
ICOUNT (Instruction Count) Tullsen et al. ISCA
1996 Count the number of instructions in the
pipeline per thread Fetch Policy Less priority
to thread with more instructions
Clogged resources
OOPS!
Front-end
Icache
Reservation Stations
Execution Units
REACTIVE EXCLUSION !
5
So can we do better?
Oracle
Front-end
Icache
Reservation Stations
Execution Units
PROACTIVE EXCLUSION !
6
Proactive Exclusion Strategies (PE)
  • Load Misses Moursy et al. ISCA 2003
  • predicted load miss ?GATE
  • MLP Eyerman et al. HPCA 2007
  • all available MLP exposed ? GATE
  • Memory Dependences

7
A Brief Overview of Memory Dependences
LSQ
Memory Dependence Predictor

PRED
ADDR
INST

0xF023
ST 1
PC
1
0xF380
LD 1
ST 2
0xF793
0xF060
?
0xF060
LD 2
Predictability of Memory Dependences Predictor
can indicate future stalls
8
Proactive Exclusion using Memory Dependences
T0
T0
LD
ST
LD
T1
T1
ST
LD
ST


T2
T2
Learn ST-LD relationships


ST A LD A
ST ? LD A
T3
T3

9
Starvation Problem with Proactive Exclusion

Stall resolves
Insn enters RS
T0
T1
Reservation Stations

Exclusion (any strategy) could cause temporary
STARVATION
T2

T3
Especially bad for short duration stalls!!!
10
Short Duration Stall
ST A LD A ADD SUB
ST ? LD A ADD SUB
ADD
Original
ST A LD A ADD SUB
ST ? LD A ADD SUB
ADD
Original PE
Memory Dependence Predictor
11
Can we avoid starvation?
With PE based on memory dependences we can
Memory Dependence Predictor
INST
ADDR
?
0xF060
20 cycles
12
Delay Predictor Details
Memory Dependence Predictor
  • Conservative
  • Maximum observed delay
  • Aggressive
  • Last observed delay
  • Adaptive
  • Average of last observed n delays

DELAY
PRED
PC
1
20
13
How does this help us?
ST A LD A ADD SUB
ADD
Addr resolves
Original
ST A LD A ADD SUB
ST ? LD A ADD SUB
ADD
Addr resolves
Memory Dependence Predictor
Original PE
Choose an appropriate delay threshold
14
Performance Impact of Delay Information
Phase 1
After 20 cycles
MDP
AST 1 BLD 1
P
D
ST ?
ST xF060
B
0
1
0
20
ST 1
. . .
AST 21 BLD 21
LD1
LD xF060
Reservation Stations
P prediction D delay
Execution Units
15
Phase 2
Delay Threshold Front End Depth 5
MDP
AST 1 BLD 1
D
P
B
1
20
. . .
AST 21 BLD 21
Front-end
P prediction D delay
16
PE without delay information
Phase 3
Front End Depth 5
Reservation Stations
Front-end
Stall resolves
Restart fetch
Insn enter RS
5
20
25 cycles
Instructions enter RS after stall resolves
17
PE with delay information
Phase 3
Delay Threshold Front End Depth 5
ReservationStations
Front-end
Stall resolves
Restart fetch
Insn enter RS
5
15
20 cycles
Instructions enter RS right in time as stall
resolves
18
What does this give us?
PEEP
  • Proactive Exclusion
  • When a memory dependence stall is predicted
  • Avoid starvation
  • Ignore short stalls
  • Give the thread a head start
  • Restart fetch of gated thread few cycles before
    stall resolves

Early Parole!!!
PROACTIVE EXCLUSION AND EARLY PAROLE
19
PEEP In Our Context
Memory Dependence and Delay Predictor
20 cycles
Front-end
Icache
Reservation Stations
Execution Units
Predicted delay FE pipeline depth 15
cycles
20
Simulation Parameters
  • Aggressive four-way SMT processor
  • MDP modeled on Load Wait Table
  • SPEC2000, MediaBench and others
  • 32 four-thread application mixes evaluated
  • Application Classification
  • S sensitive to memory dependences
  • N non-sensitive to memory dependences
  • L low-ILP
  • M medium-ILP
  • H high-ILP

21
Proactive Exclusion Strategies
S Sensitive N Non-sensitive L low-ILP M
medium-ILP H high-ILP
  • PE using memory dependencies shows 13 speedup
  • Maximum benefit with both sensitive (S) and
    non-sensitive (N) threads
  • All sensitive threads all PE strategies perform
    comparably

22
PEEP
17
  • PEEP using delay prediction outperforms MLP and
    PE mdep
  • All sensitive threads PEEP does better since it
    can predict stall durations accurately
  • PEEP with an oracle-based MDP shows performance
    speedup of 19

23
2-threaded Workloads
12
  • Less threads ? less opportunities to fetch from
    non-stalled threads
  • 12 performance speedup over 25 application mixes
    shows there is potential benefit even in a
    2-way SMT

Intel Simulator shows 8 performance speedup over
150 application mixes
24
Relationship with OOO Load Scheduling
Hypothesis Performance benefit purely due to a
more efficient fetch policy based on a highly
predictable attribute Experiment PEEP on a
processor without OOO memory scheduling
Prediction is used only for
controlling fetch policy

Result Avg. Speedup over ICOUNT17 (same as
PEEP!) Conclusion Memory Dependencies are a
very good indicator of future stalls Even a
machine without load reordering benefits from
predicting these stalls
25
Why does it work so well?
LMP
PEEP
LD 1
LD 1
ST 1
ST 1
LD 2
LD 2
LD 3
LD 3
Reservation Stations
Reservation Stations
LD 4
LD 4
26
LMP
PEEP
MLP
LD 1
LD 1
LD 1
ST 1
ST 1
ST 1
LD 2
LD 2
LD 2
ADD
ADD
ADD
Reservation Stations
Reservation Stations
Reservation Stations
SUB
SUB
SUB
Can expose more ILP
27
Key Points
  • Need a mechanism for efficient resource
    management in SMT
  • Improve the fetch unit
  • Memory Dependences and Associated Latencies are
    predictable
  • Proactively Exclude bad threads but give them
    Early Parole to avoid temporary starvation
  • Performance improvements on both 4-way and 2-way
    SMT machines

28
Thank You www.cc.gatech.edu/samantik
LD
LD
LD
LD
LD
LD
LD
LD
When will I get paroled?
29
B1Sensitivity Analysis
30
Predictor Size
Delay Threshold
31
B2PEEP
17.3
  • Memory Dependences are a very good indicator of
    future stalls
  • Performance shows that PEEP works because it
    leverages knowledge of future stalls to improve
    instruction fetch

32
B3Fairness
19
  • Speedup is computed for harmonic mean of weighted
    IPCs
  • Since all PE strategies run on top of ICOUNT,
    they inherit its fairness
  • SDS (standard deviation of speedup) for PEEP
    0.17 and for ICOUNT 0.11

33
B4 OOO memory scheduling on SMT machine
34
B5 Accuracy of MDP
35
B6 Delays associated with PEEP
36
B7 Delay Predictors
Conservative Maximum observed delay Aggressive
Last observed delay Adaptive Average of last n
observed delays
37
B8Simulator Configuration
38
4-threaded mixes
39
2-threaded mixes
Write a Comment
User Comments (0)
About PowerShow.com