Title: Efficient Runahead Execution Processors A PowerEfficient Processing Paradigm for Tolerating Long Mai
1Efficient Runahead Execution ProcessorsA
Power-Efficient Processing Paradigm
forTolerating Long Main Memory Latencies
- Onur Mutlu
- PhD Defense
- 4/28/2006
2Talk Outline
- Motivation The Memory Latency Problem
- Runahead Execution
- Evaluation
- Limitations of the Baseline Runahead Mechanism
- Efficient Runahead Execution
- Address-Value Delta (AVD) Prediction
- Summary of Contributions
- Future Work
3Motivation
- Memory latency is very long in todays processors
- CDC 6600 10 cycles Thornton, 1970
- Alpha 21264 120 cycles Wilkes, 2001
- Intel Pentium 4 300 cycles Sprangle Carmean,
2002 - And, it continues to increase (in terms of
processor cycles) - DRAM latency is not reducing as fast as processor
cycle time - Conventional techniques to tolerate memory
latency do not work well enough.
4Conventional Latency Tolerance Techniques
- Caching initially by Wilkes, 1965
- Widely used, simple, effective, but inefficient,
passive - Not all applications/phases exhibit temporal or
spatial locality - Prefetching initially in IBM 360/91, 1967
- Works well for regular memory access patterns
- Prefetching irregular access patterns is
difficult, inaccurate, and hardware-intensive - Multithreading initially in CDC 6600, 1964
- Works well if there are multiple threads
- Improving single thread performance using
multithreading hardware is an ongoing research
effort - Out-of-order execution initially by Tomasulo,
1967 - Tolerates cache misses that cannot be prefetched
- Requires extensive hardware resources for
tolerating long latencies
5Out-of-order Execution
- Instructions are executed out of sequential
program order to tolerate latency. - Instructions are retired in program order to
support precise exceptions/interrupts. - Not-yet-retired instructions and their results
are buffered in hardware structures, called the
instruction window. - The size of the instruction window determines how
much latency the processor can tolerate.
6Small Windows Full-window Stalls
- When a long-latency instruction is not complete,
it blocks retirement. - Incoming instructions fill the instruction
window. - Once the window is full, processor cannot place
new instructions into the window. - This is called a full-window stall.
- A full-window stall prevents the processor from
making progress in the execution of the program.
7Small Windows Full-window Stalls
8-entry instruction window
Oldest
LOAD R1 ? memR5
L2 Miss! Takes 100s of cycles.
BEQ R1, R0, target
ADD R2 ? R2, 8
LOAD R3 ? memR2
Independent of the L2 miss, executed out of
program order, but cannot be retired.
MUL R4 ? R4, R3
ADD R4 ? R4, R5
STOR memR2 ? R4
ADD R2 ? R2, 64
Younger instructions cannot be executed
because there is no space in the instruction
window.
LOAD R3 ? memR2
The processor stalls until the L2 Miss is
serviced.
- L2 cache misses are responsible for most
full-window stalls.
8Impact of L2 Cache Misses
L2 Misses
512KB L2 cache, 500-cycle DRAM latency,
aggressive stream-based prefetcher Data averaged
over 147 memory-intensive benchmarks on a
high-end x86 processor model
9Impact of L2 Cache Misses
L2 Misses
500-cycle DRAM latency, aggressive stream-based
prefetcher Data averaged over 147
memory-intensive benchmarks on a high-end x86
processor model
10The Problem
- Out-of-order execution requires large instruction
windows to tolerate todays main memory
latencies. - As main memory latency increases, instruction
window size should also increase to fully
tolerate the memory latency. - Building a large instruction window is a
challenging task if we would like to
achieve - Low power/energy consumption
- Short cycle time
- Low design and verification complexity
11Talk Outline
- Motivation The Memory Latency Problem
- Runahead Execution
- Evaluation
- Limitations of the Baseline Runahead Mechanism
- Efficient Runahead Execution
- Address-Value Delta (AVD) Prediction
- Summary of Contributions
- Future Work
12Overview of Runahead Execution HPCA03
- A technique to obtain the memory-level
parallelism benefits of a large instruction
window (without having to build it!) - When the oldest instruction is an L2 miss
- Checkpoint architectural state and enter runahead
mode - In runahead mode
- Instructions are speculatively pre-executed
- The purpose of pre-execution is to discover other
L2 misses - The processor does not stall due to L2 misses
- Runahead mode ends when the original L2 miss
returns - Checkpoint is restored and normal execution
resumes
13Runahead Example
Perfect Caches
Load 2 Hit
Load 1 Hit
Compute
Compute
Small Window
Load 2 Miss
Load 1 Miss
Compute
Compute
Stall
Stall
Miss 1
Miss 2
Runahead
Load 1 Miss
Load 2 Miss
Load 2 Hit
Load 1 Hit
Runahead
Compute
Compute
Saved Cycles
Miss 1
Miss 2
14Benefits of Runahead Execution
- Instead of stalling during an L2 cache miss
- Pre-executed loads and stores independent of
L2-miss instructions generate very accurate data
prefetches - For both regular and irregular access patterns
- Instructions on the predicted program path are
prefetched into the instruction/trace cache and
L2. - Hardware prefetcher and branch predictor tables
are trained using future access information.
15Runahead Execution Mechanism
- Entry into runahead mode
- Checkpoint architectural register state
- Instruction processing in runahead mode
- Exit from runahead mode
- Restore architectural register state from
checkpoint
16Instruction Processing in Runahead Mode
Load 1 Miss
Runahead
Compute
Miss 1
- Runahead mode processing is the same as
normal instruction processing, EXCEPT - It is purely speculative Architectural
(software-visible) register/memory state is NOT
updated in runahead mode. - L2-miss dependent instructions are identified and
treated specially. - They are quickly removed from the instruction
window. - Their results are not trusted.
17L2-Miss Dependent Instructions
Load 1 Miss
Runahead
Compute
Miss 1
- Two types of results produced INV and VALID
- INV Dependent on an L2 miss
- INV results are marked using INV bits in the
register file and store buffer. - INV values are not used for prefetching/branch
resolution.
18Removal of Instructions from Window
Load 1 Miss
Runahead
Compute
Miss 1
- Oldest instruction is examined for
pseudo-retirement - An INV instruction is removed from window
immediately. - A VALID instruction is removed when it completes
execution. - Pseudo-retired instructions free their allocated
resources. - This allows the processing of later
instructions. - Pseudo-retired stores communicate their data to
dependent loads.
19Store/Load Handling in Runahead Mode
Load 1 Miss
Runahead
Compute
Miss 1
- A pseudo-retired store writes its data and INV
status to a dedicated memory, called runahead
cache. - Purpose Data communication through memory in
runahead mode. - A dependent load reads its data from the runahead
cache. - Does not need to be always correct ? Size of
runahead cache is very small.
20Branch Handling in Runahead Mode
Load 1 Miss
Runahead
Compute
Miss 1
- INV branches cannot be resolved.
- A mispredicted INV branch causes the processor
to stay on the wrong program path until the end
of runahead execution. - VALID branches are resolved and initiate recovery
if mispredicted.
21Hardware Cost of Runahead Execution
- Checkpoint of the architectural register state
- Already exists in current processors
- INV bits per register and store buffer entry
- Runahead cache (512 bytes)
- lt0.05 area overhead
22Talk Outline
- Motivation The Memory Latency Problem
- Runahead Execution
- Evaluation
- Limitations of the Baseline Runahead Mechanism
- Efficient Runahead Execution
- Address-Value Delta (AVD) Prediction
- Summary of Contributions
- Future Work
23Baseline Processor
- 3-wide fetch, 29-stage pipeline x86 processor
- 128-entry instruction window
- 512 KB, 8-way, 16-cycle unified L2 cache
- Approximately 500-cycle L2 miss latency
- Bandwidth, contention, conflicts modeled in
detail - Aggressive streaming data prefetcher (16 streams)
- Next-two-lines instruction prefetcher
24Evaluated Benchmarks
- 147 Intel x86 benchmarks simulated for 30 million
instructions - Benchmark Suites
- SPEC CPU 95 (S95) Mostly scientific FP
applications - SPEC FP 2000 (FP00)
- SPEC INT 2000 (INT00)
- Internet (WEB) Spec Jbb, Webmark2001
- Multimedia (MM) MPEG, speech recognition, games
- Productivity (PROD) Powerpoint, Excel, Photoshop
- Server (SERV) Transaction processing, E-commerce
- Workstation (WS) Engineering/CAD applications
25Performance of Runahead Execution
26Runahead Execution vs. Large Windows
27Talk Outline
- Motivation The Memory Latency Problem
- Runahead Execution
- Evaluation
- Limitations of the Baseline Runahead Mechanism
- Efficient Runahead Execution
- Address-Value Delta (AVD) Prediction
- Summary of Contributions
- Future Work
28Limitations of the Baseline Runahead Mechanism
- Energy Inefficiency
- A large number of instructions are speculatively
executed - Efficient Runahead Execution ISCA05, IEEE Micro
Top Picks06 - Ineffectiveness for pointer-intensive
applications - Runahead cannot parallelize dependent L2 cache
misses - Address-Value Delta (AVD) Prediction MICRO05
- Irresolvable branch mispredictions in runahead
mode - Cannot recover from a mispredicted L2-miss
dependent branch - Wrong Path Events MICRO04
29Talk Outline
- Motivation The Memory Latency Problem
- Runahead Execution
- Evaluation
- Limitations of the Baseline Runahead Mechanism
- Efficient Runahead Execution
- Address-Value Delta (AVD) Prediction
- Summary of Contributions
- Future Work
30The Efficiency Problem ISCA05
- A runahead processor pre-executes some
instructions speculatively - Each pre-executed instruction consumes energy
- Runahead execution significantly increases the
number of executed instructions, sometimes
without providing performance improvement
31The Efficiency Problem
22
27
32Efficiency of Runahead Execution
- Goals
- Reduce the number of executed instructions
without reducing the IPC improvement - Increase the IPC improvement
without increasing the number of
executed instructions
33Causes of Inefficiency
- Short runahead periods
- Overlapping runahead periods
- Useless runahead periods
34Short Runahead Periods
- Processor can initiate runahead mode due to an
already in-flight L2 miss generated by - the prefetcher, wrong-path, or a previous
runahead period - Short periods
- are less likely to generate useful L2 misses
- have high overhead due to the flush penalty at
runahead exit
Load 1 Miss
Load 2 Miss
Load 2 Miss
Load 1 Hit
Compute
Runahead
Miss 1
Miss 2
35Eliminating Short Runahead Periods
- Mechanism to eliminate short periods
- Record the number of cycles C an L2-miss has been
in flight - If C is greater than a threshold T for an L2
miss, disable entry into runahead mode due to
that miss - T can be determined statically (at design time)
or dynamically - T400 for a minimum main memory latency of 500
cycles works well
36Overlapping Runahead Periods
- Two runahead periods that execute the same
instructions - Second period is inefficient
Load 1 Miss
Load 2 Miss
Load 2 INV
Load 1 Hit
OVERLAP
OVERLAP
Compute
Runahead
Miss 1
Miss 2
37Overlapping Runahead Periods (cont.)
- Overlapping periods are not necessarily useless
- The availability of a new data value can result
in the generation of useful L2 misses - But, this does not happen often enough
- Mechanism to eliminate overlapping periods
- Keep track of the number of pseudo-retired
instructions R during a runahead period - Keep track of the number of fetched instructions
N since the exit from last runahead period - If N lt R, do not enter runahead mode
38Useless Runahead Periods
- Periods that do not result in prefetches for
normal mode - They exist due to the lack of memory-level
parallelism - Mechanism to eliminate useless periods
- Predict if a period will generate useful L2
misses - Estimate a period to be useful if it generated an
L2 miss that cannot be captured by the
instruction window - Useless period predictors are trained based on
this estimation
Load 1 Miss
Load 1 Hit
Compute
Runahead
Miss 1
39Predicting Useless Runahead Periods
- Prediction based on the past usefulness of
runahead periods caused by the same static load
instruction - A 2-bit state machine records the past usefulness
of a load - Prediction based on too many INV loads
- If the fraction of INV loads in a runahead period
is greater than T, exit runahead mode - Sampling (phase) based prediction
- If last N runahead periods generated fewer than T
L2 misses, do not enter runahead for the next M
runahead opportunities - Compile-time profile-based prediction
- If runahead modes caused by a load were not
useful in the profiling run, mark it as
non-runahead load
40Performance Optimizations for Efficiency
- Both efficiency AND performance can be increased
by increasing the
usefulness of runahead periods - Three major optimizations
- Turning off the Floating Point Unit (FPU) in
runahead mode - FP instructions do not contribute to the
generation of load addresses - Optimizing the update policy of the hardware
prefetcher (HWP) in runahead mode - Improves the positive interaction between
runahead and HWP - Early wake-up of INV instructions
- Enables the faster removal of INV instructions
41Overall Impact on Executed Instructions
26.5
6.2
42Overall Impact on IPC
22.6
22.1
43Talk Outline
- Motivation The Memory Latency Problem
- Runahead Execution
- Evaluation
- Limitations of the Baseline Runahead Mechanism
- Efficient Runahead Execution
- Address-Value Delta (AVD) Prediction
- Summary of Contributions
- Future Work
44The Problem Dependent Cache Misses
- Runahead execution cannot parallelize dependent
misses - wasted opportunity to improve performance
- wasted energy (useless pre-execution)
- Runahead performance would improve by 25 if this
limitation were ideally overcome
Runahead Load 2 is dependent on Load 1
Cannot Compute Its Address!
Load 1 Miss
Load 2 Miss
Load 2
Load 1 Hit
INV
Runahead
Compute
Miss 1
Miss 2
45The Goal of AVD Prediction
- Enable the parallelization of dependent L2 cache
misses in runahead mode with a low-cost mechanism - How
- Predict the values of L2-miss address (pointer)
loads - Address load loads an address into its
destination register, which is later used to
calculate the address of another load - as opposed to data load
46Parallelizing Dependent Cache Misses
Cannot Compute Its Address!
Load 1 Miss
Load 2 Miss
Load 2 INV
Load 1 Hit
Compute
Runahead
Miss 1
Miss 2
Can Compute Its Address
Value Predicted
Saved Speculative Instructions
Load 2 Hit
Load 2
Load 1 Hit
Miss
Load 1 Miss
Compute
Runahead
Saved Cycles
Miss 1
Miss 2
47AVD Prediction MICRO05
- Address-value delta (AVD) of a load instruction
defined as - AVD Effective Address of Load Data
Value of Load - For some address loads, AVD is stable
- An AVD predictor keeps track of the AVDs of
address loads - When a load is an L2 miss in runahead mode, AVD
predictor is consulted - If the predictor returns a stable (confident) AVD
for that load, the value of the load is predicted - Predicted Value Effective Address
Predicted AVD
48Why Do Stable AVDs Occur?
- Regularity in the way data structures are
- allocated in memory AND
- traversed
- Two types of loads can have stable AVDs
- Traversal address loads
- Produce addresses consumed by address loads
- Leaf address loads
- Produce addresses consumed by data loads
49Traversal Address Loads
Regularly-allocated linked list
A traversal address load loads the pointer to
next node node node?next
A
AVD Effective Addr Data Value
Ak
Effective Addr
Data Value
AVD
A2k
A
Ak
-k
Ak
A2k
-k
...
A3k
A2k
A3k
-k
Stable AVD
Striding data value
50Leaf Address Loads
Sorted dictionary in parser Nodes
point to strings (words) String and node
allocated consecutively
Dictionary looked up for an input word. A leaf
address load loads the pointer to the string of
each node
lookup (node, input) // ...
ptr_str node?string
m check_match(ptr_str, input)
//
Ak
A
Ck
Bk
node
AVD Effective Addr Data Value
string
C
B
Effective Addr
Data Value
AVD
Dk
Ek
Fk
Gk
Ak
A
k
D
E
F
G
Ck
C
k
Fk
F
k
Stable AVD
No stride!
51Performance of AVD Prediction
runahead
14.3
15.5
52Talk Outline
- Motivation The Memory Latency Problem
- Runahead Execution
- Evaluation
- Limitations of the Baseline Runahead Mechanism
- Efficient Runahead Execution
- Address-Value Delta (AVD) Prediction
- Summary of Contributions
- Future Work
53Summary of Contributions
- Runahead execution provides the latency tolerance
benefit of a large instruction window by
parallelizing independent cache misses - With very modest increase in hardware cost and
complexity - 128-entry window Runahead 384-entry window
- Efficient runahead execution techniques improve
the energy-efficiency of base runahead
execution - Only 6 extra instructions executed for 22
performance benefit - Address-Value Delta (AVD) prediction enables the
parallelization of dependent cache misses - By exploiting regular memory allocation patterns
- A 16-entry (102-byte) AVD predictor improves the
performance of runahead execution by 14 on
pointer-intensive applications
54Talk Outline
- Motivation The Memory Latency Problem
- Runahead Execution
- Evaluation
- Limitations of the Baseline Runahead Mechanism
- Efficient Runahead Execution
- Address-Value Delta (AVD) Prediction
- Summary of Contributions
- Future Work
55Future Work in Runahead Execution
- Compilation/programming techniques for runahead
processors - Keeping runahead execution on the correct program
path - Parallelizing dependent cache misses in linked
data structure traversals - Runahead co-processors/accelerators
- Evaluation of runahead execution on multithreaded
and multiprocessor systems
56Research Summary
- Runahead execution
- Original runahead proposal HPCA03, IEEE Micro
Top Picks03 - Efficient runahead execution ISCA05, IEEE Micro
Top Picks06 - AVD prediction MICRO05
- Result reuse in runahead execution Comp. Arch.
Letters05 - High-performance memory system designs
- Pollution-aware caching IJPP05
- Parallelism-aware caching ISCA06
- Performance analysis of speculative memory
references IEEE Trans. on Computers05 - Latency/bandwidth tradeoffs in memory controllers
Patent04 - Branch instruction handling techniques through
compiler-microarchitecture cooperation - Wish branches MICRO05 IEEE Micro Top
Picks06 - Wrong path events MICRO04
- Compiler-assisted dynamic predication in
progress - Efficient compile-time profiling techniques for
detecting input-dependent program behavior - 2D profiling CGO06
- Fault tolerant microarchitecture design
- Microarchitecture-based introspection DSN05
57Thank you.
58Backup Slides
59Thesis Statement
- Efficient runahead execution is a cost- and
complexity-effective microarchitectural technique
that can tolerate long main memory latencies
without requiring - unreasonably large, slow, complex, and
power-hungry hardware structures - significant increases in processor complexity and
power consumption.
60Impact of L2 Cache Misses
L2 Misses
500-cycle DRAM latency, aggressive stream-based
prefetcher Data averaged over 147
memory-intensive benchmarks on a high-end x86
processor model
61Entry into Runahead Mode
Load 1 Miss
Compute
Miss 1
- When an L2-miss load instruction is the oldest in
the instruction window - Processor checkpoints architectural register
state. - Processor records the address of the L2-miss
load. - L2-miss load marks its destination register as
INV (invalid)
and it is removed from the instruction window.
62Exit from Runahead Mode
Load 1 Miss
Load 1 Re-fetched and Re-executed
Runahead
Compute
Compute
Miss 1
- When the runahead-causing L2 miss is serviced
- All instructions in the machine are flushed.
- INV bits are reset. Runahead cache is flushed.
- Processor restores the architectural state as it
was before the runahead-causing instruction was
fetched. - Architecturally, NOTHING happened.
- But, hopefully useful prefetch requests were
generated (caches warmed up). - Mode is switched to normal mode
- Instructions executed in runahead mode are
re-executed in normal mode.
63When to Enter Runahead Mode
- Why not at the time an L2 miss happens?
- Not guaranteed to be a valid correct-path
instruction until it becomes the oldest. - Limited potential (An L2-miss inst. becomes
oldest instruction 10 cycles later on average) - Need to checkpoint state at oldest instruction
(Throw away all instructions older than the L2
miss?) - Why not when the window becomes full?
- Delays the removal of instructions from window,
which can result in slow progress in runahead
mode. - No significant gain (The window becomes full 98
of the time after we see an L2 miss) - Why not on L1 cache misses?
- gt50 of L1 cache misses hit in the L2 cache ?
Many short runahead periods
64When to Exit Runahead Mode
- Why not exit early to fill the pipeline and the
window? - How do we determine how early? Memory does not
have fixed latency - This reduces the progress made in runahead mode
- Exiting early using oracle information has lower
performance than exiting when miss returns - Why not exit late to make further progress in
runahead? - Not necessarily beneficial to stay in runahead
longer - On average, this policy hurts performance
- But, more intelligent runahead period extension
schemes improve performance
65Modifications to Pipeline
lt 0.05 area overhead
CHECKPOINTED
INV
STATE
FP
FP
FP
FP Queue
Units
Sched
Regfile
Trace
Reorder
Cache
INV
Frontend
INT
Uop Queue
Int Queue
INT
Fetch
Buffer
RAT
Sched
Units
Unit
INT
Regfile
Renamer
ADDR
Mem Queue
MEM
L1
GEN
Sched
Data
Units
Backend
Cache
RAT
Prefetcher
Instruction
Selection Logic
Store
Decoder
Buffer
INV
L2 Access Queue
RUNAHEAD
CACHE
From memory
L2 Cache
Front Side Bus
To memory
Access Queue
66Effect of a Better Front-end
67Why is Runahead Better with a Better Front-end?
- A better front-end provides more correct-path
instructions (hence, more and more accurate L2
misses) in runahead periods - Average number of instructions during runahead
711 - before mispredicted INV branch 431
- with perfect TC/BP this average increases to 909
- Average number of L2 misses during runahead 2.6
- before mispredicted INV branch 2.38
- with perfect TC/BP this average increases to 3.18
- If all INV branches were resolved correctly
during runahead - performance gain would be 25 instead of 22
68Importance of Store-Load Communication
69In-order vs. Out-of-order
70Sensitivity to L2 Cache Size
71Instruction vs. Data Prefetching Benefits
72Why Does Runahead Work?
- 70 of instructions are VALID in runahead mode
- These values show periodic behavior
- Runahead prefetching reduces L1, L2, TC misses
during normal mode - Data Miss Reduction
- 18 decrease in normal mode L1 misses (base
13.7/1K uops) - 33 of normal mode L2 data misses are fully or
partially covered by runahead prefetching (base
L2 data miss rate 4.3/1K uops) - 15 of normal mode L2 data misses are fully
covered (these misses are never seen in normal
mode) - Instruction Miss Reduction
- 3 decrease in normal mode TC misses
- 14 decrease in normal mode L2 fetch misses (some
of these are only partially covered by runahead
requests) - Overall Increase in Data Misses
- L2 misses are increased by 5 (due to contention
and useless prefetches)
73Correlation Between L2 Miss Reduction and Speedup
74Runahead on a More Aggressive Processor
75Runahead on Future Model
76Future Model with Perfect Front-end
77Baseline Alpha Processor
- Execution-driven Alpha simulator
- 8-wide superscalar processor
- 128-entry instruction window, 20-stage pipeline
- 64 KB, 4-way, 2-cycle L1 data and instruction
caches - 1 MB, 32-way, 10-cycle unified L2 cache
- 500-cycle minimum main memory latency
- Aggressive stream-based prefetcher
- 32 DRAM banks, 32-byte wide processor-memory bus
(41 frequency ratio), 128 outstanding misses - Detailed memory model
78Runahead vs. Large Windows (Alpha)
79In-order vs. Out-of-order Execution (Alpha)
80Comparison to 1024-entry Window
81Runahead vs. HWP (Alpha)
82Effect of Memory Latency (Alpha)
831K 2K Memory Latency (Alpha)
84Efficient Runahead
85Methods for Efficient Runahead Execution
- Eliminating inefficient runahead periods
- Increasing the usefulness of runahead periods
- Reuse of runahead results
- Value prediction of L2-miss load instructions
- Optimizing the exit policy from runahead mode
86Impact on Efficiency
26.5
26.5
26.5
26.5
22.6
20.1
15.3
14.9
11.8
6.7
87Extra Instructions with Efficient Runahead
88Performance Increase with Efficient Runahead
89Cache Sizes (Executed Instructions)
90Cache Sizes (IPC Delta)
91Turning Off the FPU in Runahead Mode
- FP instructions do not contribute to the
generation of load addresses - FP instructions can be dropped after decode
- Spares processor resources for more useful
instructions - Increases performance by enabling faster progress
- Enables dynamic/static energy savings
- Results in an unresolvable branch misprediction
if a mispredicted branch depends on an FP
operation (rare) - Overall increases IPC and reduces executed
instructions
92HWP Update Policy in Runahead Mode
- Aggressive hardware prefetching in runahead mode
may hurt performance, if the prefetcher accuracy
is low - Runahead requests more accurate than prefetcher
requests - Three policies
- Do not update the prefetcher state
- Update the prefetcher state just like in normal
mode - Only train existing streams, but do not create
new streams - Runahead mode improves the timeliness of the
prefetcher in many benchmarks - Only training the existing streams is the best
policy
93Early INV Wake-up
- Keep track of INV status of an instruction in the
scheduler. - Scheduler wakes up the instruction if any source
is INV. - Enables faster progress during runahead mode by
removing the useless INV instructions faster. - - Increases the number of executed instructions.
- - Increases the complexity of the scheduling
logic. - Not worth implementing due to small IPC gain
94Short Runahead Periods
95RCST Counter
96Sampling for Useless Periods
97Efficiency Techniques Extra Instructions
98Efficiency Techniques IPC Increase
99Effect of Memory Latency on Efficient Runahead
100Usefulness of Runahead Periods
101L2 Misses Per Useful Runahead Periods
102Other Considerations for Efficient Runahead
103Performance Potential of Result Reuse
- Ideal reuse study
- To determine the upper bound on the performance
gain possible by reusing results of runahead
instructions - Valid pseudo-retired runahead instructions
magically update architectural state during
normal mode - They do not consume any resources (ROB or buffer
entries) - Only invalid pseudo-retired runahead instructions
are re-executed - They are fed into the renamer (fetch/decode
pipeline is skipped)
104Ideal Reuse of All Valid Runahead Results
105Alpha Reuse IPCs
106Number of Reused Instructions
107Why Does Reuse Not Work?
108IPC Increase with Reuse
109Extra Instructions with Reuse
110Runahead Period Statistics
111Mem Latency and BP Accuracy in Reuse
112Runahead VP Extra Instructions
113Runahead VP IPC Increase
114Late Exit from Runahead (Extra Inst.)
115Late Exit from Runahead (IPC Increase)
116AVD Prediction and Optimizations
117Identifying Address Loads in Hardware
- Observation If the AVD is too large, the value
being loaded is likely NOT an address - Only predict loads that have satisfied
- -MaxAVD lt AVD lt MaxAVD
- This identification mechanism eliminates almost
all data loads from consideration - Enables the AVD predictor to be small
118An Implementable AVD Predictor
- Set-associative prediction table
- Prediction table entry consists of
- Tag (PC of the load)
- Last AVD seen for the load
- Confidence counter for the recorded AVD
- Updated when an address load is retired in normal
mode - Accessed when a load misses in L2 cache in
runahead mode - Recovery-free No need to recover the state of
the processor or the predictor on misprediction - Runahead mode is purely speculative
119AVD Update Logic
120AVD Prediction Logic
121Properties of Traversal-based AVDs
- Stable AVDs can be captured with a stride value
predictor - Stable AVDs disappear with the re-organization of
the data structure (e.g., sorting) - Stability of AVDs is dependent on the behavior of
the memory allocator - Allocation of contiguous, fixed-size chunks is
useful
Sorting
Distance between nodes NOT constant!
122Properties of Leaf-based AVDs
- Stable AVDs cannot be captured with a stride
value predictor - Stable AVDs do not disappear with the
re-organization of the data structure (e.g.,
sorting) - Stability of AVDs is dependent on the behavior of
the memory allocator
Distance between node and string still constant!
Sorting
123AVD Prediction vs. Stride Value Prediction
- Performance
- Both can capture traversal address loads with
stable AVDs - e.g., treeadd
- Stride VP cannot capture leaf address loads with
stable AVDs - e.g., health, mst, parser
- AVD predictor cannot capture data loads with
striding data values - Predicting these can be useful for the correct
resolution of mispredicted L2-miss dependent
branches, e.g., parser - Complexity
- AVD predictor requires much fewer entries (only
address loads) - AVD prediction logic is simpler (no stride
maintenance)
124AVD vs. Stride VP Performance
2.5
4.5
12.1
12.6
13.4
16
16 entries
4096 entries
125AVD vs. Stride VP Performance
2.7
4.7
5.1
5.5
6.5
8.6
16 entries
4096 entries
126AVD vs. Stream Prefetching Performance
12.1
12.1
13.4
16.5
20.1
22.5
127AVD vs. Stream Pref. (L2 bandwidth)
35.3
32.8
26
24.5
5.1
5.1
128AVD vs. Stream Pref. (Mem bandwidth)
19.5
16.4
14.9
12.1
3.2
3.2
129Source Code Optimization for AVD Prediction
130Effect of Code Optimization (parser)
6.4
10.5
131Accuracy/Coverage w/ Code Optimization
132AVD and Efficiency Techniques
133AVD and Efficiency Techniques
134Effect of AVD on Runahead Periods
135AVD Example from treeadd
136AVD Example from parser
137AVD Example from health
138Motivation for NULL-value Optimization
139Effect of NULL-value Optimization
140Related Work
141Methodology
142Future Research Directions