Efficient Runahead Execution Processors A PowerEfficient Processing Paradigm for Tolerating Long Mai

About This Presentation

Title:

Efficient Runahead Execution Processors A PowerEfficient Processing Paradigm for Tolerating Long Mai

Description:

... (in terms of processor cycles) DRAM latency is not reducing as fast as processor cycle time ... Record the number of cycles C an L2-miss has been in flight ... – PowerPoint PPT presentation

Number of Views:54

Avg rating:3.0/5.0

Slides: 143

Provided by: Onur6

Learn more at: http://users.ece.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Efficient Runahead Execution Processors A PowerEfficient Processing Paradigm for Tolerating Long Mai

1
Efficient Runahead Execution ProcessorsA
Power-Efficient Processing Paradigm
forTolerating Long Main Memory Latencies

Onur Mutlu
PhD Defense
4/28/2006

2
Talk Outline

Motivation The Memory Latency Problem
Runahead Execution
Evaluation
Limitations of the Baseline Runahead Mechanism
Efficient Runahead Execution
Address-Value Delta (AVD) Prediction
Summary of Contributions
Future Work

3
Motivation

Memory latency is very long in todays processors
CDC 6600 10 cycles Thornton, 1970
Alpha 21264 120 cycles Wilkes, 2001
Intel Pentium 4 300 cycles Sprangle Carmean,
2002
And, it continues to increase (in terms of
processor cycles)
DRAM latency is not reducing as fast as processor
cycle time
Conventional techniques to tolerate memory
latency do not work well enough.

4
Conventional Latency Tolerance Techniques

Caching initially by Wilkes, 1965
Widely used, simple, effective, but inefficient,
passive
Not all applications/phases exhibit temporal or
spatial locality
Prefetching initially in IBM 360/91, 1967
Works well for regular memory access patterns
Prefetching irregular access patterns is
difficult, inaccurate, and hardware-intensive
Multithreading initially in CDC 6600, 1964
Works well if there are multiple threads
Improving single thread performance using
multithreading hardware is an ongoing research
effort
Out-of-order execution initially by Tomasulo,
1967
Tolerates cache misses that cannot be prefetched
Requires extensive hardware resources for
tolerating long latencies

5
Out-of-order Execution

Instructions are executed out of sequential
program order to tolerate latency.
Instructions are retired in program order to
support precise exceptions/interrupts.
Not-yet-retired instructions and their results
are buffered in hardware structures, called the
instruction window.
The size of the instruction window determines how
much latency the processor can tolerate.

6
Small Windows Full-window Stalls

When a long-latency instruction is not complete,
it blocks retirement.
Incoming instructions fill the instruction
window.
Once the window is full, processor cannot place
new instructions into the window.
This is called a full-window stall.
A full-window stall prevents the processor from
making progress in the execution of the program.

7
Small Windows Full-window Stalls
8-entry instruction window
Oldest
LOAD R1 ? memR5
L2 Miss! Takes 100s of cycles.
BEQ R1, R0, target
ADD R2 ? R2, 8
LOAD R3 ? memR2
Independent of the L2 miss, executed out of
program order, but cannot be retired.
MUL R4 ? R4, R3
ADD R4 ? R4, R5
STOR memR2 ? R4
ADD R2 ? R2, 64
Younger instructions cannot be executed
because there is no space in the instruction
window.
LOAD R3 ? memR2
The processor stalls until the L2 Miss is
serviced.

L2 cache misses are responsible for most
full-window stalls.

8
Impact of L2 Cache Misses
L2 Misses
512KB L2 cache, 500-cycle DRAM latency,
aggressive stream-based prefetcher Data averaged
over 147 memory-intensive benchmarks on a
high-end x86 processor model
9
Impact of L2 Cache Misses
L2 Misses
500-cycle DRAM latency, aggressive stream-based
prefetcher Data averaged over 147
memory-intensive benchmarks on a high-end x86
processor model
10
The Problem

Out-of-order execution requires large instruction
windows to tolerate todays main memory
latencies.
As main memory latency increases, instruction
window size should also increase to fully
tolerate the memory latency.
Building a large instruction window is a
challenging task if we would like to
achieve
Low power/energy consumption
Short cycle time
Low design and verification complexity

11
Talk Outline

Motivation The Memory Latency Problem
Runahead Execution
Evaluation
Limitations of the Baseline Runahead Mechanism
Efficient Runahead Execution
Address-Value Delta (AVD) Prediction
Summary of Contributions
Future Work

12
Overview of Runahead Execution HPCA03

A technique to obtain the memory-level
parallelism benefits of a large instruction
window (without having to build it!)
When the oldest instruction is an L2 miss
Checkpoint architectural state and enter runahead
mode
In runahead mode
Instructions are speculatively pre-executed
The purpose of pre-execution is to discover other
L2 misses
The processor does not stall due to L2 misses
Runahead mode ends when the original L2 miss
returns
Checkpoint is restored and normal execution
resumes

13
Runahead Example
Perfect Caches
Load 2 Hit
Load 1 Hit
Compute
Compute
Small Window
Load 2 Miss
Load 1 Miss
Compute
Compute
Stall
Stall
Miss 1
Miss 2
Runahead
Load 1 Miss
Load 2 Miss
Load 2 Hit
Load 1 Hit
Runahead
Compute
Compute
Saved Cycles
Miss 1
Miss 2
14
Benefits of Runahead Execution

Instead of stalling during an L2 cache miss
Pre-executed loads and stores independent of
L2-miss instructions generate very accurate data
prefetches
For both regular and irregular access patterns
Instructions on the predicted program path are
prefetched into the instruction/trace cache and
L2.
Hardware prefetcher and branch predictor tables
are trained using future access information.

15
Runahead Execution Mechanism

Entry into runahead mode
Checkpoint architectural register state
Instruction processing in runahead mode
Exit from runahead mode
Restore architectural register state from
checkpoint

16
Instruction Processing in Runahead Mode
Load 1 Miss
Runahead
Compute
Miss 1

Runahead mode processing is the same as
normal instruction processing, EXCEPT
It is purely speculative Architectural
(software-visible) register/memory state is NOT
updated in runahead mode.
L2-miss dependent instructions are identified and
treated specially.
They are quickly removed from the instruction
window.
Their results are not trusted.

17
L2-Miss Dependent Instructions
Load 1 Miss
Runahead
Compute
Miss 1

Two types of results produced INV and VALID
INV Dependent on an L2 miss
INV results are marked using INV bits in the
register file and store buffer.
INV values are not used for prefetching/branch
resolution.

18
Removal of Instructions from Window
Load 1 Miss
Runahead
Compute
Miss 1

Oldest instruction is examined for
pseudo-retirement
An INV instruction is removed from window
immediately.
A VALID instruction is removed when it completes
execution.
Pseudo-retired instructions free their allocated
resources.
This allows the processing of later
instructions.
Pseudo-retired stores communicate their data to
dependent loads.

19
Store/Load Handling in Runahead Mode
Load 1 Miss
Runahead
Compute
Miss 1

A pseudo-retired store writes its data and INV
status to a dedicated memory, called runahead
cache.
Purpose Data communication through memory in
runahead mode.
A dependent load reads its data from the runahead
cache.
Does not need to be always correct ? Size of
runahead cache is very small.

20
Branch Handling in Runahead Mode
Load 1 Miss
Runahead
Compute
Miss 1

INV branches cannot be resolved.
A mispredicted INV branch causes the processor
to stay on the wrong program path until the end
of runahead execution.
VALID branches are resolved and initiate recovery
if mispredicted.

21
Hardware Cost of Runahead Execution

Checkpoint of the architectural register state
Already exists in current processors
INV bits per register and store buffer entry
Runahead cache (512 bytes)
lt0.05 area overhead

22
Talk Outline

Motivation The Memory Latency Problem
Runahead Execution
Evaluation
Limitations of the Baseline Runahead Mechanism
Efficient Runahead Execution
Address-Value Delta (AVD) Prediction
Summary of Contributions
Future Work

23
Baseline Processor

3-wide fetch, 29-stage pipeline x86 processor
128-entry instruction window
512 KB, 8-way, 16-cycle unified L2 cache
Approximately 500-cycle L2 miss latency
Bandwidth, contention, conflicts modeled in
detail
Aggressive streaming data prefetcher (16 streams)
Next-two-lines instruction prefetcher

24
Evaluated Benchmarks

147 Intel x86 benchmarks simulated for 30 million
instructions
Benchmark Suites
SPEC CPU 95 (S95) Mostly scientific FP
applications
SPEC FP 2000 (FP00)
SPEC INT 2000 (INT00)
Internet (WEB) Spec Jbb, Webmark2001
Multimedia (MM) MPEG, speech recognition, games
Productivity (PROD) Powerpoint, Excel, Photoshop
Server (SERV) Transaction processing, E-commerce
Workstation (WS) Engineering/CAD applications

25
Performance of Runahead Execution
26
Runahead Execution vs. Large Windows
27
Talk Outline

Motivation The Memory Latency Problem
Runahead Execution
Evaluation
Limitations of the Baseline Runahead Mechanism
Efficient Runahead Execution
Address-Value Delta (AVD) Prediction
Summary of Contributions
Future Work

28
Limitations of the Baseline Runahead Mechanism

Energy Inefficiency
A large number of instructions are speculatively
executed
Efficient Runahead Execution ISCA05, IEEE Micro
Top Picks06
Ineffectiveness for pointer-intensive
applications
Runahead cannot parallelize dependent L2 cache
misses
Address-Value Delta (AVD) Prediction MICRO05
Irresolvable branch mispredictions in runahead
mode
Cannot recover from a mispredicted L2-miss
dependent branch
Wrong Path Events MICRO04

29
Talk Outline

Motivation The Memory Latency Problem
Runahead Execution
Evaluation
Limitations of the Baseline Runahead Mechanism
Efficient Runahead Execution
Address-Value Delta (AVD) Prediction
Summary of Contributions
Future Work

30
The Efficiency Problem ISCA05

A runahead processor pre-executes some
instructions speculatively
Each pre-executed instruction consumes energy
Runahead execution significantly increases the
number of executed instructions, sometimes
without providing performance improvement

31
The Efficiency Problem
22
27
32
Efficiency of Runahead Execution

Goals
Reduce the number of executed instructions
without reducing the IPC improvement
Increase the IPC improvement
without increasing the number of
executed instructions

33
Causes of Inefficiency

Short runahead periods
Overlapping runahead periods
Useless runahead periods

34
Short Runahead Periods

Processor can initiate runahead mode due to an
already in-flight L2 miss generated by
the prefetcher, wrong-path, or a previous
runahead period
Short periods
are less likely to generate useful L2 misses
have high overhead due to the flush penalty at
runahead exit

Load 1 Miss
Load 2 Miss
Load 2 Miss
Load 1 Hit
Compute
Runahead
Miss 1
Miss 2
35
Eliminating Short Runahead Periods

Mechanism to eliminate short periods
Record the number of cycles C an L2-miss has been
in flight
If C is greater than a threshold T for an L2
miss, disable entry into runahead mode due to
that miss
T can be determined statically (at design time)
or dynamically
T400 for a minimum main memory latency of 500
cycles works well

36
Overlapping Runahead Periods

Two runahead periods that execute the same
instructions
Second period is inefficient

Load 1 Miss
Load 2 Miss
Load 2 INV
Load 1 Hit
OVERLAP
OVERLAP
Compute
Runahead
Miss 1
Miss 2
37
Overlapping Runahead Periods (cont.)

Overlapping periods are not necessarily useless
The availability of a new data value can result
in the generation of useful L2 misses
But, this does not happen often enough
Mechanism to eliminate overlapping periods
Keep track of the number of pseudo-retired
instructions R during a runahead period
Keep track of the number of fetched instructions
N since the exit from last runahead period
If N lt R, do not enter runahead mode

38
Useless Runahead Periods

Periods that do not result in prefetches for
normal mode
They exist due to the lack of memory-level
parallelism
Mechanism to eliminate useless periods
Predict if a period will generate useful L2
misses
Estimate a period to be useful if it generated an
L2 miss that cannot be captured by the
instruction window
Useless period predictors are trained based on
this estimation

Load 1 Miss
Load 1 Hit
Compute
Runahead
Miss 1
39
Predicting Useless Runahead Periods

Prediction based on the past usefulness of
runahead periods caused by the same static load
instruction
A 2-bit state machine records the past usefulness
of a load
Prediction based on too many INV loads
If the fraction of INV loads in a runahead period
is greater than T, exit runahead mode
Sampling (phase) based prediction
If last N runahead periods generated fewer than T
L2 misses, do not enter runahead for the next M
runahead opportunities
Compile-time profile-based prediction
If runahead modes caused by a load were not
useful in the profiling run, mark it as
non-runahead load

40
Performance Optimizations for Efficiency

Both efficiency AND performance can be increased
by increasing the
usefulness of runahead periods
Three major optimizations
Turning off the Floating Point Unit (FPU) in
runahead mode
FP instructions do not contribute to the
generation of load addresses
Optimizing the update policy of the hardware
prefetcher (HWP) in runahead mode
Improves the positive interaction between
runahead and HWP
Early wake-up of INV instructions
Enables the faster removal of INV instructions

41
Overall Impact on Executed Instructions
26.5
6.2
42
Overall Impact on IPC
22.6
22.1
43
Talk Outline

Motivation The Memory Latency Problem
Runahead Execution
Evaluation
Limitations of the Baseline Runahead Mechanism
Efficient Runahead Execution
Address-Value Delta (AVD) Prediction
Summary of Contributions
Future Work

44
The Problem Dependent Cache Misses

Runahead execution cannot parallelize dependent
misses
wasted opportunity to improve performance
wasted energy (useless pre-execution)
Runahead performance would improve by 25 if this
limitation were ideally overcome

Runahead Load 2 is dependent on Load 1
Cannot Compute Its Address!
Load 1 Miss
Load 2 Miss
Load 2
Load 1 Hit
INV
Runahead
Compute
Miss 1
Miss 2
45
The Goal of AVD Prediction

Enable the parallelization of dependent L2 cache
misses in runahead mode with a low-cost mechanism
How
Predict the values of L2-miss address (pointer)
loads
Address load loads an address into its
destination register, which is later used to
calculate the address of another load
as opposed to data load

46
Parallelizing Dependent Cache Misses
Cannot Compute Its Address!
Load 1 Miss
Load 2 Miss
Load 2 INV
Load 1 Hit
Compute
Runahead
Miss 1
Miss 2
Can Compute Its Address
Value Predicted
Saved Speculative Instructions
Load 2 Hit
Load 2
Load 1 Hit
Miss
Load 1 Miss
Compute
Runahead
Saved Cycles
Miss 1
Miss 2
47
AVD Prediction MICRO05

Address-value delta (AVD) of a load instruction
defined as
AVD Effective Address of Load Data
Value of Load
For some address loads, AVD is stable
An AVD predictor keeps track of the AVDs of
address loads
When a load is an L2 miss in runahead mode, AVD
predictor is consulted
If the predictor returns a stable (confident) AVD
for that load, the value of the load is predicted
Predicted Value Effective Address
Predicted AVD

48
Why Do Stable AVDs Occur?

Regularity in the way data structures are
allocated in memory AND
traversed
Two types of loads can have stable AVDs
Traversal address loads
Produce addresses consumed by address loads
Leaf address loads
Produce addresses consumed by data loads

49
Traversal Address Loads
Regularly-allocated linked list
A traversal address load loads the pointer to
next node node node?next
A
AVD Effective Addr Data Value
Ak
Effective Addr
Data Value
AVD
A2k
A
Ak
-k
Ak
A2k
-k
...
A3k
A2k
A3k
-k
Stable AVD
Striding data value
50
Leaf Address Loads
Sorted dictionary in parser Nodes
point to strings (words) String and node
allocated consecutively
Dictionary looked up for an input word. A leaf
address load loads the pointer to the string of
each node
lookup (node, input) // ...
ptr_str node?string
m check_match(ptr_str, input)
//

Ak
A
Ck
Bk
node
AVD Effective Addr Data Value
string
C
B
Effective Addr
Data Value
AVD
Dk
Ek
Fk
Gk
Ak
A
k
D
E
F
G
Ck
C
k
Fk
F
k
Stable AVD
No stride!
51
Performance of AVD Prediction
runahead
14.3
15.5
52
Talk Outline

Motivation The Memory Latency Problem
Runahead Execution
Evaluation
Limitations of the Baseline Runahead Mechanism
Efficient Runahead Execution
Address-Value Delta (AVD) Prediction
Summary of Contributions
Future Work

53
Summary of Contributions

Runahead execution provides the latency tolerance
benefit of a large instruction window by
parallelizing independent cache misses
With very modest increase in hardware cost and
complexity
128-entry window Runahead 384-entry window
Efficient runahead execution techniques improve
the energy-efficiency of base runahead
execution
Only 6 extra instructions executed for 22
performance benefit
Address-Value Delta (AVD) prediction enables the
parallelization of dependent cache misses
By exploiting regular memory allocation patterns
A 16-entry (102-byte) AVD predictor improves the
performance of runahead execution by 14 on
pointer-intensive applications

54
Talk Outline

Motivation The Memory Latency Problem
Runahead Execution
Evaluation
Limitations of the Baseline Runahead Mechanism
Efficient Runahead Execution
Address-Value Delta (AVD) Prediction
Summary of Contributions
Future Work

55
Future Work in Runahead Execution

Compilation/programming techniques for runahead
processors
Keeping runahead execution on the correct program
path
Parallelizing dependent cache misses in linked
data structure traversals
Runahead co-processors/accelerators
Evaluation of runahead execution on multithreaded
and multiprocessor systems

56
Research Summary

Runahead execution
Original runahead proposal HPCA03, IEEE Micro
Top Picks03
Efficient runahead execution ISCA05, IEEE Micro
Top Picks06
AVD prediction MICRO05
Result reuse in runahead execution Comp. Arch.
Letters05
High-performance memory system designs
Pollution-aware caching IJPP05
Parallelism-aware caching ISCA06
Performance analysis of speculative memory
references IEEE Trans. on Computers05
Latency/bandwidth tradeoffs in memory controllers
Patent04
Branch instruction handling techniques through
compiler-microarchitecture cooperation
Wish branches MICRO05 IEEE Micro Top
Picks06
Wrong path events MICRO04
Compiler-assisted dynamic predication in
progress
Efficient compile-time profiling techniques for
detecting input-dependent program behavior
2D profiling CGO06
Fault tolerant microarchitecture design
Microarchitecture-based introspection DSN05

57
Thank you.
58
Backup Slides
59
Thesis Statement

Efficient runahead execution is a cost- and
complexity-effective microarchitectural technique
that can tolerate long main memory latencies
without requiring
unreasonably large, slow, complex, and
power-hungry hardware structures
significant increases in processor complexity and
power consumption.

60
Impact of L2 Cache Misses
L2 Misses
500-cycle DRAM latency, aggressive stream-based
prefetcher Data averaged over 147
memory-intensive benchmarks on a high-end x86
processor model
61
Entry into Runahead Mode
Load 1 Miss
Compute
Miss 1

When an L2-miss load instruction is the oldest in
the instruction window
Processor checkpoints architectural register
state.
Processor records the address of the L2-miss
load.
L2-miss load marks its destination register as
INV (invalid)
and it is removed from the instruction window.

62
Exit from Runahead Mode
Load 1 Miss
Load 1 Re-fetched and Re-executed
Runahead
Compute
Compute
Miss 1

When the runahead-causing L2 miss is serviced
All instructions in the machine are flushed.
INV bits are reset. Runahead cache is flushed.
Processor restores the architectural state as it
was before the runahead-causing instruction was
fetched.
Architecturally, NOTHING happened.
But, hopefully useful prefetch requests were
generated (caches warmed up).
Mode is switched to normal mode
Instructions executed in runahead mode are
re-executed in normal mode.

63
When to Enter Runahead Mode

Why not at the time an L2 miss happens?
Not guaranteed to be a valid correct-path
instruction until it becomes the oldest.
Limited potential (An L2-miss inst. becomes
oldest instruction 10 cycles later on average)
Need to checkpoint state at oldest instruction
(Throw away all instructions older than the L2
miss?)
Why not when the window becomes full?
Delays the removal of instructions from window,
which can result in slow progress in runahead
mode.
No significant gain (The window becomes full 98
of the time after we see an L2 miss)
Why not on L1 cache misses?
gt50 of L1 cache misses hit in the L2 cache ?
Many short runahead periods

64
When to Exit Runahead Mode

Why not exit early to fill the pipeline and the
window?
How do we determine how early? Memory does not
have fixed latency
This reduces the progress made in runahead mode
Exiting early using oracle information has lower
performance than exiting when miss returns
Why not exit late to make further progress in
runahead?
Not necessarily beneficial to stay in runahead
longer
On average, this policy hurts performance
But, more intelligent runahead period extension
schemes improve performance

65
Modifications to Pipeline
lt 0.05 area overhead
CHECKPOINTED
INV
STATE
FP
FP
FP
FP Queue
Units
Sched
Regfile
Trace
Reorder
Cache
INV
Frontend
INT
Uop Queue
Int Queue
INT
Fetch
Buffer
RAT
Sched
Units
Unit
INT
Regfile
Renamer
ADDR
Mem Queue
MEM
L1
GEN
Sched
Data
Units
Backend
Cache
RAT
Prefetcher
Instruction
Selection Logic
Store
Decoder
Buffer
INV
L2 Access Queue
RUNAHEAD
CACHE
From memory
L2 Cache
Front Side Bus
To memory
Access Queue
66
Effect of a Better Front-end
67
Why is Runahead Better with a Better Front-end?

A better front-end provides more correct-path
instructions (hence, more and more accurate L2
misses) in runahead periods
Average number of instructions during runahead
711
before mispredicted INV branch 431
with perfect TC/BP this average increases to 909
Average number of L2 misses during runahead 2.6
before mispredicted INV branch 2.38
with perfect TC/BP this average increases to 3.18
If all INV branches were resolved correctly
during runahead
performance gain would be 25 instead of 22

68
Importance of Store-Load Communication
69
In-order vs. Out-of-order
70
Sensitivity to L2 Cache Size
71
Instruction vs. Data Prefetching Benefits
72
Why Does Runahead Work?

70 of instructions are VALID in runahead mode
These values show periodic behavior
Runahead prefetching reduces L1, L2, TC misses
during normal mode
Data Miss Reduction
18 decrease in normal mode L1 misses (base
13.7/1K uops)
33 of normal mode L2 data misses are fully or
partially covered by runahead prefetching (base
L2 data miss rate 4.3/1K uops)
15 of normal mode L2 data misses are fully
covered (these misses are never seen in normal
mode)
Instruction Miss Reduction
3 decrease in normal mode TC misses
14 decrease in normal mode L2 fetch misses (some
of these are only partially covered by runahead
requests)
Overall Increase in Data Misses
L2 misses are increased by 5 (due to contention
and useless prefetches)

73
Correlation Between L2 Miss Reduction and Speedup
74
Runahead on a More Aggressive Processor
75
Runahead on Future Model
76
Future Model with Perfect Front-end
77
Baseline Alpha Processor

Execution-driven Alpha simulator
8-wide superscalar processor
128-entry instruction window, 20-stage pipeline
64 KB, 4-way, 2-cycle L1 data and instruction
caches
1 MB, 32-way, 10-cycle unified L2 cache
500-cycle minimum main memory latency
Aggressive stream-based prefetcher
32 DRAM banks, 32-byte wide processor-memory bus
(41 frequency ratio), 128 outstanding misses
Detailed memory model

78
Runahead vs. Large Windows (Alpha)
79
In-order vs. Out-of-order Execution (Alpha)
80
Comparison to 1024-entry Window
81
Runahead vs. HWP (Alpha)
82
Effect of Memory Latency (Alpha)
83
1K 2K Memory Latency (Alpha)
84
Efficient Runahead
85
Methods for Efficient Runahead Execution

Eliminating inefficient runahead periods
Increasing the usefulness of runahead periods
Reuse of runahead results
Value prediction of L2-miss load instructions
Optimizing the exit policy from runahead mode

86
Impact on Efficiency
26.5
26.5
26.5
26.5
22.6
20.1
15.3
14.9
11.8
6.7
87
Extra Instructions with Efficient Runahead
88
Performance Increase with Efficient Runahead
89
Cache Sizes (Executed Instructions)
90
Cache Sizes (IPC Delta)
91
Turning Off the FPU in Runahead Mode

FP instructions do not contribute to the
generation of load addresses
FP instructions can be dropped after decode
Spares processor resources for more useful
instructions
Increases performance by enabling faster progress
Enables dynamic/static energy savings
Results in an unresolvable branch misprediction
if a mispredicted branch depends on an FP
operation (rare)
Overall increases IPC and reduces executed
instructions

92
HWP Update Policy in Runahead Mode

Aggressive hardware prefetching in runahead mode
may hurt performance, if the prefetcher accuracy
is low
Runahead requests more accurate than prefetcher
requests
Three policies
Do not update the prefetcher state
Update the prefetcher state just like in normal
mode
Only train existing streams, but do not create
new streams
Runahead mode improves the timeliness of the
prefetcher in many benchmarks
Only training the existing streams is the best
policy

93
Early INV Wake-up

Keep track of INV status of an instruction in the
scheduler.
Scheduler wakes up the instruction if any source
is INV.
Enables faster progress during runahead mode by
removing the useless INV instructions faster.
- Increases the number of executed instructions.
- Increases the complexity of the scheduling
logic.
Not worth implementing due to small IPC gain

94
Short Runahead Periods
95
RCST Counter
96
Sampling for Useless Periods
97
Efficiency Techniques Extra Instructions
98
Efficiency Techniques IPC Increase
99
Effect of Memory Latency on Efficient Runahead
100
Usefulness of Runahead Periods
101
L2 Misses Per Useful Runahead Periods
102
Other Considerations for Efficient Runahead
103
Performance Potential of Result Reuse

Ideal reuse study
To determine the upper bound on the performance
gain possible by reusing results of runahead
instructions
Valid pseudo-retired runahead instructions
magically update architectural state during
normal mode
They do not consume any resources (ROB or buffer
entries)
Only invalid pseudo-retired runahead instructions
are re-executed
They are fed into the renamer (fetch/decode
pipeline is skipped)

104
Ideal Reuse of All Valid Runahead Results
105
Alpha Reuse IPCs
106
Number of Reused Instructions
107
Why Does Reuse Not Work?
108
IPC Increase with Reuse
109
Extra Instructions with Reuse
110
Runahead Period Statistics
111
Mem Latency and BP Accuracy in Reuse
112
Runahead VP Extra Instructions
113
Runahead VP IPC Increase
114
Late Exit from Runahead (Extra Inst.)
115
Late Exit from Runahead (IPC Increase)
116
AVD Prediction and Optimizations
117
Identifying Address Loads in Hardware

Observation If the AVD is too large, the value
being loaded is likely NOT an address
Only predict loads that have satisfied
-MaxAVD lt AVD lt MaxAVD
This identification mechanism eliminates almost
all data loads from consideration
Enables the AVD predictor to be small

118
An Implementable AVD Predictor

Set-associative prediction table
Prediction table entry consists of
Tag (PC of the load)
Last AVD seen for the load
Confidence counter for the recorded AVD
Updated when an address load is retired in normal
mode
Accessed when a load misses in L2 cache in
runahead mode
Recovery-free No need to recover the state of
the processor or the predictor on misprediction
Runahead mode is purely speculative

119
AVD Update Logic
120
AVD Prediction Logic
121
Properties of Traversal-based AVDs

Stable AVDs can be captured with a stride value
predictor
Stable AVDs disappear with the re-organization of
the data structure (e.g., sorting)
Stability of AVDs is dependent on the behavior of
the memory allocator
Allocation of contiguous, fixed-size chunks is
useful

Sorting
Distance between nodes NOT constant!
122
Properties of Leaf-based AVDs

Stable AVDs cannot be captured with a stride
value predictor
Stable AVDs do not disappear with the
re-organization of the data structure (e.g.,
sorting)
Stability of AVDs is dependent on the behavior of
the memory allocator

Distance between node and string still constant!
Sorting
123
AVD Prediction vs. Stride Value Prediction

Performance
Both can capture traversal address loads with
stable AVDs
e.g., treeadd
Stride VP cannot capture leaf address loads with
stable AVDs
e.g., health, mst, parser
AVD predictor cannot capture data loads with
striding data values
Predicting these can be useful for the correct
resolution of mispredicted L2-miss dependent
branches, e.g., parser
Complexity
AVD predictor requires much fewer entries (only
address loads)
AVD prediction logic is simpler (no stride
maintenance)

124
AVD vs. Stride VP Performance
2.5
4.5
12.1
12.6
13.4
16
16 entries
4096 entries
125
AVD vs. Stride VP Performance
2.7
4.7
5.1
5.5
6.5
8.6
16 entries
4096 entries
126
AVD vs. Stream Prefetching Performance
12.1
12.1
13.4
16.5
20.1
22.5
127
AVD vs. Stream Pref. (L2 bandwidth)
35.3
32.8
26
24.5
5.1
5.1
128
AVD vs. Stream Pref. (Mem bandwidth)
19.5
16.4
14.9
12.1
3.2
3.2
129
Source Code Optimization for AVD Prediction
130
Effect of Code Optimization (parser)
6.4
10.5
131
Accuracy/Coverage w/ Code Optimization
132
AVD and Efficiency Techniques
133
AVD and Efficiency Techniques
134
Effect of AVD on Runahead Periods
135
AVD Example from treeadd
136
AVD Example from parser
137
AVD Example from health
138
Motivation for NULL-value Optimization
139
Effect of NULL-value Optimization
140
Related Work
141
Methodology
142
Future Research Directions

Write a Comment

User Comments (0)