Efficient Runahead Execution Processors A PowerEfficient Processing Paradigm for Tolerating Long Mai - PowerPoint PPT Presentation

About This Presentation
Title:

Efficient Runahead Execution Processors A PowerEfficient Processing Paradigm for Tolerating Long Mai

Description:

... (in terms of processor cycles) DRAM latency is not reducing as fast as processor cycle time ... Record the number of cycles C an L2-miss has been in flight ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 143
Provided by: Onur6
Learn more at: http://users.ece.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Efficient Runahead Execution Processors A PowerEfficient Processing Paradigm for Tolerating Long Mai


1
Efficient Runahead Execution ProcessorsA
Power-Efficient Processing Paradigm
forTolerating Long Main Memory Latencies
  • Onur Mutlu
  • PhD Defense
  • 4/28/2006

2
Talk Outline
  • Motivation The Memory Latency Problem
  • Runahead Execution
  • Evaluation
  • Limitations of the Baseline Runahead Mechanism
  • Efficient Runahead Execution
  • Address-Value Delta (AVD) Prediction
  • Summary of Contributions
  • Future Work

3
Motivation
  • Memory latency is very long in todays processors
  • CDC 6600 10 cycles Thornton, 1970
  • Alpha 21264 120 cycles Wilkes, 2001
  • Intel Pentium 4 300 cycles Sprangle Carmean,
    2002
  • And, it continues to increase (in terms of
    processor cycles)
  • DRAM latency is not reducing as fast as processor
    cycle time
  • Conventional techniques to tolerate memory
    latency do not work well enough.

4
Conventional Latency Tolerance Techniques
  • Caching initially by Wilkes, 1965
  • Widely used, simple, effective, but inefficient,
    passive
  • Not all applications/phases exhibit temporal or
    spatial locality
  • Prefetching initially in IBM 360/91, 1967
  • Works well for regular memory access patterns
  • Prefetching irregular access patterns is
    difficult, inaccurate, and hardware-intensive
  • Multithreading initially in CDC 6600, 1964
  • Works well if there are multiple threads
  • Improving single thread performance using
    multithreading hardware is an ongoing research
    effort
  • Out-of-order execution initially by Tomasulo,
    1967
  • Tolerates cache misses that cannot be prefetched
  • Requires extensive hardware resources for
    tolerating long latencies

5
Out-of-order Execution
  • Instructions are executed out of sequential
    program order to tolerate latency.
  • Instructions are retired in program order to
    support precise exceptions/interrupts.
  • Not-yet-retired instructions and their results
    are buffered in hardware structures, called the
    instruction window.
  • The size of the instruction window determines how
    much latency the processor can tolerate.

6
Small Windows Full-window Stalls
  • When a long-latency instruction is not complete,
    it blocks retirement.
  • Incoming instructions fill the instruction
    window.
  • Once the window is full, processor cannot place
    new instructions into the window.
  • This is called a full-window stall.
  • A full-window stall prevents the processor from
    making progress in the execution of the program.

7
Small Windows Full-window Stalls
8-entry instruction window
Oldest
LOAD R1 ? memR5
L2 Miss! Takes 100s of cycles.
BEQ R1, R0, target
ADD R2 ? R2, 8
LOAD R3 ? memR2
Independent of the L2 miss, executed out of
program order, but cannot be retired.
MUL R4 ? R4, R3
ADD R4 ? R4, R5
STOR memR2 ? R4
ADD R2 ? R2, 64
Younger instructions cannot be executed
because there is no space in the instruction
window.
LOAD R3 ? memR2
The processor stalls until the L2 Miss is
serviced.
  • L2 cache misses are responsible for most
    full-window stalls.

8
Impact of L2 Cache Misses
L2 Misses
512KB L2 cache, 500-cycle DRAM latency,
aggressive stream-based prefetcher Data averaged
over 147 memory-intensive benchmarks on a
high-end x86 processor model
9
Impact of L2 Cache Misses
L2 Misses
500-cycle DRAM latency, aggressive stream-based
prefetcher Data averaged over 147
memory-intensive benchmarks on a high-end x86
processor model
10
The Problem
  • Out-of-order execution requires large instruction
    windows to tolerate todays main memory
    latencies.
  • As main memory latency increases, instruction
    window size should also increase to fully
    tolerate the memory latency.
  • Building a large instruction window is a
    challenging task if we would like to
    achieve
  • Low power/energy consumption
  • Short cycle time
  • Low design and verification complexity

11
Talk Outline
  • Motivation The Memory Latency Problem
  • Runahead Execution
  • Evaluation
  • Limitations of the Baseline Runahead Mechanism
  • Efficient Runahead Execution
  • Address-Value Delta (AVD) Prediction
  • Summary of Contributions
  • Future Work

12
Overview of Runahead Execution HPCA03
  • A technique to obtain the memory-level
    parallelism benefits of a large instruction
    window (without having to build it!)
  • When the oldest instruction is an L2 miss
  • Checkpoint architectural state and enter runahead
    mode
  • In runahead mode
  • Instructions are speculatively pre-executed
  • The purpose of pre-execution is to discover other
    L2 misses
  • The processor does not stall due to L2 misses
  • Runahead mode ends when the original L2 miss
    returns
  • Checkpoint is restored and normal execution
    resumes

13
Runahead Example
Perfect Caches
Load 2 Hit
Load 1 Hit
Compute
Compute
Small Window
Load 2 Miss
Load 1 Miss
Compute
Compute
Stall
Stall
Miss 1
Miss 2
Runahead
Load 1 Miss
Load 2 Miss
Load 2 Hit
Load 1 Hit
Runahead
Compute
Compute
Saved Cycles
Miss 1
Miss 2
14
Benefits of Runahead Execution
  • Instead of stalling during an L2 cache miss
  • Pre-executed loads and stores independent of
    L2-miss instructions generate very accurate data
    prefetches
  • For both regular and irregular access patterns
  • Instructions on the predicted program path are
    prefetched into the instruction/trace cache and
    L2.
  • Hardware prefetcher and branch predictor tables
    are trained using future access information.

15
Runahead Execution Mechanism
  • Entry into runahead mode
  • Checkpoint architectural register state
  • Instruction processing in runahead mode
  • Exit from runahead mode
  • Restore architectural register state from
    checkpoint

16
Instruction Processing in Runahead Mode
Load 1 Miss
Runahead
Compute
Miss 1
  • Runahead mode processing is the same as
    normal instruction processing, EXCEPT
  • It is purely speculative Architectural
    (software-visible) register/memory state is NOT
    updated in runahead mode.
  • L2-miss dependent instructions are identified and
    treated specially.
  • They are quickly removed from the instruction
    window.
  • Their results are not trusted.

17
L2-Miss Dependent Instructions
Load 1 Miss
Runahead
Compute
Miss 1
  • Two types of results produced INV and VALID
  • INV Dependent on an L2 miss
  • INV results are marked using INV bits in the
    register file and store buffer.
  • INV values are not used for prefetching/branch
    resolution.

18
Removal of Instructions from Window
Load 1 Miss
Runahead
Compute
Miss 1
  • Oldest instruction is examined for
    pseudo-retirement
  • An INV instruction is removed from window
    immediately.
  • A VALID instruction is removed when it completes
    execution.
  • Pseudo-retired instructions free their allocated
    resources.
  • This allows the processing of later
    instructions.
  • Pseudo-retired stores communicate their data to
    dependent loads.

19
Store/Load Handling in Runahead Mode
Load 1 Miss
Runahead
Compute
Miss 1
  • A pseudo-retired store writes its data and INV
    status to a dedicated memory, called runahead
    cache.
  • Purpose Data communication through memory in
    runahead mode.
  • A dependent load reads its data from the runahead
    cache.
  • Does not need to be always correct ? Size of
    runahead cache is very small.

20
Branch Handling in Runahead Mode
Load 1 Miss
Runahead
Compute
Miss 1
  • INV branches cannot be resolved.
  • A mispredicted INV branch causes the processor
    to stay on the wrong program path until the end
    of runahead execution.
  • VALID branches are resolved and initiate recovery
    if mispredicted.

21
Hardware Cost of Runahead Execution
  • Checkpoint of the architectural register state
  • Already exists in current processors
  • INV bits per register and store buffer entry
  • Runahead cache (512 bytes)
  • lt0.05 area overhead

22
Talk Outline
  • Motivation The Memory Latency Problem
  • Runahead Execution
  • Evaluation
  • Limitations of the Baseline Runahead Mechanism
  • Efficient Runahead Execution
  • Address-Value Delta (AVD) Prediction
  • Summary of Contributions
  • Future Work

23
Baseline Processor
  • 3-wide fetch, 29-stage pipeline x86 processor
  • 128-entry instruction window
  • 512 KB, 8-way, 16-cycle unified L2 cache
  • Approximately 500-cycle L2 miss latency
  • Bandwidth, contention, conflicts modeled in
    detail
  • Aggressive streaming data prefetcher (16 streams)
  • Next-two-lines instruction prefetcher

24
Evaluated Benchmarks
  • 147 Intel x86 benchmarks simulated for 30 million
    instructions
  • Benchmark Suites
  • SPEC CPU 95 (S95) Mostly scientific FP
    applications
  • SPEC FP 2000 (FP00)
  • SPEC INT 2000 (INT00)
  • Internet (WEB) Spec Jbb, Webmark2001
  • Multimedia (MM) MPEG, speech recognition, games
  • Productivity (PROD) Powerpoint, Excel, Photoshop
  • Server (SERV) Transaction processing, E-commerce
  • Workstation (WS) Engineering/CAD applications

25
Performance of Runahead Execution
26
Runahead Execution vs. Large Windows
27
Talk Outline
  • Motivation The Memory Latency Problem
  • Runahead Execution
  • Evaluation
  • Limitations of the Baseline Runahead Mechanism
  • Efficient Runahead Execution
  • Address-Value Delta (AVD) Prediction
  • Summary of Contributions
  • Future Work

28
Limitations of the Baseline Runahead Mechanism
  • Energy Inefficiency
  • A large number of instructions are speculatively
    executed
  • Efficient Runahead Execution ISCA05, IEEE Micro
    Top Picks06
  • Ineffectiveness for pointer-intensive
    applications
  • Runahead cannot parallelize dependent L2 cache
    misses
  • Address-Value Delta (AVD) Prediction MICRO05
  • Irresolvable branch mispredictions in runahead
    mode
  • Cannot recover from a mispredicted L2-miss
    dependent branch
  • Wrong Path Events MICRO04

29
Talk Outline
  • Motivation The Memory Latency Problem
  • Runahead Execution
  • Evaluation
  • Limitations of the Baseline Runahead Mechanism
  • Efficient Runahead Execution
  • Address-Value Delta (AVD) Prediction
  • Summary of Contributions
  • Future Work

30
The Efficiency Problem ISCA05
  • A runahead processor pre-executes some
    instructions speculatively
  • Each pre-executed instruction consumes energy
  • Runahead execution significantly increases the
    number of executed instructions, sometimes
    without providing performance improvement

31
The Efficiency Problem
22
27
32
Efficiency of Runahead Execution
  • Goals
  • Reduce the number of executed instructions
    without reducing the IPC improvement
  • Increase the IPC improvement
    without increasing the number of
    executed instructions

33
Causes of Inefficiency
  • Short runahead periods
  • Overlapping runahead periods
  • Useless runahead periods

34
Short Runahead Periods
  • Processor can initiate runahead mode due to an
    already in-flight L2 miss generated by
  • the prefetcher, wrong-path, or a previous
    runahead period
  • Short periods
  • are less likely to generate useful L2 misses
  • have high overhead due to the flush penalty at
    runahead exit

Load 1 Miss
Load 2 Miss
Load 2 Miss
Load 1 Hit
Compute
Runahead
Miss 1
Miss 2
35
Eliminating Short Runahead Periods
  • Mechanism to eliminate short periods
  • Record the number of cycles C an L2-miss has been
    in flight
  • If C is greater than a threshold T for an L2
    miss, disable entry into runahead mode due to
    that miss
  • T can be determined statically (at design time)
    or dynamically
  • T400 for a minimum main memory latency of 500
    cycles works well

36
Overlapping Runahead Periods
  • Two runahead periods that execute the same
    instructions
  • Second period is inefficient

Load 1 Miss
Load 2 Miss
Load 2 INV
Load 1 Hit
OVERLAP
OVERLAP
Compute
Runahead
Miss 1
Miss 2
37
Overlapping Runahead Periods (cont.)
  • Overlapping periods are not necessarily useless
  • The availability of a new data value can result
    in the generation of useful L2 misses
  • But, this does not happen often enough
  • Mechanism to eliminate overlapping periods
  • Keep track of the number of pseudo-retired
    instructions R during a runahead period
  • Keep track of the number of fetched instructions
    N since the exit from last runahead period
  • If N lt R, do not enter runahead mode

38
Useless Runahead Periods
  • Periods that do not result in prefetches for
    normal mode
  • They exist due to the lack of memory-level
    parallelism
  • Mechanism to eliminate useless periods
  • Predict if a period will generate useful L2
    misses
  • Estimate a period to be useful if it generated an
    L2 miss that cannot be captured by the
    instruction window
  • Useless period predictors are trained based on
    this estimation

Load 1 Miss
Load 1 Hit
Compute
Runahead
Miss 1
39
Predicting Useless Runahead Periods
  • Prediction based on the past usefulness of
    runahead periods caused by the same static load
    instruction
  • A 2-bit state machine records the past usefulness
    of a load
  • Prediction based on too many INV loads
  • If the fraction of INV loads in a runahead period
    is greater than T, exit runahead mode
  • Sampling (phase) based prediction
  • If last N runahead periods generated fewer than T
    L2 misses, do not enter runahead for the next M
    runahead opportunities
  • Compile-time profile-based prediction
  • If runahead modes caused by a load were not
    useful in the profiling run, mark it as
    non-runahead load

40
Performance Optimizations for Efficiency
  • Both efficiency AND performance can be increased
    by increasing the
    usefulness of runahead periods
  • Three major optimizations
  • Turning off the Floating Point Unit (FPU) in
    runahead mode
  • FP instructions do not contribute to the
    generation of load addresses
  • Optimizing the update policy of the hardware
    prefetcher (HWP) in runahead mode
  • Improves the positive interaction between
    runahead and HWP
  • Early wake-up of INV instructions
  • Enables the faster removal of INV instructions

41
Overall Impact on Executed Instructions
26.5
6.2
42
Overall Impact on IPC
22.6
22.1
43
Talk Outline
  • Motivation The Memory Latency Problem
  • Runahead Execution
  • Evaluation
  • Limitations of the Baseline Runahead Mechanism
  • Efficient Runahead Execution
  • Address-Value Delta (AVD) Prediction
  • Summary of Contributions
  • Future Work

44
The Problem Dependent Cache Misses
  • Runahead execution cannot parallelize dependent
    misses
  • wasted opportunity to improve performance
  • wasted energy (useless pre-execution)
  • Runahead performance would improve by 25 if this
    limitation were ideally overcome

Runahead Load 2 is dependent on Load 1
Cannot Compute Its Address!
Load 1 Miss
Load 2 Miss
Load 2
Load 1 Hit
INV
Runahead
Compute
Miss 1
Miss 2
45
The Goal of AVD Prediction
  • Enable the parallelization of dependent L2 cache
    misses in runahead mode with a low-cost mechanism
  • How
  • Predict the values of L2-miss address (pointer)
    loads
  • Address load loads an address into its
    destination register, which is later used to
    calculate the address of another load
  • as opposed to data load

46
Parallelizing Dependent Cache Misses
Cannot Compute Its Address!
Load 1 Miss
Load 2 Miss
Load 2 INV
Load 1 Hit
Compute
Runahead
Miss 1
Miss 2
Can Compute Its Address
Value Predicted
Saved Speculative Instructions
Load 2 Hit
Load 2
Load 1 Hit
Miss
Load 1 Miss
Compute
Runahead
Saved Cycles
Miss 1
Miss 2
47
AVD Prediction MICRO05
  • Address-value delta (AVD) of a load instruction
    defined as
  • AVD Effective Address of Load Data
    Value of Load
  • For some address loads, AVD is stable
  • An AVD predictor keeps track of the AVDs of
    address loads
  • When a load is an L2 miss in runahead mode, AVD
    predictor is consulted
  • If the predictor returns a stable (confident) AVD
    for that load, the value of the load is predicted
  • Predicted Value Effective Address
    Predicted AVD

48
Why Do Stable AVDs Occur?
  • Regularity in the way data structures are
  • allocated in memory AND
  • traversed
  • Two types of loads can have stable AVDs
  • Traversal address loads
  • Produce addresses consumed by address loads
  • Leaf address loads
  • Produce addresses consumed by data loads

49
Traversal Address Loads
Regularly-allocated linked list
A traversal address load loads the pointer to
next node node node?next
A
AVD Effective Addr Data Value
Ak
Effective Addr
Data Value
AVD
A2k
A
Ak
-k
Ak
A2k
-k
...
A3k
A2k
A3k
-k
Stable AVD
Striding data value
50
Leaf Address Loads
Sorted dictionary in parser Nodes
point to strings (words) String and node
allocated consecutively
Dictionary looked up for an input word. A leaf
address load loads the pointer to the string of
each node
lookup (node, input) // ...
ptr_str node?string
m check_match(ptr_str, input)
//

Ak
A
Ck
Bk
node
AVD Effective Addr Data Value
string
C
B
Effective Addr
Data Value
AVD
Dk
Ek
Fk
Gk
Ak
A
k
D
E
F
G
Ck
C
k
Fk
F
k
Stable AVD
No stride!
51
Performance of AVD Prediction
runahead
14.3
15.5
52
Talk Outline
  • Motivation The Memory Latency Problem
  • Runahead Execution
  • Evaluation
  • Limitations of the Baseline Runahead Mechanism
  • Efficient Runahead Execution
  • Address-Value Delta (AVD) Prediction
  • Summary of Contributions
  • Future Work

53
Summary of Contributions
  • Runahead execution provides the latency tolerance
    benefit of a large instruction window by
    parallelizing independent cache misses
  • With very modest increase in hardware cost and
    complexity
  • 128-entry window Runahead 384-entry window
  • Efficient runahead execution techniques improve
    the energy-efficiency of base runahead
    execution
  • Only 6 extra instructions executed for 22
    performance benefit
  • Address-Value Delta (AVD) prediction enables the
    parallelization of dependent cache misses
  • By exploiting regular memory allocation patterns
  • A 16-entry (102-byte) AVD predictor improves the
    performance of runahead execution by 14 on
    pointer-intensive applications

54
Talk Outline
  • Motivation The Memory Latency Problem
  • Runahead Execution
  • Evaluation
  • Limitations of the Baseline Runahead Mechanism
  • Efficient Runahead Execution
  • Address-Value Delta (AVD) Prediction
  • Summary of Contributions
  • Future Work

55
Future Work in Runahead Execution
  • Compilation/programming techniques for runahead
    processors
  • Keeping runahead execution on the correct program
    path
  • Parallelizing dependent cache misses in linked
    data structure traversals
  • Runahead co-processors/accelerators
  • Evaluation of runahead execution on multithreaded
    and multiprocessor systems

56
Research Summary
  • Runahead execution
  • Original runahead proposal HPCA03, IEEE Micro
    Top Picks03
  • Efficient runahead execution ISCA05, IEEE Micro
    Top Picks06
  • AVD prediction MICRO05
  • Result reuse in runahead execution Comp. Arch.
    Letters05
  • High-performance memory system designs
  • Pollution-aware caching IJPP05
  • Parallelism-aware caching ISCA06
  • Performance analysis of speculative memory
    references IEEE Trans. on Computers05
  • Latency/bandwidth tradeoffs in memory controllers
    Patent04
  • Branch instruction handling techniques through
    compiler-microarchitecture cooperation
  • Wish branches MICRO05 IEEE Micro Top
    Picks06
  • Wrong path events MICRO04
  • Compiler-assisted dynamic predication in
    progress
  • Efficient compile-time profiling techniques for
    detecting input-dependent program behavior
  • 2D profiling CGO06
  • Fault tolerant microarchitecture design
  • Microarchitecture-based introspection DSN05

57
Thank you.
58
Backup Slides
59
Thesis Statement
  • Efficient runahead execution is a cost- and
    complexity-effective microarchitectural technique
    that can tolerate long main memory latencies
    without requiring
  • unreasonably large, slow, complex, and
    power-hungry hardware structures
  • significant increases in processor complexity and
    power consumption.

60
Impact of L2 Cache Misses
L2 Misses
500-cycle DRAM latency, aggressive stream-based
prefetcher Data averaged over 147
memory-intensive benchmarks on a high-end x86
processor model
61
Entry into Runahead Mode
Load 1 Miss
Compute
Miss 1
  • When an L2-miss load instruction is the oldest in
    the instruction window
  • Processor checkpoints architectural register
    state.
  • Processor records the address of the L2-miss
    load.
  • L2-miss load marks its destination register as
    INV (invalid)
    and it is removed from the instruction window.

62
Exit from Runahead Mode
Load 1 Miss
Load 1 Re-fetched and Re-executed
Runahead
Compute
Compute
Miss 1
  • When the runahead-causing L2 miss is serviced
  • All instructions in the machine are flushed.
  • INV bits are reset. Runahead cache is flushed.
  • Processor restores the architectural state as it
    was before the runahead-causing instruction was
    fetched.
  • Architecturally, NOTHING happened.
  • But, hopefully useful prefetch requests were
    generated (caches warmed up).
  • Mode is switched to normal mode
  • Instructions executed in runahead mode are
    re-executed in normal mode.

63
When to Enter Runahead Mode
  • Why not at the time an L2 miss happens?
  • Not guaranteed to be a valid correct-path
    instruction until it becomes the oldest.
  • Limited potential (An L2-miss inst. becomes
    oldest instruction 10 cycles later on average)
  • Need to checkpoint state at oldest instruction
    (Throw away all instructions older than the L2
    miss?)
  • Why not when the window becomes full?
  • Delays the removal of instructions from window,
    which can result in slow progress in runahead
    mode.
  • No significant gain (The window becomes full 98
    of the time after we see an L2 miss)
  • Why not on L1 cache misses?
  • gt50 of L1 cache misses hit in the L2 cache ?
    Many short runahead periods

64
When to Exit Runahead Mode
  • Why not exit early to fill the pipeline and the
    window?
  • How do we determine how early? Memory does not
    have fixed latency
  • This reduces the progress made in runahead mode
  • Exiting early using oracle information has lower
    performance than exiting when miss returns
  • Why not exit late to make further progress in
    runahead?
  • Not necessarily beneficial to stay in runahead
    longer
  • On average, this policy hurts performance
  • But, more intelligent runahead period extension
    schemes improve performance

65
Modifications to Pipeline
lt 0.05 area overhead
CHECKPOINTED
INV
STATE
FP
FP
FP
FP Queue
Units
Sched
Regfile
Trace
Reorder
Cache
INV
Frontend
INT
Uop Queue
Int Queue
INT
Fetch
Buffer
RAT
Sched
Units
Unit
INT
Regfile
Renamer
ADDR
Mem Queue
MEM
L1
GEN
Sched
Data
Units
Backend
Cache
RAT
Prefetcher
Instruction
Selection Logic
Store
Decoder
Buffer
INV
L2 Access Queue
RUNAHEAD
CACHE
From memory
L2 Cache
Front Side Bus
To memory
Access Queue
66
Effect of a Better Front-end
67
Why is Runahead Better with a Better Front-end?
  • A better front-end provides more correct-path
    instructions (hence, more and more accurate L2
    misses) in runahead periods
  • Average number of instructions during runahead
    711
  • before mispredicted INV branch 431
  • with perfect TC/BP this average increases to 909
  • Average number of L2 misses during runahead 2.6
  • before mispredicted INV branch 2.38
  • with perfect TC/BP this average increases to 3.18
  • If all INV branches were resolved correctly
    during runahead
  • performance gain would be 25 instead of 22

68
Importance of Store-Load Communication
69
In-order vs. Out-of-order
70
Sensitivity to L2 Cache Size
71
Instruction vs. Data Prefetching Benefits
72
Why Does Runahead Work?
  • 70 of instructions are VALID in runahead mode
  • These values show periodic behavior
  • Runahead prefetching reduces L1, L2, TC misses
    during normal mode
  • Data Miss Reduction
  • 18 decrease in normal mode L1 misses (base
    13.7/1K uops)
  • 33 of normal mode L2 data misses are fully or
    partially covered by runahead prefetching (base
    L2 data miss rate 4.3/1K uops)
  • 15 of normal mode L2 data misses are fully
    covered (these misses are never seen in normal
    mode)
  • Instruction Miss Reduction
  • 3 decrease in normal mode TC misses
  • 14 decrease in normal mode L2 fetch misses (some
    of these are only partially covered by runahead
    requests)
  • Overall Increase in Data Misses
  • L2 misses are increased by 5 (due to contention
    and useless prefetches)

73
Correlation Between L2 Miss Reduction and Speedup
74
Runahead on a More Aggressive Processor
75
Runahead on Future Model
76
Future Model with Perfect Front-end
77
Baseline Alpha Processor
  • Execution-driven Alpha simulator
  • 8-wide superscalar processor
  • 128-entry instruction window, 20-stage pipeline
  • 64 KB, 4-way, 2-cycle L1 data and instruction
    caches
  • 1 MB, 32-way, 10-cycle unified L2 cache
  • 500-cycle minimum main memory latency
  • Aggressive stream-based prefetcher
  • 32 DRAM banks, 32-byte wide processor-memory bus
    (41 frequency ratio), 128 outstanding misses
  • Detailed memory model

78
Runahead vs. Large Windows (Alpha)
79
In-order vs. Out-of-order Execution (Alpha)
80
Comparison to 1024-entry Window
81
Runahead vs. HWP (Alpha)
82
Effect of Memory Latency (Alpha)
83
1K 2K Memory Latency (Alpha)
84
Efficient Runahead
85
Methods for Efficient Runahead Execution
  • Eliminating inefficient runahead periods
  • Increasing the usefulness of runahead periods
  • Reuse of runahead results
  • Value prediction of L2-miss load instructions
  • Optimizing the exit policy from runahead mode

86
Impact on Efficiency
26.5
26.5
26.5
26.5
22.6
20.1
15.3
14.9
11.8
6.7
87
Extra Instructions with Efficient Runahead
88
Performance Increase with Efficient Runahead
89
Cache Sizes (Executed Instructions)
90
Cache Sizes (IPC Delta)
91
Turning Off the FPU in Runahead Mode
  • FP instructions do not contribute to the
    generation of load addresses
  • FP instructions can be dropped after decode
  • Spares processor resources for more useful
    instructions
  • Increases performance by enabling faster progress
  • Enables dynamic/static energy savings
  • Results in an unresolvable branch misprediction
    if a mispredicted branch depends on an FP
    operation (rare)
  • Overall increases IPC and reduces executed
    instructions

92
HWP Update Policy in Runahead Mode
  • Aggressive hardware prefetching in runahead mode
    may hurt performance, if the prefetcher accuracy
    is low
  • Runahead requests more accurate than prefetcher
    requests
  • Three policies
  • Do not update the prefetcher state
  • Update the prefetcher state just like in normal
    mode
  • Only train existing streams, but do not create
    new streams
  • Runahead mode improves the timeliness of the
    prefetcher in many benchmarks
  • Only training the existing streams is the best
    policy

93
Early INV Wake-up
  • Keep track of INV status of an instruction in the
    scheduler.
  • Scheduler wakes up the instruction if any source
    is INV.
  • Enables faster progress during runahead mode by
    removing the useless INV instructions faster.
  • - Increases the number of executed instructions.
  • - Increases the complexity of the scheduling
    logic.
  • Not worth implementing due to small IPC gain

94
Short Runahead Periods
95
RCST Counter
96
Sampling for Useless Periods
97
Efficiency Techniques Extra Instructions
98
Efficiency Techniques IPC Increase
99
Effect of Memory Latency on Efficient Runahead
100
Usefulness of Runahead Periods
101
L2 Misses Per Useful Runahead Periods
102
Other Considerations for Efficient Runahead
103
Performance Potential of Result Reuse
  • Ideal reuse study
  • To determine the upper bound on the performance
    gain possible by reusing results of runahead
    instructions
  • Valid pseudo-retired runahead instructions
    magically update architectural state during
    normal mode
  • They do not consume any resources (ROB or buffer
    entries)
  • Only invalid pseudo-retired runahead instructions
    are re-executed
  • They are fed into the renamer (fetch/decode
    pipeline is skipped)

104
Ideal Reuse of All Valid Runahead Results
105
Alpha Reuse IPCs
106
Number of Reused Instructions
107
Why Does Reuse Not Work?
108
IPC Increase with Reuse
109
Extra Instructions with Reuse
110
Runahead Period Statistics
111
Mem Latency and BP Accuracy in Reuse
112
Runahead VP Extra Instructions
113
Runahead VP IPC Increase
114
Late Exit from Runahead (Extra Inst.)
115
Late Exit from Runahead (IPC Increase)
116
AVD Prediction and Optimizations
117
Identifying Address Loads in Hardware
  • Observation If the AVD is too large, the value
    being loaded is likely NOT an address
  • Only predict loads that have satisfied
  • -MaxAVD lt AVD lt MaxAVD
  • This identification mechanism eliminates almost
    all data loads from consideration
  • Enables the AVD predictor to be small

118
An Implementable AVD Predictor
  • Set-associative prediction table
  • Prediction table entry consists of
  • Tag (PC of the load)
  • Last AVD seen for the load
  • Confidence counter for the recorded AVD
  • Updated when an address load is retired in normal
    mode
  • Accessed when a load misses in L2 cache in
    runahead mode
  • Recovery-free No need to recover the state of
    the processor or the predictor on misprediction
  • Runahead mode is purely speculative

119
AVD Update Logic
120
AVD Prediction Logic
121
Properties of Traversal-based AVDs
  • Stable AVDs can be captured with a stride value
    predictor
  • Stable AVDs disappear with the re-organization of
    the data structure (e.g., sorting)
  • Stability of AVDs is dependent on the behavior of
    the memory allocator
  • Allocation of contiguous, fixed-size chunks is
    useful

Sorting
Distance between nodes NOT constant!
122
Properties of Leaf-based AVDs
  • Stable AVDs cannot be captured with a stride
    value predictor
  • Stable AVDs do not disappear with the
    re-organization of the data structure (e.g.,
    sorting)
  • Stability of AVDs is dependent on the behavior of
    the memory allocator

Distance between node and string still constant!
Sorting
123
AVD Prediction vs. Stride Value Prediction
  • Performance
  • Both can capture traversal address loads with
    stable AVDs
  • e.g., treeadd
  • Stride VP cannot capture leaf address loads with
    stable AVDs
  • e.g., health, mst, parser
  • AVD predictor cannot capture data loads with
    striding data values
  • Predicting these can be useful for the correct
    resolution of mispredicted L2-miss dependent
    branches, e.g., parser
  • Complexity
  • AVD predictor requires much fewer entries (only
    address loads)
  • AVD prediction logic is simpler (no stride
    maintenance)

124
AVD vs. Stride VP Performance
2.5
4.5
12.1
12.6
13.4
16
16 entries
4096 entries
125
AVD vs. Stride VP Performance
2.7
4.7
5.1
5.5
6.5
8.6
16 entries
4096 entries
126
AVD vs. Stream Prefetching Performance
12.1
12.1
13.4
16.5
20.1
22.5
127
AVD vs. Stream Pref. (L2 bandwidth)
35.3
32.8
26
24.5
5.1
5.1
128
AVD vs. Stream Pref. (Mem bandwidth)
19.5
16.4
14.9
12.1
3.2
3.2
129
Source Code Optimization for AVD Prediction
130
Effect of Code Optimization (parser)
6.4
10.5
131
Accuracy/Coverage w/ Code Optimization
132
AVD and Efficiency Techniques
133
AVD and Efficiency Techniques
134
Effect of AVD on Runahead Periods
135
AVD Example from treeadd
136
AVD Example from parser
137
AVD Example from health
138
Motivation for NULL-value Optimization
139
Effect of NULL-value Optimization
140
Related Work
141
Methodology
142
Future Research Directions
Write a Comment
User Comments (0)
About PowerShow.com