Title: On the Importance of Optimizing the Configuration of Stream Prefetches
1On the Importance of Optimizing the Configuration
of Stream Prefetches
- Ilya Ganusov
- Martin Burtscher
Computer Systems Laboratory Cornell University
2Introduction
- Memory wall
- Increasing gap between processor and memory
speeds - Concentration on bandwidth at the expense of
latency - Prefetch important data
- Do not wait until the processor requests data
- Pro-actively fetch the data that is likely to be
consumed in the near future
3Stream Prefetching
- Prefetching with outcome-based prediction
- Use the history of previous misses to guess data
addresses that are likely to miss soon - Stream prefetching
- A special case of outcome-based prediction
- Proposed 15 years ago
- The only hardware prefetching scheme used in
modern microprocessors
4Contributions
- Detailed sensitivity analysis of main prefetcher
parameters on SPECcpu2000 programs - No such study in the literature
- Many research papers fail to specify prefetcher
parameters in comparative studies - Case study
- Evaluate performance of Runahead execution on a
baseline with different stream prefetcher
parameters
5Outline
- Introduction
- Stream Prefetcher Operation
- Evaluation Methodology
- Experimental Results
- Conclusion
6How Stream Prefetchers Work
Global miss history
miss addr
addr addr addr addr
Stream table
valid stream address stride
valid stream address stride
valid stream address stride
AGU
addr stride lookahead
Stream exists?
prefetch addr
7Measured Parameters
miss history length
miss addr
addr addr addr addr
valid stream address stride
valid stream address stride
valid stream address stride
Number of supported streams
prefetch distance
AGU
addr stride lookahead
Stream exists?
prefetch addr
8Evaluation Methodology
- Benchmarks
- 22 SPECcpu2000 programs, highly optimized
- All F77, C, and C programs
- Multiple reference inputs per program
- SimPoint interval of 500 million instructions
- Simulated architecture
- SimpleScalar v4.0 cycle-accurate simulator
- Aggressive superscalar Alpha 21264-like core
9Simulated System
Execution Core Execution Core
Fetch/issue/commit 4/4/4
I-window/ROB/LSQ 64/128/64
LdSt/Int/FP units 2/4/2
Execution latencies Similar to Alpha 21264
Branch predictor 16K-entry bimodal/gshare hybrid
Memory Subsystem Memory Subsystem
Cache sizes 64KB IL1, 64KB DL1, 1MB L2
Cache associativity 2-way L1, 4-way L2
Cache latencies 2 cyc L1, 20 cyc L2
Main memory latency 400 cycles
10Outline
- Introduction
- Motivation
- Implementation
- Experimental Results
- Conclusion
11Miss History Length
7 programs are very sensitive
16-entry history is enough
12Number of Stream Table Entries
only 3 programs are sensitive
gt 8 streams provides little benefit
13L2 Cache Prefetch Distance
11 programs are very sensitive
FP speedup varies by 80 - 140
14Case Study Runahead Execution
- Performance of stream prefetching is highly
dependent on parameter choice - Another proposal Runahead execution
- Pseudo-retire long latency loads stalling the
pipeline and continue executing - Roll back to checkpoint after load comes back
from memory
15Speedup over Stream Prefetching
- SPEC fp speedup drops by gt 2x
16Conclusion
- Key observations
- The performance of the stream prefetcher is
highly dependent on its configuration - Varying the prefetch distance alone almost
doubles the average performance benefit - Choosing a non-optimal stream prefetcher as a
baseline can distort results by a factor of two - Conclusion
- Parameter optimizations are imperative when
comparing stream prefetchers to other prefetching
techniques
17On the Importance of Optimizing the Configuration
of Stream Prefetches
- Ilya Ganusov
- Martin Burtscher
Computer Systems Laboratory Cornell University