Title: Superscalar Processor Performance Enhancement Through Reliable Dynamic Clock Frequency Tuning
1Superscalar Processor Performance Enhancement
Through Reliable Dynamic Clock Frequency Tuning
- Viswanathan Subramanian, Mikel Bezdek, Naga D.
Avirneni and Arun K. Somani - Dependable Computing Networking Laboratory
(DCNL) - Iowa State University
37th Annual IEEE/IFIP International Conference
on Dependable Systems and Networks Edinburgh,
UK June 27th, 2007
2Outline
- Introduction
- SPRIT3E framework
- Dynamic frequency scaling
- Minimizing short path constraints
- Error Sampling and dynamic clock adjustment
- Simulation Results
3Worst case design for synchronous circuits
- Clock period is limited by the maximum delay from
A to B - This delay depends on
- Properties of the physical implementation of the
circuit - Properties of the environment
- Temperature and Supply Voltage
- To avoid errors, worst case delays are assumed
- Result - Overly conservative clock period
- Pipelined processor
- Longest/slowest stage limits the period of the
entire pipeline
4Related work
- Proposed solutions
- Deeper pipelines, or superpipelining
- Cons Increased branch misprediction penalty
- Some stages hard to divide
- Asynchronous designs
- Cons Unfamiliar design methodology
- Lack of tool support
- Better than Worst Case designs
- Reliable overclocking
5Related work
- RAZOR Vs. SPRIT3E
- RAZOR uses temporal fault tolerance
- Achieves lower energy consumption
- Supply voltage scaled
- Clock frequency unchanged during run time
- SPRIT3E uses temporal fault tolerance
- Allows faster execution for non worst case data
- Clock frequency scaled
- Operating frequency adjusted dynamically during
run time - SPRIT3E reliably overclocks critical pipeline
registers to improve performance of superscalar
processors
6Superscalar PeRformance Improvement through
tolerating Timing Errors SPRIT3E
- Runs a superscalar pipeline at speeds faster than
the worst case limit - Local Fault Detection and Recovery
- Global Recovery
- Dynamic Clock Frequency Tuning
7Local Fault Detection and Recovery - LFDR
- Main register clocked ambitiously
- Backup register always reliable
- PS Clock
- Phase shifted version of Main Clock
- Local recovery initiated on error detection
- Metastability conditions
- Detected and recovered from
8Global recovery
- Pipeline stalled for one cycle
- Recovery steps initiated based on error location
- IF Error
- Stall PC
- Clear bad data from ID stage
- ID Error
- Stall PC and IF
- Clear most recent entry in ROB
- FU Error
- Stall reservation station
- Invalidate instruction in ROB as well as
dependent instructions - ROB Error
- Prevent ROB from committing in the next cycle
- Clear the delay register
- Reliable Execution Guaranteed
9Error recovery diagram
10Dynamic frequency scaling
Case I No Scaling Case II Main Clock 9 ns
Phase Shift 1 ns Case III Main
Clock 7 ns (max) Phase Shift 3 ns
11Impact of error rate on performance
Case III k 1 No performance gain gt Se gt 42
- told Original Clock period
- tnew Clock period after frequency scaling
- tdiff told tnew
- Se Fraction of clock cycles affected by errors
due to scaling - k Number of cycles needed to recover from an
error - n Number of cycles taken to execute an
application
12Minimizing short path constraints
- Contamination delay limits phase shift
- Challenges
- Increasing contamination delay (tcd)
- Not affecting propagation delay (tpd)
- Experiments performed on CLA adders
- Buffers added judiciously
- tcd increased
- tpd held constant
- Minimal increase in area
13CLA Adder experiment
tcd, tpd (ns) Area (µm2)
14Dynamic frequency tuning
- Number of errors sampled periodically
- Clock period and phase shift controlled
- Voltage controlled oscillator manages clock
frequency
15Error sampling techniques
- Three different error monitoring techniques
proposed - Discrete Sampling
- Uses a counter sampled and cleared once per
window - Samples every 100,000 clock cycles
- Simple implementation
- Continuous Sampling
- Maintains a continuous history of errors in the
window - Samples in moving window of 100,000 cycles
- Semi-continuous Sampling
- Divides window into multiple counters
- Five moving windows of 20,000 cycles
16Simulation Results
- 18x18-bit multiplier implemented in FPGA
- 44 performance gain achieved
- SPRIT3E evaluated on DLX superscalar processor
- Modified superscalar processor implemented in
FPGA - Benchmarks used
- MatrixMult, BubbleSort, RandGen
- Performance result across all benchmarks
applications - Continuous error sampling (57 improvement)
- Semi-continuous error sampling (56 improvement)
- Discrete error sampling (47 improvement)
17Multiplier experiment setup
18Frequency scaling induced errors Multiplier
circuit
19Superscalar DLX processor
- Decode / Issue / Commit bandwidth - 2
instructions per cycle - Out of order execution on 4 functional units
- Arithmetic Logic ALU
- Multiply Divide MDU
- Load Store LSU
- Branch Resolve BRU
- ROB entries - 5
- 64 Byte I and D cache
- Additional 64 KB of instruction and data memory
- Synthesized in FPGA
- Worst case propagation delay 21.982 ns (MDU to
ROB)
20Benchmark applications
- MatrixMult
- Multiplies two 50x50 matrices
- Heavy utilization of MDU, poor cache utilization
- Executes in 3 million cycles
- BubbleSort
- Performs bubble sort on 5,000 16-bit numbers
- No MDU operations, better cache utilization
- Executes in 118 million cycles
- RandGen
- Generate 1 million random numbers between 0 and
255 - Uses MDU, distribution is counted and stored in
memory - Executes in 15 million cycles
21Frequency scaling induced errors Superscalar
DLX processor
22Relative performance gain for different
applications
23Performance evaluation
exold Old execution time exnew New execution
time told Old clock period tnew New
clock period
- Overall speedup (Sov) using SPRIT3E framework
depends on - Fraction of clock cycles affected by errors (Se)
- Number of cycles needed to recover from an error
(k) - Percentage change in no. of cycles taken for
execution (Sc)
24Impact on Area and Power
- Design mapped to Xilinx Virtex II Pro FPGA
- Impact on Area because of SPRIT3E framework
- 3.2 increase in number of flip-flops
- 0.3 increase in combinational logic
- 3.2 increase in equivalent gate count
- Impact on power consumption
- No significant difference (Xilinx power reports)
25Conclusions
- SPRIT3E Framework proposed
- Initial exploration of possibilities for
overcoming worst case design mentality - Timing error tolerant overclocking framework
- Modest error detection and recovery overhead
- Overcoming short path constraints
- Dynamic clock tuning methodology
- Implementation in FPGA
26 Backup Slides
27Execution Simulator
- Simulator runs the benchmarks on a cycle by cycle
basis - Each cycle, occurrence of a timing error is
determined by - Current clock period
- Probability derived from the data
- Timing errors counted and sampled based on
sampling method - Every error cycle incurs 1 stall cycle
- Clock period adjustment incurs large delay (100
cycles) - Benchmarks run for either actual execution time
or long run time of 120 million cycles
28Effect of Sampling Method
MatrixMult Benchmark for long run execution