Superscalar Processor Performance Enhancement Through Reliable Dynamic Clock Frequency Tuning - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Superscalar Processor Performance Enhancement Through Reliable Dynamic Clock Frequency Tuning

Description:

Pipeline stalled for one cycle. Recovery steps initiated based on ... Stall reservation station. Invalidate instruction in ROB as well as dependent instructions ... – PowerPoint PPT presentation

Number of Views:75

Avg rating:3.0/5.0

Slides: 29

Provided by: rsr1

Category:

more less

Transcript and Presenter's Notes

Title: Superscalar Processor Performance Enhancement Through Reliable Dynamic Clock Frequency Tuning

1
Superscalar Processor Performance Enhancement
Through Reliable Dynamic Clock Frequency Tuning

Viswanathan Subramanian, Mikel Bezdek, Naga D.
Avirneni and Arun K. Somani
Dependable Computing Networking Laboratory
(DCNL)
Iowa State University

37th Annual IEEE/IFIP International Conference
on Dependable Systems and Networks Edinburgh,
UK June 27th, 2007
2
Outline

Introduction
SPRIT3E framework
Dynamic frequency scaling
Minimizing short path constraints
Error Sampling and dynamic clock adjustment
Simulation Results

3
Worst case design for synchronous circuits

Clock period is limited by the maximum delay from
A to B
This delay depends on
Properties of the physical implementation of the
circuit
Properties of the environment
Temperature and Supply Voltage
To avoid errors, worst case delays are assumed
Result - Overly conservative clock period
Pipelined processor
Longest/slowest stage limits the period of the
entire pipeline

4
Related work

Proposed solutions
Deeper pipelines, or superpipelining
Cons Increased branch misprediction penalty
Some stages hard to divide
Asynchronous designs
Cons Unfamiliar design methodology
Lack of tool support
Better than Worst Case designs
Reliable overclocking

5
Related work

RAZOR Vs. SPRIT3E
RAZOR uses temporal fault tolerance
Achieves lower energy consumption
Supply voltage scaled
Clock frequency unchanged during run time
SPRIT3E uses temporal fault tolerance
Allows faster execution for non worst case data
Clock frequency scaled
Operating frequency adjusted dynamically during
run time
SPRIT3E reliably overclocks critical pipeline
registers to improve performance of superscalar
processors

6
Superscalar PeRformance Improvement through
tolerating Timing Errors SPRIT3E

Runs a superscalar pipeline at speeds faster than
the worst case limit
Local Fault Detection and Recovery
Global Recovery
Dynamic Clock Frequency Tuning

7
Local Fault Detection and Recovery - LFDR

Main register clocked ambitiously
Backup register always reliable
PS Clock
Phase shifted version of Main Clock
Local recovery initiated on error detection
Metastability conditions
Detected and recovered from

8
Global recovery

Pipeline stalled for one cycle
Recovery steps initiated based on error location
IF Error
Stall PC
Clear bad data from ID stage
ID Error
Stall PC and IF
Clear most recent entry in ROB
FU Error
Stall reservation station
Invalidate instruction in ROB as well as
dependent instructions
ROB Error
Prevent ROB from committing in the next cycle
Clear the delay register
Reliable Execution Guaranteed

9
Error recovery diagram
10
Dynamic frequency scaling
Case I No Scaling Case II Main Clock 9 ns
Phase Shift 1 ns Case III Main
Clock 7 ns (max) Phase Shift 3 ns
11
Impact of error rate on performance
Case III k 1 No performance gain gt Se gt 42

told Original Clock period
tnew Clock period after frequency scaling
tdiff told tnew
Se Fraction of clock cycles affected by errors
due to scaling
k Number of cycles needed to recover from an
error
n Number of cycles taken to execute an
application

12
Minimizing short path constraints

Contamination delay limits phase shift
Challenges
Increasing contamination delay (tcd)
Not affecting propagation delay (tpd)
Experiments performed on CLA adders
Buffers added judiciously
tcd increased
tpd held constant
Minimal increase in area

13
CLA Adder experiment
tcd, tpd (ns) Area (µm2)
14
Dynamic frequency tuning

Number of errors sampled periodically
Clock period and phase shift controlled
Voltage controlled oscillator manages clock
frequency

15
Error sampling techniques

Three different error monitoring techniques
proposed
Discrete Sampling
Uses a counter sampled and cleared once per
window
Samples every 100,000 clock cycles
Simple implementation
Continuous Sampling
Maintains a continuous history of errors in the
window
Samples in moving window of 100,000 cycles
Semi-continuous Sampling
Divides window into multiple counters
Five moving windows of 20,000 cycles

16
Simulation Results

18x18-bit multiplier implemented in FPGA
44 performance gain achieved
SPRIT3E evaluated on DLX superscalar processor
Modified superscalar processor implemented in
FPGA
Benchmarks used
MatrixMult, BubbleSort, RandGen
Performance result across all benchmarks
applications
Continuous error sampling (57 improvement)
Semi-continuous error sampling (56 improvement)
Discrete error sampling (47 improvement)

17
Multiplier experiment setup
18
Frequency scaling induced errors Multiplier
circuit
19
Superscalar DLX processor

Decode / Issue / Commit bandwidth - 2
instructions per cycle
Out of order execution on 4 functional units
Arithmetic Logic ALU
Multiply Divide MDU
Load Store LSU
Branch Resolve BRU
ROB entries - 5
64 Byte I and D cache
Additional 64 KB of instruction and data memory
Synthesized in FPGA
Worst case propagation delay 21.982 ns (MDU to
ROB)

20
Benchmark applications

MatrixMult
Multiplies two 50x50 matrices
Heavy utilization of MDU, poor cache utilization
Executes in 3 million cycles
BubbleSort
Performs bubble sort on 5,000 16-bit numbers
No MDU operations, better cache utilization
Executes in 118 million cycles
RandGen
Generate 1 million random numbers between 0 and
255
Uses MDU, distribution is counted and stored in
memory
Executes in 15 million cycles

21
Frequency scaling induced errors Superscalar
DLX processor
22
Relative performance gain for different
applications
23
Performance evaluation
exold Old execution time exnew New execution
time told Old clock period tnew New
clock period

Overall speedup (Sov) using SPRIT3E framework
depends on
Fraction of clock cycles affected by errors (Se)
Number of cycles needed to recover from an error
(k)
Percentage change in no. of cycles taken for
execution (Sc)

24
Impact on Area and Power

Design mapped to Xilinx Virtex II Pro FPGA
Impact on Area because of SPRIT3E framework
3.2 increase in number of flip-flops
0.3 increase in combinational logic
3.2 increase in equivalent gate count
Impact on power consumption
No significant difference (Xilinx power reports)

25
Conclusions

SPRIT3E Framework proposed
Initial exploration of possibilities for
overcoming worst case design mentality
Timing error tolerant overclocking framework
Modest error detection and recovery overhead
Overcoming short path constraints
Dynamic clock tuning methodology
Implementation in FPGA

26
Backup Slides
27
Execution Simulator

Simulator runs the benchmarks on a cycle by cycle
basis
Each cycle, occurrence of a timing error is
determined by
Current clock period
Probability derived from the data
Timing errors counted and sampled based on
sampling method
Every error cycle incurs 1 stall cycle
Clock period adjustment incurs large delay (100
cycles)
Benchmarks run for either actual execution time
or long run time of 120 million cycles