Superscalar Processor Performance Enhancement Through Reliable Dynamic Clock Frequency Tuning - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Superscalar Processor Performance Enhancement Through Reliable Dynamic Clock Frequency Tuning

Description:

Pipeline stalled for one cycle. Recovery steps initiated based on ... Stall reservation station. Invalidate instruction in ROB as well as dependent instructions ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 29
Provided by: rsr1
Category:

less

Transcript and Presenter's Notes

Title: Superscalar Processor Performance Enhancement Through Reliable Dynamic Clock Frequency Tuning


1
Superscalar Processor Performance Enhancement
Through Reliable Dynamic Clock Frequency Tuning
  • Viswanathan Subramanian, Mikel Bezdek, Naga D.
    Avirneni and Arun K. Somani
  • Dependable Computing Networking Laboratory
    (DCNL)
  • Iowa State University

37th Annual IEEE/IFIP International Conference
on Dependable Systems and Networks Edinburgh,
UK June 27th, 2007
2
Outline
  • Introduction
  • SPRIT3E framework
  • Dynamic frequency scaling
  • Minimizing short path constraints
  • Error Sampling and dynamic clock adjustment
  • Simulation Results

3
Worst case design for synchronous circuits
  • Clock period is limited by the maximum delay from
    A to B
  • This delay depends on
  • Properties of the physical implementation of the
    circuit
  • Properties of the environment
  • Temperature and Supply Voltage
  • To avoid errors, worst case delays are assumed
  • Result - Overly conservative clock period
  • Pipelined processor
  • Longest/slowest stage limits the period of the
    entire pipeline

4
Related work
  • Proposed solutions
  • Deeper pipelines, or superpipelining
  • Cons Increased branch misprediction penalty
  • Some stages hard to divide
  • Asynchronous designs
  • Cons Unfamiliar design methodology
  • Lack of tool support
  • Better than Worst Case designs
  • Reliable overclocking

5
Related work
  • RAZOR Vs. SPRIT3E
  • RAZOR uses temporal fault tolerance
  • Achieves lower energy consumption
  • Supply voltage scaled
  • Clock frequency unchanged during run time
  • SPRIT3E uses temporal fault tolerance
  • Allows faster execution for non worst case data
  • Clock frequency scaled
  • Operating frequency adjusted dynamically during
    run time
  • SPRIT3E reliably overclocks critical pipeline
    registers to improve performance of superscalar
    processors

6
Superscalar PeRformance Improvement through
tolerating Timing Errors SPRIT3E
  • Runs a superscalar pipeline at speeds faster than
    the worst case limit
  • Local Fault Detection and Recovery
  • Global Recovery
  • Dynamic Clock Frequency Tuning

7
Local Fault Detection and Recovery - LFDR
  • Main register clocked ambitiously
  • Backup register always reliable
  • PS Clock
  • Phase shifted version of Main Clock
  • Local recovery initiated on error detection
  • Metastability conditions
  • Detected and recovered from

8
Global recovery
  • Pipeline stalled for one cycle
  • Recovery steps initiated based on error location
  • IF Error
  • Stall PC
  • Clear bad data from ID stage
  • ID Error
  • Stall PC and IF
  • Clear most recent entry in ROB
  • FU Error
  • Stall reservation station
  • Invalidate instruction in ROB as well as
    dependent instructions
  • ROB Error
  • Prevent ROB from committing in the next cycle
  • Clear the delay register
  • Reliable Execution Guaranteed

9
Error recovery diagram
10
Dynamic frequency scaling
Case I No Scaling Case II Main Clock 9 ns
Phase Shift 1 ns Case III Main
Clock 7 ns (max) Phase Shift 3 ns
11
Impact of error rate on performance
Case III k 1 No performance gain gt Se gt 42
  • told Original Clock period
  • tnew Clock period after frequency scaling
  • tdiff told tnew
  • Se Fraction of clock cycles affected by errors
    due to scaling
  • k Number of cycles needed to recover from an
    error
  • n Number of cycles taken to execute an
    application

12
Minimizing short path constraints
  • Contamination delay limits phase shift
  • Challenges
  • Increasing contamination delay (tcd)
  • Not affecting propagation delay (tpd)
  • Experiments performed on CLA adders
  • Buffers added judiciously
  • tcd increased
  • tpd held constant
  • Minimal increase in area

13
CLA Adder experiment
tcd, tpd (ns) Area (µm2)
14
Dynamic frequency tuning
  • Number of errors sampled periodically
  • Clock period and phase shift controlled
  • Voltage controlled oscillator manages clock
    frequency

15
Error sampling techniques
  • Three different error monitoring techniques
    proposed
  • Discrete Sampling
  • Uses a counter sampled and cleared once per
    window
  • Samples every 100,000 clock cycles
  • Simple implementation
  • Continuous Sampling
  • Maintains a continuous history of errors in the
    window
  • Samples in moving window of 100,000 cycles
  • Semi-continuous Sampling
  • Divides window into multiple counters
  • Five moving windows of 20,000 cycles

16
Simulation Results
  • 18x18-bit multiplier implemented in FPGA
  • 44 performance gain achieved
  • SPRIT3E evaluated on DLX superscalar processor
  • Modified superscalar processor implemented in
    FPGA
  • Benchmarks used
  • MatrixMult, BubbleSort, RandGen
  • Performance result across all benchmarks
    applications
  • Continuous error sampling (57 improvement)
  • Semi-continuous error sampling (56 improvement)
  • Discrete error sampling (47 improvement)

17
Multiplier experiment setup
18
Frequency scaling induced errors Multiplier
circuit
19
Superscalar DLX processor
  • Decode / Issue / Commit bandwidth - 2
    instructions per cycle
  • Out of order execution on 4 functional units
  • Arithmetic Logic ALU
  • Multiply Divide MDU
  • Load Store LSU
  • Branch Resolve BRU
  • ROB entries - 5
  • 64 Byte I and D cache
  • Additional 64 KB of instruction and data memory
  • Synthesized in FPGA
  • Worst case propagation delay 21.982 ns (MDU to
    ROB)

20
Benchmark applications
  • MatrixMult
  • Multiplies two 50x50 matrices
  • Heavy utilization of MDU, poor cache utilization
  • Executes in 3 million cycles
  • BubbleSort
  • Performs bubble sort on 5,000 16-bit numbers
  • No MDU operations, better cache utilization
  • Executes in 118 million cycles
  • RandGen
  • Generate 1 million random numbers between 0 and
    255
  • Uses MDU, distribution is counted and stored in
    memory
  • Executes in 15 million cycles

21
Frequency scaling induced errors Superscalar
DLX processor
22
Relative performance gain for different
applications
23
Performance evaluation
exold Old execution time exnew New execution
time told Old clock period tnew New
clock period
  • Overall speedup (Sov) using SPRIT3E framework
    depends on
  • Fraction of clock cycles affected by errors (Se)
  • Number of cycles needed to recover from an error
    (k)
  • Percentage change in no. of cycles taken for
    execution (Sc)

24
Impact on Area and Power
  • Design mapped to Xilinx Virtex II Pro FPGA
  • Impact on Area because of SPRIT3E framework
  • 3.2 increase in number of flip-flops
  • 0.3 increase in combinational logic
  • 3.2 increase in equivalent gate count
  • Impact on power consumption
  • No significant difference (Xilinx power reports)

25
Conclusions
  • SPRIT3E Framework proposed
  • Initial exploration of possibilities for
    overcoming worst case design mentality
  • Timing error tolerant overclocking framework
  • Modest error detection and recovery overhead
  • Overcoming short path constraints
  • Dynamic clock tuning methodology
  • Implementation in FPGA

26
Backup Slides
27
Execution Simulator
  • Simulator runs the benchmarks on a cycle by cycle
    basis
  • Each cycle, occurrence of a timing error is
    determined by
  • Current clock period
  • Probability derived from the data
  • Timing errors counted and sampled based on
    sampling method
  • Every error cycle incurs 1 stall cycle
  • Clock period adjustment incurs large delay (100
    cycles)
  • Benchmarks run for either actual execution time
    or long run time of 120 million cycles

28
Effect of Sampling Method
MatrixMult Benchmark for long run execution
Write a Comment
User Comments (0)
About PowerShow.com