ARSMT: A Microarchitectural Approach to Fault Tolerance In Microprocessors Eric Rotenberg University - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

ARSMT: A Microarchitectural Approach to Fault Tolerance In Microprocessors Eric Rotenberg University

Description:

History 1st detected in 1954 in areas such as nuclear test sites. Originally, causes were cosmic rays ... Necessary evils: push performance = increase faults ... – PowerPoint PPT presentation

Number of Views:100
Avg rating:3.0/5.0
Slides: 27
Provided by: destat
Category:

less

Transcript and Presenter's Notes

Title: ARSMT: A Microarchitectural Approach to Fault Tolerance In Microprocessors Eric Rotenberg University


1
AR-SMT A Microarchitectural Approach to Fault
Tolerance In MicroprocessorsEric
RotenbergUniversity of Wisconsin, Madison
  • Presented by Desta Mickey Tadesse

2
Fault Tolerance
  • Detection and recovery of faults.
  • What is a fault?
  • Transient faults
  • Permanent faults

3
Transient Faults
  • Traditionally associated with corruption of
    stored data values.
  • History 1st detected in 1954 in areas such as
    nuclear test sites.
  • Originally, causes were cosmic rays and alpha
    particles.
  • Short life time in most cases -gt Hardware
    recovers.
  • Affects memory circuitssoft errors.

4
Technology Trends
  • Moores law - Implementations require decreasing
    size and supply voltage.
  • Reduced capacitive node charge and noise margins.
  • Flip flops will inevitably be affected by
    transient faults.
  • High clock rates
  • Increase in clock rates increases probability of
    a new failure.
  • Example A momentarily corrupted combinational
    signal is latched by a flip-flop.
  • Necessary evils push performance gt increase
    faults
  • Checking logical current implementation will not
    guarantee correct execution.

5
Fault Tolerance Techniques
  • General techniques
  • Information Redundancy
  • Protecting data words with information coding
  • Parity or Hamming codes
  • ECC codes mainly in memory arrays.
  • Cost is extra/additional storage for coding
    overhead,and checking logic.
  • Space Redundancy
  • Carrying out the same computation on multiple
    independent hardware at the same time.
  • Errors are exposed by checking the independent
    results.
  • Cause large hardware overhead.
  • Good for permanent faults.
  • Time Redundancy
  • Execute the same computation on the same hardware
    at different times.
  • These are not all mutually exclusivemix it up!

6
Microarchitectural based fault tolerance
  • Aim is to detect broad coverage of transient
    faults
  • Low to moderate performance impact
  • Based on time redundancy
  • Active Stream/Redundant Stream Simultaneous
    Multithreading (AR-SMT)

7
Simultaneous Multithreading (SMT)
  • First introduced by researchers at U of
    Washington 1995.
  • Combines hardware features of superscalar and
    multithread processors.
  • GOAL Issue multiple instructions from multiple
    threads in each cycle.
  • How does it work?
  • Multithreading
  • Fine grained/coarse grained.
  • Contains hardware for several threads.

8
Simultaneous Multithreading (SMT)
  • Select instructions to be in the pipeline and be
    executed from all threads.
  • Machine resources dynamically allocated
  • Takes advantage of out-of-order issue.
  • How are the pipeline stages shared?
  • Fetch
  • Focus is the instruction cache port (limited to
    accessing a contiguous range of addresses).
  • Difficult to share single port for multiple
    threads.
  • Time share or Dual port.
  • Decode
  • No data dependence between threads (RISC)
  • Partition across threads (time share)
  • Complex instructions (CISC)
  • Share the decode
  • Rename
  • Physical registers area allocated from a common
    pool
  • Share this stage.

9
Simultaneous Multithreading (SMT)
  • Issue
  • Partition the issue stage
  • Use Tomasulos algorithm
  • Wake-up and select logic
  • Wake-up is restricted to a single thread
  • Partition!
  • Execute and Memory
  • Share!!
  • Retire
  • Check for exceptions and commit rename registers
  • Check for WAW
  • Partition
  • Pentium 4 gt Hybrid Multithreading
  • 2 threads

10
AR-SMT
  • Uses time-redundancy
  • Cheaper than the two other redundancies (minimal
    hardware overhead)
  • Basic approach gt Allow a computation to be
    performed multiple times on the same hardware.
  • Run a program back to back and compare results.
  • Doubles execution time.
  • Forget about the pipeline and
  • duplicate at the execution stage.
  • Limited hardware coverage.

11
AR-SMT
  • Two explicit copies of the program run
    concurrently on the same processor resource.
  • Independent threads have their own program
    context.
  • Duplicate the whole pipeline.
  • Implemented by using SMT.
  • Detect dynamic faults by comparing the two
    threads.

12
AR-SMT the threads
  • Two threads
  • Active Stream (A-stream)- Lead-off batter
  • Instructions are fetched and executed like a
    regular thread
  • Result of each instruction is pushed onto FIFO
    queue called the Delay Buffer
  • Results include
  • modifications to the program counter by branches
  • Modifications to registers and memory
  • Redundant Stream
  • A copy of the A-stream
  • Starts behind the A-stream
  • A and R Streams are concurrently processed using
    existing SMT architecture

13
AR-SMT the threads
14
Transient fault detection
  • R-stream results are compared to the buffered
    results.
  • Results are similar NO FAULTS!
  • Different Results gt FAULT!
  • Three possible scenarios
  • Fault in A-Stream
  • Detected after some time through the Delay
    Buffer.
  • Fault in R-Stream
  • Detected before the 1st affected instruction is
    committed.
  • Fault in both A and R Streams.
  • Only the R-stream can detect the error.

15
Other microarchitectural trends
  • Data and Control Dependence
  • Instruction level parallelism hindrance.
  • Handle control dependence with branch predictors.
  • Handle data dependence with value predictors.
  • Assume both predictors work accurately.

16
Predictions
  • Predictions accuracy matters only for the
    A-Stream
  • Delay Buffer contains the prefect predictions
    from A-stream.
  • R-Stream will run a lot faster due to the perfect
    predictions and warmed up caches and TLB.
  • Additional hardware required for detecting
    difference between results.
  • NONE!
  • Control predictions inherently contain a
    mechanism to detect mispredictions.
  • Compare the predicted results to the computed
    branch condition.
  • In the same way,
  • R-stream has a perfect predictor (A-stream)
  • Compare the values predicted by the A-stream to
    the actual R-stream values computed. Difference
    between the two denotes a fault.

17
Trace Processors
  • Goal is also to detect hardware faults along with
    transient faults.
  • Use some sort of hierarchy to virtually divide
    the processor into smaller processing elements.
  • Trace processors partition the instruction
    stream into larger units of work called traces.
  • Trace length 16 or 32 instructions
  • Processor is virtually divided into processing
    elements (PE).
  • Make sure A-stream and R-stream work on different
    PEs.
  • Detect permanent faults by comparing results.
  • Cost extra bits/trace in the Delay buffer.

18
(No Transcript)
19
Implementation Issues
  • Handling register values
  • Each thread must have its own register state.
  • Register dependency in different threads should
    not interfere with each other.
  • Share a single physical register file.
  • Separate register map
  • This guarantees that the same local register in
    two different threads will be mapped to different
    registers.
  • Advantage of flexibility in register requirement.
  • Handling Memory values
  • Disambiguation unit enforce data dependency
    through memory. (SHARED)
  • Add thread identifier in the memory address to
    stop memory dependency between threads from
    interfering with each other.

20
Pipeline Implementation trace processor
  • Fetch/Dispatch
  • Time Shared (Trace fetched as a unit)
  • Fetch/Decode Arbitration
  • If Delay buffer is full, R-Stream has priority to
    access fetch/decode stage.
  • Execution
  • Space Shared
  • Unit of sharing is the Processing element (PE)
  • Arbitrary scheduling of instructions using simple
    rules
  • Retire
  • Time shared
  • Retirement stage arbitration
  • If Delay Buffer is not full, A-stream has
    priority to retire a trace.

21
Pipeline Implementation trace processor
22
Problems due to AR-SMT
  • R-stream is not a true-software context
  • OS not aware that such a program exists.
  • R-stream needs to have its own physical memory
    image.
  • Solution
  • When allocating a physical page to virtual page,
    make OS allocate two contiguous pages to the
    A-stream.
  • Address translation has to be placed in the Delay
    Buffer by the A-stream for the R-stream.

23
Performance Evaluation
  • Simulate AR-SMT trace processor.
  • Use simplescalar simulation platform
  • Fault coverage is not evaluated.
  • Results
  • Used Trace processors with 4 and 8 PEs.
  • 12 - 29 increase in execution time (4 PEs).
  • 5 - 27 (8 PEs)

24
Performance Evaluation

25
Other approaches to fault tolerance
  • DIVA
  • Detects a variety of faults (even design faults)
  • Uses a verified checker to validate computation
    of a complex processing core.
  • Uses similar techniques to the AR-SMT the
    checker is able to keep pace with the core by
    using the values it is checking as predictions.
  • Slipstream
  • Uses the basic concepts of AR-SMT.
  • A-stream (Advanced stream) is shortened to run
    faster.
  • Drafting

26
  • My work is done here
Write a Comment
User Comments (0)
About PowerShow.com