ARSMT: A Microarchitectural Approach to Fault Tolerance In Microprocessors Eric Rotenberg University - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

ARSMT: A Microarchitectural Approach to Fault Tolerance In Microprocessors Eric Rotenberg University

Description:

History 1st detected in 1954 in areas such as nuclear test sites. Originally, causes were cosmic rays ... Necessary evils: push performance = increase faults ... – PowerPoint PPT presentation

Number of Views:100

Avg rating:3.0/5.0

Slides: 27

Provided by: destat

Category:

more less

Transcript and Presenter's Notes

Title: ARSMT: A Microarchitectural Approach to Fault Tolerance In Microprocessors Eric Rotenberg University

1
AR-SMT A Microarchitectural Approach to Fault
Tolerance In MicroprocessorsEric
RotenbergUniversity of Wisconsin, Madison

Presented by Desta Mickey Tadesse

2
Fault Tolerance

Detection and recovery of faults.
What is a fault?
Transient faults
Permanent faults

3
Transient Faults

Traditionally associated with corruption of
stored data values.
History 1st detected in 1954 in areas such as
nuclear test sites.
Originally, causes were cosmic rays and alpha
particles.
Short life time in most cases -gt Hardware
recovers.
Affects memory circuitssoft errors.

4
Technology Trends

Moores law - Implementations require decreasing
size and supply voltage.
Reduced capacitive node charge and noise margins.
Flip flops will inevitably be affected by
transient faults.
High clock rates
Increase in clock rates increases probability of
a new failure.
Example A momentarily corrupted combinational
signal is latched by a flip-flop.
Necessary evils push performance gt increase
faults
Checking logical current implementation will not
guarantee correct execution.

5
Fault Tolerance Techniques

General techniques
Information Redundancy
Protecting data words with information coding
Parity or Hamming codes
ECC codes mainly in memory arrays.
Cost is extra/additional storage for coding
overhead,and checking logic.
Space Redundancy
Carrying out the same computation on multiple
independent hardware at the same time.
Errors are exposed by checking the independent
results.
Cause large hardware overhead.
Good for permanent faults.
Time Redundancy
Execute the same computation on the same hardware
at different times.
These are not all mutually exclusivemix it up!

6
Microarchitectural based fault tolerance

Aim is to detect broad coverage of transient
faults
Low to moderate performance impact
Based on time redundancy
Active Stream/Redundant Stream Simultaneous
Multithreading (AR-SMT)

7
Simultaneous Multithreading (SMT)

First introduced by researchers at U of
Washington 1995.
Combines hardware features of superscalar and
multithread processors.
GOAL Issue multiple instructions from multiple
threads in each cycle.
How does it work?
Multithreading
Fine grained/coarse grained.
Contains hardware for several threads.

8
Simultaneous Multithreading (SMT)

Select instructions to be in the pipeline and be
executed from all threads.
Machine resources dynamically allocated
Takes advantage of out-of-order issue.
How are the pipeline stages shared?
Fetch
Focus is the instruction cache port (limited to
accessing a contiguous range of addresses).
Difficult to share single port for multiple
threads.
Time share or Dual port.
Decode
No data dependence between threads (RISC)
Partition across threads (time share)
Complex instructions (CISC)
Share the decode
Rename
Physical registers area allocated from a common
pool
Share this stage.

9
Simultaneous Multithreading (SMT)

Issue
Partition the issue stage
Use Tomasulos algorithm
Wake-up and select logic
Wake-up is restricted to a single thread
Partition!
Execute and Memory
Share!!
Retire
Check for exceptions and commit rename registers
Check for WAW
Partition
Pentium 4 gt Hybrid Multithreading
2 threads

10
AR-SMT

Uses time-redundancy
Cheaper than the two other redundancies (minimal
hardware overhead)
Basic approach gt Allow a computation to be
performed multiple times on the same hardware.
Run a program back to back and compare results.
Doubles execution time.
Forget about the pipeline and
duplicate at the execution stage.
Limited hardware coverage.

11
AR-SMT

Two explicit copies of the program run
concurrently on the same processor resource.
Independent threads have their own program
context.
Duplicate the whole pipeline.
Implemented by using SMT.
Detect dynamic faults by comparing the two
threads.

12
AR-SMT the threads

Two threads
Active Stream (A-stream)- Lead-off batter
Instructions are fetched and executed like a
regular thread
Result of each instruction is pushed onto FIFO
queue called the Delay Buffer
Results include
modifications to the program counter by branches
Modifications to registers and memory
Redundant Stream
A copy of the A-stream
Starts behind the A-stream
A and R Streams are concurrently processed using
existing SMT architecture

13
AR-SMT the threads
14
Transient fault detection

R-stream results are compared to the buffered
results.
Results are similar NO FAULTS!
Different Results gt FAULT!
Three possible scenarios
Fault in A-Stream
Detected after some time through the Delay
Buffer.
Fault in R-Stream
Detected before the 1st affected instruction is
committed.
Fault in both A and R Streams.
Only the R-stream can detect the error.

15
Other microarchitectural trends

Data and Control Dependence
Instruction level parallelism hindrance.
Handle control dependence with branch predictors.
Handle data dependence with value predictors.
Assume both predictors work accurately.

16
Predictions

Predictions accuracy matters only for the
A-Stream
Delay Buffer contains the prefect predictions
from A-stream.
R-Stream will run a lot faster due to the perfect
predictions and warmed up caches and TLB.
Additional hardware required for detecting
difference between results.
NONE!
Control predictions inherently contain a
mechanism to detect mispredictions.
Compare the predicted results to the computed
branch condition.
In the same way,
R-stream has a perfect predictor (A-stream)
Compare the values predicted by the A-stream to
the actual R-stream values computed. Difference
between the two denotes a fault.

17
Trace Processors

Goal is also to detect hardware faults along with
transient faults.
Use some sort of hierarchy to virtually divide
the processor into smaller processing elements.
Trace processors partition the instruction
stream into larger units of work called traces.
Trace length 16 or 32 instructions
Processor is virtually divided into processing
elements (PE).
Make sure A-stream and R-stream work on different
PEs.
Detect permanent faults by comparing results.
Cost extra bits/trace in the Delay buffer.

18
(No Transcript)
19
Implementation Issues

Handling register values
Each thread must have its own register state.
Register dependency in different threads should
not interfere with each other.
Share a single physical register file.
Separate register map
This guarantees that the same local register in
two different threads will be mapped to different
registers.
Advantage of flexibility in register requirement.
Handling Memory values
Disambiguation unit enforce data dependency
through memory. (SHARED)
Add thread identifier in the memory address to
stop memory dependency between threads from
interfering with each other.

20
Pipeline Implementation trace processor

Fetch/Dispatch
Time Shared (Trace fetched as a unit)
Fetch/Decode Arbitration
If Delay buffer is full, R-Stream has priority to
access fetch/decode stage.
Execution
Space Shared
Unit of sharing is the Processing element (PE)
Arbitrary scheduling of instructions using simple
rules
Retire
Time shared
Retirement stage arbitration
If Delay Buffer is not full, A-stream has
priority to retire a trace.

21
Pipeline Implementation trace processor
22
Problems due to AR-SMT

R-stream is not a true-software context
OS not aware that such a program exists.
R-stream needs to have its own physical memory
image.
Solution
When allocating a physical page to virtual page,
make OS allocate two contiguous pages to the
A-stream.
Address translation has to be placed in the Delay
Buffer by the A-stream for the R-stream.

23
Performance Evaluation

Simulate AR-SMT trace processor.
Use simplescalar simulation platform
Fault coverage is not evaluated.
Results
Used Trace processors with 4 and 8 PEs.
12 - 29 increase in execution time (4 PEs).
5 - 27 (8 PEs)

24
Performance Evaluation

25
Other approaches to fault tolerance

DIVA
Detects a variety of faults (even design faults)
Uses a verified checker to validate computation
of a complex processing core.
Uses similar techniques to the AR-SMT the
checker is able to keep pace with the core by
using the values it is checking as predictions.
Slipstream
Uses the basic concepts of AR-SMT.
A-stream (Advanced stream) is shortened to run
faster.
Drafting