A Microarchitectural Analysis of Soft Error Propagation in a Production-Level Embedded Microprocessor - PowerPoint PPT Presentation

About This Presentation
Title:

A Microarchitectural Analysis of Soft Error Propagation in a Production-Level Embedded Microprocessor

Description:

Electrical Engineering and Computer Science. Architectural Errors per Cycle ... Multi-bit/multi-register architectural errors common ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 24
Provided by: Kevi1
Category:

less

Transcript and Presenter's Notes

Title: A Microarchitectural Analysis of Soft Error Propagation in a Production-Level Embedded Microprocessor


1
A Microarchitectural Analysis of Soft Error
Propagation in a Production-Level Embedded
Microprocessor
  • Jason Blome, Scott Mahlke,
  • Daryl Bradley, Krisztián Flautner
  • Advanced Computer Architecture Lab, University of
    Michigan
  • ARM Ltd.

2
Embedded Everywhere
  • Not just cellphones
  • Safety critical applications
  • Automotive
  • Healthcare

Patterson and Hennessy 2005
3
Embedded Domain Constraints
  • Power efficient performance
  • Longer clock cycle times
  • Increased logic depth between stages
  • Higher area ratio of combinational logic to state
    elements
  • Less speculative state
  • Potentially less masking
  • Limited real estate

All of these high level constraints affect the
behavior of faults and the potential of fault
tolerance techniques
4
Objectives
  • Understand the effects of transient faults on a
    typical embedded design
  • Architectural contributions to soft error effects
  • Production-grade core
  • Reference synthesis flow
  • Design for test methodologies
  • Simulate faults in both combinational and
    sequential logic

5
Soft Error Rate Contributions
Soft Error Rate Contributions
Mitra 2005
Shivakumar 2002
Increasing contribution of faults in
combinational logic to the overall soft error rate
6
Processor Model
  • ARM926EJ-S
  • Cell library characterized for 130 nm
  • 5 ns clock cycle time

ARM926EJ-S
Instruction Fetch
Instruction Decode
Data cache
Data Interface
MMU
Register Bank
Instruction Address Logic
Mux Array
Instruction cache
Shift
MMU
Write Buffer/ Bus Interface
Multiply
Data Address Logic
Bus Interface
7
Analysis Infrastructure
testbench
reference design
test design
benchmark
error checking and logging
fault injection scheduler
fault injection/error analysis framework
report generation
8
Fault Masking
  • Logical faulted value does not affect logical
    operation of the circuit
  • Architectural/Software incorrect state is
    written before it is read
  • Latching-Window the fault pulse does not reach a
    state element within the latching window
  • Electrical the fault pulse is electrically
    attenuated by subsequent gates in the circuit

9
Observed Error Rates
Faults Occurring in Registers
Error Site Error Rate Masking Rate
Microarchitectural State 94 6
Architectural State 7 93
Top-level Ports 4 96
Faults Occurring in Combinational Logic
Error Site Error Rate Masking Rate
Microarchitectural State 16 84
Architectural State 4 96
Top-level Ports 3 97
At the software interface, error rates within 3
10
Observed Error Rates
Faults Occurring in Registers
Cycle Average Bit Errors
1 1.26
2 3.19
3 3.06
4 5.52
Faults Occurring in Combinational Logic
Cycle Average Bit Errors
1 41.49
2 45.33
3 47.76
4 49.54
Faults in combinational logic have a much more
dramatic effect on system state
11
Architectural Errors per Cycle
Faults Occurring in Registers
Faults Occurring in Combinational Logic
12
Architectural Corruption Characteristics
Bits per Architectural Register Corrupted
Number of Architectural Registers Corrupted
13
Results Summary
  • Faults occurring in logic
  • Will likely be much more frequent in embedded
    design
  • Tend to have a more dramatic effect on system
    state
  • Multi-bit/multi-register architectural errors
    common
  • Design for test methodologies can greatly impact
    soft error characteristics
  • Error rates at the software interface consistent
    with those observed in high-performance
    microprocessors

14
Traditional Error Detection/Protection
  • Reliable Encoding
  • ECC/Parity
  • Limited use for faults in logic
  • Unclear where/how much to protect
  • Redundant Computation
  • In space
  • Area/energy overhead
  • In time
  • Energy overhead
  • Requires performance slack

15
Case Study I
IRoute
Instruction Fetch
Instruction Decode
Data cache
Data Interface
MMU
Register Bank
Instruction Address Logic
Mux Array
Instruction cache
Shift
MMU
Write Buffer/ Bus Interface
Multiply
Data Address Logic
Bus Interface
16
Case Study II
IPipe
Instruction Fetch
Instruction Decode
Data cache
Data Interface
MMU
Register Bank
Instruction Address Logic
Mux Array
Instruction cache
Shift
MMU
Write Buffer/ Bus Interface
Multiply
Data Address Logic
Bus Interface
17
Fault Characteristics
  • Case Study I uCORE.uIRoute.U600
  • First cycle error sites 51 errors
  • uIRoute.INSTRHeld_reg0
  • uIRoute.INSTRHeld_reg16
  • uIRoute.INSTRHeld_reg22
  • uIRoute.INSTRHeld_reg31
  • u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg0
  • u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg16
  • u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg31
  • u9EJ.uARM9.uCORECTL.uIPIPE.StoredInstrInt_reg29
  • u9EJ.uARM9.uCORECTL.uIPIPE.StoredInstrInt_reg30
  • Case Study II uCORE.u9EJ.uARM9.uCORECTL.uIPIPE.U3
    626
  • First cycle error sites 9 errors
  • u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg3
  • u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg12
  • u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg17
  • u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg18
  • u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg24

18
Embedded Design Space Potential
  • Leverage significant signal fanout
  • Determine that a fault has occurred during the
    cycle that it occurs
  • Transition detection circuits
  • Selectively deploy fault detection units
  • Intersection of high fanout fault targets
  • No roll-back necessary simply flush the
    pipeline
  • Low cost/area overhead critical for embedded
    designs

19
Conclusion
  • Design domain critical
  • Affects fault behavior
  • Limits applicable tolerance techiques
  • Key observations
  • Faults in combinational logic much more likely in
    embedded designs
  • Faults in combinational logic behave dramatically
    different than those in state elements
  • Fault fanout offers potential for low overhead
    detection

20
Soft Error Terminology
transistor
21
Dependence on Fault Duration
22
Pulse Detection
flip-flop
D
Q
CLK
Q
error
shadow latch
23
Microarchitectural Errors per Cycle
Faults Occurring in Registers
Faults Occurring in Combinational Logic
Multi-bit errors common for Faults in
combinational logic
Write a Comment
User Comments (0)
About PowerShow.com