Title: A Microarchitectural Analysis of Soft Error Propagation in a Production-Level Embedded Microprocessor
1A Microarchitectural Analysis of Soft Error
Propagation in a Production-Level Embedded
Microprocessor
- Jason Blome, Scott Mahlke,
- Daryl Bradley, Krisztián Flautner
- Advanced Computer Architecture Lab, University of
Michigan - ARM Ltd.
2Embedded Everywhere
- Not just cellphones
- Safety critical applications
- Automotive
- Healthcare
Patterson and Hennessy 2005
3Embedded Domain Constraints
- Power efficient performance
- Longer clock cycle times
- Increased logic depth between stages
- Higher area ratio of combinational logic to state
elements - Less speculative state
- Potentially less masking
- Limited real estate
All of these high level constraints affect the
behavior of faults and the potential of fault
tolerance techniques
4Objectives
- Understand the effects of transient faults on a
typical embedded design - Architectural contributions to soft error effects
- Production-grade core
- Reference synthesis flow
- Design for test methodologies
- Simulate faults in both combinational and
sequential logic
5Soft Error Rate Contributions
Soft Error Rate Contributions
Mitra 2005
Shivakumar 2002
Increasing contribution of faults in
combinational logic to the overall soft error rate
6Processor Model
- ARM926EJ-S
- Cell library characterized for 130 nm
- 5 ns clock cycle time
ARM926EJ-S
Instruction Fetch
Instruction Decode
Data cache
Data Interface
MMU
Register Bank
Instruction Address Logic
Mux Array
Instruction cache
Shift
MMU
Write Buffer/ Bus Interface
Multiply
Data Address Logic
Bus Interface
7Analysis Infrastructure
testbench
reference design
test design
benchmark
error checking and logging
fault injection scheduler
fault injection/error analysis framework
report generation
8Fault Masking
- Logical faulted value does not affect logical
operation of the circuit
- Architectural/Software incorrect state is
written before it is read
- Latching-Window the fault pulse does not reach a
state element within the latching window
- Electrical the fault pulse is electrically
attenuated by subsequent gates in the circuit
9Observed Error Rates
Faults Occurring in Registers
Error Site Error Rate Masking Rate
Microarchitectural State 94 6
Architectural State 7 93
Top-level Ports 4 96
Faults Occurring in Combinational Logic
Error Site Error Rate Masking Rate
Microarchitectural State 16 84
Architectural State 4 96
Top-level Ports 3 97
At the software interface, error rates within 3
10Observed Error Rates
Faults Occurring in Registers
Cycle Average Bit Errors
1 1.26
2 3.19
3 3.06
4 5.52
Faults Occurring in Combinational Logic
Cycle Average Bit Errors
1 41.49
2 45.33
3 47.76
4 49.54
Faults in combinational logic have a much more
dramatic effect on system state
11Architectural Errors per Cycle
Faults Occurring in Registers
Faults Occurring in Combinational Logic
12Architectural Corruption Characteristics
Bits per Architectural Register Corrupted
Number of Architectural Registers Corrupted
13Results Summary
- Faults occurring in logic
- Will likely be much more frequent in embedded
design - Tend to have a more dramatic effect on system
state - Multi-bit/multi-register architectural errors
common - Design for test methodologies can greatly impact
soft error characteristics - Error rates at the software interface consistent
with those observed in high-performance
microprocessors
14Traditional Error Detection/Protection
- Reliable Encoding
- ECC/Parity
- Limited use for faults in logic
- Unclear where/how much to protect
- Redundant Computation
- In space
- Area/energy overhead
- In time
- Energy overhead
- Requires performance slack
15Case Study I
IRoute
Instruction Fetch
Instruction Decode
Data cache
Data Interface
MMU
Register Bank
Instruction Address Logic
Mux Array
Instruction cache
Shift
MMU
Write Buffer/ Bus Interface
Multiply
Data Address Logic
Bus Interface
16Case Study II
IPipe
Instruction Fetch
Instruction Decode
Data cache
Data Interface
MMU
Register Bank
Instruction Address Logic
Mux Array
Instruction cache
Shift
MMU
Write Buffer/ Bus Interface
Multiply
Data Address Logic
Bus Interface
17Fault Characteristics
- Case Study I uCORE.uIRoute.U600
- First cycle error sites 51 errors
- uIRoute.INSTRHeld_reg0
- uIRoute.INSTRHeld_reg16
- uIRoute.INSTRHeld_reg22
- uIRoute.INSTRHeld_reg31
- u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg0
- u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg16
- u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg31
- u9EJ.uARM9.uCORECTL.uIPIPE.StoredInstrInt_reg29
- u9EJ.uARM9.uCORECTL.uIPIPE.StoredInstrInt_reg30
- Case Study II uCORE.u9EJ.uARM9.uCORECTL.uIPIPE.U3
626 - First cycle error sites 9 errors
- u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg3
- u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg12
- u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg17
- u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg18
- u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg24
18Embedded Design Space Potential
- Leverage significant signal fanout
- Determine that a fault has occurred during the
cycle that it occurs - Transition detection circuits
- Selectively deploy fault detection units
- Intersection of high fanout fault targets
- No roll-back necessary simply flush the
pipeline - Low cost/area overhead critical for embedded
designs
19Conclusion
- Design domain critical
- Affects fault behavior
- Limits applicable tolerance techiques
- Key observations
- Faults in combinational logic much more likely in
embedded designs - Faults in combinational logic behave dramatically
different than those in state elements - Fault fanout offers potential for low overhead
detection
20Soft Error Terminology
transistor
21Dependence on Fault Duration
22Pulse Detection
flip-flop
D
Q
CLK
Q
error
shadow latch
23Microarchitectural Errors per Cycle
Faults Occurring in Registers
Faults Occurring in Combinational Logic
Multi-bit errors common for Faults in
combinational logic