Title: Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor 31st Annual International Symposium on Computer Architecture (ISCA), Munich, Germany, June 2004
1Techniques to Reduce the Soft Error Rate of a
High-Performance Microprocessor31st Annual
International Symposium on Computer Architecture
(ISCA), Munich, Germany, June 2004
Shubu Mukherjee1 Christopher Weaver1, Joel
Emer1, Steve Reinhardt1,2 1Massachusetts
Microprocessor Design Center, Intel 2University
of Michigan, Ann Arbor
2Outline
- Trade-off performance for lower soft error rate
- MITF (mean instructions to failure)
- reduce errors by keeping objects longer in
protected memory - False Detected Unrecoverable Errors
- processor would unnecessarily crash on such an
error - techniques to avoid false errors
- ? (possibly incorrect) bit
- anti-? bit
3Alpha or Neutron Particle Strike Changes State
of a Single Bit
0
1
4Silent Data Corruption (SDC)
Bit Read?
no
yes
benign fault no error
Bit has error protection?
detection correction
no
yes
detection only
affects program outcome?
yes
no
no
yes
benign fault no error
SDC
5SDC Definitions
- SDC Silent Data Corruption
- MTTF Mean Time to Failure
- SDC MTTF time between two SDC events
- Chip SDC Rate (inversely ? to MTTF)
- Rate of occurrence of SDC events
- ?over all bits (Circuit Soft Error Rate) X
(SDC AVF) - Target market will typically set SDC budget
- note budget is non-zero
- Circuit Soft Error Rate
- determined by alpha or neutron flux, circuit
parameters, etc. - AVF (Architectural Vulnerability Factor),
Mukherjee, et al. MICRO, 03 - fraction of strikes that affect program outcome
- AVF 0 for branch predictor
- AVF 100 for program counter
- AVF lt 100 for instruction queue
6Instruction Queues SDC AVF
Similar to Mukherjee, et al., MICRO 03
CPU2000 Asim Simpoint Itanium2-like
7SDC Reduction Techniques
- Chip SDC Rate ?over all bits (Circuit Soft
Error Rate) X (SDC AVF) - Conventional techniques
- process technology (e.g., fully-depleted SOI)
- circuit technology (e.g., radiation-hardened
cells) - error detection or correction codes (e.g.,
parity, ECC) - Our new technique
- reduce exposure to radiation to reduce SDC AVF
- trade off between performance and soft error rate
8MITF mean instructions to failure(work between
two errors)
instructions committed
MITF
errors encountered
IPC X ( cycles)
errors encountered
IPC X Total time X frequency
errors encountered
IPC X MTTF X frequency
IPC X frequency
(Circuit Soft Error Rate) X AVF
IPC
frequency
X
AVF
Circuit Soft Error Rate
9Reducing SDC of an Instruction Queue (IQ)(assume
protected instruction cache)
- Increase IPC fetch aggressively from IC to IQ
- Reduce SDC AVF prevent instructions from sitting
needlessly in IQ - Net benefit if we improve MITF (proportional to
IPC / AVF)
10Squash Instructions
- Goal
- dont have instructions sit needlessly in the
Instruction Queue - Algorithm to Reduce Exposure to Radiation
- Trigger Cache Miss
- Action Squash all instructions in instruction
queue following the Load Miss - Evaluation using
- Asim Performance Model Framework
- First 100 million instruction simpoint of all
CPU2000 benchmarks - Itanium2-like architecture, but scaled (note
in-order machine)
11SDC MITF Improvement from Reducing Exposure
Design Point IPC SDC AVF MITF Improvement
Baseline 1.21 29 0
Squash on L1 load misses 1.19 22 37
12Outline
- Trade-off performance for lower soft error rate
- MITF (mean instructions to failure)
- reduce errors by keeping objects longer in
protected memory - False Detected Unrecoverable Errors
- processor would unnecessarily crash on such an
error - techniques to avoid false errors
- ? (possibly incorrect) bit
- anti-? bit
13Detected Unrecoverable Error (DUE)
Bit Read?
no
yes
benign fault no error
Bit has error protection?
detection correction
no
no error
detection only
affects program outcome?
no
yes
yes
no
no
yes
benign fault no error
False DUE
True DUE
SDC
14DUE Definitions
- DUE Detected Unrecoverable Error
- MTTF Mean Time to Failure
- DUE MTTF time between two DUE events
- Chip DUE Rate (inversely ? to MTTF)
- Rate of occurrence of DUE events
- ?over all bits (Circuit Soft Error Rate) X
(DUE AVF) - Target market will typically set DUE budget
- note budget is non-zero
- Circuit Soft Error Rate
- determined by alpha or neutron flux, circuit
parameters, etc. - DUE AVF (Architectural Vulnerability Factor)
- fraction of strikes that result in DUE events
- Total DUE AVF (True DUE AVF) (False DUE AVF)
15DUE AVF of Instruction Queue with Parity
CPU2000 Asim Simpoint Itanium2-like
False DUE AVF 33
16Total Soft Error Rate
- Total Soft Error Rate ?all bits (SDC Rate)
(DUE Rate) - Parity converts SDC to DUE
- True DUE AVF (with error detection) SDC AVF
(without detection) - Parity also introduces False DUE
- e.g., error flagged on wrong-path or dynamically
dead instruction - Parity-protecting a bit increases overall
observed soft error rate - Example instruction queue
- SDC AVF (without error detection) 29
- DUE AVF (with error detection) 62
- True DUE AVF 29
- False DUE AVF 33
- Idle miscellaneous 38
17Reducing DUE
- Chip DUE Rate ?over all bits (Circuit Soft
Error Rate) X (DUE AVF) - DUE AVF (True DUE AVF) (False DUE AVF)
- Techniques
- convert back to SDC
- process technology (e.g., fully-depleted SOI)
- circuit technology (e.g., radiation-hardened
cells) - error recovery techniques (e.g., ECC)
- Our new techniques
- exposure reduction techniques (first part of this
talk) - False DUE AVF reduction
18Sources of False DUE in an Instruction Queue
- Instructions with uncommitted results
- e.g., wrong-path, predicated-false
- solution ? (possibly incorrect) bit till commit
- Instruction types neutral to errors
- e.g., no-ops, prefetches, branch predict hints
- solution anti- ? bit
- Dynamically dead instructions
- instructions whose results will not be used in
future - solution ? bit beyond commit
19Coping with Wrong-Path Instructions(assume
parity-protected instruction queue)
X
inst
inst
inst
DECLARE ERROR ON ISSUE
- Problem not enough information at issue
20The ? (Possibly Incorrect) Bit(assume
parity-protected instruction queue)
inst (?)
inst
inst
inst (?)
inst
inst (?)
At commit point, declare error only if not
wrong-path instruction and ? bit is set
21anti-? bit coping with No-ops(assume
parity-protected instruction queue)
inst (anti-?)
inst (anti-?)
inst
inst
inst
inst
On issue, if the anti-? bit is set, then do not
set the ? bit
22? bit avoiding False DUE on Dynamically Dead
Instructions
write R1
write R1(?)
write R1
read R1
read R1
read R1 (?)
write R1
write R1(?)
write R1(?)
write R1(?)
read R1
- Declare the error on reading R1, if ? bit is set
- If R1 isnt read (i.e., dynamically dead), then
no False DUE - ? bit can be used in caches main memory
23Scope of the ? Bit
- ? bit allows declaring an error on use of a value
or object - rather than when the error is detected
- e.g., declare error on register read, rather when
it was detected - ? bit goes out of scope
- when error information cannot be propagated
- e.g., store writes data into cache without ? bits
- typically, raise error when ? bit goes out of
scope - Design points increasing levels of ? bit
protection - ? bit till register commit
- ? bit till register read
- ? bit till store commit
- ? bit till I/O commit
24 False DUE AVF Eliminated(PI ?)
CPU2000 Asim Simpoint Itanium2-like
Practical to eliminate most of the False DUE AVF
25Summary
- Trade-off performance for lower soft error rate
- MITF (mean instructions to failure) ? (IPC /
AVF) - reduce errors by keeping objects longer in
protected memory - False Detected Unrecoverable Errors
- processor would unnecessarily crash on such an
error - techniques to avoid false errors
- ? (possibly incorrect) bit
- anti-? bit
- PET (post-commit error tracking) buffer, see paper
26 27 of False DUE Covered
Possible to eliminate most of the False DUE AVF