Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor 31st Annual International Symposium on Computer Architecture (ISCA), Munich, Germany, June 2004 - PowerPoint PPT Presentation

About This Presentation
Title:

Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor 31st Annual International Symposium on Computer Architecture (ISCA), Munich, Germany, June 2004

Description:

Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor ... e.g., error flagged on wrong-path or dynamically dead instruction ... – PowerPoint PPT presentation

Number of Views:99
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor 31st Annual International Symposium on Computer Architecture (ISCA), Munich, Germany, June 2004


1
Techniques to Reduce the Soft Error Rate of a
High-Performance Microprocessor31st Annual
International Symposium on Computer Architecture
(ISCA), Munich, Germany, June 2004
Shubu Mukherjee1 Christopher Weaver1, Joel
Emer1, Steve Reinhardt1,2 1Massachusetts
Microprocessor Design Center, Intel 2University
of Michigan, Ann Arbor
2
Outline
  • Trade-off performance for lower soft error rate
  • MITF (mean instructions to failure)
  • reduce errors by keeping objects longer in
    protected memory
  • False Detected Unrecoverable Errors
  • processor would unnecessarily crash on such an
    error
  • techniques to avoid false errors
  • ? (possibly incorrect) bit
  • anti-? bit

3
Alpha or Neutron Particle Strike Changes State
of a Single Bit
0
1
4
Silent Data Corruption (SDC)
Bit Read?
no
yes
benign fault no error
Bit has error protection?
detection correction
no
yes
detection only
affects program outcome?
yes
no
no
yes
benign fault no error
SDC
5
SDC Definitions
  • SDC Silent Data Corruption
  • MTTF Mean Time to Failure
  • SDC MTTF time between two SDC events
  • Chip SDC Rate (inversely ? to MTTF)
  • Rate of occurrence of SDC events
  • ?over all bits (Circuit Soft Error Rate) X
    (SDC AVF)
  • Target market will typically set SDC budget
  • note budget is non-zero
  • Circuit Soft Error Rate
  • determined by alpha or neutron flux, circuit
    parameters, etc.
  • AVF (Architectural Vulnerability Factor),
    Mukherjee, et al. MICRO, 03
  • fraction of strikes that affect program outcome
  • AVF 0 for branch predictor
  • AVF 100 for program counter
  • AVF lt 100 for instruction queue

6
Instruction Queues SDC AVF
Similar to Mukherjee, et al., MICRO 03
CPU2000 Asim Simpoint Itanium2-like
7
SDC Reduction Techniques
  • Chip SDC Rate ?over all bits (Circuit Soft
    Error Rate) X (SDC AVF)
  • Conventional techniques
  • process technology (e.g., fully-depleted SOI)
  • circuit technology (e.g., radiation-hardened
    cells)
  • error detection or correction codes (e.g.,
    parity, ECC)
  • Our new technique
  • reduce exposure to radiation to reduce SDC AVF
  • trade off between performance and soft error rate

8
MITF mean instructions to failure(work between
two errors)
instructions committed
MITF
errors encountered
IPC X ( cycles)

errors encountered
IPC X Total time X frequency

errors encountered
IPC X MTTF X frequency
IPC X frequency

(Circuit Soft Error Rate) X AVF
IPC
frequency

X
AVF
Circuit Soft Error Rate
9
Reducing SDC of an Instruction Queue (IQ)(assume
protected instruction cache)
  • Increase IPC fetch aggressively from IC to IQ
  • Reduce SDC AVF prevent instructions from sitting
    needlessly in IQ
  • Net benefit if we improve MITF (proportional to
    IPC / AVF)

10
Squash Instructions
  • Goal
  • dont have instructions sit needlessly in the
    Instruction Queue
  • Algorithm to Reduce Exposure to Radiation
  • Trigger Cache Miss
  • Action Squash all instructions in instruction
    queue following the Load Miss
  • Evaluation using
  • Asim Performance Model Framework
  • First 100 million instruction simpoint of all
    CPU2000 benchmarks
  • Itanium2-like architecture, but scaled (note
    in-order machine)

11
SDC MITF Improvement from Reducing Exposure
Design Point IPC SDC AVF MITF Improvement
Baseline 1.21 29 0
Squash on L1 load misses 1.19 22 37
12
Outline
  • Trade-off performance for lower soft error rate
  • MITF (mean instructions to failure)
  • reduce errors by keeping objects longer in
    protected memory
  • False Detected Unrecoverable Errors
  • processor would unnecessarily crash on such an
    error
  • techniques to avoid false errors
  • ? (possibly incorrect) bit
  • anti-? bit

13
Detected Unrecoverable Error (DUE)
Bit Read?
no
yes
benign fault no error
Bit has error protection?
detection correction
no
no error
detection only
affects program outcome?
no
yes
yes
no
no
yes
benign fault no error
False DUE
True DUE
SDC
14
DUE Definitions
  • DUE Detected Unrecoverable Error
  • MTTF Mean Time to Failure
  • DUE MTTF time between two DUE events
  • Chip DUE Rate (inversely ? to MTTF)
  • Rate of occurrence of DUE events
  • ?over all bits (Circuit Soft Error Rate) X
    (DUE AVF)
  • Target market will typically set DUE budget
  • note budget is non-zero
  • Circuit Soft Error Rate
  • determined by alpha or neutron flux, circuit
    parameters, etc.
  • DUE AVF (Architectural Vulnerability Factor)
  • fraction of strikes that result in DUE events
  • Total DUE AVF (True DUE AVF) (False DUE AVF)

15
DUE AVF of Instruction Queue with Parity
CPU2000 Asim Simpoint Itanium2-like
False DUE AVF 33
16
Total Soft Error Rate
  • Total Soft Error Rate ?all bits (SDC Rate)
    (DUE Rate)
  • Parity converts SDC to DUE
  • True DUE AVF (with error detection) SDC AVF
    (without detection)
  • Parity also introduces False DUE
  • e.g., error flagged on wrong-path or dynamically
    dead instruction
  • Parity-protecting a bit increases overall
    observed soft error rate
  • Example instruction queue
  • SDC AVF (without error detection) 29
  • DUE AVF (with error detection) 62
  • True DUE AVF 29
  • False DUE AVF 33
  • Idle miscellaneous 38

17
Reducing DUE
  • Chip DUE Rate ?over all bits (Circuit Soft
    Error Rate) X (DUE AVF)
  • DUE AVF (True DUE AVF) (False DUE AVF)
  • Techniques
  • convert back to SDC
  • process technology (e.g., fully-depleted SOI)
  • circuit technology (e.g., radiation-hardened
    cells)
  • error recovery techniques (e.g., ECC)
  • Our new techniques
  • exposure reduction techniques (first part of this
    talk)
  • False DUE AVF reduction

18
Sources of False DUE in an Instruction Queue
  • Instructions with uncommitted results
  • e.g., wrong-path, predicated-false
  • solution ? (possibly incorrect) bit till commit
  • Instruction types neutral to errors
  • e.g., no-ops, prefetches, branch predict hints
  • solution anti- ? bit
  • Dynamically dead instructions
  • instructions whose results will not be used in
    future
  • solution ? bit beyond commit

19
Coping with Wrong-Path Instructions(assume
parity-protected instruction queue)
X
inst
inst
inst
DECLARE ERROR ON ISSUE
  • Problem not enough information at issue

20
The ? (Possibly Incorrect) Bit(assume
parity-protected instruction queue)
inst (?)
inst
inst
inst (?)
inst
inst (?)
At commit point, declare error only if not
wrong-path instruction and ? bit is set
21
anti-? bit coping with No-ops(assume
parity-protected instruction queue)
inst (anti-?)
inst (anti-?)
inst
inst
inst
inst
On issue, if the anti-? bit is set, then do not
set the ? bit
22
? bit avoiding False DUE on Dynamically Dead
Instructions
write R1
write R1(?)
write R1
read R1
read R1
read R1 (?)
write R1
write R1(?)
write R1(?)
write R1(?)
read R1
  • Declare the error on reading R1, if ? bit is set
  • If R1 isnt read (i.e., dynamically dead), then
    no False DUE
  • ? bit can be used in caches main memory

23
Scope of the ? Bit
  • ? bit allows declaring an error on use of a value
    or object
  • rather than when the error is detected
  • e.g., declare error on register read, rather when
    it was detected
  • ? bit goes out of scope
  • when error information cannot be propagated
  • e.g., store writes data into cache without ? bits
  • typically, raise error when ? bit goes out of
    scope
  • Design points increasing levels of ? bit
    protection
  • ? bit till register commit
  • ? bit till register read
  • ? bit till store commit
  • ? bit till I/O commit

24
False DUE AVF Eliminated(PI ?)
CPU2000 Asim Simpoint Itanium2-like
Practical to eliminate most of the False DUE AVF
25
Summary
  • Trade-off performance for lower soft error rate
  • MITF (mean instructions to failure) ? (IPC /
    AVF)
  • reduce errors by keeping objects longer in
    protected memory
  • False Detected Unrecoverable Errors
  • processor would unnecessarily crash on such an
    error
  • techniques to avoid false errors
  • ? (possibly incorrect) bit
  • anti-? bit
  • PET (post-commit error tracking) buffer, see paper

26
  • BACKUP SLIDES FOLLOW

27
of False DUE Covered
Possible to eliminate most of the False DUE AVF
Write a Comment
User Comments (0)
About PowerShow.com