Cache Scrubbing in Microprocessors: Myth or Necessity Practical Experience Report - PowerPoint PPT Presentation

About This Presentation
Title:

Cache Scrubbing in Microprocessors: Myth or Necessity Practical Experience Report

Description:

Massachusetts Microprocessor Design Center, Intel Corporation ... Symposium Pacific Rim Dependable Computing, French Polynesia, March 3-5, 2004 ... – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 22
Provided by: pagesC
Category:

less

Transcript and Presenter's Notes

Title: Cache Scrubbing in Microprocessors: Myth or Necessity Practical Experience Report


1
Cache Scrubbing in Microprocessors Myth or
Necessity?Practical Experience Report
Shubu Mukherjee Joel Emer, Tryggve Fossum, St
even K. Reinhardt Fault Aware Computing Technolo
gy (FACT) Group Massachusetts Microprocessor Desi
gn Center, Intel Corporation 10th IEEE Internat
ional Symposium Pacific Rim Dependable Computing,
French Polynesia, March 3-5, 2004
Also, University of Michigan, Ann Arbor
2
Summary
  • SECDED ECC (single error correction, double error
    detection)
  • commonly used in on-chip caches
  • interleaving converts spatial multi-bit errors to
    multiple single bit errors
  • Scrubbing
  • periodically read cache blocks and correct all
    single bit errors
  • this prevents single bit errors from
    accumulating, thereby avoiding temporal double
    bit errors
  • Our conclusion given detected error target of 10
    year MTTF
  • Scrubbing necessary only for very large caches
    (e.g., 100s of megabytes to gigabytes)

3
Origin of Cosmic Rays
p
p
n
n
p
n
n
p
n
p
n
Earths Surface
  • Cosmic rays come from deep space

4
Impact of Neutron Strike on a Si Device
neutron strike
Strikes release electron hole pairs that can be
absorbed by source drain to alter the state of
the device


-


-
-
-
Transistor Device
  • Secondary source of upsets alpha particles from
    packaging

5
Strike Changes State of a Single Bit
0
1
  • Example Solution
  • Error correction codes (ECC) for single bit
    correction
  • Overhead 7 bits for 64 bits of data

6
Strike Changes State of Two Adjacent BitsSpatial
Double Bit Error
  • Example solution
  • SECDED ECC (single error correction, double error
    detection)
  • 8 bits of code per 64 bits of data
  • Interleaving for the more general case

7
Interleaving bits
bits
  • Interleaving converts
  • spatial multi-bit error ? multiple single bit
    errors

8
Two Separate Strikes on Different BitsTemporal
Double Bit Errors
  • SECDED ECC (single error correction, double error
    detection)
  • could detect error, but cannot correct the error
  • if errors accumulate
  • single bit correctable error becomes a double bit
    detectable error

9
Solutions for Temporal Double Bit Errors
  • Natural Effects
  • whenever a processor reads a cache block, we can
    correct the single bit error
  • check for errors when cache blocks are replaced
    from the cache
  • More Powerful ECC
  • SECDED ECC requires 8 bits per 64 bits
  • 7 bits for single bit correction
  • 8th bit for double bit detection
  • Overhead 13
  • ECC with two bit correction requires 12 bits per
    64 bits
  • Overhead 19
  • Scrubbing
  • Periodically read memory and correct all single
    bit errors
  • Disallows accumulation of temporal double bit
    errors
  • Standard technique in main memories (DRAMs)
  • Our calculations (later) will assume the worst
    case for soft errors
  • cache blocks dont get scrubbed naturally

10
Memory Hierarchy of a Processor
CPU
L1 Cache
kilobytes
L2 Cache
megabytes
Main Memory (gigabytes)
  • Do we need to scrub on-chip caches?
  • depends on the size of these caches

11
Detected Unrecoverable Error (DUE)
  • Interval-based
  • MTTF Mean Time to Failure
  • E.g., goal 10 years MTTF for application crash

  • Bossen, IRPS 2002
  • Rate-based
  • FIT Failure in Time 1 failure in a billion
    hours
  • 10 year MTTF 109 / (24 365 10) FIT 11,415
    FITs

Hypothetical Example
12
MTTF calculations probabilities
  • 1 quadword 64 bits 8 bits 72 bits of data
    SECDED ECC
  • Q quadwords in cache memory
  • Pdn probability that a sequence of n strikes
    causes n 1 single bit errors, followed by a
    double bit error on the nth strike
  • Pd1 0
  • Pd2 1 / Q

Pd2 (Q/Q) (1/Q) 1/Q
13
MTTF calculations probabilities
  • 1 quadword 64 bits 8 bits 72 bits of SECDED
    ECC
  • Q quadwords in cache memory
  • Pdn probability that a sequence of n strikes
    causes n 1 single bit errors, followed by a
    double bit error on the nth strike
  • Pd3 (Q-1)/Q 2/Q

Pd3 (Q/Q) (Q-1/Q) (2/Q)
14
MTTF calculations probabilities
  • 1 quadword 64 bits 8 bits 72 bits of SECDED
    ECC
  • Q quadwords in cache memory
  • Pdn probability that a sequence of n strikes
    causes n 1 single bit errors, followed by a
    double bit error on the nth strike
  • Pd1 0
  • Pd2 1 / Q
  • Pd3 (Q-1)/Q 2/Q
  • Pd4 (Q-1)/Q (Q-2)/Q 3/Q
  • Pdn (Q-1/Q (Q-2)/Q (Q-3)/Q
    (Q-n2)/Q (n-1)/Q

15
MTTF calculations Equation
  • M mean of single bit errors to get a double
    bit error
  • Expected value of random variable with
    Pdn as the
  • probability distribution function
  • M can be easily generated using a computer
    program
  • MTTF (double bit error) M MTTF (single bit
    error)
  • For a 32 megabyte cache FIT/bit 0.001
    Normand 1996, Tosaka 1996
  • MTTF (double bit error) M MTTF (single bit
    error)
  • 2567 (1 / Cache FIT)
  • 2567 (109 / (0.001 222 72 24
    365))

  • 970 years
  • Saleh, et al.s, 1990 closed form equation
  • MTTF (double bit error) 1 / (72 f)
    sqrt(? / 2Q)
  • 970 years, f FIT/bit

16
Temporal Double BitMTTF variations with cache
size
  • FIT/bit 0.001 0.01 (Normand 1996, Tosaka
    1996)
  • higher at higher altitudes (e.g., 3-5x at 1.5km
    in Denver)
  • Temporal double bit error has very small
    contribution to DUE rate
  • compared to a goal of 10 years DUE MTTF

17
MTTF with Scrubbing
  • I scrubbing interval, scrub at the end of each
    interval I
  • N scrubbing intervals to reach MTTF
  • Expected value of random variable with
    probability distribution
  • function (1-pf)N pf, where pf
    probability of a temporal double bit
  • error at the end of an interval
  • Assuming 16 GB cache, FIT/bit 0.001 (Normand
    1996, Tosaka 1996),
  • scrub once a year (I 1 year)
  • MTTF(double bit error) N I
  • 2281 1 2281 years
  • Saleh, et al. 1990 closed form equation
  • 2 / Q I (f 72)2 2341 years, f FIT/bit

18
Impact of Scrubbing on Temporal Double Bit MTTF
  • FIT/bit 0.001 0.01 (Normand 1996, Tosaka
    1996)
  • higher at higher altitudes (e.g., 3-5x at 1.5km
    in Denver)
  • For 16 gigabytes of cache, scrubbing can help
  • compared to a DUE MTTF goal of 10 years

19
Summary
  • SECDED ECC (single error correction, double error
    detection)
  • commonly used in on-chip caches
  • interleaving converts spatial multi-bit errors to
    multiple single bit errors
  • Scrubbing
  • periodically read cache blocks and correct all
    single bit errors
  • this prevents single bit errors from
    accumulating, thereby avoiding temporal double
    bit errors
  • Our conclusion given detected error target of 10
    year MTTF
  • Scrubbing necessary only for very large caches
    (e.g., 100s of megabytes to gigabytes)

20
  • BACKUPS

21
Raw soft error rate 0.001 0.010 FIT/bit
  • Y.Tosaka, S.Satoh, K.Suzuki, T.Suguii, H.Ehara,
    G.A.Woffinden, and S.A.Wender, Impact of Cosmic
    Ray Neutron Induced Soft Errors, on Advanced
    Submicron CMOS circuits, VLSI Symposium on VLSI
    Technology Digest of Technical Papers, 1996.
  • Normand, Single Event Upset at Ground Level,
    IEEE Transactions on Nuclear Science, Vol. 43,
    No. 6, December 1996.
Write a Comment
User Comments (0)
About PowerShow.com