Title: Cache Scrubbing in Microprocessors: Myth or Necessity Practical Experience Report
1Cache Scrubbing in Microprocessors Myth or
Necessity?Practical Experience Report
Shubu Mukherjee Joel Emer, Tryggve Fossum, St
even K. Reinhardt Fault Aware Computing Technolo
gy (FACT) Group Massachusetts Microprocessor Desi
gn Center, Intel Corporation 10th IEEE Internat
ional Symposium Pacific Rim Dependable Computing,
French Polynesia, March 3-5, 2004
Also, University of Michigan, Ann Arbor
2Summary
- SECDED ECC (single error correction, double error
detection)
- commonly used in on-chip caches
- interleaving converts spatial multi-bit errors to
multiple single bit errors
- Scrubbing
- periodically read cache blocks and correct all
single bit errors
- this prevents single bit errors from
accumulating, thereby avoiding temporal double
bit errors
- Our conclusion given detected error target of 10
year MTTF
- Scrubbing necessary only for very large caches
(e.g., 100s of megabytes to gigabytes)
3Origin of Cosmic Rays
p
p
n
n
p
n
n
p
n
p
n
Earths Surface
- Cosmic rays come from deep space
4Impact of Neutron Strike on a Si Device
neutron strike
Strikes release electron hole pairs that can be
absorbed by source drain to alter the state of
the device
-
-
-
-
Transistor Device
- Secondary source of upsets alpha particles from
packaging
5Strike Changes State of a Single Bit
0
1
- Example Solution
- Error correction codes (ECC) for single bit
correction
- Overhead 7 bits for 64 bits of data
6Strike Changes State of Two Adjacent BitsSpatial
Double Bit Error
- Example solution
- SECDED ECC (single error correction, double error
detection)
- 8 bits of code per 64 bits of data
- Interleaving for the more general case
7Interleaving bits
bits
- Interleaving converts
- spatial multi-bit error ? multiple single bit
errors
8Two Separate Strikes on Different BitsTemporal
Double Bit Errors
- SECDED ECC (single error correction, double error
detection)
- could detect error, but cannot correct the error
- if errors accumulate
- single bit correctable error becomes a double bit
detectable error
9Solutions for Temporal Double Bit Errors
- Natural Effects
- whenever a processor reads a cache block, we can
correct the single bit error
- check for errors when cache blocks are replaced
from the cache
- More Powerful ECC
- SECDED ECC requires 8 bits per 64 bits
- 7 bits for single bit correction
- 8th bit for double bit detection
- Overhead 13
- ECC with two bit correction requires 12 bits per
64 bits
- Overhead 19
- Scrubbing
- Periodically read memory and correct all single
bit errors
- Disallows accumulation of temporal double bit
errors
- Standard technique in main memories (DRAMs)
- Our calculations (later) will assume the worst
case for soft errors
- cache blocks dont get scrubbed naturally
10Memory Hierarchy of a Processor
CPU
L1 Cache
kilobytes
L2 Cache
megabytes
Main Memory (gigabytes)
- Do we need to scrub on-chip caches?
- depends on the size of these caches
11Detected Unrecoverable Error (DUE)
- Interval-based
- MTTF Mean Time to Failure
- E.g., goal 10 years MTTF for application crash
- Bossen, IRPS 2002
- Rate-based
- FIT Failure in Time 1 failure in a billion
hours
- 10 year MTTF 109 / (24 365 10) FIT 11,415
FITs
Hypothetical Example
12MTTF calculations probabilities
- 1 quadword 64 bits 8 bits 72 bits of data
SECDED ECC
- Q quadwords in cache memory
- Pdn probability that a sequence of n strikes
causes n 1 single bit errors, followed by a
double bit error on the nth strike
Pd2 (Q/Q) (1/Q) 1/Q
13MTTF calculations probabilities
- 1 quadword 64 bits 8 bits 72 bits of SECDED
ECC
- Q quadwords in cache memory
- Pdn probability that a sequence of n strikes
causes n 1 single bit errors, followed by a
double bit error on the nth strike
Pd3 (Q/Q) (Q-1/Q) (2/Q)
14MTTF calculations probabilities
- 1 quadword 64 bits 8 bits 72 bits of SECDED
ECC
- Q quadwords in cache memory
- Pdn probability that a sequence of n strikes
causes n 1 single bit errors, followed by a
double bit error on the nth strike
- Pd1 0
- Pd2 1 / Q
- Pd3 (Q-1)/Q 2/Q
- Pd4 (Q-1)/Q (Q-2)/Q 3/Q
-
- Pdn (Q-1/Q (Q-2)/Q (Q-3)/Q
(Q-n2)/Q (n-1)/Q
15MTTF calculations Equation
- M mean of single bit errors to get a double
bit error
- Expected value of random variable with
Pdn as the
- probability distribution function
- M can be easily generated using a computer
program
- MTTF (double bit error) M MTTF (single bit
error)
- For a 32 megabyte cache FIT/bit 0.001
Normand 1996, Tosaka 1996
- MTTF (double bit error) M MTTF (single bit
error)
- 2567 (1 / Cache FIT)
- 2567 (109 / (0.001 222 72 24
365))
-
970 years
- Saleh, et al.s, 1990 closed form equation
- MTTF (double bit error) 1 / (72 f)
sqrt(? / 2Q)
- 970 years, f FIT/bit
16Temporal Double BitMTTF variations with cache
size
- FIT/bit 0.001 0.01 (Normand 1996, Tosaka
1996)
- higher at higher altitudes (e.g., 3-5x at 1.5km
in Denver)
- Temporal double bit error has very small
contribution to DUE rate
- compared to a goal of 10 years DUE MTTF
17MTTF with Scrubbing
- I scrubbing interval, scrub at the end of each
interval I
- N scrubbing intervals to reach MTTF
- Expected value of random variable with
probability distribution
- function (1-pf)N pf, where pf
probability of a temporal double bit
- error at the end of an interval
- Assuming 16 GB cache, FIT/bit 0.001 (Normand
1996, Tosaka 1996),
- scrub once a year (I 1 year)
- MTTF(double bit error) N I
- 2281 1 2281 years
- Saleh, et al. 1990 closed form equation
- 2 / Q I (f 72)2 2341 years, f FIT/bit
18Impact of Scrubbing on Temporal Double Bit MTTF
- FIT/bit 0.001 0.01 (Normand 1996, Tosaka
1996)
- higher at higher altitudes (e.g., 3-5x at 1.5km
in Denver)
- For 16 gigabytes of cache, scrubbing can help
- compared to a DUE MTTF goal of 10 years
19Summary
- SECDED ECC (single error correction, double error
detection)
- commonly used in on-chip caches
- interleaving converts spatial multi-bit errors to
multiple single bit errors
- Scrubbing
- periodically read cache blocks and correct all
single bit errors
- this prevents single bit errors from
accumulating, thereby avoiding temporal double
bit errors
- Our conclusion given detected error target of 10
year MTTF
- Scrubbing necessary only for very large caches
(e.g., 100s of megabytes to gigabytes)
20 21Raw soft error rate 0.001 0.010 FIT/bit
- Y.Tosaka, S.Satoh, K.Suzuki, T.Suguii, H.Ehara,
G.A.Woffinden, and S.A.Wender, Impact of Cosmic
Ray Neutron Induced Soft Errors, on Advanced
Submicron CMOS circuits, VLSI Symposium on VLSI
Technology Digest of Technical Papers, 1996. - Normand, Single Event Upset at Ground Level,
IEEE Transactions on Nuclear Science, Vol. 43,
No. 6, December 1996.