Title: On Cosmic Rays, Bat Droppings, and what to do about them
1On Cosmic Rays, Bat Droppings, and what to do
about them
- David Walker
- Princeton University
- with Jay Ligatti, Lester Mackey, George Reis and
David August
2A Little-Publicized Fact
1 1
2
3
3How do Soft Faults Happen?
Galactic Particles Are high-energy particles
that penetrate to Earths surface,
through buildings and walls
Solar Particles Affect Satellites Cause lt 5
of Terrestrial problems
Alpha particles from bat droppings
- High-energy particles pass through devices and
collides with silicon atom - Collision generates an electric charge that can
flip a single bit
4How Often do Soft Faults Happen?
5How Often do Soft Faults Happen?
IBM Soft Fail Rate Study Mainframes 83-86
Leadville, CO
Denver, CO
Tucson, AZ
NYC
6How Often do Soft Faults Happen?
IBM Soft Fail Rate Study Mainframes 83-86
Zeiger-Puchner 2004
Leadville, CO
Denver, CO
Tucson, AZ
NYC
- Some Data Points
- 83-86 Leadville (highest incorporated city in
the US) 1 fail/2 days - 83-86 Subterrean experiment under 50ft of
rock no fails in 9 months - 2004 1 fail/year for laptop with 1GB ram at
sea-level - 2004 1 fail/trans-pacific roundtrip
Zeiger-Puchner 2004
7How Often do Soft Faults Happen?
Soft Error Rate Trends Shenkhar Borkar, Intel,
2004
6 years from now
we are approximately here
8How Often do Soft Faults Happen?
Soft Error Rate Trends Shenkhar Borkar, Intel,
2004
6 years from now
we are approximately here
- Soft error rates go up as
- Voltages decrease
- Feature sizes decrease
- Transistor density increases
- Clock rates increase
all future manufacturing trends
9How Often do Soft Faults Happen?
- In 1948, Presper Eckert notes that cascading
effects of a single-bit error destroyed hours of
Eniacs work. Zeiger-Puchner 2004 - In 2000, Sun server systems deployed to America
Online, eBay, and others crashed due to cosmic
rays Baumann 2002 - The wake-up call came in the end of 2001 ...
billion-dollar factory ground to a halt every
month due to ... a single bit flip
Zeiger-Puchner 2004 - Los Alamos National Lab Hewlett-Packard ASC Q
2048-node supercomputer was crashing regularly
from soft faults due to cosmic radiation
Michalak 2005
10What Problems do Soft Faults Cause?
- a single bit in memory gets flipped
- a single bit in the processor logic gets flipped
and - theres no difference in external observable
behavior - the processor locks up
- the computation is silently corrupted
- register value corrupted (simple data fault)
- control-flow transfer goes to wrong place
(control-flow fault) - different opcode interpreted (instruction fault)
11FT Solutions
- Redundancy in Information
- eg Error correcting codes (ECC)
- pros protects stored values efficiently
- cons difficult to design for arithmetic units
and control logic - Redundancy in Space
- multiple redundant hardware devices
- eg Compaq Non-stop Himalaya runs two identical
programs on two processors, comparing pins on
every cycle - pros efficient in time
- cons expensive in hardware (double the space)
- Redundancy in Time
- perform the same computations at different times
(eg in sequence) - pros efficient in hardware (space is reused)
- cons expensive in time (slower --- but not
twice as slow)
12Solutions in Time
- Compiler generates code containing replicated
computations, fault detection checks and recovery
routines - eg Rebaudengo 01, CFCSS Oh et al. 02, SWIFT or
CRAFT Reis et al. 05, ... - pros software-controlled --- new code with
better reliability properties may be deployed
whenever, wherever needed - cons for fixed reliability policy, slower than
specialized hardware solutions
13Solutions in Time
- Compiler generates code containing replicated
computations, fault detection checks and recovery
routines - eg Rebaudengo 01, CFCSS Oh et al. 02, SWIFT or
CRAFT Reis et al. 05, ... - pros flexibility --- new code with better
reliability properties may be deployed whenever,
wherever needed - cons for fixed reliability policy, slower than
specialized hardware solutions - cons it might not actually work
14It might not actually work
15Agenda
- Answer basic scientific questions about
software-controlled fault tolerance - Do software-only or hybrid SW/HW techniques
actually work? - For what fault models? How do we specify them?
- How can we prove it?
- Build compilers that produce software that runs
reliably on faulty hardware - Moreover Lets not replace faulty hardware with
faulty software. - Lets prove every binary we produce is fault
tolerant relative to the specified fault model
16A Possible Compiler Architecture
compiler front end
ordinary program
reliability transform
fault tolerant program
optimization
optimized FT program
17A Possible Compiler Architecture
compiler front end
ordinary program
reliability transform
Testing Requirements
all combinations of features multiplied by all
combinations of faults
fault tolerant program
optimization
optimized FT program
18A More Reliable Compiler Architecture
compiler front end
ordinary program
reliability transform
fault tolerant program
reliability proof
optimization
optimized FT program
modified proof
proof checker
19A More Reliable Compiler Architecture
compiler front end
ordinary program
reliability transform
Testing Requirements
all combinations of features multiplied by all
combinations of faults
fault tolerant program
reliability proof
optimization
optimized FT program
modified proof
proof checker
20Central Technical Challenges
21Step 1 Lambda Zap
- Lambda Zap ICFP 06
- a lambda calculus that exhibits intermittent data
faults operators to detect and correct them - a type system that guarantees observable outputs
of well-typed programs do not change in the
presence of a single fault - types act as the proofs of fault tolerance
- expressive enough to implement an ordinary typed
lambda calculus - End result
- the foundation for a fault-tolerant typed
intermediate language
22The Fault Model
- Lambda zap models simple data faults only
( M, F v1 )
---gt ( M, F v2 )
- Not modelled
- memory faults (better protected using ECC
hardware) - control-flow faults (ie faults during
control-flow transfer) - instruction faults (ie faults in instruction
opcodes) - Goal to construct programs that tolerate 1 fault
- observers cannot distinguish between fault-free
and 1-fault runs
23Lambda to Lambda Zap The main idea
let x 2 in let y x x in out y
24Lambda to Lambda Zap The main idea
let x1 2 in let x2 2 in let x3 2 in let y1
x1 x1 in let y2 x2 x2 in let y3 x3 x3
in out y1, y2, y3
replicate instructions
let x 2 in let y x x in out y
atomic majority vote output
25Lambda to Lambda Zap The main idea
let x1 2 in let x2 2 in let x3 7 in let y1
x1 x1 in let y2 x2 x2 in let y3 x3 x3
in out y1, y2, y3
let x 2 in let y x x in out y
26Lambda to Lambda Zap The main idea
let x1 2 in let x2 2 in let x3 7 in let y1
x1 x1 in let y2 x2 x2 in let y3 x3 x3
in out y1, y2, y3
let x 2 in let y x x in out y
corrupted values copied and percolate through
computation
but final output unchanged
27Lambda to Lambda Zap Control-flow
recursively translate subexpressions
let x1 2 in let x2 2 in let x3 2 in if x1,
x2, x3 then e1 else e2
let x 2 in if x then e1 else e2
majority vote on control-flow transfer
28Lambda to Lambda Zap Control-flow
recursively translate subexpressions
let x1 2 in let x2 2 in let x3 2 in if x1,
x2, x3 then e1 else e2
let x 2 in if x then e1 else e2
majority vote on control-flow transfer
(function calls replicate arguments, results and
function itself)
29Almost too easy, can anything go wrong?...
30Almost too easy, can anything go wrong?...
yes! optimization reduces replication
overhead dramatically (eg 43 for 2 copies),
but can be unsound! original implementation of
SWIFT Reis et al. optimized away all redundancy
leaving them with an unreliable implementation!!
31Faulty Optimizations
let x1 2 in let x2 2 in let x3 2 in let y1
x1 x1 in let y2 x2 x2 in let y3 x3 x3
in out y1, y2, y3
let x1 2 in let y1 x1 x1 in out y1, y1,
y1
CSE
In general, optimizations eliminate
redundancy, fault-tolerance requires redundancy.
32The Essential Problem
bad code
let x1 2 in let y1 x1 x1 in out y1, y1,
y1
voters depend on common value x1
33The Essential Problem
good code
bad code
let x1 2 in let x2 2 in let x3 2 in let y1
x1 x1 in let y2 x2 x2 in let y3 x3 x3
in out y1, y2, y3
let x1 2 in let y1 x1 x1 in out y1, y1,
y1
voters depend on common value x1
voters do not depend on a common value
34The Essential Problem
good code
bad code
let x1 2 in let x2 2 in let x3 2 in let y1
x1 x1 in let y2 x2 x2 in let y3 x3 x3
in out y1, y2, y3
let x1 2 in let y1 x1 x1 in out y1, y1,
y1
voters depend on a common value
voters do not depend on a common value (red on
red green on green blue on blue)
35A Type System for Lambda Zap
- Key idea types track the color of the
underlying value prevents interference between
colors
Colors C R G B Types T C int
C bool C (T1,T2,T3) ? (T1,T2,T3)
36Sample Typing Rules
Judgement Form G --z e T where z
C .
simple value typing rules
(x T) in G --------------- G --z x T
------------------------ G --z C n C int
------------------------------ G --z C true
C bool
37Sample Typing Rules
Judgement Form G --z e T where z
C .
sample expression typing rules
G --z e1 C int G --z e2 C
int ----------------------------------------------
--- G --z e1 e2 C int
G --z e1 R bool G --z e2 G bool G --z
e3 B bool G --z e4 T G
--z e5 T -------------------------------------
---------------- G --z if e1, e2, e3 then e4
else e5 T
G --z e1 R int G --z e2 G int G --z e3
B int G --z e4 T ---------------------------
--------- G --z out e1, e2, e3 e4 T
38Sample Typing Rules
Judgement Form G --z e T where z
C .
recall zap rule from operational semantics
( M, F v1 ) ---gt ( M, F v2 )
before
-- v1 T
after
-- v2 ?? T gt how will we
obtain type preservation?
39Sample Typing Rules
Judgement Form G --z e T where z
C .
recall zap rule from operational semantics
( M, F v1 ) ---gt ( M, F v2 )
before
no conditions
-- v1 C U
faulty typing occurs within a single color only.
after
---------------------- G --C C v C U
--C v2 C U
by rule
40Theorems
- Theorem 1 Well-typed programs are safe, even
when there is a single error. - Theorem 2 Well-typed programs executing with a
single error simulate the output of well-typed
programs with no errors with a caveat. - Theorem 3 There is a correct, type-preserving
translation from the simply-typed lambda calculus
into lambda zap that satisfies the caveat.
ICFP 06
41The Caveat
Goal 0-fault and 1-fault executions should be
indistinguishable
bad, but well-typed code
out 2, 3, 3
outputs 3 after no faults
out 2, 3, 3
out 2, 2, 3
outputs 2 after 1 fault
More importantly out 2, 3, 3 is obviously a
symptom of a compiler bug out 2, 3, 4 is even
worse good runs never come to consensus
Solution computations must independent, but
equivalent
42The Caveat
modified typing
G --z e1 R U G --z e2 G U G --z e3 B
U G --z e4 T G --z e1 e2
G --z e2 e3 ------------------------------
---------------------------------------------- G
-- out e1, e2, e3 e4 T
43The Caveat
More generally, programmers may form triples of
equivalent values
Introduction form
Elimination form
e1, e2, e3
let x1, x2, x3 e1 in e2
- a collection of 3 items
- each of 3 stored in separate register
- single fault effects at most one
44The Caveat
More generally, programmers may form triples of
equivalent values
Introduction form
Elimination form
G --z e1 R U G --z e2 G U G --z e3 B
U G --z e1 e2 G --z e2
e3 --------------------------------------------- G
-- e1, e2, e3 R U, G U, B U
G --z e1 R U, G U, B U G, x1R U, x2G U,
x3B U, x1 x2, x2 x3 --z e2
T --------------------------------------------- G
-- let x1, x2, x3 e1 in e2 T
45Theorems
- Theorem 1 Well-typed programs are safe, even
when there is a single error. - Theorem 2 Well-typed programs executing with a
single error simulate the output of well-typed
programs with no errors. - Theorem 3 There is a correct, type-preserving
translation from the simply-typed lambda calculus
into lambda zap.
There is still one i to be dotted in the
proofs of these theorems. Lester Mackey,
brilliant Princeton undergrad, has proven all key
theorems modulo the dotted i.
46Step 2 Fault Tolerant Typed Assembly Language
(TAL/FT)
- Lambda zap is playground for studying the
principles of fault tolerance in an idealized
setting - TAL/FT is a more realistic assembly-level, hybrid
HW/SW, fault tolerance scheme with - a formal fault model
- a formal definition of fault tolerance relative
to memory-mapped I/O - a sound type system for proving compiled programs
are fault tolerant
47A More Reliable Compiler Architecture
compiler front end
ordinary program
reliability transform
types
reliability proof
TAL/FT
optimization
types
optimized TAL/FT
modified proof
type
proof checker
48TAL/FT Key Ideas (Fault Model)
- Fault model
- registers may incur arbitrary faults in between
execution of any two instructions - memory (including code) is protected by ECC
- fault model formalized as part of hardware
operational semantics
49TAL/FT Key Ideas (Properties)
store
read
Mem-mapped I/O device
ECC-protected memory
Processor
- Primary Goal if there is one fault then either
- Mem-mapped I/O device sees exactly the same
sequence - of stores as a fault-free execution, or
- (2) Hardware detects and signals a fault and
mem-mapped I/O - sees a prefix of the stores from a
fault-free execution - Secondary Goal no false positives
50TAL/FT Key Ideas (Mechanisms)
- Compiler strategy
- create two redundant computations as lambda zap
- two copies gt fault detection
- fault recovery handled by a higher-level process
- Hardware support
- special instructions modified store buffer for
implementing reliable stores - special instructions for reliable control-flow
transfers - Type system
- Simple typing mechanisms based on original TAL
Morrisett, Walker, et al. - Redundant values with separate colors like in
lambda zap - Value identities needed for equivalence checking
tracked using singleton types combined with some
ideas drawn from traditional Hoare logics
51Current Future Work
- Build the first compiler that can automatically
generate reliability proofs for compiled programs - TAL/FT refinement and implementation
- type- and reliability-preserving optimizations
- Study alternative fault detection schemes
- fault detection recovery on current hardware
- exploit multi-core alternatives
- Understand the foundational theoretical
principles that allow programs to tolerate
transient faults - general purpose program logics for reasoning
about faulty programs
52Other Research I Do
- PADS popl 06, sigmod 06 demo, popl 07
- automatic generation of tools (parsers, printers,
validators, format translators, query engines,
etc.) for ad hoc data formats - with Kathleen Fisher (ATT)
- Program Monitoring popl 00, icfp 03, pldi 05,
popl 06, ... - semantics, design and implementation of programs
that monitor other programs for security (or
other purposes) - TAL other type systems popl 98, popl 99,
toplas 99, jfp 02, ... - theory, design and implementation of type systems
for compiler target and intermediate languages
53Conclusions
- Semi-conductor manufacturers are deeply worried
about how to deal with soft faults in future
architectures (10 years out) - Using proofs and types I think we
- are going to be able to develop
- highly reliable software that runs
- on unreliable hardware
54end!
55The Caveat
56Function O.S. follows
57Lambda to Lambda Zap Control-flow
let f \x.e in f 2
let f1, f2, f3 \x. e in f1, f2, f3
2, 2, 2
majority vote on control-flow transfer
58Lambda to Lambda Zap Control-flow
let f \x.e in f 2
let f1, f2, f3 \x. e in f1, f2, f3
2, 2, 2
operational semantics
(M let f1, f2, f3 \x.e1 in
e2) ---gt (M,l\x.e1 e2 l / f1 l / f2 l /
f3)
majority vote on control-flow transfer
59TAL/FT Hardware
replicated program counters
store queue
ECC-protected Caches/Memory
60Related Work Follows
61Software Mitigation Techniques
- Examples
- N-version programming, EDDI, CFCSS Oh et al.
2002, SWIFT Reis et al. 2005, ... - Hybrid hardware-software techniques Watchdog
Processors, CRAFT Reis et al. 2005 , ... - Pros
- immediate deployment
- would have benefitted Los Alamos Labs, etc...
- policies may be customized to the environment,
application - reduced hardware cost
- Cons
- For the same universal policy, slower (but not as
much as youd think).
62Software Mitigation Techniques
- Examples
- N-version programming, EDDI, CFCSS Oh et al.
2002, SWIFT Reis et al. 2005, etc... - Hybrid hardware-software techniques Watchdog
Processors, CRAFT Reis et al. 2005 , etc... - Pros
- immediate deployment if your system is
suffering soft error-related failures, you may
deploy new software immediately - would have benefitted Los Alamos Labs, etc...
- policies may be customized to the environment,
application - reduced hardware cost
- Cons
- For the same universal policy, slower (but not as
much as youd think). - IT MIGHT NOT ACTUALLY WORK!
63Mitigation Techniques
- Hardware
- error-correcting codes
- redundant hardware
- Pros
- fast for a fixed policy
- Cons
- FT policy decided at hardware design time
- mistakes cost millions
- one-size-fits-all policy
- expensive
- Software and hybrid schemes
- replicate computations
- Pros
- immediate deployment
- policies customized to environment, application
- reduced hardware cost
- Cons
- for the same universal policy, slower (but not as
much as youd think).
64Mitigation Techniques
- Hardware
- error-correcting codes
- redundant hardware
- Pros
- fast for fixed policy
- Cons
- FT policy decided at hardware design time
- mistakes cost millions
- one-size-fits-all policy
- expensive
- Software and hybrid schemes
- replicate computations
- Pros
- immediate deployment
- policies customized to environment, application
- reduced hardware cost
- Cons
- for the same universal policy, slower (but not as
much as youd think). - It may not actually work!
- much research in HW/compilers community
completely lacking proof
65Solutions in Time
- Solutions in Hardware
- replication of instructions and checking
implemented in special-purpose hardware - eg Reinhardt Mukherjee 2000
- pros transparent to software
- cons one-size-fits-all reliability policy
- cons cant fix existing problem specialized
hardware has reduced market - Solutions in Software (or hybrid
Hardware/Software) - compiler generates replicated instructions and
checking code - eg Reis et al. 05
- pros flexibility new reliability policies may
be deployed whenever needed - cons for fixed reliability policy, slower than
specialized hardware solutions - cons it might not actually work