On Cosmic Rays, Bat Droppings, and what to do about them

About This Presentation

Title:

On Cosmic Rays, Bat Droppings, and what to do about them

Description:

Title: a theory of aspects Author: CS Last modified by: dpw Created Date: 8/12/2003 2:51:00 PM Document presentation format: On-screen Show Company – PowerPoint PPT presentation

Number of Views:145

Avg rating:3.0/5.0

Slides: 66

Provided by: cs171

Learn more at: https://www.cs.princeton.edu

Category:

more less

Transcript and Presenter's Notes

Title: On Cosmic Rays, Bat Droppings, and what to do about them

1
On Cosmic Rays, Bat Droppings, and what to do
about them

David Walker
Princeton University
with Jay Ligatti, Lester Mackey, George Reis and
David August

2
A Little-Publicized Fact
1 1
2
3
3
How do Soft Faults Happen?
Galactic Particles Are high-energy particles
that penetrate to Earths surface,
through buildings and walls
Solar Particles Affect Satellites Cause lt 5
of Terrestrial problems
Alpha particles from bat droppings

High-energy particles pass through devices and
collides with silicon atom
Collision generates an electric charge that can
flip a single bit

4
How Often do Soft Faults Happen?
5
How Often do Soft Faults Happen?
IBM Soft Fail Rate Study Mainframes 83-86
Leadville, CO
Denver, CO
Tucson, AZ
NYC
6
How Often do Soft Faults Happen?
IBM Soft Fail Rate Study Mainframes 83-86
Zeiger-Puchner 2004
Leadville, CO
Denver, CO
Tucson, AZ
NYC

Some Data Points
83-86 Leadville (highest incorporated city in
the US) 1 fail/2 days
83-86 Subterrean experiment under 50ft of
rock no fails in 9 months
2004 1 fail/year for laptop with 1GB ram at
sea-level
2004 1 fail/trans-pacific roundtrip
Zeiger-Puchner 2004

7
How Often do Soft Faults Happen?
Soft Error Rate Trends Shenkhar Borkar, Intel,
2004
6 years from now
we are approximately here
8
How Often do Soft Faults Happen?
Soft Error Rate Trends Shenkhar Borkar, Intel,
2004
6 years from now
we are approximately here

Soft error rates go up as
Voltages decrease
Feature sizes decrease
Transistor density increases
Clock rates increase

all future manufacturing trends
9
How Often do Soft Faults Happen?

In 1948, Presper Eckert notes that cascading
effects of a single-bit error destroyed hours of
Eniacs work. Zeiger-Puchner 2004
In 2000, Sun server systems deployed to America
Online, eBay, and others crashed due to cosmic
rays Baumann 2002
The wake-up call came in the end of 2001 ...
billion-dollar factory ground to a halt every
month due to ... a single bit flip
Zeiger-Puchner 2004
Los Alamos National Lab Hewlett-Packard ASC Q
2048-node supercomputer was crashing regularly
from soft faults due to cosmic radiation
Michalak 2005

10
What Problems do Soft Faults Cause?

a single bit in memory gets flipped
a single bit in the processor logic gets flipped
and
theres no difference in external observable
behavior
the processor locks up
the computation is silently corrupted
register value corrupted (simple data fault)
control-flow transfer goes to wrong place
(control-flow fault)
different opcode interpreted (instruction fault)

11
FT Solutions

Redundancy in Information
eg Error correcting codes (ECC)
pros protects stored values efficiently
cons difficult to design for arithmetic units
and control logic
Redundancy in Space
multiple redundant hardware devices
eg Compaq Non-stop Himalaya runs two identical
programs on two processors, comparing pins on
every cycle
pros efficient in time
cons expensive in hardware (double the space)
Redundancy in Time
perform the same computations at different times
(eg in sequence)
pros efficient in hardware (space is reused)
cons expensive in time (slower --- but not
twice as slow)

12
Solutions in Time

Compiler generates code containing replicated
computations, fault detection checks and recovery
routines
eg Rebaudengo 01, CFCSS Oh et al. 02, SWIFT or
CRAFT Reis et al. 05, ...
pros software-controlled --- new code with
better reliability properties may be deployed
whenever, wherever needed
cons for fixed reliability policy, slower than
specialized hardware solutions

13
Solutions in Time

Compiler generates code containing replicated
computations, fault detection checks and recovery
routines
eg Rebaudengo 01, CFCSS Oh et al. 02, SWIFT or
CRAFT Reis et al. 05, ...
pros flexibility --- new code with better
reliability properties may be deployed whenever,
wherever needed
cons for fixed reliability policy, slower than
specialized hardware solutions
cons it might not actually work

14
It might not actually work
15
Agenda

Answer basic scientific questions about
software-controlled fault tolerance
Do software-only or hybrid SW/HW techniques
actually work?
For what fault models? How do we specify them?
How can we prove it?
Build compilers that produce software that runs
reliably on faulty hardware
Moreover Lets not replace faulty hardware with
faulty software.
Lets prove every binary we produce is fault
tolerant relative to the specified fault model

16
A Possible Compiler Architecture
compiler front end
ordinary program
reliability transform
fault tolerant program
optimization
optimized FT program
17
A Possible Compiler Architecture
compiler front end
ordinary program
reliability transform
Testing Requirements
all combinations of features multiplied by all
combinations of faults
fault tolerant program
optimization
optimized FT program
18
A More Reliable Compiler Architecture
compiler front end
ordinary program
reliability transform
fault tolerant program
reliability proof
optimization
optimized FT program
modified proof
proof checker
19
A More Reliable Compiler Architecture
compiler front end
ordinary program
reliability transform
Testing Requirements
all combinations of features multiplied by all
combinations of faults
fault tolerant program
reliability proof
optimization
optimized FT program
modified proof
proof checker
20
Central Technical Challenges

Designing

21
Step 1 Lambda Zap

Lambda Zap ICFP 06
a lambda calculus that exhibits intermittent data
faults operators to detect and correct them
a type system that guarantees observable outputs
of well-typed programs do not change in the
presence of a single fault
types act as the proofs of fault tolerance
expressive enough to implement an ordinary typed
lambda calculus
End result
the foundation for a fault-tolerant typed
intermediate language

22
The Fault Model

Lambda zap models simple data faults only

( M, F v1 )
---gt ( M, F v2 )

Not modelled
memory faults (better protected using ECC
hardware)
control-flow faults (ie faults during
control-flow transfer)
instruction faults (ie faults in instruction
opcodes)
Goal to construct programs that tolerate 1 fault
observers cannot distinguish between fault-free
and 1-fault runs

23
Lambda to Lambda Zap The main idea
let x 2 in let y x x in out y
24
Lambda to Lambda Zap The main idea
let x1 2 in let x2 2 in let x3 2 in let y1
x1 x1 in let y2 x2 x2 in let y3 x3 x3
in out y1, y2, y3
replicate instructions
let x 2 in let y x x in out y
atomic majority vote output
25
Lambda to Lambda Zap The main idea
let x1 2 in let x2 2 in let x3 7 in let y1
x1 x1 in let y2 x2 x2 in let y3 x3 x3
in out y1, y2, y3
let x 2 in let y x x in out y
26
Lambda to Lambda Zap The main idea
let x1 2 in let x2 2 in let x3 7 in let y1
x1 x1 in let y2 x2 x2 in let y3 x3 x3
in out y1, y2, y3
let x 2 in let y x x in out y
corrupted values copied and percolate through
computation
but final output unchanged
27
Lambda to Lambda Zap Control-flow
recursively translate subexpressions
let x1 2 in let x2 2 in let x3 2 in if x1,
x2, x3 then e1 else e2
let x 2 in if x then e1 else e2
majority vote on control-flow transfer
28
Lambda to Lambda Zap Control-flow
recursively translate subexpressions
let x1 2 in let x2 2 in let x3 2 in if x1,
x2, x3 then e1 else e2
let x 2 in if x then e1 else e2
majority vote on control-flow transfer
(function calls replicate arguments, results and
function itself)
29
Almost too easy, can anything go wrong?...
30
Almost too easy, can anything go wrong?...
yes! optimization reduces replication
overhead dramatically (eg 43 for 2 copies),
but can be unsound! original implementation of
SWIFT Reis et al. optimized away all redundancy
leaving them with an unreliable implementation!!
31
Faulty Optimizations
let x1 2 in let x2 2 in let x3 2 in let y1
x1 x1 in let y2 x2 x2 in let y3 x3 x3
in out y1, y2, y3
let x1 2 in let y1 x1 x1 in out y1, y1,
y1
CSE
In general, optimizations eliminate
redundancy, fault-tolerance requires redundancy.
32
The Essential Problem
bad code
let x1 2 in let y1 x1 x1 in out y1, y1,
y1
voters depend on common value x1
33
The Essential Problem
good code
bad code
let x1 2 in let x2 2 in let x3 2 in let y1
x1 x1 in let y2 x2 x2 in let y3 x3 x3
in out y1, y2, y3
let x1 2 in let y1 x1 x1 in out y1, y1,
y1
voters depend on common value x1
voters do not depend on a common value
34
The Essential Problem
good code
bad code
let x1 2 in let x2 2 in let x3 2 in let y1
x1 x1 in let y2 x2 x2 in let y3 x3 x3
in out y1, y2, y3
let x1 2 in let y1 x1 x1 in out y1, y1,
y1
voters depend on a common value
voters do not depend on a common value (red on
red green on green blue on blue)
35
A Type System for Lambda Zap

Key idea types track the color of the
underlying value prevents interference between
colors

Colors C R G B Types T C int
C bool C (T1,T2,T3) ? (T1,T2,T3)
36
Sample Typing Rules
Judgement Form G --z e T where z
C .
simple value typing rules
(x T) in G --------------- G --z x T
------------------------ G --z C n C int
------------------------------ G --z C true
C bool
37
Sample Typing Rules
Judgement Form G --z e T where z
C .
sample expression typing rules
G --z e1 C int G --z e2 C
int ----------------------------------------------
--- G --z e1 e2 C int
G --z e1 R bool G --z e2 G bool G --z
e3 B bool G --z e4 T G
--z e5 T -------------------------------------
---------------- G --z if e1, e2, e3 then e4
else e5 T
G --z e1 R int G --z e2 G int G --z e3
B int G --z e4 T ---------------------------
--------- G --z out e1, e2, e3 e4 T
38
Sample Typing Rules
Judgement Form G --z e T where z
C .
recall zap rule from operational semantics
( M, F v1 ) ---gt ( M, F v2 )
before
-- v1 T
after
-- v2 ?? T gt how will we
obtain type preservation?
39
Sample Typing Rules
Judgement Form G --z e T where z
C .
recall zap rule from operational semantics
( M, F v1 ) ---gt ( M, F v2 )
before
no conditions
-- v1 C U
faulty typing occurs within a single color only.
after
---------------------- G --C C v C U
--C v2 C U
by rule
40
Theorems

Theorem 1 Well-typed programs are safe, even
when there is a single error.
Theorem 2 Well-typed programs executing with a
single error simulate the output of well-typed
programs with no errors with a caveat.
Theorem 3 There is a correct, type-preserving
translation from the simply-typed lambda calculus
into lambda zap that satisfies the caveat.

ICFP 06
41
The Caveat
Goal 0-fault and 1-fault executions should be
indistinguishable
bad, but well-typed code
out 2, 3, 3
outputs 3 after no faults
out 2, 3, 3
out 2, 2, 3
outputs 2 after 1 fault
More importantly out 2, 3, 3 is obviously a
symptom of a compiler bug out 2, 3, 4 is even
worse good runs never come to consensus
Solution computations must independent, but
equivalent
42
The Caveat
modified typing
G --z e1 R U G --z e2 G U G --z e3 B
U G --z e4 T G --z e1 e2
G --z e2 e3 ------------------------------
---------------------------------------------- G
-- out e1, e2, e3 e4 T
43
The Caveat
More generally, programmers may form triples of
equivalent values
Introduction form
Elimination form
e1, e2, e3
let x1, x2, x3 e1 in e2

a collection of 3 items
each of 3 stored in separate register
single fault effects at most one

44
The Caveat
More generally, programmers may form triples of
equivalent values
Introduction form
Elimination form
G --z e1 R U G --z e2 G U G --z e3 B
U G --z e1 e2 G --z e2
e3 --------------------------------------------- G
-- e1, e2, e3 R U, G U, B U
G --z e1 R U, G U, B U G, x1R U, x2G U,
x3B U, x1 x2, x2 x3 --z e2
T --------------------------------------------- G
-- let x1, x2, x3 e1 in e2 T
45
Theorems

Theorem 1 Well-typed programs are safe, even
when there is a single error.
Theorem 2 Well-typed programs executing with a
single error simulate the output of well-typed
programs with no errors.
Theorem 3 There is a correct, type-preserving
translation from the simply-typed lambda calculus
into lambda zap.

There is still one i to be dotted in the
proofs of these theorems. Lester Mackey,
brilliant Princeton undergrad, has proven all key
theorems modulo the dotted i.
46
Step 2 Fault Tolerant Typed Assembly Language
(TAL/FT)

Lambda zap is playground for studying the
principles of fault tolerance in an idealized
setting
TAL/FT is a more realistic assembly-level, hybrid
HW/SW, fault tolerance scheme with
a formal fault model
a formal definition of fault tolerance relative
to memory-mapped I/O
a sound type system for proving compiled programs
are fault tolerant

47
A More Reliable Compiler Architecture
compiler front end
ordinary program
reliability transform
types
reliability proof
TAL/FT
optimization
types
optimized TAL/FT
modified proof
type
proof checker
48
TAL/FT Key Ideas (Fault Model)

Fault model
registers may incur arbitrary faults in between
execution of any two instructions
memory (including code) is protected by ECC
fault model formalized as part of hardware
operational semantics

49
TAL/FT Key Ideas (Properties)
store
read
Mem-mapped I/O device
ECC-protected memory
Processor

Primary Goal if there is one fault then either
Mem-mapped I/O device sees exactly the same
sequence
of stores as a fault-free execution, or
(2) Hardware detects and signals a fault and
mem-mapped I/O
sees a prefix of the stores from a
fault-free execution
Secondary Goal no false positives

50
TAL/FT Key Ideas (Mechanisms)

Compiler strategy
create two redundant computations as lambda zap
two copies gt fault detection
fault recovery handled by a higher-level process
Hardware support
special instructions modified store buffer for
implementing reliable stores
special instructions for reliable control-flow
transfers
Type system
Simple typing mechanisms based on original TAL
Morrisett, Walker, et al.
Redundant values with separate colors like in
lambda zap
Value identities needed for equivalence checking
tracked using singleton types combined with some
ideas drawn from traditional Hoare logics

51
Current Future Work

Build the first compiler that can automatically
generate reliability proofs for compiled programs
TAL/FT refinement and implementation
type- and reliability-preserving optimizations
Study alternative fault detection schemes
fault detection recovery on current hardware
exploit multi-core alternatives
Understand the foundational theoretical
principles that allow programs to tolerate
transient faults
general purpose program logics for reasoning
about faulty programs

52
Other Research I Do

PADS popl 06, sigmod 06 demo, popl 07
automatic generation of tools (parsers, printers,
validators, format translators, query engines,
etc.) for ad hoc data formats
with Kathleen Fisher (ATT)
Program Monitoring popl 00, icfp 03, pldi 05,
popl 06, ...
semantics, design and implementation of programs
that monitor other programs for security (or
other purposes)
TAL other type systems popl 98, popl 99,
toplas 99, jfp 02, ...
theory, design and implementation of type systems
for compiler target and intermediate languages

53
Conclusions

Semi-conductor manufacturers are deeply worried
about how to deal with soft faults in future
architectures (10 years out)
Using proofs and types I think we
are going to be able to develop
highly reliable software that runs
on unreliable hardware

54
end!
55
The Caveat
56
Function O.S. follows
57
Lambda to Lambda Zap Control-flow
let f \x.e in f 2
let f1, f2, f3 \x. e in f1, f2, f3
2, 2, 2
majority vote on control-flow transfer
58
Lambda to Lambda Zap Control-flow
let f \x.e in f 2
let f1, f2, f3 \x. e in f1, f2, f3
2, 2, 2
operational semantics
(M let f1, f2, f3 \x.e1 in
e2) ---gt (M,l\x.e1 e2 l / f1 l / f2 l /
f3)
majority vote on control-flow transfer
59
TAL/FT Hardware
replicated program counters
store queue
ECC-protected Caches/Memory
60
Related Work Follows
61
Software Mitigation Techniques

Examples
N-version programming, EDDI, CFCSS Oh et al.
2002, SWIFT Reis et al. 2005, ...
Hybrid hardware-software techniques Watchdog
Processors, CRAFT Reis et al. 2005 , ...
Pros
immediate deployment
would have benefitted Los Alamos Labs, etc...
policies may be customized to the environment,
application
reduced hardware cost
Cons
For the same universal policy, slower (but not as
much as youd think).

62
Software Mitigation Techniques

Examples
N-version programming, EDDI, CFCSS Oh et al.
2002, SWIFT Reis et al. 2005, etc...
Hybrid hardware-software techniques Watchdog
Processors, CRAFT Reis et al. 2005 , etc...
Pros
immediate deployment if your system is
suffering soft error-related failures, you may
deploy new software immediately
would have benefitted Los Alamos Labs, etc...
policies may be customized to the environment,
application
reduced hardware cost
Cons
For the same universal policy, slower (but not as
much as youd think).
IT MIGHT NOT ACTUALLY WORK!

63
Mitigation Techniques

Hardware
error-correcting codes
redundant hardware
Pros
fast for a fixed policy
Cons
FT policy decided at hardware design time
mistakes cost millions
one-size-fits-all policy
expensive

Software and hybrid schemes
replicate computations
Pros
immediate deployment
policies customized to environment, application
reduced hardware cost
Cons
for the same universal policy, slower (but not as
much as youd think).

64
Mitigation Techniques

Hardware
error-correcting codes
redundant hardware
Pros
fast for fixed policy
Cons
FT policy decided at hardware design time
mistakes cost millions
one-size-fits-all policy
expensive

Software and hybrid schemes
replicate computations
Pros
immediate deployment
policies customized to environment, application
reduced hardware cost
Cons
for the same universal policy, slower (but not as
much as youd think).
It may not actually work!
much research in HW/compilers community
completely lacking proof

65
Solutions in Time

Solutions in Hardware
replication of instructions and checking
implemented in special-purpose hardware
eg Reinhardt Mukherjee 2000
pros transparent to software
cons one-size-fits-all reliability policy
cons cant fix existing problem specialized
hardware has reduced market
Solutions in Software (or hybrid
Hardware/Software)
compiler generates replicated instructions and
checking code
eg Reis et al. 05
pros flexibility new reliability policies may
be deployed whenever needed
cons for fixed reliability policy, slower than
specialized hardware solutions
cons it might not actually work

Write a Comment

User Comments (0)

About PowerShow.com

On Cosmic Rays, Bat Droppings, and what to do about them - PowerPoint PPT Presentation

On Cosmic Rays, Bat Droppings, and what to do about them

Title: a theory of aspects Author: CS Last modified by: dpw Created Date: 8/12/2003 2:51:00 PM Document presentation format: On-screen Show Company – PowerPoint PPT presentation