On Cosmic Rays, Bat Droppings, and what to do about them - PowerPoint PPT Presentation

About This Presentation
Title:

On Cosmic Rays, Bat Droppings, and what to do about them

Description:

Title: a theory of aspects Author: CS Last modified by: dpw Created Date: 8/12/2003 2:51:00 PM Document presentation format: On-screen Show Company – PowerPoint PPT presentation

Number of Views:145
Avg rating:3.0/5.0
Slides: 66
Provided by: cs171
Category:

less

Transcript and Presenter's Notes

Title: On Cosmic Rays, Bat Droppings, and what to do about them


1
On Cosmic Rays, Bat Droppings, and what to do
about them
  • David Walker
  • Princeton University
  • with Jay Ligatti, Lester Mackey, George Reis and
    David August

2
A Little-Publicized Fact
1 1
2
3
3
How do Soft Faults Happen?
Galactic Particles Are high-energy particles
that penetrate to Earths surface,
through buildings and walls
Solar Particles Affect Satellites Cause lt 5
of Terrestrial problems
Alpha particles from bat droppings
  • High-energy particles pass through devices and
    collides with silicon atom
  • Collision generates an electric charge that can
    flip a single bit

4
How Often do Soft Faults Happen?
5
How Often do Soft Faults Happen?
IBM Soft Fail Rate Study Mainframes 83-86
Leadville, CO
Denver, CO
Tucson, AZ
NYC
6
How Often do Soft Faults Happen?
IBM Soft Fail Rate Study Mainframes 83-86
Zeiger-Puchner 2004
Leadville, CO
Denver, CO
Tucson, AZ
NYC
  • Some Data Points
  • 83-86 Leadville (highest incorporated city in
    the US) 1 fail/2 days
  • 83-86 Subterrean experiment under 50ft of
    rock no fails in 9 months
  • 2004 1 fail/year for laptop with 1GB ram at
    sea-level
  • 2004 1 fail/trans-pacific roundtrip
    Zeiger-Puchner 2004

7
How Often do Soft Faults Happen?
Soft Error Rate Trends Shenkhar Borkar, Intel,
2004
6 years from now
we are approximately here
8
How Often do Soft Faults Happen?
Soft Error Rate Trends Shenkhar Borkar, Intel,
2004
6 years from now
we are approximately here
  • Soft error rates go up as
  • Voltages decrease
  • Feature sizes decrease
  • Transistor density increases
  • Clock rates increase

all future manufacturing trends
9
How Often do Soft Faults Happen?
  • In 1948, Presper Eckert notes that cascading
    effects of a single-bit error destroyed hours of
    Eniacs work. Zeiger-Puchner 2004
  • In 2000, Sun server systems deployed to America
    Online, eBay, and others crashed due to cosmic
    rays Baumann 2002
  • The wake-up call came in the end of 2001 ...
    billion-dollar factory ground to a halt every
    month due to ... a single bit flip
    Zeiger-Puchner 2004
  • Los Alamos National Lab Hewlett-Packard ASC Q
    2048-node supercomputer was crashing regularly
    from soft faults due to cosmic radiation
    Michalak 2005

10
What Problems do Soft Faults Cause?
  • a single bit in memory gets flipped
  • a single bit in the processor logic gets flipped
    and
  • theres no difference in external observable
    behavior
  • the processor locks up
  • the computation is silently corrupted
  • register value corrupted (simple data fault)
  • control-flow transfer goes to wrong place
    (control-flow fault)
  • different opcode interpreted (instruction fault)

11
FT Solutions
  • Redundancy in Information
  • eg Error correcting codes (ECC)
  • pros protects stored values efficiently
  • cons difficult to design for arithmetic units
    and control logic
  • Redundancy in Space
  • multiple redundant hardware devices
  • eg Compaq Non-stop Himalaya runs two identical
    programs on two processors, comparing pins on
    every cycle
  • pros efficient in time
  • cons expensive in hardware (double the space)
  • Redundancy in Time
  • perform the same computations at different times
    (eg in sequence)
  • pros efficient in hardware (space is reused)
  • cons expensive in time (slower --- but not
    twice as slow)

12
Solutions in Time
  • Compiler generates code containing replicated
    computations, fault detection checks and recovery
    routines
  • eg Rebaudengo 01, CFCSS Oh et al. 02, SWIFT or
    CRAFT Reis et al. 05, ...
  • pros software-controlled --- new code with
    better reliability properties may be deployed
    whenever, wherever needed
  • cons for fixed reliability policy, slower than
    specialized hardware solutions

13
Solutions in Time
  • Compiler generates code containing replicated
    computations, fault detection checks and recovery
    routines
  • eg Rebaudengo 01, CFCSS Oh et al. 02, SWIFT or
    CRAFT Reis et al. 05, ...
  • pros flexibility --- new code with better
    reliability properties may be deployed whenever,
    wherever needed
  • cons for fixed reliability policy, slower than
    specialized hardware solutions
  • cons it might not actually work

14
It might not actually work
15
Agenda
  • Answer basic scientific questions about
    software-controlled fault tolerance
  • Do software-only or hybrid SW/HW techniques
    actually work?
  • For what fault models? How do we specify them?
  • How can we prove it?
  • Build compilers that produce software that runs
    reliably on faulty hardware
  • Moreover Lets not replace faulty hardware with
    faulty software.
  • Lets prove every binary we produce is fault
    tolerant relative to the specified fault model

16
A Possible Compiler Architecture
compiler front end
ordinary program
reliability transform
fault tolerant program
optimization
optimized FT program
17
A Possible Compiler Architecture
compiler front end
ordinary program
reliability transform
Testing Requirements
all combinations of features multiplied by all
combinations of faults
fault tolerant program
optimization
optimized FT program
18
A More Reliable Compiler Architecture
compiler front end
ordinary program
reliability transform
fault tolerant program
reliability proof
optimization
optimized FT program
modified proof
proof checker
19
A More Reliable Compiler Architecture
compiler front end
ordinary program
reliability transform
Testing Requirements
all combinations of features multiplied by all
combinations of faults
fault tolerant program
reliability proof
optimization
optimized FT program
modified proof
proof checker
20
Central Technical Challenges
  • Designing

21
Step 1 Lambda Zap
  • Lambda Zap ICFP 06
  • a lambda calculus that exhibits intermittent data
    faults operators to detect and correct them
  • a type system that guarantees observable outputs
    of well-typed programs do not change in the
    presence of a single fault
  • types act as the proofs of fault tolerance
  • expressive enough to implement an ordinary typed
    lambda calculus
  • End result
  • the foundation for a fault-tolerant typed
    intermediate language

22
The Fault Model
  • Lambda zap models simple data faults only

( M, F v1 )
---gt ( M, F v2 )
  • Not modelled
  • memory faults (better protected using ECC
    hardware)
  • control-flow faults (ie faults during
    control-flow transfer)
  • instruction faults (ie faults in instruction
    opcodes)
  • Goal to construct programs that tolerate 1 fault
  • observers cannot distinguish between fault-free
    and 1-fault runs

23
Lambda to Lambda Zap The main idea
let x 2 in let y x x in out y
24
Lambda to Lambda Zap The main idea
let x1 2 in let x2 2 in let x3 2 in let y1
x1 x1 in let y2 x2 x2 in let y3 x3 x3
in out y1, y2, y3
replicate instructions
let x 2 in let y x x in out y
atomic majority vote output
25
Lambda to Lambda Zap The main idea
let x1 2 in let x2 2 in let x3 7 in let y1
x1 x1 in let y2 x2 x2 in let y3 x3 x3
in out y1, y2, y3
let x 2 in let y x x in out y
26
Lambda to Lambda Zap The main idea
let x1 2 in let x2 2 in let x3 7 in let y1
x1 x1 in let y2 x2 x2 in let y3 x3 x3
in out y1, y2, y3
let x 2 in let y x x in out y
corrupted values copied and percolate through
computation
but final output unchanged
27
Lambda to Lambda Zap Control-flow
recursively translate subexpressions
let x1 2 in let x2 2 in let x3 2 in if x1,
x2, x3 then e1 else e2
let x 2 in if x then e1 else e2
majority vote on control-flow transfer
28
Lambda to Lambda Zap Control-flow
recursively translate subexpressions
let x1 2 in let x2 2 in let x3 2 in if x1,
x2, x3 then e1 else e2
let x 2 in if x then e1 else e2
majority vote on control-flow transfer
(function calls replicate arguments, results and
function itself)
29
Almost too easy, can anything go wrong?...
30
Almost too easy, can anything go wrong?...
yes! optimization reduces replication
overhead dramatically (eg 43 for 2 copies),
but can be unsound! original implementation of
SWIFT Reis et al. optimized away all redundancy
leaving them with an unreliable implementation!!
31
Faulty Optimizations
let x1 2 in let x2 2 in let x3 2 in let y1
x1 x1 in let y2 x2 x2 in let y3 x3 x3
in out y1, y2, y3
let x1 2 in let y1 x1 x1 in out y1, y1,
y1
CSE
In general, optimizations eliminate
redundancy, fault-tolerance requires redundancy.
32
The Essential Problem
bad code
let x1 2 in let y1 x1 x1 in out y1, y1,
y1
voters depend on common value x1
33
The Essential Problem
good code
bad code
let x1 2 in let x2 2 in let x3 2 in let y1
x1 x1 in let y2 x2 x2 in let y3 x3 x3
in out y1, y2, y3
let x1 2 in let y1 x1 x1 in out y1, y1,
y1
voters depend on common value x1
voters do not depend on a common value
34
The Essential Problem
good code
bad code
let x1 2 in let x2 2 in let x3 2 in let y1
x1 x1 in let y2 x2 x2 in let y3 x3 x3
in out y1, y2, y3
let x1 2 in let y1 x1 x1 in out y1, y1,
y1
voters depend on a common value
voters do not depend on a common value (red on
red green on green blue on blue)
35
A Type System for Lambda Zap
  • Key idea types track the color of the
    underlying value prevents interference between
    colors

Colors C R G B Types T C int
C bool C (T1,T2,T3) ? (T1,T2,T3)
36
Sample Typing Rules
Judgement Form G --z e T where z
C .
simple value typing rules
(x T) in G --------------- G --z x T
------------------------ G --z C n C int
------------------------------ G --z C true
C bool
37
Sample Typing Rules
Judgement Form G --z e T where z
C .
sample expression typing rules
G --z e1 C int G --z e2 C
int ----------------------------------------------
--- G --z e1 e2 C int
G --z e1 R bool G --z e2 G bool G --z
e3 B bool G --z e4 T G
--z e5 T -------------------------------------
---------------- G --z if e1, e2, e3 then e4
else e5 T
G --z e1 R int G --z e2 G int G --z e3
B int G --z e4 T ---------------------------
--------- G --z out e1, e2, e3 e4 T
38
Sample Typing Rules
Judgement Form G --z e T where z
C .
recall zap rule from operational semantics
( M, F v1 ) ---gt ( M, F v2 )
before
-- v1 T
after
-- v2 ?? T gt how will we
obtain type preservation?
39
Sample Typing Rules
Judgement Form G --z e T where z
C .
recall zap rule from operational semantics
( M, F v1 ) ---gt ( M, F v2 )
before
no conditions
-- v1 C U
faulty typing occurs within a single color only.
after
---------------------- G --C C v C U
--C v2 C U
by rule
40
Theorems
  • Theorem 1 Well-typed programs are safe, even
    when there is a single error.
  • Theorem 2 Well-typed programs executing with a
    single error simulate the output of well-typed
    programs with no errors with a caveat.
  • Theorem 3 There is a correct, type-preserving
    translation from the simply-typed lambda calculus
    into lambda zap that satisfies the caveat.

ICFP 06
41
The Caveat
Goal 0-fault and 1-fault executions should be
indistinguishable
bad, but well-typed code
out 2, 3, 3
outputs 3 after no faults
out 2, 3, 3
out 2, 2, 3
outputs 2 after 1 fault
More importantly out 2, 3, 3 is obviously a
symptom of a compiler bug out 2, 3, 4 is even
worse good runs never come to consensus
Solution computations must independent, but
equivalent
42
The Caveat
modified typing
G --z e1 R U G --z e2 G U G --z e3 B
U G --z e4 T G --z e1 e2
G --z e2 e3 ------------------------------
---------------------------------------------- G
-- out e1, e2, e3 e4 T
43
The Caveat
More generally, programmers may form triples of
equivalent values
Introduction form
Elimination form
e1, e2, e3
let x1, x2, x3 e1 in e2
  • a collection of 3 items
  • each of 3 stored in separate register
  • single fault effects at most one

44
The Caveat
More generally, programmers may form triples of
equivalent values
Introduction form
Elimination form
G --z e1 R U G --z e2 G U G --z e3 B
U G --z e1 e2 G --z e2
e3 --------------------------------------------- G
-- e1, e2, e3 R U, G U, B U
G --z e1 R U, G U, B U G, x1R U, x2G U,
x3B U, x1 x2, x2 x3 --z e2
T --------------------------------------------- G
-- let x1, x2, x3 e1 in e2 T
45
Theorems
  • Theorem 1 Well-typed programs are safe, even
    when there is a single error.
  • Theorem 2 Well-typed programs executing with a
    single error simulate the output of well-typed
    programs with no errors.
  • Theorem 3 There is a correct, type-preserving
    translation from the simply-typed lambda calculus
    into lambda zap.

There is still one i to be dotted in the
proofs of these theorems. Lester Mackey,
brilliant Princeton undergrad, has proven all key
theorems modulo the dotted i.
46
Step 2 Fault Tolerant Typed Assembly Language
(TAL/FT)
  • Lambda zap is playground for studying the
    principles of fault tolerance in an idealized
    setting
  • TAL/FT is a more realistic assembly-level, hybrid
    HW/SW, fault tolerance scheme with
  • a formal fault model
  • a formal definition of fault tolerance relative
    to memory-mapped I/O
  • a sound type system for proving compiled programs
    are fault tolerant

47
A More Reliable Compiler Architecture
compiler front end
ordinary program
reliability transform
types
reliability proof
TAL/FT
optimization
types
optimized TAL/FT
modified proof
type
proof checker
48
TAL/FT Key Ideas (Fault Model)
  • Fault model
  • registers may incur arbitrary faults in between
    execution of any two instructions
  • memory (including code) is protected by ECC
  • fault model formalized as part of hardware
    operational semantics

49
TAL/FT Key Ideas (Properties)
store
read
Mem-mapped I/O device
ECC-protected memory
Processor
  • Primary Goal if there is one fault then either
  • Mem-mapped I/O device sees exactly the same
    sequence
  • of stores as a fault-free execution, or
  • (2) Hardware detects and signals a fault and
    mem-mapped I/O
  • sees a prefix of the stores from a
    fault-free execution
  • Secondary Goal no false positives

50
TAL/FT Key Ideas (Mechanisms)
  • Compiler strategy
  • create two redundant computations as lambda zap
  • two copies gt fault detection
  • fault recovery handled by a higher-level process
  • Hardware support
  • special instructions modified store buffer for
    implementing reliable stores
  • special instructions for reliable control-flow
    transfers
  • Type system
  • Simple typing mechanisms based on original TAL
    Morrisett, Walker, et al.
  • Redundant values with separate colors like in
    lambda zap
  • Value identities needed for equivalence checking
    tracked using singleton types combined with some
    ideas drawn from traditional Hoare logics

51
Current Future Work
  • Build the first compiler that can automatically
    generate reliability proofs for compiled programs
  • TAL/FT refinement and implementation
  • type- and reliability-preserving optimizations
  • Study alternative fault detection schemes
  • fault detection recovery on current hardware
  • exploit multi-core alternatives
  • Understand the foundational theoretical
    principles that allow programs to tolerate
    transient faults
  • general purpose program logics for reasoning
    about faulty programs

52
Other Research I Do
  • PADS popl 06, sigmod 06 demo, popl 07
  • automatic generation of tools (parsers, printers,
    validators, format translators, query engines,
    etc.) for ad hoc data formats
  • with Kathleen Fisher (ATT)
  • Program Monitoring popl 00, icfp 03, pldi 05,
    popl 06, ...
  • semantics, design and implementation of programs
    that monitor other programs for security (or
    other purposes)
  • TAL other type systems popl 98, popl 99,
    toplas 99, jfp 02, ...
  • theory, design and implementation of type systems
    for compiler target and intermediate languages

53
Conclusions
  • Semi-conductor manufacturers are deeply worried
    about how to deal with soft faults in future
    architectures (10 years out)
  • Using proofs and types I think we
  • are going to be able to develop
  • highly reliable software that runs
  • on unreliable hardware

54
end!
55
The Caveat
56
Function O.S. follows
57
Lambda to Lambda Zap Control-flow
let f \x.e in f 2
let f1, f2, f3 \x. e in f1, f2, f3
2, 2, 2
majority vote on control-flow transfer
58
Lambda to Lambda Zap Control-flow
let f \x.e in f 2
let f1, f2, f3 \x. e in f1, f2, f3
2, 2, 2
operational semantics
(M let f1, f2, f3 \x.e1 in
e2) ---gt (M,l\x.e1 e2 l / f1 l / f2 l /
f3)
majority vote on control-flow transfer
59
TAL/FT Hardware
replicated program counters
store queue
ECC-protected Caches/Memory
60
Related Work Follows
61
Software Mitigation Techniques
  • Examples
  • N-version programming, EDDI, CFCSS Oh et al.
    2002, SWIFT Reis et al. 2005, ...
  • Hybrid hardware-software techniques Watchdog
    Processors, CRAFT Reis et al. 2005 , ...
  • Pros
  • immediate deployment
  • would have benefitted Los Alamos Labs, etc...
  • policies may be customized to the environment,
    application
  • reduced hardware cost
  • Cons
  • For the same universal policy, slower (but not as
    much as youd think).

62
Software Mitigation Techniques
  • Examples
  • N-version programming, EDDI, CFCSS Oh et al.
    2002, SWIFT Reis et al. 2005, etc...
  • Hybrid hardware-software techniques Watchdog
    Processors, CRAFT Reis et al. 2005 , etc...
  • Pros
  • immediate deployment if your system is
    suffering soft error-related failures, you may
    deploy new software immediately
  • would have benefitted Los Alamos Labs, etc...
  • policies may be customized to the environment,
    application
  • reduced hardware cost
  • Cons
  • For the same universal policy, slower (but not as
    much as youd think).
  • IT MIGHT NOT ACTUALLY WORK!

63
Mitigation Techniques
  • Hardware
  • error-correcting codes
  • redundant hardware
  • Pros
  • fast for a fixed policy
  • Cons
  • FT policy decided at hardware design time
  • mistakes cost millions
  • one-size-fits-all policy
  • expensive
  • Software and hybrid schemes
  • replicate computations
  • Pros
  • immediate deployment
  • policies customized to environment, application
  • reduced hardware cost
  • Cons
  • for the same universal policy, slower (but not as
    much as youd think).

64
Mitigation Techniques
  • Hardware
  • error-correcting codes
  • redundant hardware
  • Pros
  • fast for fixed policy
  • Cons
  • FT policy decided at hardware design time
  • mistakes cost millions
  • one-size-fits-all policy
  • expensive
  • Software and hybrid schemes
  • replicate computations
  • Pros
  • immediate deployment
  • policies customized to environment, application
  • reduced hardware cost
  • Cons
  • for the same universal policy, slower (but not as
    much as youd think).
  • It may not actually work!
  • much research in HW/compilers community
    completely lacking proof

65
Solutions in Time
  • Solutions in Hardware
  • replication of instructions and checking
    implemented in special-purpose hardware
  • eg Reinhardt Mukherjee 2000
  • pros transparent to software
  • cons one-size-fits-all reliability policy
  • cons cant fix existing problem specialized
    hardware has reduced market
  • Solutions in Software (or hybrid
    Hardware/Software)
  • compiler generates replicated instructions and
    checking code
  • eg Reis et al. 05
  • pros flexibility new reliability policies may
    be deployed whenever needed
  • cons for fixed reliability policy, slower than
    specialized hardware solutions
  • cons it might not actually work
Write a Comment
User Comments (0)
About PowerShow.com