Title: FT 101 Jim Gray Microsoft Research http://research.microsoft.com/~gray/Talks/ 80% of slides are not shown (are hidden) so view with PPT to see them all Outline
1FT 101 Jim Gray Microsoft Researchhttp//resear
ch.microsoft.com/gray/Talks/80 of slides are
not shown (are hidden) so view with PPT to see
them allOutline
- Terminology and empirical measures
- General methods to mask faults.
- Software-fault tolerance
- Summary
2Dependability The 3 ITIES
- Reliability / Integrity does the right thing.
(Also large MTTF) - Availability does it now. (Also small MTTR
MTTFMTTRSystem
Availabilityif 90 of terminals up 99 of DB
up? (gt89 of transactions are serviced on
time). - Holistic vs. Reductionist view
Security
Integrity
Reliability
Availability
3High Availability System ClassesGoal Build
Class 6 Systems
Availability 90. 99. 99.9 99.99 99.999 99.99
99 99.99999
UnAvailability MTTR/MTBF can cut it in ½ by
cutting MTTR or MTBF
4Demo looking at some nodes
- Look at http//uptime.netcraft.com/
- Internet Node availability 92 mean, 97
medianDarrell Long (UCSC) ftp//ftp.cse.ucsc.e
du/pub/tr/ - ucsc-crl-90-46.ps.Z "A Study of the Reliability
of Internet Sites" - ucsc-crl-91-06.ps.Z "Estimating the Reliability
of Hosts Using the Internet" - ucsc-crl-93-40.ps.Z "A Study of the Reliability
of Hosts on the Internet" - ucsc-crl-95-16.ps.Z "A Longitudinal Survey of
Internet Host Reliability"
5Sources of Failures
- MTTF MTTR
- Power Failure 2000 hr 1 hr
- Phone Lines
- Soft gt.1 hr .1 hr
- Hard 4000 hr 10 hr
-
- Hardware Modules 100,000hr 10hr (many are
transient) - Software
- 1 Bug/1000 Lines Of Code (after vendor-user
testing) - gt Thousands of bugs in System!
- Most software failures are transient dump
restart system. - Useful fact 8,760 hrs/year 10k hr/year
6Case Study - Japan"Survey on Computer Security",
Japan Info Dev Corp., March 1986. (trans Eiichi
Watanabe).
Vendor
4
2
Tele Comm lines
1
2
1
1
.
2
Environment
2
5
Application Software
9
.
3
Operations
- Vendor (hardware and software) 5 Months
- Application software 9 Months
- Communications lines 1.5 Years
- Operations 2 Years
- Environment 2 Years
- 10 Weeks
- 1,383 institutions reported (6/84 - 7/85)
- 7,517 outages, MTTF 10 weeks, avg
duration 90 MINUTES - To Get 10 Year MTTF, Must Attack All These Areas
7Case Studies - Tandem Trends Reported MTTF by
Component
- 1985 1987 1990
- SOFTWARE 2 53 33 Years
- HARDWARE 29 91 310 Years
- MAINTENANCE 45 162 409 Years
- OPERATIONS 99 171 136 Years
- ENVIRONMENT 142 214 346 Years
- SYSTEM 8 20 21 Years
- Problem Systematic Under-reporting
8Many Software Faults are Soft
- After Design Review
- Code Inspection
- Alpha Test
- Beta Test
- 10k Hrs Of Gamma Test (Production)
- Most Software Faults Are Transient
- MVS Functional Recovery Routines
51 - Tandem Spooler 1001
- Adams gt1001
- Terminology
- Heisenbug Works On Retry
- Bohrbug Faults Again On Retry
- Adams "Optimizing Preventative Service of
Software Products", IBM J RD,28.1,1984 - Gray "Why Do Computers Stop", Tandem TR85.7,
1985 - Mourad "The Reliability of the IBM/XA Operating
System", 15 ISFTCS, 1985.
9Summary of FT Studies
- Current Situation 4-year MTTF gt Fault
Tolerance Works. - Hardware is GREAT (maintenance and MTTF).
- Software masks most hardware faults.
- Many hidden software outages in operations
- New Software.
- Utilities.
- Must make all software ONLINE.
- Software seems to define a 30-year MTTF ceiling.
- Reasonable Goal 100-year MTTF.
class 4 today gt class 6 tomorrow.
10Fault Tolerance vs Disaster Tolerance
- Fault-Tolerance mask local faults
- RAID disks
- Uninterruptible Power Supplies
- Cluster Failover
- Disaster Tolerance masks site failures
- Protects against fire, flood, sabotage,..
- Redundant system and service at remote site.
- Use design diversity
11Outline
- Terminology and empirical measures
- General methods to mask faults.
- Software-fault tolerance
- Summary
12Fault Model
- Failures are independentSo, single fault
tolerance is a big win - Hardware fails fast (blue-screen)
- Software fails-fast (or goes to sleep)
- Software often repaired by reboot
- Heisenbugs
- Operations tasks major source of outage
- Utility operations
- Software upgrades
13Fault Tolerance Techniques
- Fail fast modules work or stop
- Spare modules instant repair time.
- Independent module fails by design MTTFPair
MTTF2/ MTTR (so want tiny MTTR) - Message based OS Fault Isolation software has
no shared memory. - Session-oriented comm Reliable messages detect
lost/duplicate messages coordinate messages
with commit - Process pairs Mask Hardware Software Faults
- Transactions give A.C.I.D. (simple fault model)
14Example the FT Bank
- Modularity Repair are KEY
- vonNeumann needed 20,000x redundancy in
wires and switches - We use 2x redundancy.
- Redundant hardware can support peak loads (so
not redundant)
15Fail-Fast is Good, Repair is Needed
Lifecycle of a module fail-fast gives short
fault latency High Availability is
low UN-Availability Unavailability MTTR
MTTF
-
- Improving either MTTR or MTTF gives benefit
- Simple redundancy does not help much.
16Hardware Reliability/Availability (how to make
it fail fast)
- Comparitor Strategies
- Duplex Fail-Fast fail if either fails (e.g.
duplexed cpus) - vs Fail-Soft fail if both fail (e.g. disc,
atm,...) - Note in recursive pairs, parent knows which is
bad. - Triplex Fail-Fast fail if 2 fail (triplexed
cpus) - Fail-Soft fail if 3 fail (triplexed FailFast
cpus)
17Redundant Designs have Worse MTTF!
The Airplane Rule A two-engine airplane has
twice as many engine problems as a one engine
plane.
- THIS IS NOT GOOD Variance is lower but MTTF is
worse - Simple redundancy does not improve MTTF
(sometimes hurts). - This is just an example of
the airplane rule.
18Add Repair Get 104 Improvement
19When To Repair?
- Chances Of Tolerating A Fault are 10001 (class
3) - A 1995 study Processor Disc Rated At 10khr
MTTF - Computed Single Observed
- Failures Double Fails Ratio
- 10k Processor Fails 14 Double 1000 1
- 40k Disc Fails, 26 Double 1000 1
- Hardware Maintenance
- On-Line Maintenance "Works" 999 Times Out Of
1000. - The chance a duplexed disc will fail during
maintenance?11000 - Risk Is 30x Higher During Maintenance
- gt Do It Off Peak Hour
- Software Maintenance
- Repair Only Virulent Bugs
- Wait For Next Release To Fix Benign Bugs
20OK So Far
- Hardware fail-fast is easy
- Redundancy plus Repair is great (Class 7
availability) - Hardware redundancy repair is via modules.
- How can we get instant software repair?
- We Know How To Get Reliable Storage
- RAID Or Dumps And Transaction Logs.
- We Know How To Get Available Storage
- Fail Soft Duplexed Discs (RAID 1...N).
- ? How do we get reliable execution?
- ? How do we get available execution?
21Outline
- Terminology and empirical measures
- General methods to mask faults.
- Software-fault tolerance
- Summary
22Key Idea
- Architecture Hardware Faults
- Software Masks Environmental Faults
- Distribution Maintenance
- Software automates / eliminates operators
- So,
- In the limit there are only software design
faults.Software-fault tolerance is the key to
dependability.
INVENT IT!
23Software Techniques Learning from Hardware
- Recall that most outages are not hardware.
- Most outages in Fault Tolerant Systems are
SOFTWARE - Fault Avoidance Techniques Good Correct
design. - After that Software Fault Tolerance Techniques
- Modularity (isolation, fault containment)
- Design diversity
- N-Version Programming N-different
implementations - Defensive Programming Check parameters and data
- Auditors Check data structures in background
- Transactions to clean up state after a failure
- Paradox Need Fail-Fast Software
24Fail-Fast and High-Availability Execution
- Software N-Plexing Design Diversity
- N-Version Programming
- Write the same program N-Times (N gt 3)
- Compare outputs of all programs and take
majority vote - Process Pairs Instant restart (repair)
- Use Defensive programming to make a process
fail-fast - Have restarted process ready in separate
environment - Second process takes over if primary faults
- Transaction mechanism can clean up distributed
state - if takeover in middle of computation.
25What Is MTTF of N-Version Program?
- First fails after MTTF/N
- Second fails after MTTF/(N-1),...
- so MTTF(1/N 1/(N-1) ... 1/2)
- harmonic series goes to infinity, but VERY
slowly - for example 100-version programming gives
- 4 MTTF of 1-version programming
- Reduces variance
- N-Version Programming Needs REPAIR
- If a program fails, must reset its state from
other programs. - gt programs have common data/state
representation. - How does this work for Database Systems?
- Operating Systems?
- Network Systems?
- Answer I dont know.
26Why Process Pairs Mask FaultsMany Software
Faults are Soft
- After Design Review
- Code Inspection
- Alpha Test
- Beta Test
- 10k Hrs Of Gamma Test (Production)
- Most Software Faults Are Transient
- MVS Functional Recovery Routines 51
- Tandem Spooler 1001
- Adams gt1001
- Terminology
- Heisenbug Works On Retry
- Bohrbug Faults Again On Retry
- Adams "Optimizing Preventative Service of
Software Products", IBM J RD,28.1,1984 - Gray "Why Do Computers Stop", Tandem TR85.7,
1985 - Mourad "The Reliability of the IBM/XA Operating
System", 15 ISFTCS, 1985.
27Heisenbugs A Probabilistic Approach to
Availability
- There is considerable evidence that (1)
production systems have about one bug per
thousand lines of code (2) these bugs manifest
themselves in stochastically failures are due
to confluence of rare events, (3) system
mean-time-to-failure has a lower bound of a
decade or so. To make highly available
systems, architects must tolerate these failures
by providing instant repair (un-availability is
approximated by repair_time/time_to_fail so
cutting the repair time in half makes things
twice as good. Ultimately, one builds a set of
standby servers which have both design diversity
and geographic diversity. This minimizes
common-mode failures.
28Process Pair Repair Strategy
- If software fault (bug) is a Bohrbug, then there
is no repair - wait for the next release or
- get an emergency bug fix or
- get a new vendor
- If software fault is a Heisenbug, then repair
is - reboot and retry or
- switch to backup process (instant restart)
-
- PROCESS PAIRS Tolerate Hardware Faults
- Heisenbugs
- Repair time is seconds, could be mili-seconds if
time is critical - Flavors Of Process Pair Lockstep
- Automatic
- State Checkpointing
- Delta Checkpointing
- Persistent
29How Takeover Masks Failures
- Server Resets At Takeover But What About
Application State? - Database State?
- Network State?
- Answer Use Transactions To Reset State!
- Abort Transaction If Process Fails.
- Keeps Network "Up"
- Keeps System "Up"
- Reprocesses Some Transactions On Failure
30PROCESS PAIRS - SUMMARY
- Transactions Give Reliability
- Process Pairs Give Availability
- Process Pairs Are Expensive Hard To Program
- Transactions Persistent Process Pairs
- gt Fault Tolerant Sessions E
xecution - When Tandem Converted To This Style
- Saved 3x Messages
- Saved 5x Message Bytes
- Made Programming Easier
31SYSTEM PAIRSFOR HIGH AVAILABILITY
Primary
Backup
- Programs, Data, Processes Replicated at two
sites. - Pair looks like a single system.
- System becomes logical concept
- Like Process Pairs System Pairs.
- Backup receives transaction log (spooled if
backup down). - If primary fails or operator Switches, backup
offers service.
32SYSTEM PAIR CONFIGURATION OPTIONS
Backup
Primary
- Mutual Backup
- each has1/2 of Database Application
- Hub
- One site acts as backup for many others
- In General can be any directed graph
- Stale replicas Lazy replication
Primary
Primary
Primary
Backup
Backup
Primary
Copy
Copy
Copy
33SYSTEM PAIRS FOR SOFTWARE MAINTENANCE
(
B
a
c
k
u
p
)
(
B
a
c
k
u
p
)
(
P
r
i
m
a
r
y
)
(
P
r
i
m
a
r
y
)
V
1
V
1
V
1
V
2
S
t
e
p
1
B
o
t
h
s
y
s
t
e
m
s
a
r
e
r
u
n
n
i
n
g
V
1
.
S
t
e
p
2
B
a
c
k
u
p
i
s
c
o
l
d
-
l
o
a
d
e
d
a
s
V
2
.
(
P
r
i
m
a
r
y
)
(
P
r
i
m
a
r
y
)
(
B
a
c
k
u
p
)
(
B
a
c
k
u
p
)
V
1
V
2
V
2
V
2
S
t
e
p
4
B
a
c
k
u
p
i
s
c
o
l
d
-
l
o
a
d
e
d
a
s
V
2
D
3
0
.
S
t
e
p
3
S
W
I
T
C
H
t
o
B
a
c
k
u
p
.
- Similar ideas apply to
- Database Reorganization
- Hardware modification (e.g. add discs,
processors,...) - Hardware maintenance
- Environmental changes (rewire, new air
conditioning) - Move primary or backup to new location.
34SYSTEM PAIR BENEFITS
- Protects against ENVIRONMENT weather
- utilities
- sabotage
- Protects against OPERATOR FAILURE
- two sites, two sets of operators
- Protects against MAINTENANCE OUTAGES
- work on backup
- software/hardware install/upgrade/move...
- Protects against HARDWARE FAILURES
- backup takes over
- Protects against TRANSIENT SOFTWARE ERRORR
- Allows design diversity
- different sites have different software/hardware)
35Key Idea
- Architecture Hardware Faults
- Software Masks Environmental Faults
- Distribution Maintenance
- Software automates / eliminates operators
- So,
- In the limit there are only software design
faults. Many are HeisenbugsSoftware-fault
tolerance is the key to dependability.
INVENT IT!
36References
- Adams, E. (1984). Optimizing Preventative
Service of Software Products. IBM Journal of
Research and Development. 28(1) 2-14.0 - Anderson, T. and B. Randell. (1979). Computing
Systems Reliability. - Garcia-Molina, H. and C. A. Polyzois. (1990).
Issues in Disaster Recovery. 35th IEEE Compcon
90. 573-577. - Gray, J. (1986). Why Do Computers Stop and What
Can We Do About It. 5th Symposium on Reliability
in Distributed Software and Database Systems.
3-12. - Gray, J. (1990). A Census of Tandem System
Availability between 1985 and 1990. IEEE
Transactions on Reliability. 39(4) 409-418. - Gray, J. N., Reuter, A. (1993). Transaction
Processing Concepts and Techniques. San Mateo,
Morgan Kaufmann. - Lampson, B. W. (1981). Atomic Transactions.
Distributed Systems -- Architecture and
Implementation An Advanced Course. ACM,
Springer-Verlag. - Laprie, J. C. (1985). Dependable Computing and
Fault Tolerance Concepts and Terminology. 15th
FTCS. 2-11. - Long, D.D., J. L. Carroll, and C.J. Park (1991).
A study of the reliability of Internet sites.
Proc 10th Symposium on Reliable Distributed
Systems, pp. 177-186, Pisa, September 1991. - Darrell Long, Andrew Muir and Richard Golding,
A Longitudinal Study of Internet Host
Reliability,'' Proceedings of the Symposium on
Reliable Distributed Systems, Bad Neuenahr,
Germany IEEE, September 1995, pp. 2-9
37(No Transcript)
38Scaleable Replicated Databases
- Jim Gray (Microsoft)
- Pat Helland (Microsoft)
- Dennis Shasha (Columbia)
- Pat ONeil (U.Mass)
-
39Outline
- Replication strategies
- Lazy and Eager
- Master and Group
- How centralized databases scale
- deadlocks rise non-linearly with
- transaction size
- concurrency
- Replication systems are unstable on scaleup
- A possible solution
40Scaleup, Replication, Partition
41Why Replicate Databases?
- Give users a local copy for
- Performance
- Availability
- Mobility (they are disconnected)
- But... What if they update it?
- Must propagate updates to other copies
42Propagation Strategies
- Eager Send update right away
- (part of same transaction)
- N times larger transactions
- Lazy Send update asynchronously
- separate transaction
- N times more transactions
- Either way
- N times more updates per second per node
- N2 times more work overall
43Update Control Strategies
- Master
- Each object has a master node
- All updates start with the master
- Broadcast to the subscribers
- Group
- Object can be updated by anyone
- Update broadcast to all others
- Everyone wants Lazy Group
- update anywhere, anytime, anyway
44Quiz Questions Name One
- Eager
- Master N-Plexed disks
- Group ?
- Lazy
- Master Bibles, Bank accounts, SQLserver
- Group Name servers, Oracle, Access...
- Note Lazy contradicts Serializable
- If two lazy updates collide, then ... reconcile
- discard one transaction (or use some other rule)
- Ask for human advice
- Meanwhile, nodes disagree gt
- Network DB state diverges System Delusion
45Anecdotal Evidence
- Update Anywhere systems are attractive
- Products offer the feature
- It demos well
- But when it scales up
- Reconciliations start to cascade
- Database drifts out of sync (System Delusion)
- Whats going on?
46Outline
- Replication strategies
- Lazy and Eager
- Master and Group
- How centralized databases scale
- deadlocks rise non-linearly
- Replication is unstable on scaleup
- A possible solution
47Simple Model of Waits
DBsize records
- TPS transactions per second
- Each
- Picks Actions records uniformly from set of
DBsize records - Then commits
- About Transactions x Actions/2 resources locked
- Chance a request waits is
- Action rate is TPS x Actions
- Active Transactions TPS x Actions x Action_Time
- Wait Rate Action rate x Chance a request waits
-
- 10x more transactions, 100x more waits
TransctionsxActions 2
Transactions x Actions 2 x DB_size
TPS2 x Actions3 x Action_Time 2 x DB_size
48Simple Model of Deadlocks
- A deadlock is a wait cycle
- Cycle of length 2
- Wait rate x Chance Waitee waits for waiter
- Wait rate x (P(wait) / Transactions)
- Cycles of length 3 are PW3, so ignored.
- 10x bigger trans 100,000x more deadlocks
TPS x Actions3x Action_Time 2 x DB_size TPS x
Actions x Action_Time
TPS2 x Actions3 x Action_Time 2 x DB_size
TPS2 x Actions5 x Action_Time 4 x DB_size2
49Summary So Far
- Even centralized systems unstable
- Waits
- Square of concurrency
- 3rd power of transaction size
- Deadlock rate
- Square of concurrency
- 5th power of transaction size
Trans Size
Concurrency
50Outline
- Replication strategies
- How centralized databases scale
- Replication is unstable on scaleup
- Eager (master group)
- Lazy (master group disconnected)
- A possible solution
51Eager Transactions are FAT
- If N nodes, eager transaction is Nx bigger
- Takes Nx longer
- 10x nodes, 1,000x deadlocks
- (derivation in paper)
- Master slightly better than group
- Good news
- Eager transactions only deadlock
- No need for reconciliation
52Lazy Master Group
Write A
New Timestamp
Write B
Write C
Commit
Write A
- Use optimistic concurrency control
- Keep transaction timestamp with record
- Updates carry oldnew timestamp
- If record has old timestamp
- set value to new value
- set timestamp to new timestamp
- If record does not match old timestamp
- reject lazy transaction
- Not SNAPSHOT isolation (stale reads)
- Reconciliation
- Some nodes are updated
- Some nodes are being reconciled
Write A
Write B
Write B
Write C
Write C
Commit
Commit
53Reconciliation
- Reconciliation means System Delusion
- Data inconsistent with itself and reality
- How frequent is it?
- Lazy transactions are not fat
- but N times as many
- Eager waits become Lazy reconciliations
- Rate is
- Assuming everyone is connected
TPS2 x (Actions x Nodes)3 x Action_Time 2 x
DB_size
54Eager Lazy Disconnected
- Suppose mobile nodes disconnected for a day
- When reconnect
- get all incoming updates
- send all delayed updates
- Incoming is Nodes x TPS x Actions x
disconnect_time - Outgoing is TPS x Actions x Disconnect_Time
- Conflicts are intersection of these two sets
Action_Time
Action_Time
Disconnect_Time x (TPS xActions x Nodes)2 DB_size
55Outline
- Replication strategies (lazy eager, master
group) - How centralized databases scale
- Replication is unstable on scaleup
- A possible solution
- Two-tier architecture Mobile Base nodes
- Base nodes master objects
- Tentative transactions at mobile nodes
- Transactions must be commutative
- Re-apply transactions on reconnect
- Transactions may be rejected
56Safe Approach
- Each object mastered at a node
- Update Transactions only read and write master
items - Lazy replication to other nodes
- Allow reads of stale data (on user request)
- PROBLEMS
- doesnt support mobile users
- deadlocks explode with scaleup
- ?? How do banks work???
57Two Tier Replication
- Two kinds of nodes
- Base nodes always connected, always up
- Mobile nodes occasionally connected
- Data mastered at base nodes
- Mobile nodes
- have stale copies
- make tentative updates
58Mobile Node Makes Tentative Updates
- Updates local database while disconnected
- Saves transactions
- When Mobile node reconnects Tentative
transactions re-done as Eager-Master (at
original time??) - Some may be rejected
- (replaces reconciliation)
- No System Delusion.
59Tentative Transactions
- Must be commutative with others
- Debit 50 rather than Change 150 to 100.
- Must have acceptance criteria
- Account balance is positive
- Ship date no later than quoted
- Price is no greater than quoted
Transactions From Others
Tentative Transactions at local DB
send Tentative Xacts
Updates Rejects
60Refinement Mobile Node Can Master Some Data
- Mobile node can master private data
- Only mobile node updates this data
- Others only read that data
- Examples
- Orders generated by salesman
- Mail generated by user
- Documents generated by Notes user.
61Virtue of 2-Tier Approach
- Allows mobile operation
- No system delusion
- Rejects detected at reconnect (know right away)
- If commutativity works,
- No reconciliations
- Even though work rises as (Mobile Base)2
62Outline
- Replication strategies (lazy eager, master
group) - How centralized databases scale
- Replication is unstable on scaleup
- A possible solution (two-tier architecture)
- Tentative transactions at mobile nodes
- Re-apply transactions on reconnect
- Transactions may be rejected reconciled
- Avoids system delusion