Title: Exploring Failure Transparency and the Limits of Generic Recovery
1Exploring Failure Transparency and the Limits of
Generic Recovery
- Dave LowellCompaq Western Research Labxxx
- Subhachandra Chandra andPeter M. Chen,
University of Michigan
2Introduction
- Failure transparency abstraction of failure-free
operation - OS recovers app after hardware, OS, and
application failures - No programmer help
- No slow down
- Will explore theory, performance, and limitations
3Consistent recovery
- Visible output equivalent to failure-free run
- equivalence allows duplicates
- avoids exactly once problem
- Failure transparency ? consistent recovery with
generic techniques
4Guaranteeing consistent recovery
- Key players non-deterministic events, visible
events, commit events - Save-work invariant (simplified)
- Theres a commit after each non-deterministic
event that happens-before a visible event. - Full theorem handles liveness, distinguishes
causality and ordering
5Effort to identify/convert ND events
6Effort to commit only visible events
Effort to identify/convert ND events
7CPV-2PC
CBNDV-2PC
Effort to commit only visible events
CPVS
CBNDVS
CBNDVS-LOG
CAND
CAND-LOG
Effort to identify/convert ND events
8Effort to commit only visible events
Effort to identify/convert ND events
9Performance study
- Discount Checking fast checkpoints to reliable
memory (Rio) - Logging and two-phase commit
- Disk version
- Mostly interactive applications
- Localized and distributed
10Nvi Text Editor
Effort to commit only visible events
CPVS144
CBNDVS142
CBNDVS-LOG012
CAND143
CAND-LOG013
Effort to identify/convert ND events
11TreadMarks Barnes-Hut
CPV-2PC12319
CBNDV-2PC12 252
Effort to commit only visible events
CPVS1297346
CBNDVS1015743
CBNDVS-LOG734973
CAND19911499
CAND-LOG1267700
Effort to identify/convert ND events
12Have only considered stop failures
- Committing everything is okay
- Save-work when we must commit
- Some failures affect application state
- Can we commit too much?
13Dangerous Paths
14Dangerous Paths
15Lose-work invariant
- To recover from propagation failure, never commit
on a dangerous path. - Save-work and Lose-work conflict!
- Visible event on dangerous path
- Cant guarantee consistent recovery from
propagation failures - Do we see this conflict in practice?
16Measuring Lose-work violations
- Fault-injection study OS crashes
- injected faults into running kernel
- induced 350 OS crashes
- recovered nvi and postgres using Discount
Checking - Results
- nvi 15 crashes violate Lose-work
- postgres 3 crashes violate Lose-work
17Application crashes
- Fault-injection study ND bugs
- nvi 37 violate Lose-work
- postgres 33 violate Lose-work
- Published bug distributions 85-95 of
application bugs are deterministic - intrinsically violate Lose-work
- Perhaps gt 90 app crashes violate Lose-work!
18Conclusions
- Save-work and Lose-work invariants
- Save-work protocol space
- Invariants fundamentally conflict
- Failure transparency performance
- 0-12 overhead on reliable memory
- 13-40 overhead on disk (interactive apps)
- gt 90 application failures violate Lose-work
19(No Transcript)
20(No Transcript)
21Chart example
22Charts and graphs color template
RED
DKBLU
DKGRN
DKPUR
ORN
YLW
BLU
LTGRN
MDPUR
DKGRY