Exploring Failure Transparency and the Limits of Generic Recovery

1 / 21
About This Presentation
Title:

Exploring Failure Transparency and the Limits of Generic Recovery

Description:

Failure transparency: abstraction of failure-free operation ... Failure transparency consistent recovery with generic techniques. 4 ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Exploring Failure Transparency and the Limits of Generic Recovery


1
Exploring Failure Transparency and the Limits of
Generic Recovery
  • Dave LowellCompaq Western Research Labxxx
  • Subhachandra Chandra andPeter M. Chen,
    University of Michigan

2
Introduction
  • Failure transparency abstraction of failure-free
    operation
  • OS recovers app after hardware, OS, and
    application failures
  • No programmer help
  • No slow down
  • Will explore theory, performance, and limitations

3
Consistent recovery
  • Visible output equivalent to failure-free run
  • equivalence allows duplicates
  • avoids exactly once problem
  • Failure transparency ? consistent recovery with
    generic techniques

4
Guaranteeing consistent recovery
  • Key players non-deterministic events, visible
    events, commit events
  • Save-work invariant (simplified)
  • Theres a commit after each non-deterministic
    event that happens-before a visible event.
  • Full theorem handles liveness, distinguishes
    causality and ordering

5
Effort to identify/convert ND events
6
Effort to commit only visible events
Effort to identify/convert ND events
7
CPV-2PC
CBNDV-2PC
Effort to commit only visible events
CPVS
CBNDVS
CBNDVS-LOG
CAND
CAND-LOG
Effort to identify/convert ND events
8
Effort to commit only visible events
Effort to identify/convert ND events
9
Performance study
  • Discount Checking fast checkpoints to reliable
    memory (Rio)
  • Logging and two-phase commit
  • Disk version
  • Mostly interactive applications
  • Localized and distributed

10
Nvi Text Editor
Effort to commit only visible events
CPVS144
CBNDVS142
CBNDVS-LOG012
CAND143
CAND-LOG013
Effort to identify/convert ND events
11
TreadMarks Barnes-Hut
CPV-2PC12319
CBNDV-2PC12 252
Effort to commit only visible events
CPVS1297346
CBNDVS1015743
CBNDVS-LOG734973
CAND19911499
CAND-LOG1267700
Effort to identify/convert ND events
12
Have only considered stop failures
  • Committing everything is okay
  • Save-work when we must commit
  • Some failures affect application state
  • Can we commit too much?

13
Dangerous Paths
14
Dangerous Paths
15
Lose-work invariant
  • To recover from propagation failure, never commit
    on a dangerous path.
  • Save-work and Lose-work conflict!
  • Visible event on dangerous path
  • Cant guarantee consistent recovery from
    propagation failures
  • Do we see this conflict in practice?

16
Measuring Lose-work violations
  • Fault-injection study OS crashes
  • injected faults into running kernel
  • induced 350 OS crashes
  • recovered nvi and postgres using Discount
    Checking
  • Results
  • nvi 15 crashes violate Lose-work
  • postgres 3 crashes violate Lose-work

17
Application crashes
  • Fault-injection study ND bugs
  • nvi 37 violate Lose-work
  • postgres 33 violate Lose-work
  • Published bug distributions 85-95 of
    application bugs are deterministic
  • intrinsically violate Lose-work
  • Perhaps gt 90 app crashes violate Lose-work!

18
Conclusions
  • Save-work and Lose-work invariants
  • Save-work protocol space
  • Invariants fundamentally conflict
  • Failure transparency performance
  • 0-12 overhead on reliable memory
  • 13-40 overhead on disk (interactive apps)
  • gt 90 application failures violate Lose-work

19
(No Transcript)
20
(No Transcript)
21
Chart example
22
Charts and graphs color template
RED
DKBLU
DKGRN
DKPUR
ORN
YLW
BLU
LTGRN
MDPUR
DKGRY
Write a Comment
User Comments (0)