Embracing Failure: A Case for Recovery-Oriented Computing

About This Presentation

Title:

Description:

Number of Views:92

Avg rating:3.0/5.0

Slides: 11

Provided by: JohnC342

Learn more at: http://people.ee.duke.edu

Category:

more less

Transcript and Presenter's Notes

Title: Embracing Failure: A Case for Recovery-Oriented Computing

1
Embracing Failure A Case for Recovery-Oriented
Computing

2
Motivation

A survey in 2000 (one year prior to writing of
this paper) found
65 of surveyed web sites had customer-visible
downtime at least once every 6 months, 25 had
downtime 3 times
Is this five-nines availability?
259,200 minutes in 180 days
Five-nines no more than 2.5 minutes downtime
(barely customer visible)

3
In modern computer systems

Availability is more important than ever
Businesses can lose millions of dollars during a
one hour web site outage
Availability is harder than ever to guarantee
Modern systems are distributed, heterogeneous,
and complex, involving numerous interacting
applications web server, internal database, etc.
Availability limited by weakest link in system
In such an environment, failures are inevitable

4
Traditional Solutions

5
Hardware/Software Failures

Fault-tolerant hardware may exist, but that does
not mean it is used
Commodity hardware is cheap and ubiquitous
And error-prone IDE disks, non-ECC memory, etc.
Even low per-node failure rates are substantial
in larger clusters (e.g., Google cluster)
It may be possible to develop fault-tolerant
software, however
Software is being developed, updated, and
deployed faster than ever in the Internet age
In Internet time, people get sloppy

6
Human Failures

Arise primarily during maintenance and repair
Consider trying to diagnose and fix a subtle bug
in even a few thousand lines of code
Also arise during other activities
configuration, upgrading, performance tuning
Human error rates are nowhere near zero
Even highly-trained, intelligent people make
mistakes, especially under pressure
Therefore, maintenance and repair are not
error-free

7
Unanticipated Failures

8
Recovery-Oriented Computing

As we cannot design a system with 100
availability, modern systems must accept failure
as inevitable
Focus more on recovery and repair in addition to
avoidance
Provides an essential failure safety net that
complements failure avoidance methods
Focus on improving MTTR as well as MTTF

9
Recovery-Oriented Computing

10
Questions

What if there are errors in the recovery-oriented
framework?
How are these failures handled?
Alternately, can the framework be guaranteed not
to fail?
Probably could not be a 100 guarantee
If this is instead a five-nines style of
guarantee, arent we back where we started?
ROC not be the catch all safety net that we
desired