Recovery

About This Presentation

Title:

Recovery

Description:

when a system does not perform in the manner defined. erroneous state ... faults lead the system to an erroneous state which may or may not results in a failure ... – PowerPoint PPT presentation

Number of Views:23

Avg rating:3.0/5.0

Slides: 27

Provided by: camarsK

Category:

more less

Transcript and Presenter's Notes

Title: Recovery

1
Recovery
2
Recovery

Lightweight Recoverable Virtual Memory
Rio Vista

3
Introduction

failure
when a system does not perform in the manner
defined
erroneous state
state that could lead the system to the failure
fault
anomalous physical condition
causes
design/manufacturing error
damage/fatigue
external disturbance
faults lead the system to an erroneous state
which may or may not results in a failure

4
Failures

process failure
deadlock, timeout, protection violation, ...
OS should confine this failure to the process
system failure
software and hardware
amnesia failure cannot recover the state just
before the failure
pause failure the state can be reinstated
halting failure the system never restarts
disk failure
serious problem when it is the last backup
storage
usually backed up by tape OR
mirrored (it will enhance read throughput anyway)
communication medium failure
does not cause total system failure

5
Error Recovery

Forward Error Recovery
allow the process to proceed after fixing errors
difficult to remove all the errors (in software,
procedures to cope with all kinds of error should
be prepared, which is almost impossible)
Backward Error Recovery
the process should restart from the saved (or
predefined) state
roll-back mechanism is needed
easy to cope with any kind of errors (it is not
necessary to anticipate all kinds of errors)
overhead to restore previous state
checkpointing is needed
same error may occur again

6
Backward Error Recovery

Operation-based approach
using a log, undo(roll-back) what has been done
until an error-free state can be restored
write ahead log (for a write to X)
records in a log new value of X
updates X
State-based approach
checkpoint
a complete state of a process
at crash, rollback to the most recent safe state
needs many checkpoints
shadow page
copy of a page that is to be updated
updates are done only on the original page
at crash, goes back to the shadow page
at commit, keep using the original page

7
Issues in Recovery(1)

failure and recovery of a process affect other
processes that exchange data with the failed
process
orphan message
when a process rolls back to the point before
sending out a message
actions of other processes depending on the
orphan message should be rolled back, too (domino
effects)
lost message
node Y receives a message from X
Y rolls back to the point before receiving the
message
effects are the same as when the message is lost

8
Issues in Recovery(2)

livelocks

2. orphan message, roll back
x
X
n1
x
m1
Y
1. failure, and roll back

Y sends out m1 and receives an orphan message n1,
and rolls back
m1 becomes an orphan message
receiving m1, X rolls back

9
Checkpoints

local checkpoint
snapshot of a single node
superscalar CPU and out-of-order memory
operations made checkpointing difficult
global checkpoint
strongly consistent set of checkpoints
all the checkpoints are inside a given interval
no information is exchanged between any processes
during this interval
this is the last place any process should rolls
back to

10
Checkpoints(2)

consistent set of checkpoints
a message recorder as received in a checkpoint
should be recorded as sent in another
checkpoint
no orphan message
recorded as sent may NOT be recorded as
received in other checkpoint
possible lost message
simple to make this set
take a checkpoint after sending every message
or after sending N messages for better efficiency
but at more chances of domino effect
lost message can be dealt as in other network
protocols

11
Synchronous Checkpointing

Assumption
FIFO delivery of messages
no lost message
Operations
an initiating node P broadcasts a message
all the other node
take temporary checkpoints if necessary
reply OK to the P
do not send any message until they hear from P
P broadcasts either
GO if all the nodes reply OK to P
Fail otherwise
Nodes make the temporary checkpoint permanent or
discard it
start to send messages from this point

12
Synchronous Checkpointing

advantages
east recovery all processes restarts from the
checkpoint
disadvantages
message overhead
hinder normal progress (no computational messages
are allowed during checkpointing)

13
Asynchronous Checkpointing

checkpoint at each node is made independently
no guarantee of consistent set
recovery is complex to find the nearest
consistent set
optimization all incoming messages are logged
after checkpoint
recovery algorithm analyzes the log and find the
most recent consistent set of checkpoints

14
Asynchronous Checkpointing(2)
X

x
Y

Z

Y crashes
Y restarts from the last checkpoint
send ROLLBACK(Y,2) to X since the last checkpoint
records that Y has sent 2 msgs to X
ROLLBACK(Y,1) to Z (red lines)
other nodes sends back ROLLBACK msgs similarly
(blue lines)
X sends out (X,2), (X,0) to Y and Z, respectively
each node sets the chkpnt as to prevent orphan
msgs (red brackets)
number of received msg from i recorded in the
chkpnt lt N, where ROLLBACK(i,N) msg has arrived
loop until a consistent set of checkpoints comes
up
bounded by N (?)

15
Free Transactions with Rio Vista

crash taxonomy
hardware not frequent
software frequent due to bugs in OS
power UPS
motivations
transactions are useful but high overhead (disk
accesses)
file cache is useful, but vulnerable to system
crashes

16
Traditional Approach RVM

at the beginning of a transaction, RVM copies the
page to undo log(shadow page)
user abort is serviced by the undo log
at commit, RVM reclaims undo space, and writes
updated pages to redo log on disk
system/process failure is serviced by the redo
log
at leisure time, database is updated from the
redo log

17
Rio file cache

protect cached data from system crashes
cache is as reliable as a disk
then, write ahead log for recovery is not needed
writes to disk can be delayed infinitely
OS errors can corrupt any part of the system
the issue is how to reduce the chances
at a crash
warm reboot process writes the cache to disk

18
file cache vs disk

why people view memory more vulnerable than disk?
memory access is a simple write
an error in the address bits will overwrite the
file cache
interface to access disk is complex and explicit
hardware controller is accessed only through
device driver
calls to device drivers are checked for their
arguments
it is extremely unlikely that accidental errors
can forge the logic of device driver

19
How to protect from system crashes?

prevent OS from accidentally overwriting the file
cache
virtual memory mapping
turn off the write-permission bits in the page
table for the pages in the file cache
unauthorized accesses will encounter protection
violation
file cache module enables the bit before writing
and disables the bit afterwards
the file cache is vulnerable to crashes while
being written
disk has the same problem
solutions
verify after writes
use shadow copy for atomic writes

20
How to protect from system crashes?

some kernels bypass the address translations
(TLB)
many systems can disable such bypasses
otherwise, code insertion (sandboxing)
check for every kernel write using physical
address
20-50 slower
memory-mapped file
kernel procedures that modify the memory-mapped
file should be changed as above
faulty user program can still corrupt files to
which it has write access

21
Warm Reboot

Recovery needs to access many data structures
internal file cache lists
page tables (memory-mapped files)
all these data must be protected from crash but
they are scattered inside the kernel
Registry
a separate physical memory region
contains all the information to recover the file
cache
it is updated only when a buffer is replaced
(reloaded)

22
File System Modifications

writes to disk can be saved
most disk writes are reliability-induced
writes to disk are needed only when the file
cache overflows
writing back dirty copies when the system is idle
reduces the time when a buffer is replaced

23
Vista Recoverable Memory
24
Recovery

operations
prepare undo log
writes directly to DBs mapped image in Rio
these updates are persistent
at commit, discard the undo log
at abort, restore the undo log to the mapped DB
at recovery
Rio writes back Vista segments that were mapped
at the time of crash
Visa examines the segment if there is any
uncommitted transactions
roll back (restore undo log)
recovery process should be idempotent
crash can happen while recovering

25
Persistent Heap

only transactions can use
when they aborts, all the used heaps are returned
undo records mentioned above are stored here
programs can store their original data structures
usually convert them to record style when stored
in a file
meta data for the heap is in user space
why?
need a protection from corruption
reduce the risk by using isolated range of
addresses
software fault isolation
virtual memory protection

26
Fault Tolerance with DSM

DSM maintains multiple copies of a page
if a copy is lost, it can be recovered from
another copy
maintain at least two copies for each page
cope with a single failure
can be extend to cope with n-failures
what about state information?
can be rebuilt

Write a Comment

User Comments (0)

About PowerShow.com

Recovery - PowerPoint PPT Presentation

Recovery

when a system does not perform in the manner defined. erroneous state ... faults lead the system to an erroneous state which may or may not results in a failure ... – PowerPoint PPT presentation