Checkpointing and Recovery in Distributed Systems - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

Checkpointing and Recovery in Distributed Systems

Description:

Consider a collection of checkpoints, one from each process P1, P2,...,PN, given ... Otherwise, Pj's state is inconsistent with that of Pi ... – PowerPoint PPT presentation

Number of Views:1759

Avg rating:3.0/5.0

Slides: 24

Provided by: Office2652

Category:

more less

Transcript and Presenter's Notes

Title: Checkpointing and Recovery in Distributed Systems

1
Checkpointing and Recovery in Distributed Systems

Neeraj Mittal

2
The Main Idea

Processes take checkpoints to store the work they
have done so far.
Checkpoint of a process contains all the data
necessary to restart the process from that point.
When a process fails and restarts, the system may
enter an inconsistent state.
Recovery involves restoring the system to a
consistent state.
May require other processes to restart their
execution from earlier checkpoints.

3
Koo and Touegs Algorithm

Assumptions
All channels are FIFO.
All channels are bidirectional.
All channels are reliable.
Communication topology need not be a complete
graph.

4
Koo and Touegs Algorithm (Contd.)

Ensures that the last checkpoints of any
processes are concurrent.
Consequently
No process has to roll back beyond its last
checkpoint.
Checkpointing by one process may cause other
processes to take a checkpoint as well.
Recovery involves rolling back processes to their
last checkpoints.

5
Koo and Touegs Algorithm (Contd.)

At any given time, multiple instances of
checkpointing and recovery algorithms may be in
progress
A process participates in at most instance
(checkpointing or recovery) at any given time.
It will refuse to participate in other instances,
thereby causing them to abort.
Aborted instances are restarted later by their
initiators.

6
Koo and Touegs Checkpointing Algorithm

Consists of two phases.
First Phase
Processes take tentative checkpoints if they can.
A process after taking a tentative checkpoint
cannot send any application messages until the
second phase completes.
Second Phase
If all required processes are able to take
checkpoints in the first phase, then tentative
checkpoints are made permanent.
Otherwise, tentative checkpoints are discarded.

7
Koo and Touegs Checkpointing Algorithm (Contd.)

Minimizes the number of processes that take
checkpoints.
Each process assigns labels with monotonically
increasing value to its messages.
? is a special label
It is smaller than any other label value.
Each process maintains two vectors with one entry
for each of its neighbors
last_label_rcvd
first_label_sent

8
Koo and Touegs Checkpointing Algorithm Details

Consider processes X and Y that are neighbors.
Definition of last_label_rcvdXY
Let m be the last message that X has received
from Y since its last permanent/tentative
checkpoint.
If m exists, then last_label_rcvdXY is the
label of m.
Otherwise, last_label_rcvdXY is ?.

9
Koo and Touegs Checkpointing Algorithm Details
(Contd.)

Definition of first_label_sentXY
Let m be the first message that X has sent to Y
since its last permanent/tentative checkpoint.
If m exists, then first_label_sentXY is the
label of m.
Otherwise, first_label_sentXY is ?.

10
Koo and Touegs Checkpointing Algorithm Details
(Contd.)

Assume that X has taken a (tentative) checkpoint
Y does not need to take a checkpoint if
last_label_rcvdXY ?.
Otherwise, X requests Y to take a checkpoint and
sends last_label_rcvdXY to Y.
Y takes a (tentative) checkpoint if
last_label_rcvdXY first_label_sentYX gt ?

11
Koo and Touegs Checkpointing Algorithm An
Illustration
W
X
Z
Y
Communication topology (used in all illustrations)
12
Koo and Touegs Checkpointing Algorithm An
Illustration
(
W
2
1
1
3
4
(
X
1
3
3
4
4
(
Y
2
2
1
Z
First vector last_label_rcvd
Second vector first_label_sent
13
Koo and Touegs Recovery Algorithm

Only permanent checkpoints are used in recovery.
Consists of two phases.
First Phase
Processes agree to roll back if they can.
A process after agreeing to roll back stops its
execution until the second phase completes.
Second Phase
If all required processes agree to roll back in
the first phase, then they restart their
execution from the last checkpoint.
Otherwise, processes resume their execution from
their current point.

14
Koo and Touegs Recovery Algorithm Details

Consider processes X and Y that are neighbors.
Definition of last_label_sentXY
Let m be the last message that X sent to Y before
its last permanent checkpoint.
If m exists, then last_label_sentXY is the
label of m.
Otherwise, last_label_sentXY is ?.

15
Koo and Touegs Recovery Algorithm Details
(Contd.)

Assume that X has agreed to roll back
X requests Y to roll back sends
last_label_sentXY to Y.
Y agrees to roll back if
last_label_rcvdYX gt last_label_sentXY

16
Koo and Touegs Recovery Algorithm An
Illustration
W
X
2
1
4
?
3
4
X
1
3
3
Y
?
?
2
1
Z
First vector last_label_rcvd
Second vector last_label_sent
17
Juang and Venkatesans Algorithm

Assumptions
All channels are FIFO.
All channels are bidirectional.
All channels are reliable.
Communication topology need not be a complete
graph.
A process changes its state only on receiving a
message (except initially).

18
Juang and Venkatesans Checkpointing Algorithm

A process takes a checkpoint every time it
executes an event.
Checkpoints are taken in volatile storage.
Periodically checkpoints in volatile storage are
flushed to stable storage.
Checkpoints in volatile storage are lost when
failure occurs.
A checkpoint consists of
the local state just before message is received,
and
the message received.

19
Juang and Venkatesans Checkpointing Algorithm
(Contd.)

A checkpoint consists of
the local state just before message is received,
and
the message received.
Checkpoint can be used to recover the state just
after message is received.

20
Juang and Venkatesans Recovery Algorithm

Each process maintains two vectors with one entry
for every neighbor
SENT stores the number of messages sent to each
neighbor so far.
RCVD stores the number of messages received from
each neighbor so far.

21
Juang and Venkatesans Recovery Algorithm (Contd.)

Consider a collection of checkpoints, one from
each process P1, P2,,PN, given by
ckpt1, ckpt2, , ckptN
Checkpoints form a consistent global state if,
for each pair of neighbors Pi and Pj, the
following holds
SENT(ckpti)j RCVD(ckptj)i
Otherwise, Pjs state is inconsistent with that
of Pi
The inconsistency can be removed by rolling back
Pj.

22
Juang and Venkatesans Recovery Algorithm (Contd.)

All processes participate in recovery.
Failed process, on restarting, rolls back to its
last stable checkpoint and instructs all
processes to start recovery using flooding.
Recovery algorithm executes in iterations.
In each iteration, every process sends to each of
its neighbors the number of messages it has sent
to it as per the current state.
A process rolls back if its state is inconsistent
with that of its neighbor. It rolls back to its
latest checkpoint that removes the inconsistency.
System is guaranteed to be in a consistent state
after N-1 iterations (N is the number of
processes).

23
Juang and Venkatesans Recovery Algorithm An
Illustration
Iteration 2
Iteration 3
Flooding
Iteration 1
(
(
W
cw0
0
1
cw2
cw1
1
2
0
(
X
X
0
0
cx0
cx1
0
0
Y
cy0
cy1
cy2
0
1
1
0
(
Z
cz0
cz1
cz2

Write a Comment

User Comments (0)