Checkpointing and Recovery in Distributed Systems - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Checkpointing and Recovery in Distributed Systems

Description:

Consider a collection of checkpoints, one from each process P1, P2,...,PN, given ... Otherwise, Pj's state is inconsistent with that of Pi ... – PowerPoint PPT presentation

Number of Views:1759
Avg rating:3.0/5.0
Slides: 24
Provided by: Office2652
Category:

less

Transcript and Presenter's Notes

Title: Checkpointing and Recovery in Distributed Systems


1
Checkpointing and Recovery in Distributed Systems
  • Neeraj Mittal

2
The Main Idea
  • Processes take checkpoints to store the work they
    have done so far.
  • Checkpoint of a process contains all the data
    necessary to restart the process from that point.
  • When a process fails and restarts, the system may
    enter an inconsistent state.
  • Recovery involves restoring the system to a
    consistent state.
  • May require other processes to restart their
    execution from earlier checkpoints.

3
Koo and Touegs Algorithm
  • Assumptions
  • All channels are FIFO.
  • All channels are bidirectional.
  • All channels are reliable.
  • Communication topology need not be a complete
    graph.

4
Koo and Touegs Algorithm (Contd.)
  • Ensures that the last checkpoints of any
    processes are concurrent.
  • Consequently
  • No process has to roll back beyond its last
    checkpoint.
  • Checkpointing by one process may cause other
    processes to take a checkpoint as well.
  • Recovery involves rolling back processes to their
    last checkpoints.

5
Koo and Touegs Algorithm (Contd.)
  • At any given time, multiple instances of
    checkpointing and recovery algorithms may be in
    progress
  • A process participates in at most instance
    (checkpointing or recovery) at any given time.
  • It will refuse to participate in other instances,
    thereby causing them to abort.
  • Aborted instances are restarted later by their
    initiators.

6
Koo and Touegs Checkpointing Algorithm
  • Consists of two phases.
  • First Phase
  • Processes take tentative checkpoints if they can.
  • A process after taking a tentative checkpoint
    cannot send any application messages until the
    second phase completes.
  • Second Phase
  • If all required processes are able to take
    checkpoints in the first phase, then tentative
    checkpoints are made permanent.
  • Otherwise, tentative checkpoints are discarded.

7
Koo and Touegs Checkpointing Algorithm (Contd.)
  • Minimizes the number of processes that take
    checkpoints.
  • Each process assigns labels with monotonically
    increasing value to its messages.
  • ? is a special label
  • It is smaller than any other label value.
  • Each process maintains two vectors with one entry
    for each of its neighbors
  • last_label_rcvd
  • first_label_sent

8
Koo and Touegs Checkpointing Algorithm Details
  • Consider processes X and Y that are neighbors.
  • Definition of last_label_rcvdXY
  • Let m be the last message that X has received
    from Y since its last permanent/tentative
    checkpoint.
  • If m exists, then last_label_rcvdXY is the
    label of m.
  • Otherwise, last_label_rcvdXY is ?.

9
Koo and Touegs Checkpointing Algorithm Details
(Contd.)
  • Definition of first_label_sentXY
  • Let m be the first message that X has sent to Y
    since its last permanent/tentative checkpoint.
  • If m exists, then first_label_sentXY is the
    label of m.
  • Otherwise, first_label_sentXY is ?.

10
Koo and Touegs Checkpointing Algorithm Details
(Contd.)
  • Assume that X has taken a (tentative) checkpoint
  • Y does not need to take a checkpoint if
    last_label_rcvdXY ?.
  • Otherwise, X requests Y to take a checkpoint and
    sends last_label_rcvdXY to Y.
  • Y takes a (tentative) checkpoint if
  • last_label_rcvdXY first_label_sentYX gt ?

11
Koo and Touegs Checkpointing Algorithm An
Illustration
W
X
Z
Y
Communication topology (used in all illustrations)
12
Koo and Touegs Checkpointing Algorithm An
Illustration
(
W
2
1
1
3
4
(
X
1
3
3
4
4
(
Y
2
2
1
Z
First vector last_label_rcvd
Second vector first_label_sent
13
Koo and Touegs Recovery Algorithm
  • Only permanent checkpoints are used in recovery.
  • Consists of two phases.
  • First Phase
  • Processes agree to roll back if they can.
  • A process after agreeing to roll back stops its
    execution until the second phase completes.
  • Second Phase
  • If all required processes agree to roll back in
    the first phase, then they restart their
    execution from the last checkpoint.
  • Otherwise, processes resume their execution from
    their current point.

14
Koo and Touegs Recovery Algorithm Details
  • Consider processes X and Y that are neighbors.
  • Definition of last_label_sentXY
  • Let m be the last message that X sent to Y before
    its last permanent checkpoint.
  • If m exists, then last_label_sentXY is the
    label of m.
  • Otherwise, last_label_sentXY is ?.

15
Koo and Touegs Recovery Algorithm Details
(Contd.)
  • Assume that X has agreed to roll back
  • X requests Y to roll back sends
    last_label_sentXY to Y.
  • Y agrees to roll back if
  • last_label_rcvdYX gt last_label_sentXY

16
Koo and Touegs Recovery Algorithm An
Illustration
W
X
2
1
4
?
3
4
X
1
3
3
Y
?
?
2
1
Z
First vector last_label_rcvd
Second vector last_label_sent
17
Juang and Venkatesans Algorithm
  • Assumptions
  • All channels are FIFO.
  • All channels are bidirectional.
  • All channels are reliable.
  • Communication topology need not be a complete
    graph.
  • A process changes its state only on receiving a
    message (except initially).

18
Juang and Venkatesans Checkpointing Algorithm
  • A process takes a checkpoint every time it
    executes an event.
  • Checkpoints are taken in volatile storage.
  • Periodically checkpoints in volatile storage are
    flushed to stable storage.
  • Checkpoints in volatile storage are lost when
    failure occurs.
  • A checkpoint consists of
  • the local state just before message is received,
    and
  • the message received.

19
Juang and Venkatesans Checkpointing Algorithm
(Contd.)
  • A checkpoint consists of
  • the local state just before message is received,
    and
  • the message received.
  • Checkpoint can be used to recover the state just
    after message is received.

20
Juang and Venkatesans Recovery Algorithm
  • Each process maintains two vectors with one entry
    for every neighbor
  • SENT stores the number of messages sent to each
    neighbor so far.
  • RCVD stores the number of messages received from
    each neighbor so far.

21
Juang and Venkatesans Recovery Algorithm (Contd.)
  • Consider a collection of checkpoints, one from
    each process P1, P2,,PN, given by
  • ckpt1, ckpt2, , ckptN
  • Checkpoints form a consistent global state if,
    for each pair of neighbors Pi and Pj, the
    following holds
  • SENT(ckpti)j RCVD(ckptj)i
  • Otherwise, Pjs state is inconsistent with that
    of Pi
  • The inconsistency can be removed by rolling back
    Pj.

22
Juang and Venkatesans Recovery Algorithm (Contd.)
  • All processes participate in recovery.
  • Failed process, on restarting, rolls back to its
    last stable checkpoint and instructs all
    processes to start recovery using flooding.
  • Recovery algorithm executes in iterations.
  • In each iteration, every process sends to each of
    its neighbors the number of messages it has sent
    to it as per the current state.
  • A process rolls back if its state is inconsistent
    with that of its neighbor. It rolls back to its
    latest checkpoint that removes the inconsistency.
  • System is guaranteed to be in a consistent state
    after N-1 iterations (N is the number of
    processes).

23
Juang and Venkatesans Recovery Algorithm An
Illustration
Iteration 2
Iteration 3
Flooding
Iteration 1
(
(
W
cw0
0
1
cw2
cw1
1
2
0
(
X
X
0
0
cx0
cx1
0
0
Y
cy0
cy1
cy2
0
1
1
0
(
Z
cz0
cz1
cz2
Write a Comment
User Comments (0)
About PowerShow.com