System Reliability and Fault Tolerance - PowerPoint PPT Presentation

Loading...

PPT – System Reliability and Fault Tolerance PowerPoint presentation | free to download - id: 698de0-N2Q3N



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

System Reliability and Fault Tolerance

Description:

System Reliability and Fault Tolerance Reliable Communication Byzantine Fault Tolerance – PowerPoint PPT presentation

Number of Views:3
Avg rating:3.0/5.0
Date added: 5 February 2020
Slides: 42
Provided by: RobertR92
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: System Reliability and Fault Tolerance


1
System Reliability and Fault Tolerance
  • Reliable Communication
  • Byzantine Fault Tolerance

2
Recap Replication
  • Write is handled only by the remote primary
    server, and the backups are updated accordingly
    Read is performed locally.
  • Replicated services
  • Sun Network Info Service (NIS, formerly Yellow
    Pages)
  • FAB Building distributed enterprise disk arrays
    from commodity components, ASPLOS04

3
Reliable Point-to-Point Comm.
  • Failure Models
  • Process failure Sender vs Receiver
  • Fail-stop a process crash which can be detected
    by other processes
  • How to detect such crash? Timeout can indicate
    only that a process is not responding
  • Comm. failure
  • send failure A process completes a send, but the
    msg is not put in its outgoing msg buffer
  • receive failure A msg is in incoming buf, but is
    not received by a process
  • Channel failure fail while msg is transmitted
    from outgoing buf to incoming buf
  • Arbitrary failure (Byzantine failure)
  • Any type of error may occur. E.g. return wrong
    value.
  • Reliable comm
  • Validity Any msg in the outgoing buf is
    eventually delivered to the incoming buf
  • Integrity The msg received is identical to the
    one sent and no msgs are delivered twice.

4
RPC Failure Semantics
  • Five Possible Failures
  • The client is unable to locate the server
  • Server is down or Stub mismatches with Skeleton
  • Throw UnknownHostException
  • The request message is lost
  • start with a timer
  • retransmission of the request message
  • The server crashes after receiving a request
  • The reply message is lost
  • The client crashes after sending a request
  • In Java, all remote methods must be prepared to
    catch RemoteException

5
Server Crashes
  • A server in client-server communication
  • Normal case
  • Crash after execution
  • Crash before execution
  • At least once semantics
  • At most once semantics
  • Exactly once semantics
  • java.rmi.ServerRuntimeException

6
Server Crash (Cont)
  • Assume client requests server to print a msg
  • Send a completion msg (M) before print (P), or
  • Send a completion msg (M) after print (P)
  • Combinations
  • M?P?C crash after ack and print
  • M?C(?P)
  • P?M?C
  • P?C(?M)
  • C(?P?M) crash before print and ack
  • C(?M?P)

7
  • When a crashed server recovers, the client can
  • never reissue a request (Never)
  • always reissue a request (Always)
  • reissue only if it received ack
  • reissue only if it received no ack

Client Server Server Server Server Server
Strategy M ? P Strategy M ? P Strategy M ? P Strategy M ? P Strategy M ? P Strategy P ? M Strategy P ? M Strategy P ? M Strategy P ? M
Reissue strategy MPC MPC MC(P) MC(P) C(MP) PMC PC(M) PC(M) C(PM)
Always DUP DUP OK OK OK DUP DUP DUP OK
Never OK OK ZERO ZERO ZERO OK OK OK ZERO
Only when ACKed DUP DUP OK OK ZERO DUP OK OK ZERO
Only when not ACKed OK OK ZERO ZERO OK OK DUP DUP OK
ok Text is printed once dup printed twice
zero no printout
8
  • Lost Reply Messages
  • Some requests can be re-executed with
    side-effects (e.g. Read 1024 bytes of a file)
    some not (idempotent).
  • Solutions
  • Structure requests in an idempotent way
  • Assign request a sequence number to be checked by
    server
  • Client Crashes leading to orphan computation
  • extermination client side logging of RPC about
    what to do the log is checked after a reboot.
  • reincarnation client bcasts a new epoch when it
    reboots server detects orphan computations based
    on epochs.
  • kill orphan remote computation or locate their
    owners
  • expiration set a time quantum for each RPC
    request if it cannot finish, more quanta are
    asked.

9
Reliable Multicast
  • Basic properties
  • Validity If a correct process multicasts message
    m, then it will eventually deliver m.
  • Integrity a correct process delivers the msg at
    most once
  • Atomic messages (aka agreement)
  • A message is delivered to all members of a group,
    or to none
  • Message ordering guarantees
  • within group
  • across groups

10
Message Ordering
  • Different members may see messages in different
    orders
  • Ordered group communication requires that all
    members agree about the order of messages
  • Within each group, assign global ordering to
    messages
  • Hold back messages that arrive out of order
    (delay their delivery)

11
  • (I) Unordered Multicasts

Process P1 Process P2 Process P3
mcast m1 receives m1 receive m2
mcast m2 receives m2 receives m1
(II) FIFO-ordered Multicasts
Process P1 Process P2 Process P3 Process P4
mcast m1 receive m1 receives m3 mcast m3
mcast m2 receives m3 receives m1 mcast m4
receives m2 receives m2
receives m4 receives m4
If a process multicasts two msgs m and m in
order, then every process in the group will
deliver the msgs in the same order
12
(III) Causally-order Multicasts
If mcast(g, m) ? mcast(g, m) Then any process in
the group should deliver m before m
(VI) Totally-ordered multicasts
If a process delivers msg m before m, then any
other process that delivers m will deliver m
before m.
13
Centralized Impl of Total Ordering
  • Central ordering server (sequencer) assigns
    global sequence numbers
  • Hosts apply to ordering server for numbers, or
    ordering server sends all messages itself
  • Hold-back easy, since sequence numbers are
    sequential
  • Msgs will remain in hold-back queue until they
    can be delivered according to their sequence
    numbers.
  • Sequencer bottleneck and single point of failure
  • tricky protocol to deal with case where ordering
    server fails

14
Atomic Messages
  • Each recipient acks message, and sender
    retransmits if ack not received
  • Sender could crash before msg is delivered!!
  • Simple approach if sender crashes, a recipient
    volunteers to be backup sender for the message
  • re-sends message to everybody, waits for acks
  • use simple algorithm to choose volunteer
  • apply method again if backup sender crashes
  • No single best solutions exist!

15
Reliability due to Replication
  • Blocking update, waiting till backups are updated
  • Blocking update of backup servers must be atomic
    so as to implement sequential consistency as the
    primary can serialize all incoming writes (in
    order) and all processes see all writes in the
    same order from any backup servers.
  • Total ordering due to the use of primary for
    centralized sequencer
  • Atomic
  • What happens if some W4 are Postive Ack and some
    are NAck?
  • Two-phase commit protocol
  • W3 prepare msg from primary to other replicas
  • W3 ack to prepare (If in a prepared state,
    related objects be preserved in permanent
    storage will eventually be able to commit it.)
  • W4 commit or abort msg
  • W4 ack to commit/abort

16
2PC Protocol in the Presence of Failures
  • If ack to prepare msg is timed out, primary can
    send abort to replicas and safely abort itself
  • If replica waiting for commit or abort is
    timed out,
  • If its ack to prepare was negative, simply
    abort itself
  • If its ack to prepare was positive, it cannot
    commit, nor abort. Block, waiting for
    primary or network recovery
  • How to handle crash/reboot, particularly primary
    failure?
  • Cannot back out of a commit if already decided
  • Semantics of failure store ? commit cannot
    commit before store
  • Recovery protocol w/ non-volatile memory
  • 2PC causes a long waiting time if primary fails
    after prepare msg is sent out
  • Three-phase commit protocol Pre-Prepare,
    Prepare, Commit
  • Replica times out waiting for Commit msg will
    commit the trans
  • 2PC execute transaction when everyone is willing
    to commit
  • 3PC execute transaction when everyone knows it
    will commit

17
Recovery from Primary Failure
  • Need to pick up a new primary defining a new
    view. It could be set by human operator OR
    autonomic
  • Suppose the lowest-numbered live server is the
    primary
  • Replicas need to ping each other
  • Ping msg lose or delayed may lead to more than
    one primary
  • Paxos protocol for fault-tolerant consensus
  • At most a single value is chosen
  • Agreement reached despite lost msgs and crashed
    nodes
  • Paxos protocol eventually succeeds if a majority
    of replicas are reachable
  • See Lamport98 (submitted to TOCS in 90) and
    Chandra-Toueg96 for details

18
Handling Byzantine Failure
  • Byzantine Failure
  • Failed replicas are not necessarily failure-stop
  • Failed replicas may generate arbitrary results!!

19
The Byzantine Generals Problem
  • Leslie Lamport, Robert Shostak, and Marshall
    Pease in 1982

20
Byzantine Generals Problem
  • N divisions of Byzantine army surround city
  • Each division commanded by a general
  • Some of the N generals are traitors
  • Generals communicate via messages
  • Traitors can send different values to different
    generals
  • Requirements
  • All loyal generals decide upon same plan of
    action
  • A small number of traitors cannot cause loyal
    generals to adopt a bad plan
  • NOT required to identify traitors

21
Restricted BGP
  • Restate problems as
  • 1 commanding general
  • N-1 lieutenants
  • Interactive consistency requirements
  • IC1 All loyal lieutenants obey the same order
  • IC2 If the commander is loyal, every loyal
    lieutenant obeys his/her order
  • If we can solve this problem
  • Original BGP problem reduces to N instances of
    this problem one instance per general acting as
    commander

22
3-General Impossibility Result
  • Assume 2 loyal generals and 1 traitor (shaded)
  • Two messages ATTACK or RETREAT
  • If Lt.1 sees Attack, He said Retreat ? what
    to do?
  • If Lt2 is traitor (Fig1), L1 must attack to
    satisfy IC2
  • If Commander is traitor (Fig2), L1 and L2 must
    make same decision (always obeying commanders
    order over lieutenants violates IC1)

23
General Impossibility Result
  • In general, no solutions with fewer than 3m1
    generals if there are m traitors
  • Proof by contradiction
  • Assume there is a solution for 3m Albanians,
    including m traitors
  • Let a Byzantine general simulate m Albanian
    generals
  • The problem is then reduced to 3-general problem

24
Solution Example
  • With one faulty process f1, N4
  • 1st round the cmd sends a value to each Lt
  • 2nd round each Lt copies the value to all other
    Lts

25
Practical Byzantine Fault Tolerance
  • Miguel Castro and Barbara Liskov
  • OSDI99

26
Assumptions
  • Asynchronous distributed systems
  • Faulty nodes may behave arbitrarily
  • Due to malicious attacks or software errors
  • Independent node failures
  • Network may fail to deliver, delay, duplicate or
    deliver them out of order
  • An adversary may coordinate faulty nodes, delay
    comm, or delay correct nodes in order to cause
    the most damage to the service. BUT it cannot
    delay correct nodes indefinitely, nor subvert the
    cryptographic techniques
  • Any network fault will be eventually repaired
  • E.g. cannot forge a valid signature of non-faulty
    node
  • E.g. Cannot find two msgs with the same digests

27
Objectives
  • To be used for the implementation of any
    deterministic replicated service with a state and
    some operations
  • Clients issue requests and block waiting for a
    reply
  • Safety if no more than (n-1)/3 faulty replicas
    (i.e. to tolerate f faulty nodes, at least n3f1
    needed)
  • Safety the replicated service satisfies
    linearizability
  • Behaves like a centralized implementation that
    executes ops atomically one at a time
  • Why 3f1 the optimal resiliency?
  • Liveness clients eventually receive replies to
    their requests,
  • At most (n-1)/3 faulty replicas
  • Comm delay is bounded with unknown bounds delay
    is the latency from the time of first sending to
    the time of receipt by the destination

28
Algorithm in a nutshell
Backup
f 1 Match (OK)
Client
Primary
Backup
Backup
29
Replicas and Views
Set of replicas (R) R 3f 1
R1
R1
R0
R0
R0
R2
RR-1

View 0
View 1
For view v primary p assigned such that p v mod
R
30
Normal Case Operation
Client
Primary
REQUEST, o, t, c
o Operation t Timestamp c - Client
  • Timestamps are totally ordered such that later
  • requests have higher timestamps than earlier ones

31
Normal Case Operation
  • state of each replica is stored in a message log
  • Primary p receives a client request m , it starts
    a three-phase protocol to atomically multicast
    the request to the replicas
  • Pre-prepare, Prepare, Commit
  • Pre-Prepare and Prepare phases are for total
    ordering of requests sent in the same view, even
    when the primary is faulty
  • Prepare and Commit phases are to ensure committed
    requests are totally ordered across views

32
Pre-Prepare Phase
Backup
Primary
ltltPRE-PREPARE, v, n, dgt , mgt
Backup
Backup
  • v view number
  • n sequence number
  • m message
  • d digest of the message

33
Prepare Phase
  • If replica i accepts the PRE-PREPARE message it
    enters prepare phase by multicasting
  • ltPREPARE, v, n, d, igt
  • to all other replicas and adds both messages to
    its log
  • Otherwise does nothing
  • A replica accepts the PRE-PREPARE message
    provided,
  • The signatures are valid and the digest matches m
  • It is in view v
  • It has not accepted a PRE-PREPARE for the same v
    and n
  • Sequence number is within accepted bounds

34
Commit Phase
  • When replica i receives 2f matched PREPARE msg,
    the replica gets into Commit Phase by
    multicasting
  • ltCOMMIT, v, n, d , igt to other replicas
  • Replica i executes required operation after it
    has accepted 2f1 matched commit msgs from
    different replicas.
  • Replica is state reflects the seq execution of
    all requests with lower sequence numbers. This
    ensures all non-faulty replicas execute requests
    in same order.
  • To guarantee exactly-once semantics, replicas
    discard requests whose timestamp is lower than
    the timestamp in the last reply they sent to the
    client.

35
Normal Operation Reply
  • All replicas sends the reply ltREPLY, v, t, c, i,
    rgt, directly to the client
  • v current view number
  • t timestamp of the corresponding request
  • i replica index
  • r execution result
  • Client waits for f1 replies with valid
    signatures from different replicas, and with same
    t and r, before accepting the result r

36
Normal Case Operation Summery
Request Pre-prepare Prepare
Commit Reply
C
Primary 0
1
2
Faulty 3
X
37
Safeguards
  • If the client does not receive replies soon
    enough, it broadcasts the request to all replicas
  • If the request has already been processed, the
    replicas simply re-send the reply
  • If the replica is not the primary, it relays the
    request to the primary
  • If the primary does not multicast the request to
    the group, it will eventually be suspected to be
    faulty by enough replicas to cause a view change

38
View Changes
  • Timer is set when a request is received,
    recording the waiting time for the request to
    execute.
  • If the timer of replica expires in view v, the
    replica starts a view change to move to view v1
    by,
  • Stop accepting pre-prepare/prepare/commit
    messages
  • Multicasting a VIEW-CHANGE message
  • ltVIEW-CHANGE, v1, n, C, P, igt
  • n seq of last stable checkpoint s known to i
  • C 2f 1 checkpoint msgs proving
    correctness of s
  • P Pm for each m prepared, its seq gtn
    Pm pre-prepare and 2f matching prepare msgs

39
New Primary
  • When primary p of view v1 receives 2f valid
    VIEW-CHANGE messages
  • It multicasts a ltNEW-VIEW, v 1, V, Ogt message to
    all other replicas where
  • V set of 2f valid VIEW-CHANGE messages
  • O set of reissued PRE-PREPARE messages
  • Moves to view v1
  • For a replica that accepts NEW-VIEW
  • Sends PREPARE messages for every pre-prepare in
    set O
  • Moves to view v1

40
References
  • See OSDI99 for
  • Optimization, implementation, and evaluation
  • 3 overhead in NFS daemon
  • See tech report for
  • Formal presentation of the algorithm in I/O
    automation model
  • Proof of safety and liveness
  • See OSDI00 for
  • Proactive recovery

41
Further Readings
  • In synchronous systems, assume msg exchanges take
    place in rounds and processes can detect the
    absence of a msg through a timeout
  • At least f1 rounds of msgs are needed
    (Fisher-Lynch, 82)
  • In async systems with unbounded delay, a crashed
    process becomes indistinguishable from a slow
    one.
  • Impossibility No algorithm can guarantee to
    reach consensus in such systems, even with one
    process crash failure. (Fisher-Lynch-Paterson,
    J. ACM85)
  • Approaches to working around the impossibility
  • In partially async systems with bounded but
    unknown delay
  • Practical Byzantine Fault Tolerant Alg
    (Castro-Liskov99)
  • Using failure detectors unresponsive process be
    treated as failure and discard their subsequent
    msgs.
  • Consensus can be reached, even with an unreliable
    failure detector, if fewer than N/2 processes
    crashes and comm is reliable. (Chandra-Toueg96)
  • Statistical consensus no guarantee doesnt
    mean cannot.
  • Introduce an element of chance in processes
    behaviors so that the adversary cannot exercise
    its thwarting strategy effectively.
About PowerShow.com