Design of High Availability Systems and Networks Replication - PowerPoint PPT Presentation

1 / 41

About This Presentation

Title:

Design of High Availability Systems and Networks Replication

Description:

Lecture Replication, Slide 1. Design of High Availability ... Fault tolerance integrated in ORB (Orbix Isis, Maestro, Electra) Efficient but non-standard ORB ... – PowerPoint PPT presentation

Number of Views:374

Avg rating:3.0/5.0

Slides: 42

Provided by: centerforr3

Category:

more less

Transcript and Presenter's Notes

Title: Design of High Availability Systems and Networks Replication

1
Design of High Availability Systems and Networks
Replication
Ravi K. Iyer Center for Reliable and
High-Performance Computing Department of
Electrical and Computer Engineering
and Coordinated Science Laboratory University of
Illinois at Urbana-Champaign iyer_at_crhc.uiuc.edu
http//www.crhc.uiuc.edu/DEPEND
2
Outline

Replication
Example systems
Example algorithms for replicating multithreaded
applications
Self-checking processes the ARMOR approach

3
Replication Basic Idea

Use space redundancy to achieve FT
Two basic techniques
Active replication (fault masking)
Passive replication (standby spare)
Use multiple instances of a component so that
independent failures can be assumed

4
Active Replication

Replicas simultaneously perform the same work
Replicas start from same initial state
Replicas get the same ordered set of inputs
In absence of fault, replicas produce same
outputs (assumption!)
Voter produces single response by majority voting
on replicas outputs
Mask potential errors/failures
Voter is a single point of failure !
Active replication schemes in SW
Pass-First Each replica execute client requests
independently and sends reply to leader, which
forwards to client the first received reply
Leader Only (semi-active) Only leader sends
replies to client (other replicas suppress their
replies, unless leader fails)
Majority Voting All replicas replies are voted,
and client is delivered the majority vote
Require replica determinism

5
Passive Replication

Only one component (primary) does the computation
Spares (backups) ready to switch over when a
fault is detected in primary
Spares need to access primary state
Replicas must be both observable and controllable
Need to have a detection mechanism
Passive replication schemes in SW
State Cast primary multicasts its state
Stable Storage primary saves its state to a
system accessible stable storage
State can be sent/saved
After processing each client/request
Periodically (requires replica determinism)

6
Replication Issues

Replica must see the same ordered set of inputs
Non-determinism in replica execution may cause
replicas to diverge
Voting
Re-integration
Cost/Performance and Complexity

7
Replication Issues

Replica must see the same ordered set of inputs
In HW use additional wires
E.g., Integrity S2 - Tandem
In SW use network
group communication protocols
Provide reliable broadcast /unicast, total order
broadcast (atomic broadcast), group membership
service, and virtual synchrony
Fault Model crashes, Byzantine with signature,
network partitions
Examples ISIS/HORUS/Ensemble, Totem, Transis,
SecureRing, Rampart

8
Replication Issues

Non-determinism in replica execution may cause
replicas to diverge
In HW execute same instruction stream
Tandem TMR synchronizes CPUs on (1) global
memory references, (2) external interrupts, and
(3) periodically. For (2) and (3) use cycle
counter.
In SW
State machine approach avoid non-determinism
(e.g., multithreading, local timers/random number
generators, etc.)
Force same instruction streams (instruction
counter)
Use non-preemptive deterministic scheduler

9
Non-preemptive deterministic scheduler

Eternal (CORBA)
Intercept system calls, replacing C standard
library
Only one logical thread can execute at a time (a
RMI blocks the replica process)
Transactional Drago (ADA)
Scheduler embedded in run-time support
More logical thread can proceed concurrently
Scheduling points service requests, selective
receptions, lock requests, server calls and end
of execution.
Use internal queue and external queue
Reads on external queue must be blocking !
Non-preemptive ? only one physical thread can be
running (CPU/IO, SMP)

10
Replication Issues

Voting
Voter is single point of failure
In SW can replicate both clients and servers,
embedding a voter in each client/server
Fundamental assumption is
Values differ ? a replica failed
In SW, outputs can have chunks containing
replica/node dependent information

11
Replication Issues

Re-integration
Hardware
power off failed unit
power on a spare unit (automatically or manually)
set internal state of new spare to be equal to
other running units
Synchronize units and re-start computation
e.g., CPU re-integration in Integrity S2
Software
Need to get/set internal state (checkpoint)
Replicas must be quiescent no operation that can
change the state should be in progress
Replica state can be divided into Application
state, Infrastructure state, OS state
Application must minimize dependency on OS
resource when a checkpoint is taken (e.g., close
open files)

12
Replication Issues

Cost/Performance and Complexity
HW (e.g., TANDEM)
Between HW and SO (e.g., Hypervisor)
OS (e.g., TARGON/32)
Between OS and User App (e.g., TFT)
User App (e.g., FT-CORBA)
Overhead can range from 20 to 1,000

13
Examples of Replicated Systems

Self-Test And Repair (STAR) JPL
Prototype for real-time satellite-control
computer
Techniques error-detection coding, on-line
monitoring, standby redundancy,replication with
voting, component redundancy, and re-execution
The hard-core monitor (TARP) is triplicated
Electronic Switching Systems (ESS) Bell Lab
Target 3 minutes down-time per year (R
0.999994)
Technique redundant components and self-checking
duplicated processor
Tolerates all single-failures

Delta-4
Network attachment controllers (NAC) run atomic
multicast protocol and enforce fail-silence
Passive, semi-active, active replication
CORBA-based Replicated Systems
Fault tolerance integrated in ORB (OrbixIsis,
Maestro, Electra)
Efficient but non-standard ORB
Modifies transport level mapping in ORB
Fault tolerance as services above ORB
Inefficient/FT explicit to user
Intercept IIOP messages and send them through
reliable broadcast protocol (Eternal, AQuA)

14
Summary

Replication is expensive in terms of
cost/complexity and overhead
Fault Model crash, Byzantine, network partitions
HW/SW approaches similar issues/different
solutions
Replica must see the same ordered set of inputs
Non-determinism in replica execution may cause
replicas to diverge
Voting
Re-integration

15
A Preemptive Deterministic Scheduling Algorithm
for Multithreaded Replicas
16
Motivations

Replication offers high degree of data integrity
and system availability
Replicated systems suffer from large overhead
Multithreading can help reduce the overhead but
it results in nondeterminism in replica behavior
Not relevant from the application standpoint, but
it must be removed to support fault masking
Most replication approaches do not support
multithreading
Those that do (e.g., Eternal, Transactional
Drago) limit concurrency to at most one executing
thread at a time

17
Proposed Approach

Synchronize replicas only on shared state updates
Intercept lock/unlock requests
Enforce equivalent thread interleaving across
replicas
Allow multiple physical threads to run at the
same time
MT features that we rely on
Updates to shared state are protected via mutexes
Different mutexes protect different shared
variables
Requirements
No other form of nondeterminism is present in the
application (e.g., local timers or local random
number gen.)
Can replay the application by enforcing the same
(causal) order of mutex acquisitions

18
Formal Definitions

Correct Application Assumption
Each thread releases only mutexes it owns
A thread executing infinitely often
Eventually releases each mutex it acquires
Requests mutexes infinitely often
The mutex dependency graph is acyclic (no
deadlock)
Correctness Property
Internal Correctness
(Mutual Exclusion) At most one thread holds a
given mutex
(No Lockout) A mutex acquisition request is
eventually served
External Correctness
(Safety) Two replicas impose the same causal
order of mutex acquisitions
(Liveness) Any mutex acquisition in one replica
is eventually performed by the other replica

19
A Solution Preemptive Deterministic Scheduler
(PDS)

No inter-replica communication
Basic Idea of PDS
Assume a total order lt on thread ids
The lt relation is the same for all replicas
Replica execution is broken into a sequence of
rounds
In a round a thread can acquire at most
1 mutex (PDS-1)
2 mutexes (PDS-2)

20
Basic PDS Algorithm Example

Threads t1, t2, and t3 have requested their next
mutex acquisition and have been suspended
Thread t4 is still executing

t1
m1
t2
m1
t3
m2
t4
N
21
Basic PDS Algorithm Example
t1
m1
t2
Round N fires
m1
t3
m2
t4
N
22
Basic PDS Algorithm Example

Threads t1 and t3 execute concurrently

m1
t1
t1 releases m1
m1
t2
? t2 acquires m1
m2
t3
m2
t4
N
23
Basic PDS Algorithm Example

Threads t1, t2, and t3 execute concurrently

m1
t1
m1
t2
m2
m2
t3
m2
t4
N
24
Basic PDS Algorithm Example
t1
m1
t2
m2
t3
t3 releases m2
m2
? t4 acquires m2
t4
N
25
Basic PDS Algorithm Example

All threads execute concurrently

t1
m1
t2
t3
m2
t4
N
26
Basic PDS Algorithm Example
t1
m1
t2
t3
m2
t4
N
27
Basic PDS Algorithm Example
m3
t1
m1
t2
m3
t3
m2
t4
N
28
Basic PDS Algorithm Example
m3
t1
m3
m1
t2
m3
t3
m2
t4
N

t1 acquires m3

29
Basic PDS Algorithm Example

Thread t1 executes

m3
t1
m3
m1
t2
m3
t3
m2
m3
t4
N
N1
30
Basic PDS Algorithm Example
t1 releases m3
? t2 acquires m3
m3
t1
m3
m1
t2
m3
t3
m2
m3
t4
N
N1
31
Basic PDS Algorithm Example

Thread t1 and t2 execute concurrently

t1
m3
m1
t2
m3
t3
m2
m3
t4
N
N1
32
PDS Algorithm

Extend PDS-1 by allowing each thread to acquire
up to 2 mutexes per round
Divide a round in two steps
NOTE threads having lower ids always win the tie
with threads having higher ids

Avoid a thread suspension whenever the thread
would acquire a mutex first, regardless of other
threads (future) mutex requests

33
Performance/Dependability Experimental Evaluation

The PDS algorithm was formally specified and
proved for correctness
Experimental study of Performance/Dependability
tradeoffs involved in selecting different thread
scheduling algorithms
Studied algorithms
Loose Synchronization Algorithm (LSA)
Proposed in Basile SRDS02
Preemptive Deterministic Scheduler (PDS)
Non-Preemptive Deterministic Scheduler (NPDS)
Based on MTRDS algorithm used in Transactional
Drago

34
Experimental Setup

Triplicated server running a synthetic MT
benchmark
10 worker threads serve requests from 15 clients
Serving a client request involves execution of a
sequence of
mutex acquisitions (modeling accesses to shared
data)
I/O activities (modeling accesses to persistent
storage)
Majority voting
Replicas, voter/fanout, and clients run on
different machines

35
Performance Evaluation

Major Results
LSA provides 5 times more throughput than NPDS
PDS-2 provides 2 times more throughput than NPDS
LSA incurs in 40-60 performance overhead w.r.t.
non-replicated benchmark (baseline)

36
Dependability Evaluation Approach

Use software-based error injection
Dependability Measures
Number of catastrophic failures
Must be minimized in highly available systems
Ratio between manifested and injected errors
Provides Prfailureerror. Given an error
arrival rate, one can derive system availability.
Ratio between activated and manifested errors
Provides a closer look into the error sensitivity
of a replica code.

37
Error Injection Results

Uniform injections (text/data/heap)
PDS is less sensitive to errors than LSA
A number of manifested errors results in
catastrophic failures (0.2 for PDS and 0.6 for
LSA)
The difference between PDS and LSA is due to
their different use of underlying GCS (Ensemble)
All Ensemble functions used by PDS are used by
LSA
A number of Ensemble functions are used by LSA
and not by PDS
Targeted injections show that errors into
Ensemble generate a significant number
catastrophic failures
1-3 of manifested failures

38
Lesson Learned

Performance LSA gt PDS gt NPDS
Dependability
LSA more sensitive to the underlying GCSs fail
silence violations
NPDS dependability characteristics are similar to
PDS
Errors in the group communication layer do
propagate and cause catastrophic failures
Using interactive consistency will not suffice
A fault tolerance middleware should itself be
fault-tolerant

39
Multithreaded Apache (2.0.35)

Triplicated system with majority voting
Pentium III 500 MHz,
Linux 2.4
Ensemble 1.40
Use Apaches worker module
10 server threads.
10 concurrent clients each sending 1000 requests
to retrieve a 1.5KB HTML page

40
Lessons Learned (Replication)

Dedicated hardware solutions such as Tandem
achieve high reliability/availability through
extensive hardware redundancy
offer only a fixed level of dependability
throughout the lifetime of the application
require significant investment from the customer
Software middleware solutions are vital
alternative to hardware based approaches
Provide high reliability/availability services to
the end user
Applications executed in the network system may
require varying levels of availability.
Services must adapt to varying availability
requirements.
Provide efficient error detection and fast
recovery to minimize system downtime.
Ensure minimum error propagation across the
network by
self-checking processes and fail-silent nodes
Protect the entities that provide fault tolerance
services

41
Lessons Learned (Self checking Middleware)

Progressive fault injections to stress error
detection and recovery services of SIFT
environment.
SIFT environment imposes negligible overhead
during failure-free execution and lt 5 overhead
during recovery of ARMOR processes.
Correlated failures involving application and
ARMOR processes can impact application
availability.
Successful recovery of many correlated failures
due to hierarchical error detection and recovery.
Targeted heap injections show internal Self
Checks and microcheckpointing useful in
preventing error propagation.