Design of High Availability Systems and Networks Replication - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Design of High Availability Systems and Networks Replication

Description:

Lecture Replication, Slide 1. Design of High Availability ... Fault tolerance integrated in ORB (Orbix Isis, Maestro, Electra) Efficient but non-standard ORB ... – PowerPoint PPT presentation

Number of Views:374
Avg rating:3.0/5.0
Slides: 42
Provided by: centerforr3
Category:

less

Transcript and Presenter's Notes

Title: Design of High Availability Systems and Networks Replication


1
Design of High Availability Systems and Networks
Replication
Ravi K. Iyer Center for Reliable and
High-Performance Computing Department of
Electrical and Computer Engineering
and Coordinated Science Laboratory University of
Illinois at Urbana-Champaign iyer_at_crhc.uiuc.edu
http//www.crhc.uiuc.edu/DEPEND
2
Outline
  • Replication
  • Example systems
  • Example algorithms for replicating multithreaded
    applications
  • Self-checking processes the ARMOR approach

3
Replication Basic Idea
  • Use space redundancy to achieve FT
  • Two basic techniques
  • Active replication (fault masking)
  • Passive replication (standby spare)
  • Use multiple instances of a component so that
    independent failures can be assumed

4
Active Replication
  • Replicas simultaneously perform the same work
  • Replicas start from same initial state
  • Replicas get the same ordered set of inputs
  • In absence of fault, replicas produce same
    outputs (assumption!)
  • Voter produces single response by majority voting
    on replicas outputs
  • Mask potential errors/failures
  • Voter is a single point of failure !
  • Active replication schemes in SW
  • Pass-First Each replica execute client requests
    independently and sends reply to leader, which
    forwards to client the first received reply
  • Leader Only (semi-active) Only leader sends
    replies to client (other replicas suppress their
    replies, unless leader fails)
  • Majority Voting All replicas replies are voted,
    and client is delivered the majority vote
  • Require replica determinism

5
Passive Replication
  • Only one component (primary) does the computation
  • Spares (backups) ready to switch over when a
    fault is detected in primary
  • Spares need to access primary state
  • Replicas must be both observable and controllable
  • Need to have a detection mechanism
  • Passive replication schemes in SW
  • State Cast primary multicasts its state
  • Stable Storage primary saves its state to a
    system accessible stable storage
  • State can be sent/saved
  • After processing each client/request
  • Periodically (requires replica determinism)

6
Replication Issues
  • Replica must see the same ordered set of inputs
  • Non-determinism in replica execution may cause
    replicas to diverge
  • Voting
  • Re-integration
  • Cost/Performance and Complexity

7
Replication Issues
  • Replica must see the same ordered set of inputs
  • In HW use additional wires
  • E.g., Integrity S2 - Tandem
  • In SW use network
  • group communication protocols
  • Provide reliable broadcast /unicast, total order
    broadcast (atomic broadcast), group membership
    service, and virtual synchrony
  • Fault Model crashes, Byzantine with signature,
    network partitions
  • Examples ISIS/HORUS/Ensemble, Totem, Transis,
    SecureRing, Rampart

8
Replication Issues
  • Non-determinism in replica execution may cause
    replicas to diverge
  • In HW execute same instruction stream
  • Tandem TMR synchronizes CPUs on (1) global
    memory references, (2) external interrupts, and
    (3) periodically. For (2) and (3) use cycle
    counter.
  • In SW
  • State machine approach avoid non-determinism
    (e.g., multithreading, local timers/random number
    generators, etc.)
  • Force same instruction streams (instruction
    counter)
  • Use non-preemptive deterministic scheduler

9
Non-preemptive deterministic scheduler
  • Eternal (CORBA)
  • Intercept system calls, replacing C standard
    library
  • Only one logical thread can execute at a time (a
    RMI blocks the replica process)
  • Transactional Drago (ADA)
  • Scheduler embedded in run-time support
  • More logical thread can proceed concurrently
  • Scheduling points service requests, selective
    receptions, lock requests, server calls and end
    of execution.
  • Use internal queue and external queue
  • Reads on external queue must be blocking !
  • Non-preemptive ? only one physical thread can be
    running (CPU/IO, SMP)

10
Replication Issues
  • Voting
  • Voter is single point of failure
  • In SW can replicate both clients and servers,
    embedding a voter in each client/server
  • Fundamental assumption is
  • Values differ ? a replica failed
  • In SW, outputs can have chunks containing
    replica/node dependent information

11
Replication Issues
  • Re-integration
  • Hardware
  • power off failed unit
  • power on a spare unit (automatically or manually)
  • set internal state of new spare to be equal to
    other running units
  • Synchronize units and re-start computation
  • e.g., CPU re-integration in Integrity S2
  • Software
  • Need to get/set internal state (checkpoint)
  • Replicas must be quiescent no operation that can
    change the state should be in progress
  • Replica state can be divided into Application
    state, Infrastructure state, OS state
  • Application must minimize dependency on OS
    resource when a checkpoint is taken (e.g., close
    open files)

12
Replication Issues
  • Cost/Performance and Complexity
  • HW (e.g., TANDEM)
  • Between HW and SO (e.g., Hypervisor)
  • OS (e.g., TARGON/32)
  • Between OS and User App (e.g., TFT)
  • User App (e.g., FT-CORBA)
  • Overhead can range from 20 to 1,000

13
Examples of Replicated Systems
  • Self-Test And Repair (STAR) JPL
  • Prototype for real-time satellite-control
    computer
  • Techniques error-detection coding, on-line
    monitoring, standby redundancy,replication with
    voting, component redundancy, and re-execution
  • The hard-core monitor (TARP) is triplicated
  • Electronic Switching Systems (ESS) Bell Lab
  • Target 3 minutes down-time per year (R
    0.999994)
  • Technique redundant components and self-checking
    duplicated processor
  • Tolerates all single-failures
  • Delta-4
  • Network attachment controllers (NAC) run atomic
    multicast protocol and enforce fail-silence
  • Passive, semi-active, active replication
  • CORBA-based Replicated Systems
  • Fault tolerance integrated in ORB (OrbixIsis,
    Maestro, Electra)
  • Efficient but non-standard ORB
  • Modifies transport level mapping in ORB
  • Fault tolerance as services above ORB
  • Inefficient/FT explicit to user
  • Intercept IIOP messages and send them through
    reliable broadcast protocol (Eternal, AQuA)

14
Summary
  • Replication is expensive in terms of
    cost/complexity and overhead
  • Fault Model crash, Byzantine, network partitions
  • HW/SW approaches similar issues/different
    solutions
  • Replica must see the same ordered set of inputs
  • Non-determinism in replica execution may cause
    replicas to diverge
  • Voting
  • Re-integration

15
A Preemptive Deterministic Scheduling Algorithm
for Multithreaded Replicas
16
Motivations
  • Replication offers high degree of data integrity
    and system availability
  • Replicated systems suffer from large overhead
  • Multithreading can help reduce the overhead but
    it results in nondeterminism in replica behavior
  • Not relevant from the application standpoint, but
    it must be removed to support fault masking
  • Most replication approaches do not support
    multithreading
  • Those that do (e.g., Eternal, Transactional
    Drago) limit concurrency to at most one executing
    thread at a time

17
Proposed Approach
  • Synchronize replicas only on shared state updates
  • Intercept lock/unlock requests
  • Enforce equivalent thread interleaving across
    replicas
  • Allow multiple physical threads to run at the
    same time
  • MT features that we rely on
  • Updates to shared state are protected via mutexes
  • Different mutexes protect different shared
    variables
  • Requirements
  • No other form of nondeterminism is present in the
    application (e.g., local timers or local random
    number gen.)
  • Can replay the application by enforcing the same
    (causal) order of mutex acquisitions

18
Formal Definitions
  • Correct Application Assumption
  • Each thread releases only mutexes it owns
  • A thread executing infinitely often
  • Eventually releases each mutex it acquires
  • Requests mutexes infinitely often
  • The mutex dependency graph is acyclic (no
    deadlock)
  • Correctness Property
  • Internal Correctness
  • (Mutual Exclusion) At most one thread holds a
    given mutex
  • (No Lockout) A mutex acquisition request is
    eventually served
  • External Correctness
  • (Safety) Two replicas impose the same causal
    order of mutex acquisitions
  • (Liveness) Any mutex acquisition in one replica
    is eventually performed by the other replica

19
A Solution Preemptive Deterministic Scheduler
(PDS)
  • No inter-replica communication
  • Basic Idea of PDS
  • Assume a total order lt on thread ids
  • The lt relation is the same for all replicas
  • Replica execution is broken into a sequence of
    rounds
  • In a round a thread can acquire at most
  • 1 mutex (PDS-1)
  • 2 mutexes (PDS-2)

20
Basic PDS Algorithm Example
  • Threads t1, t2, and t3 have requested their next
    mutex acquisition and have been suspended
  • Thread t4 is still executing

t1
m1
t2
m1
t3
m2
t4
N
21
Basic PDS Algorithm Example
t1
m1
t2
Round N fires
m1
t3
m2
t4
N
22
Basic PDS Algorithm Example
  • Threads t1 and t3 execute concurrently

m1
t1
t1 releases m1
m1
t2
? t2 acquires m1
m2
t3
m2
t4
N
23
Basic PDS Algorithm Example
  • Threads t1, t2, and t3 execute concurrently

m1
t1
m1
t2
m2
m2
t3
m2
t4
N
24
Basic PDS Algorithm Example
t1
m1
t2
m2
t3
t3 releases m2
m2
? t4 acquires m2
t4
N
25
Basic PDS Algorithm Example
  • All threads execute concurrently

t1
m1
t2
t3
m2
t4
N
26
Basic PDS Algorithm Example
t1
m1
t2
t3
m2
t4
N
27
Basic PDS Algorithm Example
m3
t1
m1
t2
m3
t3
m2
t4
N
28
Basic PDS Algorithm Example
m3
t1
m3
m1
t2
m3
t3
m2
t4
N
  • t1 acquires m3

29
Basic PDS Algorithm Example
  • Thread t1 executes

m3
t1
m3
m1
t2
m3
t3
m2
m3
t4
N
N1
30
Basic PDS Algorithm Example
t1 releases m3
? t2 acquires m3
m3
t1
m3
m1
t2
m3
t3
m2
m3
t4
N
N1
31
Basic PDS Algorithm Example
  • Thread t1 and t2 execute concurrently

t1
m3
m1
t2
m3
t3
m2
m3
t4
N
N1
32
PDS Algorithm
  • Extend PDS-1 by allowing each thread to acquire
    up to 2 mutexes per round
  • Divide a round in two steps
  • NOTE threads having lower ids always win the tie
    with threads having higher ids
  • Avoid a thread suspension whenever the thread
    would acquire a mutex first, regardless of other
    threads (future) mutex requests

33
Performance/Dependability Experimental Evaluation
  • The PDS algorithm was formally specified and
    proved for correctness
  • Experimental study of Performance/Dependability
    tradeoffs involved in selecting different thread
    scheduling algorithms
  • Studied algorithms
  • Loose Synchronization Algorithm (LSA)
  • Proposed in Basile SRDS02
  • Preemptive Deterministic Scheduler (PDS)
  • Non-Preemptive Deterministic Scheduler (NPDS)
  • Based on MTRDS algorithm used in Transactional
    Drago

34
Experimental Setup
  • Triplicated server running a synthetic MT
    benchmark
  • 10 worker threads serve requests from 15 clients
  • Serving a client request involves execution of a
    sequence of
  • mutex acquisitions (modeling accesses to shared
    data)
  • I/O activities (modeling accesses to persistent
    storage)
  • Majority voting
  • Replicas, voter/fanout, and clients run on
    different machines

35
Performance Evaluation
  • Major Results
  • LSA provides 5 times more throughput than NPDS
  • PDS-2 provides 2 times more throughput than NPDS
  • LSA incurs in 40-60 performance overhead w.r.t.
    non-replicated benchmark (baseline)

36
Dependability Evaluation Approach
  • Use software-based error injection
  • Dependability Measures
  • Number of catastrophic failures
  • Must be minimized in highly available systems
  • Ratio between manifested and injected errors
  • Provides Prfailureerror. Given an error
    arrival rate, one can derive system availability.
  • Ratio between activated and manifested errors
  • Provides a closer look into the error sensitivity
    of a replica code.

37
Error Injection Results
  • Uniform injections (text/data/heap)
  • PDS is less sensitive to errors than LSA
  • A number of manifested errors results in
    catastrophic failures (0.2 for PDS and 0.6 for
    LSA)
  • The difference between PDS and LSA is due to
    their different use of underlying GCS (Ensemble)
  • All Ensemble functions used by PDS are used by
    LSA
  • A number of Ensemble functions are used by LSA
    and not by PDS
  • Targeted injections show that errors into
    Ensemble generate a significant number
    catastrophic failures
  • 1-3 of manifested failures

38
Lesson Learned
  • Performance LSA gt PDS gt NPDS
  • Dependability
  • LSA more sensitive to the underlying GCSs fail
    silence violations
  • NPDS dependability characteristics are similar to
    PDS
  • Errors in the group communication layer do
    propagate and cause catastrophic failures
  • Using interactive consistency will not suffice
  • A fault tolerance middleware should itself be
    fault-tolerant

39
Multithreaded Apache (2.0.35)
  • Triplicated system with majority voting
  • Pentium III 500 MHz,
  • Linux 2.4
  • Ensemble 1.40
  • Use Apaches worker module
  • 10 server threads.
  • 10 concurrent clients each sending 1000 requests
    to retrieve a 1.5KB HTML page

40
Lessons Learned (Replication)
  • Dedicated hardware solutions such as Tandem
  • achieve high reliability/availability through
    extensive hardware redundancy
  • offer only a fixed level of dependability
    throughout the lifetime of the application
  • require significant investment from the customer
  • Software middleware solutions are vital
    alternative to hardware based approaches
  • Provide high reliability/availability services to
    the end user
  • Applications executed in the network system may
    require varying levels of availability.
  • Services must adapt to varying availability
    requirements.
  • Provide efficient error detection and fast
    recovery to minimize system downtime.
  • Ensure minimum error propagation across the
    network by
  • self-checking processes and fail-silent nodes
  • Protect the entities that provide fault tolerance
    services

41
Lessons Learned (Self checking Middleware)
  • Progressive fault injections to stress error
    detection and recovery services of SIFT
    environment.
  • SIFT environment imposes negligible overhead
    during failure-free execution and lt 5 overhead
    during recovery of ARMOR processes.
  • Correlated failures involving application and
    ARMOR processes can impact application
    availability.
  • Successful recovery of many correlated failures
    due to hierarchical error detection and recovery.
  • Targeted heap injections show internal Self
    Checks and microcheckpointing useful in
    preventing error propagation.
Write a Comment
User Comments (0)
About PowerShow.com