EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing - PowerPoint PPT Presentation

Loading...

PPT – EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing PowerPoint presentation | free to download - id: 2495e0-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing

Description:

The ones that show up only after hours of successful operation, under unusual circumstances ... This results in the software 'failing slowly' ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 29
Provided by: wenbin
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing


1
EEC 693/793 Special Topics in Electrical
Engineering Secure and Dependable Computing
  • Lecture 11
  • Wenbing Zhao
  • Department of Electrical and Computer Engineering
  • Cleveland State University
  • wenbing_at_ieee.org

2
Outline
  • Reminder
  • midterm2 April 7, Monday
  • Dependability concepts (some review)
  • Fault, error and failure (some review)
  • Fault/failure detection in distributed systems
  • Consensus in asynchronous distributed systems

3
Dependable System
  • Dependability
  • Ability to deliver service that can justifiably
    be trusted
  • Ability to avoid service failures that are more
    frequent or more severe than is acceptable
  • When service failures are more frequent or more
    severe than acceptable, we say there is a
    dependability failure
  • For a system to be dependable, it must be
  • Available - e.g., ready for use when we need it
  • Reliable - e.g., able to provide continuity of
    service while we are using it
  • Safe - e.g., does not have a catastrophic
    consequence on the environment
  • Secure - e.g., able to preserve confidentiality

4
Approaches to Achieving Dependability
  • Fault Avoidance - how to prevent, by
    construction, the fault occurrence or
    introduction
  • Fault Removal - how to minimize, by verification,
    the presence of faults
  • Fault Tolerance - how to provide, by redundancy,
    a service complying with the specification in
    spite of faults
  • Fault Forecasting - how to estimate, by
    evaluation, the presence, the creation, and the
    consequence of faults

5
Graceful Degradation
  • If a specified fault scenario develops, the
    system must still provide a specified level of
    service. Ideally, the performance of the system
    degrades gracefully
  • The system must not suddenly collapse when a
    fault occur, or as the size of the faults
    increases
  • Rather it should continue to execute part of the
    work load correctly

6
Quantitative Dependability Measures
  • Reliability - a measure of continuous delivery of
    proper service - or, equivalently, of the time to
    failure
  • It is the probability of surviving (potentially
    despite failures) over an interval of time
  • For example, the reliability requirement might be
    stated as a 0.999999 availability for a 10-hour
    mission. In other words, the probability of
    failure during the mission may be at most 10-6
  • Hard real-time systems such as flight control and
    process control demand high reliability, in which
    a failure could mean loss of life

7
Quantitative Dependability Measures
  • Availability - a measure of the delivery of
    correct service with respect to the alternation
    of correct service and out-of-service
  • It is the probability of being operational at a
    given instant of time
  • A 0.999999 availability means that the system is
    not operational at most one hour in a million
    hours
  • A system with high availability may in fact fail.
    However, failure frequency and recovery time
    should be small enough to achieve the desired
    availability
  • Soft real-time systems such as telephone
    switching and airline reservation require high
    availability

8
Fault, Error, and Failure
  • The adjudged or hypothesized cause of an error is
    called a fault
  • An error is a manifestation of a fault in a
    system, in which the logical state of an element
    differs from its intended value
  • A service failure occurs if the error propagates
    to the service interface and causes the service
    delivered by the system to deviate from correct
    service
  • The failure of a component causes a permanent or
    transient fault in the system that contains the
    component
  • Service failure of a system causes a permanent or
    transient external fault for the other system(s)
    that receive service from the given system

9
Fault
  • Faults can arise during all stages in a computer
    system's evolution - specification, design,
    development, manufacturing, assembly, and
    installation - and throughout its operational
    life
  • Most faults that occur before full system
    deployment are discovered through testing and
    eliminated
  • Faults that are not removed can reduce a system's
    dependability when it is in the field
  • A fault can be classified by its duration, nature
    of output, and correlation to other faults

10
Fault Types - Based on Duration
  • Permanent faults are caused by irreversible
    device/software failures within a component due
    to damage, fatigue, or improper manufacturing, or
    bad design and implementation
  • Permanent software faults are also called
    Bohrbugs
  • Easier to detect
  • Transient/intermittent faults are triggered by
    environmental disturbances or incorrect design
  • Transient software faults are also referred to as
    Heisenbugs
  • Study shows that Heisenbugs are the majority
    software faults
  • Harder to detect

11
Fault Types - Based on Nature of Output
  • Malicious fault The fault that causes a unit to
    behave arbitrarily or malicious. Also referred to
    as Byzantine fault
  • A sensor sending conflicting outputs to different
    processors
  • Compromised software system that attempts to
    cause service failure
  • Non-malicious faults the opposite of malicious
    faults
  • Faults that are not caused with malicious
    intention
  • Faults that exhibit themselves consistently to
    all observers, e.g., fail-stop
  • Malicious faults are much harder to detect than
    non-malicious faults

12
Fail-Stop System
  • A system is said to be fail-stop if it responds
    to up to a certain maximum number of faults by
    simply stopping, rather than producing incorrect
    output
  • A fail-stop system typically has many processors
    running the same tasks and comparing the outputs.
    If the outputs do not agree, the whole unit turns
    itself off
  • A system is said to be fail-safe if one or more
    safe states can be identified, that can be
    accessed in case of a system failure, in order to
    avoid catastrophe

13
Fault Types - Based on Correlation
  • Components fault may be independent of one
    another or correlated
  • A fault is said to be independent if it does not
    directly or indirectly cause another fault
  • Faults are said to be correlated if they are
    related. Faults could be correlated due to
    physical or electrical coupling of components
  • Correlated faults are more difficult to detect
    than independent faults

14
Fail Fast to Reduce Heisenbugs
  • The bugs that software developers hate most
  • The ones that show up only after hours of
    successful operation, under unusual circumstances
  • The stack trace usually does not provide useful
    information
  • This kind of bugs might be caused by many
    reasons, such as
  • Not checking the boundary of an array
  • Invalid defensive programming lt what fail fast
    addresses
  • Reference
  • http//www.martinfowler.com/ieeeSoftware/failFast.
    pdf

15
Fail Fast to Reduce Heisenbugs
  • Invalid defensive programming
  • Making your software robust by working around
    problems automatically
  • This results in the software failing slowly
  • That is, it facilitates error propagation - the
    program continues working right after an error
    but fails in strange ways later on
  • Example
  • public int maxConnections()
  • string property getProperty(maxConnections)
  • if (property null)
  • return 10
  • else
  • return property.toInt()

16
Fail Fast to Reduce Heisenbugs
  • Fail fast programming
  • When a problem occurs, it fails immediately
    visibly
  • It may sound like it would make your software
    more fragile, but it actually makes it more
    robust
  • Bugs are easier to find and fix, so fewer go into
    production
  • Example
  • public int maxConnections()
  • string property getProperty(maxConnections)
  • if (property null)
  • throw new NullReferenceException(maxConnect
    ions property not
  • found in this.configFilePath)
  • else return property.toInt()

17
Failure Detection in Distributed Systems
  • Consider the failure detection problem in an
    asynchronous distributed system, where
  • No upper bound on process time
  • No upper bound on clock drift rate
  • No upper bound in networking delay
  • In an asynchronous distributed system, you cannot
    tell a crashed process from a slow one, even if
    you can assume that messages are sequenced and
    retransmitted (arbitrary numbers of times), so
    they eventually get through
  • This leads to Fischer, Lynch and Paterson to
    proof that it is impossible to reach a consensus
    in a fully asynchronous distributed system

18
Consensus Problem
  • Safety
  • Only a value that has been proposed may be chosen
  • Only a single value is chosen, and
  • A process never learns that a value has been
    chosen unless it actually has been
  • Liveness
  • Some proposed value is eventually chosen and, if
    a value has been chosen, then a process can
    eventually learn the value

19
Impossibility Results
  • FLP Impossibility of Consensus
  • A single faulty process can prevent consensus
  • Because a slow process is indistinguishable from
    a crashed one
  • Chandra/Toueg Showed that FLP Impossibility
    applies to many problems, not just consensus
  • In particular, they show that FLP applies to
    group membership, reliable multicast
  • So these practical problems are impossible in
    asynchronous systems
  • They also look at the weakest condition under
    which consensus can be solved
  • Ways to bypass the impossibility result
  • Use unreliable failure detector
  • Use a randomized consensus algorithm

20
The Paxos Algorithm
  • Contribution separately consider safety and
    liveness issues. Safety can be guaranteed and
    liveness is ensured during period of synchrony
  • Participants of the algorithm are divided into
    three categories
  • Proposers those who propose values
  • Accepters those who decide which value to choose
  • Learners those who are interested in learning
    the value chosen

21
The Paxos Algorithm
  • How to choose a value
  • Use a single acceptor straightforward but not
    fault tolerant
  • Use a number of acceptors a value is chosen if
    the majority of the acceptors have accepted it

22
The Paxos Algorithm
  • Requirements for choosing a value
  • P1. An acceptor must accept the first proposal
    that it receives
  • P2. If a proposal with value v is chosen, then
    every higher-numbered proposal that is chosen has
    value v
  • Since the proposal numbers are totally ordered,
    P2 guarantees the safety property

23
The Paxos Algorithm
  • How to guarantee P2?
  • P2a If a proposal with value v is chosen, then
    every higher-numbered proposal accepted by any
    acceptor has value v
  • But what if an acceptor that has never accepted v
    accepted a proposal with v?
  • P2b if a proposal with value v is chosen, then
    every higher-numbered proposal issued by any
    proposer has value v
  • P2b implies P2a, which implies P2

24
The Paxos Algorithm
  • How to ensure P2b?
  • P2c For any v and n, if a proposal with value v
    and number n is issued, then there is a set S
    consisting of a majority of acceptors such that
    either
  • (a) no acceptor in S has accepted any proposal
    numbered less than n, or
  • (b) v is the value of the highest-numbered
    proposal among all proposals numbered less than n
    accepted by the acceptors in S

25
The Paxos Algorithm
  • To ensure P2c, an acceptor must promise
  • It will not accept any more proposals numbered
    less than n, once it has accepted a proposal n

26
The Paxos Algorithm
  • Phase 1.
  • (a) A proposer selects a proposal number n and
    sends a prepare request with number n to a
    majority of acceptors.
  • (b) If an acceptor receives a prepare request
    with number n greater than that of any prepare
    request to which it has already responded, then
    it responds to the request with a promise not to
    accept any more proposals numbered less than n
    and with the highest-numbered proposal (if any)
    that it has accepted.

27
The Paxos Algorithm
  • Phase 2.
  • (a) If the proposer receives a response to its
    prepare requests (numbered n) from a majority of
    acceptors, then it sends an accept request to
    each of those acceptors for a proposal numbered n
    with a value v, where v is the value of the
    highest-numbered proposal among the responses, or
    is any value if the responses reported no
    proposals.
  • (b) If an acceptor receives an accept request for
    a proposal numbered n, it accepts the proposal
    unless it has already responded to a prepare
    request having a number greater than n.

28
The Paxos Algorithm
About PowerShow.com