EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing - PowerPoint PPT Presentation


PPT – EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing PowerPoint presentation | free to download - id: 2495e0-ZDc1Z


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing


The ones that show up only after hours of successful operation, under unusual circumstances ... This results in the software 'failing slowly' ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 29
Provided by: wenbin


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing

EEC 693/793 Special Topics in Electrical
Engineering Secure and Dependable Computing
  • Lecture 11
  • Wenbing Zhao
  • Department of Electrical and Computer Engineering
  • Cleveland State University

  • Reminder
  • midterm2 April 7, Monday
  • Dependability concepts (some review)
  • Fault, error and failure (some review)
  • Fault/failure detection in distributed systems
  • Consensus in asynchronous distributed systems

Dependable System
  • Dependability
  • Ability to deliver service that can justifiably
    be trusted
  • Ability to avoid service failures that are more
    frequent or more severe than is acceptable
  • When service failures are more frequent or more
    severe than acceptable, we say there is a
    dependability failure
  • For a system to be dependable, it must be
  • Available - e.g., ready for use when we need it
  • Reliable - e.g., able to provide continuity of
    service while we are using it
  • Safe - e.g., does not have a catastrophic
    consequence on the environment
  • Secure - e.g., able to preserve confidentiality

Approaches to Achieving Dependability
  • Fault Avoidance - how to prevent, by
    construction, the fault occurrence or
  • Fault Removal - how to minimize, by verification,
    the presence of faults
  • Fault Tolerance - how to provide, by redundancy,
    a service complying with the specification in
    spite of faults
  • Fault Forecasting - how to estimate, by
    evaluation, the presence, the creation, and the
    consequence of faults

Graceful Degradation
  • If a specified fault scenario develops, the
    system must still provide a specified level of
    service. Ideally, the performance of the system
    degrades gracefully
  • The system must not suddenly collapse when a
    fault occur, or as the size of the faults
  • Rather it should continue to execute part of the
    work load correctly

Quantitative Dependability Measures
  • Reliability - a measure of continuous delivery of
    proper service - or, equivalently, of the time to
  • It is the probability of surviving (potentially
    despite failures) over an interval of time
  • For example, the reliability requirement might be
    stated as a 0.999999 availability for a 10-hour
    mission. In other words, the probability of
    failure during the mission may be at most 10-6
  • Hard real-time systems such as flight control and
    process control demand high reliability, in which
    a failure could mean loss of life

Quantitative Dependability Measures
  • Availability - a measure of the delivery of
    correct service with respect to the alternation
    of correct service and out-of-service
  • It is the probability of being operational at a
    given instant of time
  • A 0.999999 availability means that the system is
    not operational at most one hour in a million
  • A system with high availability may in fact fail.
    However, failure frequency and recovery time
    should be small enough to achieve the desired
  • Soft real-time systems such as telephone
    switching and airline reservation require high

Fault, Error, and Failure
  • The adjudged or hypothesized cause of an error is
    called a fault
  • An error is a manifestation of a fault in a
    system, in which the logical state of an element
    differs from its intended value
  • A service failure occurs if the error propagates
    to the service interface and causes the service
    delivered by the system to deviate from correct
  • The failure of a component causes a permanent or
    transient fault in the system that contains the
  • Service failure of a system causes a permanent or
    transient external fault for the other system(s)
    that receive service from the given system

  • Faults can arise during all stages in a computer
    system's evolution - specification, design,
    development, manufacturing, assembly, and
    installation - and throughout its operational
  • Most faults that occur before full system
    deployment are discovered through testing and
  • Faults that are not removed can reduce a system's
    dependability when it is in the field
  • A fault can be classified by its duration, nature
    of output, and correlation to other faults

Fault Types - Based on Duration
  • Permanent faults are caused by irreversible
    device/software failures within a component due
    to damage, fatigue, or improper manufacturing, or
    bad design and implementation
  • Permanent software faults are also called
  • Easier to detect
  • Transient/intermittent faults are triggered by
    environmental disturbances or incorrect design
  • Transient software faults are also referred to as
  • Study shows that Heisenbugs are the majority
    software faults
  • Harder to detect

Fault Types - Based on Nature of Output
  • Malicious fault The fault that causes a unit to
    behave arbitrarily or malicious. Also referred to
    as Byzantine fault
  • A sensor sending conflicting outputs to different
  • Compromised software system that attempts to
    cause service failure
  • Non-malicious faults the opposite of malicious
  • Faults that are not caused with malicious
  • Faults that exhibit themselves consistently to
    all observers, e.g., fail-stop
  • Malicious faults are much harder to detect than
    non-malicious faults

Fail-Stop System
  • A system is said to be fail-stop if it responds
    to up to a certain maximum number of faults by
    simply stopping, rather than producing incorrect
  • A fail-stop system typically has many processors
    running the same tasks and comparing the outputs.
    If the outputs do not agree, the whole unit turns
    itself off
  • A system is said to be fail-safe if one or more
    safe states can be identified, that can be
    accessed in case of a system failure, in order to
    avoid catastrophe

Fault Types - Based on Correlation
  • Components fault may be independent of one
    another or correlated
  • A fault is said to be independent if it does not
    directly or indirectly cause another fault
  • Faults are said to be correlated if they are
    related. Faults could be correlated due to
    physical or electrical coupling of components
  • Correlated faults are more difficult to detect
    than independent faults

Fail Fast to Reduce Heisenbugs
  • The bugs that software developers hate most
  • The ones that show up only after hours of
    successful operation, under unusual circumstances
  • The stack trace usually does not provide useful
  • This kind of bugs might be caused by many
    reasons, such as
  • Not checking the boundary of an array
  • Invalid defensive programming lt what fail fast
  • Reference
  • http//

Fail Fast to Reduce Heisenbugs
  • Invalid defensive programming
  • Making your software robust by working around
    problems automatically
  • This results in the software failing slowly
  • That is, it facilitates error propagation - the
    program continues working right after an error
    but fails in strange ways later on
  • Example
  • public int maxConnections()
  • string property getProperty(maxConnections)
  • if (property null)
  • return 10
  • else
  • return property.toInt()

Fail Fast to Reduce Heisenbugs
  • Fail fast programming
  • When a problem occurs, it fails immediately
  • It may sound like it would make your software
    more fragile, but it actually makes it more
  • Bugs are easier to find and fix, so fewer go into
  • Example
  • public int maxConnections()
  • string property getProperty(maxConnections)
  • if (property null)
  • throw new NullReferenceException(maxConnect
    ions property not
  • found in this.configFilePath)
  • else return property.toInt()

Failure Detection in Distributed Systems
  • Consider the failure detection problem in an
    asynchronous distributed system, where
  • No upper bound on process time
  • No upper bound on clock drift rate
  • No upper bound in networking delay
  • In an asynchronous distributed system, you cannot
    tell a crashed process from a slow one, even if
    you can assume that messages are sequenced and
    retransmitted (arbitrary numbers of times), so
    they eventually get through
  • This leads to Fischer, Lynch and Paterson to
    proof that it is impossible to reach a consensus
    in a fully asynchronous distributed system

Consensus Problem
  • Safety
  • Only a value that has been proposed may be chosen
  • Only a single value is chosen, and
  • A process never learns that a value has been
    chosen unless it actually has been
  • Liveness
  • Some proposed value is eventually chosen and, if
    a value has been chosen, then a process can
    eventually learn the value

Impossibility Results
  • FLP Impossibility of Consensus
  • A single faulty process can prevent consensus
  • Because a slow process is indistinguishable from
    a crashed one
  • Chandra/Toueg Showed that FLP Impossibility
    applies to many problems, not just consensus
  • In particular, they show that FLP applies to
    group membership, reliable multicast
  • So these practical problems are impossible in
    asynchronous systems
  • They also look at the weakest condition under
    which consensus can be solved
  • Ways to bypass the impossibility result
  • Use unreliable failure detector
  • Use a randomized consensus algorithm

The Paxos Algorithm
  • Contribution separately consider safety and
    liveness issues. Safety can be guaranteed and
    liveness is ensured during period of synchrony
  • Participants of the algorithm are divided into
    three categories
  • Proposers those who propose values
  • Accepters those who decide which value to choose
  • Learners those who are interested in learning
    the value chosen

The Paxos Algorithm
  • How to choose a value
  • Use a single acceptor straightforward but not
    fault tolerant
  • Use a number of acceptors a value is chosen if
    the majority of the acceptors have accepted it

The Paxos Algorithm
  • Requirements for choosing a value
  • P1. An acceptor must accept the first proposal
    that it receives
  • P2. If a proposal with value v is chosen, then
    every higher-numbered proposal that is chosen has
    value v
  • Since the proposal numbers are totally ordered,
    P2 guarantees the safety property

The Paxos Algorithm
  • How to guarantee P2?
  • P2a If a proposal with value v is chosen, then
    every higher-numbered proposal accepted by any
    acceptor has value v
  • But what if an acceptor that has never accepted v
    accepted a proposal with v?
  • P2b if a proposal with value v is chosen, then
    every higher-numbered proposal issued by any
    proposer has value v
  • P2b implies P2a, which implies P2

The Paxos Algorithm
  • How to ensure P2b?
  • P2c For any v and n, if a proposal with value v
    and number n is issued, then there is a set S
    consisting of a majority of acceptors such that
  • (a) no acceptor in S has accepted any proposal
    numbered less than n, or
  • (b) v is the value of the highest-numbered
    proposal among all proposals numbered less than n
    accepted by the acceptors in S

The Paxos Algorithm
  • To ensure P2c, an acceptor must promise
  • It will not accept any more proposals numbered
    less than n, once it has accepted a proposal n

The Paxos Algorithm
  • Phase 1.
  • (a) A proposer selects a proposal number n and
    sends a prepare request with number n to a
    majority of acceptors.
  • (b) If an acceptor receives a prepare request
    with number n greater than that of any prepare
    request to which it has already responded, then
    it responds to the request with a promise not to
    accept any more proposals numbered less than n
    and with the highest-numbered proposal (if any)
    that it has accepted.

The Paxos Algorithm
  • Phase 2.
  • (a) If the proposer receives a response to its
    prepare requests (numbered n) from a majority of
    acceptors, then it sends an accept request to
    each of those acceptors for a proposal numbered n
    with a value v, where v is the value of the
    highest-numbered proposal among the responses, or
    is any value if the responses reported no
  • (b) If an acceptor receives an accept request for
    a proposal numbered n, it accepts the proposal
    unless it has already responded to a prepare
    request having a number greater than n.

The Paxos Algorithm