Building Stable Software Systems - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Building Stable Software Systems

Description:

FAA's major modernization project, the Advanced Automation ... protocols Pathfinder caused repeated resets, nearly doomed the mission. Unexpected interactions ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 23
Provided by: lui3
Category:

less

Transcript and Presenter's Notes

Title: Building Stable Software Systems


1
Building Stable Software Systems
  • Lui Sha
  • lrs_at_cs.uiuc.edu
  • June 1, 2005

2
The challenges of building large systems
  • FAA's major modernization project, the Advanced
    Automation System (AAS), was originally estimated
    to cost 2.5 billion with a completion date of
    1996. In 1994, FAA cancelled the AAS program,
    casting aside 11 years of development time and,
    according to GAO, wasting more than 1.5 billion
    of taxpayer money. http//www.asiaweek.com/asiawe
    ek/98/0717/nat_6_clk.html
  • According to a study by IBM, in a typical
    commercial development organization, debugging,
    testing, and verification activities can easily
    range from 50 to 75 percent of the total
    development cost. http//www.research.ibm.com/jou
    rnal/sj/411/hailpern.html

3
Unexpected interactions
Incompatible Cross Domain Protocols
Implicit and inconsistent assumptions and
abstractions
Incompatible assumptions of HW SW regarding the
operation of legs led to the loss of the Mars
Polar Lander
Pathological Interaction between RT and sync.
protocols Pathfinder caused repeated resets,
nearly doomed the mission
4
Systems instabilities
Faults and failures in one component cascade
along complex and unexpected dependency relations
Overflow of a velocity variable in a reused
monitor module led to the destruction of the
Ariane 5 rocket
A divided by zero in a 3rd party component caused
a warship adrift at sea
5
Sources of difficulties
  • Unexpected interactions resulting from
    incompatible abstractions, incorrect or implicit
    assumptions in system interfaces, and
    incompatible real time, fault tolerance, and
    security protocols.
  • Inadequate development infrastructure as
    reflected in the lack of domain
    specific-reference architectures, tools, and
    design patterns with known and parameterized real
    time, robustness, and security properties.
  • System instabilities that result when faults and
    failures in one component cascade along complex
    and unexpected dependency graphs resulting in
    catastrophic failures in a large part or even an
    entire system.

6
What needs to be done
  • Interface engineering technologies Making
    semantic assumptions of each component explicit
    and machine checkable via component property
    interface definitions and tools for two-way
    synchronization for code and interface
    specifications.
  • System integration supports A set of formally
    specified and validated coherent real time,
    robustness, security and networking protocols. A
    set of domain models, reference architectures and
    design patterns with parameterized real time,
    robustness, and security properties. And tools to
    support their use.
  • Stable software architecture Use simplicity to
    control complexity replace depend relations with
    use relations whenever possible ensure proper
    criticality ordering along semantic, resource
    sharing and timing dependency trees.

7
Focusing on stability
  • In the foreseeable future, we can only build a
    small number of modest size defect free
    components at great expense. To plan otherwise is
    imprudent is overly optimistic at best.
  • We need to learn to build structurally stable
    software systems with
  • A small number defect free components
  • A modest number of nearly defect free components
  • A majority of COTS quality components with
    residual bugs
  • Indeed, since the dawn of civilization, there has
    not been a single defect free large system. The
    important role of stability control in so many
    engineering disciplines is not an accident.

8
Building complex and stable systems
  • United States of America is a highly stable and
    evolvable system. It has grown and made truly
    remarkable progress by the metric of
    civilization, even though many problems remain.
    But its basic components, human beings, are
    complex, error prone, and hard to test or verify.
  • There are thousands of residual bugs in the
    telecomm network and it remains highly reliable.
    There are perhaps millions of bugs in the World
    Wide Web system of systems, but it is remarkably
    stable.
  • Complex but stable systems are uncommon but can
    be and have been built.

9
Some Questions
  • What is the definition of stability in a software
    system?
  • What is the domain of convergence in software
    stability control?
  • How to safely use unreliable services?
  • How can we deal with the infamous state explosion
    problem?
  • How to build a reliable core service?
  • How can we analyze the structural stability of a
    software system?
  • We shall illustrate these idea by a simple
    example

10
An example
  • Once upon a time, there was an exam on sorting
    programs. Grades are given as follows
  • A Correct and fast n log (n) in worst case
  • B Correct but slow
  • F Incorrect
  • Joe can verify his bubble sort, but has only 50
    chance to write Heap Sort correctly.
  • What is his optimal strategy?

11
Stability of a software system
  • Often, requirements can be decomposed into
  • Critical (correctness) requirements
  • Sorting output numbers in correct order
  • TSP visit every city exactly once
  • Control stable and controllable
  • Performance optimization
  • Sorting faster
  • TSP shorter path
  • Control less time/error/energy

Heap Sort
Bubble Sort
Bounded responses to errors A stable software
system is one that can maintain key properties in
spite of errors in non-critical components
12
Stability control
  • What if the untrusted sorting program alters an
    item in the input list?
  • Create a verified simple primitive called
    permute
  • Untrusted sorting software is not allowed to
    touch the input list except use the permute
    primitive.
  • Enforce the restriction using an object with
    (only) method permute
  • Under stability control, the untrusted Heap-sort
    can only produce out of order application
    errors.

Domain of convergence in software error control
is the states that satisfy the precondition of
recovery procedure. Stability control is the
mechanism used to ensure the preconditions will
hold. State explosion in stability controlled
component is a non-problem A stable system allows
for SAFE TESTING of NEW COMPONENTS
13
Stability control for control software
  • http//www-rtsl.cs.uiuc.edu/ click project,
    click drii, click telelab download

14
Transform depend relation to USE relation
  • Having a reliable controller, we identify the
    recovery region within which the controller can
    operate successfully. Recovery region is a subset
    of the states that are admissible with respect to
    operational constraints
  • The largest recovery region can be found using
    LMI. This approach is applicable to any
    linearizable systems. They cover most of the
    practical control systems.

operational constraints
Recovery Region
Stability envelope
The system under new complex controller must
stay within recovery region
15
Simplex Architecture for Control
Stability Monitoring
Trusted simple and reliable controller
Plant
Online upgradeable complex controller
Data Flow Block Diagram
16
How to build a reliable core services?
  • There two parties of thoughts
  • Fault avoidance party Put all the eggs in a
    bullet-proof basket
  • Fault tolerance party Use diversity, e.g.,
    N-version programming
  • Which party will you vote for?

17
Complexity, diversity and reliability
  • To build a robust software system that can
    tolerant arbitrary application software faults,
    we must understand the relations between software
  • Complexity the root cause of software faults
  • Diversity a necessary condition for software
    fault tolerance.
  • Reliability a function of complexity and
    diversity
  • We shall begin with postulates based self-evident
    facts

18
Software development postulates
  • We assert that the following postulates
    self-evident
  • P1 Complexity Breeds Bugs Everything else being
    equal, the more complex the software project is,
    the harder it is to make it reliable.
  • P2 All Bugs are Not Equal You fix a bunch of
    obvious bugs quickly, but finding and fixing the
    last few bugs is much harder.
  • P3 All Budgets are Finite There is only a
    finite amount of effort (budget) that we can
    spend on any project.
  • How can we model software complexity?

19
Logical complexity
  • Computational complexity gt the number of steps
    in computation.
  • Logical complexity gt the number of
    steps in verification.
  • A program can have different logical and
    computational complexities.
  • Bubble-sort lower logical complexity but higher
    computational complexity.
  • Heap sort the other way around.
  •  
  • Residue logical complexity. A program could have
    high logical complexity initially. However, if it
    has been verified and can be used as is, then the
    residue complexity is zero

20
The implications
  • P1 Complexity Breeds Bugs For a given mission
    duration t, the reliability of software decreases
    as complexity increases.
  • P2 All Bugs are Not Equal for a given degree of
    complexity, the reliability function has a
    monotonically decreasing rate of improvement with
    respect to development effort.
  • P3 Budgets are finite Diversity is not free.
    That is, if we go for n version diversity, we
    must divide the available effort n-ways.
  • One simple model that satisfies P1, P2 and P3
  • Sum of efforts used in diversity available
    effort
  • Reliability function e - k (complexity / effort
    ) t

21
Diversity, complexity and reliability
3-version programming
1-version programming
A reliable core with 10x complexity reduction
  • .

Analysis shows that what really counts is not the
degree of diversity. Rather it is the existence
of a simple and reliable core that can guarantee
the stability of the system. This result is also
robust against change of model assumptions. ---
Using Simplicity to Control Complexity, IEEE
Software 7/8, 2001, L. Sha
22
Summary Keys to a stable software system
  • Software bugs does not fly. Nor does it craw. It
    propagates along 3 types of dependency graphs. In
    the sorting example
  • Functional bubble sort USE but does not depend
    on heap sort
  • Execution none, if we give each sorting task
    separated and protected data, storage and
    computation resources
  • Timing bubble does not depend on heap-sort if a
    complexity based watchdog timer is set.
  • 1. A simple and reliable core for critical
    services
  • 2. A simple and well formed dependency tree
  • Maximized USE relations
  • Minimized dependency relation
  • 3. Safely exploit useful but unreliable services
    via stability control
Write a Comment
User Comments (0)
About PowerShow.com