Using Simplicity to Control Complexity - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

Using Simplicity to Control Complexity

Description:

The system never performs worse than before, even if the changes ... The More Alternatives the Merrier? RB2. RB3. RB10. RBn: n alternatives. lrs_at_cs.uiuc.edu ... – PowerPoint PPT presentation

Number of Views:85

Avg rating:3.0/5.0

Slides: 27

Provided by: Lui100

Category:

more less

Transcript and Presenter's Notes

Title: Using Simplicity to Control Complexity

1
Using Simplicity to Control Complexity

Lui Sha
Department of CS
lrs_at_cs.uiuc.edu
UIUC
June, 2002

2
The Goal

Software systems are not static. They evolve.
Our goal is to develop an engineering foundation
that allows us to evolve software systems
dependably.
New features can be easily added, preferably
online without down time
The system never performs worse than before, even
if the changes have bugs or even contain
malicious attack codes.
To realize this goal, we need to first understand
the nature of software reliability, and
demonstrate the viability of this idea in some
important class of applications.

3
Which Side Would You Take?

How to improve the reliability and availability
of increasingly complex software is a serious
challenge. There are two philosophical positions
The diversity camp Diversity in crops resists
diseases diversity in software improves
reliability. The likelihood of making the same
mistakes decreases as the degree of diversity
increases. Dont put all your eggs in one basket.
The bullet-proof your basket camp Concentrate
all the available resource to one version and do
it right. Do-it-right-the-first-time is the time
honored approach to quality products.

4
Software Development Postulates

In science we rely on facts and logic. Lets
begin with well known observations in software
development. We make 3 postulates
P1 Complexity Breeds Bugs Everything else being
equal, the more complex the software project is,
the harder it is to make it reliable.
P2 All Bugs are Not Equal You fix a bunch of
obvious bugs quickly, but finding and fixing the
last few bugs is much harder, if you can ever
hunt them down.
P3 All Budgets are Finite There is only a
finite amount of effort (budget) that we can
spend on any project.
No so fast, Lui! Could please you define
software complexity?

5
Residual Logical Complexity

Computational complexity is modeled as the number
of steps to complete the computation. Likewise,
logical complexity can be viewed as the number of
steps that are needed to verify the correctness.
A program can have different logical and
computational complexities. For example,
comparing with heap-sort, bubble-sort has lower
logical complexity but higher computational
complexity. We focus on logical complexity in
this talk.
Residue logical complexity. A program could have
high logical complexity initially. However, if it
has been verified and can be used as is, then the
residue complexity is zero
In the rest of discussion, we shall focus on
(residual logical) complexity of software.

6
The Implications of the 3 Postulates

P1 The Complexity Breeds Bugs postulate implies
that for a given mission duration t, the
reliability of software decreases as complexity
increases.
P2 The All Bugs are Not Equal postulate implies
that for a given degree of complexity, the
reliability function has a monotonically
decreasing rate of improvement with respect to
development effort.
A reliability function in the form of R(Effort,
Complexity, t) e-kC t/E satisfies P1 and P2
P3 The Finite Budget Assumption implies that
that diversity is not free. That is, if we go for
n version diversity, we must divide the available
effort n-way. This allows us to compare
different approaches fairly.

7
Modeling the Implications

This is equivalent to assume that
the commonly used reliability function e- ? t is
a useful model
the failure rate, ?, in R(t) is proportional to
complexity but inversely proportional to effort
spent to the software.
Hold on Lui, how do you know failure rate is
proportional to complexity and inversely
proportional to efforts spent? For Gods sake,
they could be very non-linear relations!
Ok, we will examine non-relationships later.

8
A Unified Framework

Recently Larry Bernstein extended the reliability
model as follows
R e-kCt /E?
Where ? expresses the ability to solve a program
with fewer instructions with a new tool such as a
complier.
This equation expresses reliability of a software
system in a unified form as related to software
engineering parameters. The longer the software
system runs the lower the reliability and the
more likely a fault will be executed to become a
failure. Reliability can be improved by
investing in tools (?), simplifying the design
(C), or increasing the effort in development to
do more inspections or testing than required by
software effort estimation techniques.
This a new idea. For this lecture, we assume ?
1.

9
N-Version Programming - 1

Lets use the simple model to analyze N-version
program under ideal condition that faults are
independent. N-version programming suggests that
we should independently develop N versions of
programs according to the same specification. And
then take the majority of the outputs.

3-version programming
10
N-version Programming - 2

It turns out that single-version is better than
3-version is a robust result. Here are two
examples.

3-version programming
11
Recovery Block

The idea of recovery block is that you develop
several alternatives Checkpoint your state, try
the primary and test the output. If it passes the
acceptance test, use it. Otherwise, roll back and
try another alternative We shall assume that we
have perfect acceptance test for now.

12
The More Alternatives the Merrier?
13
Power of Simplicity
14
The Fly in the Ointment

Alas, it is difficult to develop high coverage
acceptance tests. Consider the case of a uniform
number generator.
Can you determine the distribution is indeed
uniform using one isolated data point? No.
Can you determine the distribution with a large
sample? Yes.
Many phenomena require a good size sample to
diagnose. It is often difficult to diagnose a
phenomenon with an isolated instance. This
explains why it is so difficulty to determine the
correctness of each individual program output.
Unfortunately, we cannot buffer a long sequence
of outputs before we output them in many
applications. We cant do it in interactive
applications, nor can we buffer up the outputs in
control applications
We need to find a way that tolerates incorrect
outputs

15
Feedback Control of Software Execution

To tolerate output errors that cannot be detect
instantaneously, the applications should have the
following characteristics
Capability control When the system in an
operational state, a single incorrect output
cannot bring the system down instantaneously.
(Cumulative errors can)
Measurable system behavior We can evaluate the
system behaviors under the software control.
Control applications meet these 2 requirements.
Control software error maps to measurable
actuation errors. Errors are measurable and can
be bounded by a combination of control authority
and monitoring frequency.
A simple and reliable core to provide acceptable
performance
Stability control the system under complex
software control remain in states that are
controllable by the simple and reliable
controller.

16
The Idea

Joe is a new student who partied a bit too much.
He masters bubble sort but only have 50 chance
of writing a correct quick sort program.
He must submit a program that will be evaluated
as follows
Correct and fast O(n log n) A
Correct but slow B
Incorrect F
What is Joes optimal strategy?

Quick Sort
Bubble Sort
Stability control the set of numbers to be
sorted cannot be altered. This is the
precondition for Bubble Sort.
17
Simplex Architecture
A simple verifiable core diversity in the form
of 2 alternatives feedback control of the
software execution.
Online replaceable
18
Admissible States

In the operation of a plant, there is a set of
state constraints representing the safety,
device physical limitations, environmental and
other operation requirements.
They can be represented as a normalized polytope,
CTX ? 1, in the N-dimensional state space. We
must be able
take the control away from a faulty controller,
before the system state becomes inadmissible
the future trajectory of the system state after
the switch will stay within the set of admissible
states.

State constraints
Admissible States
Operation Constraints and Admissible states
19
The Error Bounds

When cannot use the boundary of admissible states
as switching rule due to the inertia of the
physical plant.
Recovery region is closed with respect to the
operations of simple controller. It is Lyapunov
function inside the polytope.
The largest recovery region can be found using
LMI.

20
System Development Process

The high assurance control subsystem
Application level using well-understood
classical controllers
System software level using high assurance OS
kernels such as certifiable Ada runtime
Hardware level using well-established and simple
fault tolerant hardware configurations, such as
pair-pair or TMR.
High assurance development and maintenance
process, e.g., FAA DO 178B
Requirement management requirements here are
limited to critical properties.
The high performance control subsystem
Application level advanced control
technologies
System software level using COTS real time
operating systems and middleware
Hardware level using standard industrial
hardware, e.g., VME
Standard industrial software development process
Requirement management features and performance
are handled here.
System evolution supports, e.g., online
replaceable components

21
Semi-Conduction Wafer Process State Control
Deposition rate Refractive index Si-H/Ni-H
bonds Uniformity etc.
DC bias Mass 60 (disilane) Mass 76
(triaminosilane)
SiH4 RF power Pressure
22
DoD Applications
SoftwareFault tolerance is particularly
useful for cases in which some new functionality
is available that has been only partially tested
but that might help to achieve the success of a
mission. By providing protection from faults,
Simplex enables such functionality to be applied
on a mission. Joint Strike Fighter (JSF)the
JSF mission software architecture builds on the
architectural principles developed under the
INSERT project http//www.sei.cmu.edu/pub/docume
nts/99.reports/pdf/news-sei-fall-1999.pdf The
Space and Naval Warfare Systems Command (SPAWAR)
has initiated a process to transition SIMPLEX
technology The technology will be transitioned
to the Surface Combatant for the 21st Century
(SC21), the Next Generation Carrier (CV(X)), and
other Navy systems. SIMPLEX includes a software
architecture, real-time middleware services and
supporting tools to allow the safe insertion of
new technology or upgrading of existing
technology in high-assurance real-time systems.
It permits the new technology to operate until an
error condition (system, timing or semantic
error) occurs at which time the system rolls back
to the baseline technology http//www.rl.af.mil/
tech/programs/edcs/Accomplishments.html
23
Summary

We should never trust complex software that is
beyond our means to verify
Untrusted complex software are useful, provided
that when it malfunctions its adverse impacts on
system behaviors is observable and bounded by
design
We need a simple and reliable core to provide
minimal essential services and constrain the
impacts of malfunction software so as not to let
faults turn into failures

After 30 seconds of a planned 90 flight missile
test in the 70s, the clock was not properly
reset. The missile blew up. Some twenty-five
years later ATT experienced a massive network
failure caused by a similar problem in the fault
recovery subsystem they were upgrading. In both
cases, the system failed because there was no
limits placed on the results the software could
produce. There were no boundary conditions set.
Designers built with a point solution in mind and
without bounding the domain of software
execution. Testers were rushed to meet schedules
and the planned fault recovery mechanisms did not
work. --- Larry
Bernstein
24
Software Fault Model

Timing fault misses its deadlines
Capability abuse
Corrupt others code or data
Unauthorized acquisition of process/resource
management capability
Semantic fault incorrect results that can lead
to
Poor control performance
Instability in the plant

25
Recent Extensions Secured Reliable Upgrades
Code Data Access Attacks
Compiler Based Protection
Algorithmic attacks
Algorithm Based Protection
Resource Depletion attacks
OS Based Protection
26
Telelab