Title: SOFTWARE METRICS FOR CONTROL AND QUALITY ASSURANCE COURSE OVERVIEW
1SOFTWARE METRICS FOR CONTROL AND QUALITY
ASSURANCE COURSE OVERVIEW
2Course Objectives
- At the end of this section of the course you
should be able to - write a metrics plan (define appropriate software
metrics and data collection programmes to satisfy
different quality assurance objectives) - understand the importance of quantification in
software engineering. - differentiate between good and bad use of
measurement in software engineering - know how to use a range of software measurement
techniques to monitor product and process quality - analyse different types of software metrics
datasets - use a method for software risk management that
takes account of multiple factors and uncertainty
3Course Structure
- Software quality metrics basics
- Software metrics practice
- Framework for software metrics
- Software reliability
- Measurement theory and statistical analysis
- Empirical software engineering
- Software metrics for risk and uncertainty
4Recommended Reading
- The main course text for this part of the course
is - Fenton NE and Pfleeger SL, Software Metrics A
Rigorous Practical Approach (2nd Edn), PWS,
1998
5LESSON 1SOFTWARE QUALITY METRICS BASICS
6Lesson 1 objectives
- Understand different definitions of software
quality and how you might measure it - Understand different notions of defects and be
able to classify them - Understand the basic techniques of data
collection and how to apply them
7How many Lines of Code?
8What is software quality?
- Fitness for purpose?
- Conformance to specification?
- Absence of defects?
- Degree of excellence?
- Timeliness?
- All of the above?
- None of the above?
9Software quality - relevance
high
Timeliness Time to market
Productivity LOC or FP per month
Relevance to producer
Technical product quality delivered
defects per KLOC
Conformance to schedule deviation from planned
budgets/ requirements
Process maturity/stability capability index
low
Relevance to customer
high
10Software Quality Models
Use
Factor
Criteria
Communicativeness
METRICS
Accuracy
Consistency
Product operation
Device Efficiency
Accessibility
Completeness
Structuredness
Conciseness
Product revision
Device independence
Legability
Self-descriptiveness
Traceability
11Definition of system reliability
The reliability of a system is the probability
that the system will execute without failure in a
given environment for a given period of time.
- Implications
- No single reliability number for a given system -
dependent on how the system is used - Use probability to express our uncertainty
- Time dependent
12What is a software failure?
- Alternative views
- Formal view
- Any deviation from specified program behaviour
is a failure - Conformance with specification is all that
matters - This is the view adopted in computer science
- Engineering view
- Any deviation from required, specified or
expected behaviour is a failure - If an input is unspecified the program should
produce a sensible output appropriate for the
circumstances - This is the view adopted in dependability
assessment
13Human errors, faults, and failures
?
can lead to
can lead to
human error
fault
failure
- Human Error Designers mistake
- Fault Encoding of an error into a software
document/product - Failure Deviation of the software system from
specified or expected behaviour
14Processing errors
In the absence of fault tolerance
Human Error
Fault
Processing Error
Failure
Input
15Relationship between faults and failures (Adams
1984)
Failures (sized by MTTF)
Faults
35 of all faults only lead to very rare failures
(MTTFgt5000 years)
16The relationship between faults and failures
- Most faults are benign
- For most faults removal will not lead to greatly
improved reliability - Large reliability improvements only come when we
eliminate the small proportion of faults which
lead to the more frequent failures - Does not mean we should stop looking for faults,
but warns us to be careful about equating fault
counts with reliability
17The defect density measure an important health
warning
- Defects faults È failures
- but sometimes defects faults or defects
failures - System defect density
- where size is usually measured as thousands of
lines of code (KLOC) - Defect density is used as a de-facto measure of
software quality. - in the light of the Adams data this is very
dangerous - What are industry norms and what do they mean?
number of defects found
system size
18Defect density Vs module size
Defect Density
Theory
Observation?
Lines of Code
19A Study in Relative Efficiency of Testing Methods
R B Grady, Practical Software metrics for
Project Management and Process Improvement,
Prentice Hall, 1992
20The problem with problems
- Defects
- Faults
- Failures
- Anomalies
- Bugs
- Crashes
21Incident Types
- Failure (in pre or post release)
- Fault
- Change request
22Generic Data
- Applicable to all incident types
- What Product details
- Where (Location) Where is it?
- Who Who found it?
- When (Timing) When did it occur?
- What happened (End Result) What was observed?
- How (Trigger) How did it arise?
- Why (Cause) Why did it occur?
- Severity/Criticality/Urgency
- Change
23Example Failure Data
- What ABC Software Version 2.3
- Where Normans home PC
- Who Norman
- When 13 Jan 2000 at 2108 after 35 minutes of
operational use - End result Program crashed with error message
xyz - How Loaded external file and clicked the command
Z. - Why ltBLANK - refer to faultgt
- Severity Major
- Change ltBLANKgt
24Example Fault Data (1) - reactive
- What ABC Software Version 2.3
- Where Help file, section 5.7
- Who Norman
- When 15 Jan 2000, during formal inspection
- End result Likely to cause users to enter
invalid passwords - How The text wrongly says that passwords are
case sensitive - Why ltBLANKgt
- Urgency Minor
- Change Suggest rewording as follows ...
25Example Fault Data (2) - responsive
- What ABC Software Version 2.3
- Where Function ltabcdgt in Module ltts0023gt
- Who Simon
- When 14 Jan 2000, after 2 hours investigation
- What happened Caused reported failure id lt0096gt
- How ltBLANKgt
- Why Missing exception code for command Z
- Urgency Major
- Change exception code for command Z added to
function ltabcdgt and also to function ltefghgt.
Closed on 15 Jan 2000.
26Example Change Request
- What ABC Software Version 2.3
- Where File save menu options
- Who Norman
- When 20 Jan 2000
- End result ltBLANKgt
- How ltBLANKgt
- Why Must be able to save files in ascii format -
currently not possible - Urgency Major
- Change Add function to enable ascii format file
saving
27Tracking incidents to components
- Incidents need to be traceable to identifiable
components - but at what level of granularity? - Unit
- Module
- Subsystem
- System
-
28Fault classifications used in Eurostar control
system
29Lesson 1 Summary
- Software quality is a multi-dimensional notion
- Defect density is a common (but confusing) way of
measuring software quality - The notion of defects or problems is highly
ambiguous - distinguish between faults and
failures - Removing faults may not lead to large reliability
improvements - Much data collection focuses on incident types
failures, faults, and changes. There are who,
when, where,.. type data to collect in each case - System components must be identified at
appropriate levels of granularity
30LESSON 2 SOFTWARE METRICS PRACTICE
31Lesson 2 Objectives
- Understand why measurement is important for
software quality assurance and assessment - Understand the basic metrics approaches used in
industry and how to apply them - Understand the importance of goal-driven
measurement and know how to identify specific
goals - Understand what a metrics plan is and how to
write one
32Why software measurement?
- To assess software products
- To assess software methods
- To help improve software processes
33From Goals to Actions
34Goal Question Metric (GQM)
- There should be a clearly-defined need for every
measurement. - Begin with the overall goals of the project or
product. - From the goals, generate questions whose answers
will tell you if the goals are met. - From the questions, suggest measurements that can
help to answer the questions. - From Basili and Rombachs Goal-Question-Metrics
paradigm, described in IEEE Transactions on
Software Engineering, 1988 paper on the TAME
project.
35GQM Example
Identify fault-prone modules as early as possible
Goal
What do we mean by fault-prone module?
Does complexity impact fault-proneness?
How much testing is done per module?
Questions
.
- Defect data for each module
- faults found per testing phase
- failures traced to module
Metrics
- Effort data for each module
- Testing effort per testing phase
- faults found per testing phase
- Size/complexity data
- for each module
- KLOC
- complexity metrics
36The Metrics Plan
- For each technical goal this contains information
about - WHY metrics can address the goal
- WHAT metrics will be collected, how they will be
defined, and how they will be analyzed - WHO will do the collecting, who will do the
analyzing, and who will see the results - HOW it will be done - what tools, techniques and
practices will be used to support metrics
collection and analysis - WHEN in the process and how often the metrics
will be collected and analyzed - WHERE the data will be stored
37The Enduring LOC Measure
- LOC Number of Lines Of Code
- The simplest and most widely used measure of
program size. Easy to compute and automate - Used (as normalising measure) for
- productivity assessment (LOC/effort)
- effort/cost estimation (Effort f(LOC))
- quality assessment/estimation (defects/LOC))
- Alternative (similar) measures
- KLOC Thousands of Lines Of Code
- KDSI Thousands of Delivered Source Instructions
- NCLOC Non-Comment Lines of Code
- Number of Characters or Number of Bytes
38Example Software Productivity at Toshiba
Instructions per programmer month
300
Introduced Software
250
Workbench System
200
150
100
50
0
1972
1974
1976
1978
1980
1982
39Problems with LOC type measures
- No standard definition
- Measures length of programs rather than size
- Wrongly used as a surrogate for
- effort
- complexity
- functionality
- Fails to take account of redundancy and reuse
- Cannot be used comparatively for different types
of programming languages - Only available at the end of the development
life-cycle
40Fundamental software size attributes
- length the physical size of the product
- functionality measures the functions supplied by
the product to the user - complexity
- Problem complexity measures the complexity of the
underlying problem. - Algorithmic complexity reflects the
complexity/efficiency of the algorithm
implemented to solve the problem - Structural complexity measures the structure of
the software used to implement the algorithm
(incudes control flow structure, hierarchical
structure and modular structure) - Cognitive complexity measures the effort required
to understand the software.
41The search for more discriminating metrics
- Measures that
- capture cognitive complexity
- capture structural complexity
- capture functionality (or functional complexity)
- are language independent
- can be extracted at early life-cycle phases
42The 1970s Measures of Source Code
- Characterized by
- Halsteads Software Science metrics
- McCabes Cyclomatic Complexity metric
- Influenced by
- Growing acceptance of structured programming
- Notions of cognitive complexity
43Halsteads Software Science Metrics
A program P is a collection of tokens, classified
as either operators or operands.
n1 number of unique operators n2 number of
unique operands N1 total occurrences of
operators N2 total occurrences of operands
Length of P is N N1N2 Vocabulary of P is n
n1n2
Theory Estimate of N is N n1 log n1 n2 log
n2
Theory Effort required to generate P is
n1 N2 N log n 2n2
(elementary mental discriminations)
E
Theory Time required to program P is TE/18
seconds
44McCabes Cyclomatic Complexity Metric v
If G is the control flowgraph of program P and G
has e edges (arcs) and n nodes
v(P) e-n2
v(P) is the number of linearly independent paths
in G
here e 16 n 13 v(P) 5
More simply, if d is the number of decision nodes
in G then
v(P) d1
McCabe proposed v(P)lt10 for each module P
45Flowgraph based measures
- Many software measures are based on a flowgraph
model of a program - Most such measures can be automatically computed
once the flowgraph decomposition is known - The notion of flowgraph decomposition provides a
rigorous, generalised theory of structured
programming - There are tools for computing flowgraph
decomposition
46The 1980s Early Life-Cycle Measures
- Predictive process measures - effort and cost
estimation - Measures of designs
- Measures of specifications
47Software Cost Estimation
48Simple COCOMO Effort Prediction
effort a (size)b
effort person months size KDSI
(predicted) a,b constants depending on type of
system
organic a 2.4 b
1.05 semi-detached a 3.0 b
1.12 embedded a 3.6 b 1.2
49COCOMO Development Time Prediction
time a (effort)b
effort person months time development time
(months) a,b constants depending on type of
system
organic a 2.5 b
0.38 semi-detached a 2.5 b
0.35 embedded a 2.5 b 0.32
50Regression Based Cost Modelling
log E (Effort)
10,000
Slope b
1000
100
log E log a b log S E a S b
10
log a
1K
10K
100K
1000K
10000K
log S(Size)
51Albrechts Function Points
Count the number of
External inputs External outputs External
inquiries External files Internal files
giving each a weighting factor
The Unadjusted Function Count (UFC) is the sum
of all these weighted scores
To get the Adjusted Function Count (FP),
multiply by a Technical Complexity Factor (TCF)
FP UFC x TCF
52Function Points Example
53Function Points Applications
- Used extensively as a size measure in
preference to LOC - Examples
FP Person months effort
Productivity
Defects FP
Quality
Effort prediction
Ef(FP)
54Function Points and Program Size
Language
Source Statements per FP
Assembler C Algol COBOL FORTRAN Pascal RPG PL/1 MO
DULA-2 PROLOG LISP BASIC 4 GL Database APL SMALLTA
LK Query languages Spreadsheet languages
320 150 106 106 106 91 80 80 71 64 64 64 40 32 21
16 6
55The 1990s Broader Perspective
- Reports on Company-wide measurement programmes
- Benchmarking
- Impact of SEIs CMM process assessment
- Use of metrics tools
- Measurement theory as a unifying framework
- Emergence of international software measurement
standards - measuring software quality
- function point counting
- general data collection
56The SEI Capability Maturity Model
Level 5 Optimising
Process change management Technology change
management Defect prevention
Level 4 Managed
Software quality management Quantitative process
mgment
Level 3 Defined
Peer reviews Training programme Intergroup
coordination Integrated s/w management Organizatio
n process definition/focus
Level 2 Repeatable
S/W configuration management S/W QA S/W
project planning S/W subcontract management S/W
requirements management
Level 1 Initial/ad-hoc
57Results of 1987-1991 SEI Assessments
Level 1 Level 2 Level 3 Level 4 Level 5
81 12 7 0 0
87 9 4 0 0
62 23 15 0 0
58Process improvement at Motorola
In-process defects/MAELOC
59IBM Space Shuttle Software Metrics Program (1)
Early detection rate
Total inserted error rate
60IBM Space Shuttle Software Metrics Program (2)
Predicted total error rate trend (errors per KLOC)
14
12
10
8
95 high
6
Actual
expected
4
2
95 low
0
1
3
5
7
8A
8C
8F
Onboard flight software releases
61IBM Space Shuttle Software Metrics Program (3)
Onboard flight software failures occurring per
base system
Basic operational increment
62ISO 9126 Software Product Evaluation Standard
- Quality characteristics and guidelines for their
use - Chosen characteristics are
- Functionality
- Reliability
- Usability
- Efficiency
- Maintainability
- Portability
63Lesson 2 Summary
- Measurement activities should be goal-driven
- Metrics Plan details how to create metrics
programme to meet specific technical objectives - Software metrics usually driven by objectives
- productivity assessment
- cost/effort estimation
- quality assessment and prediction
- All common metrics traceable to above objectives
- Recent trend away from specific metrics and
models toward company-wide metrics programmes - Software measurement now widely accepted as key
subject area in software engineering
64LESSON 3 SOFTWARE METRICS FRAMEWORK
65Lesson 3 Objectives
- Learn basic measurement definitions and a
software metrics framework that conforms to these - Understand how and why diverse metrics activities
fit into the framework - Learn how to define your own relevant metrics in
a rigorous way - Bringing it together in case study
66Software Measurement Activities
Are these diverse activities related?
67Opposing Views on Measurement?
- When you can measure what you are speaking
about, and express it in numbers, you know
something about it but when you cannot measure
it, when you cannot express it in numbers, your
knowledge is of a meagre kind. - Lord Kelvin
- In truth, a good case could be made that if your
knowledge is meagre and unsatisfactory, the last
thing in the world you should do is make
measurements. The chance is negligible that you
will measure the right things accidentally. - George Miller
68Definition of Measurement
Measurement is the process of empirical objective
assignment of numbers to entities, in order to
characterise a specific attribute.
- Entity an object or event
- Attribute a feature or property of an entity
- Objective the measurement process must be
based on a well-defined rule whose results
are repeatable
69Example Measures
70Avoiding Mistakes in Measurement
- Common mistakes in software measurement can be
avoided simply by adhering to the definition of
measurement. In particular - You must specify both entity and attribute
- The entity must be defined precisely
- You must have a reasonable, intuitive
understanding of the attribute before you propose
a measure - The theory of measurement formalises these ideas
71Be Clear of Your Attribute
- It is a mistake to propose a measure if there
is no consensus on what attribute it
characterises. - Results of an IQ test
- intelligence?
- or verbal ability?
- or problem solving skills?
- defects found / KLOC
- quality of code?
- quality of testing?
72A Cautionary Note
- We must not re-define an attribute to fit in with
an existing measure.
73Types and uses of measurement
- Two distinct types of measurement
- direct measurement
- indirect measurement
- Two distinct uses of measurement
- for assessment
- for prediction
- Measurement for prediction requires a prediction
system
74Some Direct Software Measures
- Length of source code (measured by LOC)
- Duration of testing process (measured by elapsed
time in hours) - Number of defects discovered during the testing
process (measured by counting defects) - Effort of a programmer on a project (measured by
person months worked)
75Some Indirect Software Measures
LOC produced person months of effort
Programmer productivity
number of defects module size
Module defect density Defect detection efficiency
Requirements stability Test effectiveness
ratio System spoilage
number of defects detected total number of defects
numb of initial requirements total number of
requirements
number of items covered total number of items
effort spent fixing faults total project effort
76Predictive Measurement
- Measurement for prediction requires a prediction
system. This consists of - Mathematical model
- e.g. EaSb where E is effort in person months
(to be predicted), S is size (LOC), and a and b
are constants. - Procedures for determining model parameters
- e.g. Use regression analysis on past project
data to determine a and b. - Procedures for interpreting the results
- e.g. Use Bayesian probability to determine the
likelihood that your prediction is accurate to
within 10
77No Short Cut to Accurate Prediction
- Testing your methods on a sample of past data
gets to the heart of the scientific approach to
gambling. Unfortunately this implies some
preliminary spadework, and most people skimp on
that bit, preferring to rely on blind faith
instead - Drapkin and Forsyth 1987
- Software prediction (such as cost estimation) is
no different from gambling in this respect
78Products, Processes, and Resources
- Process a software related activity or event
- testing, designing, coding, etc.
- Product an object which results from a process
- test plans, specification and design documents,
source and object code, minutes of meetings, etc. - Resource an item which is input to a process
- people, hardware, software, etc.
79Internal and External Attributes
- Let X be a product, process, or resource
- External attributes of X are those which can only
be measured with respect to how X relates to its
environment - e.g. reliability or maintainability of source
code (product) - Internal attributes of X are those which can be
measured purely in terms of X itself - e.g. length or structuredness of source code
(product)
80The Framework Applied
ATTRIBUTES
ENTITIES
External
Internal
PRODUCTS Specification Source Code ....
Length, functionality modularity, structuredness,
reuse ....
maintainability reliability .....
PROCESSES Design Test ....
time, effort, spec faults found time, effort,
failures observed ....
stability cost-effectiveness ....
RESOURCES People Tools ....
age, price, CMM level price, size ....
productivity usability, quality ....
81Lesson 3 Summary
- Measurement is about characterising attributes of
entities - Measurement can be either direct or indirect
- Measurement is either for assessment or
prediction - The framework for software measurement is based
on - classifying software entities as products,
processes, and resources - classifying attributes as internal or external
- determining whether the activity is assessment or
prediction - only when you can answer all these questions are
you ready for measurment
82CASE STUDY COMPANY OBJECTIVES
- Monitor and improve product reliability
- requires information about actual operational
failures - Monitor and improve product maintainability
- requires information about fault discovery and
fixing - Process improvement
- too high a level objective for metrics programme
- previous objectives partially characterise
process improvement
83General System Information
- 27 releases since Nov '87 implementation
- Currently 1.6 Million LOC in main system (15.2
increase from 1991 to 1992)
1600000
1400000
1200000
LOC
1000000
COBOL
800000
Natural
600000
400000
200000
0
1991
1992
84Main Data
Fault Number
Week In
System Area
Fault Type
Week Out
Hours to Repair
...
...
...
...
...
...
F254
92/14
C2
P
92/17
5.5
- faults are really failures (the lack of a
distinction caused problems) - 481 (distinct) cleared faults during the year
- 28 system areas (functionally cohesive)
- 11 classes of faults
- Repair time actual time to locate and fix defect
85Case Study Components
- 28 System areas
- All closed faults traced to system area
- System areas made up of Natural, Batch COBOL, and
CICS COBOL programs - Typically 80 programs in each. Typical program
1000 LOC - No documented mapping of program to system area
- For most faults batch repair and reporting
- No direct, recorded link between fault and
program in most cases - No database with program size information
- No historical database to capture trends
86Single Incident Close Report
Fault id Reported Definition Description Progr
ams changed SPE Date closed
F752 18/6/92 Logically deleted work done
records appear on enquiries Causes misleading
info to users Amend ADDITIONAL WORK
PERFORMED RDVIPG2A to ignore work done records
with FLAG-AMEND 1 or 2 RDVIPG2A, RGHXXZ3B Joe
Bloggs 26/6/92
87Single Incident Close Report Improved Version
Fault id Reported Trigger End result Cause Ch
ange Programs changed SPE Date closed
F752 18/6/92 Delete work done record, then open
enquiry Deleted records appear on enquiries,
providing misleading info to users Omission of
appropriate flag variables for work done
records Amend ADDITIONAL WORK PERFORMED in
RDVIPG2A to ignore work done records
with FLAG-AMEND 1 or 2 RDVIPG2A, RGHXXZ3B Joe
Bloggs 26/6/92
88Fault Classification
Non-orthogonal Data Micro JCL Operations Misc Un
resolved
Program Query Release Specification User
89Missing Data
- Recoverable
- Size information
- Static/complexity information
- Mapping of faults to programs
- Severity categories
- Non-recoverable
- Operational usage per system area
- Success/failure of fixes
- Number of repeated failures
90Reliability Trend
Faults received per week
50
40
30
Faults
20
10
0
10
20
30
40
50
Week
91Identifying Fault Prone Systems?
Number or faults per system area (1992)
90
80
70
60
faults
50
40
30
20
10
0
C2
J
System area
92Analysis of Fault Types
Faults by fault type (total 481 faults)
Others
Data
User
Query
Unresolved
Release
Misc
Program
93Fault Types and System Areas
Most common faults over system areas
70
60
50
Program
40
Data
faults
User
30
Release
20
Unresolved
10
Query
Miscellaneous
0
Area
94Maintainability Across System Areas
Mean Time To Repair Fault (by system area)
10
9
8
7
hours
6
5
4
3
2
1
0
D
O
S
W1
F
W
C3
P
L
G
C1
J
T
D1
G2
N
Z
C
C2
G1
U
System Area
95Maintainability Across Fault Types
Mean Time To Repair Fault (by fault type)
9
8
7
JCL Program Spec Release Operations User Unresolve
d Misc Data Query
6
5
Hours
4
3
2
1
0
Fault type
96Case study results with additional data System
Structure
97Normalised Fault Rates (1)
Faults per KLOC
98Normalised Fault Rates (2)
Faults per KLOC
99Case Study 1 Summary
- The hard to collect data was mostly all there
- Exceptional information on post-release faults
and maintenance effort - It is feasible to collect this crucial data
- Some easy to collect (but crucial) data was
omitted or not accessible - The addition to the metrics database of some
basic information (mostly already collected
elsewhere) would have enabled proactive activity. - Goals almost fully met with the simple additional
data. - Crucial explanatory analysis possible with simple
additional data - Goals of monitoring reliability and
maintainability only partly met with existing data
100LESSSON 4 SOFTWARE METRICS MEASUREMENT THEORY
AND STATISTICAL ANALYSIS
101Lesson 4 Objectives
- To understand in a formal sense what it means to
measure something and to know when we have a
satisfactory measure - To understand the different measurement scale
types - To understand which types of statistical analyses
are valid for which scale types - To be able to perform some simple statistical
analyses relevant to software measurement data
102Natural Evolution of Measures
- As our understanding of an attribute grows, it is
possible to define more sophisticated measures
e.g. temperature of liquids - 200BC - rankings, hotter than
- 1600 - first thermometer preserving hotter
than - 1720 - Fahrenheit scale
- 1742 - Centigrade scale
- 1854 - Absolute zero, Kelvin scale
103Measurement Theory Objectives
- Measurement theory is the scientific basis for
all types of measurement. It is used to determine
formally - When we have really defined a measure
- Which statements involving measurement are
meaningful - What the appropriate scale type is
- What types of statistical operations can be
applied to measurement data
104Measurement Theory Key Components
- Empirical relation system
- the relations which are observed on entities in
the real world which characterise our
understanding of the attribute in question,
e.g. Fred taller than Joe (for height of
people) - Representation condition
- real world entities are mapped to number (the
measurement mapping) in such a way that all
empirical relations are preserved in numerical
relations and no new relations are created e.g.
M(Fred) gt M(Joe) precisely when Fred is taller
than Joe - Uniqueness Theorem
- Which different mappings satisfy the
representation condition, e.g. we can measure
height in inches, feet, centimetres, etc but all
such mappings are related in a special way.
105Representation Condition
Real World
Number System
M
Joe
Fred
63
72
Joe taller than Fred
M(Joe) gt M(Fred)
Empirical relation
Numerical relation
preserved under M as
106Meaningfulness in Measurement
- Some statements involving measurement appear more
meaningful than others - Fred is twice as tall as Jane
- The temperature in Tokyo today is twice that in
London - The difference in temperature between Tokyo and
London today is twice what it was yesterday
Formally a statement involving measurement
is meaningful if its truth value is invariant
of transformations of allowable scales
107Measurement Scale Types
- Some measures seem to be of a different type to
others, depending on what kind of statements are
meaningful. The 5 most important scale types of
measurement are - Nominal
- Ordinal
- Interval
- Ratio
- Absolute
Increasing order of sophistication
108Nominal Scale Measurement
- Simplest possible measurement
- Empirical relation system consists only of
different classes no notion of ordering. - Any distinct numbering of the classes is an
acceptable measure (could even use symbols rather
than numbers), but the size of the numbers have
no meaning for the measure
109Ordinal Scale Measurement
- In addition to classifying, the classes are also
ordered with respect to the attribute - Any mapping that preserves the ordering (i.e. any
monotonic function) is acceptable - The numbers represent ranking only, so addition
and subtraction (and other arithmetic operations)
have no meaning
110Interval Scale Measurement
- Powerful, but rare in practice
- Distances between entities matters, but not
ratios - Mapping must preserve order and intervals
- Examples
- Timing of events occurrence, e.g. could measure
these in units of years, days, hours etc, all
relative to different fixed events. Thus it is
meaningless to say Project X started twice as
early as project Y, but meaningful to say the
time between project X starting and now is twice
the time between project Y starting and now - Air Temperature measured on Fahrenheit or
Centigrade scale
111Ratio Scale Measurement
- Common in physical sciences. Most useful scale of
measurement - Ordering, distance between entities, ratios
- Zero element (representing total lack of the
attribute) - Numbers start at zero and increase at equal
intervals (units) - All arithmetic can be meaningfully applied
112Absolute Scale Measurement
- Absolute scale measurement is just counting
- The attribute must always be of the form of
number of occurrences of x in the entity - number of failures observed during integration
testing - number of students in this class
- Only one possible measurement mapping (the actual
count) - All arithmetic is meaningful
113Problems of measuring of program complexity
- Attribute is complexity of programs
- Let R be empirical relation more complex than
xRy but neither xRz nor zRy
- No real-valued measure of complexity is
possible
114Validation of Measures
- Validation of a software measure is the process
of ensuring that the measure is a proper
numerical characterisation of the claimed
attribute - Example
- A valid measure of length of programs must not
contradict any intuitive notion about program
length - If program P2 is bigger than P1 then m(P2) gt
m(P1) - If m(P1) 7 and m(P2) 9 then if P1 and P2
are concatenated then m(P1P2) must equal
m(P1)m(P2) 16 - A stricter criterion is to demonstrate that the
measure is itself part of valid prediction system
115Validation of Prediction Systems
- Validation of a prediction system, in a given
environment, is the process of establishing the
accuracy of the predictions made by empirical
means - i.e. by comparing predictions against known data
points - Methods
- Experimentation
- Actual use
- Tools
- Statistics
- Probability
116Scale Types Summary
Scale Types
Characteristics
Nominal Ordinal Interval Ratio Absolute
Entities are classified. No arithmetic
meaningful. Entities are classified and ordered.
Cannot use or -. Entities classified, ordered,
and differences between them understood
(units). No zero, but can use ordinary
arithmetic on intervals. Zeros, units, ratios
between entities. All arithmetic. Counting only
one possible measure. All arithmetic.
117Meaningfulness and Statistics
- The scale type of a measure affects what
operations it is meaningful to perform on the
data - Many statistical analyses use arithmetic
operators - These techniques cannot be used on certain data -
particularly nominal and ordinal measures
118Example The Mean
- Suppose we have a set of values a1,a2,...,an
and wish to compute the average - The mean is
- The mean is not a meaningful average for a set of
ordinal scale data
119Alternative Measures of Average
Median The midpoint of the data when it
is arranged in increasing order. It divides the
data into two equal parts
Suitable for ordinal data. Not suitable for
nominal data since it relies on order having
meaning.
Mode The commonest value
Suitable for nominal data
120Summary of Meaningful Statistics
Scale Type
Average
Spread
Nominal Ordinal Interval Ratio Absolute
Mode Median Arithmetic mean Geometric mean Any
Frequency Percentile Standard deviation Coefficien
t of variation Any
121Non-Parametric Techniques
- Most software measures cannot be assumed to be
normally distributed. This restricts the kind of
analytical techniques we can apply. - Hence we use non-parametric techniques
- Pie charts
- Bar graphs
- Scatter plots
- Box plots
122Box Plots
- Graphical representation of the spread of data.
- Consists of a box with tails drawn relative to a
scale. - Constructing the box plot
- Arrange data in increasing order
- The box is defined by the median, upper quartile
(u) and lower quartile (l) of the data. Box
length b is u ? l - Upper tail is u1.5b, lower tail is l ? 1.5b
- Mark any data items outside upper or lower tail
(outliers) - If necessary truncate tails (usually at 0) to
avoid meaningless concepts like negative lines of
code
x
outlier
123Box Plots Examples
124Scatterplots
- Scatterplots are used to represent data for which
two measures are given for each entity - Two dimensional plot where each axis represents
one measure and each entity is plotted as a point
in the 2-D plane
125Example Scatterplot Length vs Effort
126Determining Relationships
non-linear fit
linear fit
outliers?
127Causes of Outliers
- There may be many causes of outliers, some
acceptable and others not. Further investigation
is needed to determine the cause - Example A long module with few errors may be due
to - the code being of high quality
- the module being especially simple
- reuse of code
- poor testing
- Only the last requires action, although if it is
the first it would be useful to examine further
explanatory factors so that the good lessons can
be learnt (was it use of a special tool or
method, was it just because of good people or
management, or was it just luck?)
128Control Charts
- Help you to see when your data are within
acceptable bounds - By watching the data trends over time, you can
decide whether to take action to prevent problems
before they occur. - Calculate the mean and standard deviation of the
data, and then two control limits.
129Control Chart Example
4.0
3.5
Preparation hours per hour of inspection
3.0
2.5
2.0
1.5
Mean
1.0
0.5
0
1
2
3
4
5
6
7
Components
130Lesson 4 Summary
- Measurement theory enables us to determine when a
measure is properly defined and what its scale
type is - The scale type for a measure determines
- Which statements about the measure are meaningful
- Which statistical operations can be applied to
the data - Most software metrics data comes from a
non-normal distribution. This means that we need
to use non-parametric analysis techniques - Pie charts, bar graphs, scatterplots, and box
plots - Scatterplots and box plots are particularly
useful for outlier analysis - Finding outliers is a good starting point for
software quality control
131LESSON 5 EMPIRICAL RESULTS
132Lesson 5 Objectives
- To see typical metrics from a major system
- To understand how these metrics cast doubt on
common software engineering assumptions - To understand from practical examples both the
benefits and limitations of software metrics for
quality control and assurance - To learn how measurement is used to evaluate
technologies in software engineering - To appreciate how little is really known about
what really works in software engineering
133Case study Basic data
- Major switching system software
- Modules randomly selected from those that were
new or modified in each release - Module is typically 2,000 LOC
- Only distinct faults that were fixed are conted
- Numerous metrics for each module
134Hypotheses tested
- Hypotheses relating to Pareto principle of
distribution of faults and failures - Hypotheses relating to the use of early fault
data to predict later fault and failure data - Hypotheses about metrics for fault prediction
- Benchmarking hypotheses
135Hypothesis 1a a small number of modules contain
most of the faults discovered during testing
100
80
60
of Faults
40
20
0
30
60
90
of Modules
136Hypothesis 1b
- If a small number of modules contain most of the
faults discovered during pre-release testing then
this is simply because those modules constitute
most of the code size. - For release n, the 20 of the modules which
account for 60 of the faults (discussed in
hypothesis 1a) actually make up just 30 of the
system size. The result for release n1 was
almost identical.
137Hypothesis 2a a small number of modules contain
most of the operational faults?
100
80
60
of Failures
40
20
0
10
100
of Modules
138Hypothesis 2b
- if a small number of modules contain most of the
operational faults then this is simply because
those modules constitute most of the code size. - No very strong evidence in favour of a converse
hypothesis - most operational faults are caused by faults in a
small proportion of the code - For release n, 100 of operational faults
contained in modules that make up just 12 of
entire system size. For release n1, 80 of
operational faults contained in modules that make
up 10 of the entire system size.
139Higher incidence of faults in function testing
implies higher incidence of faults in system
testing?
100
80
60
ST
FT
40
20
0
15
30
45
60
75
90
of Modules
140Hypothesis 4Higher incidence of faults
pre-release implies higher incidence of faults
post-release?
- At the module level
- This hypothesis underlies the wide acceptance of
the fault-density measure
141Pre-release vs post-release faults
Modules fault prone pre-release are NOT
fault-prone post-release - demolishes most
defect prediction models
142Size metrics good predictors of fault and failure
prone modules?
- Hypothesis 5a Smaller modules are less likely to
be failure prone than larger ones - Hypothesis 5b Size metrics are good predictors
of. number of pre-release faults in a module - Hypothesis 5c Size metrics are good predictors
of number of post-release faults in a module - Hypothesis 5d Size metrics are good predictors
of a modules (pre-release) fault-density - Hypothesis 5e Size metrics are good predictors
of a modules (post-release) fault-density
143Plotting faults against size
Correlation but poor prediction
Faults
Lines of code
144Cyclomatic complexity against pre-and
post-release faults
Cyclomatic complexity no better at prediction
than KLOC (for either pre- or post-release)
145Defect density Vs size
35
Size is no indicator of defect density (this
demolishes many software engineering assumptions)
30
25
Defects per KLOC
20
15
10
5
0
0
2000
4000
6000
8000
10000
Module size (KLOC)
146Complexity metrics vs simple size metrics
- Are complexity metrics better predictors of fault
and failure-prone modules than simple size
metrics Not really, but they are available
earlier - Results of hypothesis 4 are devastating for
metrics validation - A valid metric is implicitly a very bad
predictor of what it is supposed to be predicting - However
- complexity metrics can help to identify modules
likely to be fault-prone pre-release at a very
early stage (metrics like SigFF are available
long before LOC) - complexity metrics may be good indicators of
maintainability
147Benchmarking hypotheses
- Do software systems produced in similar
environments have broadly similar fault densities
at similar testing and operational phases?
148Case study conclusions
- Pareto principle confirmed, but normal
explanations are wrong - Complexity metrics not significantly better
than simple size measures - Modules which are especially fault-prone
pre-release are not especially fault-prone
post-release this result is very damaging to
much software metrics work - Clearly no causal link between size and defect
density - Crucial explanatory variables missing testing
effort and operational usage - incorporated in
BBNs
149Evaluating Software Engineering Technologies
through Measurement
150The Uncertainty of Reliability Achievement methods
- Software engineering is dominated by
revolutionary methods that are supposed to solve
the software crisis - Most methods focus on fault avoidance
- Proponents of methods claim theirs is best
- Adopting a new method can require a massive
overhead with uncertain benefits - Potential users have to rely on what the experts
say
151Actual Promotional Claims for Formal Methods
What are we to make of such claims?
152The Virtues of Cleanroom
- ... industrial programming teams can produce
software with unprecedented quality. Instead of
coding in 50 errors per thousand lines of code
and removing 90 by debugging to leave 5 errors
per thousand lines, programmers using functional
verification can produce code that has never been
executed with less than 5 errors per thousand
lines and remove nearly all of them in
statistical testing. - Mills H, Dyer M, Linger R, Cleanroom software
engineering, IEEE Software, Sept 1987, 19-25
153The Virtues of Verification (in Cleanroom)
- If a program looks hard to verify, it is the
program that should be revised not the
verification. The result is high productivity in
producing software that requires little or no
debugging. - Mills H, Dyer M, Linger R, Cleanroom software
engineering, IEEE Software, Sept 1987, 19-25
154Use of Measurement in Evaluating Methods
- Measurement is the only truly convincing means of
establishing the efficacy of a method/tool/techniq
ue - Quantitative claims must be supported by
empirical evidence
We cannot rely on anecdotal evidence. There is
simply too much at stake.
155Weinberg-Schulman Experiment
Completion time
Program size
Data space used
Program clarity
User-friendly output
Completion time
1
4
4
5
3
Program size
2-3
1
2
3
5
5
2
1
4
4
Data space used
Program clarity
1-2
4
3
3
2
User-friendly output
1-2
2-3
5
5
1
Ref Weinberg GM and Schulman EL, Goals and
performance in computer programming,
Human Factors 16(1), 1974, 70-77
156Empirical Evidence About Software Engineering
Methods
- Limited support for n-version programming
- Little public evidence to support claims made for
formal methods or OOD - Conflicting evidence on CASE
- No conclusive evidence even to support structured
programming - Inspection techniques are cost-effective (but
ill-defined)
We know almost nothing about which (if
any) software engineering methods really work
157The Case of Flowcharts vs Pseudocode (1)
- ... flowcharts are merely a redundant
presentation of the information contained in the
programming statements - Schneiderman et al, Experimental investigations
of the usability of detailed flowcharts in
programming, Comm ACM, June 1977, 861-881 - led to flowcharts being shunned as a means of
program or algorithm documentation - ... flowcharts should be avoided as a form of
program documentation - J Martin and C McClure, Diagramming Techniques
for Analysts and Programmers, Prentice-Hall, 1985
158The Case of Flowcharts vs Pseudocode (2)
- ... these experiments were flawed in method
and/or used unstructured flowcharts - ... significantly less time is required to
comprehend algorithms presented as flowcharts - DA Scanlan, Structured flowcharts outperform
pseudocode an experimental comparison, IEEE
Software, Sept 1989, 28-36
159The Evidence for Structured Programming
- The precepts of structured programming are
compelling, yet the empirical evidence is
equivocal - I Vessey and R Webber, Research on structured
programming an empiricists evaluation, IEEE
Trans Software Eng, 10, July 1984, 397-407
It is hard to known which claims we can believe
160The Virtues of Structured Programming
- When a program was claimed to be 90 done with
solid top-down structured programming, it would
take only 10 more effort to complete it (instead
of another 90). - Mills H, Structured programming retrospect and
prospect, IEEE Software, 3(6), Nov 1986, 55-66
161Management Before Technology
- Results of SQEs extensive survey were summarised
as - Best projects do not necessarily have state of
the art methodologies or extensive automation and
tooling. They do rely on basic principles such as
strong team work, project communication, and
project controls. Good organization appears to be
far more of a critical success factor than
technology or methodology. - Hetzel B, Making Software Measurement Work,
QED, 1993
162Formal Methods Rewarding Quantified Success
- The Queens award for technological achievement
1990 to INMOS and Oxford University PRG - Her majesty the Queen has been graciously
pleased to approve the Prime Ministers
recommendation that the award should be conferred
this year ... for the development of formal
methods in the specification and design of
microprocessors ... The use of formal methods
has enabled development time to be reduced by 12
months - The 1991 award went to PRG and IBM Hursley for
the use of formal methods (Z) on CICS.
163IBM/PRG Project Use of Z in CICS
- Many measurements of the process of developing
CICS/ESA V3.1 were conducted by IBM - Costs of development reduced by almost 5.5M
(8) - Significant decreases in product failure rate
claimed - The moral of this tale is that formal methods
can not only improve quality, but also the
timeliness and cost of producing state-of-the-art
products - Jones G, Queens Award for Technology, e-mail
broadcast. Oxford University PRG, 1992
But the quantitative evidence is not in the
public domain
164CICS study problems found during development
cycle
Z used
Problems
non Z
per
KLOC
Z
Pld Cld Mld Ut
Fv St Ca
165Comprehensibility of Formal specifications
- After a weeks training in formal specification,
engineers can use it in their work - ConForm project summary , European Focus, Issue
8, 1997 - Use of a formal method is no longer an
adventure it is becoming routine - FM99 World Congress of Formal Methods,
Publicity Material 1998
166Difficulty of understanding Z
Number of students
Number of correct responses
167Experiment to assess effect of structuring Z on
comprehension
- 65 students (who had completed extensive Z
course). Blocking applied to groups - Specification A monolithic 121 lines mostly in
one Z schema. - Specification B 6 main schemas each approx 20
lines. Total spec 159 lines - Specification C 18 small schemas. Total spec 165
lines
168Comparisons of scores for the different
specifications
60
50
score out of 60
40
A monolithic
B 6 schemas
30
C small schemas
20
10
0
0
5
10
15
20
25
student id
169Formal Methods for Safety Critical Systems
- Wide consensus that formal methods must be used
- Formal methods mandatory in Def Stan 00-55
- These mathematical approaches provide us with
the best available approach to the development of
high-integrity systems. - McDermid JA, Safety critical systems a
vignette, IEE Software Eng J, 8(1), 2-3, 1993
170SMARTIE Formal Methods Study CDIS Air Traffic
Control System
- Best quantitative evidence yet to support FM
- Mixture of formally (VDM, CCS) and informally
developed modules. - The techniques used resulted in extraordinarily
high levels of reliability (0.81 failures per
KLOC). - Little difference in total number of pre-delivery
faults for formal and informal methods (though
unit testing revealed fewer errors in modules
developed using formal techniques), but clear
difference in the post-delivery failures.
171CDIS fault report form
172Relative sizes and changes reported for each
design type in delivered code
Design Type
Total Lines
Number of
Code
Number
Total
Percent
of Delivered
Fault
Changes
of
Number
Delivered
Code
Report-
per
Modules
of
Modules
generated
KLOC
Having
Delivered
Changed
Code
This
Modules
Changes in
Design
Changed
Delivered
Type
Code
FSM
19064
260
13.6
67
52
78
VDM
61061
1539
25.2
352
284
81
VDM/CCS
22201
202
9.1
82
57
70
Formal
102326
2001
19.6
501
393
78
Informal
78278
1644
21.0
469
335
71
173Code changes by design type for modules requiring
many changes
Design Type
Total
Number
Percent of
Number
Percent
Number
of
Modules
of
of
of
Modules
Changed
Modules
Modules
Modules
with Over
with Over
Changed
Changed
5 Changes
10
Per
Changes
Module
Per
Module
FSM
58
11
16
8
12
VDM
284
89
25
35
19
VDM/CCS
58
11
13
3
4
Formal
400
111
2