Title: Modeling, Analyzing and Engineering NASAs Safety Culture Betty Barrett, John Carroll, Joel CutcherGe
1Modeling, Analyzing and Engineering NASAs Safety
Culture Betty Barrett, John Carroll, Joel
Cutcher-Gershenfeld, Nicolas Dulac, Nancy
Leveson, Karen Marais, David ZipkinMassachusetts
Institute of Technology
- Presentation to Universities Space Research
Association - January 2005
2Motivation
- The foam debris hit was not the single cause of
the Columbia accident, just as the failure of the
joint seal that permitted O-ring erosion was not
the single cause of Challenger. Both Columbia
and Challenger were lost also because of the
failure of NASAs organizational system. - -- Columbia Accident Investigation Board report
(CAIB), August, 2003, p. 195
3Core Hypothesis
- Safety decision making and dynamics can be
modeled, analyzed and engineered just like
physical systems. The models will be useful in
designing and validating improvements to the risk
management and safety culture, in evaluating the
potential impact of changes and policy decisions,
in assessing risks, in detecting when risk is
increasing to unacceptable levels, and in
performing root cause analysis.
- Defining Organizational Culture (three
levelsSchein, 1985) - Level 1 Visible Artifacts
- Level 2 Stated Policies and Principles
- Level 3 Underlying Values and Assumptions
4Assumptions NASA Safety Culture
- Gap between vision and reality
- No one single culture
- Mitigation of risk, not elimination of risk
Visual Image for the Project
- An electronic equivalent of the canary in a coal
mine
5Case Example
- Incremental loss of independence
- Shuttle SSRP (originally called the Senior Safety
Review Board and now known as the System Safety
Review Panel) established in 1981 - Over two decades with twists and turns
- Safety, Reliability, and Quality Assurance
(SRQA) established with membership and chair
from the safety organizations - First, advisory input from Space Shuttle Program
- Then, representation from the Program
- Then, leadership from the Program
- Ultimately, full shift in responsibility of SRQA
to the Space Shuttle Program in 2000 - Project manager now decides how much safety
services to purchase!
6Introduction to System Safety
- Safety as an emergent, system property
- The Problem
- Component level focus on reliability and
redundancy was incomplete - Fly-fix-fly became unacceptable
- System Safety
- Emerged after WWII Jerome Lederers Flight
Safety Foundation - Focus on interfaces of particular components or
operations and system-level hazards - Still a challenge relative to the
component-focused mindset
7Chain-of-Events Accident Causality Models
- Explain accidents in terms of multiple events,
sequenced as a forward chain over time. - Events linked together by direct relationships
(ignore indirect, non-linear relationships). - Events almost always involve component failure,
human error, or energy-related events.
8Limitations of Event-Chain Causality Models
- Social and organizational factors
- System accidents
- Software Error
- Human Error
- Cannot effectively model human behavior by
decomposing it into individual decisions and
actions and studying it in isolation from - physical and social context
- value system in which it takes place
- dynamic work process
- Adaptation
- Major accidents involve systematic migration of
organizational behavior to higher levels of risk.
9A Systems Theory Model of Accidents
- Return to a core principle Safety as an Emergent
Property - Accidents arise from interactions among
- People
- Societal and organizational structures
- Engineering activities
- Physical system components
- that violate the constraints on safe
components behavior and interactions - Need to include entire socio-technical system
10A Systems Theory Model of Accidents
- Systems should not be treated as a static design
- A socio-technical system is a dynamic process
continually adapting to achieve its ends and to
react to changes in itself and its environment - Preventing accident requires designing a control
structure to enforce constraints on system
behavior and adaptation
11(No Transcript)
12A Systems Theory Model of Accidents
- Views accidents as a control problem
- O-ring did not control propellant gas release by
sealing gap in field joint - Software did not adequately control descent speed
of Mars Polar Lander - Events are the result of the inadequate control
- Result from lack of enforcement of safety
constraints - To understand accidents, we need to examine
control structure itself to determine why
inadequate to maintain safety constraints
Not a blame model trying to understand why
13Modeling Accidents Using STAMP
- Three types of models are needed
- Static safety control structure
- Safety requirement and constraint
- Flawed control action
- Context (social, political, etc.)
- Mental model flaws
- Coordination flaws
- Structural dynamics
- How the static safety control structure changed
over time - Behavioral dynamics
- Dynamic processes behind the changes (i.e., why
the system changes)
Possible to model analyze, and engineer the
safety culture
14Introduction to System Safety Modeling
- Orientation to Systems Dynamics modeling
- Overall model structure
- Unpacking one element of the model
- Three sample scenarios
15Orientation to Systems Dynamics Modeling
16Overall Model Structure
Launch Rate
System Safety
Resource
Allocation
System
Safety
Status
Perceived
Success by
Administration
Shuttle Aging
and
System Safety
Maintenance
Efforts
Efficacy
Incident Learning
Corrective
Action
Risk
System Safety
Knowledge,
Skills Staffing
17Overall Model Structure
System Safety
Resource
Launch Rate
Allocation
System
Safety
Status
Perceived
Success by
Administration
Shuttle Aging
and
System Safety
Maintenance
Efforts
Efficacy
Incident Learning
Corrective
Action
Risk
System Safety
Knowledge,
Skills Staffing
18Complete Learning Model
19Unpacking One Element Learning Corrective
Actions
20How can this model help us learn about NASA?
- It allows us to
- Understand how and why accidents have occurred
- Test and validate changes and new policies
- Learn which levers have a significant and
sustainable effect - Facilitate the identification and tracking of
metrics to detect increasing risk - In order to do the above, we need to be
comfortable with the model
21Model Results
Attempts to address systemic factors
1
dmnl
400
months
0.5
dmnl
200
months
0
dmnl
0
months
0
150
300
450
600
750
900
Time (Months)
Attempts to address systemic factors
Changes made after an accident were ineffective
over the long run in solving the systemic problems
22Model Results
Risk
0.2
Risk Units
400
months
0.1
Risk Units
200
months
0
Risk Units
0
months
0
150
300
450
600
750
900
Time (Months)
Risk
Response to accidents had very little impact on
actual risk
23Model Results
Safety and performance compete for resources
4
3
2
1
0
0
100
200
300
400
500
600
700
800
900
1000
Time (Months)
Perceived priority of safety
Perceived priority of performance
Accidents lead to a reevaluation of NASA safety
and performance priorities
24Scenario A Impact of fixing systemic factors
vs. symptoms
Risk
1
0.75
0.5
0.25
0
0
100
200
300
400
500
600
700
800
900
1000
Time (Months)
The system risk quickly escalates if only
symptoms are fixed and systemic factors are not
addressed
25Scenario B Independence of Safety Decision
Makers
Risk
0.2
0.15
0.1
0.05
0
0
100
200
300
400
500
600
700
800
900
1000
Time (Months)
- Assumes an Independent Safety Organization that
ensures - the assignment of high ranked and highly
regarded personnel to the safety organization - more power and authority to the safety
organization - staff can make reports without fear of blame
- an increase in the percentage of incidents are
reported - higher employee participation in the
investigation - an unbiased evaluation of proposed corrective
actions emphasizing solutions that address
systemic factors
26Scenario C Increased Contracting
Risk
1
0.75
0.5
0.25
0
0
100
200
300
400
500
600
700
800
900
1000
Time (Months)
There is a tipping point at which NASA is not
able to perform the integration and safety
oversight that is their responsibility. After
this point, the risk escalates substantially
27Lessons Learned
- Without addressing systemic factors, accidents
persist and risk increases - Increasing a safety organizations independence
has a positive effect on system risk - There are certain tipping points beyond which
the behavior of the system is significantly
different
Many other lessons will be extracted from the
model after further analysis!!
28Phase I Accomplishments
- A working model!
- Interrelationships among models for
- Launch Rate
- Perceived Program Success
- Shuttle Aging and Maintenance
- Incident Learning Corrective Action
- System Safety Efforts and Efficacy
- System Safety Resource Allocation
- System Safety Knowledge, Skills and Staffing
- System Safety Status
- Risk
29Next Step Implications
- Further validation of the model
- Further analysis using the model
- Development of robust system safety what if
flight simulator tool for field use - Development of user interface and support process
for flight simulator - Pilot deployment in the field and ongoing PDCA
improvement
30Conclusions
- Organizational and Institutional aspects of
safety systems can be modeled with rigor and
utility comparable to technical systems models - The Promise
- Detecting in advance indications of migration
toward heighten risk - Assessing risk/benefits associated with potential
changes in organizations structure and systems - Building a systems safety approach into
organizational strategy, structure and process
31Appendix
32Project Work Plan
- Phase I (six months)
- Models of NASA Shuttle Program Safety Culture and
Safety Control Structure - Focus on hazards at the interfaces of components
and operations, as well as dynamics over time - Incorporate insights from Challenger and Columbia
accident reports a rare window into
relationships and interactions - Build on efforts of weekly study group faculty,
research staff and doctoral students - Phase II (two years)
- Validation of model, incorporation of toolkit for
what if analysis, and integration of metrics - Partnership with NASA around operationalization
and implementation
33Elements of Relevant Social Systems
- Formal organizational safety structure
- Headquarters Office of Safety and Mission
Assurance SMA offices at NASA centers and
facitities NASA Engineering and Safety Center
(NESC) and safety roles of managers, engineers,
civil servants, contractors and others etc. - Organizational sub-systems
- Communications systems, safety information
support systems, analysis and decision making
systems, reward and reinforcement systems,
selection and retention systems, skills and
training systems, organizational learning
systems, incident investigation systems
(including in-flight anomalies (IFAs)), and
conflict resolution systems, etc. - Safety rules and procedures
- Specific rules and procedures underlying
assumptions and principles dynamics over time - Individual behavior Motivation and capability
- Commitment to safety values Knowledge, skills,
and ability with respect to safety tools and
methods group dynamics fear of surfacing safety
issues learning from mistakes etc.
34Full Social Systems Framework
- Structure Sub-Systems
- Structure
- Groups ongoing and ad hoc (formal and informal)
- Organizations hierarchies, networks, layers
(formal and informal) - Institutions
- Industries
- Markets
- Sub-Systems
- Communications systems
- Information systems
- Reward and reinforcement systems
- Selection and retention systems
- Learning and feedback systems
- Complaint and conflict resolution systems
- Social Interaction Processes
- Leadership
- Negotiations
- Problem-solving
- Decision-making
- Teamwork
- Partnership
- Capability Motivation
- Bias and human judgment
- Individual knowledge, skills ability
- Group stages of development
- Fear, satisfaction and commitment
- Culture, Vision Strategy
- Culture
- Artifacts, attributes, assumptions
- Gender and diversity
- Cross-cultural dynamics
- Dominant cultures and sub-cultures
- Vision and Strategy
Importance of decompositional attention to
details and integration across elements