RTES Approach to Software Fault Tolerance

About This Presentation

Title:

RTES Approach to Software Fault Tolerance

Description:

Program testing procedures, implementation, and design quality ... Being extended (capturing new detection/recovery procedures) ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 14

Provided by: jimk53

Learn more at: https://pingprod.fnal.gov

Category:

more less

Transcript and Presenter's Notes

Title: RTES Approach to Software Fault Tolerance

1
RTES Approach to Software Fault Tolerance

Jim KowalkowskiFermilab-CD

2
Purpose
To give you a brief, high-level overview of some
of the concepts, techniques, and tools that the
RTES collaboration developed to address software
reliability concerns in the BTeV trigger system.
(Although looking at these slides, one may think
it is not so brief.)
3
Very brief history

BTeV was developing a complex trigger system,
which included a large collection of FPGAs and
commodity hardware in two large clusters with
High availability requirement
High throughput requirement
Real time processing constraints
The RTES collaboration was formed to help assure
that the availability requirement was met
Four universities with expertise in software and
hardware fault tolerance, reliability
engineering, and real time processing
(I was the liaison between BTeV and RTES)

4
Problem and RTES Goal

Problem Software reliability depends on
Physics detector-machine performance
Program testing procedures, implementation, and
design quality
Behavior of the electronics (front-end and within
the trigger)
Goal Create fault handling infrastructure
capable of
Accurately identifying problems (where, what, and
why)
Compensating for problems (shift the load,
changing thresholds)
Performing automated recovery procedures (restart
/ reconfiguration)
Accurate accounting
Being extended (capturing new detection/recovery
procedures)
Policy driven monitoring and control
( also wanted to simplify operations)

5
What aspects are interesting?

Hierarchical decomposition of problem, which
addresses
Real time constraints (react quickly when
necessary)
Scalability
Resource usage constraints
Protocols between levels
Separation of concerns
the various contributors write code specific to
their need
Very low coupling
linkage through message subscriptions
Separation of monitoring, problem detection, and
actions
The system can change dynamically (as it is
running)
Interprocess messaging infrastructure
Based on Elvin
A Publish/subscribe system
Supplied gateways and routers at various levels
High-level abstractions
System behavior and configuration and can be
expressed using domain specific concepts and
terms
Tools actually executing the configuration can
evolve independently

6
Development approach

Generated a series of use cases that describe
typical system behavior.
Generated a set of prototypical problems that may
occur on each of the systems.
Generated a system architecture that looked
similar to the vision of the real system
Created demonstration systems that match this
architecture and emulate operation of various
parts of the system using RTES developed products
Purpose is to detect and react to a given set of
problems
Made Level 1 trigger project using DSP event
processors, Linux regional managers, and a
high-level control system
Made Level 2/3 trigger project using a Linux farm

7
Hierarchical Detection/Mitigation
8
Configuration through Modeling

Multi-aspect tool, separate views of
Hardware components and physical connectivity
Executables configuration and logical
connectivity
Fault handling behavior using hierarchical state
machines
Model interpreters can generate the system image
At the code fragment level (for fault handling)
Download scripts and configuration
Validation and testing
Modeling languages are application specific
Shapes, properties, associations, constraints
Appropriate to application/context
System architecture/configuration
Component states
Message structures
Fault mitigations

9
Modeling Environment
10
ARMORs

Are multithreaded processes composed of
replaceable or pluggable building blocks called
Elements
Elements provide error detection and recovery
services to the trigger and other applications
Restarts, reconfiguration
Removal from service
ARMOR framework routes messages and schedules
Elements based on their message subscriptions
A Hierarchy of ARMOR processes form a
reconfigurable runtime environment
System management, error detection, and error
recovery services are distributed across ARMOR
processes
ARMOR runtime environment can handle self failure
ARMOR support for the application
Completely transparent and external support
Instrumentation with ARMOR API