RTES Approach to Software Fault Tolerance - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

RTES Approach to Software Fault Tolerance

Description:

Program testing procedures, implementation, and design quality ... Being extended (capturing new detection/recovery procedures) ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 14
Provided by: jimk53
Category:

less

Transcript and Presenter's Notes

Title: RTES Approach to Software Fault Tolerance


1
RTES Approach to Software Fault Tolerance
  • Jim KowalkowskiFermilab-CD

2
Purpose
To give you a brief, high-level overview of some
of the concepts, techniques, and tools that the
RTES collaboration developed to address software
reliability concerns in the BTeV trigger system.
(Although looking at these slides, one may think
it is not so brief.)
3
Very brief history
  • BTeV was developing a complex trigger system,
    which included a large collection of FPGAs and
    commodity hardware in two large clusters with
  • High availability requirement
  • High throughput requirement
  • Real time processing constraints
  • The RTES collaboration was formed to help assure
    that the availability requirement was met
  • Four universities with expertise in software and
    hardware fault tolerance, reliability
    engineering, and real time processing
  • (I was the liaison between BTeV and RTES)

4
Problem and RTES Goal
  • Problem Software reliability depends on
  • Physics detector-machine performance
  • Program testing procedures, implementation, and
    design quality
  • Behavior of the electronics (front-end and within
    the trigger)
  • Goal Create fault handling infrastructure
    capable of
  • Accurately identifying problems (where, what, and
    why)
  • Compensating for problems (shift the load,
    changing thresholds)
  • Performing automated recovery procedures (restart
    / reconfiguration)
  • Accurate accounting
  • Being extended (capturing new detection/recovery
    procedures)
  • Policy driven monitoring and control
  • ( also wanted to simplify operations)

5
What aspects are interesting?
  • Hierarchical decomposition of problem, which
    addresses
  • Real time constraints (react quickly when
    necessary)
  • Scalability
  • Resource usage constraints
  • Protocols between levels
  • Separation of concerns
  • the various contributors write code specific to
    their need
  • Very low coupling
  • linkage through message subscriptions
  • Separation of monitoring, problem detection, and
    actions
  • The system can change dynamically (as it is
    running)
  • Interprocess messaging infrastructure
  • Based on Elvin
  • A Publish/subscribe system
  • Supplied gateways and routers at various levels
  • High-level abstractions
  • System behavior and configuration and can be
    expressed using domain specific concepts and
    terms
  • Tools actually executing the configuration can
    evolve independently

6
Development approach
  • Generated a series of use cases that describe
    typical system behavior.
  • Generated a set of prototypical problems that may
    occur on each of the systems.
  • Generated a system architecture that looked
    similar to the vision of the real system
  • Created demonstration systems that match this
    architecture and emulate operation of various
    parts of the system using RTES developed products
  • Purpose is to detect and react to a given set of
    problems
  • Made Level 1 trigger project using DSP event
    processors, Linux regional managers, and a
    high-level control system
  • Made Level 2/3 trigger project using a Linux farm

7
Hierarchical Detection/Mitigation
8
Configuration through Modeling
  • Multi-aspect tool, separate views of
  • Hardware components and physical connectivity
  • Executables configuration and logical
    connectivity
  • Fault handling behavior using hierarchical state
    machines
  • Model interpreters can generate the system image
  • At the code fragment level (for fault handling)
  • Download scripts and configuration
  • Validation and testing
  • Modeling languages are application specific
  • Shapes, properties, associations, constraints
  • Appropriate to application/context
  • System architecture/configuration
  • Component states
  • Message structures
  • Fault mitigations

9
Modeling Environment
10
ARMORs
  • Are multithreaded processes composed of
    replaceable or pluggable building blocks called
    Elements
  • Elements provide error detection and recovery
    services to the trigger and other applications
  • Restarts, reconfiguration
  • Removal from service
  • ARMOR framework routes messages and schedules
    Elements based on their message subscriptions
  • A Hierarchy of ARMOR processes form a
    reconfigurable runtime environment
  • System management, error detection, and error
    recovery services are distributed across ARMOR
    processes
  • ARMOR runtime environment can handle self failure
  • ARMOR support for the application
  • Completely transparent and external support
  • Instrumentation with ARMOR API

11
Very Lightweight Agents (VLA)
  • Message scheduling and priority assignments
  • Fast, simple reactive decisions
  • Reads, summarizes, and reports sensors data
  • Are pluggable components
  • Lives alongside application
  • Some predictive capabilities

12
Shortcomings
  • VLAs
  • Standarized API and management framework never
    established
  • Definition much to vague to know precisely if
    your code is one of these or not
  • ARMORs
  • Difficult to write pluggable components
    (complex execution model)
  • Only support for C
  • Management and configuration not adequate

(Please note Some of the developments were cut
short due to the cancellation of BTeV)
13
Current Activities
  • LQCD cluster, need to automate
  • Routine administration tasks
  • Recovery procedures (for jobs and nodes)
  • Collecting of performance information
  • CMS online
  • Automate system validation and testing

(These are the ones I am aware of.)
Write a Comment
User Comments (0)
About PowerShow.com