RTES CD Status Report most of the material from the BTeV Temple 2003 Review - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

RTES CD Status Report most of the material from the BTeV Temple 2003 Review

Description:

System specification, generation, and modeling tools. ... Modeling Environment. Fault handling. Process dataflow. Hardware Configuration ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 22
Provided by: jimk53
Category:

less

Transcript and Presenter's Notes

Title: RTES CD Status Report most of the material from the BTeV Temple 2003 Review


1
RTES CD Status Report(most of the material
from the BTeV Temple 2003 Review)
  • Jim Kowalkowski

2
Deliverables
  • A Toolkit containing
  • Very Lightweight Agents (VLAs)
  • ARMORs
  • Modeling tools and a domain specific environment
    under which they operate
  • Some BTeV trigger and DAQ specific plug-ins
    using the above toolkit, applied to both hardware
    and software

3
Participation in SC2003
  • All the university groups Fermilab worked
    together to create a system (hardware and
    software) demonstrating their technology in a
    BTeV Level1 trigger-like setting
  • This was a concrete project with a deadline
  • Created a system that is being reviewed
  • Helped the RTES groups develop an understanding
    of the processing that goes on in the trigger and
    how events are generated
  • They developed code and rules that handle some of
    the problems we expect to encounter
  • Contains initial prototypes of the deliverables
    GME models, ARMORs, and VLAs

4
SC2003 External view of demo
5
SC2003 Internal view of demo
GME
Linux PC
Actions, Commands
ARMOR
ARMOR
Display data
Monitoring, State
Event Generator
Control System
Gateway
Commands
Windows PC
Start/Stop Fault injection Parameters settings
Operator
Switch
Buffer Manager
Local Manager
VLA
3
Fake Physics App
2
3
1
2
Farmlet-1
6
Milestones with respect to BTeV
  • See draft BTeV document 2079
  • Year 3 Define APIs, make distributed
    application decisions, evaluate modeling tools,
    create a more complete prototype (demo), generate
    a simulator
  • Year 4 Synchronize with or conform to the BTeV
    trigger development environment and TSM system
    for configuration, control, and monitoring
  • Year 5 Used to address integration issues with
    the BTeV trigger

7
Achieving milestones
  • () Many of the BTeV needs (issues addressed in
    the milestones) are also necessary for RTES
    collaborators to carry out their research
  • Simulation, to validate or verify their ideas
  • Scalability issues for the modeling tools and
    ARMORs
  • APIs and ways to move data from place to place
  • () Concrete projects have been successful
  • (-) They need to feel that the BTeV goals are
    matching their own research goals
  • (-) The students involved must have adequate
    skills

8
Schedule
  • RTES completes early in relationship with the
    milestones and completion of the trigger
  • We have the last year (2006) to address
    integration issues
  • The TSM (controls/monitoring for the trigger)
    will be aided by RTES, but still function (most
    likely in a reduced capacity) without it

9
Acceptance testing
  • We are generating use cases (a work in progress -
    BTeV document 2189) to capture behavioral
    requirements (when you poke the system like this,
    it reacts this way)
  • Use cases translate nicely into test cases and
    acceptance criteria (e.g. how many of the use
    cases does RTES software satisfy?)
  • We are following a semi-formal methodology
    (Cockburn)
  • A detailed simulation will be used to verify RTES
    solutions
  • We are working toward automated component and
    integration test procedures that fit into the
    BTeV development environment

10
Technology choices
  • We know the importance of keeping close contact
    with the BTeV development and engineering staff
  • The configuration, distributed computing models,
    coding standards, and APIs will be highly
    influenced by BTeV developers
  • All this will help make consistent RTES/TSM
    systems that require less maintenance effort and
    less manpower to create
  • BTeV developers can make use of product
    evaluations and experiences of the RTES group
  • We are designing a message format and evaluating
    exchange protocols
  • We are investigating the use of a RTOS

11
Recent Activities
  • Use cases
  • Working to understand how to do this correctly
    with Margaret V. and Luciano P.
  • can we bring an expert in for a few days?
  • Evaluation of OSE real-time kernel
  • Fermilab will be using PowerPC 8540 as the
    platform/architecture
  • Target at this time is only the embedded systems
    in the trigger
  • Jim K. wants to be involved in this
  • Review of the prototype/demo system
  • Marc P. and Mark F. are the reviewers
  • Extremely valuable results already
  • Emphasized the need for use cases
  • Strong desire to evaluate parallel-C compilers
    that generate VHDL code (Jim K.)

12
Issues
  • Tools for research use versus tools for
    production use in an experiment
  • Where the tools actually fit in and the
    relationship between the traditional
    controls/monitoring systems
  • Time for research and contemplating solutions
  • Diverse interests within the group

13
RTES hardware for SC2003
14
Backup Slide - errors for SC2003
  • Increased/decreased data rate
  • broken communication link to a DSP
  • trigger filter application hung
  • death of manager process on the host PCs
  • Increased/decreased processing time per event
  • input queue high water mark reached
  • unable to keep up after DSP failure
  • impede processing on one DSP
  • timeout during event processing
  • bad and lost events or lost events

15
Slides from CHEP2003 talk
16
Goals Summary
  • Implement a large, aggressive trigger, that
  • Applies computation to every interaction
  • Has high sustained computational performance
  • Maintains functional integrity for long periods
    of time
  • Is highly available
  • Is dynamically reconfigureable, maintainable, and
    evolvable
  • Create fault handling infrastructure capable of
  • Accurately identifying problems (where, what, and
    why)
  • Compensating for problems (shift the load,
    changing thresholds)
  • Automated recovery procedures (restart /
    reconfiguration)
  • Accurate accounting
  • Being extended (capturing new detection/recovery
    procedures)
  • Policy driven monitoring and control
  • Simplify operations

17
What is RTES?
  • A collaboration of five institutions, funded by
    NSF ITR grant ACI-0121658
  • University of Illinois (M. Haney, R.K. Iyer, Z.
    Kalbarczyk, Q. Liu, A. Mahajan, M. Selen, Z.
    Yang)
  • University of Pittsburgh (D. Mosse, O.
    Shigiltchoff)
  • University of Syracuse (R. Chopade, J. Oh, L.
    Hovey, S. Stone, D. Messie)
  • Vanderbilt University (T. Bapty, S. Neema, S.
    Norsdstrom, P. Sheldon, S. Shetty, E. Vaandering,
    D. Vashishtha)
  • Fermilab (J. Appel, J. Butler, E. Gottschalk, J.
    Kowalkowski, L. Piccoli, M. Votava)
  • Physicists and Computer Scientists/Electrical
    Engineers at BTeV institutions with expertise in
  • High performance, real-time, embedded system
    software and hardware,
  • Reliability and fault tolerance,
  • System specification, generation, and modeling
    tools.
  • A group working on fault management in large
    computing clusters

18
Very Lightweight Agents (VLA)
  • Message scheduling and priority assignments
  • Fast, simple reactive decisions
  • Reads, summarizes, and reports sensors data
  • Are pluggable components
  • Lives alongside application
  • Some predictive capabilities

19
ARMOR View
Node 1
Node 2
Daemon
Daemon
ARMOR Microkernel
TCP Connection Mgmt.
Named PipeMgmt.
ProcessMgmt.
DetectionPolicy
ProcessMgmt.
Network
Node 3
Daemon
Remote daemons
ARMOR Microkernel
Trigger Application
Recovery Policy
Execution Controller
Local Manager ARMOR
20
Modeling Environment
  • Fault handling
  • Process dataflow
  • Hardware Configuration

21
Why is all of this interesting?
  • It is an integrated approach from hardware to
    physics algorithms
  • Standardization of resource monitoring,
    management, error reporting, and integration of
    recovery procedures can make operating the system
    more efficient and make it possible to comprehend
    and extend.
  • There are real-time constraints
  • Scheduling and deadlines
  • Numerous detection and recovery actions
  • The product of this research will
  • Automatically handle simple problems that occur
    frequently
  • Be as smart as the detection/recovery modules
    plugged into it
  • The product can lead to better or increased
  • Trigger uptime by compensating for problems or
    predicting them instead of pausing or stopping a
    run
  • Resource utilization - the trigger will use
    resources that it needs
  • Understanding of the operating characteristics of
    the software
  • Ability to debug and diagnose difficult problems
Write a Comment
User Comments (0)
About PowerShow.com