Christopher A. Monaco - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

Christopher A. Monaco

Description:

The first RAD6000 was launched in 1996 on the Mars Pathfinder. ... Rovers Spirit and Opportunity, Deep Space 1, Genesis and Stardust, Mars Polar ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 16
Provided by: kevin221
Category:

less

Transcript and Presenter's Notes

Title: Christopher A. Monaco


1
STEREOThe Solar TErrestrial RElations
Observatory Flight Softwares Unconventional
Solution to Floating Point Error Handling
  • Christopher A. Monaco
  • JHU/APL
  • FSW-07 Workshop
  • Laurel, MD
  • 5/6-Nov-2007

2
Early STEREO FSW Development
  • A task of the STEREO FSW team during development
    was to implement a floating point exception
    handler
  • On the RAD6000 this is not as straight forward as
    it sounds
  • This is what we learned and what we did about it

3
Background
  • Each of the two STEREO spacecraft have 2 RAD6000
    processors that operate the SC bus
  • One for CDH and one for GC subsystems
  • The CDH and GC processors run VxWorks 5.3.1
    operating system
  • The RAD6000 is a POWER processor architecture
  • Based on the IBM Federal Systems RSC6000 circa
    1985
  • The first RAD6000 was launched in 1996 on the
    Mars Pathfinder.
  • Since then approximately 150 RAD6000 have been
    flown on various missions including
  • Rovers Spirit and Opportunity, Deep Space 1,
    Genesis and Stardust, Mars Polar Lander, Mars
    Climate Orbiter, APLs MESSENGER,

4
RSC6000
  • RSC6000 has 3 semi-autonomous processor units
    that each implement their own instruction
    pipeline
  • Instruction Stream processor or Branch Unit (BU)
  • Fixed Point Unit (FXU)
  • Floating Point Unit (FPU)
  • The 3 processing units execute somewhat
    independently.
  • Several instructions may be in various phases of
    execution at any particular instant
  • Instructions across the 3 pipelines often finish
    in a different order from that defined by the
    program
  • This is in contrast to the sequential model of
    program execution
  • Each instruction must completely finish before
    the next begins
  • Pipelined instruction execution is responsible
    for significant performance improvements made by
    the POWER architecture

5
IEEE 754 Floating Point Standard
  • IEEE 754 Floating Point Standard was first
    adopted by IEEE/ANSI in 1984
  • The standard requires that a faulting instruction
    be accurately identifiable within the exception
    trap
  • In general this requirement is met by chip
    designers by implementing a precise interrupt
  • Implementation of precise interrupt is
    complicated due to the 3 somewhat independent
    pipelines of the RSC6000
  • out-of-order instruction sequencing
  • Precise interrupt an interrupt or exception
    is precise if the saved processor state
    corresponds with the sequential model of program
    execution where one instruction execution ends
    before the next begins.

6
RSC6000 Floating Point
  • Designers of the RSC6000 had a choice
  • (1) Implement precise interrupt - Invent a
    complex scheme for identifying a faulting
    instruction and enabling rollback of instructions
    that executed in the other pipelines
    out-of-sequence and restoring processor state
  • (2) Implement precise interrupt - Enforce
    explicit instruction execution sequencing
    serialize the pipelines
  • Each instruction must complete (exception-free)
    before subsequent instructions may begin
  • Performance hit between 2-3 X
  • (3) Give up the ability to identify the faulting
    instruction
  • Software must poll floating point registers for
    exceptions

or
7
Floating Point Exceptions
  • Buried at the end of the Floating Point
    Exceptions section in the POWER Processor
    Architecture Manual Version 1.52 regarding
    trapping floating point exceptions
  • System performance with MSR(FE) 1 may be
    significantly degraded
  • Regarding polling for floating point exceptions
    RSC6000 and RAD6000 literature offers little
    guidance
  • inserting test code after each floating point
    operation
  • Adding test code after each floating point
    operation is too invasive particularly since a
    significant portion of the GC code is
    autogenerated MatLab RTW code
  • Each task involving floating point operations
    would require modification
  • The compiler may provide several options at
    different levels subroutine, loop exit,
    statement assignment, or after each floating
    point instruction
  • No obvious compiler solutions offered detection
    and appropriate handling of floating point errors
    while also avoiding performance loss

8
STEREO FP Exception Handling Options
  • (2) Enforce explicit instruction sequencing Trap
    floating point exceptions
  • System-wide solution
  • Conventional
  • No latency in error detection
  • - Significant overall system performance
    degradation 2-3 X associated with serializing the
    3 pipelines
  • (3) Give up precise interrupt and poll for
    exceptions Polling for floating point exceptions
  • We dont really NEED to know the exact
    instruction that caused the error. Reset system
    in case of critical task floating point error
  • - Polling results in latency between the
    occurrence and detection of error
  • Small latency can be tolerated
  • Good software practices ? Floating point
    exceptions should be VERY rare!
  • 2-3 X faster than option (2). Take advantage
    of parallel pipelines!

9
VxWorks
  • VxWorks associates a copy of the FPU registers
    with each user task
  • VxWorks saves and restores FPU registers at
    context switches
  • Polling in a particular task context would only
    catch exceptions occurring within that task since
    the last poll
  • VxWorks offers a hook into the OS in which
    developers may insert user code to execute at
    task context switches
  • taskHookLib STATUS taskSwitchHookAdd ( FUNCPTR
    switchHook)
  • Arguments to the user supplied task switch hook
    are pointers to the old_tcb and new_tcb

10
STEREO Floating Point Exception Polling
  • VxWorks saves and restores a tasks registers
    prior to calling user task switch hook routine
  • The switch hook behaves as though it were
    executing within the context of the new task.
    Therefore, FPSCRread() within the task switch
    hook supplies the FPSCR associated with the new
    task
  • Did the new task suffer floating pt exception
    last time it was run?

11
Detection Latency
  • Maximum Latency is deterministic for each task
  • Example GC attitude controller task 50Hz
  • Maximum Latency for GC floating point error
    detection 20 ms
  • Acceptable
  • Actual Latency

12
STEREO Floating Point Exception Polling
  • This approach guarantees that all tasks will be
    monitored
  • Every task runs as a result of a context switch
  • Floating point error monitoring occurs at a
    bounded rate
  • Minimum scheduled task rate
  • Maximum on the order of the rate of the highest
    rate task in the system
  • Acceptable detection latency
  • Insignificant overhead cost
  • Monitoring consists of FPSCRread(), mask and
    test
  • Floating point error handling can easily
    discriminate based upon the predefined
    criticality of the faulting task
  • Floating point errors in non-critical tasks are
    recorded and FPSCR is cleared
  • Floating point errors in critical tasks are
    recorded and initiate a system reset

13
STEREO Floating Point Exception Polling
  • Interesting Note
  • Rarely, the OS executes the task switch hook with
    new task ID 0 and the FPSCR register contains
    seemingly erroneous data
  • Task ID 0, coincidentally, matches the Task ID
    of the CDH GC Idler tasks which perform no
    floating point operations
  • IdlerTask()
  • for()
  • This was investigated extensively by the STEREO
    FSW team a specific cause could not be
    identified
  • Empirically determined through all phases of
    testing to be a false positive
  • Over 4 processor years of this code operating
    post launch validating this assertion

14
STEREO Floating Point Exception Polling
  • Task switch hook FPSCR polling is a good software
    solution to the feature traditionally implemented
    in hardware that has come to be taken for granted
  • System-wide approach
  • can easily be customized per task
  • Guarantees that all system tasks are monitored
    within one solution
  • Tasks are monitored at a sufficiently high rate

15
STEREO Floating Point Exception Polling
  • Questions
Write a Comment
User Comments (0)
About PowerShow.com