A Triple Module Redundancy Scheme for SEU Mitigation of Static Latch-Based FPGAs ( - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

A Triple Module Redundancy Scheme for SEU Mitigation of Static Latch-Based FPGAs (

Description:

Title: DCM User Interface Author: Xilinx Last modified by: rk Created Date: 2/11/2004 4:35:09 AM Document presentation format: On-screen Show Company – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 14
Provided by: Xil951
Learn more at: http://klabs.org
Category:

less

Transcript and Presenter's Notes

Title: A Triple Module Redundancy Scheme for SEU Mitigation of Static Latch-Based FPGAs (


1
A Triple Module Redundancy Scheme for SEU
Mitigation of Static Latch-Based
FPGAs(Birds-of-a-Feather)
Carl Carmichael1, Brendan Bridgford1, Gary
Swift2, Matt Napier3 1Xilinx Corporation, San
Jose CA2Jet Propulsion Laboratory, Pasadena
CA3Sandia National Laboratories, Albuquerque NM
"This work was carried out in part by the Jet
Propulsion Laboratory, California Institute of
Technology, under contract with the National
Aeronautics and Space Administration."
"Reference herein to any specific commercial
product, process, or service by trade name,
trademark, manufacturer, or otherwise, does not
constitute or imply its endorsement by the United
States Government or the Jet Propulsion
Laboratory, California Institute of Technology."
2
XTMR SEU Mitigation
  • Xilinx Triple Module Redundancy (XTMR)
  • Single Point Failures are eliminated by
    triplication of every logic node (gates nets).
  • XTMR confers SEU and SET immunity
  • XTMR does not protect against SEFIs!
  • Any digital design can be XTMRed by
  • Triplication of throughput (combinational
    sequential) logic
  • Triplication of feedback logic and inserting
    majority voters
  • Adding redundant IO (outputs with minority
    voters)
  • Design cleanup (removing half-latches, SRL16s,
    etc.)

3
XTMR State-Machines
Pre-TMR
  • XTMR provides autonomous re-synchronization of
    the separate redundant domains of a state-machine
    by inserting majority voters at the origin of any
    registered feed-back Looped path.
  • When a configuration upset disables one domain,
    the other two domains continue to operate
    providing a correct majority representation of
    state data and functionality.
  • When Scrubbing fixes the configuration of the
    upset domain, the embedded redundant voters
    automatically correct the state of the upset
    domain without any external intervention.
  • As long as the scrub rate is greater than the
    upset rate, a single bit upset cannot disturb
    more than one redundant domain.

Post-XTMR
4
XTMR Inputs
  • Effective SEU Mitigation requires the use of
    triple redundant input pins for every input
    signal.
  • Not triplicating input Global signals (clk, rst,
    etc) can seriously compromise SEU resistance.
  • Triplication of input data paths can be traded
    for EDAC.
  • SEU resistance is sometimes a trade-off for
    resource utilization.

5
XTMR Outputs with Minority Voters
  • Outputs can be triplicated, using three pins for
    each output signal.
  • Minority voters monitor each of the triplicated
    design modules.
  • If one module is different from the others, its
    output pin is driven to High-Z
  • Voters are triplicated

Minority Voter
P
TR0
Minority Voter
P
TR1
Minority Voter
P
TR2
Convergence point is outside FPGA, at trace
6
Previous SEE Test Methodology for Mitigation
  • The assertion of the combined mitigation method
    of XTMR Scrubbing is that the complete removal
    of Single Even Functional Errors in the user
    logic confers any user design to an overall error
    rate determined by the remaining Single Event
    Functional Interrupts. Therefore, a successful
    mitigation test is expected to produce zero
    errors other than SEFIs.
  • Since the effectiveness of TMR is dependent upon
    no accumulation of errors in the configuration,
    experiments were attempted to maintain an upset
    rate that did not exceed the scrub rate. This
    methodology had two significant flaws
  • One is an impracticality of testing at such low
    fluxes requiring unreasonably long run times and
    thus being incapable of reaching sufficient
    fluence for acceptable statistical significance
    of data.
  • The other flaw is that a zero error rate result
    is not useful for making any calculations or
    extrapolations.
  • These issues raise concerns over the validity of
    any results.

7
Improved SEE Test Methodology for Mitigation
  • There is an expected physical relationship
    between functional error rate of a mitigated
    system as a function of upset rate. The expected
    relationship is a function that predicts the
    increasing probability of upsetting bit
    combinations that will cause a mitigated (TMR)
    system to fail as a function of bit upset rate
  • MER (1/2)(NBCA/TS)RU2
  • MER Mitigation Error Rate
  • NB Number of Relevant Bits
  • CA Average Cluster Size
  • TS Scrub Time
  • RU Upset Rate of Relevant Bits.
  • Therefore, testing at extremely high fluxes over
    several orders of magnitude variation can be
    performed to reveal this functional relationship
    between mitigation error rate and bit upset rate.
  • This function can then be extrapolated to make
    predictions at the much lower upset rates of
    earth orbits.

8
Plot Definitions
  • Predicted SEFI cross-section
  • Static and Dynamic SEE Characterization of the
    Virtex-II FPGA revealed several Single Event
    Functional Interrupt Modes POR (2.5E-06), SMAP
    (1.72E-06), IOB (4.2E-06)
  • These combined cross-sections represent the
    minimum functional error cross-section for a
    single Virtex-II (XQR2V6000) device on orbit.
  • Worst Case Orbital Upset Rate
  • CREME96 calculation of the worst case orbital
    upset rate for a XQR2V6000 is 7,740
    bit-errors/day (9E-02 bit-errors/sec) in a GEO
    orbit at 36,000km during the worst day of an
    Anomalously Large Solar Flare accounting for both
    Heavy Ion and Proton. In a 40MeV Kr beam the
    exact same upset rate is achieved with a Flux of
    1.25E-01 p/cm2/s. This denotes that the
    equivalent upset rates for all other orbits and
    solar conditions would reside to the LEFT of this
    line.
  • Single Event Functional Interrupts
  • This is the average cross-section of the observed
    SEFI(s) while collecting the data represented in
    the plot. This cross-section is not Flux
    dependent. Variations from the predicted value
    are due to statistical significance of the total
    accumulated fluence during each test.
  • Functional Errors
  • Data plot of the observed events when the Device
    Under Test returned an incorrect result.
    Cross-section is determined by the number of
    error events divided by total fluence at the
    specified flux. TMR denotes that the DUT design
    was fully mitigated with XTMR and scrubbing. The
    Unmitigated results were obtained with an
    identically functional design without XTMR,
    however scrubbing was also used for the
    unmitigated test.
  • Extrapolation
  • A derived function describing the relation
    between Mitigation failure as a function of upset
    rate. Extension of the function predicts
    functional error cross-sections at worst case
    orbital upset rates to be less than SEFI
    cross-sections.

9
PLOT 1
3.5E-02
3.5E-01
3.5E00
3.5E01
3.5E02
3.5E03
Configuration Bit Errors per Scrub Cycle
36,000km GEO Orbit Worst Day Solar Flare 8,000
bit-errors/day
All other orbits
40 MeV Kr LET 22.3
MeV/cm2/mg
SEFIs drive error rate for all designs and all
orbits.
Mitigation errors on orbit are always less than
SEFI errors by orders of magnitude
10
PLOT 2
3.5E-02
3.5E-01
3.5E00
3.5E01
3.5E02
3.5E03
3.5E03
Configuration Bit Errors per Scrub Cycle
36,000km GEO Orbit Worst Day Solar Flare 8,000
bit-errors/day
All other orbits
40 MeV Kr LET 22.3
MeV/cm2/mg
SEFIs drive error rate for all designs and all
orbits.
Mitigation errors on orbit are always less than
SEFI errors by orders of magnitude
11
PLOT 3
3.5E-02
3.5E-01
3.5E00
3.5E01
3.5E02
3.5E03
3.5E03
Configuration Bit Errors per Scrub Cycle
36,000km GEO Orbit Worst Day Solar Flare 8,000
bit-errors/day
All other orbits
SEFIs drive error rate for all designs and all
orbits.
40 MeV Kr LET 22.3
MeV/cm2/mg
Mitigation errors on orbit are always less than
SEFI errors by orders of magnitude
12
SEE Test Analysis
  • The experiments were conducted over a flux range
    of 7E00 to 4E04 (p/cm2/s).
  • The Flux rates have been normalized in the
    secondary (top) x-axis of the plots to average
    bit upsets per scrub cycle (RS).
  • Each experiment demonstrated a drop in failure
    cross-section over several orders of magnitude,
    crossing the SEFI cross-section at upset rates
    that are still several orders of magnitude above
    worst case orbital upset rates.
  • Extrapolating this data for each experiment
    clearly demonstrates a mitigation error
    cross-section at least 1 or more orders of
    magnitude below the SEFI cross-section at worst
    case orbital upset rates.
  • By Superposition of the data fit functions, the
    total effective mitigated error rate
    cross-section is
  • SigmaTOTAL SigmaBRAM SigmaCLB SigmaMULT
    SigmaSEFI
  • SigmaTOTAL 5.0E-8(1.4 RS)(2) 5.0E-6(0.7
    RS)(0.5) 1.75E-6(1.4 RS)(0.35) 8.42E-6 (cm2)
  • Therefore, at the worst case orbital upset rate
    of 9E-2 upsets/sec (RS4.5E-2 upsets/scrub) the
    effective total cross-section for functional
    error is calculated
  • SigmaTOTAL 1.05E-5 (cm2/device) Orbital Worst
    Case

13
Conclusions
  • Efficiency and accuracy of the validation of
    mitigation techniques is greatly improved by
    demonstrating the upset rate dependency of the
    mitigation method by testing at Flux rates that
    overwhelm the mitigation.
  • The static SEFI cross-section is the dominating
    factor for calculating orbital error rates for
    any Virtex-II design when mitigated with Full
    XTMR Scrubbing.
  • Future Work
  • The authors recognize an anomaly in the data fit
    functions in that they were not all expressed as
    a square function. It is anticipated that this is
    due to the complexity of the bit clusters of the
    experimental designs. Additional research is
    called for to derive the separate coefficients
    for the MER equation for each design and explain
    their functional associations.
Write a Comment
User Comments (0)
About PowerShow.com