Title: A Triple Module Redundancy Scheme for SEU Mitigation of Static Latch-Based FPGAs (
1A Triple Module Redundancy Scheme for SEU
Mitigation of Static Latch-Based
FPGAs(Birds-of-a-Feather)
Carl Carmichael1, Brendan Bridgford1, Gary
Swift2, Matt Napier3 1Xilinx Corporation, San
Jose CA2Jet Propulsion Laboratory, Pasadena
CA3Sandia National Laboratories, Albuquerque NM
"This work was carried out in part by the Jet
Propulsion Laboratory, California Institute of
Technology, under contract with the National
Aeronautics and Space Administration."
"Reference herein to any specific commercial
product, process, or service by trade name,
trademark, manufacturer, or otherwise, does not
constitute or imply its endorsement by the United
States Government or the Jet Propulsion
Laboratory, California Institute of Technology."
2XTMR SEU Mitigation
- Xilinx Triple Module Redundancy (XTMR)
- Single Point Failures are eliminated by
triplication of every logic node (gates nets). - XTMR confers SEU and SET immunity
- XTMR does not protect against SEFIs!
- Any digital design can be XTMRed by
- Triplication of throughput (combinational
sequential) logic - Triplication of feedback logic and inserting
majority voters - Adding redundant IO (outputs with minority
voters) - Design cleanup (removing half-latches, SRL16s,
etc.)
3XTMR State-Machines
Pre-TMR
- XTMR provides autonomous re-synchronization of
the separate redundant domains of a state-machine
by inserting majority voters at the origin of any
registered feed-back Looped path. - When a configuration upset disables one domain,
the other two domains continue to operate
providing a correct majority representation of
state data and functionality. - When Scrubbing fixes the configuration of the
upset domain, the embedded redundant voters
automatically correct the state of the upset
domain without any external intervention. - As long as the scrub rate is greater than the
upset rate, a single bit upset cannot disturb
more than one redundant domain.
Post-XTMR
4XTMR Inputs
- Effective SEU Mitigation requires the use of
triple redundant input pins for every input
signal. - Not triplicating input Global signals (clk, rst,
etc) can seriously compromise SEU resistance. - Triplication of input data paths can be traded
for EDAC. - SEU resistance is sometimes a trade-off for
resource utilization.
5XTMR Outputs with Minority Voters
- Outputs can be triplicated, using three pins for
each output signal. - Minority voters monitor each of the triplicated
design modules. - If one module is different from the others, its
output pin is driven to High-Z - Voters are triplicated
Minority Voter
P
TR0
Minority Voter
P
TR1
Minority Voter
P
TR2
Convergence point is outside FPGA, at trace
6Previous SEE Test Methodology for Mitigation
- The assertion of the combined mitigation method
of XTMR Scrubbing is that the complete removal
of Single Even Functional Errors in the user
logic confers any user design to an overall error
rate determined by the remaining Single Event
Functional Interrupts. Therefore, a successful
mitigation test is expected to produce zero
errors other than SEFIs. - Since the effectiveness of TMR is dependent upon
no accumulation of errors in the configuration,
experiments were attempted to maintain an upset
rate that did not exceed the scrub rate. This
methodology had two significant flaws - One is an impracticality of testing at such low
fluxes requiring unreasonably long run times and
thus being incapable of reaching sufficient
fluence for acceptable statistical significance
of data. - The other flaw is that a zero error rate result
is not useful for making any calculations or
extrapolations. - These issues raise concerns over the validity of
any results.
7Improved SEE Test Methodology for Mitigation
- There is an expected physical relationship
between functional error rate of a mitigated
system as a function of upset rate. The expected
relationship is a function that predicts the
increasing probability of upsetting bit
combinations that will cause a mitigated (TMR)
system to fail as a function of bit upset rate - MER (1/2)(NBCA/TS)RU2
- MER Mitigation Error Rate
- NB Number of Relevant Bits
- CA Average Cluster Size
- TS Scrub Time
- RU Upset Rate of Relevant Bits.
- Therefore, testing at extremely high fluxes over
several orders of magnitude variation can be
performed to reveal this functional relationship
between mitigation error rate and bit upset rate. - This function can then be extrapolated to make
predictions at the much lower upset rates of
earth orbits.
8Plot Definitions
- Predicted SEFI cross-section
- Static and Dynamic SEE Characterization of the
Virtex-II FPGA revealed several Single Event
Functional Interrupt Modes POR (2.5E-06), SMAP
(1.72E-06), IOB (4.2E-06) - These combined cross-sections represent the
minimum functional error cross-section for a
single Virtex-II (XQR2V6000) device on orbit. - Worst Case Orbital Upset Rate
- CREME96 calculation of the worst case orbital
upset rate for a XQR2V6000 is 7,740
bit-errors/day (9E-02 bit-errors/sec) in a GEO
orbit at 36,000km during the worst day of an
Anomalously Large Solar Flare accounting for both
Heavy Ion and Proton. In a 40MeV Kr beam the
exact same upset rate is achieved with a Flux of
1.25E-01 p/cm2/s. This denotes that the
equivalent upset rates for all other orbits and
solar conditions would reside to the LEFT of this
line. - Single Event Functional Interrupts
- This is the average cross-section of the observed
SEFI(s) while collecting the data represented in
the plot. This cross-section is not Flux
dependent. Variations from the predicted value
are due to statistical significance of the total
accumulated fluence during each test. - Functional Errors
- Data plot of the observed events when the Device
Under Test returned an incorrect result.
Cross-section is determined by the number of
error events divided by total fluence at the
specified flux. TMR denotes that the DUT design
was fully mitigated with XTMR and scrubbing. The
Unmitigated results were obtained with an
identically functional design without XTMR,
however scrubbing was also used for the
unmitigated test. - Extrapolation
- A derived function describing the relation
between Mitigation failure as a function of upset
rate. Extension of the function predicts
functional error cross-sections at worst case
orbital upset rates to be less than SEFI
cross-sections.
9PLOT 1
3.5E-02
3.5E-01
3.5E00
3.5E01
3.5E02
3.5E03
Configuration Bit Errors per Scrub Cycle
36,000km GEO Orbit Worst Day Solar Flare 8,000
bit-errors/day
All other orbits
40 MeV Kr LET 22.3
MeV/cm2/mg
SEFIs drive error rate for all designs and all
orbits.
Mitigation errors on orbit are always less than
SEFI errors by orders of magnitude
10PLOT 2
3.5E-02
3.5E-01
3.5E00
3.5E01
3.5E02
3.5E03
3.5E03
Configuration Bit Errors per Scrub Cycle
36,000km GEO Orbit Worst Day Solar Flare 8,000
bit-errors/day
All other orbits
40 MeV Kr LET 22.3
MeV/cm2/mg
SEFIs drive error rate for all designs and all
orbits.
Mitigation errors on orbit are always less than
SEFI errors by orders of magnitude
11PLOT 3
3.5E-02
3.5E-01
3.5E00
3.5E01
3.5E02
3.5E03
3.5E03
Configuration Bit Errors per Scrub Cycle
36,000km GEO Orbit Worst Day Solar Flare 8,000
bit-errors/day
All other orbits
SEFIs drive error rate for all designs and all
orbits.
40 MeV Kr LET 22.3
MeV/cm2/mg
Mitigation errors on orbit are always less than
SEFI errors by orders of magnitude
12SEE Test Analysis
- The experiments were conducted over a flux range
of 7E00 to 4E04 (p/cm2/s). - The Flux rates have been normalized in the
secondary (top) x-axis of the plots to average
bit upsets per scrub cycle (RS). - Each experiment demonstrated a drop in failure
cross-section over several orders of magnitude,
crossing the SEFI cross-section at upset rates
that are still several orders of magnitude above
worst case orbital upset rates. - Extrapolating this data for each experiment
clearly demonstrates a mitigation error
cross-section at least 1 or more orders of
magnitude below the SEFI cross-section at worst
case orbital upset rates. - By Superposition of the data fit functions, the
total effective mitigated error rate
cross-section is - SigmaTOTAL SigmaBRAM SigmaCLB SigmaMULT
SigmaSEFI - SigmaTOTAL 5.0E-8(1.4 RS)(2) 5.0E-6(0.7
RS)(0.5) 1.75E-6(1.4 RS)(0.35) 8.42E-6 (cm2) - Therefore, at the worst case orbital upset rate
of 9E-2 upsets/sec (RS4.5E-2 upsets/scrub) the
effective total cross-section for functional
error is calculated - SigmaTOTAL 1.05E-5 (cm2/device) Orbital Worst
Case
13Conclusions
- Efficiency and accuracy of the validation of
mitigation techniques is greatly improved by
demonstrating the upset rate dependency of the
mitigation method by testing at Flux rates that
overwhelm the mitigation. - The static SEFI cross-section is the dominating
factor for calculating orbital error rates for
any Virtex-II design when mitigated with Full
XTMR Scrubbing. - Future Work
- The authors recognize an anomaly in the data fit
functions in that they were not all expressed as
a square function. It is anticipated that this is
due to the complexity of the bit clusters of the
experimental designs. Additional research is
called for to derive the separate coefficients
for the MER equation for each design and explain
their functional associations.