Title: A Triple Module Redundancy Scheme for SEU Mitigation of Static Latch-Based FPGAs
1A Triple Module Redundancy Scheme for SEU
Mitigation of Static Latch-Based FPGAs
Carl Carmichael1, Brendan Bridgford1, Gary
Swift2, Matt Napier3 1Xilinx Corporation, San
Jose CA2Jet Propulsion Laboratory, Pasadena
CA3Sandia National Laboratories, Albuquerque NM
"This work was carried out in part by the Jet
Propulsion Laboratory, California Institute of
Technology, under contract with the National
Aeronautics and Space Administration."
"Reference herein to any specific commercial
product, process, or service by trade name,
trademark, manufacturer, or otherwise, does not
constitute or imply its endorsement by the United
States Government or the Jet Propulsion
Laboratory, California Institute of Technology."
2ABSTRACT
- Xilinx Triple Module Redundancy, or XTMR, is an
SEU mitigation technique and design methodology
intended to remove all single points of failure
within the configuration control cells and user
logic elements, including those in the voting
circuitry, as well as preventing the propagation
of single event transients, by triplicating all
inputs, outputs, logic, clock domains and voters.
Voters are also inserted on all state logic
feedback paths, conferring full SEU and SET
immunity while allowing for autonomous
re-synchronization of just-reconfigured state
logic to the redundant domains. - This paper presents the fundamental philosophy of
the XTMR method, the automated implementation of
XTMR provided by the new release of the Xilinx
TMRTool, as well as Single Event Effects testing
and analysis of the combined SEU mitigation
technique of XTMR and autonomous partial
re-configuration (scrubbing). - The SEE test analysis demonstrates that this
combined SEU mitigation technique pushes the
cross-section for functional error for any design
in any orbit to at least one order of magnitude
below the established cross-sections for device
level Single Event Functional Interrupts (SEFI).
This study has the potential to alleviate the
requirement for many users of having to perform
independent SEE testing on individual design
implementations.
3XTMR SEU Mitigation
- Xilinx Triple Module Redundancy (XTMR)
- Single Point Failures are eliminated by
triplication of every logic node (gates nets). - XTMR confers SEU and SET immunity
- XTMR does not protect against SEFIs!
- Any digital design can be XTMRed by
- Triplication of throughput (combinational
sequential) logic - Triplication of feedback logic and inserting
majority voters - Adding redundant IO (outputs with minority
voters) - Design cleanup (removing half-latches, SRL16s,
etc.)
4XTMR State-Machines
Pre-TMR
- XTMR provides autonomous re-synchronization of
the separate redundant domains of a state-machine
by inserting majority voters at the origin of any
registered feed-back Looped path. - When a configuration upset disables one domain,
the other two domains continue to operate
providing a correct majority representation of
state data and functionality. - When Scrubbing fixes the configuration of the
upset domain, the embedded redundant voters
automatically correct the state of the upset
domain without any external intervention. - As long as the scrub rate is greater than the
upset rate, a single bit upset cannot disturb
more than one redundant domain.
Post-XTMR
5XTMR Inputs
- Effective SEU Mitigation requires the use of
triple redundant input pins for every input
signal. - Not triplicating input Global signals (clk, rst,
etc) can seriously compromise SEU resistance. - Triplication of input data paths can be traded
for EDAC. - SEU resistance is sometimes a trade-off for
resource utilization.
6XTMR Outputs with Minority Voters
- Outputs can be triplicated, using three pins for
each output signal. - Minority voters monitor each of the triplicated
design modules. - If one module is different from the others, its
output pin is driven to High-Z - Voters are triplicated
Minority Voter
P
TR0
Minority Voter
P
TR1
Minority Voter
P
TR2
Convergence point is outside FPGA, at trace
7Xilinx TMRTool
- The Xilinx TMRTool is a graphical application
that automates the implementation of XTMR for
FPGA designs. - The designer is provided the flexibility to
selectively apply XTMR to their design at the
instance, component, and hierarchical levels. - Use of custom mitigation methods may be employed
for specific portions of the design with the use
of user created library macros. - Designs are imported from a Xilinx netlist
(NGO/NGC) and exported as a single standard EDIF
project source.
8XTMR SEE Testing
- Validation of mitigation of architectural
resources by superposition. - Separate experiments were created to cover the
major elements of the Virtex-II architecture - Configurable Logic Block
- Combinatorial Logic, Sequential Logic,
Arithmetics, Multiplexing. - Design implementation is an array of
state-machines. - Multipliers
- Dedicated 18 x 18 bit multiply function blocks.
- Design implementation is array of Multiply and
Accumulate functions. - Block Memories
- Synchronous Dual Port 18k bit RAM blocks.
- Design implemented as a single large memory space
for high speed store and fetch functions. - Input Output Blocks
- Multi-standard discrete bi-directional
un/registered device IO. - Design implemented as feed-thru channels from IOB
to IOB. - Digital Clock Managers
- Clock frequency synthesis and phase delay
re-allignment. - This will be tested in future work.
92V6000 Dynamic SEU Test
BEAM
Thinned DUT
Inside target room
Functional Monitor/ Strip Chart
Front Side
Configuration Monitor/ Strip Chart
Back Side
10CLB Test Design
11CLB Test Functional Description
- The CLB test pre-TMR design consists of 512 (32
bit) counters created as 16 modules of 32
counters per module. Each counter in the module
increments by a different value. The output of
each module is a multiplex of the 32 counters.
The outputs of all the modules are again
multiplexed to a single 16 bit bus. A 10 bit
address bus is used to select the output of a
specific counter and select between the upper and
lower 16 bit banks of the 32 bit module outputs. - The Xilinx TMRTool software is used to process
the design into a fully XTMR mitigated design.
Both the TMR and pre-TMR designs undergo active
scrubbing (partial reconfiguration for SEU
correction) for the configuration of the DUT. - All counters are running continuously. Each
counter is selected sequentially for sampling of
its current state and operation. - For each module sample taken, the actual and
expected values are recorded along with
sequential count of state errors and the running
count of event errors into a strip chart file on
the host PC. - When counters are observed to be permanently in
the wrong state the design is reset to regain a
fully functioning test. - The final error count is calculated as the number
of events that a counter either lost its state
or moved to the wrong state.
12Multiplier Test Design
DUT
SERVICE
36
MAC
32
36
Configuration Manager Core
1x10
MAC
36
MUX 3x2x32
1x11
MAC
3
MODULE
Functional Monitor
32
Error Counters
mod0
16
32
x
mod15
MUX 32x1x16
5
Constant
MAC
3
8
13Multiplier Test Functional Description
- The Mutliplier test pre-TMR design consists of
48 (18x18x36 bit) Multiply and Accumulate (MAC)
blocks created as 16 modules of 3 MACs per
module. Each MAC in the module increments by 1
and multiplies by a different constant (1, 10,
and 11, respectively). The output of each module
is a multiplex of the 3 MACs and a select of the
lower 32 bits and upper 4 bits of the 36 bit
registered multiplier output. The outputs of all
the modules are again multiplexed to a single 16
bit bus. An 8 bit address bus is used to select
the output of a specific MAC and select between
the upper and lower 16 bit banks of the 32 bit
module outputs. - The Xilinx TMRTool software is used to process
the design into a fully TMR mitigated design.
Both the TMR and pre-TMR designs undergo active
scrubbing (partial reconfiguration for SEU
correction) for the configuration of the DUT. - All MACs are constantly accumulating. Each MAC is
selected sequentially for a periodic sampling of
its sequence. - For each module sample taken, the actual and
expected values are recorded along with
sequential count of state errors and the running
count of event errors into a strip chart file on
the host PC. - When MACs are observed to be permanently in the
wrong state the design is reset to regain a fully
functioning test. - The final error count is calculated as the number
of events that a MAC lost its state or produced
an incorrect result.
14BRAM Test Design
DUT
SERVICE
Configuration Manager Core
128k byte RAM
Functional Monitor
Error Counters
DATA
ADDRESS
-1
16
16
15BRAM Test Functional Description
- The Block Memory test pre-TMR design consists
of single large 128k byte single port memory
space created from 64 memory blocks of 16k bits
each. - The Xilinx TMRTool software is used to process
the design into a fully TMR mitigated design.
Both the TMR and pre-TMR designs undergo active
scrubbing (partial reconfiguration for SEU
correction) for the configuration of the DUT. - Separate WRITE and READ routines are executed to
all memory address locations. The data is derived
from a decrement of the address value. The entire
memory space is refreshed with a write operation
and then the data is retrieved with a read
operation. - During the read operation the retrieved data is
compared against the expected value. - For each data sample taken, the actual and
expected values are recorded with the running
count of event errors into a strip chart file on
the host PC. - Each error event is measured for its total word
error size in bits 1, 32, 64, 512, 1024, etc. - The final error count is calculated as the number
of separate events of word errors.
16Configuration Error Detection and Correction
Algorithm
- Configure target FPGA with configuration data
stored in the configuration PROM(s). - Read back configuration programming data from
target FPGA and calculate 16 bit CRC. Store CRC
value as Config-CRC. - Perform a Write/Read check on the internal Frame
Address Register of target FPGA. - Scrub (background refresh) configuration data of
target FPGA. - Read back configuration programming data from
target FPGA and calculate 16 bit CRC. Store CRC
value as Rdbk-CRC and perform bit-for-bit error
detection of configuration data. - Compare RDBK CRC with Config-CRC
- If CRC values mismatch a second time then assert
SEFI_ERROR and RECONFIGURE
0
START
DONE
1
YES
YES
SEFI
PREV SCRUB
NO
NO
YES
NO
0
DONE
1
NO
CONFIG SCRUB
CRC ERROR 1
YES
CRC ERROR 0
17Previous SEE Test Methodology for Mitigation
- The assertion of the combined mitigation method
of XTMR Scrubbing is that the complete removal
of Single Event Functional Errors in the user
logic confers any user design to an overall error
rate determined by the remaining Single Event
Functional Interrupts. Therefore, a successful
mitigation test is expected to produce zero
errors other than SEFIs. - Since the effectiveness of TMR is dependent upon
no accumulation of errors in the configuration,
experiments were attempted to maintain an upset
rate that did not exceed the scrub rate. This
methodology had two significant flaws - One is an impracticality of testing at such low
fluxes requiring unreasonably long run times and
thus being incapable of reaching sufficient
fluence for acceptable statistical significance
of data. - The other flaw is that a zero error rate result
is not useful for making any calculations or
extrapolations. - These issues raise concerns over the validity of
any results.
18Improved SEE Test Methodology for Mitigation
- There is an expected physical relationship
between functional error rate of a mitigated
system as a function of upset rate. The expected
relationship is a function that predicts the
increasing probability of upsetting bit
combinations that will cause a mitigated (TMR)
system to fail as a function of bit upset rate - MER (1/2)(NBCA/TS)RU2
- MER Mitigation Error Rate
- NB Number of Relevant Bits
- CA Average Cluster Size
- TS Scrub Time
- RU Upset Rate of Relevant Bits.
- Therefore, testing at extremely high fluxes over
several orders of magnitude variation can be
performed to reveal this functional relationship
between mitigation error rate and bit upset rate. - This function can then be extrapolated to make
predictions at the much lower upset rates of
earth orbits.
19Plot Definitions
- Predicted SEFI cross-section
- Static and Dynamic SEE Characterization of the
Virtex-II FPGA revealed several Single Event
Functional Interrupt Modes POR (2.5E-06), SMAP
(1.72E-06), IOB (4.2E-06) - These combined cross-sections represent the
minimum functional error cross-section for a
single Virtex-II (XQR2V6000) device on orbit. - Worst Case Orbital Upset Rate
- CREME96 calculation of the worst case orbital
upset rate for a XQR2V6000 is 7,740
bit-errors/day (9E-02 bit-errors/sec) in a GEO
orbit at 36,000km during the worst day of an
Anomalously Large Solar Flare accounting for both
Heavy Ion and Proton. In a 40MeV Kr beam the
exact same upset rate is achieved with a Flux of
1.25E-01 p/cm2/s. This denotes that the
equivalent upset rates for all other orbits and
solar conditions would reside to the LEFT of this
line. - Single Event Functional Interrupts
- This is the average cross-section of the observed
SEFI(s) while collecting the data represented in
the plot. This cross-section is not Flux
dependent. Variations from the predicted value
are due to statistical significance of the total
accumulated fluence during each test. - Functional Errors
- Data plot of the observed events when the Device
Under Test returned an incorrect result.
Cross-section is determined by the number of
error events divided by total fluence at the
specified flux. TMR denotes that the DUT design
was fully mitigated with XTMR and scrubbing. The
Unmitigated results were obtained with an
identically functional design without XTMR,
however scrubbing was also used for the
unmitigated test. - Extrapolation
- A derived function describing the relation
between Mitigation failure as a function of upset
rate. Extension of the function predicts
functional error cross-sections at worst case
orbital upset rates to be less than SEFI
cross-sections.
20PLOT 1
3.5E-02
3.5E-01
3.5E00
3.5E01
3.5E02
3.5E03
Configuration Bit Errors per Scrub Cycle
36,000km GEO Orbit Worst Day Solar Flare 8,000
bit-errors/day
All other orbits
40 MeV Kr LET 22.3
MeV/cm2/mg
SEFIs drive error rate for all designs and all
orbits.
Mitigation errors on orbit are always less than
SEFI errors by orders of magnitude
21PLOT 2
3.5E-02
3.5E-01
3.5E00
3.5E01
3.5E02
3.5E03
3.5E03
Configuration Bit Errors per Scrub Cycle
36,000km GEO Orbit Worst Day Solar Flare 8,000
bit-errors/day
All other orbits
40 MeV Kr LET 22.3
MeV/cm2/mg
SEFIs drive error rate for all designs and all
orbits.
Mitigation errors on orbit are always less than
SEFI errors by orders of magnitude
22PLOT 3
3.5E-02
3.5E-01
3.5E00
3.5E01
3.5E02
3.5E03
3.5E03
Configuration Bit Errors per Scrub Cycle
36,000km GEO Orbit Worst Day Solar Flare 8,000
bit-errors/day
All other orbits
SEFIs drive error rate for all designs and all
orbits.
40 MeV Kr LET 22.3
MeV/cm2/mg
Mitigation errors on orbit are always less than
SEFI errors by orders of magnitude
23SEE Test Analysis
- The experiments were conducted over a flux range
of 7E00 to 4E04 (p/cm2/s). - The Flux rates have been normalized in the
secondary (top) x-axis of the plots to average
bit upsets per scrub cycle (RS). - Each experiment demonstrated a drop in failure
cross-section over several orders of magnitude,
crossing the SEFI cross-section at upset rates
that are still several orders of magnitude above
worst case orbital upset rates. - Extrapolating this data for each experiment
clearly demonstrates a mitigation error
cross-section at least 1 or more orders of
magnitude below the SEFI cross-section at worst
case orbital upset rates. - By Superposition of the data fit functions, the
total effective mitigated error rate
cross-section is - SigmaTOTAL SigmaBRAM SigmaCLB SigmaMULT
SigmaSEFI - SigmaTOTAL 5.0E-8(1.4 RS)(2) 5.0E-6(0.7
RS)(0.5) 1.75E-6(1.4 RS)(0.35) 8.42E-6 (cm2) - Therefore, at the worst case orbital upset rate
of 9E-2 upsets/sec (RS4.5E-2 upsets/scrub) the
effective total cross-section for functional
error is calculated - SigmaTOTAL 1.05E-5 (cm2/device) Orbital Worst
Case
24Conclusions
- Efficiency and accuracy of the validation of
mitigation techniques is greatly improved by
demonstrating the upset rate dependency of the
mitigation method by testing at Flux rates that
overwhelm the mitigation. - The static SEFI cross-section is the dominating
factor for calculating orbital error rates for
any Virtex-II design when mitigated with Full
XTMR Scrubbing. - Future Work
- The authors recognize an anomaly in the data fit
functions in that they were not all expressed as
a square function. It is anticipated that this is
due to the complexity of the bit clusters of the
experimental designs. Additional research is
called for to derive the separate coefficients
for the MER equation for each design and explain
their functional associations.