A Triple Module Redundancy Scheme for SEU Mitigation of Static Latch-Based FPGAs - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

A Triple Module Redundancy Scheme for SEU Mitigation of Static Latch-Based FPGAs

Description:

A Triple Module Redundancy Scheme for SEU Mitigation of Static Latch-Based FPGAs Carl Carmichael1, Brendan Bridgford1, Gary Swift2, Matt Napier3 – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 25
Provided by: Xil3
Learn more at: http://klabs.org
Category:

less

Transcript and Presenter's Notes

Title: A Triple Module Redundancy Scheme for SEU Mitigation of Static Latch-Based FPGAs


1
A Triple Module Redundancy Scheme for SEU
Mitigation of Static Latch-Based FPGAs
Carl Carmichael1, Brendan Bridgford1, Gary
Swift2, Matt Napier3 1Xilinx Corporation, San
Jose CA2Jet Propulsion Laboratory, Pasadena
CA3Sandia National Laboratories, Albuquerque NM
"This work was carried out in part by the Jet
Propulsion Laboratory, California Institute of
Technology, under contract with the National
Aeronautics and Space Administration."
"Reference herein to any specific commercial
product, process, or service by trade name,
trademark, manufacturer, or otherwise, does not
constitute or imply its endorsement by the United
States Government or the Jet Propulsion
Laboratory, California Institute of Technology."
2
ABSTRACT
  • Xilinx Triple Module Redundancy, or XTMR, is an
    SEU mitigation technique and design methodology
    intended to remove all single points of failure
    within the configuration control cells and user
    logic elements, including those in the voting
    circuitry, as well as preventing the propagation
    of single event transients, by triplicating all
    inputs, outputs, logic, clock domains and voters.
    Voters are also inserted on all state logic
    feedback paths, conferring full SEU and SET
    immunity while allowing for autonomous
    re-synchronization of just-reconfigured state
    logic to the redundant domains.
  • This paper presents the fundamental philosophy of
    the XTMR method, the automated implementation of
    XTMR provided by the new release of the Xilinx
    TMRTool, as well as Single Event Effects testing
    and analysis of the combined SEU mitigation
    technique of XTMR and autonomous partial
    re-configuration (scrubbing).
  • The SEE test analysis demonstrates that this
    combined SEU mitigation technique pushes the
    cross-section for functional error for any design
    in any orbit to at least one order of magnitude
    below the established cross-sections for device
    level Single Event Functional Interrupts (SEFI).
    This study has the potential to alleviate the
    requirement for many users of having to perform
    independent SEE testing on individual design
    implementations.

3
XTMR SEU Mitigation
  • Xilinx Triple Module Redundancy (XTMR)
  • Single Point Failures are eliminated by
    triplication of every logic node (gates nets).
  • XTMR confers SEU and SET immunity
  • XTMR does not protect against SEFIs!
  • Any digital design can be XTMRed by
  • Triplication of throughput (combinational
    sequential) logic
  • Triplication of feedback logic and inserting
    majority voters
  • Adding redundant IO (outputs with minority
    voters)
  • Design cleanup (removing half-latches, SRL16s,
    etc.)

4
XTMR State-Machines
Pre-TMR
  • XTMR provides autonomous re-synchronization of
    the separate redundant domains of a state-machine
    by inserting majority voters at the origin of any
    registered feed-back Looped path.
  • When a configuration upset disables one domain,
    the other two domains continue to operate
    providing a correct majority representation of
    state data and functionality.
  • When Scrubbing fixes the configuration of the
    upset domain, the embedded redundant voters
    automatically correct the state of the upset
    domain without any external intervention.
  • As long as the scrub rate is greater than the
    upset rate, a single bit upset cannot disturb
    more than one redundant domain.

Post-XTMR
5
XTMR Inputs
  • Effective SEU Mitigation requires the use of
    triple redundant input pins for every input
    signal.
  • Not triplicating input Global signals (clk, rst,
    etc) can seriously compromise SEU resistance.
  • Triplication of input data paths can be traded
    for EDAC.
  • SEU resistance is sometimes a trade-off for
    resource utilization.

6
XTMR Outputs with Minority Voters
  • Outputs can be triplicated, using three pins for
    each output signal.
  • Minority voters monitor each of the triplicated
    design modules.
  • If one module is different from the others, its
    output pin is driven to High-Z
  • Voters are triplicated

Minority Voter
P
TR0
Minority Voter
P
TR1
Minority Voter
P
TR2
Convergence point is outside FPGA, at trace
7
Xilinx TMRTool
  • The Xilinx TMRTool is a graphical application
    that automates the implementation of XTMR for
    FPGA designs.
  • The designer is provided the flexibility to
    selectively apply XTMR to their design at the
    instance, component, and hierarchical levels.
  • Use of custom mitigation methods may be employed
    for specific portions of the design with the use
    of user created library macros.
  • Designs are imported from a Xilinx netlist
    (NGO/NGC) and exported as a single standard EDIF
    project source.

8
XTMR SEE Testing
  • Validation of mitigation of architectural
    resources by superposition.
  • Separate experiments were created to cover the
    major elements of the Virtex-II architecture
  • Configurable Logic Block
  • Combinatorial Logic, Sequential Logic,
    Arithmetics, Multiplexing.
  • Design implementation is an array of
    state-machines.
  • Multipliers
  • Dedicated 18 x 18 bit multiply function blocks.
  • Design implementation is array of Multiply and
    Accumulate functions.
  • Block Memories
  • Synchronous Dual Port 18k bit RAM blocks.
  • Design implemented as a single large memory space
    for high speed store and fetch functions.
  • Input Output Blocks
  • Multi-standard discrete bi-directional
    un/registered device IO.
  • Design implemented as feed-thru channels from IOB
    to IOB.
  • Digital Clock Managers
  • Clock frequency synthesis and phase delay
    re-allignment.
  • This will be tested in future work.

9
2V6000 Dynamic SEU Test
BEAM
Thinned DUT
Inside target room
Functional Monitor/ Strip Chart
Front Side
Configuration Monitor/ Strip Chart
Back Side
10
CLB Test Design
11
CLB Test Functional Description
  • The CLB test pre-TMR design consists of 512 (32
    bit) counters created as 16 modules of 32
    counters per module. Each counter in the module
    increments by a different value. The output of
    each module is a multiplex of the 32 counters.
    The outputs of all the modules are again
    multiplexed to a single 16 bit bus. A 10 bit
    address bus is used to select the output of a
    specific counter and select between the upper and
    lower 16 bit banks of the 32 bit module outputs.
  • The Xilinx TMRTool software is used to process
    the design into a fully XTMR mitigated design.
    Both the TMR and pre-TMR designs undergo active
    scrubbing (partial reconfiguration for SEU
    correction) for the configuration of the DUT.
  • All counters are running continuously. Each
    counter is selected sequentially for sampling of
    its current state and operation.
  • For each module sample taken, the actual and
    expected values are recorded along with
    sequential count of state errors and the running
    count of event errors into a strip chart file on
    the host PC.
  • When counters are observed to be permanently in
    the wrong state the design is reset to regain a
    fully functioning test.
  • The final error count is calculated as the number
    of events that a counter either lost its state
    or moved to the wrong state.

12
Multiplier Test Design
DUT
SERVICE
36
MAC
32
36
Configuration Manager Core
1x10
MAC
36
MUX 3x2x32
1x11
MAC
3
MODULE
Functional Monitor
32
Error Counters
mod0
16

32
x
mod15
MUX 32x1x16
5
Constant
MAC
3
8
13
Multiplier Test Functional Description
  • The Mutliplier test pre-TMR design consists of
    48 (18x18x36 bit) Multiply and Accumulate (MAC)
    blocks created as 16 modules of 3 MACs per
    module. Each MAC in the module increments by 1
    and multiplies by a different constant (1, 10,
    and 11, respectively). The output of each module
    is a multiplex of the 3 MACs and a select of the
    lower 32 bits and upper 4 bits of the 36 bit
    registered multiplier output. The outputs of all
    the modules are again multiplexed to a single 16
    bit bus. An 8 bit address bus is used to select
    the output of a specific MAC and select between
    the upper and lower 16 bit banks of the 32 bit
    module outputs.
  • The Xilinx TMRTool software is used to process
    the design into a fully TMR mitigated design.
    Both the TMR and pre-TMR designs undergo active
    scrubbing (partial reconfiguration for SEU
    correction) for the configuration of the DUT.
  • All MACs are constantly accumulating. Each MAC is
    selected sequentially for a periodic sampling of
    its sequence.
  • For each module sample taken, the actual and
    expected values are recorded along with
    sequential count of state errors and the running
    count of event errors into a strip chart file on
    the host PC.
  • When MACs are observed to be permanently in the
    wrong state the design is reset to regain a fully
    functioning test.
  • The final error count is calculated as the number
    of events that a MAC lost its state or produced
    an incorrect result.

14
BRAM Test Design
DUT
SERVICE
Configuration Manager Core
128k byte RAM
Functional Monitor
Error Counters
DATA

ADDRESS
-1
16
16
15
BRAM Test Functional Description
  • The Block Memory test pre-TMR design consists
    of single large 128k byte single port memory
    space created from 64 memory blocks of 16k bits
    each.
  • The Xilinx TMRTool software is used to process
    the design into a fully TMR mitigated design.
    Both the TMR and pre-TMR designs undergo active
    scrubbing (partial reconfiguration for SEU
    correction) for the configuration of the DUT.
  • Separate WRITE and READ routines are executed to
    all memory address locations. The data is derived
    from a decrement of the address value. The entire
    memory space is refreshed with a write operation
    and then the data is retrieved with a read
    operation.
  • During the read operation the retrieved data is
    compared against the expected value.
  • For each data sample taken, the actual and
    expected values are recorded with the running
    count of event errors into a strip chart file on
    the host PC.
  • Each error event is measured for its total word
    error size in bits 1, 32, 64, 512, 1024, etc.
  • The final error count is calculated as the number
    of separate events of word errors.

16
Configuration Error Detection and Correction
Algorithm
  • Configure target FPGA with configuration data
    stored in the configuration PROM(s).
  • Read back configuration programming data from
    target FPGA and calculate 16 bit CRC. Store CRC
    value as Config-CRC.
  • Perform a Write/Read check on the internal Frame
    Address Register of target FPGA.
  • Scrub (background refresh) configuration data of
    target FPGA.
  • Read back configuration programming data from
    target FPGA and calculate 16 bit CRC. Store CRC
    value as Rdbk-CRC and perform bit-for-bit error
    detection of configuration data.
  • Compare RDBK CRC with Config-CRC
  • If CRC values mismatch a second time then assert
    SEFI_ERROR and RECONFIGURE

0
START
DONE
1
YES
YES
SEFI
PREV SCRUB
NO
NO
YES
NO
0
DONE
1
NO
CONFIG SCRUB
CRC ERROR 1
YES
CRC ERROR 0
17
Previous SEE Test Methodology for Mitigation
  • The assertion of the combined mitigation method
    of XTMR Scrubbing is that the complete removal
    of Single Event Functional Errors in the user
    logic confers any user design to an overall error
    rate determined by the remaining Single Event
    Functional Interrupts. Therefore, a successful
    mitigation test is expected to produce zero
    errors other than SEFIs.
  • Since the effectiveness of TMR is dependent upon
    no accumulation of errors in the configuration,
    experiments were attempted to maintain an upset
    rate that did not exceed the scrub rate. This
    methodology had two significant flaws
  • One is an impracticality of testing at such low
    fluxes requiring unreasonably long run times and
    thus being incapable of reaching sufficient
    fluence for acceptable statistical significance
    of data.
  • The other flaw is that a zero error rate result
    is not useful for making any calculations or
    extrapolations.
  • These issues raise concerns over the validity of
    any results.

18
Improved SEE Test Methodology for Mitigation
  • There is an expected physical relationship
    between functional error rate of a mitigated
    system as a function of upset rate. The expected
    relationship is a function that predicts the
    increasing probability of upsetting bit
    combinations that will cause a mitigated (TMR)
    system to fail as a function of bit upset rate
  • MER (1/2)(NBCA/TS)RU2
  • MER Mitigation Error Rate
  • NB Number of Relevant Bits
  • CA Average Cluster Size
  • TS Scrub Time
  • RU Upset Rate of Relevant Bits.
  • Therefore, testing at extremely high fluxes over
    several orders of magnitude variation can be
    performed to reveal this functional relationship
    between mitigation error rate and bit upset rate.
  • This function can then be extrapolated to make
    predictions at the much lower upset rates of
    earth orbits.

19
Plot Definitions
  • Predicted SEFI cross-section
  • Static and Dynamic SEE Characterization of the
    Virtex-II FPGA revealed several Single Event
    Functional Interrupt Modes POR (2.5E-06), SMAP
    (1.72E-06), IOB (4.2E-06)
  • These combined cross-sections represent the
    minimum functional error cross-section for a
    single Virtex-II (XQR2V6000) device on orbit.
  • Worst Case Orbital Upset Rate
  • CREME96 calculation of the worst case orbital
    upset rate for a XQR2V6000 is 7,740
    bit-errors/day (9E-02 bit-errors/sec) in a GEO
    orbit at 36,000km during the worst day of an
    Anomalously Large Solar Flare accounting for both
    Heavy Ion and Proton. In a 40MeV Kr beam the
    exact same upset rate is achieved with a Flux of
    1.25E-01 p/cm2/s. This denotes that the
    equivalent upset rates for all other orbits and
    solar conditions would reside to the LEFT of this
    line.
  • Single Event Functional Interrupts
  • This is the average cross-section of the observed
    SEFI(s) while collecting the data represented in
    the plot. This cross-section is not Flux
    dependent. Variations from the predicted value
    are due to statistical significance of the total
    accumulated fluence during each test.
  • Functional Errors
  • Data plot of the observed events when the Device
    Under Test returned an incorrect result.
    Cross-section is determined by the number of
    error events divided by total fluence at the
    specified flux. TMR denotes that the DUT design
    was fully mitigated with XTMR and scrubbing. The
    Unmitigated results were obtained with an
    identically functional design without XTMR,
    however scrubbing was also used for the
    unmitigated test.
  • Extrapolation
  • A derived function describing the relation
    between Mitigation failure as a function of upset
    rate. Extension of the function predicts
    functional error cross-sections at worst case
    orbital upset rates to be less than SEFI
    cross-sections.

20
PLOT 1
3.5E-02
3.5E-01
3.5E00
3.5E01
3.5E02
3.5E03
Configuration Bit Errors per Scrub Cycle
36,000km GEO Orbit Worst Day Solar Flare 8,000
bit-errors/day
All other orbits
40 MeV Kr LET 22.3
MeV/cm2/mg
SEFIs drive error rate for all designs and all
orbits.
Mitigation errors on orbit are always less than
SEFI errors by orders of magnitude
21
PLOT 2
3.5E-02
3.5E-01
3.5E00
3.5E01
3.5E02
3.5E03
3.5E03
Configuration Bit Errors per Scrub Cycle
36,000km GEO Orbit Worst Day Solar Flare 8,000
bit-errors/day
All other orbits
40 MeV Kr LET 22.3
MeV/cm2/mg
SEFIs drive error rate for all designs and all
orbits.
Mitigation errors on orbit are always less than
SEFI errors by orders of magnitude
22
PLOT 3
3.5E-02
3.5E-01
3.5E00
3.5E01
3.5E02
3.5E03
3.5E03
Configuration Bit Errors per Scrub Cycle
36,000km GEO Orbit Worst Day Solar Flare 8,000
bit-errors/day
All other orbits
SEFIs drive error rate for all designs and all
orbits.
40 MeV Kr LET 22.3
MeV/cm2/mg
Mitigation errors on orbit are always less than
SEFI errors by orders of magnitude
23
SEE Test Analysis
  • The experiments were conducted over a flux range
    of 7E00 to 4E04 (p/cm2/s).
  • The Flux rates have been normalized in the
    secondary (top) x-axis of the plots to average
    bit upsets per scrub cycle (RS).
  • Each experiment demonstrated a drop in failure
    cross-section over several orders of magnitude,
    crossing the SEFI cross-section at upset rates
    that are still several orders of magnitude above
    worst case orbital upset rates.
  • Extrapolating this data for each experiment
    clearly demonstrates a mitigation error
    cross-section at least 1 or more orders of
    magnitude below the SEFI cross-section at worst
    case orbital upset rates.
  • By Superposition of the data fit functions, the
    total effective mitigated error rate
    cross-section is
  • SigmaTOTAL SigmaBRAM SigmaCLB SigmaMULT
    SigmaSEFI
  • SigmaTOTAL 5.0E-8(1.4 RS)(2) 5.0E-6(0.7
    RS)(0.5) 1.75E-6(1.4 RS)(0.35) 8.42E-6 (cm2)
  • Therefore, at the worst case orbital upset rate
    of 9E-2 upsets/sec (RS4.5E-2 upsets/scrub) the
    effective total cross-section for functional
    error is calculated
  • SigmaTOTAL 1.05E-5 (cm2/device) Orbital Worst
    Case

24
Conclusions
  • Efficiency and accuracy of the validation of
    mitigation techniques is greatly improved by
    demonstrating the upset rate dependency of the
    mitigation method by testing at Flux rates that
    overwhelm the mitigation.
  • The static SEFI cross-section is the dominating
    factor for calculating orbital error rates for
    any Virtex-II design when mitigated with Full
    XTMR Scrubbing.
  • Future Work
  • The authors recognize an anomaly in the data fit
    functions in that they were not all expressed as
    a square function. It is anticipated that this is
    due to the complexity of the bit clusters of the
    experimental designs. Additional research is
    called for to derive the separate coefficients
    for the MER equation for each design and explain
    their functional associations.
Write a Comment
User Comments (0)
About PowerShow.com