FPGA SelfRepair using an Organic Embedded System Architecture - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

FPGA SelfRepair using an Organic Embedded System Architecture

Description:

Broad temporal consensus in the population used to determine fitness metric ... the new generated individual is member of CR at generation G if and only if ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 39
Provided by: cal98
Category:

less

Transcript and Presenter's Notes

Title: FPGA SelfRepair using an Organic Embedded System Architecture


1
FPGA Self-Repair using an Organic Embedded
System Architecture
Kening Zhang, Jaafar Alghazo and Ronald F. DeMara
University of Central Florida
06 December 2007
2
Organic Computing (OC)biologically-inspired
computing with self-x properties
Technical Objective
support long lifetime missions with multiple
failure occurrences
Research Focus
OC Approach addresses system controllability
with increasing complexity
Communication networks among autonomous systems
Composed of large collection of autonomous
systems
Autonomous system owned sensor and actuators
System Property
  • Self-organization
  • Self-configuration
  • Self-optimization
  • Self-healing
  • Self-protection
  • Self-explaining

Self-x Characteristics
  • Context-awareness
  • Self-synchronization
  • Example Relevance
  • How to achieve sustainable presence in NASAs
    Moon, Mars Beyond objective???

Reconfigurable Hardware with Self-Healing based
on SRAM FPGA platform
Sponsors NASA FPGA platform and Genetic
Algorithm research DARPA
OC approach and SOAR Longevity Platform
3
Goal Autonomous FPGA Refurbishment
increase availability without carrying
pre-configured spares
  • Redundancy
  • increases with amount
  • of spare capacity
  • restricted at design-time
  • based on time required to select spare
    resource
  • determined by adequacy of spares available (?)
  • yes
  • Refurbishment
  • weakly-related to number
  • recovery capacity
  • variable at recovery-time
  • based on time required to find suitable
    recovery
  • affected by multiple characteristics (
    or -)
  • yes

?
Overhead from Unutilized Spares weight, size,
power Granularity of Fault Coverage
resolution where fault handled
Fault-Resolution Latency availability via
downtime required to handle fault Quality
of Repair likelihood and completeness
Autonomous Operation fix without outside
intervention
?
?
?
?
?
4
Fault-Handling Techniques for SRAM-based FPGAs
Device Failure
Characteristics
Duration
Transient SEU
Permanent SEL, Oxide Breakdown, Electron
Migration, LPD
Device Configuration
Processing Datapath
Device Configuration
Processing Datapath
Target
BIST
Evolutionary
Approach
Scrubbing
TMR
STARS
CED
Vigander
OC
Methods
Supplementary Testbench
Duplex Output Comparison
Duplex/Triplex Output Comparison
Detection
(not addressed)
Cartesian Intersection
Isolation
(not addressed)
Bitwise Comparison
Majority Vote
Autonomous Element (AE)
Fast Run-time Location
Worst-case Clock Period Dilation
Diagnosis
Autonomous Supervisor (AS)
unnecessary
Population-based GA using Extrinsic
Fitness Evaluation
Evolutionary Algorithm using Intrinsic
Fitness Evaluation
Recovery
Replicate in Spare Resource
Select Spare Resource
Reload Bitstream / Invert Bit Value
Ignore Discrepancy
5
Autonomous System-on-a-Chip (ASoC) Architecture
  • Dual-layer ASoC proposed by Lipsa et al Lipsa
    05
  • Functional Layer
  • Functional Elements (FEs) e.g. CPU, RAM, Network
    interface
  • Autonomic Layer
  • Autonomic Elements (AEs)
  • Monitor
  • Actuator
  • Communication interface
  • Autonomic Supervisor (AS)
  • UCF Approach for fault coverage
  • Functional Layer Autonomic Layer
  • achieved by assessing consensus
  • among elements
  • first to realize failure detection
  • consensus provides an organic method
  • for fitness evaluation of competing
    alternatives during
  • evolution providing a self-regulating
    approach to fault resolution

6
EHW Environments
  • Evolvable Hardware (EHW) Environments enable
    experimental methods to research soft
    computing intelligent search techniques
  • EHW operates by repetitive reprogramming of
    real-world physical devices using an iterative
    refinement process

Extrinsic Evolution
Intrinsic Evolution
Application
Two modes of Evolvable Hardware
or
Genetic Algorithm
Genetic Algorithm
Deep Space Satellite gt100 FPGAs onboard
hostile environment radiation, thermal
stress How to achieve reliability to avoid
mission failure???
Simulation in the loop
Hardware in the loop
Done? Build it
software model
new approach to Autonomous Repair of failed
devices
device design-time refinement
device run-time refinement
7
Genetic Algorithms (GAs)
  • Mechanism coarsely modeled after neo-Darwinism
    (natural selection genetics)

start
replacement
offspring
population of candidate solutions
evaluate fitness of individuals
Fitness function
mutation
crossover
selection of parents
parents
Goal reached
8
Genetic Mechanisms
  • Guided trial-and-error search techniques using
    principles of Darwinian evolution
  • iterative selection, survival of the fittest
  • genetic operators -- mutation, crossover,
  • implementor must define fitness function
  • GAs frequently use strings of 1s and 0s to
    represent candidate solutions
  • Genotype chromosomes of GA operation if 100101
    is better than 010001 it will have more chance to
    breed and influence future population
  • Genotype changes during evolution must adhere to
    the Xilinx-defined format of bitstream
  • To prevent undesirable conditions that may damage
    the FPGA such as a mutation which has two logic
    outputs tied together, a logical genotype is used
    for evolution and mapped to physical phenotype
  • Logic functional logic index number for LUT
  • Row/Column physical location of LUT in FPGA

9
Loosely Coupled Solution on Xilinx Virtex II Pro
Virtex 4
The Virtex 2Pro/4 is mounted on a development
board which can then be interfaced with a
WorkStation running Xilinx EDK and ISE.
The entire system operates on a 32-bit basis
10
Organic Embedded System (OES) Architecture
One Dimensional Column-oriented OES based on
Xilinx Virtex II Pro FPGA platform
  • FEs and AEs reside on two distinct layers with
    interconnection structure between them
  • AEs and FEs can either be realized in hardware,
    software, or co-design
  • AE layer supervises functionality of FE elements
    while requiring no application-specific
    algorithms on the AE layer
  • Observer/Controller architecture includes an AS
    element which had no counterpart to evaluate if
    the AS fault-free, so address by minimizing its
    complexity in proposed approach
  • utilize Xilinx partial reconfiguration technology
    to manipulate relocatable bitstreams

11
OES AE Component Design
  • AEs decentralize Observer/Controller
    functionality
  • Concurrent Error Detection (CED) unit collects 2
    FE Outputs for discrepancy identification
  • A Checksum for AE fault detection which are
    checked against Stored Checksum values
  • Evaluator of outputs from 2 FEs against checksum
    and Actuator which initiates recovery phase
  • An important architectural property is that all
    AE components are identical in structure despite
    the fact that they monitor different types of
    FEs.
  • Homogeneous characteristics deliver a
    uniform-behavior property leveraged for
    consensus-based evaluation fault-handling
    methodology
  • OC Concept although AE components add an
    additional complexity to the design, they will
    ease integration of fault-handling difficulties
    inherent with current commercial IP cores

12
Consensus-Based Evaluation (CBE)
  • Uses a Relative Fitness Measure
  • Pairwise discrepancy checking yields relative
    fitness measure
  • Broad temporal consensus in the population used
    to determine fitness metric
  • Transition between Fitness States occurs in the
    population
  • Provides graceful degradation in presence of
    changing environments, applications and inputs,
    since this is a moving measure
  • Test Inputs Normal Inputs for Data Throughput
  • CBE does not utilizes additional functional nor
    resource test vectors
  • Potential for higher availability as regeneration
    is integrated with normal operation

13
Genetic Operators Mutation
Typical Approach bit inversion of LUT
functionality Selected Approach input
interconnection of LUTs mutated
Rearrange input interconnection to search unused
LUT resources which occlude faulty resource
Mutation Genotype chromosomes
  • original functionality is
  • F F1(F3F4) w/ input F2 unassigned by
    synthesis tool
  • mutation operator will change input F4 to unused
    as F F1(F3F2)
  • shadow shows changed input and LUT contents
  • some opportunity for input stuck-at fault or LUT
    content stuck-at fault.
  • functionalities of LUTs remain undistorted while
    search space explored
  • Mutation Phenotype chromosomes

14
Genetic Operators Cell Swapping
Cell-Swap operation on Phenotype chromosomes
Cell-Swap operation on Genotype chromosomes
interchanges two distinct LUT blocks while
maintaining correct logic order and
functionalities in genotype
  • exchange all LUT input interconnections, LUT
    content and physical 2-tuple (Col, Row) as well
    as the logic sequence

15
Genetic Operators PMX Operator
Partial Match Crossover (PMX) maintains crossover
information as well as order information
  • two genotype configuration streams are aligned
    at LUT boundary
  • crossover site selected at random along LUT
    boundary
  • this crossover point defines a left/right
    partition used to affect crossover through
    LUT-by-LUT exchange
  • suppose crossover point at position 4 of the LUT
    vector
  • first step is to map configuration B to
    configuration A by exchanging the following
    aligned LUTs (4,7),(5,2),(6,1),(7,5).
  • Applying PMX results in two new configurations A
    and B

16
Illustrative ExampleGate Level Design of OES
  • Experiment circuit 1-bit
    Full-adder
  • Fault-free model Duplex
  • Fault-impact model TMR
  • Fault-detect model CBE
  • Fault recovery strategy GA operation
  • Experimental setup
  • Hardware prototype implemented in Xilinx
    Virtex-II Pro FPGA
  • VHDL implementation
  • Using the GNAT library along with the MRRA
    framework and JTAG reconfiguration interface.

17
MCNC-91 Benchmark Case Studies
System Availability under Multiple Faults
Fc number of correct behaviors of FE observed
during evolutionary recovery phase Fe number of
errant or discrepant behaviors 1 exactly one
output required to detect the fault during the
original CED configuration. 2 number of the
reconfigurations required, i.e. one from CED to
TMR, and one back from TMR to CED Fc1 Fe1
correct and faulty output number of the FE during
the AE repair period Fc2 Fe2 correct and
faulty output number during the FE repair period
n number of reconfigurations of the FE ß
represents reconfiguration to computation time
ratio
18
Experimental Results
  • Fault Free arrangement CED FEs with cold
    standby FE
  • Inject a stuck-at-zero or stuck-at-one fault at
    one of the FEs LUT input pins
  • CED -gt TMR to identify faulty FE or AE
  • CBE used to resolve faulty AE

Redundancy for both FE (RFE) and AE (RAE) ratio
of unused LUT inputs to total number of LUTs
inputs
Fc number of correct behaviors of FE observed
during evolutionary recovery phase Fe number of
errant or discrepant behaviors n number of
reconfigurations of the FE ß represents
reconfiguration to computation time ratio
19
Experimental Results
  • Fault Free arrangement CED FEs with cold
    standby FE
  • Inject a stuck-at-zero or stuck-at-one fault at
    one of the FEs LUT input pins
  • CED -gt TMR to identify faulty FE or AE
  • CBE used to resolve faulty AE

Redundancy for both FE (RFE) and AE (RAE) ratio
of unused LUT inputs to total number of LUTs
inputs
Fc number of correct behaviors of FE observed
during evolutionary recovery phase Fe number of
errant or discrepant behaviors n number of
reconfigurations of the FE ß represents
reconfiguration to computation time ratio
20
Experimental Results
  • Fault Free arrangement CED FEs with cold
    standby FE
  • Inject a stuck-at-zero or stuck-at-one fault at
    one of the FEs LUT input pins
  • CED -gt TMR to identify faulty FE or AE
  • CBE used to resolve faulty AE

Redundancy for both FE (RFE) and AE (RAE) ratio
of unused LUT inputs to total number of LUTs
inputs
Fc number of correct behaviors of FE observed
during evolutionary recovery phase Fe number of
errant or discrepant behaviors n number of
reconfigurations of the FE ß represents
reconfiguration to computation time ratio
21
Conclusion
  • A self-adaptation and self-healing OES
    architecture developed for autonomic operation
    without human intervention.
  • The OES architecture is capable of handling many
    single fault scenarios and several multiple fault
    scenarios for small digital logic design.
  • Experimental result support our design objectives
    during the repair phase averaged 75.05, 82.21,
    and 65.21 for the z4ml, cm85a, and cm138a
    circuits respectively under stated conditions.
  • Reconfiguration time ratio (ß) ratio is key
    factor limiting availability during AE repair
  • Future work evaluate extensions of the OES
    architecture addressing scalability of in terms
    of pipelined stages

22
Backup Slides
  • On following pages

23
Isolation of a single faulty individual with
1-out-of-64 impact
instantaneous DV (point values) for a sample
individual in population and population oracles
(solid lines)
Sliding Window
  • Outliers are identified after EW iterations have
    elapsed
  • Expected D.V. (1/64)600 9.375 from
    individual impacted by fault
  • Isolated faulty individuals DV differs from the
    average DV by 3? after 1 or more observation
    intervals of length EW

24
Future WorkDevelopment Board to Self-Contained
FPGA
  • Qualitative Analysis of CRR model
  • Number of iterations and completeness of
    regeneration repair
  • Percentage of time the device remains online
    despite physical resource fault (availability)
  • Hardware Resource Management
  • Optimization of hardware profile for Xilinx
    Virtex II Pro
  • Field Testing on SRAM-based FPGA in a Cubesat
    mission

25
OES Integrated FE and AE Failure Detection
Procedure
  • System Initialization
  • FE Initialization step
  • Compute Checksum step
  • FE Fault Detection/Recovery
  • AE-CED fault detection
  • FE fault-recovery
  • AE fault detection Phase
  • A fault may exist in the CED, Actuator, or
    Evaluator,
  • A fault may exist in Check Sum component, or
  • A fault may exist in the Stored CheckSum-LUT.

Runtime inputs to FE applied to both active
instance under a CED strategy. After allowing for
FE inputs propagation time through the AE, the
expected output will be supplied to AE-CED for
the fault detection. The output of the FE is then
compared in the AE-CED module and any discrepancy
between the two values will indicate that a fault
has occurred either of one the FE or the AE-CED
itself. Further detection will be required to
distinguish which of the two is faulty. If the AE
component is identified as innocent and then the
fault must of occurred in this output will be
discarded and control will branch to a fault
identification phase which will wakeup the cold
standby FE and construct a temporary TMR system
which can articulate the faulty FE under the new
supplied external input. Furthermore, as
descrived in Section 3.3, the actuator will
initiate a repair cycle which may require
automatic evolutionary repair of the identified
faulty FE which will be set as standby-under-repai
r and the AE-CED will return to receive the
remaining two active FEs inputs. The
decision-making procedure causes at least one
throughput-delay penalty
26
Previous Work
  • Detection Characteristics of FPGA Fault-Handling
    Schemes

Strategy 1) Evolve redundancy into
design before the anticipated failure
or
27
Previous Work
  • Fault Recovery Characteristics of Selected
    Approaches

Strategy 2) Evolve recovery from specific
failure after (and if) it occurs or
28
CRR Arrangement in SRAM FPGA
  • Configurations in Population
  • C CL? CR
  • CL subset of left-half configurations
  • CR subset of right-half configurations
  • CLCR C/2
  • Discrepancy Operator
  • Baseline Discrepancy Operator ? is dyadic
    operator with binary output
  • Z(Ci) is FPGA data throughput output of
    configuration Ci
  • Each half-configuration evaluates ? using
    embedded checker (XNOR gate) within each
    individual
  • Any fault in checker lowers that individuals
    fitness so that individual is no longer preferred
    and eventually undergoes repair

WTA
(Equivalence)
29
Terminology and Characteristics
Pristine Pool CP. For any Ci?C, is member of CP
at generation G if and only if Suspect Pool
CS. For any Ci?C, is member of CS at generation
G if and only if at least one of Under Repair
Pool CU For any Ci?C, is member of CU at
generation G if and only if Refurbished Pool
CR after Genetic Operator applied, the new
generated individual is member of CR at
generation G if and only if
ED is Discrepancy Count of Ci and EC is
Correctness Count of Ci Length of Evaluation
Fitness Window W ED EC Fitness Metric f(Ci)
EC/ EW
30
Sketch of CRR ApproachPremise Recovery
Complexity ltlt Design Complexity
  • Initialization
  • Population P of functionally-identical yet
    physically-distinct configurations
  • Partition P into sub-populations that use
    supersets of physically-distinct resources, e.g.
    size P/2 to designate physical FPGA
  • left-half or right-half resource utilization
  • Fitness Assessment
  • Discrepancy Operator ? is some function of
  • bitwise agreement between each halfs output
  • Four Fitness States defined for Configurations as
  • CP,CS,CU,CR with transitions, respectively
  • Pristine Suspect Under Repair
    Refurbished
  • Fitness Evaluation Window W determines
    comparison interval
  • Regeneration
  • Genetic Operators used to recover from fault
    based on Reintroduction Rate ?
  • Operators only applied once then offspring
    returned to service without for concern about
    increasing fitness

fitness assessment via pairwise discrepancy
(temporal voting vs. spatial voting)
31
States Transitions during lifetime of ith
Half-Configuration
Configuration Health States
32
Procedural Flow under Competitive Runtime
Reconfiguration
  • Integrates all fault handling stages using EC
    strategy
  • Detects faults by the occurrence of discrepancy
  • Isolates faults by accumulation of discrepancies
  • Failure-specific refurbishment using Genetic
    Operators
  • Intra-Module-Crossover, Inter-Module-Crossover,
    Intra-Module-Mutation
  • Realize online device refurbishment
  • Refurbished online without additional function or
    resource test vectors
  • Repair during the normal data throughput process

33
Fitness Evaluation Window
  • Fitness Evaluation Window W
  • denotes number of iterations used to evaluate
    fitness before the state of an individual is
    determined
  • Determination of W for 3x3 multiplier
  • 6 input pins articulating 2664 possible inputs
  • W should be selected so that all possible inputs
    appear
  • More formally,
  • Let rand(X) return some xi ? X at random
  • Seek W ? ? rand(X) X with high
    probability
  • xK distinct orderings of K inputs showing in D
    trials
  • if D constant, can calculate Pkgt1 successively
  • probability PK of K inputs showing after D trials
    is ratio of xK / KD

34
W Determination
When K64
35
Integer Multiplier Case Study
  • 3bit x 3bit unsigned multiplier automated design
  • Building blocks
  • Half-Adder 18 templates created
  • Full-Adder 24 templates
  • Parallel-And 1 template created
  • Randomly select templates for instantiation in
    modules

GA operators External-Module-Crossover Internal-Mo
dule-Crossover Internal-Module-Mutation
GA parameters Population size 20 individuals
Crossover rate 5 Mutation rate up to
80 per bit
Experiments Demonstrate
Experimental Evaluation Xilinx Virtex II Pro on
Avnet PCI board
  • Objective fitness function replaced by the
    Consensus-based Evaluation Approach and Relative
    Fitness
  • Elimination of additional test vectors
  • Temporal Assessment process

36
Template Fault Coverage
  • Half-Adder Template A

Half-Adder Template A
Half-Adder Template B
  • Template A
  • Gate3 is an AND gate
  • Will lose correctness if a Stuck-At-Zero fault
    occurs in second input line of the Gate3, an AND
    gate
  • Template B
  • Gate3 is a NOT gate and only uses the first input
    line
  • Will work correctly even if second input line is
    stuck at Zero or One

37
Regeneration Performance

Parameters
Difference (vs. Hamming Distance) Evaluation
Window, Ew 600 Suspect Threshold ?S
1-6/60099 Repair Threshold ?R 1-4/600
99.3 Re-introduction rate ?r 0.1
Repairs evolved in-situ, in real-time, without
additional test vectors, while allowing device to
remain partially online.
38
Isolation of a single faulty individual with
1-out-of-64 impact
  • Outliers are identified after W iterations
    elapsed
  • E.V. (1/64)600 9.375 from minimum impact
    faulty individual
  • Isolated individuals f differs from the average
    DV by 3? after 1 or more observation intervals of
    length W
Write a Comment
User Comments (0)
About PowerShow.com