Application Level Fault Tolerance and Detection - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Application Level Fault Tolerance and Detection

Description:

ALFTD has been implemented into OTIS to determine its feasibility as a fault ... OTIS has two sets of related output data, the temperature and emissivity ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 33
Provided by: lordt
Category:

less

Transcript and Presenter's Notes

Title: Application Level Fault Tolerance and Detection


1
Application Level Fault Tolerance and Detection
  • Principal Investigators
  • C. Mani Krishna Israel Koren
  • Graduate Students
  • Diganta Eric Janhavi Osman Vijay

Architecture and Real-Time Systems (ARTS)
Lab. Department of Electrical and Computer
Engineering University of Massachusetts Amherst
MA 01003
2
What is ALFTD?
  • Application Level Fault Tolerance and Detection
  • ALFTD complements existing system or algorithm
    level fault tolerance by leveraging information
    available only at the application level
  • Using such application level semantic information
    significantly reduces the overall cost providing
    fault tolerance
  • ALFTD may be used alone or to supplement other
    fault detection schemes
  • ALFTD is scalable
  • Error overhead can be traded off with invested
    time overhead for fault tolerance

3
ALFTD Overview
  • Application Level Fault Tolerance and Detection
    allows for system survival of both data and
    system (instruction/hardware) faults.
  • System faults cause a process to eventually cease
    functioning
  • Data faults cause a process to continue running
    with incorrect results
  • ALFTD has been implemented into OTIS to determine
    its feasibility as a fault detection and
    tolerance method for REE applications
  • OTIS has two sets of related output data, the
    temperature and emissivity
  • Experiments have focused mostly on the
    temperature output

4
OTIS Structure
4. Slave Calculations
5
OTIS Work Distribution
  • OTIS dynamic workload distribution allows it to
    compensate for system faults
  • Work originally partitioned for a failed
    processor is instead taken by the remaining
    processes
  • OTIS does not compensate for data faults
  • As long as the work is completed, there is no
    measure of correctness
  • OTIS does not consider deadline repercussions

6
OTIS Fault Cases
7
ALFTD OTIS Structure
8
Secondaries in OTIS
  • The secondary required for ALFTD is implemented
    to be functionally similar to the primary
  • Secondary scaling occurs through resolution
    reduction
  • OTIS natural data input exhibits spatial
    locality
  • Points not directly calculated can be
    approximately estimated using interpolation
    between calculated points
  • Secondary processes have been tested at 20-50
    of the primary calculation overhead
  • While 50 affords better quality, 20 has less
    overhead

9
Example of Secondary Resolution
100 Secondary Resolution
50 Secondary Resolution
33 Secondary Resolution
25 Secondary Resolution
  • (ALFTD Compensation for 10 rows in a sample
    dataset)

10
ALFTD Benefit
11
ALFTD Benefit (contd)
12
Fault Detection
  • When to run the secondary, and when to use the
    secondary output, is determined by output filters
  • Output filters are created to check for
    application-specific trends in data
  • Aberrations from normal data characteristics can
    be considered to be the product of potentially
    faulty processes
  • OTIS relies on natural temperature
    characteristics to detect potentially faulty data
  • Spatial Locality temperature changes gradually
    over small areas
  • Absolute Bounds temperature should not exceed
    certain values

13
Data Sets
  • Three data sets were chosen for their interesting
    characteristics

14
Data Frequency (Values)
15
Data Frequency (Spatial Locality)
16
Validation Through Secondaries
  • When the primary deadline is hit, rows are
    re-delegated to the secondaries if (and only if)
  • The primary has returned results for that row
    suspected to be faulty
  • The secondary results can be used to decide
    whether the results are indeed faulty
  • A particular row was never successfully
    calculated
  • The secondary results can be immediately used in
    place of the missing primary results

17
Validation Through Secondaries (contd)
  • After the secondary has been run to verify a
    primarys results, the better data is chosen
    according to the following logic grid

Secondary
18
Fault Tolerance Results Spots
  • Fault Tolerance with injected faults in Spots

19
Fault Tolerance Results Spots (contd)
Faulty Output
Fault-Free Output
33 ALFTD Computation Overhead
50 ALFTD Computation Overhead
25 ALFTD Computation Overhead
ALFTD-corrected faulty output
20
Fault Tolerance Results Spots (contd)
Difference Plots faulty output versus faultless
output
No ALFTD
25 ALFTD Computation Overhead
33 ALFTD Computation Overhead
50 ALFTD Computation Overhead
No Error
Max Error
21
Fault Tolerance Results Blob
  • Fault Tolerance with injected faults in Blob

22
Fault Tolerance Results Blob (contd)
Faulty Output
Fault-Free Output
33 ALFTD Computation Overhead
50 ALFTD Computation Overhead
25 ALFTD Computation Overhead
ALFTD-corrected faulty output
23
Fault Tolerance Results Blob (contd)
Difference Plots faulty output versus faultless
output
No ALFTD
25 ALFTD Computation Overhead
33 ALFTD Computation Overhead
50 ALFTD Computation Overhead
No Error
Max Error
24
Fault Tolerance Results Stripe
  • Fault Tolerance with injected faults in Stripe

25
Fault Tolerance Results Stripe(contd)
Faulty Output
Fault-Free Output
33 ALFTD Computation Overhead
50 ALFTD Computation Overhead
25 ALFTD Computation Overhead
ALFTD-corrected faulty output
26
Fault Tolerance Results Stripe(contd)
Difference Plots faulty output versus faultless
output
No ALFTD
25 ALFTD Computation Overhead
33 ALFTD Computation Overhead
50 ALFTD Computation Overhead
No Error
Max Error
27
Emissivity Data
  • Emissivity is loosely proportional to temperature
    data
  • Emissivity exhibits spatial locality
  • Emissivity has natural bounds of expected data

lt0.5 - Faulty
gt1.0 - Faulty
Natural Metal 0.5
Vegetatation, Water 1.0
Rock 0.8 - 0.95
28
Emissivity Data (contd)
  • Emissivity does not exhibit the same data
    closeness as temperature output
  • This makes it very difficult to distinguish
    faulty from non-faulty data
  • Luckily, faults present in temperature output are
    easily detected, and reflect faults in emissivity
    output.
  • Emissivity does not have per-pixel independence
    of calculation
  • Dependence on the correctness of neighboring
    pixels makes resolution reduction a viable, but
    not the best, method for secondary reduction

29
Data Frequency (Emissivity Values)
30
Conclusion
  • ALFTD has already shown to be a worthwhile
    alternative to full redundancy
  • Improvements on the scheme will increase fault
    coverage and decrease secondary calculation
    overhead in both the emissivity and temperature
    outputs
  • OTIS, as a general matrix-based, master/slave
    program is a springboard to other, similar
    programs (e.g., NGST)
  • ALFTD as a fault-detection scheme will continue
    to be effective in programs which exhibit
    natural output

31
Thank You!
32
Relative Error Calculation
  • Error in OTIS output is calculated relative to a
    faultless template
  • The average relative error is the average of all
    relative errors of the entire output
  • Faulty value f(x,y)
  • Faultless value F(x,y)
  • Error
Write a Comment
User Comments (0)
About PowerShow.com