Design of Reliable Systems and Networks ECE 442 CS 435 Lecture 5 Error Detection Techniques - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Design of Reliable Systems and Networks ECE 442 CS 435 Lecture 5 Error Detection Techniques

Description:

When used to reset the system, a watchdog timer can improve availability (the ... Property 'At any instance of time, at most one instruction of a sequence can ... – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 47
Provided by: centerforr3
Category:

less

Transcript and Presenter's Notes

Title: Design of Reliable Systems and Networks ECE 442 CS 435 Lecture 5 Error Detection Techniques


1
Design of Reliable Systems and NetworksECE 442
/CS 435Lecture 5 Error Detection Techniques
Ravi K. Iyer Center for Reliable and
High-Performance Computing Department of
Electrical and Computer Engineering
and Coordinated Science Laboratory University of
Illinois at Urbana-Champaign iyer_at_crhc.uiuc.edu
http//www.crhc.uiuc.edu/DEPEND
2
Outline
  • Watchdog timers
  • Heartbeats
  • Consistency and capability checking
  • Data audits
  • Runtime generated assertions

3
Watchdog Timer (hardware/software)
  • An inexpensive method of error detection
  • Process being watched must reset the timer before
    the timer expires, otherwise the watched process
    is assumed as faulty
  • Watchdog timers only detect errors which manifest
    themselves as a control-flow error such that the
    system does not continue to reset the timer
  • Only processes with relatively deterministic
    runtimes can be checked, since the error
    detection is based entirely on the time between
    timer resets

4
Watchdog Timer (hardware/software)
  • A watchdog timer provides only an indication of
    possible process failure
  • a partially failed process may still be able to
    reset the timer
  • Not a limited fault model but, only control flow
    errors that prevent a reset, are checked
  • Timers must be frequent not internal to loops
  • Processes must have a rel. deterministic runtime
  • Coverage is limited, as neither the data nor the
    results are checked
  • When used to reset the system, a watchdog timer
    can improve availability (the mean time to
    recovery is shortened) but not reliability
    (failures are just as likely to occur)
  • when the availability of a system is more
    important than the loss of data, the use of a
    watchdog timer to reset the system on the
    detection of an error is an appropriate choice.

5
Example Applications of Watchdog Timers
  • Pluribus Reliable Multiprocessor.
  • Hardware and software timers (range from 5?s to 2
    minutes in duration) monitor almost every
    subsystem.
  • Each subsystem performs a self-check/cycle. If
    this takes too long---ERROR
  • Examples
  • Buffer not returned to free message buffer list
    (2 min timeout)
  • A lock failure can cause locking of a resource
    when no subsystem is using it. Since lock
    failed, resource will not become free
  • A 1/15-second timer interrupts the processor
    unlocks the resource. A temporary (1/15-second)
    degradation in system performance system
    unaffected by the error.

6
Example Applications of Watchdog Timers
  • VAX-11/780
  • A multiprocessor system for commercial
    applications
  • The console processor runs a watchdog process
    that is reset when an interrupt line is strobed
  • If the interrupt line is not strobed by a
    processor within 200 microseconds, this indicates
    a failure and the console processor attempts to
    determine the reason for the failure
  • Bell System Telephone Switch
  • External watchdog timers monitor correct program
    operation by triggering recovery when timers are
    not periodically reset.
  • Allows an early (before the error propagates)
    detection of problems caused by software errors
    and consequently easier recovery
  • Mars Rover(RT Multi-threaded OS) Priority
    Inversion Problem

7
Heartbeats
  • A common approach to detecting process and node
    failures in a distributed (networked) computing
    environment.
  • Periodically, a monitoring entity sends a message
    (a heartbeat) to a monitored node or process and
    waits for a reply.
  • If the monitored node does not respond within a
    predefined timeout interval, the node is declared
    as failed and appropriate recovery action is
    initiated.
  • Issues
  • The timeout period is pre-negotiated by the two
    parties or hard-coded by the programmer
  • The predefined timeout value cannot adapt to
    changes in network traffic or to load variability
    on individual nodes
  • The monitored node is assumed to be healthy if it
    is able to respond to a heartbeat message
  • Process/thread responding to the heartbeat
    message may operate correctly, while other
    processes/threads may be in a deadlock situation
    or operating incorrectly

8
Adaptive Smart Heartbeat
  • Adaptive heartbeat - the timeout value used by
    the monitor process is not fixed but is
    periodically negotiated between the two parties
    to adapt to changes in the network traffic or
    node load.
  • Smart heartbeat - the entity being monitored
    excites a set of predefined checks to verify the
    robustness of the entire process and only then
    responds to the monitoring process

9
Adaptive Heartbeat with Load Generator
10
Consistency and Capability Checking
  • Capability Checking
  • can be implemented as a hardware mechanism or can
    be part of the operating system (usually the
    case)
  • access to objects (memory segments, I/O devices)
    is limited to users (processors or processes)
    with the proper authorization
  • Examples
  • virtual address management (MMU usually has a
    capability check)
  • permission vs. activity if these are not valid,
    there is an error trap
  • password checking
  • Consistency Checks
  • range check - confirms that a computed value is
    in a valid range, e.g., a computed probability
    must be in the range 0 to 1
  • address checking - verifies that the address to
    accessed exists
  • opcode checking - checks whether the instruction
    to be executed has one of defined (documented)
    opcodes
  • arithmetic overflow and underflow

11
Control-flow Monitoring Using Signatures
Hardware Approaches
  • Employ a Watchdog (a simple co-processor) to
    monitor behavior of a Main Processor
  • Suitable for a single embedded applications with
    little or no caching
  • Limited applicability in off-the-shelf systems,
    as require additional specialized resources,
    e.g., watchdog, pre-compiler.

12
Control-flow Monitoring Using Signatures
Hardware Approaches (cont.)
  • Problems with both approaches
  • Assumes straight line code with no interleaving
    of processes or threads
  • Need for a tightly coupled Watchdog Processor
  • Require a customized compiler
  • DIfficult to apply in networked context

Embedded Signature Monitoring Pre-computed
signature embedded in the application
program Recompilation of existing
programs Performance degradation of application
Autonomous Signature Monitoring Watchdog
Processor stores pre-computed signature in the
memory and mimics the control flow of
application Watchdog Processor rather
complex High memory overhead
13
Control-flow Monitoring Using Signatures
Software Approaches
  • Software techniques partition the application
    into blocks, either in the assembly language or
    in the high-level language
  • Appropriate instrumentation inserted at the
    beginning and/or end of the blocks
  • The checking code is inserted in the instruction
    stream eliminating the need for a hardware
    watchdog processor
  • Two classes of approaches
  • non-preemptive signature checking
  • preemptive signature checking

14
Fine-grained Signature
Element GST RST 1
AB(CDB)E
Armor GST RST
PR 1
PR 2
  • Capture control flow error within an element
    (a simple software component)
  • At initialization time, GST formed from
    regular expression of valid control signatures
    (paths)
  • At runtime, RST formed through emit()
    function calls inserted into the element

Emit(A)
Elements 2,3,
ELEMENT 1
Emit(B)
Emit(C)
Emit(E)
Emit(D)
15
Problems with Control Flow Signatures
Incorrect execution without preemptive checking
Incorrect execution with preemptive checking
Correct execution
  • Preemptive check detects erroneous
    control flow
  • Computation stops
  • Erroneous change in control flow
  • AB not reached

X
16
Preemptive Control Signatures (PECOS)
  • PECOS determines the runtime target address and
    compares that against the valid addresses before
    the jump to the target address is made
  • As a result (unlike other techniques), executing
    instructions from an invalid target location is
    unlikely
  • High-level control structure of Assertion Block
  • Determine the runtime target address Xout.
  • Extract the list of valid target addresses
    X1,X2.
  • Calculate ID Xout 1/P,
  • where, P !(Xout-X1) (Xout-X2)
  • Calculation of ID to raise a DIV-BY-ZERO
    exception in case of error
  • Can handle single (jumps), two (branches), or
    multiple (calls and returns) target addresses
  • Assertion Block does not introduce any new
    control flow instruction

17
How to apply PECOS to an Application?
18
What Can We Cover with Preemptive Software
Control Signature?
Address Bus
Data Bus
Memory
CPU
Errors in cache Not covered
Errors on the Bus Covered
Errors in the Memory Covered
Solution Insert programmable error detection
core into the CPU
19
Process Crash/Hang Detection
  • Instruction Count Heartbeat (ICH) leverages
    processor performance registers to detect
    process/OS crashes/hangs
  • Infinite Loop Hang Detection (ILHD) by tracking
    loop entry and exit points
  • Sequential Code Hang Detection (SCHD) detects
    illegal repetition of sequence of instructions

N. Nakka, G. P. Saggese, Z. T. Kalbarczyk, R.K.
Iyer, An architectural framework for Process
Crash/Hang detection, Proceedings of EDCC-5,
2005
20
Process Crash/Hang Detection
  • Crash detection
  • Instruction Count Heartbeat (ICH)
  • Uses processor performance counters to detect
    process and OS crashes
  • Can be extended to support failure detection in
    distributed systems

21
Process Crash/Hang Detection
  • Process hang in legal loops
  • Infinite loop Hang Detector (ILHD)
  • Profile-based analysis of application to estimate
    loop execution time
  • Module reconfigured with timeout for loop as it
    is entered CHECK Loop Entry and Loop Exit
  • Process hang in illegal loops
  • Sequential code hang detector (SCHD)
  • Parameterize module with length of loop
  • Any loop shorter than given length indicates
    control error

22
Infinite Loop Hang Detector
  • Property Loop execution behavior
  • Assembly code analysis to find loop entry and
    exit
  • Profiling application with representative inputs
  • Derive statistical bounds for execution time
  • At runtime monitor loop execution time
  • Use expected loop execution time to detect
    possible infinite loop

23
Detection of Instruction Dependency Violations
  • RAW dependency imposes sequential order on
    execution of instructions
  • Errors in processor control logic, binary of
    instruction can lead to a violation
  • Sequence Checker Module (SCM), detects such
    violations
  • monitors issue and execute events in pipeline
  • Representative instruction sequences extracted
    using static analysis
  • CHECKs used to dynamically reconfigure the module
    with sequences

24
SCM Detection Mechanism
  • SCM state for sequence (i, e)
  • i instruction on which event is awaited
  • e event (issue/execute) awaited
  • Property At any instance of time, at most one
    instruction of a sequence can be issued or
    executed
  • Instructions in issue and execute queues matched
    against instructions of sequence
  • at most one instruction from the queue should
    match the correct state of the SCM
  • Error Detected when there is
  • Execute or issue mismatch
  • a match other than expected state

25
Detected and undetected faults
26
SCM Reconfiguration Architecture
  • Achieved with help of CHECK instructions
  • Extracted sequences loaded as part of program
    image
  • At runtime SCM loads sequences into set of
    registers
  • Each sequence has additional registers
  • length, state

27
Runtime Generated Assertions
  • Goals
  • Generate runtime assertions by monitoring the
    values of selected variables in a program
  • Use the monitored data to abstract out, via
    statistical pattern recognition techniques, the
    key relationships between the variables,
    separately and jointly, and to establish their
    probabilistic behavior
  • Approach
  • Identify clusters of values traversed by
    different variables
  • Use this information to automatically generate
    runtime assertions capable of capturing abnormal
    behavior of an application due to hardware or
    software errors
  • Cross-check with other entities in the system
    their views on the state of selected variables
  • if a variable is globally accessible, then
    multiple entities (e.g., multiple execution
    threads) may have their own opinions about the
    correct value of the variable
  • can improve coverage and reduce false alarms

28
Automated Derivation of Application-based
Invariants

29
Introduction
  • Applications fail due to a variety of errors in
    the field
  • Hardware design errors
  • Runtime errors (soft errors, process variation)
  • Software design errors
  • Many techniques focus on particular classes of
    errors
  • May not provide high coverage for errors in the
    field
  • We need a unified technique to detect multiple
    classes
  • Static analysis techniques
  • Dynamic analysis techniques
  • Hybrid static/dynamic techniques

30
Static Analysis Techniques
  • Attempt to find bugs by approximating program
    state through compiler-based source code analysis
  • E.g. Prefix/Prefast, LINT, ESP, ESC/Java, etc.
  • Advantages
  • Can find S/W design errors without executing
    program
  • Can find errors on all possible paths (in theory)
  • Disadvantages
  • Cannot reason about H/W design errors or runtime
    errors
  • May find errors on infeasible paths i.e. paths
    never executed by the program. Leads to wasteful
    detections.
  • Impossible to eliminate all infeasible paths
    (halting problem)
  • Approximations for finding paths may lead to
    missed detections

31
Static Analysis Example
  • int size 0
  • char str NULL
  • src
  • while (srcsize!\0)
  • size
  • if (sizegt0)
  • str malloc(size1)
  • strcpy(str,src size )

Is src'\0' ?
Is (size gt 0) ?
Is str NULL ?
32
Static Analysis Example (CFG)
size 0 str NULL
  • int size 0
  • char str NULL
  • char src
  • while (srcsize!\0)
  • size
  • if (sizegt0)
  • str malloc(size1)
  • strcpy(str,src, size )

while (srcsize!0) size
if (size gt 0)
then
else
str malloc()
strcpy(str, src, size)
33
Dynamic Analysis Techniques
  • Learn application invariants by monitoring
    application execution at runtime and use the
    invariants learned to detect errors at runtime
  • DAIKON learns invariants based on predefined
    templates by executing the program over multiple
    inputs. The invariants are learned offline.
  • DIDUCE learns invariants in the initial portion
    of an applications execution and subsequent
    violations of the invariants are signaled.
  • Advantages
  • No need for complex feasible path analysis
  • Can detect all three classes of errors (but what
    about coverage ?)
  • Disadvantages
  • Coverage may not be high as program may crash
    before reaching check (or detector may be
    bypassed in program)
  • May lead to false positives if training set is
    not appropriately chosen (DAIKON), or if
    application behavior changes over time (DIDUCE)

34
DAIKON Example Invariants
  • // Return the sum of the
  • // elements of array b, which
  • // has length n.
  • long array_sum(int b, long n)
  • long s 0
  • for (int i0 iltn i)
  • s s bi
  • return s

Preconditions N size(B) ? length N gt
0 ? range
Loop Invariants N size(B) ? length S
sum(B0..I-1) ? sum N gt 0 ?
range I gt 0 ? range I lt N
? range
Postconditions B B_orig ? const N
I N_orig ? equality S sum(B)
? sum N gt 0 ? range
35
DAIKON Example Detected Error
  • // Return the sum of the
  • // elements of array b, which
  • // has length n.
  • long array_sum(int b, long n)
  • long s 0
  • for (int i0 iltn i)
  • s s bi
  • return s

Preconditions N size(B) ? length N gt
0 ? range
Loop Invariants N size(B) ? length S
sum(B0..I-1) ? sum N gt 0 ?
range I gt 0 ? range I lt N
? range
Error !
Postconditions B B_orig ? const N
I N_orig ? equality S sum(B)
? sum N gt 0 ? range
36
DAIKON Example Undetected Error
  • // Return the sum of the
  • // elements of array b, which
  • // has length n.
  • long array_sum(int b, long n)
  • long s 0
  • for (int i0 iltn i)
  • s s bi
  • return s

Preconditions N size(B) ? length N gt
0 ? range
Loop Invariants N size(B) ? length S
sum(B0..I-1) ? sum N gt 0 ?
range I gt 0 ? range I lt N
? range
Error !
Crash !
Postconditions B B_orig ? const N
I N_orig ? equality S sum(B)
? sum N gt 0 ? range
37
Big Picture Detection Techniques
Existing techniques use either static or dynamic
analysis exclusively. The CVR technique on the
other hand, uses Static Analysis to derive
detectors, but the detectors are executed at
runtime (dynamic detection) This allows the
technique to detect runtime errors in hardware
and software, (1) No false-positives (detect
errors when no errors exist), AND (2) No wasteful
detections (detect errors which may not
manifest), AND (3) Can deal with arbitrary errors
that can occur at runtime
38
Critical Variable Re-computation (CVR)
  • Identify instructions that compute critical
    variables (dynamic analysis)
  • Variables with high dynamic fanouts PRDC 2005
  • Encode check as optimized symbolic expression
    (static analysis)
  • Based on backward slice of critical variable
    IOLTS 2007
  • Specialized according to program control paths
  • Choose expression based on executed path
    (runtime checking)
  • Re-compute value of critical variable using
    expression
  • Compare with original Mismatch indicates error
  • Advantages of checking expression
  • Simplicity Higher performance, easier for H/W
    checking
  • Diversity Protection from common mode errors
  • Formalism Can be formally proven to detect
    different error classes

38
39
Critical Variable Identification
  • Identify critical variables (and locations) for
    placing detectors (checks)
  • Construction of program's Dynamic Dependence
    Graph
  • Computed heuristics to choose candidate points
    for placement
  • E.g., Fanouts, Lifetimes, Execution
  • Evaluation of coverage using fault-injections
  • Assumes detectors are ideal (golden run)
  • 10 ideal detectors placed according to Fanouts
    heuristic provides up to 80 coverage for a large
    application such as gcc95

Coverage Detection of errors that matter to
application
40
CVR Conceptual Example
Checking expressions
if (a 0)
Error !
then
else
b a c d b e f d b
c a d b d e f b c
path1
path2
use f
Path tracking
Rest of code
Original Code
40
41
Fault Model
  • Runtime errors
  • Soft errors in register file, cache, main memory
  • Errors in arithmetic, logic unit
  • Hardware design errors
  • Errors in data-forwarding mechanisms
  • Instruction scheduling errors
  • Errors in instruction commit mechanism,
  • Software design errors
  • Memory corruption errors (e.g. buffer overflows,
    dangling ptrs)
  • Race conditions (or more generally atomicity
    violations)

42
Automated Design Flow (S/W to H/W)
Application Source Code
Dynamic Execution Profile
Application code annotated with critical
variables
profiling
Fanouts analysis
Value Recomputation Pass
Application code instrumented with instructions
to invoke H/W checks
Path-tracking state machines
Checking Expressions
Regular compiler
VHDL Translation synthesis
RSE Inter- face
General-purpose Processor
Reliability and Security Engine (RSE)
42
43
VRP - Value Recomputation Pass (Implemented using
LLVM compiler)
  • Placement Finds critical variables using profile
    data and static dependence graph (based on
    Fanouts)
  • Chooses top N fanout variables for each function
  • Computes backward slice of critical variable for
    all paths
  • Intra-procedural slice Go back to top of
    function
  • Acyclic Breaks checks across loops into two
    checks
  • Places each extracted slice in a separate basic
    block
  • Invokes optimization passes to produce checking
    expression
  • Adds instrumentation for tracking control paths
    at runtime
  • Invoke hardware module for path-tracking and
    checking

44
Results (Software Implementation)
  • Average Perf. Overhead
  • Checking 25
  • Modification 8
  • Total 33
  • Average Coverage
  • Before Prop 64
  • Before Crash 13
  • Total Detected 77

45
Hardware Implementation RSE Module
Static-Detector Module
DLX Superscalar with RSE
PATH_CHECK Instruction Committed
Path Tracking
Register File
Runtime Path
EXPR_CHECK Instruction Committed
Static-Checking
Main Memory
Write Buffer
Error Detected
45
46
Hardware Performance Evaluation
Significant performance gain over software
implementation
Write a Comment
User Comments (0)
About PowerShow.com