Title: Design of Reliable Systems and Networks ECE 442 CS 435 Lecture 5 Error Detection Techniques
1Design of Reliable Systems and NetworksECE 442
/CS 435Lecture 5 Error Detection Techniques
Ravi K. Iyer Center for Reliable and
High-Performance Computing Department of
Electrical and Computer Engineering
and Coordinated Science Laboratory University of
Illinois at Urbana-Champaign iyer_at_crhc.uiuc.edu
http//www.crhc.uiuc.edu/DEPEND
2Outline
- Watchdog timers
- Heartbeats
- Consistency and capability checking
- Data audits
- Runtime generated assertions
3Watchdog Timer (hardware/software)
- An inexpensive method of error detection
- Process being watched must reset the timer before
the timer expires, otherwise the watched process
is assumed as faulty - Watchdog timers only detect errors which manifest
themselves as a control-flow error such that the
system does not continue to reset the timer - Only processes with relatively deterministic
runtimes can be checked, since the error
detection is based entirely on the time between
timer resets
4Watchdog Timer (hardware/software)
- A watchdog timer provides only an indication of
possible process failure - a partially failed process may still be able to
reset the timer - Not a limited fault model but, only control flow
errors that prevent a reset, are checked - Timers must be frequent not internal to loops
- Processes must have a rel. deterministic runtime
- Coverage is limited, as neither the data nor the
results are checked - When used to reset the system, a watchdog timer
can improve availability (the mean time to
recovery is shortened) but not reliability
(failures are just as likely to occur) - when the availability of a system is more
important than the loss of data, the use of a
watchdog timer to reset the system on the
detection of an error is an appropriate choice.
5Example Applications of Watchdog Timers
- Pluribus Reliable Multiprocessor.
- Hardware and software timers (range from 5?s to 2
minutes in duration) monitor almost every
subsystem. - Each subsystem performs a self-check/cycle. If
this takes too long---ERROR - Examples
- Buffer not returned to free message buffer list
(2 min timeout) - A lock failure can cause locking of a resource
when no subsystem is using it. Since lock
failed, resource will not become free - A 1/15-second timer interrupts the processor
unlocks the resource. A temporary (1/15-second)
degradation in system performance system
unaffected by the error.
6Example Applications of Watchdog Timers
- VAX-11/780
- A multiprocessor system for commercial
applications - The console processor runs a watchdog process
that is reset when an interrupt line is strobed - If the interrupt line is not strobed by a
processor within 200 microseconds, this indicates
a failure and the console processor attempts to
determine the reason for the failure - Bell System Telephone Switch
- External watchdog timers monitor correct program
operation by triggering recovery when timers are
not periodically reset. - Allows an early (before the error propagates)
detection of problems caused by software errors
and consequently easier recovery - Mars Rover(RT Multi-threaded OS) Priority
Inversion Problem
7Heartbeats
- A common approach to detecting process and node
failures in a distributed (networked) computing
environment. - Periodically, a monitoring entity sends a message
(a heartbeat) to a monitored node or process and
waits for a reply. - If the monitored node does not respond within a
predefined timeout interval, the node is declared
as failed and appropriate recovery action is
initiated. - Issues
- The timeout period is pre-negotiated by the two
parties or hard-coded by the programmer - The predefined timeout value cannot adapt to
changes in network traffic or to load variability
on individual nodes - The monitored node is assumed to be healthy if it
is able to respond to a heartbeat message - Process/thread responding to the heartbeat
message may operate correctly, while other
processes/threads may be in a deadlock situation
or operating incorrectly
8Adaptive Smart Heartbeat
- Adaptive heartbeat - the timeout value used by
the monitor process is not fixed but is
periodically negotiated between the two parties
to adapt to changes in the network traffic or
node load. - Smart heartbeat - the entity being monitored
excites a set of predefined checks to verify the
robustness of the entire process and only then
responds to the monitoring process
9Adaptive Heartbeat with Load Generator
10Consistency and Capability Checking
- Capability Checking
- can be implemented as a hardware mechanism or can
be part of the operating system (usually the
case) - access to objects (memory segments, I/O devices)
is limited to users (processors or processes)
with the proper authorization - Examples
- virtual address management (MMU usually has a
capability check) - permission vs. activity if these are not valid,
there is an error trap - password checking
- Consistency Checks
- range check - confirms that a computed value is
in a valid range, e.g., a computed probability
must be in the range 0 to 1 - address checking - verifies that the address to
accessed exists - opcode checking - checks whether the instruction
to be executed has one of defined (documented)
opcodes - arithmetic overflow and underflow
11Control-flow Monitoring Using Signatures
Hardware Approaches
- Employ a Watchdog (a simple co-processor) to
monitor behavior of a Main Processor - Suitable for a single embedded applications with
little or no caching - Limited applicability in off-the-shelf systems,
as require additional specialized resources,
e.g., watchdog, pre-compiler.
12Control-flow Monitoring Using Signatures
Hardware Approaches (cont.)
- Problems with both approaches
- Assumes straight line code with no interleaving
of processes or threads - Need for a tightly coupled Watchdog Processor
- Require a customized compiler
- DIfficult to apply in networked context
Embedded Signature Monitoring Pre-computed
signature embedded in the application
program Recompilation of existing
programs Performance degradation of application
Autonomous Signature Monitoring Watchdog
Processor stores pre-computed signature in the
memory and mimics the control flow of
application Watchdog Processor rather
complex High memory overhead
13Control-flow Monitoring Using Signatures
Software Approaches
- Software techniques partition the application
into blocks, either in the assembly language or
in the high-level language - Appropriate instrumentation inserted at the
beginning and/or end of the blocks - The checking code is inserted in the instruction
stream eliminating the need for a hardware
watchdog processor - Two classes of approaches
- non-preemptive signature checking
- preemptive signature checking
14Fine-grained Signature
Element GST RST 1
AB(CDB)E
Armor GST RST
PR 1
PR 2
- Capture control flow error within an element
(a simple software component) - At initialization time, GST formed from
regular expression of valid control signatures
(paths) - At runtime, RST formed through emit()
function calls inserted into the element
Emit(A)
Elements 2,3,
ELEMENT 1
Emit(B)
Emit(C)
Emit(E)
Emit(D)
15Problems with Control Flow Signatures
Incorrect execution without preemptive checking
Incorrect execution with preemptive checking
Correct execution
- Preemptive check detects erroneous
control flow - Computation stops
- Erroneous change in control flow
- AB not reached
X
16Preemptive Control Signatures (PECOS)
- PECOS determines the runtime target address and
compares that against the valid addresses before
the jump to the target address is made - As a result (unlike other techniques), executing
instructions from an invalid target location is
unlikely - High-level control structure of Assertion Block
- Determine the runtime target address Xout.
- Extract the list of valid target addresses
X1,X2. - Calculate ID Xout 1/P,
- where, P !(Xout-X1) (Xout-X2)
- Calculation of ID to raise a DIV-BY-ZERO
exception in case of error - Can handle single (jumps), two (branches), or
multiple (calls and returns) target addresses - Assertion Block does not introduce any new
control flow instruction
17How to apply PECOS to an Application?
18What Can We Cover with Preemptive Software
Control Signature?
Address Bus
Data Bus
Memory
CPU
Errors in cache Not covered
Errors on the Bus Covered
Errors in the Memory Covered
Solution Insert programmable error detection
core into the CPU
19Process Crash/Hang Detection
- Instruction Count Heartbeat (ICH) leverages
processor performance registers to detect
process/OS crashes/hangs - Infinite Loop Hang Detection (ILHD) by tracking
loop entry and exit points - Sequential Code Hang Detection (SCHD) detects
illegal repetition of sequence of instructions
N. Nakka, G. P. Saggese, Z. T. Kalbarczyk, R.K.
Iyer, An architectural framework for Process
Crash/Hang detection, Proceedings of EDCC-5,
2005
20Process Crash/Hang Detection
- Crash detection
- Instruction Count Heartbeat (ICH)
- Uses processor performance counters to detect
process and OS crashes - Can be extended to support failure detection in
distributed systems
21Process Crash/Hang Detection
- Process hang in legal loops
- Infinite loop Hang Detector (ILHD)
- Profile-based analysis of application to estimate
loop execution time - Module reconfigured with timeout for loop as it
is entered CHECK Loop Entry and Loop Exit - Process hang in illegal loops
- Sequential code hang detector (SCHD)
- Parameterize module with length of loop
- Any loop shorter than given length indicates
control error
22Infinite Loop Hang Detector
- Property Loop execution behavior
- Assembly code analysis to find loop entry and
exit - Profiling application with representative inputs
- Derive statistical bounds for execution time
- At runtime monitor loop execution time
- Use expected loop execution time to detect
possible infinite loop
23Detection of Instruction Dependency Violations
- RAW dependency imposes sequential order on
execution of instructions - Errors in processor control logic, binary of
instruction can lead to a violation - Sequence Checker Module (SCM), detects such
violations - monitors issue and execute events in pipeline
- Representative instruction sequences extracted
using static analysis - CHECKs used to dynamically reconfigure the module
with sequences
24SCM Detection Mechanism
- SCM state for sequence (i, e)
- i instruction on which event is awaited
- e event (issue/execute) awaited
- Property At any instance of time, at most one
instruction of a sequence can be issued or
executed - Instructions in issue and execute queues matched
against instructions of sequence - at most one instruction from the queue should
match the correct state of the SCM - Error Detected when there is
- Execute or issue mismatch
- a match other than expected state
25Detected and undetected faults
26SCM Reconfiguration Architecture
- Achieved with help of CHECK instructions
- Extracted sequences loaded as part of program
image - At runtime SCM loads sequences into set of
registers - Each sequence has additional registers
- length, state
27Runtime Generated Assertions
- Goals
- Generate runtime assertions by monitoring the
values of selected variables in a program - Use the monitored data to abstract out, via
statistical pattern recognition techniques, the
key relationships between the variables,
separately and jointly, and to establish their
probabilistic behavior - Approach
- Identify clusters of values traversed by
different variables - Use this information to automatically generate
runtime assertions capable of capturing abnormal
behavior of an application due to hardware or
software errors - Cross-check with other entities in the system
their views on the state of selected variables - if a variable is globally accessible, then
multiple entities (e.g., multiple execution
threads) may have their own opinions about the
correct value of the variable - can improve coverage and reduce false alarms
28Automated Derivation of Application-based
Invariants
29Introduction
- Applications fail due to a variety of errors in
the field - Hardware design errors
- Runtime errors (soft errors, process variation)
- Software design errors
- Many techniques focus on particular classes of
errors - May not provide high coverage for errors in the
field - We need a unified technique to detect multiple
classes - Static analysis techniques
- Dynamic analysis techniques
- Hybrid static/dynamic techniques
30Static Analysis Techniques
- Attempt to find bugs by approximating program
state through compiler-based source code analysis - E.g. Prefix/Prefast, LINT, ESP, ESC/Java, etc.
- Advantages
- Can find S/W design errors without executing
program - Can find errors on all possible paths (in theory)
- Disadvantages
- Cannot reason about H/W design errors or runtime
errors - May find errors on infeasible paths i.e. paths
never executed by the program. Leads to wasteful
detections. - Impossible to eliminate all infeasible paths
(halting problem) - Approximations for finding paths may lead to
missed detections
31Static Analysis Example
- int size 0
- char str NULL
- src
- while (srcsize!\0)
- size
- if (sizegt0)
- str malloc(size1)
-
- strcpy(str,src size )
Is src'\0' ?
Is (size gt 0) ?
Is str NULL ?
32Static Analysis Example (CFG)
size 0 str NULL
- int size 0
- char str NULL
- char src
- while (srcsize!\0)
- size
- if (sizegt0)
- str malloc(size1)
-
- strcpy(str,src, size )
while (srcsize!0) size
if (size gt 0)
then
else
str malloc()
strcpy(str, src, size)
33Dynamic Analysis Techniques
- Learn application invariants by monitoring
application execution at runtime and use the
invariants learned to detect errors at runtime - DAIKON learns invariants based on predefined
templates by executing the program over multiple
inputs. The invariants are learned offline. - DIDUCE learns invariants in the initial portion
of an applications execution and subsequent
violations of the invariants are signaled. - Advantages
- No need for complex feasible path analysis
- Can detect all three classes of errors (but what
about coverage ?) - Disadvantages
- Coverage may not be high as program may crash
before reaching check (or detector may be
bypassed in program) - May lead to false positives if training set is
not appropriately chosen (DAIKON), or if
application behavior changes over time (DIDUCE)
34DAIKON Example Invariants
- // Return the sum of the
- // elements of array b, which
- // has length n.
- long array_sum(int b, long n)
- long s 0
- for (int i0 iltn i)
- s s bi
- return s
Preconditions N size(B) ? length N gt
0 ? range
Loop Invariants N size(B) ? length S
sum(B0..I-1) ? sum N gt 0 ?
range I gt 0 ? range I lt N
? range
Postconditions B B_orig ? const N
I N_orig ? equality S sum(B)
? sum N gt 0 ? range
35DAIKON Example Detected Error
- // Return the sum of the
- // elements of array b, which
- // has length n.
- long array_sum(int b, long n)
- long s 0
- for (int i0 iltn i)
- s s bi
- return s
Preconditions N size(B) ? length N gt
0 ? range
Loop Invariants N size(B) ? length S
sum(B0..I-1) ? sum N gt 0 ?
range I gt 0 ? range I lt N
? range
Error !
Postconditions B B_orig ? const N
I N_orig ? equality S sum(B)
? sum N gt 0 ? range
36DAIKON Example Undetected Error
- // Return the sum of the
- // elements of array b, which
- // has length n.
- long array_sum(int b, long n)
- long s 0
- for (int i0 iltn i)
- s s bi
- return s
Preconditions N size(B) ? length N gt
0 ? range
Loop Invariants N size(B) ? length S
sum(B0..I-1) ? sum N gt 0 ?
range I gt 0 ? range I lt N
? range
Error !
Crash !
Postconditions B B_orig ? const N
I N_orig ? equality S sum(B)
? sum N gt 0 ? range
37Big Picture Detection Techniques
Existing techniques use either static or dynamic
analysis exclusively. The CVR technique on the
other hand, uses Static Analysis to derive
detectors, but the detectors are executed at
runtime (dynamic detection) This allows the
technique to detect runtime errors in hardware
and software, (1) No false-positives (detect
errors when no errors exist), AND (2) No wasteful
detections (detect errors which may not
manifest), AND (3) Can deal with arbitrary errors
that can occur at runtime
38Critical Variable Re-computation (CVR)
- Identify instructions that compute critical
variables (dynamic analysis) - Variables with high dynamic fanouts PRDC 2005
- Encode check as optimized symbolic expression
(static analysis) - Based on backward slice of critical variable
IOLTS 2007 - Specialized according to program control paths
- Choose expression based on executed path
(runtime checking) - Re-compute value of critical variable using
expression - Compare with original Mismatch indicates error
- Advantages of checking expression
- Simplicity Higher performance, easier for H/W
checking - Diversity Protection from common mode errors
- Formalism Can be formally proven to detect
different error classes
38
39Critical Variable Identification
- Identify critical variables (and locations) for
placing detectors (checks) - Construction of program's Dynamic Dependence
Graph - Computed heuristics to choose candidate points
for placement - E.g., Fanouts, Lifetimes, Execution
- Evaluation of coverage using fault-injections
- Assumes detectors are ideal (golden run)
- 10 ideal detectors placed according to Fanouts
heuristic provides up to 80 coverage for a large
application such as gcc95
Coverage Detection of errors that matter to
application
40CVR Conceptual Example
Checking expressions
if (a 0)
Error !
then
else
b a c d b e f d b
c a d b d e f b c
path1
path2
use f
Path tracking
Rest of code
Original Code
40
41Fault Model
- Runtime errors
- Soft errors in register file, cache, main memory
- Errors in arithmetic, logic unit
- Hardware design errors
- Errors in data-forwarding mechanisms
- Instruction scheduling errors
- Errors in instruction commit mechanism,
- Software design errors
- Memory corruption errors (e.g. buffer overflows,
dangling ptrs) - Race conditions (or more generally atomicity
violations)
42Automated Design Flow (S/W to H/W)
Application Source Code
Dynamic Execution Profile
Application code annotated with critical
variables
profiling
Fanouts analysis
Value Recomputation Pass
Application code instrumented with instructions
to invoke H/W checks
Path-tracking state machines
Checking Expressions
Regular compiler
VHDL Translation synthesis
RSE Inter- face
General-purpose Processor
Reliability and Security Engine (RSE)
42
43VRP - Value Recomputation Pass (Implemented using
LLVM compiler)
- Placement Finds critical variables using profile
data and static dependence graph (based on
Fanouts) - Chooses top N fanout variables for each function
- Computes backward slice of critical variable for
all paths - Intra-procedural slice Go back to top of
function - Acyclic Breaks checks across loops into two
checks - Places each extracted slice in a separate basic
block - Invokes optimization passes to produce checking
expression - Adds instrumentation for tracking control paths
at runtime - Invoke hardware module for path-tracking and
checking
44Results (Software Implementation)
- Average Perf. Overhead
- Checking 25
- Modification 8
- Total 33
- Average Coverage
- Before Prop 64
- Before Crash 13
- Total Detected 77
-
45Hardware Implementation RSE Module
Static-Detector Module
DLX Superscalar with RSE
PATH_CHECK Instruction Committed
Path Tracking
Register File
Runtime Path
EXPR_CHECK Instruction Committed
Static-Checking
Main Memory
Write Buffer
Error Detected
45
46Hardware Performance Evaluation
Significant performance gain over software
implementation