RollForward Techniques for Fault Detection and Correction in Condor - PowerPoint PPT Presentation

1 / 38

About This Presentation

Title:

RollForward Techniques for Fault Detection and Correction in Condor

Description:

http://www.ibmdatabasemag.com. Gyung-Leen et. al. 'A New Approach for High Performance Computing with Various ... Wakerly et. al. 'Microcomputer Reliability ... – PowerPoint PPT presentation

Number of Views:140

Avg rating:3.0/5.0

Slides: 39

Provided by: Mane

Category:

more less

Transcript and Presenter's Notes

Title: RollForward Techniques for Fault Detection and Correction in Condor

1
Roll-Forward Techniques for Fault Detection and
Correction in Condor
ECE 753 PROJECT
Vaishali Karanth Janaki Jillella
2
Overview

Brief introduction to Condor
Roll-Forward Technique
Implementation Details
Simulation
Performance and Overhead Evaluation
Extensions to Roll-Forward
Conclusion

3
Overview

Brief introduction to Condor
Roll-Forward Technique
Implementation Details
Simulation
Performance and Overhead Evaluation
Extensions to Roll-Forward
Conclusion

4
Introduction to Condor

What is Condor?
A specialized workload management system for
compute-intensive workloads
A cluster environment that uses idle CPU
power to deliver high throughput
Extremely useful for running time consuming
workloads

5
Continued

Usage
Submit job description file using Condor Command
Submit a Standard Universe Job
Executable program1
Arguments 10 100
Universe standard
Log program1.log
Output program1.out
Error program1.err
Input program1.in
Queue 1
Result in user specified output file
User receives notification upon
completion

6
Fault tolerance in Condor

Checkpointing
Supports Standalone, Periodic Checkpointing
Allows job migration during system down time

7
Continued..

Limitations
Not supported on all cluster types
Requires program recompilation
Depends on nature of the job
Ex 1) Record a checkpoint image.
2) Read data from a file.
3) Write data to the same file.
4) Execution failure, so roll back to step 2.
Transient errors not handled

8
Continued

Fault tolerance in Grid Computing Condor-G
Fault-tolerance in Stork Applications
Data placement jobs restarted upon error
At data transfer level

9
Contd..

Fault tolerance for MPI interfaces
Proposals to implement coordinated checkpointing,
application-level and user-transparent
checkpointing
Summary
No general solution

10
Overview

Brief introduction to Condor
Roll-Forward Technique
Implementation Details
Simulation
Performance and Overhead Evaluation
Extensions to Roll-Forward
Conclusion

11
Roll-Forward Technique

Traditional method based on checkpointing
We introduce software based solution for Condor

12
Roll-Forward in Condor

Use abundant Computing Power
Start two identical copies of a job in Condor
Mechanism to compare computation results of two
copies
Flag any mismatch in results
Start a third copy of the job
Identify the faulty job

13
Roll-Forward vs. TMR
14
Advantages

Simple
Software based Light-weight
No hardware modification
Independent of nature of the job
Condor cluster independent

15
Overview

Brief introduction to Condor
Roll-Forward Technique
Implementation Details
Simulation
Performance and Overhead Evaluation
Extensions to Roll-Forward
Conclusion

16
Implementation Details

C program to monitor jobs monitor program
Monitor program submits identical jobs to Condor
Reads and checks for equality
Submits new job on mismatch checker job
Removes the faulty job eventually
User gets right result

17
Algorithm
18
Overview

Brief introduction to Condor
Roll-Forward Technique
Implementation Details
Simulation
Performance and Overhead Evaluation
Extensions to Roll-Forward
Conclusion

19
Simulation

Evaluated Schemes
RF-method-A tolerates fault in single machine
RF-method-B tolerates fault in two machines
TMR tolerates fault in two machine

20
Simulation

Any long running program
For example large input data set matrix
multiplication
Static fault injection at different intervals of
program execution
Program execution statistics assimilation

21
Parameters of interest

Worst case job completion time
Average job completion time
Resource Overhead
Error detection/correction capability

22
Overview

Brief introduction to Condor
Roll-Forward Technique
Implementation Details
Simulation
Performance and Overhead Evaluation
Extensions to Roll-Forward
Conclusion

23
RF-method-A

Additional 34 increase in average
execution time
Worst case execution time is twice the
execution time
Average resource overhead 1.52 times
standalone execution

24
RF-method-B

Additional 52 increase in average
execution time
Worst case execution time is twice the
standalone execution time
Average resource overhead 1.83 times
standalone execution

25
TMR method

No increase in job completion time
High overhead of 1.6 times standalone
execution

26
Resource Overhead Comparison
27
Overview

Brief introduction to Condor
Roll-Forward Technique
Implementation Details
Simulation
Performance and Overhead Evaluation
Extensions to Roll-Forward
Conclusion

28
Extensions to Roll-Forward

Techniques to reduce resource overhead further
Speculative Job Removal
Pair-and-spare

29
Speculative Job Removal

Randomly remove one copy upon error detection
Start checker job
Determine if speculation was right
Checker job runs to completion upon
mis-speculation
Checker job result is presented to use
Terminate checker job if speculation was right

30
Mis-speculation Affects

Additional job completion latency
Reasonable performance for errors detected in
early stage of execution
Average increase in latency of 1.6 times
Worst case job completion time is twice the
standalone job completion time

31
Speculation Extremes

Average resource overhead is less than earlier
methods
Average resource overhead 1.23 times the
standalone execution

32
Pair-and-Spare

One spare for multiple (N) jobs
Checker job for any program runs on spare
Performance depends on error probability of
Condor workstations
Performance Impact
Increase in error correction latency

33
Pair-and-Spare contd.

Error probability of single machine .001 per hour
assumed
Inject error to N programs (2N copies) with a
probability at random instance
Repeat several times to collect statistics

34
Error Statistics

For N4, program execution time is nearly 2.56
times the standalone execution time
No double errors among the pairs found

35
Conclusion

More general fault tolerant solution for Condor
Roll-forward technique based fault tolerant
method for Condor
Less resource overhead compared to software TMR
solution
Variants of roll-forward techniques are discussed

36
References

Pradhan et. al. "Roll-Forward Checkpointing
Scheme A Novel Fault-Tolerant Architecture
http//www.ibmdatabasemag.com
Gyung-Leen et. al. "A New Approach for High
Performance Computing with Various Checkpoiting
Schemes
Xu et. al. "Roll-Forward Error Recovery in
Embedded Real-Time Systems
Condor Team. "Condor Version 7.2.1 Manual."
University of Wisconsin-Madison.
http//www.cs.wisc.edu/condor/
Voas J. "Fault injection for the masses
Israel Koren, and C. Mani Krishna. Fault-tolerant
systems. Elsevier/Morgan Kaufmann, 2007.
Thain et. al. "Distributed Computing in Practice
The Condor Experience
Wakerly et. al. "Microcomputer Reliability
Improvement Using Triple-Modular Redundancy