RollForward Techniques for Fault Detection and Correction in Condor - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

RollForward Techniques for Fault Detection and Correction in Condor

Description:

http://www.ibmdatabasemag.com. Gyung-Leen et. al. 'A New Approach for High Performance Computing with Various ... Wakerly et. al. 'Microcomputer Reliability ... – PowerPoint PPT presentation

Number of Views:140
Avg rating:3.0/5.0
Slides: 39
Provided by: Mane
Category:

less

Transcript and Presenter's Notes

Title: RollForward Techniques for Fault Detection and Correction in Condor


1
Roll-Forward Techniques for Fault Detection and
Correction in Condor
ECE 753 PROJECT
Vaishali Karanth Janaki Jillella
2
Overview
  • Brief introduction to Condor
  • Roll-Forward Technique
  • Implementation Details
  • Simulation
  • Performance and Overhead Evaluation
  • Extensions to Roll-Forward
  • Conclusion

3
Overview
  • Brief introduction to Condor
  • Roll-Forward Technique
  • Implementation Details
  • Simulation
  • Performance and Overhead Evaluation
  • Extensions to Roll-Forward
  • Conclusion

4
Introduction to Condor
  • What is Condor?
  • A specialized workload management system for
    compute-intensive workloads
  • A cluster environment that uses idle CPU
  • power to deliver high throughput
  • Extremely useful for running time consuming
    workloads

5
Continued
  • Usage
  • Submit job description file using Condor Command
  • Submit a Standard Universe Job
  • Executable program1
  • Arguments 10 100
  • Universe standard
  • Log program1.log
  • Output program1.out
  • Error program1.err
  • Input program1.in
  • Queue 1
  • Result in user specified output file
  • User receives notification upon
  • completion

6
Fault tolerance in Condor
  • Checkpointing
  • Supports Standalone, Periodic Checkpointing
  • Allows job migration during system down time

7
Continued..
  • Limitations
  • Not supported on all cluster types
  • Requires program recompilation
  • Depends on nature of the job
  • Ex 1) Record a checkpoint image.
  • 2) Read data from a file.
  • 3) Write data to the same file.
  • 4) Execution failure, so roll back to step 2.
  • Transient errors not handled

8
Continued
  • Fault tolerance in Grid Computing Condor-G
  • Fault-tolerance in Stork Applications
  • Data placement jobs restarted upon error
  • At data transfer level

9
Contd..
  • Fault tolerance for MPI interfaces
  • Proposals to implement coordinated checkpointing,
    application-level and user-transparent
    checkpointing
  • Summary
  • No general solution

10
Overview
  • Brief introduction to Condor
  • Roll-Forward Technique
  • Implementation Details
  • Simulation
  • Performance and Overhead Evaluation
  • Extensions to Roll-Forward
  • Conclusion

11
Roll-Forward Technique
  • Traditional method based on checkpointing
  • We introduce software based solution for Condor

12
Roll-Forward in Condor
  • Use abundant Computing Power
  • Start two identical copies of a job in Condor
  • Mechanism to compare computation results of two
    copies
  • Flag any mismatch in results
  • Start a third copy of the job
  • Identify the faulty job

13
Roll-Forward vs. TMR
14
Advantages
  • Simple
  • Software based Light-weight
  • No hardware modification
  • Independent of nature of the job
  • Condor cluster independent

15
Overview
  • Brief introduction to Condor
  • Roll-Forward Technique
  • Implementation Details
  • Simulation
  • Performance and Overhead Evaluation
  • Extensions to Roll-Forward
  • Conclusion

16
Implementation Details
  • C program to monitor jobs monitor program
  • Monitor program submits identical jobs to Condor
  • Reads and checks for equality
  • Submits new job on mismatch checker job
  • Removes the faulty job eventually
  • User gets right result

17
Algorithm
18
Overview
  • Brief introduction to Condor
  • Roll-Forward Technique
  • Implementation Details
  • Simulation
  • Performance and Overhead Evaluation
  • Extensions to Roll-Forward
  • Conclusion

19
Simulation
  • Evaluated Schemes
  • RF-method-A tolerates fault in single machine
  • RF-method-B tolerates fault in two machines
  • TMR tolerates fault in two machine

20
Simulation
  • Any long running program
  • For example large input data set matrix
    multiplication
  • Static fault injection at different intervals of
    program execution
  • Program execution statistics assimilation

21
Parameters of interest
  • Worst case job completion time
  • Average job completion time
  • Resource Overhead
  • Error detection/correction capability

22
Overview
  • Brief introduction to Condor
  • Roll-Forward Technique
  • Implementation Details
  • Simulation
  • Performance and Overhead Evaluation
  • Extensions to Roll-Forward
  • Conclusion

23
RF-method-A
  • Additional 34 increase in average
    execution time
  • Worst case execution time is twice the
  • execution time
  • Average resource overhead 1.52 times
    standalone execution

24
RF-method-B
  • Additional 52 increase in average
    execution time
  • Worst case execution time is twice the
  • standalone execution time
  • Average resource overhead 1.83 times
    standalone execution

25
TMR method
  • No increase in job completion time
  • High overhead of 1.6 times standalone
    execution

26
Resource Overhead Comparison
27
Overview
  • Brief introduction to Condor
  • Roll-Forward Technique
  • Implementation Details
  • Simulation
  • Performance and Overhead Evaluation
  • Extensions to Roll-Forward
  • Conclusion

28
Extensions to Roll-Forward
  • Techniques to reduce resource overhead further
  • Speculative Job Removal
  • Pair-and-spare

29
Speculative Job Removal
  • Randomly remove one copy upon error detection
  • Start checker job
  • Determine if speculation was right
  • Checker job runs to completion upon
    mis-speculation
  • Checker job result is presented to use
  • Terminate checker job if speculation was right

30
Mis-speculation Affects
  • Additional job completion latency
  • Reasonable performance for errors detected in
    early stage of execution
  • Average increase in latency of 1.6 times
  • Worst case job completion time is twice the
    standalone job completion time

31
Speculation Extremes
  • Average resource overhead is less than earlier
    methods
  • Average resource overhead 1.23 times the
    standalone execution

32
Pair-and-Spare
  • One spare for multiple (N) jobs
  • Checker job for any program runs on spare
  • Performance depends on error probability of
    Condor workstations
  • Performance Impact
  • Increase in error correction latency

33
Pair-and-Spare contd.
  • Error probability of single machine .001 per hour
    assumed
  • Inject error to N programs (2N copies) with a
    probability at random instance
  • Repeat several times to collect statistics

34
Error Statistics
  • For N4, program execution time is nearly 2.56
    times the standalone execution time
  • No double errors among the pairs found

35
Conclusion
  • More general fault tolerant solution for Condor
  • Roll-forward technique based fault tolerant
    method for Condor
  • Less resource overhead compared to software TMR
    solution
  • Variants of roll-forward techniques are discussed

36
References
  • Pradhan et. al. "Roll-Forward Checkpointing
    Scheme A Novel Fault-Tolerant Architecture
  • http//www.ibmdatabasemag.com
  • Gyung-Leen et. al. "A New Approach for High
    Performance Computing with Various Checkpoiting
    Schemes
  • Xu et. al. "Roll-Forward Error Recovery in
    Embedded Real-Time Systems
  • Condor Team. "Condor Version 7.2.1 Manual."
    University of Wisconsin-Madison.
    http//www.cs.wisc.edu/condor/
  • Voas J. "Fault injection for the masses
  • Israel Koren, and C. Mani Krishna. Fault-tolerant
    systems. Elsevier/Morgan Kaufmann, 2007.
  • Thain et. al. "Distributed Computing in Practice
    The Condor Experience
  • Wakerly et. al. "Microcomputer Reliability
    Improvement Using Triple-Modular Redundancy

37
???
38
Thank You!
Write a Comment
User Comments (0)
About PowerShow.com