Checkpointing Mechanism for the Grid Environment - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Checkpointing Mechanism for the Grid Environment

Description:

Based on the Poisson process. Occurrence of failure is random with failure rate ... Ts is the time required to save information at a checkpoint. ... Critical Region. ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 26
Provided by: krit8
Category:

less

Transcript and Presenter's Notes

Title: Checkpointing Mechanism for the Grid Environment


1
Checkpointing Mechanism for the Grid Environment
  • K Sajadah, G Terstyanszky, S Winter, P. Kacsuk
  • University of Westminster

2
The Grid Environment
  • Nature of Grid Environment
  • Generic, heterogeneous, and dynamic with lots of
    unreliable resources making it exposed to
    failures.
  • Solution
  • Fault tolerant mechanisms should ensure
    successful execution of applications.

3
Fault Tolerant Solutions
  • Retrying
  • When a job fails, it is re-executed a certain
    number of times.
  • The expected jobs completion time is very big.
  • Replication
  • Replicas of a job are executed on different Grid
    resources simultaneously.
  • It requires extra processing power.
  • Checkpointing
  • It stores a snapshot of an application state, and
    use it for restarting the execution in case of
    failure.
  • It is very efficient in environment where failure
    rate is high.

4
Checkpointing
  • Transparent Checkpointing
  • Programmer orchestrates the checkpointing process
  • Message synchronisation is performed.
  • Checkpointing Recovery process is transparent
    to the programmer.
  • Non-Transparent Checkpointing
  • Mechanism provides support for checkpointing
    through run-time libraries.
  • Programmer can specify data that should be
    included in checkpoint file.
  • Approach is not transparent to the programmer.

5
Challenges in Checkpointing
  • When to take the checkpoint
  • How to synchronise (or how to minimise
    inter-process communication)
  • What kind of info to store at the checkpoint
  • Where to store the checkpoints info
  • How to restore the execution after a fault

6
Checkpointing (2)
  • Performance constraints in existing solutions
  • Overheads due to synchronisation of messages.
  • Checkpoint intervals are either user-defined with
    no regular pattern or are periodic.
  • Proposed solution
  • Take checkpoint at the best possible pre-defined
    intervals.
  • Mimimalise (or optimise) the inter-communication
    as much as possible.

7
Checkpointing (3)
  • Inter-process communications can cause
    inconsistent checkpoints due to lost messages or
    orphan messages.
  • To achieve a global consistent checkpoint
    synchronization should be performed
  • Synchronization introduces extra communications
    among processes.

8
Approaches Used
  • Combination of
  • First Order Approximation.
  • Natural Synchronisation Points.
  • First Order Approximation
  • Calculate the optimal checkpointing intervals.
  • Based on the Poisson process.
  • Occurrence of failure is random with failure rate
    ?.

9
First Order Approximation
  • The Optimal Checkpoint interval Tc is
  • Tc ?2TsTf , where
  • Ts is the time required to save information at a
    checkpoint.
  • Tf is the mean time between failures and Tf
    Th/?k
  • The following data are needed
  • The number of hours the program will run on the
    machines (Th).
  • The known failure rate during that time (?k).
  • The time required to save information at a
    checkpoint (Ts).

10
First Order Approximation (2)
Tc Checkpoint interval Ts Time to save a
checkpoint tr Rerun time of a failed
application
11
First Order Approximation(3)
  • Using the PROVE toolset, we can measure both the
    execution time and the checkpointing time of an
    application.
  • Nagios can be used to determine the failure rate
    of Grid resources.

12
Natural Synchronisation Points
  • Examples of natural synchronization points
  • Barriers.
  • Top or bottom of a main loop.
  • Collective operations (broadcast, gather,
    scatter, etc.)
  • No interprocess communication at these points.
  • Therefore, no need to be concerned with the state
    of the communication channels or possible
    in-transit message.
  • Eliminate the overhead incurred due to the
    synchronization process involved during
    checkpointing.

13
Natural Synchronisation Points (2)
Application Execution with Processes interacting
Coordinated checkpoint - waiting for in-transit
messages
14
Natural Synchronisation Points (4)
Coordinated checkpoint - logging in-transit
messages
Checkpointing at natural synchronisation points.
15
New Checkpointing Approa
  • Using First Order Approximation only
  • Involves synchronisation of messages and
    capturing in-transit messages.
  • Checkpointing at natural synchronisation points
    only
  • May not be very effective because there are no
    patterns in their occurrences.

16
New Checkpointing Approach(2)
  • Use a combination of both the Natural
    Synchronisation Points and the First Order
    Approximation.
  • Take checkpoints at natural synchronization
    points which are closest to the optimal
    checkpoint intervals.

17
Choosing Checkpoint Intervals
Choosing appropriate checkpointing intervals
18
Choosing Checkpoint Intervals(2)
  • Decision to select a checkpoint based on
  • Optimal checkpoint interval,
  • Natural synchronisation points and
  • Critical Region.
  • Checkpointing process is triggered by signals
    sent to the coordinated process whenever
    synchronization points are encountered.

19
The Checkpointing Process
  • When coordinated process receives a signal, it
    checks to see if this signal is within the
    critical region.
  • If so, a checkpoint is taken and the clock is
    reset.
  • If not, no checkpointing is performed.
  • If no natural synchronization points are met
    within the critical region, we will have to force
    a checkpoint at the end of the critical region.
  • In such cases, the checkpointing mechanism will
    perform synchronization to ensure there are no
    lost or orphan messages.

20
The TestBed
  • Madcity Traffic Simulation tool was used.
  • Simulates traffic on a road network and shows how
    individual vehicles behave on roads and at
    junctions.
  • MadCity traffic simulator can be parallelised
    using PGRADE.

21
The Testbed(2)
Proposed checkpointing solution
22
The Testbed(3)
  • Through the First Order Approximation, the
    calculated optimal checkpoint interval was 8
    minutes.
  • A critical region of 2 minutes range from the
    optimal checkpoint interval was defined.
  • Checkpoint taken at Ns1, Ns2, Ns5, Fs1, Ns6,Ns9.
  • Overall average time between checkpoints 8.2
    minutes

23
Conclusion
  • Proposed checkpointing mechanism provides a
    better and more efficient way to save checkpoint
    images.
  • Minimise the need of performing synchronisation
    of messages.
  • Ensure that our average checkpointing interval is
    close to the optimal checkpointing interval
    defined by the First Order Approximation.

24
Future Works
  • Integrate the checkpointing solution in PGRADE to
    provide an efficient fault tolerant solution to
    applications executed as Grid workflows.
  • Provide an efficient and reliable storage
    mechanism.

25
Questions
Write a Comment
User Comments (0)
About PowerShow.com