FTCCharm : An InMemory CheckpointBased Fault Tolerant Runtime for Charm and MPI

About This Presentation

Title:

FTCCharm : An InMemory CheckpointBased Fault Tolerant Runtime for Charm and MPI

Description:

IA-32 Linux cluster at NCSA. 512 dual 1Ghz Intel Pentium III processors ... Cluster 2004. 16. Performance Comparisons with Traditional Disk-based Checkpointing ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 22

Provided by: gengbi

Learn more at: http://charm.cs.uiuc.edu

Category:

more less

Transcript and Presenter's Notes

Title: FTCCharm : An InMemory CheckpointBased Fault Tolerant Runtime for Charm and MPI

1
FTC-Charm An In-Memory Checkpoint-Based Fault
Tolerant Runtime for Charm and MPI

Gengbin Zheng
Lixia Shi
Laxmikant V. Kale
Parallel Programming Lab
University of Illinois at Urbana-Champaign

2
Motivation

As machines grow in size
MTBF decreases
Applications have to tolerate faults
Applications need fast, low cost and scalable
fault tolerance support
Fault tolerant runtime for
Charm (Parallel C language and runtime)
Adaptive MPI

3
Requirements

Low impact on fault-free execution
Provide fast and automatic restart capability
Does not rely on extra processors
Maintain execution efficiency after restart
Does not rely on any fault-free component
Not assume stable storage

4
Background

Checkpoint based methods
Coordinated Blocking Tamir84, Non-blocking
Chandy85
Co-check, Starfish, Clip fault tolerant MPI
Uncoordinated suffers from rollback propagation
Communication Briatico84, doesnt scale well
Log-based methods
Message logging

5
Design Overview

Coordinated checkpointing scheme
Simple, low overhead on fault-free execution
Scientific applications that are iterative
Double checkpointing
Tolerate one failure at a time
In-memory checkpointing
Diskless checkpointing
Efficient for applications with small memory
footprint
In case when there is no extra processors
Program continue to run with remaining processors
Load balancing for restart

6
Charm Processor Virtualization
User View

Charm
Parallel C with Data driven objects - Chares
Chares are migratable
Asynchronous method invocation
Adaptive MPI
Implemented on Charmwith migratable threads
Multiple virtual processors on a physical
processor

7
Benefits of Virtualization

Latency tolerant
Adaptive overlap of communication and computation
Supports migration of virtual processors for load
balancing
Checkpoint data
Objects (instead of process image)
Checkpoint migrate object to another processor

8
Checkpoint Protocol

Adopt coordinated checkpointing strategy
Charm runtime provides the functionality for
checkpointing
Programmers can decide what to checkpoint
Each object pack data and send to two different
(buddy) processors
Charm runtime data

9
Restart protocol

Initiated by the failure of a physical processor
Every object rolls back to the state preserved in
the recent checkpoints
Combine with load balancer to sustain the
performance

10
Checkpoint/Restart Protocol
PE3
PE0
PE1
PE2
I
H
J
A
G
B
D
E
F
C
H
I
J
F
G
D
E
B
C
A
A
I
H
B
C
J
G
F
D
E
PE1 crashed ( lost 1 processor )
PE0
PE2
PE3
I
B
C
H
J
A
G
D
E
F
D
H
J
G
A
B
C
F
E
I
A
C
E
J
H
I
F
G
D
B
checkpoint 1
checkpoint 2
object
restored object
A
A
A
A
11
Local Disk-Based Protocol