Title: FTCCharm : An InMemory CheckpointBased Fault Tolerant Runtime for Charm and MPI
1FTC-Charm An In-Memory Checkpoint-Based Fault
Tolerant Runtime for Charm and MPI
- Gengbin Zheng
- Lixia Shi
- Laxmikant V. Kale
- Parallel Programming Lab
- University of Illinois at Urbana-Champaign
2Motivation
- As machines grow in size
- MTBF decreases
- Applications have to tolerate faults
- Applications need fast, low cost and scalable
fault tolerance support - Fault tolerant runtime for
- Charm (Parallel C language and runtime)
- Adaptive MPI
3Requirements
- Low impact on fault-free execution
- Provide fast and automatic restart capability
- Does not rely on extra processors
- Maintain execution efficiency after restart
- Does not rely on any fault-free component
- Not assume stable storage
4Background
- Checkpoint based methods
- Coordinated Blocking Tamir84, Non-blocking
Chandy85 - Co-check, Starfish, Clip fault tolerant MPI
- Uncoordinated suffers from rollback propagation
- Communication Briatico84, doesnt scale well
- Log-based methods
- Message logging
5Design Overview
- Coordinated checkpointing scheme
- Simple, low overhead on fault-free execution
- Scientific applications that are iterative
- Double checkpointing
- Tolerate one failure at a time
- In-memory checkpointing
- Diskless checkpointing
- Efficient for applications with small memory
footprint - In case when there is no extra processors
- Program continue to run with remaining processors
- Load balancing for restart
6Charm Processor Virtualization
User View
- Charm
- Parallel C with Data driven objects - Chares
- Chares are migratable
- Asynchronous method invocation
- Adaptive MPI
- Implemented on Charmwith migratable threads
- Multiple virtual processors on a physical
processor
7Benefits of Virtualization
- Latency tolerant
- Adaptive overlap of communication and computation
- Supports migration of virtual processors for load
balancing - Checkpoint data
- Objects (instead of process image)
- Checkpoint migrate object to another processor
8Checkpoint Protocol
- Adopt coordinated checkpointing strategy
- Charm runtime provides the functionality for
checkpointing - Programmers can decide what to checkpoint
- Each object pack data and send to two different
(buddy) processors - Charm runtime data
9Restart protocol
- Initiated by the failure of a physical processor
- Every object rolls back to the state preserved in
the recent checkpoints - Combine with load balancer to sustain the
performance
10Checkpoint/Restart Protocol
PE3
PE0
PE1
PE2
I
H
J
A
G
B
D
E
F
C
H
I
J
F
G
D
E
B
C
A
A
I
H
B
C
J
G
F
D
E
PE1 crashed ( lost 1 processor )
PE0
PE2
PE3
I
B
C
H
J
A
G
D
E
F
D
H
J
G
A
B
C
F
E
I
A
C
E
J
H
I
F
G
D
B
checkpoint 1
checkpoint 2
object
restored object
A
A
A
A
11Local Disk-Based Protocol
- Double in-memory checkpointing
- Memory concern
- Pick checkpointing time where global state is
small - Double In-disk checkpointing
- Make use of local disk
- Also does not rely on any reliable storage
- Useful for applications with very big memory
footprint
12Performance Evaluation
- IA-32 Linux cluster at NCSA
- 512 dual 1Ghz Intel Pentium III processors
- 1.5GB RAM each processor
- Connected by both Myrinet and 100MBit Ethernet
13Checkpoint Overhead Evaluation
- Jacobi3D MPI
- Up to 128 processors
- Myrinet vs. 100Mbit Ethernet
14Single Checkpoint Overhead
- AMPI jacobi3D
- Problem size 200MB
- 128 processors
15Comparisons of Program Execution Time
- Jacobi (200MB data size) on upto 128 processors
- 8 checkpoints in 100 steps
16Performance Comparisons with Traditional
Disk-based Checkpointing
17Recovery Performance
- Molecular Dynamics Simulation application -
LeanMD - Apoa1 benchmark (92K atoms)
- 128 processors
- Crash simulated by killing processes
- No backup processors
- With load balancing
18Performance improve with Load Balancing
LeanMD, Apoa1, 128 processors
19Recovery Performance
- 10 crashes
- 128 processors
- Checkpoint every 10 time steps
20- LeanMD with Apoa1 benchmark
- 90K atoms
- 8498 objects
21Future work
- Use our scheme on some extremely large parallel
machines - Reduce memory usage of the protocol
- Message logging
- Paper appeared in IPDPS04