FTCCharm : An InMemory CheckpointBased Fault Tolerant Runtime for Charm and MPI - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

FTCCharm : An InMemory CheckpointBased Fault Tolerant Runtime for Charm and MPI

Description:

IA-32 Linux cluster at NCSA. 512 dual 1Ghz Intel Pentium III processors ... Cluster 2004. 16. Performance Comparisons with Traditional Disk-based Checkpointing ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 22
Provided by: gengbi
Learn more at: http://charm.cs.uiuc.edu
Category:

less

Transcript and Presenter's Notes

Title: FTCCharm : An InMemory CheckpointBased Fault Tolerant Runtime for Charm and MPI


1
FTC-Charm An In-Memory Checkpoint-Based Fault
Tolerant Runtime for Charm and MPI
  • Gengbin Zheng
  • Lixia Shi
  • Laxmikant V. Kale
  • Parallel Programming Lab
  • University of Illinois at Urbana-Champaign

2
Motivation
  • As machines grow in size
  • MTBF decreases
  • Applications have to tolerate faults
  • Applications need fast, low cost and scalable
    fault tolerance support
  • Fault tolerant runtime for
  • Charm (Parallel C language and runtime)
  • Adaptive MPI

3
Requirements
  • Low impact on fault-free execution
  • Provide fast and automatic restart capability
  • Does not rely on extra processors
  • Maintain execution efficiency after restart
  • Does not rely on any fault-free component
  • Not assume stable storage

4
Background
  • Checkpoint based methods
  • Coordinated Blocking Tamir84, Non-blocking
    Chandy85
  • Co-check, Starfish, Clip fault tolerant MPI
  • Uncoordinated suffers from rollback propagation
  • Communication Briatico84, doesnt scale well
  • Log-based methods
  • Message logging

5
Design Overview
  • Coordinated checkpointing scheme
  • Simple, low overhead on fault-free execution
  • Scientific applications that are iterative
  • Double checkpointing
  • Tolerate one failure at a time
  • In-memory checkpointing
  • Diskless checkpointing
  • Efficient for applications with small memory
    footprint
  • In case when there is no extra processors
  • Program continue to run with remaining processors
  • Load balancing for restart

6
Charm Processor Virtualization
User View
  • Charm
  • Parallel C with Data driven objects - Chares
  • Chares are migratable
  • Asynchronous method invocation
  • Adaptive MPI
  • Implemented on Charmwith migratable threads
  • Multiple virtual processors on a physical
    processor

7
Benefits of Virtualization
  • Latency tolerant
  • Adaptive overlap of communication and computation
  • Supports migration of virtual processors for load
    balancing
  • Checkpoint data
  • Objects (instead of process image)
  • Checkpoint migrate object to another processor

8
Checkpoint Protocol
  • Adopt coordinated checkpointing strategy
  • Charm runtime provides the functionality for
    checkpointing
  • Programmers can decide what to checkpoint
  • Each object pack data and send to two different
    (buddy) processors
  • Charm runtime data

9
Restart protocol
  • Initiated by the failure of a physical processor
  • Every object rolls back to the state preserved in
    the recent checkpoints
  • Combine with load balancer to sustain the
    performance

10
Checkpoint/Restart Protocol
PE3
PE0
PE1
PE2
I
H
J
A
G
B
D
E
F
C
H
I
J
F
G
D
E
B
C
A
A
I
H
B
C
J
G
F
D
E
PE1 crashed ( lost 1 processor )
PE0
PE2
PE3
I
B
C
H
J
A
G
D
E
F
D
H
J
G
A
B
C
F
E
I
A
C
E
J
H
I
F
G
D
B
checkpoint 1
checkpoint 2
object
restored object
A
A
A
A
11
Local Disk-Based Protocol
  • Double in-memory checkpointing
  • Memory concern
  • Pick checkpointing time where global state is
    small
  • Double In-disk checkpointing
  • Make use of local disk
  • Also does not rely on any reliable storage
  • Useful for applications with very big memory
    footprint

12
Performance Evaluation
  • IA-32 Linux cluster at NCSA
  • 512 dual 1Ghz Intel Pentium III processors
  • 1.5GB RAM each processor
  • Connected by both Myrinet and 100MBit Ethernet

13
Checkpoint Overhead Evaluation
  • Jacobi3D MPI
  • Up to 128 processors
  • Myrinet vs. 100Mbit Ethernet

14
Single Checkpoint Overhead
  • AMPI jacobi3D
  • Problem size 200MB
  • 128 processors

15
Comparisons of Program Execution Time
  • Jacobi (200MB data size) on upto 128 processors
  • 8 checkpoints in 100 steps

16
Performance Comparisons with Traditional
Disk-based Checkpointing
17
Recovery Performance
  • Molecular Dynamics Simulation application -
    LeanMD
  • Apoa1 benchmark (92K atoms)
  • 128 processors
  • Crash simulated by killing processes
  • No backup processors
  • With load balancing

18
Performance improve with Load Balancing
LeanMD, Apoa1, 128 processors
19
Recovery Performance
  • 10 crashes
  • 128 processors
  • Checkpoint every 10 time steps

20
  • LeanMD with Apoa1 benchmark
  • 90K atoms
  • 8498 objects

21
Future work
  • Use our scheme on some extremely large parallel
    machines
  • Reduce memory usage of the protocol
  • Message logging
  • Paper appeared in IPDPS04
Write a Comment
User Comments (0)
About PowerShow.com