Title: MPICHV2 a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging
1MPICH-V2 a Fault Tolerant MPI for Volatile Nodes
based on PessimisticSender Based Message Logging
- joint work with
- A.Bouteiller, F.Cappello, G.Krawezik,
P.Lemarinier, F.Magniette - Parallelism team, Grand Large Project
- Thomas Hérault herault_at_lri.fr
http//www.lri.fr/herault
2MPICH-V2
- Computing nodes of clusters are subject to
failure - Many applications use MPI as communication
library - Design a fault-tolerant MPI library
- MPICH-V1 is a fault-tolerant MPI implementation
- It requires many stable components to provide
high performance - MPICH-V2 addresses this requirements
- And provides higher performances
3Outline
- Introduction
- Architecture
- Performances
- Perspective Conclusion
4Large Scale Parallel and Distributed systems and
node Volatility
- Industry and academia are building larger and
larger computing facilities for technical
computing (research and production). - Platforms with 1000s of nodes are becoming
common Tera Scale Machines (US ASCI, French
Tera), Large Scale Clusters (Score III, etc.),
Grids, PC-Grids (Seti_at_home, XtremWeb, Entropia,
UD, Boinc)
- These large scale systems have frequent
failures/disconnections - ASCI-Q full system MTBF is estimated
(analytic) to few hours (Petrini LANL), A
5 hours job with 4096 procs has less than 50
chance to terminate. - PC Grids nodes are volatile ?
disconnections / interruptions are expected to be
very frequent (several/hour)
- When failures/disconnections can not be avoided,
they become - one characteristic of the system called
Volatility
- We need a Volatility tolerant Message Passing
library
5Goal execute existing or new MPI Apps
Problems 1) volatile nodes (any number at any
time) 2) non named receptions (? should be
replayed in the same
order as the one of the previous failed exec.)
Objective summary 1) Automatic fault
tolerance 2) Transparency for the programmer
user 3) Tolerate n faults (n being the MPI
processes) 4) Scalable Infrastructure/protocols
5) Avoid global synchronizations
(ckpt/restart) 6) Theoretical verification of
protocols
6Related works
A classification of fault tolerant
message passing environments considering
A) level in the software stack where fault
tolerance is managed and B) fault
tolerance techniques.
Non Automatic
Automatic
Checkpoint based
Log based
Optimistic log (sender based)
Pessimistic log
Causal log
Optimistic recovery In distributed systems n
faults with coherent checkpoint SY85
Coordinated checkpoint
Manetho n faults EZ92
Cocheck Independent of MPI Ste96
Framework
Starfish Enrichment of MPI AF99
FT-MPI Modification of MPI routines User Fault
Treatment FD00
Egida RAV99
Clip Semi-transparent checkpoint CLP97
MPI/FT Redundance of tasks BNC01
API
Pruitt 98 2 faults sender based PRU98
MPI-FT N fault Centralized server LNLE00
MPICH-V2 N faults Distributed logging
Communication Lib.
Sender based Mess. Log. 1 fault sender
based JZ87
Level
7Checkpoint techniques
restart
Coordinated Checkpoint (Chandy/Lamport)
detection/ global stop
The objective is to checkpoint the application
when there is no in transit messages between any
two nodes ? global synchronization? network
flush? not scalable
failure
Ckpt
Sync
Nodes
Uncoordinated Checkpoint
- No global synchronization (scalable)
- Nodes may checkpoint at any time (independently
of the others) - Need to log undeterministic events In-transit
Messages
restart
detection
failure
Ckpt
Nodes
8Outline
- Introduction
- Architecture
- Performances
- Perspective Conclusion
9MPICH-V1
Dispatcher
Channel Memories
node
Network
Checkpoint servers
node
node
10MPICH-V2 protocol
A new protocol (never published yet) based on 1)
Splitting message logging and event logging 2)
Sender based message logging 3) Pessimistic
approach (reliable event logger)
- Definition 3 (Pessimistic Logging protocol) Let P
be a communication protocol, and E an execution
of P with at most f concurrent failures. Let MC
denotes the set of messages transmitted between
the initial configuration and the configuration C
of E. - P is a pessimistic message logging protocol if
and only if - C?E, ?m ? MC,
- (DependC(m) gt 1) ) ?Re - Executable(m)
Theorem 2 The protocol of MPICH-V2 is a
pessimistic message logging protocol.
Key points of the proof A. Every non
deterministic event has its logical clock logged
on reliable media B. Every message reception
logged on reliable media is reexecutable the
message payload is saved on the sender the sender
will produce the message again and associate the
same unique logical clock
11Message logger and event logger
q
m
p
crash
(id, l)
event logger for p
r
q
D
B
C
p
restart
A
event logger for p
reexecution phase
r
12Computing node
Event Logger
Ckpt Server
Reception event
Checkpoint Image
ack
MPI process
CSAC
Send
Send
V2 daemon
Ckpt Control
Receive
Receive
keep payload
Node
13Impact of uncoordinated checkpoint sender based
message logging
1
2
EL
1,
2
?
Checkpoint image
Checkpoint image
P0
? ?
Checkpoint image
Checkpoint image
P1
P1s ML
1
2
1
CS
- Obligation to checkpoint Message Loggers on
- computing nodes
- Garbage collector required for reducing ML
- checkpoint size.
14Garbage collection
1
2
EL
Checkpoint image
P0
Checkpoint image
P1
P1s ML
1
2
1
2
3
3
1
1 and 2 can be deleted ? Garbage collector
CS
Receiver checkpoint completion triggers the
garbage collector of senders.
15Scheduling Checkpoint
- Uncoordinated checkpoint lead to log in-transit
messages - Scheduling checkpoint simultaneously will lead to
bursts - in the network traffic.
- Checkpoint size can be reduced by removing
message logs - Coordinated checkpoint (Lamport).
- Requires global synchronization
- Checkpoint traffic should be flattened
- Checkpoint scheduling should evaluate the cost
and benefit - of each checkpoint.
1, 2 and 3 can be deleted ? Garbage collector
1
2
1
2
3
1
P0s ML
P0
No message Checkpoint needed
3 needs to be checkpointed
P1
P1s ML
1
2
1
2
3
1
1 and 2 can be deleted ? Garbage collector
CS
16Node (Volatile) Checkpointing
- User-level Checkpoint Condor Stand Alone
Checkpointing - Clone checkpointing non blocking checkpoint
(1) fork
Resume execution using CSAC just after (4),
reopen sockets and return
code
Ckpt order
CSAC
(2) Terminate ongoing coms (3) close sockets (4)
call ckpt_and_exit()
libmpichv
fork
- Checkpoint image is sent to CS on the fly (not
stored locally)
17Library based on MPICH 1.2.5
- A new device ch_v2 device
- All ch_v2 device functions are blocking
communication functions built over TCP
layer
MPI_Send
MPID_SendControl MPID_SendChannel
_v2from
- get the src of the last message
Binding
_v2Init
- initialize the client
V2 device Interface
_v2bsend
_v2Finalize
- finalize the client
18Outline
- Introduction
- Architecture
- Performances
- Perspective Conclusion
19Performance evaluation
Cluster 32 1800 Athlon CPU, 1 GB, IDE Disc
16 Dual Pentium III, 500 Mhz, 512 MB, IDE Disc
48 ports 100Mb/s Ethernet switch Linux 2.4.18,
GCC 2.96 (-O3), PGI Frotran lt5 (-O3, -tpathlonxp)
Checkpoint Server Event Logger Checkpoint
Scheduler Dispatcher
A single reliable node
node
Network
node
node
20Bandwidth and Latency
Latency for a 0 byte MPI message MPICH-P4
(77us), MPICH-V1 (154us), MPICH-V2 (277us)
Latency is high due to the event logging. ? A
receiving process can send a new message only
when the reception event has been successfully
logged (3 TCP messages for a communication) Bandw
idth is high because event messages are short.
21NAS Benchmark Class A and B
Megaflops
Megaflops
Megaflops
22Breakdown of the execution time
23Faulty execution performance
1 fault Every 45 sec!
190 s (80)
24Outline
- Introduction
- Architecture
- Performances
- Perspective Conclusion
25Perspectives
- Compare to Coordinated techniques
- Treshold of fault frequency where logging
techniques are more valuable - MPICH-V/CL
- Cluster 2003
- Hierarchical logging for Grids
- Tolerate node failures cluster failures
- MPICH-V3
- SC 2003 Poster session
- Address the latency of MPICH-V2
- Use causal logging techniques ?
26Conclusion
- MPICH-V2 is a completely new protocol replacing
MPICH-V1 removing the channel memories - New protocol is pessimistic sender based
- MPICH-V2 reach a Ping-Pong Bandwidth
- close to the one of MPICH-P4
- MPICH-V2 cannot compete with MPICH-P4 on latency
- However for applications with large messages,
performance - are close to the one of P4
- In addition, MPICH-V2 resists up to one fault
every 45 seconds. - Main conclusion MPICH-V2 requires much less
stable nodes than MPICH-V1 with better
performances
Come to see MPICH-V demos at the Booth 3315 INRIA
27Re-execution performance (1)
Time for the re-execution of a token ring on 8
nodes According to the token size and number of
re-started nodes
28Re-execution performance (2)
29Logging techniques
Initial execution
crash
ckpt
The system must provide the messages to be
replayed, and discard the re-emissions
Replayed execution starts from last checkpoint
(this process)
- Main problems
- Discard re-emissions (technical)
- Ensure that messages are replayed
- in a consistent order
30Large Scale Parallel and Distributed Systems and
programing
- Many HPC applications use message passing
paradigm - Message passing MPI
- We need a Volatility tolerant
- Message Passing Interface implementation
- Based on MPICH-1.2.5 which implements MPI
standard 1.1
31Checkpoint Server (stable)
Checkpoint images are stored on reliable media 1
file per Node (name given By Node)
Disc
Checkpoint images
Multiprocess server
Poll, treat event and dispatch job to other
processes
Incoming Message (Put ckpt transaction)
Outgoing Message (Get ckpt transaction control)
Open Sockets -one per attached Node -one per
home CM of attached Nodes
32NAS Benchmark Class A and B
Latency
Memory capacity (logging on disc)