MPICHV2 a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

MPICHV2 a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging

Description:

2) non named receptions ( should be replayed in the ... A. Every non deterministic event has its logical clock logged on reliable media ... – PowerPoint PPT presentation

Number of Views:97
Avg rating:3.0/5.0
Slides: 33
Provided by: franckc5
Category:

less

Transcript and Presenter's Notes

Title: MPICHV2 a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging


1
MPICH-V2 a Fault Tolerant MPI for Volatile Nodes
based on PessimisticSender Based Message Logging
  • joint work with
  • A.Bouteiller, F.Cappello, G.Krawezik,
    P.Lemarinier, F.Magniette
  • Parallelism team, Grand Large Project
  • Thomas Hérault herault_at_lri.fr
    http//www.lri.fr/herault

2
MPICH-V2
  • Computing nodes of clusters are subject to
    failure
  • Many applications use MPI as communication
    library
  • Design a fault-tolerant MPI library
  • MPICH-V1 is a fault-tolerant MPI implementation
  • It requires many stable components to provide
    high performance
  • MPICH-V2 addresses this requirements
  • And provides higher performances

3
Outline
  • Introduction
  • Architecture
  • Performances
  • Perspective Conclusion

4
Large Scale Parallel and Distributed systems and
node Volatility
  • Industry and academia are building larger and
    larger computing facilities for technical
    computing (research and production).
  • Platforms with 1000s of nodes are becoming
    common Tera Scale Machines (US ASCI, French
    Tera), Large Scale Clusters (Score III, etc.),
    Grids, PC-Grids (Seti_at_home, XtremWeb, Entropia,
    UD, Boinc)
  • These large scale systems have frequent
    failures/disconnections
  • ASCI-Q full system MTBF is estimated
    (analytic) to few hours (Petrini LANL), A
    5 hours job with 4096 procs has less than 50
    chance to terminate.
  • PC Grids nodes are volatile ?
    disconnections / interruptions are expected to be
    very frequent (several/hour)
  • When failures/disconnections can not be avoided,
    they become
  • one characteristic of the system called
    Volatility
  • We need a Volatility tolerant Message Passing
    library

5
Goal execute existing or new MPI Apps
Problems 1) volatile nodes (any number at any
time) 2) non named receptions (? should be
replayed in the same
order as the one of the previous failed exec.)
Objective summary 1) Automatic fault
tolerance 2) Transparency for the programmer
user 3) Tolerate n faults (n being the MPI
processes) 4) Scalable Infrastructure/protocols
5) Avoid global synchronizations
(ckpt/restart) 6) Theoretical verification of
protocols
6
Related works
A classification of fault tolerant
message passing environments considering
A) level in the software stack where fault
tolerance is managed and B) fault
tolerance techniques.
Non Automatic
Automatic
Checkpoint based
Log based
Optimistic log (sender based)
Pessimistic log
Causal log
Optimistic recovery In distributed systems n
faults with coherent checkpoint SY85
Coordinated checkpoint
Manetho n faults EZ92
Cocheck Independent of MPI Ste96
Framework
Starfish Enrichment of MPI AF99
FT-MPI Modification of MPI routines User Fault
Treatment FD00
Egida RAV99
Clip Semi-transparent checkpoint CLP97
MPI/FT Redundance of tasks BNC01
API
Pruitt 98 2 faults sender based PRU98
MPI-FT N fault Centralized server LNLE00
MPICH-V2 N faults Distributed logging
Communication Lib.
Sender based Mess. Log. 1 fault sender
based JZ87
Level
7
Checkpoint techniques
restart
Coordinated Checkpoint (Chandy/Lamport)
detection/ global stop
The objective is to checkpoint the application
when there is no in transit messages between any
two nodes ? global synchronization? network
flush? not scalable
failure
Ckpt
Sync
Nodes
Uncoordinated Checkpoint
  • No global synchronization (scalable)
  • Nodes may checkpoint at any time (independently
    of the others)
  • Need to log undeterministic events In-transit
    Messages

restart
detection
failure
Ckpt
Nodes
8
Outline
  • Introduction
  • Architecture
  • Performances
  • Perspective Conclusion

9
MPICH-V1
Dispatcher
Channel Memories
node
Network
Checkpoint servers
node
node
10
MPICH-V2 protocol
A new protocol (never published yet) based on 1)
Splitting message logging and event logging 2)
Sender based message logging 3) Pessimistic
approach (reliable event logger)
  • Definition 3 (Pessimistic Logging protocol) Let P
    be a communication protocol, and E an execution
    of P with at most f concurrent failures. Let MC
    denotes the set of messages transmitted between
    the initial configuration and the configuration C
    of E.
  • P is a pessimistic message logging protocol if
    and only if
  • C?E, ?m ? MC,
  • (DependC(m) gt 1) ) ?Re - Executable(m)

Theorem 2 The protocol of MPICH-V2 is a
pessimistic message logging protocol.
Key points of the proof A. Every non
deterministic event has its logical clock logged
on reliable media B. Every message reception
logged on reliable media is reexecutable the
message payload is saved on the sender the sender
will produce the message again and associate the
same unique logical clock
11
Message logger and event logger
q
m
p
crash
(id, l)
event logger for p
r
q
D
B
C
p
restart
A
event logger for p
reexecution phase
r
12
Computing node
Event Logger
Ckpt Server
Reception event
Checkpoint Image
ack
MPI process
CSAC
Send
Send
V2 daemon
Ckpt Control
Receive
Receive
keep payload
Node
13
Impact of uncoordinated checkpoint sender based
message logging
1
2
EL
1,
2
?
Checkpoint image
Checkpoint image
P0
? ?
Checkpoint image
Checkpoint image
P1
P1s ML
1
2
1
CS
  • Obligation to checkpoint Message Loggers on
  • computing nodes
  • Garbage collector required for reducing ML
  • checkpoint size.

14
Garbage collection
1
2
EL
Checkpoint image
P0
Checkpoint image
P1
P1s ML
1
2
1
2
3
3
1
1 and 2 can be deleted ? Garbage collector
CS
Receiver checkpoint completion triggers the
garbage collector of senders.
15
Scheduling Checkpoint
  • Uncoordinated checkpoint lead to log in-transit
    messages
  • Scheduling checkpoint simultaneously will lead to
    bursts
  • in the network traffic.
  • Checkpoint size can be reduced by removing
    message logs
  • Coordinated checkpoint (Lamport).
  • Requires global synchronization
  • Checkpoint traffic should be flattened
  • Checkpoint scheduling should evaluate the cost
    and benefit
  • of each checkpoint.

1, 2 and 3 can be deleted ? Garbage collector
1
2
1
2
3
1
P0s ML
P0
No message Checkpoint needed
3 needs to be checkpointed
P1
P1s ML
1
2
1
2
3
1
1 and 2 can be deleted ? Garbage collector
CS
16
Node (Volatile) Checkpointing
  • User-level Checkpoint Condor Stand Alone
    Checkpointing
  • Clone checkpointing non blocking checkpoint

(1) fork
Resume execution using CSAC just after (4),
reopen sockets and return
code
Ckpt order
CSAC
(2) Terminate ongoing coms (3) close sockets (4)
call ckpt_and_exit()
libmpichv
fork
  • Checkpoint image is sent to CS on the fly (not
    stored locally)

17
Library based on MPICH 1.2.5
  • A new device ch_v2 device
  • All ch_v2 device functions are blocking
    communication functions built over TCP
    layer

MPI_Send
MPID_SendControl MPID_SendChannel
_v2from
- get the src of the last message
Binding
_v2Init
- initialize the client
V2 device Interface
_v2bsend
_v2Finalize
- finalize the client
18
Outline
  • Introduction
  • Architecture
  • Performances
  • Perspective Conclusion

19
Performance evaluation
Cluster 32 1800 Athlon CPU, 1 GB, IDE Disc
16 Dual Pentium III, 500 Mhz, 512 MB, IDE Disc
48 ports 100Mb/s Ethernet switch Linux 2.4.18,
GCC 2.96 (-O3), PGI Frotran lt5 (-O3, -tpathlonxp)
Checkpoint Server Event Logger Checkpoint
Scheduler Dispatcher
A single reliable node
node
Network
node
node
20
Bandwidth and Latency
Latency for a 0 byte MPI message MPICH-P4
(77us), MPICH-V1 (154us), MPICH-V2 (277us)
Latency is high due to the event logging. ? A
receiving process can send a new message only
when the reception event has been successfully
logged (3 TCP messages for a communication) Bandw
idth is high because event messages are short.
21
NAS Benchmark Class A and B
Megaflops
Megaflops
Megaflops
22
Breakdown of the execution time
23
Faulty execution performance
1 fault Every 45 sec!
190 s (80)
24
Outline
  • Introduction
  • Architecture
  • Performances
  • Perspective Conclusion

25
Perspectives
  • Compare to Coordinated techniques
  • Treshold of fault frequency where logging
    techniques are more valuable
  • MPICH-V/CL
  • Cluster 2003
  • Hierarchical logging for Grids
  • Tolerate node failures cluster failures
  • MPICH-V3
  • SC 2003 Poster session
  • Address the latency of MPICH-V2
  • Use causal logging techniques ?

26
Conclusion
  • MPICH-V2 is a completely new protocol replacing
    MPICH-V1 removing the channel memories
  • New protocol is pessimistic sender based
  • MPICH-V2 reach a Ping-Pong Bandwidth
  • close to the one of MPICH-P4
  • MPICH-V2 cannot compete with MPICH-P4 on latency
  • However for applications with large messages,
    performance
  • are close to the one of P4
  • In addition, MPICH-V2 resists up to one fault
    every 45 seconds.
  • Main conclusion MPICH-V2 requires much less
    stable nodes than MPICH-V1 with better
    performances

Come to see MPICH-V demos at the Booth 3315 INRIA
27
Re-execution performance (1)
Time for the re-execution of a token ring on 8
nodes According to the token size and number of
re-started nodes
28
Re-execution performance (2)
29
Logging techniques
Initial execution
crash
ckpt
The system must provide the messages to be
replayed, and discard the re-emissions
Replayed execution starts from last checkpoint
(this process)
  • Main problems
  • Discard re-emissions (technical)
  • Ensure that messages are replayed
  • in a consistent order

30
Large Scale Parallel and Distributed Systems and
programing
  • Many HPC applications use message passing
    paradigm
  • Message passing MPI
  • We need a Volatility tolerant
  • Message Passing Interface implementation
  • Based on MPICH-1.2.5 which implements MPI
    standard 1.1

31
Checkpoint Server (stable)
Checkpoint images are stored on reliable media 1
file per Node (name given By Node)
Disc
Checkpoint images
Multiprocess server
Poll, treat event and dispatch job to other
processes
Incoming Message (Put ckpt transaction)
Outgoing Message (Get ckpt transaction control)
Open Sockets -one per attached Node -one per
home CM of attached Nodes
32
NAS Benchmark Class A and B
Latency
Memory capacity (logging on disc)
Write a Comment
User Comments (0)
About PowerShow.com