ApplicationLevel Fault Tolerance for MPI Programs - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

ApplicationLevel Fault Tolerance for MPI Programs

Description:

System-level checkpointing (SLC) (eg) Condor. core-dump style snapshots of computations ... Condor) Linux: Dell PowerEdge 1650. Solaris: Sun V210. Two large ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 55
Provided by: BRON9
Category:

less

Transcript and Presenter's Notes

Title: ApplicationLevel Fault Tolerance for MPI Programs


1
Application-Level Fault Tolerance for MPI
Programs
  • Keshav Pingali
  • http//iss.cs.cornell.edu

Joint work with Greg Bronevetsky, Rohit
Fernandes, Daniel Marques, Paul Stodghill
2
The Problem
  • Old picture of high-performance computing
  • Turn-key big-iron platforms
  • Short-running codes
  • Modern high-performance computing
  • Less Reliable Platforms
  • Extremely large systems (millions of parts)
  • Large clusters of commodity parts
  • Grid Computing
  • Long-running codes
  • Program runtimes greatly exceed mean time
  • to failure
  • ASCI, Blue Gene, PSC, Illinois Rocket Center
  • ?Fault-tolerance is critical

3
Fault tolerance
  • Fault tolerance comes in different flavors
  • Mission-critical systems (eg) air traffic
    control system
  • No down-time, fail-over, redundancy
  • Computational applications
  • Restart after failure, minimizing lost work
  • Guarantee progress

4
Fault Models
  • Fail-stop
  • Failed process dies silently
  • Does not send corrupted messages
  • Does not corrupt shared data
  • Byzantine
  • Arbitrary misbehavior is allowed
  • Our focus
  • Fail-stop faults

5
Fault tolerance strategies
  • Our experience
  • Scientific programs communicate more frequently
    than distributed systems.
  • Message-logging is not practical for scientific
    programs.

6
Checkpoint/restart (CPR)
  • System-level checkpointing (SLC) (eg) Condor
  • core-dump style snapshots of computations
  • very architecture and OS dependent
  • checkpoints are not portable
  • Application-level checkpointing (ALC)
  • program saves and restores its own state
  • (eg) n-body codes save and restore positions and
    velocities of particles
  • programs are self-checkpointing and
    self-restarting
  • amount of state saved can be much smaller than
    SLC
  • IBMs BlueGene protein folding
  • Megabytes vs terabytes
  • Alegra (Sandia)
  • App. level restart file only 5 of core size
  • Disadvantage of current application-level
    check-pointing
  • manual implementation
  • requires global barriers in programs

7
Our Approach
  • Automate application-level check-pointing
  • Minimize programmer annotations
  • Generalize to arbitrary MPI programs w/o barriers

Application State-saving
Original Application
Precompiler
Failure detector
Thin Coordination Layer
MPI Implementation
MPI Implementation
Reliable communication layer
8
Outline
  • Saving single-process state
  • Stack, heap, globals, locals,
  • Coordination of single-process states into a
    global snapshot
  • Basic issues
  • Crossing messages, non-determinism,
  • Our protocol for point-to-point messages
  • Collective communication
  • Implementation Results
  • Overheads are minimal

9
System ArchitectureSingle-Processor Checkpointing
Application State-saving
Precompiler
Original Application
Failure detector
Thin Coordination Layer
MPI Implementation
MPI Implementation
Reliable communication layer
10
Precompiler
  • Where to checkpoint
  • At calls to potentialCheckpoint() function
  • Mandatory calls in main process (initiator)
  • Other calls are optional
  • Process checks if global checkpoint has been
    initiated, and if so, joins in protocol to save
    state
  • Inserted by programmer or automated tool
  • Currently inserted by programmer
  • Transformed program can save its state at calls
    to potentialCheckpoint()

11
Saving Application State
  • Must save
  • Heap we provide special malloc that tracks the
    memory it allocates
  • Globals precompiler knows the globals inserts
    statements to explicitly save them
  • Call Stack, Locals and Program Counter - maintain
    a separate stack which records all functions that
    got called and the local vars inside them.
  • Similar to work done with PORCH (MIT)
  • PORCH is portable but not transparent to
    programmer

12
Example
  • main()
  • int a
  • VDS.push(a, sizeof a)
  • if(restart)
  • load LS
  • copy LS to LS.old
  • jump dequeue(LS.old)
  • //
  • LS.push(2)
  • label2
  • function()
  • LS.pop()
  • //
  • VDS.pop()
  • function()
  • int b
  • VDS.push(b, sizeof b)
  • if(restart)
  • jump dequeue(LS.old)
  • //
  • LS.push(2)
  • take_ckpt()
  • label2
  • if(restart)
  • load VDS
  • restore variables
  • LS.pop()
  • //
  • VDS.pop()

13
Reducing saved state
  • Statically determine spots in the code with the
    least amount of state
  • Determine live data at the time of a checkpoint
  • Incremental state-saving
  • Recomputation vs saving state
  • eg. Protein folding, AB C
  • Prior work
  • CATCH (Illinois) uses runtime learning rather
    than static analysis
  • Beck, Plank and Kingsley (UTK) memory exclusion
    analysis of static data

14
Outline
  • Saving single-process state
  • Stack, Heap, Globals, Locals,
  • Coordination of single-process states into a
    global snapshot
  • Basic issues
  • Crossing messages, non-determinism,
  • Our Protocol
  • Collective communication
  • Implementation Results
  • Overheads are minimal

15
System ArchitectureDistributed Checkpointing
Application State-saving
Precompiler
Original Application
Failure detector
Thin Coordination Layer
MPI Implementation
MPI Implementation
Reliable communication layer
16
Need for Coordination
Ps Checkpoint
Early Message
Process P
Past Message
Future Message
Process Q
Late Message
Qs Checkpoint
  • Horizontal Lines events in each process
  • Recovery Line
  • line connecting checkpoints on each processor
  • represents global system state on recovery
  • Problem with Communication
  • messages may cross recovery line

17
Late Messages
Ps Checkpoint
Process P
Process Q
Late Message
Qs Checkpoint
  • Must record message data at receiver as part of
    checkpoint
  • On recovery re-read recorded message data

18
Early Messages
Ps Checkpoint
Early Message
Process P
Process Q
Qs Checkpoint
  • Must suppress the resending of message on recovery

19
Early Messages
Ps Checkpoint
Early Message
Process P
Process Q
Qs Checkpoint
  • Must suppress the resending of message on
    recovery
  • What about non-deterministic events before the
    send?
  • Must ensure the application generates the same
    early message on recovery
  • Record and replay all non-deterministic events
    between checkpoint and send

20
Difficulty of Coordination
Ps Checkpoint
Process P
Process Q
Qs Checkpoint
  • No communication ? no coordination necessary

21
Difficulty of Coordination
Ps Checkpoint
Process P
Past Message
Future Message
Process Q
Qs Checkpoint
  • No communication ? no coordination necessary
  • BSP Style programs ? checkpoint at barrier

22
Difficulty of Coordination
Ps Checkpoint
Process P
Past Message
Future Message
Process Q
Late Message
Qs Checkpoint
  • No communication ? no coordination necessary
  • BSP Style programs ? checkpoint at barrier
  • General MIMD programs

23
Difficulty of Coordination
Ps Checkpoint
Process P
Past Message
Future Message
Process Q
Late Message
Qs Checkpoint
  • No communication ? no coordination necessary
  • BSP Style programs ? checkpoint at barrier
  • General MIMD programs
  • System-level checkpointing (eg. Chandy-Lamport)
  • Forces checkpoints to avoid early messages
  • Assumed by existing work

24
Difficulty of Coordination
Ps Checkpoint
Early Message
Process P
Past Message
Future Message
Process Q
Late Message
Qs Checkpoint
  • No communication ? no coordination necessary
  • BSP Style programs ? checkpoint at barrier
  • General MIMD programs
  • System-level checkpointing (eg. Chandy-Lamport)
  • Only late messages
  • Application-level checkpointing
  • Checkpoint locations fixed, may not force
  • Late and early messages
  • Requires new protocol

25
MPI-specific issues
  • Non-FIFO communication tags
  • Non-blocking communication
  • Collective communication
  • MPI_Reduce(), MPI_AllGather(), MPI_Bcast()
  • Internal MPI library state
  • Visible
  • non-blocking request objects, datatypes,
    communicators, attributes
  • Invisible
  • internal timers, buffers, IP address mappings,
    etc.

26
Outline
  • Saving single-process state
  • Stack, heap, globals, locals,
  • Coordination of single-process states into a
    global snapshot
  • Basic issues
  • Crossing messages, non-determinism,
  • Our Protocol
  • Collective communication
  • Implementation Results
  • Overheads are minimal

27
The Global View
Initiator
Epoch 0
Epoch 1
Epoch 2

Epoch n
Process P
Process Q
  • A programs execution is divided into a series of
    disjoint epochs
  • Epochs are separated by recovery lines
  • A failure in Epoch n means all processes roll
    back to the prior recovery line

28
Protocol Outline (I)
Initiator
pleaseCheckpoint
Process P
Process Q
  • Initiator checkpoints, sends pleaseCheckpoint
    message to all others
  • After receiving this message, process checkpoints
    at the next available spot

29
Protocol Outline (II)
Initiator
pleaseCheckpoint
Process P
Recording
Process Q
  • After checkpointing, each process keeps a record,
    containing
  • data of messages from last epoch (Late messages)
  • non-deterministic events
  • (that happened before Early message sends)

30
Protocol Outline (IIIa)
Initiator
Process P
Process Q
  • Globally, ready to stop recording when
  • all processes have received their late messages
  • all processes have sent their early messages

31
Protocol Outline (IIIb)
Initiator
readyToStopRecording
Process P
Process Q
  • Locally, when a process
  • has received all its late messages
  • has sent all its early messages
  • ? sends a readyToStopRecording message to
    Initiator.

32
Protocol Outline (IV)
Initiator
stopRecording
stopRecording
Process P
Application Message
Process Q
  • When initiator receives readyToStopRecording from
    everyone, it sends stopRecording to everyone
  • Stop recording when
  • received stopRecording message OR
  • a message from a process that has stopped
    recording

33
Protocol Discussion
Initiator
stopRecording
?
Process P
Application Message
Process Q
  • Why cant we just wait to receive stopRecording
    message?
  • Our record would depend on a non-deterministic
    event, invalidating it.
  • The application message may be different or may
    not be resent on recovery.

34
Protocol Details
  • Piggyback 4 bytes of control data to tell if a
    message is late, early, etc.
  • Can be reduced to 2 bits
  • On recovery
  • reinitialize MPI library using MPI_Init()
  • restore the single-process state
  • recreate datatypes and communicators
  • ensure that all calls to MPI_Send() and
    MPI_Recv() are suppressed/fed with data as
    necessary

35
Outline
  • Saving single-process state
  • Stack, heap, globals, locals,
  • Coordination of single-process states into a
    global snapshot
  • Basic issues
  • Crossing messages, non-determinism,
  • Our Protocol
  • Collective communication
  • Implementation Results
  • Overheads are minimal

36
MPI Collective Communications
  • Single communication involving multiple processes
  • Single-Receiver multiple senders, one receiver
  • e.g. Gather, Reduce
  • Single-Sender one sender, multiple receivers
  • e.g. Bcast, Scatter
  • AlltoAll every process in group sends data to
    every other process
  • e.g. AlltoAll, AllGather, AllReduce, Scan
  • Barrier everybody waits for everybody else to
    reach barrier before going on.
  • (Only collective call with explicit
    synchronization guarantee)

37
Possible Solutions
  • We have a protocol for point-to-point messages.
    Why not reimplement all collectives as
    point-to-point messages?
  • Lots of work and less efficient than native
    implementation.
  • Checkpoint collectives directly without breaking
    them up.
  • May be complex but requires no reimplementation
    of MPI internals.

38
AlltoAll Example
MPI_AlltoAll()
Process P
Process Q
MPI_AlltoAll()
Process R
MPI_AlltoAll()
  • Data flows represent application-level semantics
    of how data travels
  • Do NOT correspond to real messages
  • Used to reason about applications view of
    communications
  • AlltoAll data flows are a bidirectional clique
    since data flows in both directions
  • Recovery line may be
  • Before the AlltoAll
  • After the AlltoAll
  • Straddling the AlltoAll

39
AlltoAll Example
MPI_AlltoAll()
Process P
Process Q
MPI_AlltoAll()
Process R
MPI_AlltoAll()
  • Before the AlltoAll No Problem
  • On recovery application will reexecute the
    AlltoAll
  • After the AlltoAll No Problem
  • Application wont care about AlltoAll

40
Straddling AlltoAll What to do
Process P

Process Q
The Record
Process R
  • Straddling the AlltoAll only single case
  • P?Q and P?R Late/Early data flows
  • Record result and replay for P
  • Suppress Ps call to MPI_AlltoAll
  • Record/replay non-determinism before Ps
    MPI_AlltoAll call

41
Collective Communication
  • Single-sender/single collector collectives have a
    similar solution
  • May also reissue some MPI calls
  • Barrier very different, requires new solution

42
Barrier
MPI_Barrier()
Process P
Process Q
MPI_Barrier()
Process R
MPI_Barrier()
  • Recovery line before or after Barrier No Problem

43
Barrier
Process P

Process Q
Process R
  • Recovery straddles Barrier Problem!
  • No way for recovery to uphold Barrier
    synchronization semantics
  • No process may pass the barrier until every other
    process has reached it in real time

44
Barrier
Process P

Process Q
Process R
  • Solution ensure that barriers may not straddle
    recovery lines
  • Precede each Barrier with a special checkpoint
    spot.
  • If one node took a checkpoint before the Barrier,
    everybody else does too.

45
Outline
  • Saving single-process state
  • Stack, heap, globals, locals,
  • Coordination of single-process states into a
    global snapshot
  • Basic issues
  • Crossing messages, non-determinism,
  • Our Protocol
  • Collective communication
  • Implementation Results

46
Implementation
  • Several sequential platforms (cf. Condor)
  • Linux Dell PowerEdge 1650
  • Solaris Sun V210
  • Two large-scale parallel platforms
  • Lemieux 750 node Alphaserver at Pittsburgh
  • Velocity 2 128 dual-processor Windows cluster
  • Benchmarks
  • NAS suite CG, LU, SP
  • SMG2000, HPL

47
Sequential Experiments (vs Condor)
  • Checkpoint sizes are comparable.
  • Ongoing work reduce checkpoint sizes through
    compiler analysis

48
Runtimes on Lemieux
  • Compare original running time of code with
    running time using C3 system without taking any
    checkpoints
  • Chronic overhead is small (lt 10)

49
Runtimes on V2
  • Overheads on Windows cluster are also small,
    except for SMG2000.
  • Relatively large overhead in SMG2000 might be due
    to initialization code.

50
Overheads w/checkpointing on Lemieux
  • Configuration 1 no checkpoints
  • Configuration 2 go through motions of taking 1
    checkpoint, but nothing written to disk
  • Configuration 3 write checkpoint data to local
    disk on each machine
  • Measurement noise about 2-3
  • Conclusion relative cost of taking a checkpoint
    is fairly small

51
Overhead w/checkpointing on V2
  • Overheads on V2 for taking checkpoints are also
    fairly small.

52
Other work
  • Designed a similar protocol for shared-memory
    programs.
  • Implemented protocol and evaluated on SPLASH-2
    programs
  • Overheads are small (lt10).
  • Paper in ASPLOS 2004

53
Ongoing work
  • Compiler analysis to reduce the amount of saved
    state (with Radu Rugina)
  • Identify live data
  • Incremental checkpointing
  • Recomputation vs state-saving
  • Portable checkpointing (Rohit Fernandes)
  • Restart checkpoint on different machine
  • Useful for task migration in grid environment
  • MPI-2
  • One-way communication

54
Contributions
  • Developed system for making MPI apps fault
    tolerant
  • Precompiler-based single-process checkpointer
  • Minimal programmer annotations
  • Developed and implemented a novel protocol for
    distributed application-level checkpointing
  • Works with any single-process checkpointer
  • Can transparently handle all features of MPI
  • Non-FIFO, non-blocking, collective,
    communicators, etc.
  • Portable across MPI implementations
  • Components orthogonal
  • Can be used/applied independently
  • Extended to shared-memory (OpenMP) programs
  • Overhead is low
Write a Comment
User Comments (0)
About PowerShow.com