Title: Formal Analysis and Code Generation Support for MPI
1Formal Analysis andCode Generation Supportfor
MPI
- Ganesh Gopalakrishnan
- Mike Kirby
- School of Computing and
- Scientific Computing and Imaging (SCI) Institute
- University of Utah
- Salt Lake City, UT, USA
2We are the Gauss Group, School of
Computing, University of Utah, Salt Lake City, UT
- Faculty
- Ganesh Gopalakrishnan (main area FV)
- Mike Kirby (main area HPC)
- Students
- Robert Palmer (PhD)
- Yu Yang (PhD)
- Salman Pervez (PhD)
- Sonjong Hwang (BS/MS)
- Geoffrey Sawaya (BS)
- Summer Visitor
- Igor Melatti
- Funding Acknowledgement
- Microsoft HPC Institute
- Formal Analysis and Code Generation Support
for MPI - (Also supported on a complementary project
thru NSF grant - CSRSMA -- Toward Reliable and Efficient
Message Passing Software Through Formal Analysis)
3What is Formal Analysis, and Why do It?
- The Engineering Approach
- Model, Analyze, Debug, and Improve software /
systems - Success Stories
- Windows Device Driver Development Kit
- Initial ideas from the area of
predicate-abstraction - for C programs (Ball, Rajamani,)
- Later optimized with powerful analysis engines
- (Ball, Cook, )
- Automated to analyze source trees and run
- checks
- 4 years from possibility to reality
- Similar story in cache coherence / MP design
(SRC - project at Utah)
- Our vision make this happen for HPC software!
- Having found four focal areas, we are going
after them! - Many differences in terms of FV problem
formulation / solution
4Focal Areas
- Formal Modeling of MPI
- Model Checking MPI programs
- Verifying Advanced / New MPI Features e.g.,
One-Sided Communication - Parallel Model Checking
5Organization of the Rest of the Talk
- Modeling of the MPI Library (overview GG)
- Model Checking (traditional and in situ)
- (overview GG)
- Verifying One-Sided MPI Constructs (overview
GG) - Parallel/Distributed Model Checking (overview
GG) - Verifying Byte-Range Locking (details MK)
- Parallel/Distributed Model Checking (details
MK) - Future Plans (MK)
6Reasons for our thrusts
Modeling of the MPI Library In situ Model
Checking Verifying One-Sided MPI Constructs
Parallel Model Checking
MPI Widely used HPC library with COMPLEX and
EVOLVING semantics (1-sided, threading)
Large MPI programs are MPI Calls Hanging off a
Program Scaffolding. Hence Finite State
Machine model extraction Model Checking is
ineffective in many cases
Some of the new MPI Extensions are Poorly
Understood
Parallelism can benefit even the verification
process !!
7Brief Overview of Specific Results
81. MPI Library Modeling
- Informal Spec Documents do not help answer what
ifs - We need specs using which we can calculate
outcomes - A cal (Lamport, MSR) Formal Spec for MPI
(Palmer) - Semantics cal ? TLA ? Mathematical Logic /
Sets - Executability cal ? TLC model checker ?
Scenario Oracle - Analyzability cal ? Boolean Propositions ?
answering more - open-ended scenarios with all possible
answers for instance - Coverage Covered many aspects of MPI-2
- Presentation Formal Spec Scenario Oracle will
be hyper- - linked into MPI HTML
Formal Spec document
92. Model Checking
- Traditional Extract / Analyze control
skeletons - Now developing a finite model checker tailored
- for MPI (M language and model checker)
- Integration into VisualStudio Environment (in
- progress) written in C
- In-Situ Model Checking
- Instrument downscaled MPI programs to record
- states intelligently schedule state
exploration - Socket based instrumentation (prototype runs)
- PMPI based instrumentation (collab with Argonne)
103. Verifying One-Sided MPI Constructs
- MPIs One-Sided Primitives are Poorly Understood
- bugs in simple algorithms, proposed fixes,
- The Semantics are Non-trivial (Writes / Reads
into - Windows are unordered) like weak mem
models - Pervez Analyzed and Found Bug in Byte-Range
- Locking Protocol Published in EuroPVM 2005 (by
- Thakur and Latham)
- Proposed Two Fixes (joint work with Thakur,
Gropp) - Details in Mikes talk
114. Parallel Model Checking
- Finite State Verification Models of even
- downscaled MPI programs are VERY large
- Parallel and Distributed Exploration of these
FSMs - using Clusters Running MPI is attractive!
- A Model Checker Eddy has been developed
- Two Threads per CPU
- One Thread for State Generation
- Another Thread for State send / receive
- States coalesced into Lines Before Shipping
- Win32 Threads / MCC Porting Done
- Gives Linear Speedups (CPU and memory
- across clusters utilized)
12Details Verification of a Byte-Range Locking
Algorithm
13Byte-Range lockingusing MPI one-sided
communication
- One process makes its memory space available for
communication - Global state is stored in this memory space
- Each process is associated with a flag, start and
end values stored in array A - Pis flag value is in A3 i for all i
14Lock Acquire
lock_acquire (strat, end) 1 val0 1 /
flag / val1 start val2 end 2
while(1) 3 lock_win 4 place val in
win 5 get values of other processes from
win 6 unlock_win 7 for all i, if (Pi
conflicts with my range) 8 conflict
1 9 if(conflict) 10 val0 0 11
lock_win 12 place val in
win 13 unlock_win 14
MPI_Recv(ANY_SOURCE) 15 16
else 17 / lock is acquire / 18
break 19
15Lock Release
lock_release (strat, end) val0 0 /
flag / val1 -1 val2 -1 lock_win
place val in win get values of other
processes from win unlock_win for all i,
if (Pi conflicts with my range)
MPI_Send(Pi)
16Error Trace
17Error Discussion
- Problem too many Sends, not enough Recvs
- Not really a problem
- Messages are 0 bytes only
- Sends could be made non-blocking
- Maybe a problem
- Even 0 byte messages cause unknown memory leaks,
not desirable - More importantly, if there are unused Sends in
the system, processes that were supposed to be
blocked may wake up by consuming these. This ties
up processor resources and hurts performance!
18Some Details Parallel Model Checking The
Eddy-Murphi Model Checker
19Parallel Model Checking
- Each computation node owns a portion of the
state space - Each node locally stores and analyzes its own
states - Newly generated states which do not belong to the
current node are sent to the owner node - Standard distributed algorithm may be chosen for
termination
20Eddy Algorithm
- For each node, two threads are used
- Worker thread analyzes, generates and partitions
states - If there are no states to be visited, it sleeps
- Communication thread repeatedly sends/receives
states to/from the other nodes - It also handles termination
- Communication between the two threads
- Via shared memory
- Via mutex signals primitives
21Worker Thread
Communication Thread
Consumption Queue
Receive and process inbound Messages Initiate
Isends Check completion of Isends
Take State Off Consumption Queue Expand
State (get new set of states) Make decision
about Set of states
Hash
Communication Queue
22The Communication Queue
- There is one communication queue for each node
- Each communication queue has N lines and M states
per line - States additions are made (by the worker thread)
only on one active line - The other lines may be already full or empty
23Eddy-Murphi Performance
24Eddy-Murphi Performance
25Summary
- Complex systems can benefit from formal methods
- Rich interdisciplinary ground for FM and HPC to
interact - Win-Win scenarios exist
26Future Plans
- Publish Formal and Analyzable MPI-2 Spec
- Develop Formal Property-based Tests for MPI
Libraries such as MCCS - Build Model Checker for MPI Programs that treats
MPI calls as Primitives (more efficient
partial-order reductions, thanks to MPI
Semantics) - Large-scale Eddy-Murphi runs on MCCS
- In-Situ Model Checker for MCCS MPI Programs
- Release of Tools
27Software and Publications
- Eddy Software Available
- Formal Spec of MPI-2
- reliminary stable version would be hyperlinked
into HTML MPI-2 spec - Two workshop papers submitted (Yu Palmer) to
Verify06 and RTVA06 - EuroPVM06 paper (Pervez, Ganesh, Mike,
- Gropp, Thakur) accepted
28Workshop on Thread Verification (TV06, sponsored
by Microsoft) 2-day program, 13 papers, 5
invited talks Seattle, WA, Aug 21-22 See
http//www.cs.utah.edu/tv06
29Extra Slides
30 Our Approach Toward Code Generation
Support for MPI
- Describe Executable MPI-2 (MPI) Semantic Model
- Develop MPI Formal Semantics
- Develop Facility to Calculate Scenario Outcomes
- from Formal Semantics
- Develop Test Suites for MPI Libraries Based on
- Formal Semantics
- Detect Errors in User MPI Code
- Deadlocks
- Resource (e.g. Buffer) Usage Violations
- Support Code Optimizations
- Develop Facility to Show Correctness of
Designer - Introduced Code Transformations
31The Communication Queue
- Summing up, this is the evolution of a line
status
WTBA
Active
WTBS
CBS
32Eddy-Murphi Performance
- Tuning of the communication queuing mechanism
- High number of states per line is required
- Much better sending many states at a time
- Not too few number of lines
- Or the worker will not be able to submit new
states
33Eddy-Murphi Performances
- Comparison with previous versions of parallel
Murphi - When ported to MPI, old versions of parallel
Murphi perform worse than serial Murphi - Comparison with serial Murphi almost linear
speedup is expected
34A Formal Model of MPI Process Interaction
- Robert Palmer Mike Kirby Ganesh Gopalakrishnan
35Goals
- Create a human readable and understandable formal
document to supplement the written standard - CAL
- ZF set theory first order logic weak temporal
logic (TLA) - Sequencing scope processes (Pseudo-code
flavored) - Model behaviors of MPI to preserve correctness
properties - Deadlock
- Local assertion violations
- Capture communication behavior implied by the MPI
standard - Standard/Asynchronous mode point to point
operations - Collective operations
- Set the formality sufficiently high that
automated reasoning assistance can be applied - Theorem proving and Model checking
- Useful for discovering difficult to reproduce
errors in reactive systems - Cover an interesting (but still limited) subset
of the standard - Point to point and collective communications that
transmit data - Wait on topology, communicator manipulation etc.
- Abstract away most of the data passing (i.e.,
focus on the semantics of communication) - Assume static process system
36Communicator
- Defines the communication universe for processes
in MPI - Group (Set of processes and a ranking function)
- Context (Where messages are posted)
- Virtual Topology
- Attributes
- Shared memory based model
- Communicator objects are globally visible to
processes in the computation - Accessed by an integer handle as in MPICH
( The set of communicator objects in the
computation. ) variables comm i \in
(0..(MAX_COMM - 1)) -gt p2p
-gt data -gt 0, src -gt 0,
dest -gt 0,
type -gt "MPI_INT",
tag -gt 1,
state -gt
"vacant",
col -gt
root -gt 0, type -gt
"MPI_INT", state -gt
"vacant", processes_in -gt ,
participants -gt ,
group -gt pr pr \in (0..(N-1)),
groupsize -gt N, ranking -gt
p \in (0..(N-1)) -gt 0,
inverseranking -gt p \in (0..(N-1)) -gt 0,
lastrank -gt 0
37Point to Point Basis
- Communication happens with respect to the context
of a communicator - Algorithm for posting a point to point send
(receive) to the context - When the state of the context is vacant the send
(receive) can be posted directly to the context.
The sender (receiver) then waits for the state to
change to transmitting indicating that the
matching receive has been posted. The sender
(receiver) then changes the state of the context
back to vacant and exits. - Or when the state of the context is initially
receive (send) and the message matches the
message already posted to the context, the sender
(receiver) changes the state of the context to
transmitting and exits.
post_send1 either when commc.p2p.state
"vacant"
commc.p2p m post_send2 when
commc.p2p.state "transmitting"
commc.p2p.state "vacant"
or when commc.p2p.state "recv"
/\ commc.p2p.data m.data
/\ commc.p2p.src m.src
/\ commc.p2p.dest m.dest /\
commc.p2p.tag m.tag /\
commc.p2p.type m.type
commc.p2p.state "transmitting"
end either
38MPI Specified P2P Operations
- Asynchronous mode send start
- Enqueue communication requests into a process
local FIFO queue - Consume buffer or check for a match as
appropriate - Asynchronous mode send complete
- Post messages from the process local FIFO to the
context of the communicator until the requested
message completes - Release resources as appropriate
- Standard mode communications
- Couple corresponding asynchronous start and
complete
39Collective Operations
- Three flavors of collective operations (recall
data is abstracted away) - Require all processes to wait until some
particular process has entered the communication - MPI_Bcast, MPI_Scatter
- Require at least one process to wait until all
processes have entered the communication - MPI_Gather, MPI_Reduce
- Require all processes to wait until all processes
have entered the communication - MPI_Barrier, MPI_All_reduce
- The standard does not require but allows (1) and
(2) above
40Type 1 Collective
- Three states and three transitions are used to
implement this protocol - All processes entering the communication add
themselves to the participant set. These
processes then wait until the root process is in
the participant set. All processes are allowed
to exit once root has entered the communication.
The last process to exit resets the collective
state and participant set.
col_one1 when (commc.col.state "vacant"
\/ commc.col.state
"collective in) /\ self \notin
commc.col.participants commc commc
EXCEPT !.col.state "collective in",
!.col.participants _at_ \cup self,
!.col.processes_in _at_ \cup self col_one2
when root \in commc.col.participants
if (commc.col.processes_in \ self
/\ commc.col.participants
commc.group) then
commc commc EXCEPT !.col.state
"vacant",
!.col.participants , !.col.processes_in
else
commc commc EXCEPT !.col.processes_in _at_
\ self end if
41Type 2 Collective
- Three states and four transitions are used to
implement this protocol - If the process entering the communication is not
root it adds itself to the set of participants
and exits. If the process is root, it also adds
itself to the set of participants. The root
process then waits until the set of participants
is equal to the group for the communicator. The
root process is then allowed to exit.
col_two1 either when commc.col.state
"vacant"
commc.col.state "col_two in"
commc.col.participants self
or when commc.col.state
"col_two in"
commc.col.participants commc.col.participan
ts \cup self end either
col_two2 if commc.rankingself root
then when commc.col.participants
commc.group commc.col.state
"vacant" commc.col.participants
end if
42Type 3 Collective
- Algorithm requries three states and five
transitions - The first process to post this type of message
changes the state from vacant to col3 in. Other
processes in the communicator are collected until
all processes have joined the communication The
guard at col32 then becomes enabled, changing the
state to col3 out. All processes then exit the
communication. The last process to exit returns
the state of the collective context to vacant.
col31 either when commc.col.state "vacant"
( First process in the barrier. )
commc commc EXCEPT !.col.state
col3 in", !.col.participants _at_ \cup self
or when commc.col.state col3
in" /\ self \notin commc.col.participants
commc commc EXCEPT
!.col.participants _at_ \cup self
end either col32 either when commc.col.partici
pants commc.group ( First out. )
commc commc EXCEPT
!.col.participants _at_ \ self, !.col.state
col3 out" or when
commc.col.state col3 out" /\
commc.col.participants / self ( Middle
out. ) commc commc EXCEPT
!.col.participants _at_ \ self or
when commc.col.state col3 out /\
commc.col.participants self ( Last out.
) commc commc
EXCEPT !.col.state "vacant", !.col.participants
end either
43Motivation
- Software model checking in the presence of
library calls is hard - Model extraction and verification may not scale
- it may also miss bugs due to modeling
assumptions - In-situ model checking can help
- run the instrumented program directly with an
external schedule to control the programs
execution - need to avoid redundant interleavings
- need to retain enough central scheduling control
44Avoid Redundant Interleaving with DPOR1
- Static Partial Order Reduction relies on static
analysis - to yield approximate information about run-time
behavior - pointers gt coarse information gt limited POR gt
state explosion Partial-order reduction - explores subset of the state space, without
sacrificing soundness - Dynamic Partial Order Reduction
- while model checker explores the programs state
space, - it sees exactly which threads access which
locations - use to simultaneously reduce the state space
while model-checking
45An Example of DPOR
46In-situ model checking concurrent programs
- 1. Instrument the program to add
request/permit routines before the
synchronization routines/access to shared objects
(this phage can be automated) - 2. Compile and run the program and have the
scheduler record the request trace - 3. The schedule generates other possible
interleavings - 4. Run the program again according to those
interleavings
47Current Status
- Implementation underway
- Efficiency will be measured - naive schedules
vs DPOR generated - rules for filtering
infeasible executions - proving them complete
48References
- 1 Cormac Flanagan and Patrice Godefroid,
Dynamic Partial-Order Reduction for Model
Checking Software, POPL05.
49Independent transitions1
- B and R are independent transitions if
- they commute B R R B
- neither enables nor disables the other
- Example x 3 and y 1 are independent
50Byte-Range lockingusing MPI one-sided
communication
- One process makes its memory space available for
communication - Global state is stored in this memory space
- Each process is associated with a flag, start and
end values stored in array A - Pis flag value is in A3 i for all i
51Lock Acquire
lock_acquire (strat, end) 1 val0 1 /
flag / val1 start val2 end 2
while(1) 3 lock_win 4 place val in
win 5 get values of other processes from
win 6 unlock_win 7 for all i, if (Pi
conflicts with my range) 8 conflict
1 9 if(conflict) 10 val0 0 11
lock_win 12 place val in
win 13 unlock_win 14
MPI_Recv(ANY_SOURCE) 15 16
else 17 / lock is acquire / 18
break 19
52Lock Release
lock_release (strat, end) val0 0 /
flag / val1 -1 val2 -1 lock_win
place val in win get values of other
processes from win unlock_win for all i,
if (Pi conflicts with my range)
MPI_Send(Pi)
53Error Trace
54Solution - Picking
- Main Idea The process about to be blocked (Pi)
picks who will wake it up (Pj) and indicates so
by writing to shared memory in lines 11 and 13 - It is possible that Pj leaves before seeing this
information. If this is a case the next
conflicting (Pk) process must be picked. - If no such Pk exists, the lock must be retried.
- Similarly, if picking Pj causes deadlock, Pk must
be picked instead.
55Avoiding Deadlock
- Once processes declare their intentions globally,
deadlock can be avoided. - For there to be deadlock, a dependency cycle must
exist. - The last process to complete this cycle will know
and must not do so.
56Lock Acquire - Picking
- lock_acquire (strat, end)
- 1 val0 1 / flag / val1 start
val2 end val3 -1 / pick / - 2 while(1)
- 3 lock_win
- 4 place val in win
- 5 get values of other processes from win
- 6 unlock_win
- 7 for all i, if (Pi conflicts with my
range) - 8 conflict 1 remember Pi
- 9 if(conflict with Pi)
- 10 val0 0 val3 Pi
- 11 lock_win
- 12 place val in win
- 13 unlock_win
- 14 if(Pi has left deadlock_possible)
i goto 9 - 15 else MPI_Recv(ANY_SOURCE)
- 16
- 17 else
57Orthogonal compositionof primitives
Can perhaps model MPI this way Bsndinit Rsndinit
Rcvinit Probe Test Wait on the LHS,
and nonlocal a problem (need conceptual
bits) Send -gt S,R,B -gt Ssend, Rsend, Bsend -gt
I -gt ISSENd, IRsend, IBsend Issend IRsend
are global other I are not
58Expensive Resources are Involved!
10k/week on Blue Gene (180 GFLOPS)at IBMs Deep
Computing Lab
136,800 GFLOPS Max
59HPC Software Development isInherently Quite
Complex!
- Understand What is Simulated (the
Physics, Biology, etc). - Develop a Mathematical Model
- Generate Numerical Discretization of
Model - Solve Numerical Problem
- Experiment with Serial Solution (gain
understanding) - Develop Parallel Algorithm
-
- Check Consistency With Physics etc.
(energy conservation) - Improve Load Balancing
- Fight Grubby Realities of Libraries,
Program Complexity,
60General Reasons for picking our Four Thrusts
- Natural conclusion of our own
- study of key issues in the area
- Close to no previous research in the area
- Complementary Strengths of PIs
- Mike (HPC) and Ganesh (FV)
- Opportunity and luck
- NSF funding, MS funding
- Past HPC and FV research at Utah
- Collaboration with Argonne
- Enthusiastic Students !!
61Specific Reasons for our thrusts
Modeling of the MPI Library In situ Model
Checking Verifying One-Sided MPI Constructs
Parallel Model Checking
MPI Widely used HPC library with COMPLEX and
EVOLVING semantics
Large MPI programs are MPI Calls Hanging off a
Program Scaffolding. Hence Finite State
Machine model extraction Model Checking is
ineffective in many cases
Some of the new MPI Extensions are
Extremely Poorly Understood
Parallelism can benefit even the verification
process !!
62Some Nasty Realities
- MPI Programming
- Code Optimized to take advantage of
- relaxed send / receive / probe orderings may be
buggy - Too many MPI Functions (over 200 in MPI-2)
- ? Misunderstandings, Buggy MPI Libraries
-
- Threads and MPI often used together
- Thread Programming Bugs, Unexpected Interactions
- MPI Libraries Vary in the Allowed Range of
Semantics - Code that takes advantage of one library doesnt
port
63What is Model Checking basedFormal Verification ?
- Analog of wind-tunnel testing of airplane
wings - for programs and hardware designs
- Build Scaled-down Model, retaining essential
features - Exhaustively Verify the Model
- Experience shows that Exhaustive
Verification of - downscaled Model often Superior to ad hoc
testing of - full system descriptions which have a HUGE
state space - Key Advances in Recent Times
- Very Large Models can be Verified
- Richer Assertions can be Verified
- Yet, Little (or no) work in FV for HPC
64Yet, verification (bug-hunting)is only a small
part of the complex picture!
- Provide Formal Models that unambiguously specify
MPI Library - Function Semantics
- Make these Formal Models Runnable
- Use Formal Specification to Assist MPI Program
Transformations - Help Debug Large MPI Programs and / or MPI
Library Implementations - Study Specific tricky MPI Constructs and
Programs (e.g. Locking - protocols implemented using MPIs new 1-sided
Constructs) - Speed-up Model Checking by using Multiple
Threads and MPI - Processes
65Past Work on FV for HPC
- Avrunin and Siegel have published the following
results - Hand-modeled MPI Library Functions in Promela
- Identified some issues (bugs?) with the help
of SPIN used as - a model checker
- Proved that .. under certain conditions ..
showing absence - of deadlocks in an MPI program using synchronous
- communications guarantees absence of deadlocks
even - after the communications are switched back to
being async. - Pioneering work
- but none recently
- lots of areas not covered
66Example 1 Modeling of the MPI Library
67Variety of bugs that are common in parallel
scientific programs
- Deadlock
- Communication Race Conditions
- Misunderstanding the semantics of MPI procedures
- Resource related assumptions
- Incorrectly matched send/receives
68State of the art in Debugging
- TotalView
- Parallel debugger trace visualization
- Parallel DBX
- gdb
- MPICHECK
- Does some deadlock checking
- Uses trace analysis
69Related work
- Verification of wildcard free models Siegel,
Avrunin, 2005 - Deadlock free with length zero buffers gt
deadlock free with length gt zero buffers - SPIN models of MPI programs Avrunin, Seigel,
Seigel, 2005 and Seigel, Mironova, Avrunin,
Clarke, 2005 - Compare serial and parallel versions of numerical
computations for numerical equivalence
70The Big Picture
proctype MPI_Send(chan out, int c)
out!c proctype MPI_Bsend(chan out, int c)
out!c proctype MPI_Isend(chan out, int c)
out!c typedef MPI_Status int MPI_SOURCE
int MPI_TAG int MPI_ERROR
MPI LibraryModel
int y active proctype T1() int x x 1
if x 0 x 2 fi y
x active proctype T2() int x x 2
if y x 1 y 0 fi assert( y
0 )
Compiler
ProgramModel
Model Generator
EnvironmentModel
Error Simulator
Refinement
Abstractor
Modeling Language
MC Server
Result Analyzer
MC Client
MC Client
MC Client
MC Client
MC Client
MC Client
OK
MC Client
MC Client
MC Client
71Goal
- Verification / Transformation of MPI programs
- that is nice that you may be able to show my
program does not deadlock but can you make it
faster? - Verification of safety properties
- Automatic optimization through verifiably safe
transformation (Send with ISend/Wait, etc.)
72Example 2 In Situ Model Checking
73Motivation
- Software model checking in the presence of
library calls is hard - Model extraction and verification may not scale
- It may also miss bugs due to modeling
assumptions - In-situ model checking can help
- run the instrumented program directly with an
external schedule to control the programs
execution - need to avoid redundant interleavings
- need to retain enough central scheduling control
74In-situ model checking concurrent programs
- 1. Instrument the program to add
request/permit routines before the
synchronization routines/access to shared objects
(this phase can be automated) - 2. Compile and run the program and have the
scheduler record the request trace - 3. The schedule generates other possible
interleavings - 4. Run the program again according to those
interleavings