Formal Analysis and Code Generation Support for MPI

About This Presentation

Title:

Formal Analysis and Code Generation Support for MPI

Description:

... (traditional and 'in situ') (overview GG) ... In situ Model Checking. Verifying One-Sided MPI Constructs ... In-Situ Model Checker for MCCS MPI Programs ... – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 75

Provided by: DAM4

Category:

more less

Transcript and Presenter's Notes

Title: Formal Analysis and Code Generation Support for MPI

1
Formal Analysis andCode Generation Supportfor
MPI

Ganesh Gopalakrishnan
Mike Kirby
School of Computing and
Scientific Computing and Imaging (SCI) Institute
University of Utah
Salt Lake City, UT, USA

2
We are the Gauss Group, School of
Computing, University of Utah, Salt Lake City, UT

Faculty
Ganesh Gopalakrishnan (main area FV)
Mike Kirby (main area HPC)
Students
Robert Palmer (PhD)
Yu Yang (PhD)
Salman Pervez (PhD)
Sonjong Hwang (BS/MS)
Geoffrey Sawaya (BS)
Summer Visitor
Igor Melatti
Funding Acknowledgement
Microsoft HPC Institute
Formal Analysis and Code Generation Support
for MPI
(Also supported on a complementary project
thru NSF grant
CSRSMA -- Toward Reliable and Efficient
Message Passing Software Through Formal Analysis)

3
What is Formal Analysis, and Why do It?

The Engineering Approach
Model, Analyze, Debug, and Improve software /
systems
Success Stories
Windows Device Driver Development Kit
Initial ideas from the area of
predicate-abstraction
for C programs (Ball, Rajamani,)
Later optimized with powerful analysis engines
(Ball, Cook, )
Automated to analyze source trees and run
checks
4 years from possibility to reality
Similar story in cache coherence / MP design
(SRC
project at Utah)
Our vision make this happen for HPC software!
Having found four focal areas, we are going
after them!
Many differences in terms of FV problem
formulation / solution

4
Focal Areas

Formal Modeling of MPI
Model Checking MPI programs
Verifying Advanced / New MPI Features e.g.,
One-Sided Communication
Parallel Model Checking

5
Organization of the Rest of the Talk

Modeling of the MPI Library (overview GG)
Model Checking (traditional and in situ)
(overview GG)
Verifying One-Sided MPI Constructs (overview
GG)
Parallel/Distributed Model Checking (overview
GG)
Verifying Byte-Range Locking (details MK)
Parallel/Distributed Model Checking (details
MK)
Future Plans (MK)

6
Reasons for our thrusts
Modeling of the MPI Library In situ Model
Checking Verifying One-Sided MPI Constructs
Parallel Model Checking
MPI Widely used HPC library with COMPLEX and
EVOLVING semantics (1-sided, threading)
Large MPI programs are MPI Calls Hanging off a
Program Scaffolding. Hence Finite State
Machine model extraction Model Checking is
ineffective in many cases
Some of the new MPI Extensions are Poorly
Understood
Parallelism can benefit even the verification
process !!
7
Brief Overview of Specific Results
8
1. MPI Library Modeling

Informal Spec Documents do not help answer what
ifs
We need specs using which we can calculate
outcomes
A cal (Lamport, MSR) Formal Spec for MPI
(Palmer)
Semantics cal ? TLA ? Mathematical Logic /
Sets
Executability cal ? TLC model checker ?
Scenario Oracle
Analyzability cal ? Boolean Propositions ?
answering more
open-ended scenarios with all possible
answers for instance
Coverage Covered many aspects of MPI-2
Presentation Formal Spec Scenario Oracle will
be hyper-
linked into MPI HTML
Formal Spec document

9
2. Model Checking

Traditional Extract / Analyze control
skeletons
Now developing a finite model checker tailored
for MPI (M language and model checker)
Integration into VisualStudio Environment (in
progress) written in C
In-Situ Model Checking
Instrument downscaled MPI programs to record
states intelligently schedule state
exploration
Socket based instrumentation (prototype runs)
PMPI based instrumentation (collab with Argonne)

10
3. Verifying One-Sided MPI Constructs

MPIs One-Sided Primitives are Poorly Understood
bugs in simple algorithms, proposed fixes,
The Semantics are Non-trivial (Writes / Reads
into
Windows are unordered) like weak mem
models
Pervez Analyzed and Found Bug in Byte-Range
Locking Protocol Published in EuroPVM 2005 (by
Thakur and Latham)
Proposed Two Fixes (joint work with Thakur,
Gropp)
Details in Mikes talk

11
4. Parallel Model Checking

Finite State Verification Models of even
downscaled MPI programs are VERY large
Parallel and Distributed Exploration of these
FSMs
using Clusters Running MPI is attractive!
A Model Checker Eddy has been developed
Two Threads per CPU
One Thread for State Generation
Another Thread for State send / receive
States coalesced into Lines Before Shipping
Win32 Threads / MCC Porting Done
Gives Linear Speedups (CPU and memory
across clusters utilized)

12
Details Verification of a Byte-Range Locking
Algorithm
13
Byte-Range lockingusing MPI one-sided
communication

One process makes its memory space available for
communication
Global state is stored in this memory space
Each process is associated with a flag, start and
end values stored in array A
Pis flag value is in A3 i for all i

14
Lock Acquire
lock_acquire (strat, end) 1 val0 1 /
flag / val1 start val2 end 2
while(1) 3 lock_win 4 place val in
win 5 get values of other processes from
win 6 unlock_win 7 for all i, if (Pi
conflicts with my range) 8 conflict
1 9 if(conflict) 10 val0 0 11
lock_win 12 place val in
win 13 unlock_win 14
MPI_Recv(ANY_SOURCE) 15 16
else 17 / lock is acquire / 18
break 19
15
Lock Release
lock_release (strat, end) val0 0 /
flag / val1 -1 val2 -1 lock_win
place val in win get values of other
processes from win unlock_win for all i,
if (Pi conflicts with my range)
MPI_Send(Pi)
16
Error Trace
17
Error Discussion

Problem too many Sends, not enough Recvs
Not really a problem
Messages are 0 bytes only
Sends could be made non-blocking
Maybe a problem
Even 0 byte messages cause unknown memory leaks,
not desirable
More importantly, if there are unused Sends in
the system, processes that were supposed to be
blocked may wake up by consuming these. This ties
up processor resources and hurts performance!

18
Some Details Parallel Model Checking The
Eddy-Murphi Model Checker
19
Parallel Model Checking

Each computation node owns a portion of the
state space
Each node locally stores and analyzes its own
states
Newly generated states which do not belong to the
current node are sent to the owner node
Standard distributed algorithm may be chosen for
termination

20
Eddy Algorithm

For each node, two threads are used
Worker thread analyzes, generates and partitions
states
If there are no states to be visited, it sleeps
Communication thread repeatedly sends/receives
states to/from the other nodes
It also handles termination
Communication between the two threads
Via shared memory
Via mutex signals primitives

21
Worker Thread
Communication Thread
Consumption Queue
Receive and process inbound Messages Initiate
Isends Check completion of Isends
Take State Off Consumption Queue Expand
State (get new set of states) Make decision
about Set of states
Hash
Communication Queue
22
The Communication Queue

There is one communication queue for each node
Each communication queue has N lines and M states
per line
States additions are made (by the worker thread)
only on one active line
The other lines may be already full or empty

23
Eddy-Murphi Performance
24
Eddy-Murphi Performance
25
Summary

Complex systems can benefit from formal methods
Rich interdisciplinary ground for FM and HPC to
interact
Win-Win scenarios exist

26
Future Plans

Publish Formal and Analyzable MPI-2 Spec
Develop Formal Property-based Tests for MPI
Libraries such as MCCS
Build Model Checker for MPI Programs that treats
MPI calls as Primitives (more efficient
partial-order reductions, thanks to MPI
Semantics)
Large-scale Eddy-Murphi runs on MCCS
In-Situ Model Checker for MCCS MPI Programs
Release of Tools

27
Software and Publications

Eddy Software Available
Formal Spec of MPI-2
reliminary stable version would be hyperlinked
into HTML MPI-2 spec
Two workshop papers submitted (Yu Palmer) to
Verify06 and RTVA06
EuroPVM06 paper (Pervez, Ganesh, Mike,
Gropp, Thakur) accepted

28
Workshop on Thread Verification (TV06, sponsored
by Microsoft) 2-day program, 13 papers, 5
invited talks Seattle, WA, Aug 21-22 See
http//www.cs.utah.edu/tv06
29
Extra Slides
30
Our Approach Toward Code Generation
Support for MPI

Describe Executable MPI-2 (MPI) Semantic Model
Develop MPI Formal Semantics
Develop Facility to Calculate Scenario Outcomes
from Formal Semantics
Develop Test Suites for MPI Libraries Based on
Formal Semantics
Detect Errors in User MPI Code
Deadlocks
Resource (e.g. Buffer) Usage Violations
Support Code Optimizations
Develop Facility to Show Correctness of
Designer
Introduced Code Transformations

31
The Communication Queue

Summing up, this is the evolution of a line
status

WTBA
Active
WTBS
CBS
32
Eddy-Murphi Performance

Tuning of the communication queuing mechanism
High number of states per line is required
Much better sending many states at a time
Not too few number of lines
Or the worker will not be able to submit new
states

33
Eddy-Murphi Performances

Comparison with previous versions of parallel
Murphi
When ported to MPI, old versions of parallel
Murphi perform worse than serial Murphi
Comparison with serial Murphi almost linear
speedup is expected

34
A Formal Model of MPI Process Interaction

Robert Palmer Mike Kirby Ganesh Gopalakrishnan

35
Goals

Create a human readable and understandable formal
document to supplement the written standard
CAL
ZF set theory first order logic weak temporal
logic (TLA)
Sequencing scope processes (Pseudo-code
flavored)
Model behaviors of MPI to preserve correctness
properties
Deadlock
Local assertion violations
Capture communication behavior implied by the MPI
standard
Standard/Asynchronous mode point to point
operations
Collective operations
Set the formality sufficiently high that
automated reasoning assistance can be applied
Theorem proving and Model checking
Useful for discovering difficult to reproduce
errors in reactive systems
Cover an interesting (but still limited) subset
of the standard
Point to point and collective communications that
transmit data
Wait on topology, communicator manipulation etc.
Abstract away most of the data passing (i.e.,
focus on the semantics of communication)
Assume static process system

36
Communicator

Defines the communication universe for processes
in MPI
Group (Set of processes and a ranking function)
Context (Where messages are posted)
Virtual Topology
Attributes
Shared memory based model
Communicator objects are globally visible to
processes in the computation
Accessed by an integer handle as in MPICH

( The set of communicator objects in the
computation. ) variables comm i \in
(0..(MAX_COMM - 1)) -gt p2p
-gt data -gt 0, src -gt 0,
dest -gt 0,
type -gt "MPI_INT",
tag -gt 1,
state -gt
"vacant",
col -gt
root -gt 0, type -gt
"MPI_INT", state -gt
"vacant", processes_in -gt ,
participants -gt ,
group -gt pr pr \in (0..(N-1)),
groupsize -gt N, ranking -gt
p \in (0..(N-1)) -gt 0,
inverseranking -gt p \in (0..(N-1)) -gt 0,
lastrank -gt 0
37
Point to Point Basis

Communication happens with respect to the context
of a communicator
Algorithm for posting a point to point send
(receive) to the context
When the state of the context is vacant the send
(receive) can be posted directly to the context.
The sender (receiver) then waits for the state to
change to transmitting indicating that the
matching receive has been posted. The sender
(receiver) then changes the state of the context
back to vacant and exits.
Or when the state of the context is initially
receive (send) and the message matches the
message already posted to the context, the sender
(receiver) changes the state of the context to
transmitting and exits.

post_send1 either when commc.p2p.state
"vacant"
commc.p2p m post_send2 when
commc.p2p.state "transmitting"
commc.p2p.state "vacant"
or when commc.p2p.state "recv"
/\ commc.p2p.data m.data
/\ commc.p2p.src m.src
/\ commc.p2p.dest m.dest /\
commc.p2p.tag m.tag /\
commc.p2p.type m.type
commc.p2p.state "transmitting"
end either
38
MPI Specified P2P Operations

Asynchronous mode send start
Enqueue communication requests into a process
local FIFO queue
Consume buffer or check for a match as
appropriate
Asynchronous mode send complete
Post messages from the process local FIFO to the
context of the communicator until the requested
message completes
Release resources as appropriate
Standard mode communications
Couple corresponding asynchronous start and
complete

39
Collective Operations

Three flavors of collective operations (recall
data is abstracted away)
Require all processes to wait until some
particular process has entered the communication
MPI_Bcast, MPI_Scatter
Require at least one process to wait until all
processes have entered the communication
MPI_Gather, MPI_Reduce
Require all processes to wait until all processes
have entered the communication
MPI_Barrier, MPI_All_reduce
The standard does not require but allows (1) and
(2) above

40
Type 1 Collective

Three states and three transitions are used to
implement this protocol
All processes entering the communication add
themselves to the participant set. These
processes then wait until the root process is in
the participant set. All processes are allowed
to exit once root has entered the communication.
The last process to exit resets the collective
state and participant set.

col_one1 when (commc.col.state "vacant"
\/ commc.col.state
"collective in) /\ self \notin
commc.col.participants commc commc
EXCEPT !.col.state "collective in",

!.col.participants _at_ \cup self,

!.col.processes_in _at_ \cup self col_one2
when root \in commc.col.participants
if (commc.col.processes_in \ self
/\ commc.col.participants
commc.group) then
commc commc EXCEPT !.col.state
"vacant",
!.col.participants , !.col.processes_in
else
commc commc EXCEPT !.col.processes_in _at_
\ self end if
41
Type 2 Collective

Three states and four transitions are used to
implement this protocol
If the process entering the communication is not
root it adds itself to the set of participants
and exits. If the process is root, it also adds
itself to the set of participants. The root
process then waits until the set of participants
is equal to the group for the communicator. The
root process is then allowed to exit.

col_two1 either when commc.col.state
"vacant"
commc.col.state "col_two in"
commc.col.participants self
or when commc.col.state
"col_two in"
commc.col.participants commc.col.participan
ts \cup self end either
col_two2 if commc.rankingself root
then when commc.col.participants
commc.group commc.col.state
"vacant" commc.col.participants
end if
42
Type 3 Collective

Algorithm requries three states and five
transitions
The first process to post this type of message
changes the state from vacant to col3 in. Other
processes in the communicator are collected until
all processes have joined the communication The
guard at col32 then becomes enabled, changing the
state to col3 out. All processes then exit the
communication. The last process to exit returns
the state of the collective context to vacant.

col31 either when commc.col.state "vacant"
( First process in the barrier. )
commc commc EXCEPT !.col.state
col3 in", !.col.participants _at_ \cup self
or when commc.col.state col3
in" /\ self \notin commc.col.participants
commc commc EXCEPT
!.col.participants _at_ \cup self
end either col32 either when commc.col.partici
pants commc.group ( First out. )
commc commc EXCEPT
!.col.participants _at_ \ self, !.col.state
col3 out" or when
commc.col.state col3 out" /\
commc.col.participants / self ( Middle
out. ) commc commc EXCEPT
!.col.participants _at_ \ self or
when commc.col.state col3 out /\
commc.col.participants self ( Last out.
) commc commc
EXCEPT !.col.state "vacant", !.col.participants
end either
43
Motivation

Software model checking in the presence of
library calls is hard
Model extraction and verification may not scale
it may also miss bugs due to modeling
assumptions
In-situ model checking can help
run the instrumented program directly with an
external schedule to control the programs
execution
need to avoid redundant interleavings
need to retain enough central scheduling control

44
Avoid Redundant Interleaving with DPOR1

Static Partial Order Reduction relies on static
analysis
to yield approximate information about run-time
behavior
pointers gt coarse information gt limited POR gt
state explosion Partial-order reduction
explores subset of the state space, without
sacrificing soundness
Dynamic Partial Order Reduction
while model checker explores the programs state
space,
it sees exactly which threads access which
locations
use to simultaneously reduce the state space
while model-checking

45
An Example of DPOR
46
In-situ model checking concurrent programs

1. Instrument the program to add
request/permit routines before the
synchronization routines/access to shared objects
(this phage can be automated)
2. Compile and run the program and have the
scheduler record the request trace
3. The schedule generates other possible
interleavings
4. Run the program again according to those
interleavings

47
Current Status

Implementation underway
Efficiency will be measured - naive schedules
vs DPOR generated - rules for filtering
infeasible executions - proving them complete

48
References

1 Cormac Flanagan and Patrice Godefroid,
Dynamic Partial-Order Reduction for Model
Checking Software, POPL05.

49
Independent transitions1

B and R are independent transitions if
they commute B R R B
neither enables nor disables the other
Example x 3 and y 1 are independent

50
Byte-Range lockingusing MPI one-sided
communication

One process makes its memory space available for
communication
Global state is stored in this memory space
Each process is associated with a flag, start and
end values stored in array A
Pis flag value is in A3 i for all i

51
Lock Acquire
lock_acquire (strat, end) 1 val0 1 /
flag / val1 start val2 end 2
while(1) 3 lock_win 4 place val in
win 5 get values of other processes from
win 6 unlock_win 7 for all i, if (Pi
conflicts with my range) 8 conflict
1 9 if(conflict) 10 val0 0 11
lock_win 12 place val in
win 13 unlock_win 14
MPI_Recv(ANY_SOURCE) 15 16
else 17 / lock is acquire / 18
break 19
52
Lock Release
lock_release (strat, end) val0 0 /
flag / val1 -1 val2 -1 lock_win
place val in win get values of other
processes from win unlock_win for all i,
if (Pi conflicts with my range)
MPI_Send(Pi)
53
Error Trace
54
Solution - Picking

Main Idea The process about to be blocked (Pi)
picks who will wake it up (Pj) and indicates so
by writing to shared memory in lines 11 and 13
It is possible that Pj leaves before seeing this
information. If this is a case the next
conflicting (Pk) process must be picked.
If no such Pk exists, the lock must be retried.
Similarly, if picking Pj causes deadlock, Pk must
be picked instead.

55
Avoiding Deadlock

Once processes declare their intentions globally,
deadlock can be avoided.
For there to be deadlock, a dependency cycle must
exist.
The last process to complete this cycle will know
and must not do so.

56
Lock Acquire - Picking

lock_acquire (strat, end)
1 val0 1 / flag / val1 start
val2 end val3 -1 / pick /
2 while(1)
3 lock_win
4 place val in win
5 get values of other processes from win
6 unlock_win
7 for all i, if (Pi conflicts with my
range)
8 conflict 1 remember Pi
9 if(conflict with Pi)
10 val0 0 val3 Pi
11 lock_win
12 place val in win
13 unlock_win
14 if(Pi has left deadlock_possible)
i goto 9
15 else MPI_Recv(ANY_SOURCE)
16
17 else

57
Orthogonal compositionof primitives
Can perhaps model MPI this way Bsndinit Rsndinit
Rcvinit Probe Test Wait on the LHS,
and nonlocal a problem (need conceptual
bits) Send -gt S,R,B -gt Ssend, Rsend, Bsend -gt
I -gt ISSENd, IRsend, IBsend Issend IRsend
are global other I are not
58
Expensive Resources are Involved!
10k/week on Blue Gene (180 GFLOPS)at IBMs Deep
Computing Lab
136,800 GFLOPS Max
59
HPC Software Development isInherently Quite
Complex!

Understand What is Simulated (the
Physics, Biology, etc).
Develop a Mathematical Model
Generate Numerical Discretization of
Model
Solve Numerical Problem
Experiment with Serial Solution (gain
understanding)
Develop Parallel Algorithm
Check Consistency With Physics etc.
(energy conservation)
Improve Load Balancing
Fight Grubby Realities of Libraries,
Program Complexity,

60
General Reasons for picking our Four Thrusts

Natural conclusion of our own
study of key issues in the area
Close to no previous research in the area
Complementary Strengths of PIs
Mike (HPC) and Ganesh (FV)
Opportunity and luck
NSF funding, MS funding
Past HPC and FV research at Utah
Collaboration with Argonne
Enthusiastic Students !!

61
Specific Reasons for our thrusts
Modeling of the MPI Library In situ Model
Checking Verifying One-Sided MPI Constructs
Parallel Model Checking
MPI Widely used HPC library with COMPLEX and
EVOLVING semantics
Large MPI programs are MPI Calls Hanging off a
Program Scaffolding. Hence Finite State
Machine model extraction Model Checking is
ineffective in many cases
Some of the new MPI Extensions are
Extremely Poorly Understood
Parallelism can benefit even the verification
process !!
62
Some Nasty Realities

MPI Programming
Code Optimized to take advantage of
relaxed send / receive / probe orderings may be
buggy
Too many MPI Functions (over 200 in MPI-2)
? Misunderstandings, Buggy MPI Libraries
Threads and MPI often used together
Thread Programming Bugs, Unexpected Interactions
MPI Libraries Vary in the Allowed Range of
Semantics
Code that takes advantage of one library doesnt
port

63
What is Model Checking basedFormal Verification ?

Analog of wind-tunnel testing of airplane
wings
for programs and hardware designs
Build Scaled-down Model, retaining essential
features
Exhaustively Verify the Model
Experience shows that Exhaustive
Verification of
downscaled Model often Superior to ad hoc
testing of
full system descriptions which have a HUGE
state space
Key Advances in Recent Times
Very Large Models can be Verified
Richer Assertions can be Verified
Yet, Little (or no) work in FV for HPC

64
Yet, verification (bug-hunting)is only a small
part of the complex picture!

Provide Formal Models that unambiguously specify
MPI Library
Function Semantics
Make these Formal Models Runnable
Use Formal Specification to Assist MPI Program
Transformations
Help Debug Large MPI Programs and / or MPI
Library Implementations
Study Specific tricky MPI Constructs and
Programs (e.g. Locking
protocols implemented using MPIs new 1-sided
Constructs)
Speed-up Model Checking by using Multiple
Threads and MPI
Processes

65
Past Work on FV for HPC

Avrunin and Siegel have published the following
results
Hand-modeled MPI Library Functions in Promela
Identified some issues (bugs?) with the help
of SPIN used as
a model checker
Proved that .. under certain conditions ..
showing absence
of deadlocks in an MPI program using synchronous
communications guarantees absence of deadlocks
even
after the communications are switched back to
being async.
Pioneering work
but none recently
lots of areas not covered

66
Example 1 Modeling of the MPI Library
67
Variety of bugs that are common in parallel
scientific programs

Deadlock
Communication Race Conditions
Misunderstanding the semantics of MPI procedures
Resource related assumptions
Incorrectly matched send/receives

68
State of the art in Debugging

TotalView
Parallel debugger trace visualization
Parallel DBX
gdb
MPICHECK
Does some deadlock checking
Uses trace analysis

69
Related work

Verification of wildcard free models Siegel,
Avrunin, 2005
Deadlock free with length zero buffers gt
deadlock free with length gt zero buffers
SPIN models of MPI programs Avrunin, Seigel,
Seigel, 2005 and Seigel, Mironova, Avrunin,
Clarke, 2005
Compare serial and parallel versions of numerical
computations for numerical equivalence

70
The Big Picture
proctype MPI_Send(chan out, int c)
out!c proctype MPI_Bsend(chan out, int c)
out!c proctype MPI_Isend(chan out, int c)
out!c typedef MPI_Status int MPI_SOURCE
int MPI_TAG int MPI_ERROR
MPI LibraryModel
int y active proctype T1() int x x 1
if x 0 x 2 fi y
x active proctype T2() int x x 2
if y x 1 y 0 fi assert( y
0 )
Compiler

ProgramModel
Model Generator

EnvironmentModel
Error Simulator
Refinement
Abstractor
Modeling Language
MC Server
Result Analyzer
MC Client
MC Client
MC Client
MC Client
MC Client
MC Client

OK
MC Client
MC Client
MC Client
71
Goal

Verification / Transformation of MPI programs
that is nice that you may be able to show my
program does not deadlock but can you make it
faster?
Verification of safety properties
Automatic optimization through verifiably safe
transformation (Send with ISend/Wait, etc.)

72
Example 2 In Situ Model Checking
73
Motivation

Software model checking in the presence of
library calls is hard
Model extraction and verification may not scale
It may also miss bugs due to modeling
assumptions
In-situ model checking can help
run the instrumented program directly with an
external schedule to control the programs
execution
need to avoid redundant interleavings
need to retain enough central scheduling control

74
In-situ model checking concurrent programs

1. Instrument the program to add
request/permit routines before the
synchronization routines/access to shared objects
(this phase can be automated)
2. Compile and run the program and have the
scheduler record the request trace
3. The schedule generates other possible
interleavings
4. Run the program again according to those
interleavings