High Performance Computing Course Notes 20072008 Message Passing Programming I

About This Presentation

Title:

High Performance Computing Course Notes 20072008 Message Passing Programming I

Description:

Message Passing is the most widely used parallel programming model ... Six of them are indispensable, but can write a large number of useful programs already ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 58

Provided by: SAJ80

Category:

more less

Transcript and Presenter's Notes

Title: High Performance Computing Course Notes 20072008 Message Passing Programming I

1
High Performance ComputingCourse Notes
2007-2008Message Passing Programming I
2
Message Passing Programming

Message Passing is the most widely used parallel
programming model
Message passing works by creating a number of
tasks, uniquely named, that interact by sending
and receiving messages to and from one another
(hence the message passing)
Generally, processes communicate through sending
the data from the address space of one process to
that of another
Communication of processes (via files, pipe,
socket)
Communication of threads within a process (via
global data area)
Programs based on message passing can be based on
standard sequential language programs (C/C,
Fortran), augmented with calls to library
functions for sending and receiving messages

3
Message Passing Interface (MPI)

MPI is a specification, not a particular
implementation
Does not specify process startup, error codes,
amount of system buffer, etc
MPI is a library, not a language
The goals of MPI functionality, portability and
efficiency
Message passing model gt MPI specification gt MPI
implementation

4
OpenMP vs MPI

In a nutshell
MPI is used on distributed-memory systems
OpenMP is used for code parallelisation on
shared-memory systems
Both are explicit parallelism
High-level control (OpenMP), lower-level control
(MPI)

5
A little history

Message-passing libraries developed for a number
of early distributed memory computers
By 1993 there were loads of vendor specific
implementations
By 1994 MPI-1 came into being
By 1996 MPI-2 was finalized

6
The MPI programming model

MPI standards -
MPI-1 (1.1, 1.2), MPI-2 (2.0)
Forwards compatibility preserved between versions
Standard bindings - for C, C and Fortran. Have
seen MPI bindings for Python, Java etc (all
non-standard)
We will stick to the C binding, for the lectures
and coursework. More info on MPI
www.mpi-forum.org
Implementations - For your laptop pick up MPICH
(free portable implementation of MPI
(http//www-unix.mcs.anl. gov/mpi/mpich/index.htm)
Coursework will use MPICH

7
MPI

MPI is a complex system comprising of 129
functions with numerous parameters and variants
Six of them are indispensable, but can write a
large number of useful programs already
Other functions add flexibility (datatype),
robustness (non-blocking send/receive),
efficiency (ready-mode communication), modularity
(communicators, groups) or convenience
(collective operations, topology).
In the lectures, we are going to cover most
commonly encountered functions

8
The MPI programming model

Computation comprises one or more processes that
communicate via library routines and sending and
receiving messages to other processes
(Generally) a fixed set of processes created at
outset, one process per processor
Different from PVM

9
Intuitive Interfaces for sending and receiving
messages

Send(data, destination), Receive(data, source)
minimal interface
Not enough in some situations, we also need
Message matching add message_id at both send
and receive interfaces
they become Send(data, destination, msg_id),
receive(data, source, msg_id)
Message_id
Is expressed using an integer, termed as message
tag
Allows the programmer to deal with the arrival of
messages in an orderly fashion (queue and then
deal with

10
How to express the data in the send/receive
interfaces

Early stages
(address, length) for the send interface
(address, max_length) for the receive interface
They are not always good
The data to be sent may not be in the contiguous
memory locations
Storing format for data may not be the same or
known in advance in heterogeneous platform
Enventually, a triple (address, count, datatype)
is used to express the data to be sent and
(address, max_count, datatype) for the data to be
received
Reflecting the fact that a message contains much
more structures than just a string of bits, For
example, (vector_A, 300, MPI_REAL)
Programmers can construct their own datatype
Now, the interfaces become send(address, count,
datatype, destination, msg_id) and
receive(address, max_count, datatype, source,
msg_id)

11
How to distinguish messages

Message tag is necessary, but not sufficient
So, communicator is introduced

12
Communicators

Messages are put into contexts
Contexts are allocated at run time by the system
in response to programmer requests
The system can guarantee that each generated
context is unique
The processes belong to groups
The notions of context and group are combined in
a single object, which is called a communicator
A communicator identifies a group of processes
and a communication context
The MPI library defines a initial communicator,
MPI_COMM_WORLD, which contains all the processes
running in the system
The messages from different process groups can
have the same tag
So the send interface becomes send(address,
count, datatype, destination, tag, comm)

13
Status of the received messages

The structure of the message status is added to
the receive interface
Status holds the information about source, tag
and actual message size
In the C language, source can be retrieved by
accessing status.MPI_SOURCE,
tag can be retrieved by status.MPI_TAG and
actual message size can be retrieved by calling
the function MPI_Get_count(status, datatype,
count)
The receive interface becomes receive(address,
maxcount, datatype, source, tag, communicator,
status)

14
How to express source and destination

The processes in a communicator (group) are
identified by ranks
If a communicator contains n processes, process
ranks are integers from 0 to n-1
Source and destination processes in the
send/receive interface are the ranks

15
Some other issues

In the receive interface, tag can be a wildcard,
which means any message will be received
In the receive interface, source can also be a
wildcard, which match any source

16
MPI basics

First six functions (C bindings)
MPI_Send (buf, count, datatype, dest, tag, comm)
Send a message
buf address of send buffer
count no. of elements to send (gt0)
datatype of elements
dest process id of destination
tag message tag
comm communicator (handle)

17
MPI basics

First six functions (C bindings)
MPI_Send (buf, count, datatype, dest, tag, comm)
Send a message
buf address of send buffer
count no. of elements to send (gt0)
datatype of elements
dest process id of destination
tag message tag
comm communicator (handle)

18
MPI basics

First six functions (C bindings)
MPI_Send (buf, count, datatype, dest, tag, comm)
Send a message
buf address of send buffer
count no. of elements to send (gt0)
datatype of elements
dest process id of destination
tag message tag
comm communicator (handle)

19
MPI basics

First six functions (C bindings)
MPI_Send (buf, count, datatype, dest, tag, comm)
Calculating the size of the data to be send
buf address of send buffer
count sizeof (datatype) bytes of data

20
MPI basics

First six functions (C bindings)
MPI_Send (buf, count, datatype, dest, tag, comm)
Send a message
buf address of send buffer
count no. of elements to send (gt0)
datatype of elements
dest process id of destination
tag message tag
comm communicator (handle)

21
MPI basics

First six functions (C bindings)
MPI_Send (buf, count, datatype, dest, tag, comm)
Send a message
buf address of send buffer
count no. of elements to send (gt0)
datatype of elements
dest process id of destination
tag message tag
comm communicator (handle)

22
MPI basics

First six functions (C bindings)
MPI_Recv (buf, count, datatype, source, tag,
comm, status)
Receive a message
buf address of receive buffer (var param)
count max no. of elements in receive buffer
(gt0)
datatype of receive buffer elements
source process id of source process, or
MPI_ANY_SOURCE
tag message tag, or MPI_ANY_TAG
comm communicator
status status object

23
MPI basics

First six functions (C bindings)
MPI_Init (int argc, char argv)
Initiate a computation
argc (number of arguments) and argv (argument
vector) are main programs arguments
Must be called first, and once per process
MPI_Finalize ( )
Shut down a computation
The last thing that happens

24
MPI basics

First six functions (C bindings)
MPI_Comm_size (MPI_Comm comm, int size)
Determine number of processes in comm
comm is communicator handle, MPI_COMM_WORLD is
the default (including all MPI processes)
size holds number of processes in group
MPI_Comm_rank (MPI_Comm comm, int pid)
Determine id of current (or calling) process
pid holds id of current process

25
MPI basics a basic example

include "mpi.h" include ltstdio.hgt int
main(int argc, char argv)     int rank,
nprocs    MPI_Init(argc,argv)
MPI_Comm_size(MPI_COMM_WORLD,nprocs)
MPI_Comm_rank(MPI_COMM_WORLD,rank)
printf("Hello, world. I am d of d\n", rank,
nprocs)     MPI_Finalize()

mpirun np 4 myprog Hello, world. I am 1 of
4 Hello, world. I am 3 of 4 Hello, world. I am 0
of 4 Hello, world. I am 2 of 4
26
MPI basics send and recv example (1)

include "mpi.h"include ltstdio.hgt int
main(int argc, char argv)    int rank,
size, i    int buffer10    MPI_Status
status     MPI_Init(argc, argv)
MPI_Comm_size(MPI_COMM_WORLD, size)
MPI_Comm_rank(MPI_COMM_WORLD, rank)    if
(size lt 2)            printf("Please run with
two processes.\n")         MPI_Finalize()
   return 0        if (rank 0)
        for (i0 ilt10 i)
bufferi i        MPI_Send(buffer, 10,
MPI_INT, 1, 123, MPI_COMM_WORLD)

27
MPI basics send and recv example (2)

    if (rank 1)            for (i0 ilt10
i)            bufferi -1
MPI_Recv(buffer, 10, MPI_INT, 0, 123,
MPI_COMM_WORLD, status)        for (i0 ilt10
i)                    if (bufferi !
i)                printf("Error bufferd d
but is expected to be d\n", i, bufferi,
i)                MPI_Finalize()

28
MPI language bindings

Standard (accepted) bindings for Fortran, C and
C
Java bindings are work in progress
JavaMPI Java wrapper to native calls
mpiJava JNI wrappers
jmpi pure Java implementation of MPI library
MPIJ same idea
Java Grande Forum trying to sort it all out
We will use the C bindings

29
High Performance ComputingCourse Notes 2007-2008

Message Passing Programming II

30
Modularity

MPI supports modular programming via
communicators
Provides information hiding by encapsulating
local communications and having local namespaces
for processes
All MPI communication operations specify a
communicator (process group that is engaged in
the communication)

31
Forming new communicators one approach

MPI_Comm world, workers
MPI_Group world_group, worker_group
int ranks1
MPI_Init(argc, argv)
worldMPI_COMM_WORLD
MPI_Comm_size(world, numprocs)
MPI_Comm_rank(world, myid)
servernumprocs-1
MPI_Comm_group(world, world_group)
ranks0server
MPI_Group_excl(world_group, 1, ranks,
worker_group)
MPI_Comm_create(world, worker_group, workers)
MPI_Group_free(world_group)
MPI_Comm_free(workers)

32
Forming new communicators - functions

int MPI_Comm_group(MPI_Comm comm, MPI_Group
group)
int MPI_Group_excl(MPI_Group group, int n, int
ranks, MPI_Group newgroup)
Int MPI_Group_incl(MPI_Group group, int n, int
ranks, MPI_Group newgroup)
int MPI_Comm_create(MPI_Comm comm, MPI_Group
group, MPI_Comm newcomm)
int MPI_Group_free(MPI_Group group)
int MPI_Comm_free(MPI_Comm comm)

33
Forming new communicators another approach (1)

MPI_Comm_split (comm, colour, key, newcomm)
Creates one or more new communicators from the
original comm
comm communicator (handle)
colour control of subset assignment (processes
with same colour are in same new
communicator)
key control of rank assignment
newcomm new communicator
Is a collective communication operation (must be
executed by all processes in the process group
comm)
Is used to (re-) allocate processes to
communicator (groups)

34
Forming new communicators another approach (2)

MPI_Comm_split (comm, colour, key, newcomm)
MPI_Comm comm, newcomm int myid, color
MPI_Comm_rank(comm, myid) // id of current
process
color myid3
MPI_Comm_split(comm, colour, myid, newcomm)

0
4
5
6
7
1
2
3
1
0
0
0
1
2
2
1
0
1
2
35
Forming new communicators another approach (3)

MPI_Comm_split (comm, colour, key, newcomm)
New communicator created for each new value of
colour
Each new communicator (sub-group) comprises those
processes that specify its value in colour
These processes are assigned new identifiers
(ranks, starting at zero) with the order
determined by the value of key (or by their ranks
in the old communicator in event of ties)

36
Communications

Point-to-point communications involving exact
two processes, one sender and one receiver
For example, MPI_Send() and MPI_Recv()
Collective communications involving a group of
processes

37
Collective operations

i.e. coordinated communication operations
involving multiple processes
Programmer could do this by hand (tedious), MPI
provides a specialized collective communications
barrier synchronize all processes
broadcast sends data from one to all processes
gather gathers data from all processes to one
process
scatter scatters data from one process to all
processes
reduction operations sums, multiplies etc.
distributed data
all executed collectively (on all processes in
the group, at the same time, with the same
parameters)

38
Collective operations

MPI_Barrier (comm)
Global synchronization
comm is the communicator handle
No processes return from function until all
processes have called it
Good way of separating one phase from another

39
Barrier synchronizations

You are only as quick as your slowest process

Barrier sync.
Barrier sync.
40
Collective operations

MPI_Bcast (buf, count, type, root, comm)
Broadcast data from root to all processes
buf address of input buffer or output buffer
(root)
count no. of entries in buffer (gt0)
type datatype of buffer elements
root process id of root process
comm communicator

data
One to all broadcast
proc.
A0
A0
A0
A0
MPI_BCAST
A0
41
Example of MPI_Bcast

Broadcast 100 ints from process 0 to every
process in the group
MPI_Comm comm
int array100
int root 0
MPI_Bcast (array, 100, MPI_INT, root, comm)

42
Collective operations

MPI_Gather (inbuf, incount, intype, outbuf,
outcount, outtype, root, comm)
Collective data movement function
inbuf address of input buffer
incount no. of elements sent from each (gt0)
intype datatype of input buffer elements
outbuf address of output buffer (var param)
outcount no. of elements received from each
outtype datatype of output buffer elements
root process id of root process
comm communicator

data
All to one gather
proc.
A0
A0
A1
A2
A3
A1
A2
MPI_GATHER
A3
43
Collective operations

MPI_Gather (inbuf, incount, intype, outbuf,
outcount, outtype, root, comm)
Collective data movement function
inbuf address of input buffer
incount no. of elements sent from each (gt0)
intype datatype of input buffer elements
outbuf address of output buffer
outcount no. of elements received from each
outtype datatype of output buffer elements
root process id of root process
comm communicator

Input to gather
data
All to one gather
proc.
A0
A0
A1
A2
A3
A1
A2
MPI_GATHER
A3
44
Collective operations

MPI_Gather (inbuf, incount, intype, outbuf,
outcount, outtype, root, comm)
Collective data movement function
inbuf address of input buffer
incount no. of elements sent from each (gt0)
intype datatype of input buffer elements
outbuf address of output buffer (var param)
outcount no. of elements received from each
outtype datatype of output buffer elements
root process id of root process
comm communicator

Output gather
data
All to one gather
proc.
A0
A0
A1
A2
A3
A1
A2
MPI_GATHER
A3
45
Collective operations

MPI_Gather (inbuf, incount, intype, outbuf,
outcount, outtype, root, comm)
Collective data movement function
inbuf address of input buffer
incount no. of elements sent from each (gt0)
intype datatype of input buffer elements
outbuf address of output buffer (var param)
outcount no. of elements received from each
outtype datatype of output buffer elements
root process id of root process
comm communicator

Receiving proc.
data
All to one gather
proc.
A0
A0
A1
A2
A3
A1
A2
MPI_GATHER
A3
46
MPI_Gather example

Gather 100 ints from every process in group to
root
MPI_Comm comm
int gsize, sendarray100
int root, myrank, rbuf
...
MPI_Comm_rank( comm, myrank) // find proc. id
If (myrank root)
MPI_Comm_size( comm, gsize) // find group
size
rbuf (int ) malloc(gsize100sizeof(int))
// calc. receive buffer
MPI_Gather( sendarray, 100, MPI_INT, rbuf, 100,
MPI_INT, root, comm)

47
Collective operations

MPI_Scatter (inbuf, incount, intype, outbuf,
outcount, outtype, root, comm)
Collective data movement function
inbuf address of input buffer
incount no. of elements sent to each (gt0)
intype datatype of input buffer elements
outbuf address of output buffer
outcount no. of elements received by each
outtype datatype of output buffer elements
root process id of root process
comm communicator

data
One to all scatter
proc.
A0
A1
A0
A2
A3
A1
A2
MPI_SCATTER
A3
48
Example of MPI_Scatter

MPI_Scatter is reverse of MPI_Gather
It is as if the root sends using
MPI_Send(inbufiincount sizeof(intype),
incount, intype, i, )
MPI_Comm comm
int gsize, sendbuf
int root, rbuff100
MPI_Comm_size (comm, gsize)
sendbuf (int ) malloc (gsize100sizeof(int
))
MPI_Scatter (sendbuf, 100, MPI_INT, rbuf,
100, MPI_INT, root, comm)

49
Collective operations

MPI_Reduce (inbuf, outbuf, count, type, op,
root, comm)
Collective reduction function
inbuf address of input buffer
outbuf address of output buffer
count no. of elements in input buffer (gt0)
type datatype of input buffer elements
op operation
root process id of root process
comm communicator

data
proc.
2
4
0
2
Using MPI_MIN Root 0
5
7
0
3
MPI_REDUCE
2
6
50
Collective operations

MPI_Reduce (inbuf, outbuf, count, type, op,
root, comm)
Collective reduction function
inbuf address of input buffer
outbuf address of output buffer
count no. of elements in input buffer (gt0)
type datatype of input buffer elements
op operation
root process id of root process
comm communicator

data
proc.
2
4
Using MPI_SUM Root 1
13
16
5
7
0
3
MPI_REDUCE
2
6
51
Collective operations

MPI_Allreduce (inbuf, outbuf, count, type, op,
comm)
Collective reduction function
inbuf address of input buffer
outbuf address of output buffer (var param)
count no. of elements in input buffer (gt0)
type datatype of input buffer elements
op operation
comm communicator

data
proc.
2
4
0
2
Using MPI_MIN
5
7
0
2
0
3
0
2
MPI_ALLREDUCE
2
6
0
2
52
Buffering in MPI communications

Application buffer specified by the first
parameter in MPI_Send/Recv functions
System buffer
Hidden from the programmer and managed by the MPI
library

Is limitted and can be easy to exhaust

53
Blocking and non-blocking communications

Blocking send
The sender doesnt return until the application
buffer can be re-used (which often means that the
data have been copied from application buffer to
system buffer), but doesnt mean that the data
will be received
MPI_Send (buf, count, datatype, dest, tag, comm)
Blocking receive
The receiver doesnt return until the data have
been ready to use by the receiver (which often
means that the data have been copied from system
buffer to application buffer)
Non-blocking send/receive
The calling process returns immediately
Just request the MPI library to perform the
operation, the user cannot predict when that will
happen
Unsafe to modify the application buffer until you
can make sure the requested operation has been
performed (MPI provides routines to test this)
Can be used to overlap computation with
communication and have possible performance gains
MPI_Isend (buf, count, datatype, dest, tag, comm,
request)

54
Testing non-blocking communications for completion

Completion tests come in two types
WAIT type
TEST type
WAIT type the WAIT type testing routines block
until the communication has completed.
A non-blocking communication immediately followed
by a WAIT-type test is equivalent to the
corresponding blocking communication
TEST type these routines return TRUE or FALSE
value
The process can perform some other tasks when the
communication has not completed

55
Testing non-blocking communications for completion

The WAIT-type test is
MPI_Wait (request, status)
This routine blocks until the communication
specified by the handle request has completed.
The request handle will have been returned by an
earlier call to a non-blocking communication
routine.
The TEST-type test is
MPI_Test (request, flag, status)
In this case the communication specified by the
handle request is simply queried to see if the
communication has completed and the result of the
query (TRUE or FALSE) is returned immediately in
flag.

56
Testing multiple non-blocking communications for
completion

Wait for all communications to complete
MPI_Waitall (count, array_of_requests,
array_of_statuses)
This routine blocks until all the communications
specified by the request handles,
array_of_requests, have completed. The statuses
of the communications are returned in the array
array_of_statuses and each can be queried in the
usual way for the source and tag if required
Test if all communications have completed
MPI_Testall (count, array_of_requests, flag,
array_of_statuses)
If all the communications have completed, flag is
set to TRUE, and information about each of the
communications is returned in array_of_statuses.
Otherwise flag is set to FALSE and
array_of_statuses is undefined.

57
Testing multiple non-blocking communications for
completion

Query a number of communications at a time to
find out if any of them have completed
Wait MPI_Waitany (count, array_of_requests,
index, status)
MPI_WAITANY blocks until one or more of the
communications associated with the array of
request handles, array_of_requests, has
completed.
The index of the completed communication in the
array_of_requests handles is returned in index,
and its status is returned in status.
Should more than one communication have
completed, the choice of which is returned is
arbitrary.
Test MPI_Testany (count, array_of_requests,
index, flag, status)
The result of the test (TRUE or FALSE) is
returned immediately in flag.