MPI Workshop I

About This Presentation

Title:

MPI Workshop I

Description:

The underlying compiler, NAG, PGI, etc. is determined by how MPIHOME and PATH ... NAG F95 - f95. PGI - pgf77, pgcc, pgCC, pgf90. GNU Compiler Suite - gcc, g77 ... – PowerPoint PPT presentation

Number of Views:26

Avg rating:3.0/5.0

Slides: 50

Provided by: andrewc84

Category:

more less

Transcript and Presenter's Notes

Title: MPI Workshop I

1
MPI Workshop - I

Introduction to Point-to-Point Communications
HPC_at_UNM Research Staff
Dr. Andrew C. Pineda, Dr. Paul M. Alsing
Week 1 of 2

2
Table of Contents

Introduction to Parallelism and MPI
MPI Standard
MPI Course Map
MPI Routines/Exercises
point-to-point communications basics
communicators
blocking versus non-blocking calls
collective communications basics
how to run MPI routines at HPC_at_UNM
References

3
Parallelism and MPI

Parallelism is accomplished by
Breaking up the task into smaller tasks
Assigning the smaller tasks to multiple workers
to work on simultaneously
Coordinating the workers
Not breaking up the task so small that it takes
longer to tell the worker what to do than it does
to do it
Buzzwords latency, bandwidth

4
Parallelism MPI

Message Passing Model
Multiple processors operate independently but
each has its own private memory (distributed
processors and memory)
Data is shared across a communications network
using cooperative operations via library calls
User responsible for synchronization using
message passing
Advantages
Memory scalable to number of processors. Increase
number of processors, size of memory and
bandwidth increases.
Each processor can rapidly access its own memory
without interference from others.

5
Parallelism MPI

Another advantage is flexibility of programming
schemes
Functional parallelism - different tasks done at
the same time by different nodes
Master-Slave (Client-server) parallelism - one
process assigns subtasks to other processes.
Data parallelism - data can be distributed
SPMD parallelism - Single Program, Multiple Data
- same code replicated to each process but
operating on different data

6
Parallelism MPI

Disadvantages
Sometimes difficult to map existing data
structures to this memory organization
User responsible for sending and receiving data
among processors
To minimize overhead and latency, data should be
blocked up in large chunks and shipped before
receiving node needs it

7
Parallelism MPI

Message Passing Interface - MPI
A standard portable message-passing library
definition developed in 1993 by a group of
parallel computer vendors, computer scientists,
and applications developers.
Available to both Fortran and C programs (and
through these F90 and C).
Available on a wide variety of parallel machines.
Target platform is a distributed memory system
such as the Los Lobos Linux Cluster.
All inter-task communication is by message
passing.
All parallelism is explicit the programmer is
responsible for parallelism of the program and
implementing it with MPI constructs.

8
MPI Standardization Effort

MPI Forum initiated in April 1992 Workshop on
Message Passing Standards.
Initially about 60 people from 40 organizations
participated.
Defines an interface that can be implemented on
many vendor's platforms with no significant
changes in the underlying communication and
system software.
Allow for implementations that can be used in a
heterogeneous environment.
Semantics of the interface should be language
independent.
There are currently over 125 people from 52
organizations who have contributed to this
effort.

9
MPI-Standard Releases

May, 1994 MPI-Standard version 1.0
June, 1995 MPI-Standard version 1.1
includes minor revisions of 1.0
July, 1997 MPI-Standard version 1.2 and 2.0
with extended functions
2.0 - support real time operations, spawning of
processes, more collective operations
2.0 - explicit C and F90 bindings
Complete postscript and HTML documentation can be
found at http//www.mpi-forum.org/docs/docs.html
Currently available at HPC_at_UNM

10
MPI Implementations

Message Passing Libraries
MPI - Message Passing Interface
PVM - Parallel Virtual Machine
Public Domain MPI Implementations
MPICH (ANL and MSU) (v. 1.2.5, 6 Jan 2003)
www-unix.mcs.anl.gov/mpi/mpich/
MPICH2 (ANL and MSU) (v. 0.94, 22 Aug 2003)
LAM (v. 7.0, MPI 1.2 much of 2.0, 2 Jul 2003)
www.lam-mpi.org
Vendor MPI Implementations
IBM-MPI , SGI (based on MPICH), others
Available on HPC_at_UNM platforms.

11
Course Roadmap
12
Program examples/MPI calls

Hello - Basic MPI code with no communications.
Illustrates some basic aspects of parallel
programming
MPI_INIT - starts MPI communications
MPI_COMM_RANK - get processor id
MPI_COMM_SIZE - get number of processors
MPI_FINALIZE - end MPI communications
Swap - Basic MPI point-to-point messages
MPI_SEND - blocking send
MPI_RECV - blocking receive
MPI_IRECV, MPI_WAIT - non-blocking receive

13
Program examples/MPI calls

client-server - Basic MPI code illustrating
functional parallelism
MPI_INIT - starts MPI communications
MPI_COMM_RANK - get processor id
MPI_COMM_SIZE - get number of processors
MPI_ATTR_GET - get attributes (in this case
maximum allowed tag value)
MPI_BARRIER - synchronization
MPI_BCAST - collective call - broadcast
MPI_PROBE - peek at message queue to see what
next message is/wait for message
MPI_SEND/MPI_RECV
MPI_FINALIZE - end MPI communications

14
MPI 1.x Language Bindings

Fortran 77
include mpif.h
call MPI_ABCDEF(list of arguments, IERROR)
Fortran 90 via Fortran 77 Library
F90 strong type checking of arguments can cause
difficulties
cannot handle more than one object type
include mpif90.h in F90.
ANSI C
include mpi.h
IERRORMPI_Abcdef(list of arguments)
C via C Library
via extern C declaration, include mpi.h

15
1st Example

Hello, World in MPI
hello.f - Fortran 90 version
hello.c - C version
Can be found in the mpi1 subdirectory of your
guest account.
Goals
Demonstrate basic elements of an MPI program
Describe available compiler/network combinations
Basic introduction to job scheduler (PBS) used on
HPC_at_UNM platforms

16
1st Example
17
Communicators in MPI

Communicators provide a handle to a group of
processes/processors.
Within each communicator, a process is assigned a
rank.
MPI provides a standard communicator
MPI_COMM_WORLD that represents all of the
processes/processors.
MPI_Comm_rank and MPI_Comm_size return rank and
number of processors.
Mapping of processors to rank is implementation
dependent.
Appear in virtually every MPI call.
Communicators can be created
Can create subgroups of processors
Can be used to map topology of processors onto
topology of data using MPI Cartesian Topology
functions
Useful if your data can be mapped onto a grid.

18
1st Example
19
Compiling MPI codes

You invoke your compiler via scripts that tack on
the appropriate MPI include and library files
mpif77 -o ltprognamegt ltfilenamegt.f
mpif77 -c ltfilenamegt.f
mpif77 -o progname ltfilenamegt.o
mpif90 -o ltprognamegt ltfilenamegt.f90
mpicc -o ltprognamegt ltfilenamegt.c
mpiCC -o ltprognamegt ltfilenamegt.cc (not
supported)
These scripts save you from having to know where
the libraries are located. Use them!!!
The underlying compiler, NAG, PGI, etc. is
determined by how MPIHOME and PATH are set up in
your environment.

20
Compiling MPI codes contd.

MPICH
Two choices of communications networks
eth - FastEthernet (100Mb/sec)
gm - Myrinet (1.2 Gb/sec, Los Lobos)
Gigabit Ethernet (1.0Gb/sec, Azul)
Many compilers
NAG F95 - f95
PGI - pgf77, pgcc, pgCC, pgf90
GNU Compiler Suite - gcc, g77
Combination is determined by your environment.

21

Compiling MPI codes contd.

MPIHOME settings determine your choice of the
underlying compilers and communications network
for your compilation.
Compilers
PGI - Portland Group (pgcc, pgf77, pgf90)
GCC - Gnu Compiler Suite (gcc, g77, g)
NAG - Numerical Algorithms Group (f95)
Networks
FastEthernet (ch_p4)
Myrinet (ch_gm)

22
Supported MPIHOME values
Stealth PGI!!!
23
Portable Batch Scheduler(PBS)

To submit job use
qsub file.pbs
file.pbs is a shell script that invokes mpirun
qsub -q R1234 file.pbs
submit to a reservation queue R1234
qsub -I -l nodes1
Interactive session
To check status
qstat
qstat -a (shows status of everyones
jobs)
qstat -n jobid (shows nodes assigned to job)
To cancel job
qdel job_id

24
PBS command file (file.pbs)

Just a variation on a shell script
PBS -l nodes4ppn2,walltime40000
any set up you need to do,
e.g. staging data
mpirun -np 8 -machinefile PBS_NODEFILE
ltexecutable or scriptgt
cleanup or save auxiliary files

See man qsub for other -l options
The script runs on the head node. Use ssh or dsh
to run commands on others. In most cases, you can
rely on PBS to clean up after you.
25
Lab exercise 1

Download, compile and run hello.f or hello.c.
Run several times with different numbers of
processors.
Do the results always come out the same?
If not, can you explain why?
Copy files from
mpi1 subdirectory of your guest account.

26
2nd Example Code

Swap
swap.f - F90 implementation
swap.c - C implementation
Goals
Illustrate a basic message exchange among a few
processors
Introduce basic flavors of send and receive
operations
Illustrate potential pitfalls such as deadlock
situations
Can be found in the mpi1 subdirectory of your
guest account.

27
Basic Send/Receive operations

Send
MPI_Send - standard mode blocking send
blocking does not return until the message
buffer can be reused
semantics are blocking, but may not block in
practice
vendor free to implement in most efficient manner
in most cases you should use this
MPI_Isend - immediate non-blocking send
lets MPI know we are sending data and where it
is, but we return immediately with the promise
that we will not touch the data until we have
verified that the message buffer can be safely
reused.
Pair with MPI_Wait or MPI_Test to complete/verify
completion
Allows overlap of communication and computation
by letting processor do work while we wait on
completion.
Many other flavors MPI_Bsend (Buffered Send),
MPI_Ssend (Synchronous Send), MPI_Rsend (Ready
Send) to name a few.

28
Basic Send/Receive operations

Receive
MPI_Recv - Standard mode blocking receive
Does not return until data from matching send is
in receive buffer.
MPI_IRecv - Immediate mode non-blocking receive
lets MPI know we are expecting data and where to
put it.
Pair with MPI_Wait or MPI_Test to complete
Unlike sends these are the only two...
Completing non-blocking calls
MPI_Wait - blocking
MPI_Test - non-blocking
Getting information about an incoming message
MPI_Probe - blocking
MPI_Iprobe - non-blocking

29
Send/Receive Structure

Basic structure of a standard send/receive call
Fortran
MPI_Recv(data components, message envelope,
status, ierror)
C
ierrorMPI_Recv(data envelope, message envelope,
status)
Data components consists of 3 parts
data buffer (holds data you are sending)
size of buffer - in units of the data type (e.g.
5 integers, or 5 floats)
type descriptor - corresponds to standard
language data types
Message envelope consists of 4 parts, 3 of which
are specified
source/destination - integer
tag - an integer label between 0 and an
implementation dependent maxmum value (gt32K-1)
communicator
Status (does not appear in Send operations), and
Ierror
In Fortran, an array used to return information
about the received message, e.g. source, tag.
Example status(MPI_SOURCE)
In C, this is a C structure. Example
status.MPI_SOURCE.
Ierror returns error status.
Other types of sends, receives require additional
arguments, see supplementary materials.

30
MPI Type Descriptors

C types
MPI_INT
MPI_CHAR
MPI_FLOAT
MPI_DOUBLE
MPI_LONG
MPI_UNSIGNED_INT
Many others corresponding to other C types
MPI_BYTE
MPI_PACKED

Fortran 77/Fortran 90 types
MPI_INTEGER
MPI_CHARACTER
MPI_REAL
MPI_DOUBLE_PRECISION
MPI_COMPLEX
MPI_LOGICAL
MPI_BYTE
MPI_PACKED

31
Matching Sends to Receives

Message Envelope - consists of the source,
destination, tag, and communicator values.
A message can only be received if the specified
envelope agrees with the message envelope.
The source and tag portions can be wildcarded
using MPI_ANY_SOURCE and MPI_ANY_TAG. (Useful for
writing client-server applications.)
Sourcedestination is allowed except for blocking
operations.
Variable types of the messages must match.
In heterogeneous systems, MPI automatically
handles data conversions, e.g. big-endian to
little-endian.
Messages (with the same envelope) are not
overtaking.

32
2nd Example
33
2nd Example
34
Lab exercise 2

swap.f, swap.c
Compile and run either the Fortran or C code with
two processes.
Try running with the send and receive operations
in the two sections in the sequences shown below
(in addition to that in the code). What happens
in each case?
Can be found in the mpi1 subdirectory of your
guest account.

35
Non-blocking communications

Heres what the MPI standard says about how your
experiment should have worked out.

The last case fails because both processors are
blocked in the receive operation and never
execute their sends. Case 1 works if the send is
buffered. This allows the sends to complete.
36
2nd Example RevisitedNon-blocking calls
37
Non-blocking calls

Can use MPI_TEST in place of MPI_WAIT to
periodically check on a message rather than
blocking and waiting.
Client-server applications can use MPI_WAITANY or
MPI_TESTANY.
Can peek ahead at messages with MPI_PROBE and
MPI_IPROBE. MPI_PROBE is used in our
client-server application.

38
Lab exercise 3

Change your broken copy of swap to use
MPI_IRECV and MPI_WAIT instead of MPI_RECV and
try running again. Below is the syntax of these
calls.
C language
int MPI_Irecv(void buf, int count, MPI_Datatype
datatype,
int source, int tag,
MPI_Comm comm, MPI_Request request)
int MPI_Wait(MPI_Request request, MPI_Status
status)
Fortran
lttypegt buf()
integer count, datatype, source, tag, comm,
request, ierror
integer request, status(MPI_STATUS_SIZE), ierror
call MPI_WAIT(request, status, ierror)
call MPI_IRECV(buf, count, datatype, source, tag,
comm, request, ierror)

39
3rd example code

Client-Server Application
Illustrates use of point-to-point communications
calls, message tags.
Illustrates one of the basic paradigms of
parallel computing - task decomposition -
provides a general framework for solving a wide
range of problems.
Easy to understand example - multiplication of a
vector by a matrix.
In next weeks workshop, well re-implement this
code entirely using collective communications
calls.
This example uses 2 collective calls MPI_Barrier,
and MPI_Bcast (the latter for clarity)

40
Collective Communications
MPI_Barrier(communicator, ierror) - used to
synchronize processes within a communicator MPI_Bc
ast(data envelope, source, communicator, ierror)
- broadcast copy of a piece of data to all
processes. Equivalent to N-1 sends from processor
with data to remaining processes.
41
3rd ExampleMatrix-vector Multiplication

Goal Compute r Ax.
ri ?aijxj
Can be decomposed into either a row or column
oriented algorithm.
Important because different programming languages
have different conventions for storing 2-D
arrays.
C language - arrays stored in row major order -
multiply using traditional dot product
interpretation of matrix multiplication.
Fortran language - arrays stored in column major
order - multiplication is linear combination of
columns.

X

42
3rd ExampleAbridged client-server.f code

call MPI_INIT (ierror)
call MPI_COMM_SIZE (MPI_COMM_WORLD,
num_processes,ierror)
call MPI_COMM_RANK (MPI_COMM_WORLD, rank,
ierror)
call MPI_ATTR_GET(MPI_COMM_WORLD,
MPI_TAG_UB, NO_DATA, flag, ierror)
! Set NO_DATA to the largest allowed value
for a tag - about 227 under
! the MPICH implementation.
if (rank .eq. server) then
! read dimensions of A(m,n)
call MPI_BCAST(m, 1, MPI_INTEGER,
server, MPI_COMM_WORLD, ierror)
call MPI_BCAST(n, 1, MPI_INTEGER,
server, MPI_COMM_WORLD, ierror)
! Allocate memory for arrays, only server
keeps full A.
! Allocate memory for arrays, initialize
to zero if necessary.
buffer(1m)0.0d0
! Server - read and broadcast vector
x(0),...,x(n)
read(f_ptr,)(x(j),j1,n)
call MPI_BCAST (x, n, MPI_REAL, server,
MPI_COMM_WORLD, ierror)
! read A(m,n)

43
3rd ExamplePriming the pump

! Here we give the compute processes their
1st batch of data
! Send out up to num_processes-1 rows to
clients.
i1
do while ((i.le.n).and.(i.lt.num_processes))
! Distribute a(m,n) by columns
alocal(1m)a(1m,i)
call MPI_SEND (alocal, m, MPI_REAL, i,
i,MPI_COMM_WORLD, ierror)
active_client_countactive_client_count1
ii1
end do
! note At the end of the above loop
imin(n,num_processes).
! Handle case where there is more processes
than rows.
! Use tagNO_DATA as message to go to
waiting area pending finish
do ji1,num_processes,1
call MPI_SEND (alocal, m, MPI_REAL, j,
NO_DATA, MPI_COMM_WORLD,
ierror)
end do
! Note that the message is the TAGNO_DATA,
not alocal.

44
3rd ExampleServer loop

do while ((active_client_count.gt.0) .or. (i.le.
n))
call MPI_RECV (buffer, m,
MPI_REAL,MPI_ANY_SOURCE,
MPI_ANY_TAG, MPI_COMM_WORLD, status,
ierror)
if (status(MPI_TAG).ne.NO_DATA) then
result(1m)result(1m)buffer(1m)!
Accumulate result, F90 array syntax
if (i.le.n) then
alocal(1m)a(1m,i)
call MPI_SEND (alocal, m, MPI_REAL,
status(MPI_SOURCE), i, MPI_COMM_WORLD,
ierror)
ii1
else
! No more data
active_client_countactive_client_count-1
call MPI_SEND (alocal, m, MPI_REAL,
status(MPI_SOURCE), NO_DATA,
MPI_COMM_WORLD, ierror)
endif
else
! Handle node errors
endif

45
3rd ExampleClient Loop

do i1,m
print , result(i)
end do
else
! Client side - Receive broadcasts of row, and
column dimension
call MPI_BCAST (m, 1, MPI_INTEGER, server,
MPI_COMM_WORLD, ierror)
call MPI_BCAST (n, 1, MPI_INTEGER, server,
MPI_COMM_WORLD, ierror)
! Allocate memory then get x values from the
broadcast.
call MPI_BCAST (x, n, MPI_REAL, server,
MPI_COMM_WORLD, ierror)
! Listen for 1st message from server - decide
if data or loop terminator.
call MPI_PROBE(server, MPI_ANY_TAG,
MPI_COMM_WORLD, status, ierror)
do while(status(MPI_TAG).ne.NO_DATA) ! Loop
until NO_DATA message recd.
call MPI_RECV (alocal,m,MPI_REAL, server,
status(MPI_TAG),MPI_COMM_WORLD,
status,ierror)
buffer(1m)alocal(1m)x(status(MPI_TAG))
! multiply array by const.
call MPI_SEND (buffer, m, MPI_REAL, server,
status(MPI_TAG), MPI_COMM_WORLD,
ierror) ! return results
! Listen for the next message
call MPI_PROBE(server, MPI_ANY_TAG,
MPI_COMM_WORLD,status, ierror)

46
Program Termination

endif
! everyone waits here until all are done.
Waiting area
call MPI_Barrier(MPI_COMM_WORLD, ierror)
call MPI_Finalize(ierror)
end program

47
Exercises

Pick an example code in a language you are
familiar with and rewrite one of the broadcast
operations using the equivalent sends and
receives.
Located in /mpi1.
server_client.f - column-oriented Fortran 90
server_client_row2.c - row-oriented C
server_client_col2.c - column-oriented C
How would you rewrite one of the column-oriented
example codes to do a full matrix-matrix
multiplication? (Hint look back at the pictures
and think a bit about how the arrays have to be
distributed.) What issues would you need to
resolve?
Can you rewrite this without the MPI_Probe call?

48
References - MPI Tutorial

PACS online course
http//webct.ncsa.uiuc.edu8900/
Edinburgh Parallel Computing Center
http//www.epcc.ed.ac.uk/epic/mpi/notes/mpi-course
-epic.book_1.html
Argonne National Laboratory (MPICH)
http//www-unix.mcs.anl.gov/mpi/
MPI Forum
http//www.mpi-forum.org/docs/docs.html
MPI The Complete Reference (vols. 1, 2)
Vol. 1. at http//www.netlib.org/utk/papers/mpi-bo
ok/mpi-book.html
IBM (MPI on the RS/6000 (IBM SP))
http//publib-b.boulder.ibm.com/Redbooks.nsf/Redbo
okAbstracts

49
References Some useful books

MPI The Complete Reference
Marc Snir, Steve Otto, Steven Huss-Lederman,
David Walker and Jack Dongara, MIT Press
examples/mpidocs/mpi_complete_reference.ps.Z
Parallel Programming with MPI
Peter S. Pacheco, Morgan Kaufman Publishers, Inc
Using MPI Portable Parallel Programing with
Message Passing Interface
William Gropp, E. Lusk and A. Skjellum, MIT Press