CS 267: Applications of Parallel Computers Lecture 7: Message Passing Programming MPI - PowerPoint PPT Presentation

Loading...

PPT – CS 267: Applications of Parallel Computers Lecture 7: Message Passing Programming MPI PowerPoint presentation | free to view - id: 7cdf3-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

CS 267: Applications of Parallel Computers Lecture 7: Message Passing Programming MPI

Description:

Basic Send and Receive. Buffering and message delivery. Non-blocking communication ... MPI Basic Send/Receive. Need to describe. How to identify process. How ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 58
Provided by: jtca
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: CS 267: Applications of Parallel Computers Lecture 7: Message Passing Programming MPI


1
CS 267 Applications of Parallel
Computers Lecture 7 Message Passing Programming
(MPI)
  • Jonathan Carter
  • jtcarter_at_lbl.gov

2
Overview
  • Message Passing Programming Review
  • What is MPI ?
  • Parallel Programming with MPI
  • Basic Send and Receive
  • Buffering and message delivery
  • Non-blocking communication
  • Collective Communication
  • Notes and Further Information

3
Message Passing Programming - Review
  • Model
  • Set of processes that each have local data and
    are able to communicate with each other by
    sending and receiving messages
  • Advantages
  • Useful and complete model to express parallel
    algorithms
  • Potentially fast
  • What is used in practice

4
What is MPI ?
  • A coherent effort to produce a standard for
    message passing
  • Before MPI, proprietary (Cray shmem, IBM MPL),
    and research community (PVM, p4) libraries
  • A message-passing library specification
  • For Fortran, C, and C

5
MPI History
  • MPI forum government, academia, and industry
  • November 1992 committee formed
  • May 1994 MPI 1.0 published
  • June 1995 MPI 1.1 published (clarifications)
  • April 1995 MPI 2.0 committee formed
  • July 1997 MPI 2.0 published
  • July 1997 MPI 1.2 published (clarifications)

6
Current Status
  • MPI 1.2
  • MPICH from ANL/MSU
  • LAM from Indiana University (Bloomington)
  • IBM, Cray, HP, SGI, NEC, Fujitsu
  • MPI 2.0
  • Fujitsu (all), IBM, Cray, NEC (most), MPICH, LAM,
    HP (some)

7
Parallel Programming With MPI
  • Communication
  • Basic send/receive (blocking)
  • Collective
  • Non-blocking
  • One-sided (MPI 2)
  • Synchronization
  • Implicit in point-to-point communication
  • Global synchronization via collective
    communication
  • Parallel I/O (MPI 2)

8
Creating Parallelism
  • Single Program Multiple Data (SPMD)
  • Each MPI process runs a copy of the same program
    on different data
  • Each copy runs at own rate and is not explicitly
    synchronized
  • May take different paths through the program
  • Control through rank and number of tasks

9
Creating Parallelism
  • Multiple Program Multiple Data
  • Each MPI process can be a separate program
  • With OpenMP, pthreads
  • Each MPI process can be explicitly
    multi-threaded, or threaded via some directive
    set such as OpenMP

10
Hello World
  • include "mpi.h"
  • include ltstdio.hgt
  • int main ( int argc, char argv )
  • int rank, size
  • MPI_Init( argc, argv )
  • MPI_Comm_rank( MPI_COMM_WORLD, rank )
  • MPI_Comm_size( MPI_COMM_WORLD, size )
  • printf( "I am d of d\n", rank, size )
  • MPI_Finalize()
  • return 0

11
MPI Basic Send/Receive
Process 1
Process 0
send
recv
  • Need to describe
  • How to identify process
  • How to identify message
  • How to identify data

12
Identifying Processes
  • MPI Communicator
  • Defines a group (set of ordered processes) and a
    context (a virtual network)
  • Rank
  • Process number within the group
  • MPI_ANY_SOURCE will receive from any process
  • Default communicator
  • MPI_COMM_WORLD the whole group

13
Identifying Messages
  • An MPI Communicator defines a virtual network,
    send/recv pairs must use the same communicator
  • send/recv routines have a tag (integer variable)
    argument that can be used to identify a message,
    or screen for a particular message.
  • MPI_ANY_TAG will receive a message with any tag

14
Identifying Data
  • Data is described by a triple (address, type,
    count)
  • For send, this defines the message
  • For recv, this defines the size of the receive
    buffer
  • Amount of data received, source, and tag
    available via status data structure
  • Useful if using MPI_ANY_SOURCE, MPI_ANY_TAG, or
    unsure of message size (must be smaller than
    buffer)

15
MPI Types
  • Type may be recursively defined as
  • An MPI predefined type
  • A contiguous array of types
  • An array of equally spaced blocks
  • An array of arbitrary spaced blocks
  • Arbitrary structure
  • Each user-defined type constructed via an MPI
    routine, e.g. MPI_TYPE_VECTOR

16
MPI Predefined Types
  • C Fortran
  • MPI_INT MPI_INTEGER
  • MPI_FLOAT MPI_REAL
  • MPI_DOUBLE MPI_DOUBLE_PRECISION
  • MPI_CHAR MPI_CHARACTER
  • MPI_UNSIGNED MPI_LOGICAL
  • MPI_LONG MPI_COMPLEX
  • Language Independent
  • MPI_BYTE

17
MPI Types
  • Explicit data description is useful
  • Simplifies programming, e.g. send row/column of a
    matrix with a single call
  • Heterogeneous machines
  • May improve performance
  • Reduce memory-to-memory copies
  • Allow use of scatter/gather hardware
  • May hurt performance
  • User packing of data likely faster

18
Point-to-point Example
Process 0 Process 1
define TAG 999 float a10 int
dest1 MPI_Send(a, 10, MPI_FLOAT, dest,
TAG, MPI_COMM_WORLD)
define TAG 999 MPI_Status status int
count float b20 int sender0 MPI_Recv(b, 20,
MPI_FLOAT, sender, TAG, MPI_COMM_WORLD,
status) MPI_Get_count(status, MPI_FLOAT,
count)
19
MPI_Send
  • MPI_Send(address, count, type, dest, tag, comm)
  • address pointer to data
  • count number of elements to be sent
  • type data type
  • dest destination process
  • tag identifying tag
  • comm communicator
  • When MPI_Send returns the message is sent and the
    data can be reused. The message has not
    necessarily been received.

20
MPI_Recv
  • MPI_Recv(address, count, type, dest, tag, comm,
    status)
  • address pointer to data
  • count number of elements to be sent
  • type data type
  • dest destination process
  • tag identifying tag
  • comm communicator
  • status sender, tag, and message size
  • When MPI_Recv returns the message has been
    received and the data can be used.

21
MPI Status Data Structure
  • In C
  • MPI_Status status
  • int recvd_tag, recvd_from, recvd_count
  • recvd_tag status.MPI_TAG
  • recvd_from status.MPI_SOURCE
  • MPI_Get_count( status, MPI_INT, recvd_count)
  • In Fortran
  • integer status(MPI_STATUS_SIZE)

22
MPI Programming with Six Routines
  • Some programs can be written with only six
    routines
  • MPI_Init
  • MPI_Finalize
  • MPI_Comm_size
  • MPI_Comm_rank
  • MPI_Send
  • MPI_Recv

23
Data Exchange
  • Process 0 Process 1

MPI_Recv(…,1,…) MPI_Send(…,1,…)
MPI_Recv(…,0,…) MPI_Send(…,0,…)
Deadlock. MPI_Recv will not return until send is
posted.
24
Data Exchange
Process 0 Process 1
MPI_Send(…,1,…) MPI_Recv(…,1,…)
MPI_Send(…,0,…) MPI_Recv(…,0,…)
May deadlock, depending on the implementation. If
the messages can be buffered, program will run.
Called 'unsafe' in the MPI standard.
25
Buffering in MPI
  • Implementation may buffer on sending process,
    receiving process, both, or none.
  • In practice, tend to buffer "small" messages on
    receiving process.
  • MPI has a buffered send-mode
  • MPI_Buffer_attach
  • MPI_Buffer_detach
  • MPI_Bsend

26
Message Delivery
P0
P1
  • Eager send data immediately store in remote
    buffer
  • No synchronization
  • Only one message sent
  • Data is copied
  • Uses memory for buffering (less for application)
  • Rendezvous send message header wait for recv to
    be posted send data
  • No data copy
  • More memory for application
  • More messages required
  • Synchronization (send blocks until recv posted)

27
Message Delivery
  • Many MPI implementations use both the eager and
    rendezvous methods of message delivery
  • Switch between the two methods according to
    message size
  • Often the cutover point is controllable via an
    environment variable, e.g. MP_EAGER_LIMIT and
    MP_USE_FLOW_CONTROL on the IBM SP

28
Message Delivery
  • Non-overtaking messages
  • Message sent from the same process will arrive in
    the order sent
  • No fairness
  • On a wildcard receive, possible to receive from
    only one source despite other messages being sent
  • Progress
  • For a pair of matched send and receives, at least
    one will complete independent of other messages.

29
Performance Comparison
30
Data Exchange
  • Ways around 'unsafe' data exchange
  • Match send/recv pairs hard in general case
  • Use MPI_Sendrecv
  • Use non-blocking communication

31
Non-blocking Communication
  • Communication split into two parts
  • MPI_Isend or MPI_Irecv starts communication and
    returns request data structure.
  • MPI_Wait (also MPI_Waitall, MPI_Waitany) uses
    request as an argument and blocks until
    communication is complete.
  • MPI_Test uses request as an argument and checks
    for completion.
  • Advantages
  • No deadlocks
  • Overlap communication with computation
  • Exploit bi-directional communication

32
Data Exchange with Isend/recv
Process 0 Process 1
MPI_Isend(…,1,…, request) MPI_Recv(…,1,…) MPI_Wa
it(request, status)
MPI_Isend(…,0,…, request) MPI_Recv(…,0,…) MPI_Wa
it(request, status)
33
Non-blocking send/recv buffers
  • May not modify or read the message buffer between
    MPI_Irecv and MPI_Wait calls.
  • May not modify or read the message buffer between
    MPI_Isend and MPI_Wait calls.
  • May not have two MPI_Irecv pending on the same
    buffer.
  • May not have two MPI_Isend pending on the same
    buffer.
  • Restrictions provide flexibility for implementers.

34
Performance Comparison
35
Collective Communication
  • Optimized algorithms, scaling as log(n)
  • Differences from point-to-point
  • Amount of data sent must match amount of data
    specified by receivers
  • No tags
  • Blocking only
  • MPI_barrier(comm)
  • All processes in the communicator are
    synchronized. The only collective call where
    synchronization is guaranteed.

36
Collective Move Functions
  • MPI_Bcast(data, count, type, src, comm)
  • Broadcast data from src to all processes in the
    communicator.
  • MPI_Gather(in, count, type, out, count, type,
    dest, comm)
  • Gathers data from all nodes to dest node
  • MPI_Scatter(in, count, type, out, count, type,
    src, comm)
  • Scatters data from src node to all nodes

37
Collective Move Functions
data
processes
broadcast
scatter
gather
38
Collective Move Functions
  • Additional functions
  • MPI_Allgather, MPI_Gatherv, MPI_Scatterv,
    MPI_Allgatherv, MPI_Alltoall

39
Collective Reduce Functions
  • MPI_Reduce(send, recv, count, type, op, root,
    comm)
  • Global reduction operation, op, on send buffer.
    Result is at process root in recv buffer. op may
    be user defined, MPI predefined operation.
  • MPI_Allreduce(send, recv, count, type, op, comm)
  • As above, except result broadcast to all
    processes.

40
Collective Reduce Functions
data
processes
allreduce
41
Collective Reduce Functions
  • Additional functions
  • MPI_Reduce_scatter, MPI_Scan
  • Predefined operations
  • Sum, product, min, max, …
  • User-defined operations
  • MPI_Op_create

42
Notes on C, Fortran, C
  • In C
  • include mpi.h
  • MPI functions return error code or MPI_SUCCESS
  • In Fortran
  • include mpif.h
  • use mpi (MPI 2)
  • All MPI calls are subroutines, return code is
    final argument
  • In C
  • Size MPICOMM_WORLD.Get_size() (MPI 2)

43
Other Features
  • Other send modes
  • synchronous mode can be used to check if the
    program is safe, since it forces a rendezvous
    protocol
  • ready mode is difficult to use and doesn't boost
    performance on any implementation. Need to ensure
    that recv is posted, this leads to more user
    explicit synchronization.
  • persistent communication, pre-specify a message
    envelope and data
  • Create new communicators
  • Libraries, logically partitioning tasks
  • Topologies
  • Cartesian and graph topologies can map physical
    hardware to processes

44
Other Features
  • Probe and cancel
  • Check for characteristics of incoming message,
    possibly cancel
  • I/O (MPI 2)
  • Individual, shared or explicit file pointers
  • Collective or individual (by process) file access
  • Blocking and non-blocking access
  • One-sided communication (MPI 2)
  • Put, get, and accumulate
  • Loose synchronization model
  • Remote lock/unlock of memory

45
Free MPI Implementations
  • MPICH from Argonne National Lab and Mississippi
    State Univ.
  • http//www-unix.mcs.anl.gov/mpi/mpich/
  • Runs on
  • network of workstations
  • SMP using shared memory
  • MPP system support limited
  • Windows

46
Free MPI Implementations
  • LAM from Ohio Supercomputer Center, University of
    Notre Dame, Indiana University
  • http//www.lam-mpi.org/
  • Many MPI 2 features
  • Runs on
  • Network of workstations

47
IBM MPI Implementation
  • MPI Programming Guide and MPI Subroutine
    Reference
  • http//www1.ibm.com/servers/eserver/pseries/librar
    y/sp_books/pe.html
  • Compatible with Pthreads and OpenMP
  • All of MPI 2 except for process spawning

48
Further Information
  • MPI Standards
  • http//www.mpi-forum.org/
  • Books
  • Using MPI Portable Parallel Programming with the
    Message-Passing Interface (second edition), by
    Gropp, Lusk and Skjellum
  • Using MPI-2 Advanced Features of the
    Message-Passing Interface, by Gropp, Lusk and
    Thakur
  • MPI The Complete Reference. Volume 1, by Snir,
    Otto, Huss-Lederman, Walker and Dongarra
  • MPI The Complete Reference. Volume 2, by Gropp,
    Huss-Lederman, Lumsdaine, Lusk, Nitzberg, Saphir,
    and Snir

49
Example
  • Calculate the energy of a system of particles
    interacting via a Coulomb potential.

real coord(3,n), charge(n)
energy0.0 do i 1, n do j 1,
i-1 rdist 1.0/sqrt((coord(1,i)-coord(1
,j))2 (coord(2,i)-coord(2,j))2(c
oord(3,i)-coord(3,j))2) energy
energy charge(i)charge(j)rdist end
do end do
50
MPI Example 1
  • Functional decomposition
  • each process will compute roughly the same number
    of interactions
  • accomplish this by dividing up the outer loop
  • replicate data to make communication simple
  • this approach will not scale

51
MPI - Example 1
include 'mpif.h' parameter(n50000)
dimension coord(3,n), charge(n) call
mpi_init(ierr) call mpi_comm_rank(MPI_COMM_W
ORLD, mype, ierr) call mpi_comm_size(MPI_COM
M_WORLD, npes, ierr) call
initdata(n,coord,charge,mype) e
energy(mype,npes,n,coord,charge)
etotal0.0 call mpi_reduce(e, etotal, 1,
MPI_REAL, MPI_SUM, 0, MPI_COMM_WORLD,
ierr) if (mype.eq.0) write(,) etotal
call mpi_finalize(ierr)
52
MPI - Example 1
subroutine initdata(n,coord,charge,mype)
include 'mpif.h' dimension coord(3,n),
charge(n) if (mype.eq.0) then
GENERATE coords, charge end if ! broadcast
data to slaves call mpi_bcast(coord, 3n,
MPI_REAL, 0, MPI_COMM_WORLD, ierr) call
mpi_bcast(charge, n, MPI_REAL, 0, MPI_COMM_WORLD,
ierr) return
53
MPI - Example 1
real function energy(mype,npes,n,coord,charg
e) dimension coord(3,n), charge(n)
intern(n-1)/npes nstartnint(sqrt(real(myp
einter)))1 nfinishnint(sqrt(real((mype1)
inter))) if (mype.eq.npes-1) nfinishn
total 0.0 do i nstart, nfinish
do j 1, i-1 .... total
total charge(i)charge(j)rdist end do
end do energy total return
54
MPI - Example 2
  • Domain decomposition
  • each task takes a chunk of particles
  • in turn, receives particle data from another
    process and computes all interactions between own
    data and received data
  • repeat until all interactions are done

55
MPI - Example 2
Proc 0
Proc 1
Proc 2
Proc 3
Proc 4
Step 1
21-40
41-60
61-80
81-100
1-20
21-40
41-60
61-80
81-100
1-20
Step 2
21-40
41-60
61-80
81-100
1-20
41-60
61-80
81-100
1-20
21-40
Step 3
21-40
41-60
61-80
81-100
1-20
61-80
81-100
1-20
21-40
41-60
56
subroutine initdata(n,coord,charge,mype,npes
,npepmax,nmax,nmin) include 'mpif.h'
dimension coord(3,n), charge(n) integer
status(MPI_STATUS_SIZE) itag0
isender0 if (mype.eq.0) then do
ipe1,npes-1 GENERATE coord, charge
for PEipe call mpi_send(coord, nj3,
MPI_REAL, ipe, itag, MPI_COMM_WORLD,
ierror) call mpi_send(charge, nj,
MPI_REAL, ipe, itag, MPI_COMM_WORLD,
ierror) end do GENERATE coord,
charge for self else ! receive particles
call mpi_recv(coord, 3n, MPI_REAL,
isender, itag, MPI_COMM_WORLD, status,
ierror) call mpi_recv(charge, n,
MPI_REAL, isender, itag,
MPI_COMM_WORLD, status, ierror) endif
return
57
niternpes/2 do iter1, niter ! PE
to send to and receive from if
(ipsend.eq.npes-1) then ipsend0
else ipsendipsend1 end if
if (iprecv.eq.0) then
iprecvnpes-1 else
iprecviprecv-1 end if ! send and
receive particles call mpi_sendrecv(coordi
, 3n, MPI_REAL, ipsend, itag, coordj,
3n, MPI_REAL, iprecv, itag,
MPI_COMM_WORLD, status, ierror) call
mpi_sendrecv(chargei, n, MPI_REAL, ipsend, itag,
chargej, n, MPI_REAL, iprecv, itag,
MPI_COMM_WORLD, status, ierror) !
accumulate energy e e energy2(n,
coordi, chargei, n, coordj, chargej) end do
About PowerShow.com