MPI Workshop I - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

MPI Workshop I

Description:

The underlying compiler, NAG, PGI, etc. is determined by how MPIHOME and PATH ... NAG F95 - f95. PGI - pgf77, pgcc, pgCC, pgf90. GNU Compiler Suite - gcc, g77 ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 50
Provided by: andrewc84
Category:
Tags: mpi | nag | workshop

less

Transcript and Presenter's Notes

Title: MPI Workshop I


1
MPI Workshop - I
  • Introduction to Point-to-Point Communications
  • HPC_at_UNM Research Staff
  • Dr. Andrew C. Pineda, Dr. Paul M. Alsing
  • Week 1 of 2

2
Table of Contents
  • Introduction to Parallelism and MPI
  • MPI Standard
  • MPI Course Map
  • MPI Routines/Exercises
  • point-to-point communications basics
  • communicators
  • blocking versus non-blocking calls
  • collective communications basics
  • how to run MPI routines at HPC_at_UNM
  • References

3
Parallelism and MPI
  • Parallelism is accomplished by
  • Breaking up the task into smaller tasks
  • Assigning the smaller tasks to multiple workers
    to work on simultaneously
  • Coordinating the workers
  • Not breaking up the task so small that it takes
    longer to tell the worker what to do than it does
    to do it
  • Buzzwords latency, bandwidth

4
Parallelism MPI
  • Message Passing Model
  • Multiple processors operate independently but
    each has its own private memory (distributed
    processors and memory)
  • Data is shared across a communications network
    using cooperative operations via library calls
  • User responsible for synchronization using
    message passing
  • Advantages
  • Memory scalable to number of processors. Increase
    number of processors, size of memory and
    bandwidth increases.
  • Each processor can rapidly access its own memory
    without interference from others.

5
Parallelism MPI
  • Another advantage is flexibility of programming
    schemes
  • Functional parallelism - different tasks done at
    the same time by different nodes
  • Master-Slave (Client-server) parallelism - one
    process assigns subtasks to other processes.
  • Data parallelism - data can be distributed
  • SPMD parallelism - Single Program, Multiple Data
    - same code replicated to each process but
    operating on different data

6
Parallelism MPI
  • Disadvantages
  • Sometimes difficult to map existing data
    structures to this memory organization
  • User responsible for sending and receiving data
    among processors
  • To minimize overhead and latency, data should be
    blocked up in large chunks and shipped before
    receiving node needs it

7
Parallelism MPI
  • Message Passing Interface - MPI
  • A standard portable message-passing library
    definition developed in 1993 by a group of
    parallel computer vendors, computer scientists,
    and applications developers.
  • Available to both Fortran and C programs (and
    through these F90 and C).
  • Available on a wide variety of parallel machines.
  • Target platform is a distributed memory system
    such as the Los Lobos Linux Cluster.
  • All inter-task communication is by message
    passing.
  • All parallelism is explicit the programmer is
    responsible for parallelism of the program and
    implementing it with MPI constructs.

8
MPI Standardization Effort
  • MPI Forum initiated in April 1992 Workshop on
    Message Passing Standards.
  • Initially about 60 people from 40 organizations
    participated.
  • Defines an interface that can be implemented on
    many vendor's platforms with no significant
    changes in the underlying communication and
    system software.
  • Allow for implementations that can be used in a
    heterogeneous environment.
  • Semantics of the interface should be language
    independent.
  • There are currently over 125 people from 52
    organizations who have contributed to this
    effort.

9
MPI-Standard Releases
  • May, 1994 MPI-Standard version 1.0
  • June, 1995 MPI-Standard version 1.1
  • includes minor revisions of 1.0
  • July, 1997 MPI-Standard version 1.2 and 2.0
  • with extended functions
  • 2.0 - support real time operations, spawning of
    processes, more collective operations
  • 2.0 - explicit C and F90 bindings
  • Complete postscript and HTML documentation can be
    found at http//www.mpi-forum.org/docs/docs.html
  • Currently available at HPC_at_UNM

10
MPI Implementations
  • Message Passing Libraries
  • MPI - Message Passing Interface
  • PVM - Parallel Virtual Machine
  • Public Domain MPI Implementations
  • MPICH (ANL and MSU) (v. 1.2.5, 6 Jan 2003)
  • www-unix.mcs.anl.gov/mpi/mpich/
  • MPICH2 (ANL and MSU) (v. 0.94, 22 Aug 2003)
  • LAM (v. 7.0, MPI 1.2 much of 2.0, 2 Jul 2003)
  • www.lam-mpi.org
  • Vendor MPI Implementations
  • IBM-MPI , SGI (based on MPICH), others
  • Available on HPC_at_UNM platforms.

11
Course Roadmap
12
Program examples/MPI calls
  • Hello - Basic MPI code with no communications.
  • Illustrates some basic aspects of parallel
    programming
  • MPI_INIT - starts MPI communications
  • MPI_COMM_RANK - get processor id
  • MPI_COMM_SIZE - get number of processors
  • MPI_FINALIZE - end MPI communications
  • Swap - Basic MPI point-to-point messages
  • MPI_SEND - blocking send
  • MPI_RECV - blocking receive
  • MPI_IRECV, MPI_WAIT - non-blocking receive

13
Program examples/MPI calls
  • client-server - Basic MPI code illustrating
    functional parallelism
  • MPI_INIT - starts MPI communications
  • MPI_COMM_RANK - get processor id
  • MPI_COMM_SIZE - get number of processors
  • MPI_ATTR_GET - get attributes (in this case
    maximum allowed tag value)
  • MPI_BARRIER - synchronization
  • MPI_BCAST - collective call - broadcast
  • MPI_PROBE - peek at message queue to see what
    next message is/wait for message
  • MPI_SEND/MPI_RECV
  • MPI_FINALIZE - end MPI communications

14
MPI 1.x Language Bindings
  • Fortran 77
  • include mpif.h
  • call MPI_ABCDEF(list of arguments, IERROR)
  • Fortran 90 via Fortran 77 Library
  • F90 strong type checking of arguments can cause
    difficulties
  • cannot handle more than one object type
  • include mpif90.h in F90.
  • ANSI C
  • include mpi.h
  • IERRORMPI_Abcdef(list of arguments)
  • C via C Library
  • via extern C declaration, include mpi.h

15
1st Example
  • Hello, World in MPI
  • hello.f - Fortran 90 version
  • hello.c - C version
  • Can be found in the mpi1 subdirectory of your
    guest account.
  • Goals
  • Demonstrate basic elements of an MPI program
  • Describe available compiler/network combinations
  • Basic introduction to job scheduler (PBS) used on
    HPC_at_UNM platforms

16
1st Example
17
Communicators in MPI
  • Communicators provide a handle to a group of
    processes/processors.
  • Within each communicator, a process is assigned a
    rank.
  • MPI provides a standard communicator
    MPI_COMM_WORLD that represents all of the
    processes/processors.
  • MPI_Comm_rank and MPI_Comm_size return rank and
    number of processors.
  • Mapping of processors to rank is implementation
    dependent.
  • Appear in virtually every MPI call.
  • Communicators can be created
  • Can create subgroups of processors
  • Can be used to map topology of processors onto
    topology of data using MPI Cartesian Topology
    functions
  • Useful if your data can be mapped onto a grid.

18
1st Example
19
Compiling MPI codes
  • You invoke your compiler via scripts that tack on
    the appropriate MPI include and library files
  • mpif77 -o ltprognamegt ltfilenamegt.f
  • mpif77 -c ltfilenamegt.f
  • mpif77 -o progname ltfilenamegt.o
  • mpif90 -o ltprognamegt ltfilenamegt.f90
  • mpicc -o ltprognamegt ltfilenamegt.c
  • mpiCC -o ltprognamegt ltfilenamegt.cc (not
    supported)
  • These scripts save you from having to know where
    the libraries are located. Use them!!!
  • The underlying compiler, NAG, PGI, etc. is
    determined by how MPIHOME and PATH are set up in
    your environment.

20
Compiling MPI codes contd.
  • MPICH
  • Two choices of communications networks
  • eth - FastEthernet (100Mb/sec)
  • gm - Myrinet (1.2 Gb/sec, Los Lobos)
  • Gigabit Ethernet (1.0Gb/sec, Azul)
  • Many compilers
  • NAG F95 - f95
  • PGI - pgf77, pgcc, pgCC, pgf90
  • GNU Compiler Suite - gcc, g77
  • Combination is determined by your environment.

21

Compiling MPI codes contd.
  • MPIHOME settings determine your choice of the
    underlying compilers and communications network
    for your compilation.
  • Compilers
  • PGI - Portland Group (pgcc, pgf77, pgf90)
  • GCC - Gnu Compiler Suite (gcc, g77, g)
  • NAG - Numerical Algorithms Group (f95)
  • Networks
  • FastEthernet (ch_p4)
  • Myrinet (ch_gm)

22
Supported MPIHOME values
Stealth PGI!!!
23
Portable Batch Scheduler(PBS)
  • To submit job use
  • qsub file.pbs
  • file.pbs is a shell script that invokes mpirun
  • qsub -q R1234 file.pbs
  • submit to a reservation queue R1234
  • qsub -I -l nodes1
  • Interactive session
  • To check status
  • qstat
  • qstat -a (shows status of everyones
    jobs)
  • qstat -n jobid (shows nodes assigned to job)
  • To cancel job
  • qdel job_id

24
PBS command file (file.pbs)
  • Just a variation on a shell script
  • PBS -l nodes4ppn2,walltime40000
  • any set up you need to do,
  • e.g. staging data
  • mpirun -np 8 -machinefile PBS_NODEFILE
    ltexecutable or scriptgt
  • cleanup or save auxiliary files

See man qsub for other -l options
The script runs on the head node. Use ssh or dsh
to run commands on others. In most cases, you can
rely on PBS to clean up after you.
25
Lab exercise 1
  • Download, compile and run hello.f or hello.c.
  • Run several times with different numbers of
    processors.
  • Do the results always come out the same?
  • If not, can you explain why?
  • Copy files from
  • mpi1 subdirectory of your guest account.

26
2nd Example Code
  • Swap
  • swap.f - F90 implementation
  • swap.c - C implementation
  • Goals
  • Illustrate a basic message exchange among a few
    processors
  • Introduce basic flavors of send and receive
    operations
  • Illustrate potential pitfalls such as deadlock
    situations
  • Can be found in the mpi1 subdirectory of your
    guest account.

27
Basic Send/Receive operations
  • Send
  • MPI_Send - standard mode blocking send
  • blocking does not return until the message
    buffer can be reused
  • semantics are blocking, but may not block in
    practice
  • vendor free to implement in most efficient manner
  • in most cases you should use this
  • MPI_Isend - immediate non-blocking send
  • lets MPI know we are sending data and where it
    is, but we return immediately with the promise
    that we will not touch the data until we have
    verified that the message buffer can be safely
    reused.
  • Pair with MPI_Wait or MPI_Test to complete/verify
    completion
  • Allows overlap of communication and computation
    by letting processor do work while we wait on
    completion.
  • Many other flavors MPI_Bsend (Buffered Send),
    MPI_Ssend (Synchronous Send), MPI_Rsend (Ready
    Send) to name a few.

28
Basic Send/Receive operations
  • Receive
  • MPI_Recv - Standard mode blocking receive
  • Does not return until data from matching send is
    in receive buffer.
  • MPI_IRecv - Immediate mode non-blocking receive
  • lets MPI know we are expecting data and where to
    put it.
  • Pair with MPI_Wait or MPI_Test to complete
  • Unlike sends these are the only two...
  • Completing non-blocking calls
  • MPI_Wait - blocking
  • MPI_Test - non-blocking
  • Getting information about an incoming message
  • MPI_Probe - blocking
  • MPI_Iprobe - non-blocking

29
Send/Receive Structure
  • Basic structure of a standard send/receive call
  • Fortran
  • MPI_Recv(data components, message envelope,
    status, ierror)
  • C
  • ierrorMPI_Recv(data envelope, message envelope,
    status)
  • Data components consists of 3 parts
  • data buffer (holds data you are sending)
  • size of buffer - in units of the data type (e.g.
    5 integers, or 5 floats)
  • type descriptor - corresponds to standard
    language data types
  • Message envelope consists of 4 parts, 3 of which
    are specified
  • source/destination - integer
  • tag - an integer label between 0 and an
    implementation dependent maxmum value (gt32K-1)
  • communicator
  • Status (does not appear in Send operations), and
    Ierror
  • In Fortran, an array used to return information
    about the received message, e.g. source, tag.
    Example status(MPI_SOURCE)
  • In C, this is a C structure. Example
    status.MPI_SOURCE.
  • Ierror returns error status.
  • Other types of sends, receives require additional
    arguments, see supplementary materials.

30
MPI Type Descriptors
  • C types
  • MPI_INT
  • MPI_CHAR
  • MPI_FLOAT
  • MPI_DOUBLE
  • MPI_LONG
  • MPI_UNSIGNED_INT
  • Many others corresponding to other C types
  • MPI_BYTE
  • MPI_PACKED
  • Fortran 77/Fortran 90 types
  • MPI_INTEGER
  • MPI_CHARACTER
  • MPI_REAL
  • MPI_DOUBLE_PRECISION
  • MPI_COMPLEX
  • MPI_LOGICAL
  • MPI_BYTE
  • MPI_PACKED

31
Matching Sends to Receives
  • Message Envelope - consists of the source,
    destination, tag, and communicator values.
  • A message can only be received if the specified
    envelope agrees with the message envelope.
  • The source and tag portions can be wildcarded
    using MPI_ANY_SOURCE and MPI_ANY_TAG. (Useful for
    writing client-server applications.)
  • Sourcedestination is allowed except for blocking
    operations.
  • Variable types of the messages must match.
  • In heterogeneous systems, MPI automatically
    handles data conversions, e.g. big-endian to
    little-endian.
  • Messages (with the same envelope) are not
    overtaking.

32
2nd Example
33
2nd Example
34
Lab exercise 2
  • swap.f, swap.c
  • Compile and run either the Fortran or C code with
    two processes.
  • Try running with the send and receive operations
    in the two sections in the sequences shown below
    (in addition to that in the code). What happens
    in each case?
  • Can be found in the mpi1 subdirectory of your
    guest account.

35
Non-blocking communications
  • Heres what the MPI standard says about how your
    experiment should have worked out.

The last case fails because both processors are
blocked in the receive operation and never
execute their sends. Case 1 works if the send is
buffered. This allows the sends to complete.
36
2nd Example RevisitedNon-blocking calls
37
Non-blocking calls
  • Can use MPI_TEST in place of MPI_WAIT to
    periodically check on a message rather than
    blocking and waiting.
  • Client-server applications can use MPI_WAITANY or
    MPI_TESTANY.
  • Can peek ahead at messages with MPI_PROBE and
    MPI_IPROBE. MPI_PROBE is used in our
    client-server application.

38
Lab exercise 3
  • Change your broken copy of swap to use
    MPI_IRECV and MPI_WAIT instead of MPI_RECV and
    try running again. Below is the syntax of these
    calls.
  • C language
  • int MPI_Irecv(void buf, int count, MPI_Datatype
    datatype,
  • int source, int tag,
    MPI_Comm comm, MPI_Request request)
  • int MPI_Wait(MPI_Request request, MPI_Status
    status)
  • Fortran
  • lttypegt buf()
  • integer count, datatype, source, tag, comm,
    request, ierror
  • integer request, status(MPI_STATUS_SIZE), ierror
  • call MPI_WAIT(request, status, ierror)
  • call MPI_IRECV(buf, count, datatype, source, tag,
    comm, request, ierror)

39
3rd example code
  • Client-Server Application
  • Illustrates use of point-to-point communications
    calls, message tags.
  • Illustrates one of the basic paradigms of
    parallel computing - task decomposition -
    provides a general framework for solving a wide
    range of problems.
  • Easy to understand example - multiplication of a
    vector by a matrix.
  • In next weeks workshop, well re-implement this
    code entirely using collective communications
    calls.
  • This example uses 2 collective calls MPI_Barrier,
    and MPI_Bcast (the latter for clarity)

40
Collective Communications
MPI_Barrier(communicator, ierror) - used to
synchronize processes within a communicator MPI_Bc
ast(data envelope, source, communicator, ierror)
- broadcast copy of a piece of data to all
processes. Equivalent to N-1 sends from processor
with data to remaining processes.
41
3rd ExampleMatrix-vector Multiplication
  • Goal Compute r Ax.
  • ri ?aijxj
  • Can be decomposed into either a row or column
    oriented algorithm.
  • Important because different programming languages
    have different conventions for storing 2-D
    arrays.
  • C language - arrays stored in row major order -
    multiply using traditional dot product
    interpretation of matrix multiplication.
  • Fortran language - arrays stored in column major
    order - multiplication is linear combination of
    columns.





X

42
3rd ExampleAbridged client-server.f code
  • call MPI_INIT (ierror)
  • call MPI_COMM_SIZE (MPI_COMM_WORLD,
    num_processes,ierror)
  • call MPI_COMM_RANK (MPI_COMM_WORLD, rank,
    ierror)
  • call MPI_ATTR_GET(MPI_COMM_WORLD,
    MPI_TAG_UB, NO_DATA, flag, ierror)
  • ! Set NO_DATA to the largest allowed value
    for a tag - about 227 under
  • ! the MPICH implementation.
  • if (rank .eq. server) then
  • ! read dimensions of A(m,n)
  • call MPI_BCAST(m, 1, MPI_INTEGER,
    server, MPI_COMM_WORLD, ierror)
  • call MPI_BCAST(n, 1, MPI_INTEGER,
    server, MPI_COMM_WORLD, ierror)
  • ! Allocate memory for arrays, only server
    keeps full A.
  • ! Allocate memory for arrays, initialize
    to zero if necessary.
  • buffer(1m)0.0d0
  • ! Server - read and broadcast vector
    x(0),...,x(n)
  • read(f_ptr,)(x(j),j1,n)
  • call MPI_BCAST (x, n, MPI_REAL, server,
    MPI_COMM_WORLD, ierror)
  • ! read A(m,n)

43
3rd ExamplePriming the pump
  • ! Here we give the compute processes their
    1st batch of data
  • ! Send out up to num_processes-1 rows to
    clients.
  • i1
  • do while ((i.le.n).and.(i.lt.num_processes))
  • ! Distribute a(m,n) by columns
  • alocal(1m)a(1m,i)
  • call MPI_SEND (alocal, m, MPI_REAL, i,
    i,MPI_COMM_WORLD, ierror)
  • active_client_countactive_client_count1
  • ii1
  • end do
  • ! note At the end of the above loop
    imin(n,num_processes).
  • ! Handle case where there is more processes
    than rows.
  • ! Use tagNO_DATA as message to go to
    waiting area pending finish
  • do ji1,num_processes,1
  • call MPI_SEND (alocal, m, MPI_REAL, j,
    NO_DATA, MPI_COMM_WORLD,
  • ierror)
  • end do
  • ! Note that the message is the TAGNO_DATA,
    not alocal.

44
3rd ExampleServer loop
  • do while ((active_client_count.gt.0) .or. (i.le.
    n))
  • call MPI_RECV (buffer, m,
    MPI_REAL,MPI_ANY_SOURCE,
  • MPI_ANY_TAG, MPI_COMM_WORLD, status,
    ierror)
  • if (status(MPI_TAG).ne.NO_DATA) then
  • result(1m)result(1m)buffer(1m)!
    Accumulate result, F90 array syntax
  • if (i.le.n) then
  • alocal(1m)a(1m,i)
  • call MPI_SEND (alocal, m, MPI_REAL,
  • status(MPI_SOURCE), i, MPI_COMM_WORLD,
    ierror)
  • ii1
  • else
  • ! No more data
  • active_client_countactive_client_count-1
  • call MPI_SEND (alocal, m, MPI_REAL,
  • status(MPI_SOURCE), NO_DATA,
    MPI_COMM_WORLD, ierror)
  • endif
  • else
  • ! Handle node errors
  • endif

45
3rd ExampleClient Loop
  • do i1,m
  • print , result(i)
  • end do
  • else
  • ! Client side - Receive broadcasts of row, and
    column dimension
  • call MPI_BCAST (m, 1, MPI_INTEGER, server,
    MPI_COMM_WORLD, ierror)
  • call MPI_BCAST (n, 1, MPI_INTEGER, server,
    MPI_COMM_WORLD, ierror)
  • ! Allocate memory then get x values from the
    broadcast.
  • call MPI_BCAST (x, n, MPI_REAL, server,
    MPI_COMM_WORLD, ierror)
  • ! Listen for 1st message from server - decide
    if data or loop terminator.
  • call MPI_PROBE(server, MPI_ANY_TAG,
    MPI_COMM_WORLD, status, ierror)
  • do while(status(MPI_TAG).ne.NO_DATA) ! Loop
    until NO_DATA message recd.
  • call MPI_RECV (alocal,m,MPI_REAL, server,
  • status(MPI_TAG),MPI_COMM_WORLD,
    status,ierror)
  • buffer(1m)alocal(1m)x(status(MPI_TAG))
    ! multiply array by const.
  • call MPI_SEND (buffer, m, MPI_REAL, server,
  • status(MPI_TAG), MPI_COMM_WORLD,
    ierror) ! return results
  • ! Listen for the next message
  • call MPI_PROBE(server, MPI_ANY_TAG,
    MPI_COMM_WORLD,status, ierror)

46
Program Termination
  • endif
  • ! everyone waits here until all are done.
    Waiting area
  • call MPI_Barrier(MPI_COMM_WORLD, ierror)
  • call MPI_Finalize(ierror)
  • end program

47
Exercises
  • Pick an example code in a language you are
    familiar with and rewrite one of the broadcast
    operations using the equivalent sends and
    receives.
  • Located in /mpi1.
  • server_client.f - column-oriented Fortran 90
  • server_client_row2.c - row-oriented C
  • server_client_col2.c - column-oriented C
  • How would you rewrite one of the column-oriented
    example codes to do a full matrix-matrix
    multiplication? (Hint look back at the pictures
    and think a bit about how the arrays have to be
    distributed.) What issues would you need to
    resolve?
  • Can you rewrite this without the MPI_Probe call?

48
References - MPI Tutorial
  • PACS online course
  • http//webct.ncsa.uiuc.edu8900/
  • Edinburgh Parallel Computing Center
  • http//www.epcc.ed.ac.uk/epic/mpi/notes/mpi-course
    -epic.book_1.html
  • Argonne National Laboratory (MPICH)
  • http//www-unix.mcs.anl.gov/mpi/
  • MPI Forum
  • http//www.mpi-forum.org/docs/docs.html
  • MPI The Complete Reference (vols. 1, 2)
  • Vol. 1. at http//www.netlib.org/utk/papers/mpi-bo
    ok/mpi-book.html
  • IBM (MPI on the RS/6000 (IBM SP))
  • http//publib-b.boulder.ibm.com/Redbooks.nsf/Redbo
    okAbstracts

49
References Some useful books
  • MPI The Complete Reference
  • Marc Snir, Steve Otto, Steven Huss-Lederman,
    David Walker and Jack Dongara, MIT Press
  • examples/mpidocs/mpi_complete_reference.ps.Z
  • Parallel Programming with MPI
  • Peter S. Pacheco, Morgan Kaufman Publishers, Inc
  • Using MPI Portable Parallel Programing with
    Message Passing Interface
  • William Gropp, E. Lusk and A. Skjellum, MIT Press
Write a Comment
User Comments (0)
About PowerShow.com