An Introduction to Parallel Programming with MPI - PowerPoint PPT Presentation

Loading...

PPT – An Introduction to Parallel Programming with MPI PowerPoint presentation | free to download - id: f515f-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

An Introduction to Parallel Programming with MPI

Description:

Overview of basic parallel programming on a cluster with the goals of MPI ... Message Passing Paradigm. P 6. P 5. P 4. P 3. P 2. P 1. Network. Message Passing Paradigm ... – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 56
Provided by: davidb122
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: An Introduction to Parallel Programming with MPI


1
An Introduction to Parallel Programming with MPI
  • March 22, 24, 29, 31
  • 2005
  • David Adams

2
Outline
  • Disclaimers
  • Overview of basic parallel programming on a
    cluster with the goals of MPI
  • Batch system interaction
  • Startup procedures
  • Blocking message passing
  • Non-blocking message passing
  • Collective Communications

3
Disclaimers
  • I do not have all the answers.
  • Completion of this short course will give you
    enough tools to begin making use of MPI. It will
    not automagically allow your code to run on a
    parallel machine simply by logging in.
  • Some codes are easier to parallelize than others.

4
The Goals of MPI
  • Design an application programming interface.
  • Allow efficient communication.
  • Allow for implementations that can be used in a
    heterogeneous environment.
  • Allow convenient C and Fortran 77 bindings.
  • Provide a reliable communication interface.
  • Portable.
  • Thread safe.

5
Message Passing Paradigm
6
Message Passing Paradigm
7
Message Passing Paradigm
  • Conceptually, all processors communicate through
    messages (even though some may share memory
    space).
  • Low level details of message transport are
    handled by MPI and are invisible to the user.
  • Every processor is running the same program but
    will take different logical paths determined by
    self processor identification (Who am I?).
  • Programs are written, in general, for an
    arbitrary number of processors though they may be
    more efficient on specific numbers (powers of
    2?).

8
Distributed Memory and I/O Systems
  • The cluster machines available at Virginia Tech
    are distributed memory distributed I/O systems.
  • Each node (processor pair) has its own memory and
    local hard disk.
  • Allows asynchronous execution of multiple
    instruction streams.
  • Heavy disk I/O should be delegated to the local
    disk instead of across the network and minimized
    as much as possible.
  • While getting your program running, another goal
    to keep in mind is to see that it makes good use
    of the hardware available to you.
  • What does good use mean?

9
Speedup
  • The speedup achieved by a parallel algorithm
    running on p processors is the ratio between the
    time taken by that parallel computer executing
    the fastest serial algorithm and the time taken
    by the same parallel computer executing the
    parallel algorithm using p processors.
  • -Designing Efficient Algorithms for Parallel
    Computers, Michael J. Quinn

10
Speedup
  • Sometimes a fastest serial version of the code
    is unavailable.
  • The speedup of a parallel algorithm can be
    measured based on the speed of the parallel
    algorithm run serially but this gives an unfair
    advantage to the parallel code as the
    inefficiencies of making the code parallel will
    also appear in the serial version.

11
Speedup Example
  • Our really_big_code01 executes on a single
    processor in 100 hours.
  • The same code on 10 processors takes 10 hours.
  • 100 hrs./10 hrs. 10 speedup.
  • When speedup p it is called ideal (or perfect)
    speedup.
  • Speedup by itself is not very meaningful. A
    speedup of 10 may sound good (We are solving the
    problem 10 times as fast!) but what if we were
    using 1000 processors to get that number?

12
Efficiency
  • The efficiency of a parallel algorithm running on
    p processors is the speedup divided by p.
  • -Designing Efficient Algorithms for Parallel
    Computers, Michael J. Quinn
  • From our last example,
  • when p 10 the efficiency is 10/101 (great!),
  • When p 1000 the efficiency is 10/10000.01
    (bad!).
  • Speedup and efficiency give us an idea of how
    well our parallel code is making use of the
    available resources.

13
Concurrency
  • The first step in parallelizing any code is to
    identify the types of concurrency found in the
    problem itself (not necessarily the serial
    algorithm).
  • Many parallel algorithms show few resemblances to
    the (fastest known) serial version they are
    compared to and sometimes require an unusual
    perspective on the problem.

14
Concurrency
  • Consider the problem of finding the sum of n
    integer values.
  • A sequential algorithm may look something like
    this
  • BEGIN
  • sum A0
  • FOR i 1 TO n 1 DO
  • sum sum Ai
  • ENDFOR
  • END

15
Concurrency
  • Suppose n 4. Then the additions would be done
    in a precise order as follows
  • (A0 A1) A2 A3
  • Without any insight into the problem itself we
    might assume that the process is completely
    sequential and can not be parallelized.
  • Of course, we know that addition is associative
    (mostly). The same expression could be written
    as
  • (A0 A1) (A2 A3)
  • By using our insight into the problem of addition
    we can exploit the inherent concurrency of the
    problem and not the algorithm.

16
Communication is Slow
  • Continuing our example of adding n integers we
    may want to parallelize the process to exploit as
    much concurrency as possible. We call on the
    services of Clovus the Parallel Guru.
  • Let n 128.
  • Clovus divides the integers into pairs and
    distributes them to 64 processors maximizing the
    concurrency inherent in the problem.
  • The solution to the 64 sub-problems are
    distributed to 32 and those 32 to 16 etc

17
Communication Overhead
  • Suppose it takes t units of time to perform a
    floating-point addition.
  • Suppose it takes 100t units of time to pass a
    floating-point number from one processor to
    another.
  • The entire calculation on a single processor
    would take 127t time units.
  • Using the maximum number of processors possible
    (64) Clovus finds the sum of the first set of
    pairs in 101t time units. Further steps for 32,
    16, 8, 4, and 2 follow to obtain the final
    solution.
  • (64) (32) (16) (8) (4) (2)
  • 101t 101t 101t 101t 101t 101t 606t
    total time units

18
Parallelism and Pipelining to Achieve Concurrency
  • There are two primary ways to achieve concurrency
    in an algorithm.
  • Parallelism
  • The use of multiple resources to increase
    concurrency.
  • Partitioning.
  • Example Our summation problem.
  • Pipelining
  • Dividing the computation into a number of steps
    that are repeated throughout the algorithm.
  • An ordered set of segments in which the output of
    each segment is the input of its successor.
  • Example Automobile assembly line.

19
Examples(Jacobi style update)
  • Imagine we have a cellular automata that we want
    to parallelize.

7
8

1
2
3
4
5
6
20
Examples
  • We try to distribute the rows evenly between two
    processors.

7
8

1
2
3
4
5
6
21
Examples
  • Columns seem to work better for this problem.

7
8

1
2
3
4
5
6
22
Examples
  • Minimizing communication.

7
8

1
2
3
4
5
6
23
Examples(Gauss-Seidel style update)
  • Emulating a serial Gauss-Seidel update style with
    a pipe.

7
8

1
2
3
4
5
6
24
Examples(Gauss-Seidel style update)
  • Emulating a serial Gauss-Seidel update style with
    a pipe.

7
8

1
2
3
4
5
6
25
Examples(Gauss-Seidel style update)
  • Emulating a serial Gauss-Seidel update style with
    a pipe.

7
8

1
2
3
4
5
6
26
Examples(Gauss-Seidel style update)
  • Emulating a serial Gauss-Seidel update style with
    a pipe.

7
8

1
2
3
4
5
6
27
Examples(Gauss-Seidel style update)
  • Emulating a serial Gauss-Seidel update style with
    a pipe.

7
8

1
2
3
4
5
6
28
Examples(Gauss-Seidel style update)
  • Emulating a serial Gauss-Seidel update style with
    a pipe.

7
8

1
2
3
4
5
6
29
Examples(Gauss-Seidel style update)
  • Emulating a serial Gauss-Seidel update style with
    a pipe.

7
8

1
2
3
4
5
6
30
Examples(Gauss-Seidel style update)
  • Emulating a serial Gauss-Seidel update style with
    a pipe.

7
8

1
2
3
4
5
6
31
Examples(Gauss-Seidel style update)
  • Emulating a serial Gauss-Seidel update style with
    a pipe.

7
8

1
2
3
4
5
6
32
Examples(Gauss-Seidel style update)
  • Emulating a serial Gauss-Seidel update style with
    a pipe.

7
8

1
2
3
4
5
6
33
Examples(Gauss-Seidel style update)
  • Emulating a serial Gauss-Seidel update style with
    a pipe.

7
8

1
2
3
4
5
6
34
Examples(Gauss-Seidel style update)
  • Emulating a serial Gauss-Seidel update style with
    a pipe.

7
8

1
2
3
4
5
6
35
Examples(Gauss-Seidel style update)
  • Emulating a serial Gauss-Seidel update style with
    a pipe.

7
8

1
2
3
4
5
6
36
Examples(Gauss-Seidel style update)
  • Emulating a serial Gauss-Seidel update style with
    a pipe.

7
8

1
2
3
4
5
6
37
Examples(Gauss-Seidel style update)
  • Emulating a serial Gauss-Seidel update style with
    a pipe.

7
8

1
2
3
4
5
6
38
Examples(Gauss-Seidel style update)
  • Emulating a serial Gauss-Seidel update style with
    a pipe.

7
8

1
2
3
4
5
6
39
Batch System Interaction
  • Both Anantham (400 processors) and System X
    (2200 processors) will normally operate in batch
    mode.
  • Jobs are not interactive.
  • Multi-user etiquette is enforced by a job
    scheduler and queuing system.
  • Users will submit jobs using a script file built
    by the administrator and modified by the user.

40
PBS (Portable Batch Scheduler) Submission Script
  • /bin/bash
  • !
  • ! Example of job file to submit parallel MPI
    applications.
  • ! Lines starting with PBS are options for the
    qsub command.
  • ! Lines starting with ! are comments
  • ! Set queue (production queue --- the only one
    right now) and
  • ! the number of nodes.
  • ! In this case we require 10 nodes from the
    entire set ("all").
  • PBS -q prod_q
  • PBS -l nodes10all

41
PBS Submission Script
  • ! Set time limit.
  • ! The default is 30 minutes of cpu time.
  • ! Here we ask for up to 1 hour.
  • ! (Note that this is total cpu time, e.g., 10
    minutes on
  • ! each of 4 processors is 40 minutes)
  • ! Hoursminutesseconds
  • PBS -l cput010000
  • ! Name of output files for std output and error
  • ! Defaults are ltjob-namegt.oltjob numbergt and
    ltjob-namegt.eltjob-numbergt
  • !PBS -e ZCA.err
  • !PBS -o ZCA.log

42
PBS Submission Script
  • ! Mail to user when job terminates or aborts
  • ! PBS -m ae
  • !change the working directory (default is home
    directory)
  • cd PBS_O_WORKDIR
  • ! Run the parallel MPI executable (change the
    default a.out)
  • ! (Note omit "-kill" if you are running a 1
    node job)
  • /usr/local/bin/mpiexec -kill a.out

43
Common Scheduler Commands
  • qsub ltscript file namegt
  • Submits your script file for scheduling. It is
    immediately checked for validity and if it passes
    the check you will get a message that your job
    has been added to the queue.
  • qstat
  • Displays information on jobs waiting in the queue
    and jobs that are running. How much time they
    have left and how many processors they are using.
  • Each job aquires a unique job_id that can be used
    to communicate with a job that is already running
    (perhaps to kill it).
  • qdel ltjob_idgt
  • If for some reason you have a job that you need
    to remove from the queue, this command will do
    it. It will also kill a job in progress.
  • You, of course, only have access to delete your
    own jobs.

44
MPI Data Types
  • MPI thinks of every message as a starting point
    in memory and some measure of length along with a
    possible interpretation of the data.
  • The direct measure of length (number of bytes) is
    hidden from the user through the use of MPI data
    types.
  • Each language binding (C and Fortran 77) has its
    own list of MPI types that are intended to
    increase portability as the length of these types
    can change from machine to machine.
  • Interpretations of data can change from machine
    to machine in heterogeneous clusters (Macs and
    PCs in the same cluster for example).

45
MPI types in C
  • MPI_CHAR signed char
  • MPI_SHORT signed short int
  • MPI_INT signed int
  • MPI_LONG signed long int
  • MPI_UNSIGNED_CHAR unsigned short int
  • MPI_UNSIGNED unsigned int
  • MPI_UNSIGNED_LONG unsigned long int
  • MPI_FLOAT float
  • MPI_DOUBLE double
  • MPI_LONG_DOUBLE long double
  • MPI_BYTE
  • MPI_PACKED

46
MPI Types in Fortran 77
  • MPI_INTEGER INTEGER
  • MPI_REAL REAL
  • MPI_DOUBLE_PRECISION DOUBLE PRECISION
  • MPI_COMPLEX COMPLEX
  • MPI_LOGICAL LOGICAL
  • MPI_CHARACTER CHARACTER(1)
  • MPI_BYTE
  • MPI_PACKED
  • Caution Fortran90 does not always store arrays
    contiguously.

47
Functions Appearing in all MPI Programs (Fortran
77)
  • MPI_INIT(IERROR)
  • INTEGER IERROR
  • Must be called before any other MPI routine.
  • Can be visualized as the point in the code where
    every processor obtains its own copy of the
    program and continues to execute though this may
    happen earlier.

48
Functions Appearing in all MPI Programs (Fortran
77)
  • MPI_FINALIZE (IERROR)
  • INTEGER IERROR
  • This routine cleans up all MPI state.
  • Once this routine is called no MPI routine may be
    called.
  • It is the users responsibility to ensure that ALL
    pending communications involving a process
    complete before the process calls MPI_FINALIZE

49
Typical Startup Functions
  • MPI_COMM_SIZE(COMM, SIZE, IERROR)
  • IN INTEGER COMM
  • OUT INTEGER SIZE, IERROR
  • Returns the size of the group associated with the
    communicator COMM.
  • Whats a communicator?

50
Communicators
  • A communicator is an integer that tells MPI what
    communication domain it is in.
  • There is a special communicator that exists in
    every MPI program called MPI_COMM_WORLD.
  • MPI_COMM_WORLD can be thought of as the superset
    of all communication domains. Every processor
    requested by your initial script is a member of
    MPI_COMM_WORLD.

51
Typical Startup Functions
  • MPI_COMM_SIZE(COMM, SIZE, IERROR)
  • IN INTEGER COMM, SIZE, IERROR
  • OUT INTEGER SIZE, IERROR
  • Returns the size of the group associated with the
    communicator COMM.
  • A typical program contains the following command
    as one of the very first MPI calls to provide the
    code with the number of processors it has
    available for this execution. (Step one of self
    identification).
  • CALL MPI_COMM_SIZE(MPI_COMM_WORLD, size_p, ierr_p)

52
Typical Startup Functions
  • MPI_COMM_RANK(COMM, RANK, IERROR)
  • IN INTEGER COMM
  • OUT INTEGER RANK, IERROR
  • Indicates the rank of the process that calls it
    in the range from 0..size-1, where size is the
    return value of MPI_COMM_SIZE.
  • This rank is relative to the communication domain
    specified by the communicator COMM.
  • For MPI_COMM_WORLD, this function will return the
    absolute rank of the process, a unique
    identifier. (Step 2 of self identification).
  • CALL MPI_COMM_Rank(MPI_COMM_WORLD, size_p, ierr_p)

53
Startup Variables
  • SIZE
  • INTEGER size_p
  • RANK
  • INTEGER rank_p
  • STATUS (more on this guy later)
  • INTEGER, DIMENSION(MPI_STATUS_SIZE) status_p
  • IERROR (Fortran 77)
  • INTEGER ierr_p

54
Hello WorldFortran90
  • PPROGRAM Hello_World
  • IMPLICIT NONE
  • INCLUDE 'mpif.h'
  • INTEGER ierr_p, rank_p, size_p
  • INTEGER, DIMENSION(MPI_STATUS_SIZE) status_p
  • CALL MPI_INIT(ierr_p)
  • CALL MPI_COMM_RANK(MPI_COMM_WORLD, rank_p,
    ierr_p)
  • CALL MPI_COMM_SIZE(MPI_COMM_WORLD, size_p,
    ierr_p)
  • IF (rank_p0) THEN
  • WRITE(,) Hello world! I am process 0 and I am
    special!
  • ELSE
  • WRITE(,) Hello world! I am process , rank_p
  • END IF
  • CALL MPI_FINALIZE(ierr_p)

55
Hello WorldC
  • include ltstdio.hgt
  • include ltmpi.hgt
  • main (int argc, char argv)
  • int node
  • MPI_Init(argc, argv)
  • MPI_Comm_rank(MPI_COMM_WORLD, node)
  • if (node 0)
  • printf("Hello word! I am C process 0 and I
    am special!\n")
  • else
  • printf("Hello word! I am C process d\n",
    node)
  • MPI_Finalize()
About PowerShow.com