Parallel Programming With MPI - PowerPoint PPT Presentation

About This Presentation
Title:

Parallel Programming With MPI

Description:

Completion implies that the message has been received at its destination. ... The code will hang after printing 'Message received! ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 84
Provided by: ProjectA7
Category:

less

Transcript and Presenter's Notes

Title: Parallel Programming With MPI


1
Parallel Programming With MPI
  • Self Test

2
Message Passing Fundamentals
3
Self Test
  • A shared memory computer has access to
  • the memory of other nodes via a proprietary
    high-speed communications network
  • a directives-based data-parallel language
  • a global memory space
  • communication time

4
Self Test
  • A domain decomposition strategy turns out not to
    be the most efficient algorithm for a parallel
    program when
  • data can be divided into pieces of approximately
    the same size.
  • the pieces of data assigned to the different
    processes require greatly different lengths of
    time to process.
  • one needs the advantage of maintaining a single
    flow of control.
  • one must parallelize a finite differencing
    scheme.

5
Self Test
  • In the message passing approach
  • serial code is made parallel by adding directives
    that tell the compiler how to distribute data and
    work across the processors.
  • details of how data distribution, computation,
    and communications are to be done are left to the
    compiler.
  • is not very flexible.
  • it is left up to the programmer to explicitly
    divide data and work across the processors as
    well as manage the communications among them.

6
Self Test
  • Total execution time does not involve
  • computation time.
  • compiling time.
  • communications time.
  • idle time.

7
Self Test
  • One can minimize idle time by
  • occupying a process with one or more new tasks
    while it waits for communication to finish so it
    can proceed on another task.
  • always using nonblocking communications.
  • never using nonblocking communications.
  • frequent use of barriers.

8
Matching Question
  • When each node has rapid access to its own local
    memory and access to the memory of other nodes
    via some sort of communications network.
  • When multiple processor units share access to a
    global memory space via a high speed memory bus.
  • Data are divided into pieces of approximately the
    same size, and then mapped to different
    processors.
  • The problem is decomposed into a large number of
    smaller tasks and then the tasks are assigned to
    the processors as they become available.
  • Serial code is made parallel by adding directives
    that tell the compiler how to distribute data and
    work across the processors.
  • The programmer explicitly divides data and work
    across the processors as well as managing the
    communications among them.
  • Dividing the work equally among the available
    processes.
  • The time spent performing computations on the
    data.
  • The time a process spends waiting for data from
    other processors.
  • The time for processes to send and receive
    messages
  • Message passing
  • Domain decomposition
  • Idle time
  • Load balancing
  • Directives-based data parallel language
  • Distributed memory
  • Shared memory
  • Computation time
  • Functional decomposition
  • Communication time

9
Course Problem
  • The initial problem implements a parallel search
    of an extremely large (several thousand elements)
    integer array. The program finds all occurrences
    of a certain integer, called the target, and
    writes all the array indices where the target was
    found to an output file. In addition, the program
    reads both the target value and all the array
    elements from an input file.
  • Using these concepts of parallel programming,
    write a description of a parallel approach to
    solving the problem described above. (No coding
    is required for this exercise.)

10
Getting Started with MPI
11
Self Test
  • Is a blocking send necessarily also synchronous?
  • Yes
  • No

12
Self Test
  • Consider the following fragment of MPI
    pseudo-code
  • ...
  • x fun(y)
  • MPI_SOME_SEND(the value of x to some other
    processor)
  • x fun(z)
  • ...
  • where MPI_SOME_SEND is a generic send routine. In
    this case, it would be best to use
  • A blocking send
  • A nonblocking send

13
Self Test
  • Which of the following is true for all send
    routines?
  • It is always safe to overwrite the sent
    variable(s) on the sending processor after the
    send returns.
  • Completion implies that the message has been
    received at its destination.
  • It is always safe to overwrite the sent
    variable(s) on the sending processor after the
    send is complete.
  • All of the above.
  • None of the above.

14
Matching Question
  • Point-to-point communication
  • Collective communication
  • Communication mode
  • Blocking send
  • Synchronous send
  • Broadcast
  • Scatter
  • Gather
  • A send routine that does not return until it is
    complete
  • Communication involving one or more groups of
    processes
  • send routine that is not complete until receipt
    of the message at its destination has been
    acknowledged
  • An operation in which one process sends the same
    data to several others
  • Communication involving a single pair of
    processes
  • An operation in which one process distributes
    different elements of a local array to several
    others
  • An operation in which one process collects data
    from several others and assembles them in a local
    array
  • Specification of the method of operation and
    completion criteria for a communication routine

15
Course Problem
  • The initial problem implements a parallel search
    of an extremely large (several thousand elements)
    integer array. The program finds all occurrences
    of a certain integer, called the target, and
    writes all the array indices where the target was
    found to an output file. In addition, the program
    reads both the target value and all the array
    elements from an input file.
  • Before writing a parallel version of a program,
    first write a serial version (that is, a version
    that runs on one processor). That is the task for
    this chapter. You can use C/C and should
    confirm that the program works by using a test
    input array.

16
MPI Program Structure
17
Self Test
  • How would you modify "Hello World" so that only
    even-numbered processors print the greeting
    message?

18
Self Test
  • Consider the following MPI pseudo-code, which
    sends a piece of data from processor 1 to
    processor 2
  • MPI_INIT()
  • MPI_COMM_RANK(MPI_COMM_WORLD, myrank, ierr)
  • if (myrank 1)
  • MPI_SEND (some data to processor 2 in
    MPI_COMM_WORLD)
  • else
  • MPI_RECV (data from processor 1 in
    MPI_COMM_WORLD)
  • print "Message received!"
  • MPI_FINALIZE()
  • where MPI_SEND and MPI_RECV are blocking send and
    receive routines. Thus, for example, a process
    encountering the MPI_RECV statement will block
    while waiting for a message from processor 1.

19
Self Test
  • If this code is run on a single processor, what
    do you expect to happen?
  • The code will print "Message received!" and then
    terminate.
  • The code will terminate normally with no output.
  • The code will hang with no output.
  • An error condition will result.

20
Self Test
  • If the code is run on three processors, what do
    you expect?
  • The code will terminate after printing "Message
    received!".
  • The code will hang with no output.
  • The code will hang after printing "Message
    received!".
  • The code will give an error message and exit
    (possibly leaving a core file).

21
Self Test
  • Consider an MPI code running on four processors,
    denoted A, B, C, and D. In the default
    communicator MPI_COMM_WORLD their ranks are 0-3,
    respectively. Assume that we have defined another
    communicator, called USER_COMM, consisting of
    processors B and D. Which one of the following
    statements about USER_COMM is always true?
  • Processors B and D have ranks 1 and 3,
    respectively.
  • Processors B and D have ranks 0 and 1,
    respectively.
  • Processors B and D have ranks 1 and 3, but which
    has which is in general undefined.
  • Processors B and D have ranks 0 and 1, but which
    has which is in general undefined.

22
Course Problem
  • Description
  • The initial problem implements a parallel search
    of an extremely large (several thousand elements)
    integer array. The program finds all occurrences
    of a certain integer, called the target, and
    writes all the array indices where the target was
    found to an output file. In addition, the program
    reads both the target value and all the array
    elements from an input file.
  • Exercise
  • You now have enough knowledge to write
    pseudo-code for the parallel search algorithm
    introduced in Chapter 1. In the pseudo-code, you
    should correctly initialize MPI, have each
    processor determine and use its rank, and
    terminate MPI. By tradition, the Master processor
    has rank 0. Assume in your pseudo-code that the
    real code will be run on 4 processors.

23
Point-to-Point Communication
24
Self Test
  • MPI_SEND is used to send an array of 10 4-byte
    integers. At the time MPI_SEND is called, MPI has
    over 50 Kbytes of internal message buffer free on
    the sending process. Choose the best answer.
  • This is a blocking send. Most MPI implementations
    will copy the message into MPI internal message
    buffer and return.
  • This is a blocking send. Most MPI implementations
    will block the sending process until the
    destination process has received the message.
  • This is a non-blocking send. Most MPI
    implementations will copy the message into MPI
    internal message buffer and return.

25
Self Test
  • MPI_SEND is used to send an array of 100,000
    8-byte reals. At the time MPI_SEND is called, MPI
    has less than 50 Kbytes of internal message
    buffer free on the sending process. Choose the
    best answer.
  • This is a blocking send. Most MPI implementations
    will block the calling process until enough
    message buffer becomes available.
  • This is a blocking send. Most MPI implementations
    will block the sending process until the
    destination process has received the message.

26
Self Test
  • MPI_SEND is used to send a large array. When
    MPI_SEND returns, the programmer may safely
    assume
  • The destination process has received the message.
  • The array has been copied into MPI internal
    message buffer.
  • Either the destination process has received the
    message or the array has been copied into MPI
    internal message buffer.
  • None of the above.

27
Self Test
  • MPI_ISEND is used to send an array of 10 4-byte
    integers. At the time MPI_ISEND is called, MPI
    has over 50 Kbytes of internal message buffer
    free on the sending process. Choose the best
    answer.
  • This is a non-blocking send. MPI will generate a
    request id and then return.
  • This a non-blocking send. Most MPI
    implementations will copy the message into MPI
    internal message buffer and return.

28
Self Test
  • MPI_ISEND is used to send an array of 10 4-byte
    integers. At the time MPI_ISEND is called, MPI
    has over 50 Kbytes of internal message buffer
    free on the sending process. After calling
    MPI_ISEND, the sending process calls MPI_WAIT to
    wait for completion of the send operation. Choose
    the best answer.
  • MPI_Wait will not return until the destination
    process has received the message.
  • MPI_WAIT may return before the destination
    process has received the message.

29
Self Test
  • MPI_ISEND is used to send an array of 100,000
    8-byte reals. At the time MPI_ISEND is called,
    MPI has less than 50 Kbytes of internal message
    buffer free on the sending process. Choose the
    best answer.
  • This is a non-blocking send. MPI will generate a
    request id and return.
  • This is a blocking send. Most MPI implementations
    will block the sending process until the
    destination process has received the message.

30
Self Test
  • MPI_ISEND is used to send an array of 100,000
    8-byte reals. At the time MPI_ISEND is called,
    MPI has less than 50 Kbytes of internal message
    buffer free on the sending process. After calling
    MPI_ISEND, the sending process calls MPI_WAIT to
    wait for completion of the send operation. Choose
    the best answer.
  • This is a blocking send. In most implementations,
    MPI_WAIT will not return until the destination
    process has received the message.
  • This is a non-blocking send. In most
    implementations, MPI_Wait will not return until
    the destination process has received the message.
  • This is a non-blocking send. In most
    implementations, MPI_WAIT will return before the
    destination process has received the message.

31
Self Test
  • Assume the only communicator used in this problem
    is MPI_COMM_WORLD. After calling MPI_INIT,
    process 1 immediately sends two messages to
    process 0. The first message sent has tag 100,
    and the second message sent has tag 200. After
    calling MPI_INIT and verifying there are at least
    2 processes in MPI_COMM_WORLD, process 0 calls
    MPI_RECV with the source argument set to 1 and
    the tag argument set to MPI_TAG_ANY. Choose the
    best answer.
  • The tag of the first message received by process
    0 is 100.
  • The tag of the first message received by process
    0 is 200.
  • The tag of the first message received by process
    0 is either 100 or 200.
  • Based on the information given, one cannot safely
    assume any of the above.

32
Self Test
  • Assume the only communicator used in this problem
    is MPI_COMM_WORLD. After calling MPI_INIT,
    process 1 immediately sends two messages to
    process 0. The first message sent has tag 100,
    and the second message sent has tag 200. After
    calling MPI_INIT and verifying there are at least
    2 processes in MPI_COMM_WORLD, process 0 calls
    MPI_RECV with the source argument set to 1 and
    the tag argument set to 200. Choose the best
    answer.
  • Process 0 is deadlocked, since it attempted to
    receive the second message before receiving the
    first.
  • Process 0 receives the second message sent by
    process 1, even though the first message has not
    yet been received.
  • None of the above.

33
Course Problem
  • Description
  • The initial problem implements a parallel search
    of an extremely large (several thousand elements)
    integer array. The program finds all occurrences
    of a certain integer, called the target, and
    writes all the array indices where the target was
    found to an output file. In addition, the program
    reads both the target value and all the array
    elements from an input file.
  • Exercise
  • Go ahead and write the real parallel code for the
    search problem! Using the pseudo-code from the
    previous chapter as a guide, fill in all the
    sends and receives with calls to the actual MPI
    send and receive routines. For this task, use
    only the blocking routines. If you have access to
    a parallel computer with the MPI library
    installed, run your parallel code using 4
    processors. See if you get the same results as
    those obtained with the serial version of Chapter
    2. Of course, you should.

34
Derived Datatypes and Related Features
35
Self Test
  • You are writing a parallel program to be run on
    100 processors. Each processor is working with
    only one section of a skeleton outline of a 3-D
    model of a house. In the course of constructing
    the model house each processor often has to send
    the three cartesian coordinates (x,y,z) of nodes
    that are to be used to make the boundaries
    between the house sections. Each coordinate will
    be a real value.
  • Why would it be advantageous for you to define a
    new data type called Point which contained the
    three coordinates?

36
Self Test
  • My program will be more readable and self
    commenting.
  • Since many x,y,z values will be used in MPI
    communication routines they can be sent as a
    single Point type entity instead of packing and
    unpacking three reals each time.
  • Since all three values are real, there is no
    purpose in making a derived data type.
  • It would be impossible to use MPI_Pack and
    MPI_Unpack to send the three real values.

37
Self Test
  • What is the simplest MPI derived datatype
    creation function to make the Point datatype
    described in problem 1? (Three of the answers
    given can actually be used to construct the type,
    but one is the simplest).
  • MPI_TYPE_CONTIGUOUS
  • MPI_TYPE_VECTOR
  • MPI_TYPE_STRUCT
  • MPI_TYPE_COMMIT

38
Self Test
  • The C syntax for the MPI_TYPE_CONTIGUOUS
    subroutine is
  • MPI_Type_contiguous (count, oldtype, newtype)
  • The argument names should be fairly
    self-explanatory, but if you want thier exact
    definition you can look them up at the MPI home
    page.
  • For the derived datatype we have been discussing
    in the previous problems, what would be the
    values for the count, oldtype, and newtype
    arguments respectively?
  • 2, MPI_REAL, Point
  • 3, REAL, Point
  • 3, MPI_INTEGER, Coord
  • 3, MPI_REAL, Point

39
Self Test
  • In Section Using MPI Derived Types for
    User-Defined Types, the code for creating the
    derived data type MPI_SparseElt is shown. Using
    the MPI_Extent function, determine the size (in
    bytes) of a variable of type MPI_SparseElt. You
    should probably modify the code found in that
    section, compile and run it to get the answer.

40
Course Problem
  • Description
  • The new problem still implements a parallel
    search of an integer array.
  • The program should find all occurrences of a
    certain integer which will be called the target.
  • It should then calculate the average of the
    target value and its index.
  • Both the target location and the average should
    be written to an output file.
  • In addition, the program should read both the
    target value and all the array elements from an
    input file.

41
Course Problem
  • Exercise
  • Modify your code from Chapter 4 to create a
    program that solves the new Course Problem.
  • Use the techniques/routines of this chapter to
    make a new derived type called MPI_PAIR that will
    contain both the target location and the average.
  • All of the slave sends and the master receives
    must use the MPI_Pair type.

42
Collective Communications
43
Self Test
  • We want to do a simple broadcast of variable
    abc7 in processor 0 to the same location in all
    other processors of the communicator. What is the
    correct syntax of the call to do this broadcast?
  • MPI_Bcast( abc7, 1, MPI_REAL, 0, comm)
  • MPI_Bcast( abc, 7, MPI_REAL, 0, comm)
  • MPI_Broadcast(abc7,1,MPI_REAL,0,comm)

44
Self Test
  • Each processor has a local array a with 50
    elements. Each local array is a slice of a larger
    global array. We wish to compute the average
    (mean) of all elements in the global array. Our
    preferred approach is to add all of the data
    element-by-element onto the root processor, sum
    elements of the resulting array, divide, and
    broadcast the result to all processes. Which
    sequence of calls will accomplish this? Assume
    variables are typed and initialized appropriately.

45
Self Test
  • start 0
  • final 49
  • count final - start 1
  • mysum 0
  • for ( istart iltfinal i ) mysum ai
  • MPI_Reduce ( mysum, sum, 1, MPI_REAL,
  • MPI_SUM, root, comm )
  • total_countnprocs count
  • if ( my_rankroot ) averagesum / total_count
  • MPI_Bcast ( average, 1, MPI_REAL, root, comm )

46
Self Test
  • start 0
  • final 49
  • count final - start 1
  • MPI_Reduce ( a, sum_array, count,
  • MPI_REAL, MPI_SUM, root, comm )
  • sum 0
  • for ( istart, iltfinal i ) sum
    sum_arrayi
  • total_countnprocs count
  • if ( my_rankroot ) averagesum / total_count
  • MPI_Bcast ( average, 1, MPI_REAL, root, comm )

47
Self Test
  • start 0
  • final 49
  • count final - start 1
  • mysum 0
  • for ( istart iltfinal i ) mysum ai
  • my_averagemysum / count
  • MPI_Reduce ( my_average, sum, 1, MPI_REAL,
  • MPI_SUM, root, comm )
  • if ( my_rankroot ) averagesum / nprocs
  • MPI_Bcast ( average, 1, MPI_REAL, root, comm )

48
Self Test
  • Consider a communicator with 4 processes. How
    many total MPI_Send()'s and MPI_Recv()'s would be
    required to accomplish the following
  • MPI_Allreduce ( a, x, 1, MPI_REAL, MPI_SUM,
    comm )
  • 3
  • 4
  • 12
  • 16

49
Course Problem
  • Description
  • The new problem still implements a parallel
    search of an integer array. The program should
    find all occurrences of a certain integer which
    will be called the target. It should then
    calculate the average of the target value and its
    index. Both the target location and the average
    should be written to an output file. In addition,
    the program should read both the target value and
    all the array elements from an input file.

50
Course Problem
  • Exercise
  • Modify your code from Chapter 5, to change how
    the master first sends out the target and
    subarray data to the slaves.
  • Use the MPI broadcast routines to give each slave
    the target.
  • Use the MPI scatter routine to give all
    processors a section of the array b it will
    search.
  • When you use the standard MPI scatter routine you
    will see that the global array b is now split up
    into four parts and the master process now has
    the first fourth of the array to search.
  • So you should add a search loop (similar to the
    slaves') in the master section of code to search
    for the target and calculate the average and then
    write the result to the output file.
  • This is actually an improvement in performance
    since all the processors perform part of the
    search in parallel.

51
Communicators
52
Self Test
  • MPI_Comm_group may be used to
  • create a new group.
  • determine group handle of a communicator.
  • create a new communicator.

53
Self Test
  • MPI_Group_incl is used to select processes of
    old_group to form new_group. If the selection
    include process(es) not in old_group, it would
    cause the MPI job
  • to print error messages and abort the job.
  • to print error messages but execution continues.
  • to continue with warning messages.

54
Self Test
  • Assuming that a calling process belongs to the
    new_group of MPI_Group_excl(old_group, count,
    nonmembers, new_group), if nonmembers' order were
    altered,
  • the corresponding rank of the calling process in
    the new_group would change.
  • the corresponding rank of the calling process in
    the new_group remain unchanged.
  • the corresponding rank of the calling process in
    the new_group might or might not change.

55
Self Test
  • In MPI_Group_excl(old_group, count, nonmembers,
    new_group), if count 0, then
  • new_group is identical to old_group.
  • new_group has no members.
  • error results.

56
Self Test
  • In MPI_Group_excl(old_group, count, nonmembers,
    new_group) if the nonmembers array is not unique
    ( i.e., one or more entries of nonmembers point
    to the same rank in old_group), then
  • MPI ignores the repetition.
  • error results.
  • it returns MPI_GROUP_EMPTY.

57
Self Test
  • MPI_Group_rank is used to query the calling
    process' rank number in group. If the calling
    process does not belong to the group, then
  • error results.
  • the returned group rank has a value of -1,
    indicating that the calling process is not a
    member of the group.
  • the returned group rank is MPI_UNDEFINED.

58
Self Test
  • In MPI_Comm_split, if two processes of the same
    color are assigned the same key, then
  • error results.
  • their rank numbers in the new communicator are
    ordered according to their relative rank order in
    the old communicator.
  • they both share the same rank in the new
    communicator.

59
Self Test
  • MPI_Comm_split(old_comm, color, key, new_comm) is
    equivalent to MPI_Comm_create(old_comm, group,
    new_comm) when
  • colorIam, key0 calling process Iam belongs to
    group ELSE colorMPI_UNDEFINED for all other
    processes in old_comm.
  • color0, keyIam calling process Iam belongs to
    group ELSE colorMPI_UNDEFINED for all other
    processes in old_comm.
  • color0, key0

60
Course Problem
  • For this chapter, a new version of the Course
    Problem is presented in which the average value
    each processor calculates when a target location
    is found, is calculated in a different manner.
  • Specifically, the average will be calculated from
    the "neighbor" values of the target.
  • This is a classic style of programming (called
    calculations with a stencil) used at important
    array locations. Stencil calculations are used in
    many applications including numerical solutions
    of differential equations and image processesing
    to name two.
  • This new Course Problem will also entail more
    message passing between the searching processors
    because in order to calculate the average they
    will have to get values of the global array they
    do not have in their subarray.

61
Description
  • Our new problem still implements a parallel
    search of an integer array. The program should
    find all occurences of a certain integer which
    will be called the target. When a processor of a
    certain rank finds a target location, it should
    then calculate the average of
  • The target value
  • An element from the processor with rank one
    higher (the "right" processor). The right
    processor should send the first element from its
    local array.
  • An element from the processor with rank one less
    (the "left" processor). The left processor should
    send the first element from its local array.

62
Description
  • For example, if processor 1 finds the target at
    index 33 in its local array, it should get from
    processors 0 (left) and 2 (right) the first
    element of their local arrays. These three
    numbers should then be averaged.
  • In terms of right and left neighbors, you should
    visualize the four processors connected in a
    ring. That is, the left neighbor for P0 should be
    P3, and the right neighbor for P3 should be P0.
  • Both the target location and the average should
    be written to an output file. As usual, the
    program should read both the target value and all
    the array elements from an input file.

63
Exercise
  • Solve this new version of the Course Problem by
    modifying your code from Chapter 6. Specifically,
    change the code to perform the new method of
    calculating the average value at each target
    location.

64
Virtual Topologies
65
Self Test
  • When using MPI_Cart_create, if the cartesian grid
    size is smaller than processes available in
    old_comm, then
  • error results.
  • new_comm returns MPI_COMM_NULL for calling
    processes not used for grid.
  • new_comm returns MPI_UNDEFINED for calling
    processes not used for grid.

66
Self Test
  • When using MPI_Cart_create, if the cartesian grid
    size is larger than processes available in
    old_comm, then
  • error results.
  • the cartesian grid is automatically reduced to
    match processes available in old_comm.
  • more processes are added to match the requested
    cartesian grid size if possible otherwise error
    results.

67
Self Test
  • After using MPI_Cart_create to generate a
    cartesian grid with grid size smaller than
    processes available in old_comm, a call to
    MPI_Cart_coords or MPI_Cart_rank
    unconditionally(i.e., without regard to whether
    it is appropriate to call) ends in error because
  • calling processes not belonging to group have
    been assigned the communicator MPI_UNDEFINED,
    which is not a valid communicator for
    MPI_Cart_coords or MPI_Cart_rank.
  • calling processes not belonging to group have
    been assigned the communicator MPI_COMM_NULL,
    which is not a valid communicator for
    MPI_Cart_coords or MPI_Cart_rank.
  • grid size does not match what is in old_comm.

68
Self Test
  • When using MPI_Cart_rank to translate cartesian
    coordinates into equivalent rank, if some or all
    of the indices of the coordinates are outside of
    the defined range, then
  • error results.
  • error results unless periodicity is imposed in
    all dimensions.
  • error results unless each of the out-of-range
    indices is periodic.

69
Self Test
  • With MPI_Cart_shift(comm, direction, displ,
    source, dest), if the calling process is the
    first or the last entry along the shift direction
    and that displ is greater than 0, then
  • error results.
  • MPI_Cart_shift returns source and dest if
    periodicity is imposed along the shift direction.
    Otherwise, source and/or dest return
    MPI_UNDEFINED.
  • error results unless periodicity is imposed along
    the shift direction.

70
Self Test
  • MPI_Cart_sub can be used to subdivide a cartesian
    grid into subgrids of lower dimensions. These
    subgrids
  • have dimensions one lower than the original grid.
  • attributes such as periodicity must be reimposed.
  • possess appropriate attributes of the original
    cartesian grid.

71
Course Problem
  • Description
  • The new problem still implements a parallel
    search of an integer array. The program should
    find all occurrences of a certain integer which
    will be called the target. When a processor of a
    certain rank finds a target location, it should
    then calculate the average of
  • The target value
  • An element from the processor with rank one
    higher (the "right" processor). The right
    processor should send the first element from its
    local array.
  • An element from the processor with rank one less
    (the "left" processor). The left processor should
    send the first element from its local array.

72
Course Problem
  • For example, if processor 1 finds the target at
    index 33 in its local array, it should get from
    processors 0 (left) and 2 (right) the first
    element of their local arrays. These three
    numbers should then be averaged.
  • In terms of right and left neighbors, you should
    visualize the four processors connected in a
    ring. That is, the left neighbor for P0 should be
    P3, and the right neighbor for P3 should be P0.
  • Both the target location and the average should
    be written to an output file. As usual, the
    program should read both the target value and all
    the array elements from an input file.

73
Course Problem
  • Exercise
  • Modify your code from Chapter 7 to solve this
    latest version of the Course Problem using a
    virtual topology. First, create the topology
    (which should be called MPI_RING) in which the
    four processors are connected in a ring. Then,
    use the utility routines to determine which
    neighbors a given processor has.

74
MPI Program Performance
75
Matching
  • The time elapsed from when the first processor
    starts executing a problem to when the last
    processor completes execution.
  • T1/(PTp), where T1 is the execution time on one
    processor and Tp is the execution time on P
    processors.
  • The execution time on one processor of the
    fastest sequential program divided by the
    execution time on P processors.
  • When the sequential component of an algorithm
    accounts for 1/s of the program's execution time,
    then the maximum possible speedup that can be
    achieved on a parallel computer is s.
  • Characterizing performance in a large limit.
  • When an algorithm suffers from computation or
    communication imbalances among processors.
  • When the fast memory on a processor gets used
    more often in a parallel implementation, causing
    an unexpected decrease in the computation time.
  • A performance tool that shows the amount of time
    a program spends on different program components.
  • A performance tool that determines the length of
    time spent executing particular piece of code.
  • The most detailed performance tool that generates
    a file which records the significant events in
    the running of a program.
  • Amdahl's Law
  • Profiles
  • Relative efficiency
  • Load imbalances
  • Timers
  • Asymptotic analysis
  • Execution time
  • Cache effects
  • Event traces
  • Absolute speedup

76
Self Test
  • The following is not a performance metric
  • speedup
  • efficiency
  • problem size

77
Self Test
  • A good question to ask in scalability analysis
    is
  • How can one overlap computation and
    communications tasks in an efficient manner?
  • How can a single performance measure give an
    accurate picture of an algorithm's overall
    performance?
  • How does efficiency vary with increasing problem
    size?
  • In what parameter regime can I apply Amdahl's
    law?

78
Self Test
  • If an implementation has unaccounted-for
    overhead, a possible reason is
  • an algorithm may suffer from computation or
    communication imbalances among processors.
  • the cache, or fast memory, on a processor may get
    used more often in a parallel implementation
    causing an unexpected decrease in the computation
    time.
  • you failed to employ a domain decomposition.
  • there is not enough communication between
    processors.

79
Self Test
  • Which one of the following is not a data
    collection technique used to gather performance
    data
  • counters
  • profiles
  • abstraction
  • event traces

80
Course Problem
  • In this chapter, the broad subject of parallel
    code performance is discussed both in terms of
    theoretical concepts and some specific tools for
    measuring performance metrics that work on
    certain parallel machines. Put in its simplest
    terms, improving code performance boils down to
    speeding up your parallel code and/or improving
    how your code uses memory.

81
Course Problem
  • As you have learned new features of MPI in this
    course, you have also improved the performance of
    the code. Here is a list of performance
    improvements so far
  • Using Derived Datatypes instead of sending and
    receiving the separate pieces of data
  • Using Collective Communication routines instead
    of repeating/looping individual sends and
    receives
  • Using a Virtual Topology and its utility routines
    to avoid extraneous calculations
  • Changing the original master-slave algorithm so
    that the master also searches part of the global
    array (The slave rebellion Spartacus!)
  • Using "true" parallel I/O so that all processors
    write to the output file simultaneously instead
    of just one (the master)

82
Course Problem
  • But more remains to be done - especially in terms
    of how the program affects memory. And that is
    the last exercise for this course. The problem
    description is the same as the one given in
    Chapter 9 but you will modify the code you wrote
    using what you learned in this chapter.

83
Course Problem
  • Description
  • The initial problem implements a parallel search
    of an extremely large (several thousand elements)
    integer array. The program finds all occurrences
    of a certain integer, called the target, and
    writes all the array indices where the target was
    found to an output file. In addition, the program
    reads both the target value and all the array
    elements from an input file.
  • Exercise
  • Modify your code from Chapter 9 so that it uses
    dynamic memory allocation to use only the amount
    of memory it needs and only for as long as it
    needs it. Make both the arrays a and b ALLOCATED
    DYNAMICALLY and connect them to memory properly.
    You may also assume that the input data file
    "b.data" now has on its first line the number of
    elements in the global array b. The second line
    now has the target value. The remaining lines are
    the contents of the global array b.
Write a Comment
User Comments (0)
About PowerShow.com