Parallel Programming With MPI

About This Presentation

Title:

Parallel Programming With MPI

Description:

Completion implies that the message has been received at its destination. ... The code will hang after printing 'Message received! ... – PowerPoint PPT presentation

Number of Views:67

Avg rating:3.0/5.0

Slides: 84

Provided by: ProjectA7

Category:

more less

Transcript and Presenter's Notes

Title: Parallel Programming With MPI

1
Parallel Programming With MPI

Self Test

2
Message Passing Fundamentals
3
Self Test

A shared memory computer has access to
the memory of other nodes via a proprietary
high-speed communications network
a directives-based data-parallel language
a global memory space
communication time

4
Self Test

A domain decomposition strategy turns out not to
be the most efficient algorithm for a parallel
program when
data can be divided into pieces of approximately
the same size.
the pieces of data assigned to the different
processes require greatly different lengths of
time to process.
one needs the advantage of maintaining a single
flow of control.
one must parallelize a finite differencing
scheme.

5
Self Test

In the message passing approach
serial code is made parallel by adding directives
that tell the compiler how to distribute data and
work across the processors.
details of how data distribution, computation,
and communications are to be done are left to the
compiler.
is not very flexible.
it is left up to the programmer to explicitly
divide data and work across the processors as
well as manage the communications among them.

6
Self Test

Total execution time does not involve
computation time.
compiling time.
communications time.
idle time.

7
Self Test

One can minimize idle time by
occupying a process with one or more new tasks
while it waits for communication to finish so it
can proceed on another task.
always using nonblocking communications.
never using nonblocking communications.
frequent use of barriers.

8
Matching Question

When each node has rapid access to its own local
memory and access to the memory of other nodes
via some sort of communications network.
When multiple processor units share access to a
global memory space via a high speed memory bus.
Data are divided into pieces of approximately the
same size, and then mapped to different
processors.
The problem is decomposed into a large number of
smaller tasks and then the tasks are assigned to
the processors as they become available.
Serial code is made parallel by adding directives
that tell the compiler how to distribute data and
work across the processors.
The programmer explicitly divides data and work
across the processors as well as managing the
communications among them.
Dividing the work equally among the available
processes.
The time spent performing computations on the
data.
The time a process spends waiting for data from
other processors.
The time for processes to send and receive
messages

Message passing
Domain decomposition
Idle time
Load balancing
Directives-based data parallel language
Distributed memory
Shared memory
Computation time
Functional decomposition
Communication time

9
Course Problem

The initial problem implements a parallel search
of an extremely large (several thousand elements)
integer array. The program finds all occurrences
of a certain integer, called the target, and
writes all the array indices where the target was
found to an output file. In addition, the program
reads both the target value and all the array
elements from an input file.
Using these concepts of parallel programming,
write a description of a parallel approach to
solving the problem described above. (No coding
is required for this exercise.)

10
Getting Started with MPI
11
Self Test

Is a blocking send necessarily also synchronous?
Yes
No

12
Self Test

Consider the following fragment of MPI
pseudo-code
...
x fun(y)
MPI_SOME_SEND(the value of x to some other
processor)
x fun(z)
...
where MPI_SOME_SEND is a generic send routine. In
this case, it would be best to use
A blocking send
A nonblocking send

13
Self Test

Which of the following is true for all send
routines?
It is always safe to overwrite the sent
variable(s) on the sending processor after the
send returns.
Completion implies that the message has been
received at its destination.
It is always safe to overwrite the sent
variable(s) on the sending processor after the
send is complete.
All of the above.
None of the above.

14
Matching Question

Point-to-point communication
Collective communication
Communication mode
Blocking send
Synchronous send
Broadcast
Scatter
Gather

A send routine that does not return until it is
complete
Communication involving one or more groups of
processes
send routine that is not complete until receipt
of the message at its destination has been
acknowledged
An operation in which one process sends the same
data to several others
Communication involving a single pair of
processes
An operation in which one process distributes
different elements of a local array to several
others
An operation in which one process collects data
from several others and assembles them in a local
array
Specification of the method of operation and
completion criteria for a communication routine

15
Course Problem

The initial problem implements a parallel search
of an extremely large (several thousand elements)
integer array. The program finds all occurrences
of a certain integer, called the target, and
writes all the array indices where the target was
found to an output file. In addition, the program
reads both the target value and all the array
elements from an input file.
Before writing a parallel version of a program,
first write a serial version (that is, a version
that runs on one processor). That is the task for
this chapter. You can use C/C and should
confirm that the program works by using a test
input array.

16
MPI Program Structure
17
Self Test

How would you modify "Hello World" so that only
even-numbered processors print the greeting
message?

18
Self Test

Consider the following MPI pseudo-code, which
sends a piece of data from processor 1 to
processor 2
MPI_INIT()
MPI_COMM_RANK(MPI_COMM_WORLD, myrank, ierr)
if (myrank 1)
MPI_SEND (some data to processor 2 in
MPI_COMM_WORLD)
else
MPI_RECV (data from processor 1 in
MPI_COMM_WORLD)
print "Message received!"
MPI_FINALIZE()
where MPI_SEND and MPI_RECV are blocking send and
receive routines. Thus, for example, a process
encountering the MPI_RECV statement will block
while waiting for a message from processor 1.

19
Self Test

If this code is run on a single processor, what
do you expect to happen?
The code will print "Message received!" and then
terminate.
The code will terminate normally with no output.
The code will hang with no output.
An error condition will result.

20
Self Test

If the code is run on three processors, what do
you expect?
The code will terminate after printing "Message
received!".
The code will hang with no output.
The code will hang after printing "Message
received!".
The code will give an error message and exit
(possibly leaving a core file).

21
Self Test

Consider an MPI code running on four processors,
denoted A, B, C, and D. In the default
communicator MPI_COMM_WORLD their ranks are 0-3,
respectively. Assume that we have defined another
communicator, called USER_COMM, consisting of
processors B and D. Which one of the following
statements about USER_COMM is always true?
Processors B and D have ranks 1 and 3,
respectively.
Processors B and D have ranks 0 and 1,
respectively.
Processors B and D have ranks 1 and 3, but which
has which is in general undefined.
Processors B and D have ranks 0 and 1, but which
has which is in general undefined.

22
Course Problem

Description
The initial problem implements a parallel search
of an extremely large (several thousand elements)
integer array. The program finds all occurrences
of a certain integer, called the target, and
writes all the array indices where the target was
found to an output file. In addition, the program
reads both the target value and all the array
elements from an input file.
Exercise
You now have enough knowledge to write
pseudo-code for the parallel search algorithm
introduced in Chapter 1. In the pseudo-code, you
should correctly initialize MPI, have each
processor determine and use its rank, and
terminate MPI. By tradition, the Master processor
has rank 0. Assume in your pseudo-code that the
real code will be run on 4 processors.

23
Point-to-Point Communication
24
Self Test

MPI_SEND is used to send an array of 10 4-byte
integers. At the time MPI_SEND is called, MPI has
over 50 Kbytes of internal message buffer free on
the sending process. Choose the best answer.
This is a blocking send. Most MPI implementations
will copy the message into MPI internal message
buffer and return.
This is a blocking send. Most MPI implementations
will block the sending process until the
destination process has received the message.
This is a non-blocking send. Most MPI
implementations will copy the message into MPI
internal message buffer and return.

25
Self Test

MPI_SEND is used to send an array of 100,000
8-byte reals. At the time MPI_SEND is called, MPI
has less than 50 Kbytes of internal message
buffer free on the sending process. Choose the
best answer.
This is a blocking send. Most MPI implementations
will block the calling process until enough
message buffer becomes available.
This is a blocking send. Most MPI implementations
will block the sending process until the
destination process has received the message.

26
Self Test

MPI_SEND is used to send a large array. When
MPI_SEND returns, the programmer may safely
assume
The destination process has received the message.
The array has been copied into MPI internal
message buffer.
Either the destination process has received the
message or the array has been copied into MPI
internal message buffer.
None of the above.

27
Self Test

MPI_ISEND is used to send an array of 10 4-byte
integers. At the time MPI_ISEND is called, MPI
has over 50 Kbytes of internal message buffer
free on the sending process. Choose the best
answer.
This is a non-blocking send. MPI will generate a
request id and then return.
This a non-blocking send. Most MPI
implementations will copy the message into MPI
internal message buffer and return.

28
Self Test

MPI_ISEND is used to send an array of 10 4-byte
integers. At the time MPI_ISEND is called, MPI
has over 50 Kbytes of internal message buffer
free on the sending process. After calling
MPI_ISEND, the sending process calls MPI_WAIT to
wait for completion of the send operation. Choose
the best answer.
MPI_Wait will not return until the destination
process has received the message.
MPI_WAIT may return before the destination
process has received the message.

29
Self Test

MPI_ISEND is used to send an array of 100,000
8-byte reals. At the time MPI_ISEND is called,
MPI has less than 50 Kbytes of internal message
buffer free on the sending process. Choose the
best answer.
This is a non-blocking send. MPI will generate a
request id and return.
This is a blocking send. Most MPI implementations
will block the sending process until the
destination process has received the message.

30
Self Test

MPI_ISEND is used to send an array of 100,000
8-byte reals. At the time MPI_ISEND is called,
MPI has less than 50 Kbytes of internal message
buffer free on the sending process. After calling
MPI_ISEND, the sending process calls MPI_WAIT to
wait for completion of the send operation. Choose
the best answer.
This is a blocking send. In most implementations,
MPI_WAIT will not return until the destination
process has received the message.
This is a non-blocking send. In most
implementations, MPI_Wait will not return until
the destination process has received the message.
This is a non-blocking send. In most
implementations, MPI_WAIT will return before the
destination process has received the message.

31
Self Test

Assume the only communicator used in this problem
is MPI_COMM_WORLD. After calling MPI_INIT,
process 1 immediately sends two messages to
process 0. The first message sent has tag 100,
and the second message sent has tag 200. After
calling MPI_INIT and verifying there are at least
2 processes in MPI_COMM_WORLD, process 0 calls
MPI_RECV with the source argument set to 1 and
the tag argument set to MPI_TAG_ANY. Choose the
best answer.
The tag of the first message received by process
0 is 100.
The tag of the first message received by process
0 is 200.
The tag of the first message received by process
0 is either 100 or 200.
Based on the information given, one cannot safely
assume any of the above.

32
Self Test

Assume the only communicator used in this problem
is MPI_COMM_WORLD. After calling MPI_INIT,
process 1 immediately sends two messages to
process 0. The first message sent has tag 100,
and the second message sent has tag 200. After
calling MPI_INIT and verifying there are at least
2 processes in MPI_COMM_WORLD, process 0 calls
MPI_RECV with the source argument set to 1 and
the tag argument set to 200. Choose the best
answer.
Process 0 is deadlocked, since it attempted to
receive the second message before receiving the
first.
Process 0 receives the second message sent by
process 1, even though the first message has not
yet been received.
None of the above.

33
Course Problem

Description
The initial problem implements a parallel search
of an extremely large (several thousand elements)
integer array. The program finds all occurrences
of a certain integer, called the target, and
writes all the array indices where the target was
found to an output file. In addition, the program
reads both the target value and all the array
elements from an input file.
Exercise
Go ahead and write the real parallel code for the
search problem! Using the pseudo-code from the
previous chapter as a guide, fill in all the
sends and receives with calls to the actual MPI
send and receive routines. For this task, use
only the blocking routines. If you have access to
a parallel computer with the MPI library
installed, run your parallel code using 4
processors. See if you get the same results as
those obtained with the serial version of Chapter
2. Of course, you should.

34
Derived Datatypes and Related Features
35
Self Test

You are writing a parallel program to be run on
100 processors. Each processor is working with
only one section of a skeleton outline of a 3-D
model of a house. In the course of constructing
the model house each processor often has to send
the three cartesian coordinates (x,y,z) of nodes
that are to be used to make the boundaries
between the house sections. Each coordinate will
be a real value.
Why would it be advantageous for you to define a
new data type called Point which contained the
three coordinates?

36
Self Test

My program will be more readable and self
commenting.
Since many x,y,z values will be used in MPI
communication routines they can be sent as a
single Point type entity instead of packing and
unpacking three reals each time.
Since all three values are real, there is no
purpose in making a derived data type.
It would be impossible to use MPI_Pack and
MPI_Unpack to send the three real values.

37
Self Test

What is the simplest MPI derived datatype
creation function to make the Point datatype
described in problem 1? (Three of the answers
given can actually be used to construct the type,
but one is the simplest).
MPI_TYPE_CONTIGUOUS
MPI_TYPE_VECTOR
MPI_TYPE_STRUCT
MPI_TYPE_COMMIT

38
Self Test

The C syntax for the MPI_TYPE_CONTIGUOUS
subroutine is
MPI_Type_contiguous (count, oldtype, newtype)
The argument names should be fairly
self-explanatory, but if you want thier exact
definition you can look them up at the MPI home
page.
For the derived datatype we have been discussing
in the previous problems, what would be the
values for the count, oldtype, and newtype
arguments respectively?
2, MPI_REAL, Point
3, REAL, Point
3, MPI_INTEGER, Coord
3, MPI_REAL, Point

39
Self Test

In Section Using MPI Derived Types for
User-Defined Types, the code for creating the
derived data type MPI_SparseElt is shown. Using
the MPI_Extent function, determine the size (in
bytes) of a variable of type MPI_SparseElt. You
should probably modify the code found in that
section, compile and run it to get the answer.

40
Course Problem

Description
The new problem still implements a parallel
search of an integer array.
The program should find all occurrences of a
certain integer which will be called the target.
It should then calculate the average of the
target value and its index.
Both the target location and the average should
be written to an output file.
In addition, the program should read both the
target value and all the array elements from an
input file.

41
Course Problem

Exercise
Modify your code from Chapter 4 to create a
program that solves the new Course Problem.
Use the techniques/routines of this chapter to
make a new derived type called MPI_PAIR that will
contain both the target location and the average.
All of the slave sends and the master receives
must use the MPI_Pair type.

42
Collective Communications
43
Self Test

We want to do a simple broadcast of variable
abc7 in processor 0 to the same location in all
other processors of the communicator. What is the
correct syntax of the call to do this broadcast?
MPI_Bcast( abc7, 1, MPI_REAL, 0, comm)
MPI_Bcast( abc, 7, MPI_REAL, 0, comm)
MPI_Broadcast(abc7,1,MPI_REAL,0,comm)

44
Self Test

Each processor has a local array a with 50
elements. Each local array is a slice of a larger
global array. We wish to compute the average
(mean) of all elements in the global array. Our
preferred approach is to add all of the data
element-by-element onto the root processor, sum
elements of the resulting array, divide, and
broadcast the result to all processes. Which
sequence of calls will accomplish this? Assume
variables are typed and initialized appropriately.

45
Self Test

start 0
final 49
count final - start 1
mysum 0
for ( istart iltfinal i ) mysum ai
MPI_Reduce ( mysum, sum, 1, MPI_REAL,
MPI_SUM, root, comm )
total_countnprocs count
if ( my_rankroot ) averagesum / total_count
MPI_Bcast ( average, 1, MPI_REAL, root, comm )

46
Self Test

start 0
final 49
count final - start 1
MPI_Reduce ( a, sum_array, count,
MPI_REAL, MPI_SUM, root, comm )
sum 0
for ( istart, iltfinal i ) sum
sum_arrayi
total_countnprocs count
if ( my_rankroot ) averagesum / total_count
MPI_Bcast ( average, 1, MPI_REAL, root, comm )

47
Self Test

start 0
final 49
count final - start 1
mysum 0
for ( istart iltfinal i ) mysum ai
my_averagemysum / count
MPI_Reduce ( my_average, sum, 1, MPI_REAL,
MPI_SUM, root, comm )
if ( my_rankroot ) averagesum / nprocs
MPI_Bcast ( average, 1, MPI_REAL, root, comm )

48
Self Test

Consider a communicator with 4 processes. How
many total MPI_Send()'s and MPI_Recv()'s would be
required to accomplish the following
MPI_Allreduce ( a, x, 1, MPI_REAL, MPI_SUM,
comm )
3
4
12
16

49
Course Problem

Description
The new problem still implements a parallel
search of an integer array. The program should
find all occurrences of a certain integer which
will be called the target. It should then
calculate the average of the target value and its
index. Both the target location and the average
should be written to an output file. In addition,
the program should read both the target value and
all the array elements from an input file.

50
Course Problem

Exercise
Modify your code from Chapter 5, to change how
the master first sends out the target and
subarray data to the slaves.
Use the MPI broadcast routines to give each slave
the target.
Use the MPI scatter routine to give all
processors a section of the array b it will
search.
When you use the standard MPI scatter routine you
will see that the global array b is now split up
into four parts and the master process now has
the first fourth of the array to search.
So you should add a search loop (similar to the
slaves') in the master section of code to search
for the target and calculate the average and then
write the result to the output file.
This is actually an improvement in performance
since all the processors perform part of the
search in parallel.

51
Communicators
52
Self Test

MPI_Comm_group may be used to
create a new group.
determine group handle of a communicator.
create a new communicator.

53
Self Test

MPI_Group_incl is used to select processes of
old_group to form new_group. If the selection
include process(es) not in old_group, it would
cause the MPI job
to print error messages and abort the job.
to print error messages but execution continues.
to continue with warning messages.

54
Self Test

Assuming that a calling process belongs to the
new_group of MPI_Group_excl(old_group, count,
nonmembers, new_group), if nonmembers' order were
altered,
the corresponding rank of the calling process in
the new_group would change.
the corresponding rank of the calling process in
the new_group remain unchanged.
the corresponding rank of the calling process in
the new_group might or might not change.

55
Self Test

In MPI_Group_excl(old_group, count, nonmembers,
new_group), if count 0, then
new_group is identical to old_group.
new_group has no members.
error results.

56
Self Test

In MPI_Group_excl(old_group, count, nonmembers,
new_group) if the nonmembers array is not unique
( i.e., one or more entries of nonmembers point
to the same rank in old_group), then
MPI ignores the repetition.
error results.
it returns MPI_GROUP_EMPTY.

57
Self Test

MPI_Group_rank is used to query the calling
process' rank number in group. If the calling
process does not belong to the group, then
error results.
the returned group rank has a value of -1,
indicating that the calling process is not a
member of the group.
the returned group rank is MPI_UNDEFINED.

58
Self Test

In MPI_Comm_split, if two processes of the same
color are assigned the same key, then
error results.
their rank numbers in the new communicator are
ordered according to their relative rank order in
the old communicator.
they both share the same rank in the new
communicator.

59
Self Test

MPI_Comm_split(old_comm, color, key, new_comm) is
equivalent to MPI_Comm_create(old_comm, group,
new_comm) when
colorIam, key0 calling process Iam belongs to
group ELSE colorMPI_UNDEFINED for all other
processes in old_comm.
color0, keyIam calling process Iam belongs to
group ELSE colorMPI_UNDEFINED for all other
processes in old_comm.
color0, key0

60
Course Problem

For this chapter, a new version of the Course
Problem is presented in which the average value
each processor calculates when a target location
is found, is calculated in a different manner.
Specifically, the average will be calculated from
the "neighbor" values of the target.
This is a classic style of programming (called
calculations with a stencil) used at important
array locations. Stencil calculations are used in
many applications including numerical solutions
of differential equations and image processesing
to name two.
This new Course Problem will also entail more
message passing between the searching processors
because in order to calculate the average they
will have to get values of the global array they
do not have in their subarray.

61
Description

Our new problem still implements a parallel
search of an integer array. The program should
find all occurences of a certain integer which
will be called the target. When a processor of a
certain rank finds a target location, it should
then calculate the average of
The target value
An element from the processor with rank one
higher (the "right" processor). The right
processor should send the first element from its
local array.
An element from the processor with rank one less
(the "left" processor). The left processor should
send the first element from its local array.

62
Description

For example, if processor 1 finds the target at
index 33 in its local array, it should get from
processors 0 (left) and 2 (right) the first
element of their local arrays. These three
numbers should then be averaged.
In terms of right and left neighbors, you should
visualize the four processors connected in a
ring. That is, the left neighbor for P0 should be
P3, and the right neighbor for P3 should be P0.
Both the target location and the average should
be written to an output file. As usual, the
program should read both the target value and all
the array elements from an input file.

63
Exercise

Solve this new version of the Course Problem by
modifying your code from Chapter 6. Specifically,
change the code to perform the new method of
calculating the average value at each target
location.

64
Virtual Topologies
65
Self Test

When using MPI_Cart_create, if the cartesian grid
size is smaller than processes available in
old_comm, then
error results.
new_comm returns MPI_COMM_NULL for calling
processes not used for grid.
new_comm returns MPI_UNDEFINED for calling
processes not used for grid.

66
Self Test

When using MPI_Cart_create, if the cartesian grid
size is larger than processes available in
old_comm, then
error results.
the cartesian grid is automatically reduced to
match processes available in old_comm.
more processes are added to match the requested
cartesian grid size if possible otherwise error
results.

67
Self Test

After using MPI_Cart_create to generate a
cartesian grid with grid size smaller than
processes available in old_comm, a call to
MPI_Cart_coords or MPI_Cart_rank
unconditionally(i.e., without regard to whether
it is appropriate to call) ends in error because
calling processes not belonging to group have
been assigned the communicator MPI_UNDEFINED,
which is not a valid communicator for
MPI_Cart_coords or MPI_Cart_rank.
calling processes not belonging to group have
been assigned the communicator MPI_COMM_NULL,
which is not a valid communicator for
MPI_Cart_coords or MPI_Cart_rank.
grid size does not match what is in old_comm.

68
Self Test

When using MPI_Cart_rank to translate cartesian
coordinates into equivalent rank, if some or all
of the indices of the coordinates are outside of
the defined range, then
error results.
error results unless periodicity is imposed in
all dimensions.
error results unless each of the out-of-range
indices is periodic.

69
Self Test

With MPI_Cart_shift(comm, direction, displ,
source, dest), if the calling process is the
first or the last entry along the shift direction
and that displ is greater than 0, then
error results.
MPI_Cart_shift returns source and dest if
periodicity is imposed along the shift direction.
Otherwise, source and/or dest return
MPI_UNDEFINED.
error results unless periodicity is imposed along
the shift direction.

70
Self Test

MPI_Cart_sub can be used to subdivide a cartesian
grid into subgrids of lower dimensions. These
subgrids
have dimensions one lower than the original grid.
attributes such as periodicity must be reimposed.
possess appropriate attributes of the original
cartesian grid.

71
Course Problem

Description
The new problem still implements a parallel
search of an integer array. The program should
find all occurrences of a certain integer which
will be called the target. When a processor of a
certain rank finds a target location, it should
then calculate the average of
The target value
An element from the processor with rank one
higher (the "right" processor). The right
processor should send the first element from its
local array.
An element from the processor with rank one less
(the "left" processor). The left processor should
send the first element from its local array.

72
Course Problem

For example, if processor 1 finds the target at
index 33 in its local array, it should get from
processors 0 (left) and 2 (right) the first
element of their local arrays. These three
numbers should then be averaged.
In terms of right and left neighbors, you should
visualize the four processors connected in a
ring. That is, the left neighbor for P0 should be
P3, and the right neighbor for P3 should be P0.
Both the target location and the average should
be written to an output file. As usual, the
program should read both the target value and all
the array elements from an input file.

73
Course Problem

Exercise
Modify your code from Chapter 7 to solve this
latest version of the Course Problem using a
virtual topology. First, create the topology
(which should be called MPI_RING) in which the
four processors are connected in a ring. Then,
use the utility routines to determine which
neighbors a given processor has.

74
MPI Program Performance
75
Matching

The time elapsed from when the first processor
starts executing a problem to when the last
processor completes execution.
T1/(PTp), where T1 is the execution time on one
processor and Tp is the execution time on P
processors.
The execution time on one processor of the
fastest sequential program divided by the
execution time on P processors.
When the sequential component of an algorithm
accounts for 1/s of the program's execution time,
then the maximum possible speedup that can be
achieved on a parallel computer is s.
Characterizing performance in a large limit.
When an algorithm suffers from computation or
communication imbalances among processors.
When the fast memory on a processor gets used
more often in a parallel implementation, causing
an unexpected decrease in the computation time.
A performance tool that shows the amount of time
a program spends on different program components.
A performance tool that determines the length of
time spent executing particular piece of code.
The most detailed performance tool that generates
a file which records the significant events in
the running of a program.

Amdahl's Law
Profiles
Relative efficiency
Load imbalances
Timers
Asymptotic analysis
Execution time
Cache effects
Event traces
Absolute speedup

76
Self Test

The following is not a performance metric
speedup
efficiency
problem size

77
Self Test

A good question to ask in scalability analysis
is
How can one overlap computation and
communications tasks in an efficient manner?
How can a single performance measure give an
accurate picture of an algorithm's overall
performance?
How does efficiency vary with increasing problem
size?
In what parameter regime can I apply Amdahl's
law?

78
Self Test

If an implementation has unaccounted-for
overhead, a possible reason is
an algorithm may suffer from computation or
communication imbalances among processors.
the cache, or fast memory, on a processor may get
used more often in a parallel implementation
causing an unexpected decrease in the computation
time.
you failed to employ a domain decomposition.
there is not enough communication between
processors.

79
Self Test

Which one of the following is not a data
collection technique used to gather performance
data
counters
profiles
abstraction
event traces

80
Course Problem

In this chapter, the broad subject of parallel
code performance is discussed both in terms of
theoretical concepts and some specific tools for
measuring performance metrics that work on
certain parallel machines. Put in its simplest
terms, improving code performance boils down to
speeding up your parallel code and/or improving
how your code uses memory.

81
Course Problem

As you have learned new features of MPI in this
course, you have also improved the performance of
the code. Here is a list of performance
improvements so far
Using Derived Datatypes instead of sending and
receiving the separate pieces of data
Using Collective Communication routines instead
of repeating/looping individual sends and
receives
Using a Virtual Topology and its utility routines
to avoid extraneous calculations
Changing the original master-slave algorithm so
that the master also searches part of the global
array (The slave rebellion Spartacus!)
Using "true" parallel I/O so that all processors
write to the output file simultaneously instead
of just one (the master)

82
Course Problem

But more remains to be done - especially in terms
of how the program affects memory. And that is
the last exercise for this course. The problem
description is the same as the one given in
Chapter 9 but you will modify the code you wrote
using what you learned in this chapter.

83
Course Problem

Description
The initial problem implements a parallel search
of an extremely large (several thousand elements)
integer array. The program finds all occurrences
of a certain integer, called the target, and
writes all the array indices where the target was
found to an output file. In addition, the program
reads both the target value and all the array
elements from an input file.
Exercise
Modify your code from Chapter 9 so that it uses
dynamic memory allocation to use only the amount
of memory it needs and only for as long as it
needs it. Make both the arrays a and b ALLOCATED
DYNAMICALLY and connect them to memory properly.
You may also assume that the input data file
"b.data" now has on its first line the number of
elements in the global array b. The second line
now has the target value. The remaining lines are
the contents of the global array b.