Chapter 4, CLR Textbook

About This Presentation

Title:

Chapter 4, CLR Textbook

Description:

... and simulation of complex cellular automata (e.g., ... Parallel Programming in C with MPI and OpenMP Author: jbaker Last modified by: jbaker Created Date: – PowerPoint PPT presentation

Number of Views:70

Avg rating:3.0/5.0

Slides: 47

Provided by: JBa999

Learn more at: https://www.cs.kent.edu

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 4, CLR Textbook

1
Chapter 4, CLR Textbook

Algorithms on Rings of Processors

2
Algorithms on Rings of Processors

When using message passing, it is common to
abstract away from the physical network and to
choose a convenient logical network instead.
This chapter presents several algorithms
intended for the logical ring network studied
earlier
Coverage of mapping logical networks map onto
physical networks are deferred to Sections 4.6
and 4.7
Rings are linear interconnection network
Ideal for a first look at distributed memory
algorithms
Each processor has a single predecessor and
successor

3
Matrix-Vector Multiplication

The first unidirectional ring algorithm will be
the multiplication y Ax of a n?n matrix A by a
vector x of dimension n.
1. for i 0 to n-1 do
2. yi ? 0
3. for j 0 to n-1 do
4. yi yi Ai,j ? xj
Each outer (e.g., i) loop computes the scalar
product of one row of A by vector x.
These scalar products can be performed in any
order.
These scalar products will be distributed among
the processors so these can be done in parallel.

4
Matrix-Vector Multiplication (cont.)

We assume that n is divisible by p and let r
n/p.
Each processor must store r contiguous rows of
matrix A and r scalar products.
This is called a block row.
The corresponding r components of the vector y
and x are also stored with each processor.
Each processor Pq will then store
Rows qr to (q1)r -1 of matrix A of dimension r?n
Components qr to (q1)r -1 of vectors x and y.
For simplicity, we will ignore the case where n
is not divisible by p.
However, this case can be handled by temporarily
adding additional rows of zeros to matrix and
zeros to vector x so the resulting nr. of rows is
divisible by p.

5
Matrix-Vector Multiplication (cont.)

Declarations needed
var A array0..r-1,0..n-1 of real
var x, y array 0..r-1 of real
Then A0,0 on P0 corresponds to A0,0 but on P1
to Ar,0.
Note the subscript are global while array indices
are local.
Also, note global index (i,j) corresponds to
local index (i - ?i/r?, j) on processor Pk where
k ?i/r?
The next figure illustrates how the rows and
vectors are partitioned among the processors.

6
(No Transcript)
7
Matrix-Vector Multiplication (cont.)

The partitioning of the data makes it possible to
solve larger problems in parallel.
The parallel algorithm can solve a problem
roughly p times larger than the sequential
algorithm.
Algorithm 4.1 is given on the next slide.
For each loop in Algorithm 4.1, each processor Pq
computes the scalar product of its r?r matrix
with a vector of size r.
This is a partial result.
The values of the components of y assigned to Pq
is obtained by adding all of these partial
results together.

8
(No Transcript)
9
Matrix-Vector Multiplication (cont.)

In the first pass through the loop, the
x-components in Pq are the ones originally
assigned.
During each pass through the loop, Pq executes
the scalar product of the appropriate part of the
qth block of A with Pq s current components of
x.
Concurrently, each Pq sends its block of x-values
to Pq1 (mod p) and receives a new block of
x-values from Pq-1 (mod p).
At the conclusion, each Pq has its original block
of x-values, but has calculated the correct
values for its y-components.
These steps are illustrated in Figure 4.2.

10
(No Transcript)
11
Analysis of Matrix-Vector Multiplication

There are p identical steps
Each step involves three activities compute,
send, and receive.
The time to send and receive are identical and
concurrent, so the execution time is
T(p) p max r2w, Lrb
where w is the computation time for multiplying
a vector component by matrix component adding
two products,
b is the inverse of the bandwidth, and L is
communications startup cost.
As r n/p, the computation cost becomes
asymptotically larger than the communication cost
as n increases, since
(for n large)

12
Matrix-Vector Multiplication Analysis (cont)

Next, we calculate various metrics and their
complexity
For large n,
T(p) p(r2w) n2w/p or O(n2/p) ? O(n2) if p
constant
The cost (n2w/p)p n2w or O(n2)
The speedup ts/T(p) cn2 (p/n2w) (c/w)p
or O(p)
However if p is constant/small, the speedup is
only O(1)
The efficiency ts/cost cn2/ (n2w) c/w or
O(1)
Note efficiency tsp/tp ? O(1)
Note that if vector x were duplicated across all
processors, then there would be no need for any
communication and parallel efficiency would be
O(1) for all values of n.
However, there would be an increased memory cost

13
Matrix-Matrix Multiplication

Using matrix-vector multiplication, this is easy.
Let C A?B, where all are n?n matrices
The multiplication consists of computing n scalar
products
for i 0 to n-1 do
for j 0 to n-1 do
Ci,j 0
for k 0 to n-1 do
Ci,j Ci,j Ai,k ? Bk,j
We will distribute the matrices over the p
processors, giving each the first processor the
first r n/p rows, etc.
Declaration
var A, B, C array0r-1, 0r-1 of reals.

14
(No Transcript)
15
Matrix-Matrix Multiplication Analysis

This algorithm is very similar to the one for
matrix-vector multiplication
Scalar products are replaced by sub-matrix
multiplication
Circular shifting of a vector is replaced by
circular shifting of matrix rows
Analysis
Each step lasts as long as the longest of the
three activities performed during the step
Compute, send, and receive.
T(p) p max nr2w, Lnrb
As before, the asymptotic parallel efficiency is
1 when n is large.

16
Matrix-Matrix Multiplication Analysis

Naïve Algorithm matrix-matrix could be achieved
by executing matrix-vector multiplication n times
Analysis of Naïve algorithm
Execution time is just the time for matrix-vector
multiplication, multiplied by n.
T(p) p ? max nr2w, nL nrb
The only difference between T and T is that term
L has become nL
Naïve approach exchange vectors of size r
In the algorithm while in the algorithm developed
in this section, they exchange matrices of size r
? n
This does not change asymptotic efficiency
However, sending data in bulk can significantly
reduce the communications overhead.

17
Stencil Applications

Popular applications that operate on a discrete
domain that consists of cells.
Each cell holds some value(s) and has neighbor
cells.
The application uses an application that applies
pre-defined rules to update the value(s) of a
cell using the values of the neighbor cells.
The location of the neighbor cells and the
function used to update cell values constitute a
stencil that is applied to all cells in the
domain.
These type of applications arise in many areas of
science and engineering.
Examples include image processing, approximate
solutions to differential equations, and
simulation of complex cellular automata (e.g.,
Conways Game of Life)

18
A Simple Sequential Algorithm

We consider a stencil application on a 2D domain
of size n?n.
Each cell has 8 neighbors, as shown below
NW N NE
W c E
SW S SE
The algorithm we consider updates the values of
Cell c based on the value of the already updated
value of its West and North neighbors.
The stencil is shown on the next slide and can be
formalized as
cnew ? UPDATE(cold, Wnew, Nnew)

19
(No Transcript)
20
A Simple Sequential Algorithm (cont)

This simple stencil is similar to important
applications
Gauss-Seidel numerical method algorithm
Smith-Waterman biological string comparison
algorithm
This stencil can not be applied to cells in top
row or left column.
These cells are handled by the update function.
To indicate that no neighbor exists for a cell
update, we pass a Nil argument to UPDATE.

21
Greedy Parallel Algorithm for Stencil

Consider a ring of p processors, P0, P1, ,
Pp-1.
Must decide how to allocate cells among
processors.
Need to balance computational load without
creating overly expensive communications.
Assume initially that p is equal to n
We will allocate row i of domain A to ith
processor, Pi.
Declaration Needed Var A Array0..n-1 of
real
As soon as Pi has computed a cell value, it sends
that value to Pi1 (0 ? i lt p-1).
Initially, only A0,0 can be computed
Once A0,0 is computed, then A1,0 and A0,1can be
computed.
The computation proceeds in steps. At step k, all
values on the k-th anti-diagonal are computed.

22
(No Transcript)
23
General Steps of Greedy Algorithm

At time ij, processor Pi performs the following
operations
It receives A(i-1,j) from Pi-1
It computes A(i,j)
Then it sends A(i,j) to Pi1
Exceptions
P0 does not need to receive cell values to update
its cells.
Pp-1 does not send its cell values after updating
its cells.
Above exceptions do not influence algorithm
performance.
This algorithm is captured in Algorithm 4.3 on
next slide.

24
(No Transcript)
25
Tracing Steps in Preceding Algorithm

Re-read pgs72-73 CLR on send receive for
sych.rings.
See slides 35-40, esp. 37-38 in slides on
synchronous networks
Steps 1-3 are performed by all processors.
All processors obtain a array A of n reals, their
ID nr, and the nr of processors.
Steps 4-6 are preformed only by P0.
In Step 5, P0 updates the cell A0,0 in NW top
corner.
In Step 6, P0 sends contents in A0 of cell A0,0
to its successor, P1.
Steps 7- 8 are executed only by P1 with since it
is only processor receiving a message. (Note this
is not blocking receive, as would block all Pi
for igt1.)
In Step 8. P1. stores update of A0,0 from P0 in
address v.
In Step 9, P0. uses value in v to update value in
A0 of cell A1,0.

26
Tracing Steps in Algorithm (cont)

Steps 12-13 are executed by P0 to update the
value Aj of its next cell A0,j in top row and
send its value to P1.
Steps 14-16 are executed only by Pn-1 on bottom
row to update the value Aj of its next cell
An-1,j.
This value will be used by Pn-1 to update its
next cell in the next round.
Pn-1 does not send a value since its row is the
last one.
Only Pi for 0ltiltn-1 can execute 18-19.
In Step 18, Pi executing 18-19 on j-th loop are
further restricted to those receiving a message
(i.e., blocking receive)
In Step 18, Pi executes the send and receive in
parallel
In Step 19, Pi uses the received value to update
the value Aj of the next cell Ai,j.

27
Algorithm for Fewer Processors

Typically, have much fewer processors than nr of
rows.
WLOG, assume p divides n.
If n/p rows are assigned to each processor, then
at least n/p steps must occur before P0 can send
a value to P1.
This situation repeats for each Pi and Pi1,
severely restricting parallelism.
Instead, we assign rows to processors cyclically,
with row j assigned to Pj mod p.
Each processor has following declaration
var A array0...n/p, 0..n-1 of real
This is a contiguous array of rows, but these
rows are not contiguous.
Algorithm 4.4 for the stencil application on a
ring of processors using a cyclic data
distribution is given next.

28
(No Transcript)
29
Cyclic Stencil Algorithm Execution Time

Let T(n,p) be the execution time for preceding
algorithm.
We assume that receiving is blocking while
sending is not.
The sending of a message in step k is followed by
the reception of the message in step k1.
The time needed to perform one algorithm step is
?bL, where
The time needed to update a cell is ?
The rate at which cell values are communicated is
b
The startup cost is L.
The computation terminates when Pp-1 finishes
computing the rightmost cell value of its last
row of cells.

30
Cyclic Stencil Algorithm Run Time (cont)

Number of algorithm steps is p-1 n2/p
Pp-1 is idle for first p-1 steps
Once Pp-1 starts computing, it computes a cell
each step until the algorithm computation is
completed.
There are n2 cells, split evenly between the
processors, so each processor is assigned n2/p
cells
This yields
Additional problem
Algorithm was designed to minimize (time between
a cell update computation) and (its reception by
the next processor)
However, the algorithm performs many
communications of small data items.
L can be orders of magnitude larger than b if
cell value small.

31
Cyclic Stencil Algorithm Run Time (cont)

Stencil application characteristics
The cell value is often as small as an integer or
real nr.
The computations to update the cells may involve
only a few operations, so ? may also be small.
For many computations, most of the execution time
could be due to the L term in the equation for
T(n,p).
Spending a large amount of time in communications
overhead reduces the parallel efficiency
considerably.
Note that Ep(n) Tseq(n) / p?Tpar(n) n2w /
p?Tpar(n)
Ep(n) reduces to the below formula. Note that as
n increases, the efficiency may drop well below
1.

32
Augmenting Granularity of Algorithm

The communication overhead due to startup
latencies can be decreased by sending fewer
messages that are larger.
Let each processor compute k contiguous cell
values in each row during each step, instead of
just 1 value.
To simplify analysis, we assume k divides n, so
each row has n/k segments of k contiguous cells.
To avoid above, let the last incomplete segment
spill over to the next row. The last segment of
last row may have fewer than k elements.
With this algorithm, cell values are communicated
in bulk, k at a time.

33
Augmenting Granularity of Algorithm (cont)

Effect of bulk communication k items on algorithm
Larger values of k produce less communication
overhead.
However, larger values of k increase the time
between a cell value update and its reception in
the next processor
In this algorithm, processors will start
computing cell values later, leading to more idle
time for cells.
This approach is illustrated in next diagram.

34
(No Transcript)
35
Block-Cyclic Allocation of Cells

A second way to reduce communication costs is to
decrease the number of cell values that are being
communicated.
This is done by allocating blocks from r
consecutive rows to processors cyclically.
To simplify the analysis, we assume r?p divides
n.
This idea of a block-cyclic allocation is very
useful, and is illustrated below

36
Block-Cyclic Allocation of Cells (cont)

Each processor computes k contiguous cells in
each row from a block of r rows.
At each step, each processor now computes r?k
cells
Note blocks are r?k (r rows, k columns) in size
Note Only those values on the edges of the block
have to be sent to other processors.
This general approach can dramatically decrease
the number of cells whose updates have to be sent
to other processors.
The algorithm for this allocation is similar to
those shown for the cyclic row assignment scheme
in Figure 4.6,
Simply replace rows by blocks of rows.
A processor calculates all cell values in its
first block of rows in n/k steps of the
algorithm.

37
Block-Cyclic Allocations (cont)

Processor Pp-1 sends its k cell values to P0
after p algorithm steps.
P0 needs these values to compute its second
block of rows .
As a result, we need n ? kp in order to keep
processors busy.
If n gt kp, then processors must temporarily store
received cell values while they finishing
computing their block of rows for the previous
step.
Recall processors only have to exchange data at
the boundaries between blocks.
Using r rows per block, the amount of data
communicated is r times smaller than the previous
algorithm.

38
Block-Cyclic Allocations (cont)

Processor activities in computing block
Receive k cell values from predecessor
Compute kr cell values
Sends k cell values to its successor
Again we assume receives are blocking while
sends are not.
The time required to perform one step of
algorithm is
krwkbL
The computation finishes when processor Pp-1
finishes computing its rightmost segment in its
last block of rows of cells.
Pp-1 computes one segment of a block row in each
step

39
Optimizing Block-Cyclic Allocations

There are n2/(kr) such segments and so p
processors can compute them in n2/(pkr) steps
It takes p-1 algorithm steps before processor
Pp-1 can start doing any computation.
Afterwards, Pp-1 will computer one segment at
each step.
Overall, the algorithm runs for p-1n2/pkr steps
with a total computation time of
The efficiency of this algorithm is

40
Optimizing Block-Cyclic Allocations (cont)

However, this gives the asymptotic efficiency of
Note that by increasing r and k, it is possible
to achieve significantly higher efficiency.
However, increasing r and k reduces
communications.
The text also outlines how to determine optimal
values for k and r using a fair amount of
mathematics.

41
Implementing Logical Topologies

Designers of parallel algorithms should choose
the logical topology.
In section 4.5, switching the topology from
unidirectional ring to a bidirectional ring made
the program much simpler and lowered the
communications time.
The message passing libraries, such as the ones
implemented for MPI, allow communications between
any two processors using the Send and Recv
functions.
Using a logical topology restricts communications
to only a few paths, which usually makes the
algorithm design simpler.
The logical topology can be implemented by
creating a set of functions that allows each
processor to identify its neighbors.
Unidirectional Ring only needs NextNode(P)
Bidirectional Ring would need also need
PreviousNode(P)

42
Logical Topologies (cont)

Some systems (e.g., many modern supercomputers)
provide many physical networks, but sometimes
creation of logical topologies left to the user.
A difficult task is matching the logical topology
to the physical topology for the application.
The common wisdom is that a local topology that
resembles the physical topology of application
should produce a good performance.
Sometime the reason for using a logical topology
is to hide the complexity of the physical
topology.
Often extensive benchmarking is required to
determine the best topology for a given algorithm
on a given platform.
The local topologies studied in this chapter and
the next are known to be useful in the majority
of scenarios.

43
Distributed vs Centralized Implementations

In the CLR text, the data is already distributed
among the processors at the start of the
execution.
One may wonder how the data was distributed to
the processors if whether that should also be
part of the algorithm.
There are two approaches Distributed
Centralized.
In the centralized approach,one assumes that the
data resides in a single master location.
A single processor
A file on a disk, if data size is large.
The CLR book takes the distributed approach. The
Akl book usually takes the distributed approach,
but occasionally takes the centrailized approach.

44
Distributed vs Centralized (cont)

An advantage of the centralized approach is that
the library routine can choose the data
distribution scheme to enforce.
The best performance requires that the choice for
each algorithm consider its underlying topology.
This cannot be done in advance
Often the library developer will provide multiple
versions of possible data distribution
The user can then choose the version that bet
fits the underlying platform.
This choice may be difficult without extensive
benchmarking.
The main disadvantage of the centralized approach
is when user applies successive algorithms using
the same data.
Data will be repeatedly distributed
undistributed.
Causes most library developers to opt for
distributed option.

45
Summary of Algorithmic Principles(For
Asynchronous Message Passing)

Although used only for ring topology, the below
principles are general. Unfortunately, they often
conflict with each other.
Sending data in bulk
Reduces communication overhead due to network
latencies
Sending data early
Sending data as early as possible allows other
processors to start computing as early as
possible.

46
Summary of Algorithmic Principles(For
Asynchronous Message Passing)-- Continued --

Overlapping communication and computation
If both can be performed at the same time, the
communication cost is often hidden
Block Data Distribution
Having processors assigned blocks of contiguous
data elements reduces the amount of communication
Cyclic Data Distribution
Having data elements interleaved among processors
makes it possible to reduce idle time and achieve
a better load balance