What will SDSC be well-known for ?

About This Presentation

Title:

What will SDSC be well-known for ?

Description:

High-Performance Grid Computing and Research Networking Algorithms on a Ring (II) Presented by Yuming Zhang (Minton) Instructor: S. Masoud Sadjadi – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 59

Provided by: Henri208

Learn more at: http://users.cis.fiu.edu

Category:

more less

Transcript and Presenter's Notes

Title: What will SDSC be well-known for ?

1
High-Performance Grid Computing and Research
Networking
Algorithms on a Ring (II)
Presented by Yuming Zhang (Minton) Instructor
S. Masoud Sadjadi http//www.cs.fiu.edu/sadjadi/T
eaching/ sadjadi At cs Dot fiu Dot edu
2
Acknowledgements

The content of many of the slides in this lecture
notes have been adopted from the online resources
prepared previously by the people listed below.
Many thanks!
Henri Casanova
Principles of High Performance Computing
http//navet.ics.hawaii.edu/casanova
henric_at_hawaii.edu

3
Stencil Application

Weve talked about stencil applications in the
context of shared-memory programs
We found that we had to cut the matrix in small
blocks
On a ring the same basic idea applies, but lets
do it step-by-step

t1
t1
t
t
0
1
2
3
4
5
6
t
1
2
3
4
5
6
7
2
3
4
5
6
7
8
Cijt1 F(Ci-1jt1 Cij-1t1
Ci1jt Cij1t )
3
4
5
6
7
8
9
4
5
6
7
8
9
10
5
6
7
8
9
10
11
6
7
8
9
10
11
12
4
40x40 Example with p1

Recap sequential code
Example with n40 and p1
Row after row
Wave front

5
3x3 Example with p9
0,0
0,1
0,2
1,1
1,0
1,2
2,0
2,1
2,2
6
3x3 Example with p9
0,0
0,1
0,2
Step
1,1
1,0
1,2
0
2,0
2,1
2,2
7
Stencil Application

Let us, for now, consider that the domain is of
size nxn and that we have pn processors
Each processor is responsible for computing one
row of the domain (at each iteration)
One first simple idea is to have each processor
send each cell value to its neighbor as soon as
that cell value is computed
Basic principle 1 do communication as early as
possible to get your neighbors started as early
as possible
Remember that one of the goals of a parallel
program is to reduce idle time on the processors
We call this algorithm the Greedy algorithm, and
seek an evaluation of its performance

8
3x3 Example with p3
0,0
0,1
0,2
1,1
1,0
1,2
2,0
2,1
2,2
9
3x3 Example with p3
0,0
0,1
0,2
Step
1,1
1,0
1,2
0
2,0
2,1
2,2
10
Greedy Algorithm with np

float Cn/pn // for now n/p 1
// code for only one iteration
my_rank ? rank()
p ? num_procs()
for (j0 jltn j)
if (my_rank gt 0) RECV(tmp,1) else tmp-1
C0j update(j,tmp) // implements the
stencil
if (my_rank lt num_procs-1)
SEND((C0j),1)
We made a few assumptions about the
implementation of the update function
At time ij, Processor Pi does three things
it receives c(i-1,j) from Pi-1
it computes c(i,j)
it send c(i,j) to Pi1
Not technically true for P0 and Pp-1, but we
dont care for performance analysis

11
Greedy Algorithm with n m p

This is all well and good, but really, we almost
always have more rows in the domain than
processors
Example with n40 and p10

P0
P1
P2
P3
P4
P5
P6
P7
P8
P9
12
Greedy Algorithm with n m p

This is all well and good, but really, we almost
always have more rows in the domain than
processors
Example with n40 and p10
First algorithm
P1 has to wait for 10 steps
P2 has to wait for 36 steps
P9 has to wait for 666 steps

P0
P1
P2
P3
P4
P5
13
Greedy Algorithm with n m p

This is all well and good, but really, we almost
always have more rows in the domain than
processors
Example with n40 and p10
First algorithm
P1 has to wait for 10 steps
P2 has to wait for 36 steps
P9 has to wait for 666 steps
Second algorithm
P1 has to wait for 4 steps
P2 has to wait for 8 steps
P9 has to wait for 36 steps

P0
P1
P2
P3
P4
P5
14
Greedy Algorithm

Lets assume that we have more rows in the domain
than processors a realistic assumption!
The question is then how do we allocate matrix
rows to processors?
Similarly to what we saw in the shared memory
case, we use a cyclic (i.e., interleaved)
distribution
Basic Principle 2 A cyclic distribution of data
among processors is a good way to achieve good
load balancing
Remember that in the mat-vec and mat-mat multiply
we did not use a cyclic distribution
There was no such need there because all
computations were independent and we could just
assign blocks of rows to processors and everybody
could compute
If we did this here, all processors would be idle
while processor P0 computes its block of rows,
and then all processors would be idle while P1
computes its block of rows, etc. It would be a
sequential execution!

15
Cyclic Greedy Algorithm

Assumption n m p
n40 and p10

P0
P1
P2
P3
P4
P5
P6
P7
P8
P9
16
Cyclic Greedy Algorithm

Assumption n m p
n40 and p10
With Cyclic Distribution
P1 waits 1 step
P2 waits 2 steps
P9 waits 9 steps
Waits p-1 steps
Need to do n2/p
Finishes after p-1n2/p steps

P0
P1
P2
P3
P4
P5
P6
P7
P8
P9
17
Cyclic Greedy Algorithm

float Cn/pn // n/p gt 1 ad is an integer!
// code for only one iteration
my_rank ? rank()
p ? num_procs()
for (i0iltn/pi)
for (j0 jltn j)
if (my_rankip gt 0) RECV(tmp,1) else tmp
-1
Cij update(i,j,tmp)
if (my_rankip lt n-1) SEND((Cij),1)

18
Cyclic Greedy Algorithm

Let us compute the execution time for this
algorithm, T(n,p)
Remember that n gt p
We can assume that sending a message is done in a
non-blocking fashion (while receiving is
blocking)
Then when a processor sends a message at step k
of the algorithm in parallel it receives a
message at step k1
This is a reasonable assumption because the
message sizes are identical
Remember that in performance analysis we make
simplifying assumptions otherwise reasoning
becomes overly complex
Therefore, at each algorithm step processors
(i.e., at least one) do
Call update on one cell takes time Ta
Ta computation in Flops / machine speed in
Flop/sec
Send/receive a cell value takes time b Tc
b communication start-up latency
Tc cell value size (in byte) / network bandwidth
Each steps lasts Ta b Tc

19
Cyclic Greedy Algorithm

Each step takes time Ta b Tc
How many steps are there?
Weve done this for the shared-memory version
(sort of)
It takes p-1 steps before processor Pp-1 can
start computing
Then it computes n2/p cells
Therefore there are p-1n2/p steps
T(n,p) (p-1n2/p) (Ta Tc b)
This formula points to a big problem a large
component of the execution time is caused by the
communication start-up time b
In practice b can be as large or larger than Ta
Tc!
The reason is we send many small messages
A bunch of SEND(...,1)!
Therefore we can fix it by sending larger
messages
What we need to do is augment the granularity of
the algorithm
Basic Principle 3 Sending large messages
reduces communication overhead
Conflicts with principle 1

20
Higher Granularity

As opposed to sending a cell value as soon as it
is computed, compute k cell values and send them
all in bulk
We assume that k divides n
As opposed to having each processor hold n/p non
contiguous rows of the domain, have each
processor hold blocks of r consecutive rows
We assume that pr divides n

21
Higher Granularity

Assumption n m p
n40 and p10
R2 and k5

P0
P1
P2
P3
P4
P5
P6
P7
P8
P9
22
Idle processors?

In the previous picture, it may be that, after it
finishes computing its first block row, processor
P0 has to wait for data from Pp-1
Processor P0 computes its first block row in n/k
algorithm steps
Processor Pp-1 computes the first subblock of its
first block row after p algorithm steps
Therefore, P0 is not idle if n/k gt p, or n gt kp
If n lt kp idle time, which is not a good idea
If n gt kp processors need to receive and store
values for a while before being able to use them

23
Higher Granularity

Assumption n m p
n40 and p10
R2 and k5
nltkp
40lt5x10

P0
P1
P2
P3
P4
P5
P6
P7
24
Cyclic Greedy Algorithm

Assumption n m p
n40 and p10
R1 and k1
nltkp
40lt1x10

P0
P1
P2
P3
P4
P5
P6
P7
P8
P9
25
Higher Granularity

Assumption n m p
n40 and p10
R2 and k4
nkp
404x10

k
P0
P1
P2
P3
P4
P5
P6
P7
P8
r
P9
26
Higher Granularity

Assumption n m p
n40 and p10
R4 and k5
nkp
404x10

k
P0
P1
P2
P3
P4
P5
r
P6
P7
P8
P9
27
Performance Analysis

Very similar to what we did before
Each step takes time krTa kTc b
time to compute kr cells
time to send k cells
There are p -1 n2 /(pkr) steps
p-1 steps before processor Pp-1 can start any
computation
Then Pp-1 has to compute (n2/kr)/p blocks
T(n,p,k,r) (p-1n2/(pkr)) (krTa kTc b)
Compare to T(n,p) (p-1n2/p) (Ta Tc b)
We traded off Principle 1 for Principle 3,
which is probably better in practice
Values of k and r can be experimented with to
find what works best in practice
Note that optimal values can be computed

28
Solving Linear Systems of Eq.

Method for solving Linear Systems
The need to solve linear systems arises in an
estimated 75 of all scientific computing
problems Dahlquist 1974
Gaussian Elimination is perhaps the most
well-known method
based on the fact that the solution of a linear
system is invariant under scaling and under row
additions
One can multiply a row of the matrix by a
constant as long as one multiplies the
corresponding element of the right-hand side by
the same constant
One can add a row of the matrix to another one as
long as one adds the corresponding elements of
the right-hand side
Idea scale and add equations so as to transform
matrix A in an upper triangular matrix

?
?
x

?
?
?
?
equation n-i has i unknowns, with
29
Gaussian Elimination
1 1 1
1 -2 2
1 2 -1
0
4
2
x
Substract row 1 from rows 2 and 3
1 1 1
0 -3 1
0 1 -2
0
4
2
x
Multiple row 3 by 3 and add row 2
1 1 1
0 -3 1
0 0 -5
0
4
10
x
-5x3 10 x3 -2 -3x2 x3 4
x2 -2 x1 x2 x3 0 x1 4
Solving equations in reverse order (backsolving)
30
Gaussian Elimination

The algorithm goes through the matrix from the
top-left corner to the bottom-right corner
the ith step eliminates non-zero sub-diagonal
elements in column i, substracting the ith row
scaled by aji/aii from row j, for ji1,..,n.

values already computed
i
pivot row
0
to be zeroed
values yet to be updated
i
31
Sequential Gaussian Elimination

Simple sequential algorithm
// for each column i
// zero it out below the diagonal by adding
// multiples of row i to later rows
for i 1 to n-1
// for each row j below row i
for j i1 to n
// add a multiple of row i to row j
for k i to n
A(j,k) A(j,k) - (A(j,i)/A(i,i))
A(i,k)
Several tricks that do not change the spirit of
the algorithm but make implementation easier
and/or more efficient
Right-hand side is typically kept in column n1
of the matrix and one speaks of an augmented
matrix
Compute the A(i,j)/A(i,i) term outside of the
loop

32
Pivoting Motivation
0 1
1 1

A few pathological cases
Division by small numbers ? round-off error in
computer arithmetics
Consider the following system
0.0001x1 x2 1.000
x1 x2 2.000
exact solution x11.00010 and x2 0.99990
say we round off after 3 digits after the decimal
point
Multiply the first equation by 104 and subtract
it from the second equation
(1 - 1)x1 (1 - 104)x2 2 - 104
But, in finite precision with only 3 digits
1 - 104 -0.9999 E4 -0.999 E4
2 - 104 -0.9998 E4 -0.999 E4
Therefore, x2 1 and x1 0 (from the first
equation)
Very far from the real solution!

33
Partial Pivoting

One can just swap rows
x1 x2 2.000
0.0001x1 x2 1.000
Multiple the first equation by 0.0001 and
subtract it from the second equation gives
(1 - 0.0001)x2 1 - 0.0001
0.9999 x2 0.9999 gt x2 1
and then x1 1
Final solution is closer to the real solution.
(Magical)
Partial Pivoting
For numerical stability, one doesnt go in order,
but pick the next row in rows i to n that has the
largest element in row i
This row is swapped with row i (along with
elements of the right hand side) before the
subtractions
the swap is not done in memory but rather one
keeps an indirection array
Total Pivoting
Look for the greatest element ANYWHERE in the
matrix
Swap columns
Swap rows
Numerical stability is really a difficult field

34
Parallel Gaussian Elimination?

Assume that we have one processor per matrix
element

max aji needed to compute the scaling factor
to find the max aji
Independent computation of the scaling factor
Compute
Reduction
Broadcast
Every update needs the scaling factor and the
element from the pivot row
Independent computations
Broadcasts
Compute
35
LU Factorization

Gaussian Elimination is simple but
What if we have to solve many Ax b systems for
different values of b?
This happens a LOT in real applications
Another method is the LU Factorization
Ax b
Say we could rewrite A L U, where L is a lower
triangular matrix, and U is an upper triangular
matrix O(n3)
Then Ax b is written L U x b
Solve L y b O(n2)
Solve U x y O(n2)

triangular system solves are easy
?
?
?
x
?
x

?
?
?
?
?
?
?
?
equation i has i unknowns
equation n-i has i unknowns
36
LU Factorization Principle

It works just like the Gaussian Elimination, but
instead of zeroing out elements, one saves
scaling coefficients.
Magically, A L x U !
Should be done with pivoting as well

1 2 -1
4 3 1
2 2 3
1 2 -1
0 -5 5
2 2 3
1 2 -1
4 -5 5
2 2 3
1 2 -1
4 -5 5
2 -2 5
gaussian elimination
save the scaling factor
gaussian elimination save the scaling factor
1 2 -1
4 -5 5
2 2/5 3
1 2 -1
0 -5 5
0 0 3
1 0 0
4 1 0
2 2/5 1
L
gaussian elimination save the scaling factor
U
37
LU Factorization

Were going to look at the simplest possible
version
No pivotingjust creates a bunch of indirections
that are easy but make the code look complicated
without changing the overall principle

LU-sequential(A,n) for k 0 to n-2 //
preparing column k for i k1 to n-1
aik ? -aik / akk for j k1 to n-1
// Task Tkj update of column j for
ik1 to n-1 aij ? aij aik akj

stores the scaling factors
k
k
38
LU Factorization

Were going to look at the simplest possible
version
No pivotingjust creates a bunch of indirections
that are easy but make the code look complicated
without changing the overall principle

LU-sequential(A,n) for k 0 to n-2 //
preparing column k for i k1 to n-1
aik ? -aik / akk for j k1 to n-1
// Task Tkj update of column j for
ik1 to n-1 aij ? aij aik akj

update
k
k
j
i
39
Parallel LU on a ring

Since the algorithm operates by columns from left
to right, we should distribute columns to
processors
Principle of the algorithm
At each step, the processor that owns column k
does the prepare task and then broadcasts the
bottom part of column k to all others
Annoying if the matrix is stored in row-major
fashion
Remember that one is free to store the matrix in
anyway one wants, as long as its coherent and
that the right output is generated
After the broadcast, the other processors can
then update their data.
Assume there is a function alloc(k) that returns
the rank of the processor that owns column k
Basically so that we dont clutter our program
with too many global-to-local index translations
In fact, we will first write everything in terms
of global indices, as to avoid all annoying index
arithmetic

40
LU-broadcast algorithm

LU-broadcast(A,n)
q ? rank()
p ? numprocs()
for k 0 to n-2
if (alloc(k) q)
// preparing column k
for i k1 to n-1
bufferi-k-1 ? aik ? -aik / akk
broadcast(alloc(k),buffer,n-k-1)
for j k1 to n-1
if (alloc(j) q)
// update of column j
for ik1 to n-1
aij ? aij bufferi-k-1 akj

41
Dealing with local indices

Assume that p divides n
Each processor needs to store rn/p columns and
its local indices go from 0 to r-1
After step k, only columns with indices greater
than k will be used
Simple idea use a local index, l, that everyone
initializes to 0
At step k, processor alloc(k) increases its local
index so that next time it will point to its next
local column

42
LU-broadcast algorithm

...
double an-1r-1
q ? rank()
p ? numprocs()
l ? 0
for k 0 to n-2
if (alloc(k) q)
for i k1 to n-1
bufferi-k-1 ? ai,k ? -ai,l /
ak,l
l ? l1
broadcast(alloc(k),buffer,n-k-1)
for j l to r-1
for ik1 to n-1
ai,j ? ai,j bufferi-k-1
ak,j

43
What about the Alloc function?

One thing we have left completely unspecified is
how to write the alloc function how are columns
distributed among processors
There are two complications
The amount of data to process varies throughout
the algorithms execution
At step k, columns k1 to n-1 are updated
Fewer and fewer columns to update
The amount of computation varies among columns
e.g., column n-1 is updated more often than
column 2
Holding columns on the right of the matrix leads
to much more work
There is a strong need for load balancing
All processes should do the same amount of work

44
Bad load balancing
P1
P2
P3
P4
already done
already done
working on it
45
Good Load Balancing?
already done
already done
working on it
Cyclic distribution
46
Proof that load balancing is good

The computation consists of two types of
operations
column preparations
matrix element updates
There are many more updates than preparations, so
we really care about good balancing of the
preparations
Consider column j
Lets count the number of updates performed by
the processor holding column j
Column j is updated at steps k0, ..., j-1
At step k, elements ik1, ..., n-1 are updates
indices start at 0
Therefore, at step k, the update of column j
entails n-k-1 updates
The total number of updates for column j in the
execution is

47
Proof that load balancing is good

Consider processor Pi, which holds columns lpi
for l0, ... , n/p -1
Processor Pi needs to perform this many updates

Turns out this can be computed
separate terms
use formulas for sums of integers and sums of
squares
What it all boils down to is

This does not depend on i
Therefore it is (asymptotically) the same for all
Pi processors
Therefore we have (asymptotically) perfect load
balancing!

48
Load-balanced program

...
double an-1r-1
q ? rank()
p ? numprocs()
l ? 0
for k 0 to n-2
if (k q mod p)
for i k1 to n-1
bufferi-k-1 ? ai,k ? -ai,l /
ak,l
l ? l1
broadcast(alloc(k),buffer,n-k-1)
for j l to r-1
for ik1 to n-1
ai,j ? ai,j bufferi-k-1
ak,j

49
Performance Analysis

How long does this code take to run?
This is not an easy question because there are
many tasks and many communications
A little bit of analysis shows that the execution
time is the sum of three terms
n-1 communications nxb (n2/2)Tcomm O(1)
n-1 column preparations (n2/2)Tcomp O(1)
column updates (n3/3p)Tcomp O(n2)
Therefore, the execution time is (n3/3p)Tcomp
Note that the sequential time is (n3 /3)Tcomp
Therefore, we have perfect asymptotic efficiency!
once again
This is good, but isnt always the best in
practice
How can we improve this algorithm?

50
Pipelining on the Ring

So far, the algorithm weve used a simple
broadcast
Nothing was specific to being on a ring of
processors and its portable
in fact you could just write raw MPI that just
looks like our pseudo-code and have a very
limited, inefficient for small n, LU
factorization that works only for some number of
processors
But its not efficient
The n-1 communication steps are not overlapped
with computations
Therefore Amdahls law, etc.
Turns out that on a ring, with a cyclic
distribution of the columns, one can interleave
pieces of the broadcast with the computation
It almost looks like inserting the source code
from the broadcast code we saw at the very
beginning throughout the LU code

51
Previous program

...
double an-1r-1
q ? rank()
p ? numprocs()
l ? 0
for k 0 to n-2
if (k q mod p)
for i k1 to n-1
bufferi-k-1 ? ai,k ? -ai,l /
ak,l
l ? l1
broadcast(alloc(k),buffer,n-k-1)
for j l to r-1
for ik1 to n-1
ai,j ? ai,j bufferi-k-1
ak,j

52
LU-pipeline algorithm

double an-1r-1
q ? rank()
p ? numprocs()
l ? 0
for k 0 to n-2
if (k q mod p)
for i k1 to n-1
bufferi-k-1 ? ai,k ? -ai,l /
ak,l
l ? l1
send(buffer,n-k-1)
else
recv(buffer,n-k-1)
if (q ? k-1 mod p) send(buffer, n-k-1)
for j l to r-1
for ik1 to n-1
ai,j ? ai,j bufferi-k-1
ak,j

53
Why is it better?

During a broadcast the roots successor just sits
idle while the message goes along the ring
This is because of the way we have implemented
broadcast, partially
With a better broadcast on a general topology the
wait may be smaller
But there is still a wait
What we have done is allow each processor to move
on to other business after receiving and
forwarding the message
Possible by writing the code with just sends and
receive
More complicated, more efficient usual trade-off
Lets look at a (idealized) time-line

54
Prep(0)
Send(0)
Recv(0)
Update(0,4)
Send(0)
Recv(0)
Update(0,8)
Update(0,1)
Send(0)
Recv(0)
Update(0,12)
Update(0,5)
Update(0,2)
Update(0,3)
Update(0,9)
Update(0,6)
Update(0,7)
Update(0,13)
Update(0,10)
Update(0,11)
A processor sends out data as soon as it
receives it
Update(0,14)
Update(0,15)
Prep(1)
Send(1)
Recv(1)
Update(1,5)
Send(1)
Recv(1)
Update(1,9)
Update(1,2)
Send(1)
Recv(1)
Update(1,13)
Update(1,6)
Update(1,3)
Update(1,4)
Update(1,10)
Update(1,7)
Update(1,8)
First four stages
Update(1,14)
Update(1,11)
Update(1,12)
Update(1,15)
Prep(2)
Send(2)
Recv(2)
Update(2,6)
Send(2)
Recv(2)
Update(2,10)
Update(2,3)
Send(2)
Recv(2)
Some communication occurs in parallel with
computation
Update(2,14)
Update(2,7)
Update(2,4)
Update(2,5)
Update(2,11)
Update(2,8)
Update(2,9)
Update(2,15)
Update(2,12)
Update(2,13)
Prep(3)
Send(3)
Recv(3)
Update(3,7)
Send(3)
Recv(3)
Update(3,11)
Update(3,4)
Send(3)
Recv(3)
Update(3,15)
Update(3,8)
Update(3,5)
Update(3,6)
Update(3,12)
Update(3,9)
Update(3,10)
Update(3,13)
Update(3,14)
55
Can we do better?

In the previous algorithm, a processor does all
its updates before doing a Prep() computation
that then leads to a communication
But in fact, some of these updates can be done
later
Idea Send out pivot as soon as possible
Example
In the previous algorithm
P1 Receive(0), Send(0)
P1 Update(0,1), Update(0,5), Update(0,9),
Update(0,13)
P1 Prep(1)
P1 Send(1)
...
In the new algorithm
P1 Receive(0), Send(0)
P1 Update(0,1)
P1 Prep(1)
P1 Send(1)
P1 Update(0,5), Update(0,9), Update(0,13)
...

56
Prep(0)
Send(0)
Recv(0)
Update(0,4)
Send(0)
Recv(0)
Update(0,8)
Update(0,1)
Send(0)
Recv(0)
Update(0,12)
Update(0,2)
Update(0,3)
Prep(1)
Update(0,7)
Send(1)
Recv(1)
A processor sends out data as soon as it
receives it
Update(0,5)
Send(1)
Recv(1)
Update(0,9)
Update(1,2)
Send(1)
Recv(1)
Update(0,13)
Update(1,3)
Update(1,4)
Prep(2)
Update(1,5)
Update(1,8)
Send(2)
Recv(2)
Update(0,6)
Update(1,9)
Send(2)
Recv(2)
First four stages
Update(0,10)
Update(2,3)
Send(2)
Recv(2)
Update(0,14)
Update(1,13)
Update(1,12)
Prep(3)
Update(1,6)
Update(2,5)
Send(3)
Recv(3)
Update(0,11)
Update(1,10)
Send(3)
Recv(3)
Many communications occur in parallel with
computation
Update(0,15)
Update(2,4)
Send(3)
Recv(3)
Update(1,14)
Update(1,7)
Update(2,8)
Update(2,9)
Update(1,11)
Update(2,6)
Update(2,12)
Update(2,13)
Update(1,15)
Update(2,10)
Update(3,4)
Update(3,5)
Update(2,14)
Update(2,7)
Update(3,8)
Update(3,9)
Update(2,11)
Update(3,12)
Update(3,13)
Update(3,6)
Update(2,15)
Update(3,10)
Update(3,7)
Update(3,14)
Update(3,11)
Update(3,15)
57
LU-look-ahead algorithm

q ? rank()
p ? numprocs()
l ? 0
for k 0 to n-2
if (k q mod p)
Prep(k)
Send(buffer,n-k-1)
for all j k mod p, jgtk Update(k-1,j)
for all j k mod p, jgtk Update(k,j)
else
Recv(buffer,n-k-1)
if (q ? k - 1 mod p) then
Send(buffer,n-k-1)
if (q ? k 1 mod p) then
Update(k,k1)
else
for all j k mod p, jgtk Update(k,j)

58
Further improving performance

One can use local overlap of communication and
computation
multi-threading, good MPI non-blocking
implementation, etc.
There is much more to be said about parallel LU
factorization
Many research articles
Many libraries available
Its a good example of an application for which
one can think hard about operation orderings and
try to find improved sequences
The basic principle is always the same send
things as early as possible
The modified principle send things as early as
required, but not earlier! You can avoid extra
communication load by sending less number of
longer messages.