Introduction to Parallel Programming Message Passing - PowerPoint PPT Presentation

1 / 67

About This Presentation

Title:

Introduction to Parallel Programming Message Passing

Description:

Beowulf Computers. Distributed Memory. COTS: Commercial-Off-The-Shelf computers ... Drawbacks that arise when solving Problems using Parallelism ... – PowerPoint PPT presentation

Number of Views:118

Avg rating:3.0/5.0

Slides: 68

Provided by: paco9

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Parallel Programming Message Passing

1
Introduction to Parallel Programming (Message
Passing)
Francisco Almeida falmeida_at_ull.es
Parallel Computing Group
2
Beowulf Computers

Distributed Memory

COTS Commercial-Off-The-Shelf computers

3
(No Transcript)
4
The Parallel Model
PRAM

Computational Models
Programming Models
Architectural Models

BSP, LogP
PVM, MPI, HPF, Threads, OPenMP
Parallel Architectures
5
The Message Passing Model
Send(parameters) Recv(parameters)
6
Network of Workstations
Hardware

Distributed Memory
Non Shared Memory Space
Star Topology

Sun Sparc Ultra 1
143 Mhz Etherswitch

7
SGI Origin 2000
Hardware

Shared Dsitributed Memory
Hypercubic Topology

C4-CEPBA
64 R1000processos
8 Gb memory
32 Gflop/s

8
Digital AlphaServer 8400
Hardware

Shared Memory
BusTopology

C4-CEPBA
10 Alpha processors21164
2 Gb Memory
8,8 Gflop/s

9
Drawbacks that arise when solving Problems using
Parallelism

Parallel Programming is more complex than
sequential.
Results may vary as a consequence of the
intrinsic non determinism.
New problems. Deadlocks, starvation...
Is more difficult to debug parallel programs.
Parallel programs are less portable.

10
MPI
CMMD
pvm
Express
Zipcode
p4
PARMACS
EUI
MPI
Parallel Libraries
Parallel Applications
Parallel Languages
11
MPI

What Is MPI?
Message Passing Interface standard
The first standard and portable message
passing library with good performance
"Standard" by consensus of MPI Forum
participants from over 40 organizations
Finished and published in May 1994, updated
in June 1995
What does MPI offer?
Standardization - on many levels
Portability - to existing and new systems
Performance - comparable to vendors'
proprietary libraries
Richness - extensive functionality, many
quality implementations

12
A Simple MPI Program
MPI hello.c include ltstdio.hgt include
ltstring.hgt include "mpi.h" main(int argc,
charargv) int name, p, source, dest, tag
0 char message100 MPI_Status
status MPI_Init(argc,argv) MPI_Comm_rank(MPI_C
OMM_WORLD,name) MPI_Comm_size(MPI_COMM_WORLD,p)

if (name ! 0) printf("Processor d of
d\n",name, p) sprintf(message,"greetings
from process d!", name)
dest 0 MPI_Send(message,
strlen(message)1,
MPI_CHAR, dest, tag,
MPI_COMM_WORLD) else
printf("processor 0, p d ",p)
for(source1 source lt p source)
MPI_Recv(message,100, MPI_CHAR, source,
tag, MPI_COMM_WORLD, status)
printf("s\n",message)
MPI_Finalize()
Processor 2 of 4 Processor 3 of 4 Processor 1 of
4 processor 0, p 4 greetings from process
1! greetings from process 2! greetings from
process 3!
mpicc o hello hello.c mpirun np 4 hello

13
Basic Communication Operations
14
One-to-all broadcast Single-node Accumulation
One-to-all broadcast
M
. . .
0
p
1
M
M
M
Single-node Accumulation
0
1
Step 1
2
Step 2
. . .
Step p
p
15
Broadcast on Hypercubes
16
Broadcast on Hypercubes
17
MPI Broadcast

int MPI_Bcast(
void buffer
int count
MPI_Datatype datatype
int root
MPI_Comm comm
)
Broadcasts a message from the
process with rank "root" to
all other processes of the group

18
Reduction on Hypercubes
A6 110

_at_ conmutative and associative operator
Ai in processor i
Every processor has to obtain A0_at_A1_at_..._at_AP-1

A7 101
A2 101
A3 101
A5 101
A0 000
A1 001
19
Reductions with MPI

int MPI_Reduce(
void sendbufvoid recvbufint
countMPI_Datatype datatypeMPI_Op opint
rootMPI_Comm comm)
Reduces values on all processes to a single value
processes

int MPI_Allreduce(
void sendbufvoid recvbufint
countMPI_Datatype datatypeMPI_Op opMPI_Comm
comm)
Combines values form all processes and
distributes the result back to all

20
All-To-All BroadcastMultinode Accumulation
All-to-all broadcast
M1
M2
Mp
M0
M0
M0
Single-node Accumulation
M1
M1
M1
Mp
Mp
Mp
Reductions, Prefixsums
21
MPI Collective Operations

MPI Operator Operation
----------------------------------------
-----------------------
MPI_MAX maximum
MPI_MIN minimum
MPI_SUM sum
MPI_PROD product
MPI_LAND logical and
MPI_BAND bitwise and
MPI_LOR logical or
MPI_BOR bitwise or
MPI_LXOR logical
exclusive or
MPI_BXOR bitwise
exclusive or
MPI_MAXLOC max value and
location
MPI_MINLOC min value and
location

22
The Master Slave Paradigm
Master
Slaves
23
Computing ?
MPI_Bcast(n, 1, MPI_INT, 0,
MPI_COMM_WORLD) h 1.0 / (double) n
mypi 0.0 for (i myid 1 i lt n i
numprocs) x h ((double)i - 0.5)
mypi f(x) mypi h sum
MPI_Reduce(mypi, pi, 1, MPI_DOUBLE, MPI_SUM,
0, MPI_COMM_WORLD)
4
2
0.0
0.2
0.4
0.6
0.8
1.0
mpirun np 3 cpi
24
The Portability of the Efficiency
25
The Sequential Algorithm

fkc max fk-1C, fk-1C - Wk pk
for C ? Wk

void mochila01_sec (void)
unsigned v1
int c, k
for (c 0 c lt C c)
f0c 0
for (k 1 k lt N k)
for (c 0 c lt C c)
fkc fk-1c
if (c gt wk)
v1 fk-1c - wk pk
if (fkc gt v1)
fkc v1

n
. . . .
. . . .
. . . .
. . . .
C
. . . .
. . . .
. . . .
. . . .
fk - 1
fk
O(nC)
26
The Parallel Algorithm
1void transition (int stage) 2 3 unsigned
x 4 int c, k 5 k stage 6 for (c 0 c lt
C c) 7 fc 0 8 for (c 0 c lt C
c) 9 IN(x) 10 fc max(fc,
x) 11 OUT(fc, 1, sizeof(unsigned)) 12 if
(C gt c wk) 13 fc wk x
pk 14 15
fkc max fk-1C, fk-1C - Wk
pk
27
The Evolution of the Pipeline
n
C
28
The Running Time
n
n -1
C
C
29
Processor Virtualization
n/p
Block Mapping
C
2
0
1
30
Processor Virtualization
n/p
Block Mapping
C
2
0
1
31
Processor Virtualization
n/p
C
2
0
1
32
The Running Time
n/p
(n/p -1)C
(n/p -1)C
C
nC/p
nC
2
0
1
33
Processor Virtualization
n/p
C
34
The Running Time
n/p
n/p
n/p
C
35
Block Mapping
void transition (void) unsigned c, k i,
inData for (c 0 c lt C c)
IN(inData) k
calcInitStage() for (i 0 i lt width k,
i) fi c max(fic, inData)
if (c wk lt C) fic wk inData
pk inData fic
OUT(fi-1c, 1, sizeof(unsigned))
width N / num_proc if (f_name lt N num_proc)
/ Load Balancing / width int
calcInitStage( void ) return (f_name lt N
num_proc) ? f_name width (f_name
width) (N num_proc)
36
Cyclic Mapping
2
0
1
37
The Running Time
(p-1)
n/p C
38
Cyclic Mapping
int bands num_bands(n) for (i 0 i lt
bands i) stage f_name i num_proc
if (stage lt n - 1) transition
(stage) unsigned num_bands (unsigned n)
float aux_f unsigned aux aux_f (float)
n / (float) num_proc aux (unsigned) aux_f
if (aux_f gt aux) return (aux 1) return
(aux)

void transition (int stage)
unsigned x
int c, k
k stage
for (c 0 c lt C c)
fc 0
for (c 0 c lt C c)
IN(x)
fc max(fc, x)
OUT(fc, 1, sizeof(unsigned))
if (C gt c wk)
fc wk x pk

39
Advantages and Disadvantages

Block Distribution
Minimizes the Number of Communications
Penalizes the Startup Time of the Pipeline
Cyclic Distribution
Minimizes the Startup Time of the Pipeline
May Produce Communications Overhead

40
Transputer Network - Local Area Network

Transputer Network
Fine Grain
Parallel Communications

Local Area Network
Coarse Grain
Serial Communications

41
Computational Results
Transputers
Local Area Network
Time
Time
Processors
Processos
42
The Resource Allocation Problem

M units of an indivisible Resource and a set of
N Tasks.
fj(x) ??Benefit obtained when x unidades of
resource are allocated to task j.

N
max
f
x
(
)
å
j
j

j
1
N

Subject to
x
M
å
j

j
1

x
B
,
0
j
j

Î
integer,

xj
j
N
M
Bj
,
.
.
.
,

,
1
N
43
RAP- The Sequential Algorithm
Gkm maxGk-1m-i fk(i) / 0 ? i ? m
int rap_seq(void) int i, k, m for (m 0 m
lt M n) G0m 0 q a Q b for(k
0 k lt N k) for(m 0 m lt M m) for
(i 0 i lt m i) Gkm maxGkm,
Gk-1i fk(m- i) return GN
M
O(nM2)
44
RAP - The Parallel Algorithm
1void transition (int stage) 2 3 int m, j,
x, k 4 for( m 0 m lt M m) 5 Gm
0 6 k stage 7 for (m 0 m lt M m)
8 IN(x) 9 Gm max(Gm, x f(k-1,
0)) 10 OUT(Gm, 1, sizeof(int)) 11 for
(j m 1 j lt M j) 12 Gj max(Gj,
x f(k - 1, j - m)) 13 / for m ... / 14
/ transition /
Gkm maxGk-1m-i fk(i) / 0 ? i ? m
45
The Cray T3E

CRAY T3E
Shared Address Space
Three-Dimensional Toroidal Network

46
Block - Cyclic Mapping
g(p-1) gM2 n/gp
2
0
1
47
Computational Results
48
Linear Model to Predict Communication Performance

Time to send N bytes ? n b

49
PAPI

http//icl.cs.utk.edu/projects/papi/
PAPI aims to provide the tool designer and
application engineer with a consistent interface
and methodology for use of the performance
counter hardware found in most major
microprocessors.

50
Buffering Data
Virtual Process name runs of real processor fname
if (name / grain) mod p fname
Processor 1
Processor 0
Processor 0
P 2 Grain 3
0
1
2
3
6
7
8
...
4
5
Virtual Processes
Size B
SET_BUFIO(1, size)
51
The knapsack ProblemN 12800, M 12800Cray -
T3E
52
The Resource Allocation Problem. Cray - T3E
53
Portability of the Efficiency

One disappointing contrast in parallel systems is
between the peak performance of the parallel
systems and the actual performance of parallel
applications.
Metrics, techniques and tools have been developed
to understand the sources of performance
degradation.
An effective parallel program development cycle,
may iterate many times before achieving the
desired performance.
Performance prediction is important in achieving
efficient execution of parallel programs, since
it allows to avoid the coding and debugging cost
of inefficient strategies.
Most of the approaches to performance analysis
fall into two categories Analytical Modeling and
Performance Profiling.

54
Performance Analysis

Profiling may be conducted on an existing
parallel system to recognize current performance
bottlenecks, correct them, and identify and
prevent potential future performance problems.
Architectural Dependent.
The majority of performance metrics and tools
devised reflect their orientation towards the
measurement-modify paradigm.
PICL, Dimemas, Kpi.
ParaGraph, Vampir, Paraver.

55
Performance Analysis

Analytical Modeling
Provides a structured way for understanding
performance problems
Architectural Independent
Has predictive ability
Modeling is not a trivial task. The model must be
simple enough to be tractable, and sufficiently
detailed to be accurate.
PRAM, LogP, BSP, BSPWB, etc...

Analytical
Modeling
Optimal
Run Time
Parameter
Prediction
Prediction
Error
Computation
Prediction
56
Standard Loop on a Pipeline Algorithm

void f() Compute(body0) While (running)
Receive() Compute(body1)
Send() Compute(body2)

body0 take constant time body1 and body2 depends
on the iteration of the loop
Analytical Model Numerical Solutions for every
case
57
The Analytical Model

Ts denotes the startup time between
two processors.
Ts t0( G - 1) G?i 1, (B-1) (t1i t2i
)
2bI (G - 1) B bE B b t B

Tc denotes the whole evaluation of G processes,
including the time to send M/B packets of size B
Tc t0 G G?i 1, M (t1i t2i )
2bI (G - 1)M bEM (b tB) M/B

G
G
G
G
B
B
B
B
. . .
M/B
B
58
The Analytical Model

T1(G, B) Ts (p - 1) Tc N/(Gp)
1 ? G ? N/p and 1 ? B ? M

1
p-1
0
0
G
G
G
G
R1 Values (G, B) where Ts p ? Tc
B
B
B
B
. . .
B
1
p-1
0
0
G
G
G
G
B

T2(G, B) Ts (N/G 1) Tc

B
B
B
. . .
B
R2 Values (G, B) where Ts p ? Tc
59
Validation of The Model
60
The Tuning Problem

Given an Algorithm A, FA is the input/output
fuction computed by the algorithm
FA D D1x...xDn ? ? ? ?
FA(z) is the output value of the Algorithm A for
the entry z belonging to D
TimeM(A(z)) is the execution time of the
Algorithm A over the input z on a Machine M.
CTimeM(A(z)) is the analytical Complexity Time
formula that approximates TimeM(A(z))
T D1x...xDk T ? Tunning Parameters I
Dk1x...xDn I ? Input Parameters
x ? T if and only if, occurs that x has only
impact in the performance of the algorithm but
not in its output.
FA(x, z) FA(y, z) for any x and y? T
TimeM(A(x, z)) ? TimeM(A(y, z)
The Tuning Problem
is to find x0 ?T such that CTimeM(A(x0, z))
min CTimeM(A(x, z)) / x?T

61
Tunning Parameters

The list of tuning parameters in parallel
computing is extensive
The most obvious tuning parameter is the Number
of Processors.
The size of the buffers used during data
exchange.
Under the Master-Slave paradigm, the size and the
number of data item generated by the master.
In the parallel Divide and Conquer technique, the
size of a subproblem to be considered trivial
and the the processor assignment policy.
On regular numerical HPF-like algorithms, the
block size allocation.

62
The Methodology

Profiling the execution to compute the parameters
needed for the Complexity Time function
CTimeM(A(x, z)).
Compute x0?T such that minimizes the Complexity
Time function CTimeM(A(x, z)).
CTimeM(A(x0, z)) min CTimeM(A(x, z)) /x?T
At this point, the predictive ability of the
Complexity Time function can be used to predict
the execution time TimeM(A(z)) of an optimal
execution or to execute the algorithm according
to the tuning parameter T.

Analytical
Modeling
Instrumentation
Optimal Parameter
Run Time
Computation
Prediction
Error Prediction
Computation
63
llp Solver
IN OUT
gettime() gettime()
64
The MALLBA Infrastructure
65
Performace PredictionBA - ULL
66
The MALLBA Project

Library for the resolution of combinatorial
optimisation problems.
3 types of resolution techniques
Exact
Heuristic
Hybrid
3 implementations
Sequential
LAN
WAN
Goals
Genericity
Ease of utilization
Locally- and geographically-distributed
computation

67
References

Willkinson B., Allen M. Parallel Programming.
Techniques and Applications Using Networkded
Workstations and Parallel Computers. 1999.
Prentice-Hall.
Gropp W., Lusk E., Skjellum A. Using MPI.
Portable Parallel Programming with the
Message-Passing Interface. 1999. The MIT Press.
Pacheco P. Parallel Programming with MPI. 1997.
Morgan Kaufmann Publishers.
Wu X. Performance Evaluation, Prediction and
Visualization of Parallel Systems.
nereida.deioc.ull.es