Introduction to Parallel Programming Message Passing - PowerPoint PPT Presentation

Loading...

PPT – Introduction to Parallel Programming Message Passing PowerPoint presentation | free to view - id: 1d6b64-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Introduction to Parallel Programming Message Passing

Description:

Beowulf Computers. Distributed Memory. COTS: Commercial-Off-The-Shelf computers ... Drawbacks that arise when solving Problems using Parallelism ... – PowerPoint PPT presentation

Number of Views:111
Avg rating:3.0/5.0
Slides: 68
Provided by: paco9
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Introduction to Parallel Programming Message Passing


1
Introduction to Parallel Programming (Message
Passing)
Francisco Almeida falmeida_at_ull.es
Parallel Computing Group
2
Beowulf Computers
  • Distributed Memory
  • COTS Commercial-Off-The-Shelf computers

3
(No Transcript)
4
The Parallel Model
PRAM
  • Computational Models
  • Programming Models
  • Architectural Models

BSP, LogP
PVM, MPI, HPF, Threads, OPenMP
Parallel Architectures
5
The Message Passing Model
Send(parameters) Recv(parameters)
6
Network of Workstations
Hardware
  • Distributed Memory
  • Non Shared Memory Space
  • Star Topology
  • Sun Sparc Ultra 1
  • 143 Mhz Etherswitch

7
SGI Origin 2000
Hardware
  • Shared Dsitributed Memory
  • Hypercubic Topology
  • C4-CEPBA
  • 64 R1000processos
  • 8 Gb memory
  • 32 Gflop/s

8
Digital AlphaServer 8400
Hardware
  • Shared Memory
  • BusTopology
  • C4-CEPBA
  • 10 Alpha processors21164
  • 2 Gb Memory
  • 8,8 Gflop/s

9
Drawbacks that arise when solving Problems using
Parallelism
  • Parallel Programming is more complex than
    sequential.
  • Results may vary as a consequence of the
    intrinsic non determinism.
  • New problems. Deadlocks, starvation...
  • Is more difficult to debug parallel programs.
  • Parallel programs are less portable.

10
MPI
CMMD
pvm
Express
Zipcode
p4
PARMACS
EUI
MPI
Parallel Libraries
Parallel Applications
Parallel Languages
11
MPI
  • What Is MPI?
  • Message Passing Interface standard
  • The first standard and portable message
    passing library with good performance
  • "Standard" by consensus of MPI Forum
    participants from over 40 organizations
  • Finished and published in May 1994, updated
    in June 1995
  • What does MPI offer?
  • Standardization - on many levels
  • Portability - to existing and new systems
  • Performance - comparable to vendors'
    proprietary libraries
  • Richness - extensive functionality, many
    quality implementations

12
A Simple MPI Program
MPI hello.c include ltstdio.hgt include
ltstring.hgt include "mpi.h" main(int argc,
charargv) int name, p, source, dest, tag
0 char message100 MPI_Status
status MPI_Init(argc,argv) MPI_Comm_rank(MPI_C
OMM_WORLD,name) MPI_Comm_size(MPI_COMM_WORLD,p)

if (name ! 0) printf("Processor d of
d\n",name, p) sprintf(message,"greetings
from process d!", name)
dest 0 MPI_Send(message,
strlen(message)1,
MPI_CHAR, dest, tag,
MPI_COMM_WORLD) else
printf("processor 0, p d ",p)
for(source1 source lt p source)
MPI_Recv(message,100, MPI_CHAR, source,
tag, MPI_COMM_WORLD, status)
printf("s\n",message)
MPI_Finalize()
Processor 2 of 4 Processor 3 of 4 Processor 1 of
4 processor 0, p 4 greetings from process
1! greetings from process 2! greetings from
process 3!
mpicc o hello hello.c mpirun np 4 hello

13
Basic Communication Operations
14
One-to-all broadcast Single-node Accumulation
One-to-all broadcast
M
. . .
0
p
1
M
M
M
Single-node Accumulation
0
1
Step 1
2
Step 2
. . .
Step p
p
15
Broadcast on Hypercubes
16
Broadcast on Hypercubes
17
MPI Broadcast
  • int MPI_Bcast(
  • void buffer
  • int count
  • MPI_Datatype datatype
  • int root
  • MPI_Comm comm
  • )
  • Broadcasts a message from the
  • process with rank "root" to
  • all other processes of the group

18
Reduction on Hypercubes
A6 110
  • _at_ conmutative and associative operator
  • Ai in processor i
  • Every processor has to obtain A0_at_A1_at_..._at_AP-1

A7 101
A2 101
A3 101
A5 101
A0 000
A1 001
19
Reductions with MPI
  • int MPI_Reduce(
  • void sendbufvoid recvbufint
    countMPI_Datatype datatypeMPI_Op opint
    rootMPI_Comm comm)
  • Reduces values on all processes to a single value
    processes
  • int MPI_Allreduce(
  • void sendbufvoid recvbufint
    countMPI_Datatype datatypeMPI_Op opMPI_Comm
    comm)
  • Combines values form all processes and
    distributes the result back to all

20
All-To-All BroadcastMultinode Accumulation
All-to-all broadcast
M1
M2
Mp
M0
M0
M0
Single-node Accumulation
M1
M1
M1
Mp
Mp
Mp
Reductions, Prefixsums
21
MPI Collective Operations
  • MPI Operator Operation
  • ----------------------------------------
    -----------------------
  • MPI_MAX maximum
  • MPI_MIN minimum
  • MPI_SUM sum
  • MPI_PROD product
  • MPI_LAND logical and
  • MPI_BAND bitwise and
  • MPI_LOR logical or
  • MPI_BOR bitwise or
  • MPI_LXOR logical
    exclusive or
  • MPI_BXOR bitwise
    exclusive or
  • MPI_MAXLOC max value and
    location
  • MPI_MINLOC min value and
    location

22
The Master Slave Paradigm
Master
Slaves
23
Computing ?
MPI_Bcast(n, 1, MPI_INT, 0,
MPI_COMM_WORLD) h 1.0 / (double) n
mypi 0.0 for (i myid 1 i lt n i
numprocs) x h ((double)i - 0.5)
mypi f(x) mypi h sum
MPI_Reduce(mypi, pi, 1, MPI_DOUBLE, MPI_SUM,
0, MPI_COMM_WORLD)
4
2
0.0
0.2
0.4
0.6
0.8
1.0
mpirun np 3 cpi
24
The Portability of the Efficiency
25
The Sequential Algorithm
  • fkc max fk-1C, fk-1C - Wk pk
    for C ? Wk
  • void mochila01_sec (void)
  • unsigned v1
  • int c, k
  • for (c 0 c lt C c)
  • f0c 0
  • for (k 1 k lt N k)
  • for (c 0 c lt C c)
  • fkc fk-1c
  • if (c gt wk)
  • v1 fk-1c - wk pk
  • if (fkc gt v1)
  • fkc v1

n
. . . .
. . . .
. . . .
. . . .
C
. . . .
. . . .
. . . .
. . . .
fk - 1
fk
O(nC)
26
The Parallel Algorithm
1void transition (int stage) 2 3 unsigned
x 4 int c, k 5 k stage 6 for (c 0 c lt
C c) 7 fc 0 8 for (c 0 c lt C
c) 9 IN(x) 10 fc max(fc,
x) 11 OUT(fc, 1, sizeof(unsigned)) 12 if
(C gt c wk) 13 fc wk x
pk 14 15
fkc max fk-1C, fk-1C - Wk
pk
27
The Evolution of the Pipeline
n
C
28
The Running Time
n
n -1
C
C
29
Processor Virtualization
n/p
Block Mapping
C
2
0
1
30
Processor Virtualization
n/p
Block Mapping
C
2
0
1
31
Processor Virtualization
n/p
C
2
0
1
32
The Running Time
n/p
(n/p -1)C
(n/p -1)C
C
nC/p
nC
2
0
1
33
Processor Virtualization
n/p
C
34
The Running Time
n/p
n/p
n/p
C
35
Block Mapping
void transition (void) unsigned c, k i,
inData for (c 0 c lt C c)
IN(inData) k
calcInitStage() for (i 0 i lt width k,
i) fi c max(fic, inData)
if (c wk lt C) fic wk inData
pk inData fic
OUT(fi-1c, 1, sizeof(unsigned))
width N / num_proc if (f_name lt N num_proc)
/ Load Balancing / width int
calcInitStage( void ) return (f_name lt N
num_proc) ? f_name width (f_name
width) (N num_proc)
36
Cyclic Mapping
2
0
1
37
The Running Time
(p-1)
n/p C
38
Cyclic Mapping
int bands num_bands(n) for (i 0 i lt
bands i) stage f_name i num_proc
if (stage lt n - 1) transition
(stage) unsigned num_bands (unsigned n)
float aux_f unsigned aux aux_f (float)
n / (float) num_proc aux (unsigned) aux_f
if (aux_f gt aux) return (aux 1) return
(aux)
  • void transition (int stage)
  • unsigned x
  • int c, k
  • k stage
  • for (c 0 c lt C c)
  • fc 0
  • for (c 0 c lt C c)
  • IN(x)
  • fc max(fc, x)
  • OUT(fc, 1, sizeof(unsigned))
  • if (C gt c wk)
  • fc wk x pk

39
Advantages and Disadvantages
  • Block Distribution
  • Minimizes the Number of Communications
  • Penalizes the Startup Time of the Pipeline
  • Cyclic Distribution
  • Minimizes the Startup Time of the Pipeline
  • May Produce Communications Overhead

40
Transputer Network - Local Area Network
  • Transputer Network
  • Fine Grain
  • Parallel Communications
  • Local Area Network
  • Coarse Grain
  • Serial Communications

41
Computational Results
Transputers
Local Area Network
Time
Time
Processors
Processos
42
The Resource Allocation Problem
  • M units of an indivisible Resource and a set of
    N Tasks.
  • fj(x) ??Benefit obtained when x unidades of
    resource are allocated to task j.

N
max
f
x
(
)
å
j
j

j
1
N

Subject to
x
M
å
j

j
1


x
B
,
0
j
j

Î
integer,

xj
j
N
M
Bj
,
.
.
.
,

,
1
N
43
RAP- The Sequential Algorithm
Gkm maxGk-1m-i fk(i) / 0 ? i ? m
int rap_seq(void) int i, k, m for (m 0 m
lt M n) G0m 0 q a Q b for(k
0 k lt N k) for(m 0 m lt M m) for
(i 0 i lt m i) Gkm maxGkm,
Gk-1i fk(m- i) return GN
M
O(nM2)
44
RAP - The Parallel Algorithm
1void transition (int stage) 2 3 int m, j,
x, k 4 for( m 0 m lt M m) 5 Gm
0 6 k stage 7 for (m 0 m lt M m)
8 IN(x) 9 Gm max(Gm, x f(k-1,
0)) 10 OUT(Gm, 1, sizeof(int)) 11 for
(j m 1 j lt M j) 12 Gj max(Gj,
x f(k - 1, j - m)) 13 / for m ... / 14
/ transition /
Gkm maxGk-1m-i fk(i) / 0 ? i ? m
45
The Cray T3E
  • CRAY T3E
  • Shared Address Space
  • Three-Dimensional Toroidal Network

46
Block - Cyclic Mapping
g(p-1) gM2 n/gp
2
0
1
47
Computational Results
48
Linear Model to Predict Communication Performance
  • Time to send N bytes ? n b

49
PAPI
  • http//icl.cs.utk.edu/projects/papi/
  • PAPI aims to provide the tool designer and
    application engineer with a consistent interface
    and methodology for use of the performance
    counter hardware found in most major
    microprocessors.

50
Buffering Data
Virtual Process name runs of real processor fname
if (name / grain) mod p fname
Processor 1
Processor 0
Processor 0
P 2 Grain 3
0
1
2
3
6
7
8
...
4
5
Virtual Processes
Size B
SET_BUFIO(1, size)
51
The knapsack ProblemN 12800, M 12800Cray -
T3E
52
The Resource Allocation Problem. Cray - T3E
53
Portability of the Efficiency
  • One disappointing contrast in parallel systems is
    between the peak performance of the parallel
    systems and the actual performance of parallel
    applications.
  • Metrics, techniques and tools have been developed
    to understand the sources of performance
    degradation.
  • An effective parallel program development cycle,
    may iterate many times before achieving the
    desired performance.
  • Performance prediction is important in achieving
    efficient execution of parallel programs, since
    it allows to avoid the coding and debugging cost
    of inefficient strategies.
  • Most of the approaches to performance analysis
    fall into two categories Analytical Modeling and
    Performance Profiling.

54
Performance Analysis
  • Profiling may be conducted on an existing
    parallel system to recognize current performance
    bottlenecks, correct them, and identify and
    prevent potential future performance problems.
  • Architectural Dependent.
  • The majority of performance metrics and tools
    devised reflect their orientation towards the
    measurement-modify paradigm.
  • PICL, Dimemas, Kpi.
  • ParaGraph, Vampir, Paraver.

55
Performance Analysis
  • Analytical Modeling
  • Provides a structured way for understanding
    performance problems
  • Architectural Independent
  • Has predictive ability
  • Modeling is not a trivial task. The model must be
    simple enough to be tractable, and sufficiently
    detailed to be accurate.
  • PRAM, LogP, BSP, BSPWB, etc...

Analytical
Modeling
Optimal
Run Time
Parameter
Prediction
Prediction
Error
Computation
Prediction
56
Standard Loop on a Pipeline Algorithm
  • void f() Compute(body0) While (running)
    Receive() Compute(body1)
    Send() Compute(body2)

body0 take constant time body1 and body2 depends
on the iteration of the loop
Analytical Model Numerical Solutions for every
case
57
The Analytical Model
  • Ts denotes the startup time between
  • two processors.
  • Ts t0( G - 1) G?i 1, (B-1) (t1i t2i
    )
  • 2bI (G - 1) B bE B b t B
  • Tc denotes the whole evaluation of G processes,
    including the time to send M/B packets of size B
  • Tc t0 G G?i 1, M (t1i t2i )
  • 2bI (G - 1)M bEM (b tB) M/B

G
G
G
G
B
B
B
B
. . .
M/B
B
58
The Analytical Model
  • T1(G, B) Ts (p - 1) Tc N/(Gp)
  • 1 ? G ? N/p and 1 ? B ? M

1
p-1
0
0
G
G
G
G
R1 Values (G, B) where Ts p ? Tc
B
B
B
B
. . .
B
1
p-1
0
0
G
G
G
G
B
  • T2(G, B) Ts (N/G 1) Tc

B
B
B
. . .
B
R2 Values (G, B) where Ts p ? Tc
59
Validation of The Model
60
The Tuning Problem
  • Given an Algorithm A, FA is the input/output
    fuction computed by the algorithm
  • FA D D1x...xDn ? ? ? ?
  • FA(z) is the output value of the Algorithm A for
    the entry z belonging to D
  • TimeM(A(z)) is the execution time of the
    Algorithm A over the input z on a Machine M.
  • CTimeM(A(z)) is the analytical Complexity Time
    formula that approximates TimeM(A(z))
  • T D1x...xDk T ? Tunning Parameters I
    Dk1x...xDn I ? Input Parameters
  • x ? T if and only if, occurs that x has only
    impact in the performance of the algorithm but
    not in its output.
  • FA(x, z) FA(y, z) for any x and y? T
  • TimeM(A(x, z)) ? TimeM(A(y, z)
  • The Tuning Problem
  • is to find x0 ?T such that CTimeM(A(x0, z))
    min CTimeM(A(x, z)) / x?T

61
Tunning Parameters
  • The list of tuning parameters in parallel
    computing is extensive
  • The most obvious tuning parameter is the Number
    of Processors.
  • The size of the buffers used during data
    exchange.
  • Under the Master-Slave paradigm, the size and the
    number of data item generated by the master.
  • In the parallel Divide and Conquer technique, the
    size of a subproblem to be considered trivial
    and the the processor assignment policy.
  • On regular numerical HPF-like algorithms, the
    block size allocation.

62
The Methodology
  • Profiling the execution to compute the parameters
    needed for the Complexity Time function
    CTimeM(A(x, z)).
  • Compute x0?T such that minimizes the Complexity
    Time function CTimeM(A(x, z)).
  • CTimeM(A(x0, z)) min CTimeM(A(x, z)) /x?T
  • At this point, the predictive ability of the
    Complexity Time function can be used to predict
    the execution time TimeM(A(z)) of an optimal
    execution or to execute the algorithm according
    to the tuning parameter T.

Analytical
Modeling
Instrumentation
Optimal Parameter
Run Time
Computation
Prediction
Error Prediction
Computation
63
llp Solver
IN OUT
gettime() gettime()
64
The MALLBA Infrastructure
65
Performace PredictionBA - ULL
66
The MALLBA Project
  • Library for the resolution of combinatorial
    optimisation problems.
  • 3 types of resolution techniques
  • Exact
  • Heuristic
  • Hybrid
  • 3 implementations
  • Sequential
  • LAN
  • WAN
  • Goals
  • Genericity
  • Ease of utilization
  • Locally- and geographically-distributed
    computation

67
References
  • Willkinson B., Allen M. Parallel Programming.
    Techniques and Applications Using Networkded
    Workstations and Parallel Computers. 1999.
    Prentice-Hall.
  • Gropp W., Lusk E., Skjellum A. Using MPI.
    Portable Parallel Programming with the
    Message-Passing Interface. 1999. The MIT Press.
  • Pacheco P. Parallel Programming with MPI. 1997.
    Morgan Kaufmann Publishers.
  • Wu X. Performance Evaluation, Prediction and
    Visualization of Parallel Systems.
  • nereida.deioc.ull.es
About PowerShow.com