Loading...

PPT – Introduction to Parallel Programming Message Passing PowerPoint presentation | free to view - id: 1d6b64-ZDc1Z

The Adobe Flash plugin is needed to view this content

Introduction to Parallel Programming (Message

Passing)

Francisco Almeida falmeida_at_ull.es

Parallel Computing Group

Beowulf Computers

- Distributed Memory

- COTS Commercial-Off-The-Shelf computers

(No Transcript)

The Parallel Model

PRAM

- Computational Models
- Programming Models
- Architectural Models

BSP, LogP

PVM, MPI, HPF, Threads, OPenMP

Parallel Architectures

The Message Passing Model

Send(parameters) Recv(parameters)

Network of Workstations

Hardware

- Distributed Memory
- Non Shared Memory Space
- Star Topology

- Sun Sparc Ultra 1
- 143 Mhz Etherswitch

SGI Origin 2000

Hardware

- Shared Dsitributed Memory
- Hypercubic Topology

- C4-CEPBA
- 64 R1000processos
- 8 Gb memory
- 32 Gflop/s

Digital AlphaServer 8400

Hardware

- Shared Memory
- BusTopology

- C4-CEPBA
- 10 Alpha processors21164
- 2 Gb Memory
- 8,8 Gflop/s

Drawbacks that arise when solving Problems using

Parallelism

- Parallel Programming is more complex than

sequential. - Results may vary as a consequence of the

intrinsic non determinism. - New problems. Deadlocks, starvation...
- Is more difficult to debug parallel programs.
- Parallel programs are less portable.

MPI

CMMD

pvm

Express

Zipcode

p4

PARMACS

EUI

MPI

Parallel Libraries

Parallel Applications

Parallel Languages

MPI

- What Is MPI?
- Message Passing Interface standard
- The first standard and portable message

passing library with good performance - "Standard" by consensus of MPI Forum

participants from over 40 organizations - Finished and published in May 1994, updated

in June 1995 - What does MPI offer?
- Standardization - on many levels
- Portability - to existing and new systems
- Performance - comparable to vendors'

proprietary libraries - Richness - extensive functionality, many

quality implementations

A Simple MPI Program

MPI hello.c include ltstdio.hgt include

ltstring.hgt include "mpi.h" main(int argc,

charargv) int name, p, source, dest, tag

0 char message100 MPI_Status

status MPI_Init(argc,argv) MPI_Comm_rank(MPI_C

OMM_WORLD,name) MPI_Comm_size(MPI_COMM_WORLD,p)

if (name ! 0) printf("Processor d of

d\n",name, p) sprintf(message,"greetings

from process d!", name)

dest 0 MPI_Send(message,

strlen(message)1,

MPI_CHAR, dest, tag,

MPI_COMM_WORLD) else

printf("processor 0, p d ",p)

for(source1 source lt p source)

MPI_Recv(message,100, MPI_CHAR, source,

tag, MPI_COMM_WORLD, status)

printf("s\n",message)

MPI_Finalize()

Processor 2 of 4 Processor 3 of 4 Processor 1 of

4 processor 0, p 4 greetings from process

1! greetings from process 2! greetings from

process 3!

mpicc o hello hello.c mpirun np 4 hello

Basic Communication Operations

One-to-all broadcast Single-node Accumulation

One-to-all broadcast

M

. . .

0

p

1

M

M

M

Single-node Accumulation

0

1

Step 1

2

Step 2

. . .

Step p

p

Broadcast on Hypercubes

Broadcast on Hypercubes

MPI Broadcast

- int MPI_Bcast(
- void buffer
- int count
- MPI_Datatype datatype
- int root
- MPI_Comm comm
- )
- Broadcasts a message from the
- process with rank "root" to
- all other processes of the group

Reduction on Hypercubes

A6 110

- _at_ conmutative and associative operator
- Ai in processor i
- Every processor has to obtain A0_at_A1_at_..._at_AP-1

A7 101

A2 101

A3 101

A5 101

A0 000

A1 001

Reductions with MPI

- int MPI_Reduce(
- void sendbufvoid recvbufint

countMPI_Datatype datatypeMPI_Op opint

rootMPI_Comm comm) - Reduces values on all processes to a single value

processes

- int MPI_Allreduce(
- void sendbufvoid recvbufint

countMPI_Datatype datatypeMPI_Op opMPI_Comm

comm) - Combines values form all processes and

distributes the result back to all

All-To-All BroadcastMultinode Accumulation

All-to-all broadcast

M1

M2

Mp

M0

M0

M0

Single-node Accumulation

M1

M1

M1

Mp

Mp

Mp

Reductions, Prefixsums

MPI Collective Operations

- MPI Operator Operation
- ----------------------------------------

----------------------- - MPI_MAX maximum
- MPI_MIN minimum
- MPI_SUM sum
- MPI_PROD product
- MPI_LAND logical and
- MPI_BAND bitwise and
- MPI_LOR logical or
- MPI_BOR bitwise or
- MPI_LXOR logical

exclusive or - MPI_BXOR bitwise

exclusive or - MPI_MAXLOC max value and

location - MPI_MINLOC min value and

location

The Master Slave Paradigm

Master

Slaves

Computing ?

MPI_Bcast(n, 1, MPI_INT, 0,

MPI_COMM_WORLD) h 1.0 / (double) n

mypi 0.0 for (i myid 1 i lt n i

numprocs) x h ((double)i - 0.5)

mypi f(x) mypi h sum

MPI_Reduce(mypi, pi, 1, MPI_DOUBLE, MPI_SUM,

0, MPI_COMM_WORLD)

4

2

0.0

0.2

0.4

0.6

0.8

1.0

mpirun np 3 cpi

The Portability of the Efficiency

The Sequential Algorithm

- fkc max fk-1C, fk-1C - Wk pk

for C ? Wk

- void mochila01_sec (void)
- unsigned v1
- int c, k
- for (c 0 c lt C c)
- f0c 0
- for (k 1 k lt N k)
- for (c 0 c lt C c)
- fkc fk-1c
- if (c gt wk)
- v1 fk-1c - wk pk
- if (fkc gt v1)
- fkc v1

n

. . . .

. . . .

. . . .

. . . .

C

. . . .

. . . .

. . . .

. . . .

fk - 1

fk

O(nC)

The Parallel Algorithm

1void transition (int stage) 2 3 unsigned

x 4 int c, k 5 k stage 6 for (c 0 c lt

C c) 7 fc 0 8 for (c 0 c lt C

c) 9 IN(x) 10 fc max(fc,

x) 11 OUT(fc, 1, sizeof(unsigned)) 12 if

(C gt c wk) 13 fc wk x

pk 14 15

fkc max fk-1C, fk-1C - Wk

pk

The Evolution of the Pipeline

n

C

The Running Time

n

n -1

C

C

Processor Virtualization

n/p

Block Mapping

C

2

0

1

Processor Virtualization

n/p

Block Mapping

C

2

0

1

Processor Virtualization

n/p

C

2

0

1

The Running Time

n/p

(n/p -1)C

(n/p -1)C

C

nC/p

nC

2

0

1

Processor Virtualization

n/p

C

The Running Time

n/p

n/p

n/p

C

Block Mapping

void transition (void) unsigned c, k i,

inData for (c 0 c lt C c)

IN(inData) k

calcInitStage() for (i 0 i lt width k,

i) fi c max(fic, inData)

if (c wk lt C) fic wk inData

pk inData fic

OUT(fi-1c, 1, sizeof(unsigned))

width N / num_proc if (f_name lt N num_proc)

/ Load Balancing / width int

calcInitStage( void ) return (f_name lt N

num_proc) ? f_name width (f_name

width) (N num_proc)

Cyclic Mapping

2

0

1

The Running Time

(p-1)

n/p C

Cyclic Mapping

int bands num_bands(n) for (i 0 i lt

bands i) stage f_name i num_proc

if (stage lt n - 1) transition

(stage) unsigned num_bands (unsigned n)

float aux_f unsigned aux aux_f (float)

n / (float) num_proc aux (unsigned) aux_f

if (aux_f gt aux) return (aux 1) return

(aux)

- void transition (int stage)
- unsigned x
- int c, k
- k stage
- for (c 0 c lt C c)
- fc 0
- for (c 0 c lt C c)
- IN(x)
- fc max(fc, x)
- OUT(fc, 1, sizeof(unsigned))
- if (C gt c wk)
- fc wk x pk

Advantages and Disadvantages

- Block Distribution
- Minimizes the Number of Communications
- Penalizes the Startup Time of the Pipeline
- Cyclic Distribution
- Minimizes the Startup Time of the Pipeline
- May Produce Communications Overhead

Transputer Network - Local Area Network

- Transputer Network
- Fine Grain
- Parallel Communications

- Local Area Network
- Coarse Grain
- Serial Communications

Computational Results

Transputers

Local Area Network

Time

Time

Processors

Processos

The Resource Allocation Problem

- M units of an indivisible Resource and a set of

N Tasks. - fj(x) ??Benefit obtained when x unidades of

resource are allocated to task j.

N

max

f

x

(

)

å

j

j

j

1

N

Subject to

x

M

å

j

j

1

x

B

,

0

j

j

Î

integer,

xj

j

N

M

Bj

,

.

.

.

,

,

1

N

RAP- The Sequential Algorithm

Gkm maxGk-1m-i fk(i) / 0 ? i ? m

int rap_seq(void) int i, k, m for (m 0 m

lt M n) G0m 0 q a Q b for(k

0 k lt N k) for(m 0 m lt M m) for

(i 0 i lt m i) Gkm maxGkm,

Gk-1i fk(m- i) return GN

M

O(nM2)

RAP - The Parallel Algorithm

1void transition (int stage) 2 3 int m, j,

x, k 4 for( m 0 m lt M m) 5 Gm

0 6 k stage 7 for (m 0 m lt M m)

8 IN(x) 9 Gm max(Gm, x f(k-1,

0)) 10 OUT(Gm, 1, sizeof(int)) 11 for

(j m 1 j lt M j) 12 Gj max(Gj,

x f(k - 1, j - m)) 13 / for m ... / 14

/ transition /

Gkm maxGk-1m-i fk(i) / 0 ? i ? m

The Cray T3E

- CRAY T3E
- Shared Address Space
- Three-Dimensional Toroidal Network

Block - Cyclic Mapping

g(p-1) gM2 n/gp

2

0

1

Computational Results

Linear Model to Predict Communication Performance

- Time to send N bytes ? n b

PAPI

- http//icl.cs.utk.edu/projects/papi/
- PAPI aims to provide the tool designer and

application engineer with a consistent interface

and methodology for use of the performance

counter hardware found in most major

microprocessors.

Buffering Data

Virtual Process name runs of real processor fname

if (name / grain) mod p fname

Processor 1

Processor 0

Processor 0

P 2 Grain 3

0

1

2

3

6

7

8

...

4

5

Virtual Processes

Size B

SET_BUFIO(1, size)

The knapsack ProblemN 12800, M 12800Cray -

T3E

The Resource Allocation Problem. Cray - T3E

Portability of the Efficiency

- One disappointing contrast in parallel systems is

between the peak performance of the parallel

systems and the actual performance of parallel

applications. - Metrics, techniques and tools have been developed

to understand the sources of performance

degradation. - An effective parallel program development cycle,

may iterate many times before achieving the

desired performance. - Performance prediction is important in achieving

efficient execution of parallel programs, since

it allows to avoid the coding and debugging cost

of inefficient strategies. - Most of the approaches to performance analysis

fall into two categories Analytical Modeling and

Performance Profiling.

Performance Analysis

- Profiling may be conducted on an existing

parallel system to recognize current performance

bottlenecks, correct them, and identify and

prevent potential future performance problems. - Architectural Dependent.
- The majority of performance metrics and tools

devised reflect their orientation towards the

measurement-modify paradigm. - PICL, Dimemas, Kpi.
- ParaGraph, Vampir, Paraver.

Performance Analysis

- Analytical Modeling
- Provides a structured way for understanding

performance problems - Architectural Independent
- Has predictive ability
- Modeling is not a trivial task. The model must be

simple enough to be tractable, and sufficiently

detailed to be accurate. - PRAM, LogP, BSP, BSPWB, etc...

Analytical

Modeling

Optimal

Run Time

Parameter

Prediction

Prediction

Error

Computation

Prediction

Standard Loop on a Pipeline Algorithm

- void f() Compute(body0) While (running)

Receive() Compute(body1)

Send() Compute(body2)

body0 take constant time body1 and body2 depends

on the iteration of the loop

Analytical Model Numerical Solutions for every

case

The Analytical Model

- Ts denotes the startup time between
- two processors.
- Ts t0( G - 1) G?i 1, (B-1) (t1i t2i

) - 2bI (G - 1) B bE B b t B

- Tc denotes the whole evaluation of G processes,

including the time to send M/B packets of size B - Tc t0 G G?i 1, M (t1i t2i )
- 2bI (G - 1)M bEM (b tB) M/B

G

G

G

G

B

B

B

B

. . .

M/B

B

The Analytical Model

- T1(G, B) Ts (p - 1) Tc N/(Gp)
- 1 ? G ? N/p and 1 ? B ? M

1

p-1

0

0

G

G

G

G

R1 Values (G, B) where Ts p ? Tc

B

B

B

B

. . .

B

1

p-1

0

0

G

G

G

G

B

- T2(G, B) Ts (N/G 1) Tc

B

B

B

. . .

B

R2 Values (G, B) where Ts p ? Tc

Validation of The Model

The Tuning Problem

- Given an Algorithm A, FA is the input/output

fuction computed by the algorithm - FA D D1x...xDn ? ? ? ?
- FA(z) is the output value of the Algorithm A for

the entry z belonging to D - TimeM(A(z)) is the execution time of the

Algorithm A over the input z on a Machine M. - CTimeM(A(z)) is the analytical Complexity Time

formula that approximates TimeM(A(z)) - T D1x...xDk T ? Tunning Parameters I

Dk1x...xDn I ? Input Parameters - x ? T if and only if, occurs that x has only

impact in the performance of the algorithm but

not in its output. - FA(x, z) FA(y, z) for any x and y? T
- TimeM(A(x, z)) ? TimeM(A(y, z)
- The Tuning Problem
- is to find x0 ?T such that CTimeM(A(x0, z))

min CTimeM(A(x, z)) / x?T

Tunning Parameters

- The list of tuning parameters in parallel

computing is extensive - The most obvious tuning parameter is the Number

of Processors. - The size of the buffers used during data

exchange. - Under the Master-Slave paradigm, the size and the

number of data item generated by the master. - In the parallel Divide and Conquer technique, the

size of a subproblem to be considered trivial

and the the processor assignment policy. - On regular numerical HPF-like algorithms, the

block size allocation.

The Methodology

- Profiling the execution to compute the parameters

needed for the Complexity Time function

CTimeM(A(x, z)). - Compute x0?T such that minimizes the Complexity

Time function CTimeM(A(x, z)). - CTimeM(A(x0, z)) min CTimeM(A(x, z)) /x?T
- At this point, the predictive ability of the

Complexity Time function can be used to predict

the execution time TimeM(A(z)) of an optimal

execution or to execute the algorithm according

to the tuning parameter T.

Analytical

Modeling

Instrumentation

Optimal Parameter

Run Time

Computation

Prediction

Error Prediction

Computation

llp Solver

IN OUT

gettime() gettime()

The MALLBA Infrastructure

Performace PredictionBA - ULL

The MALLBA Project

- Library for the resolution of combinatorial

optimisation problems. - 3 types of resolution techniques
- Exact
- Heuristic
- Hybrid
- 3 implementations
- Sequential
- LAN
- WAN
- Goals
- Genericity
- Ease of utilization
- Locally- and geographically-distributed

computation

References

- Willkinson B., Allen M. Parallel Programming.

Techniques and Applications Using Networkded

Workstations and Parallel Computers. 1999.

Prentice-Hall. - Gropp W., Lusk E., Skjellum A. Using MPI.

Portable Parallel Programming with the

Message-Passing Interface. 1999. The MIT Press. - Pacheco P. Parallel Programming with MPI. 1997.

Morgan Kaufmann Publishers. - Wu X. Performance Evaluation, Prediction and

Visualization of Parallel Systems. - nereida.deioc.ull.es