Parallel%20distributed%20computing%20techniques presentation

About This Presentation

Transcript and Presenter's Notes

Title: Parallel%20distributed%20computing%20techniques

1
Parallel distributed computing techniques

GVHD
Ph?m Tr?n Vu

Sinh viên Lê Tr?ng Tín Mai Van Ninh Phùng
Quang Chánh Nguy?n Ð?c C?nh Ð?ng Trung Tín
2
Contents
3
Contents
4
Motivation of Parallel Computing Techniques

Demand for Computational Speed
Continual demand for greater computational speed
from a computer system than is currently possible
Areas requiring great computational speed include
numerical modeling and simulation of scienti?c
and engineering problems.
Computations must be completed within a
reasonable time period.

5
Contents
6
Message-Passing Computing

Basics of Message-Passing Programming using
user-level message passing libraries
Two primary mechanisms needed
A method of creating separate processes for
execution on different computers
A method of sending and receiving messages

7
Message-Passing Computing

Static process creation

Source file
Basic MPI way
Compile to suit processor
Source file
Source file
executables
Processor n-1
Processor 0
8
Message-Passing Computing

Dynamic process creation

Processor 1
. spawn() . . . . .
PVM way
Start execution of process 2
Processor 2
. . . . . . .
time
9
Message-Passing Computing
Method of sending and receiving messages?
10
Contents
11
Pipelined Computation

Problem divided into a series of tasks
that have to be completed one after the other
(the basis of sequential programming).
Each task executed by a separate process or
processor.

12
Pipelined Computation

Where pipelining can be used to good effect
1-If more than one instance of the
complete problem is to be executed
2-If a series of data items must be
processed, each requiring multiple operations
3-If information to start the next process
can be passed forward before the process has
completed all its internal operations

13
Pipelined Computation

Execution time m p - 1 cycles for a p-stage
pipeline and m instances

14
Pipelined Computation
15
Pipelined Computation
16
Pipelined Computation
17
Pipelined Computations
18
Pipelined Computation
19
Contents
20
Ideal Parallel Computation

A computation that can obviously be devided into
a number of completely independent parts
Each of which can be executed by a separate
processor
Each process can do its tasks without any
interaction with other process

21
Ideal Parallel Computation

Practical embarrassingly parallel computation
with static process creation and master slave
approach

22
Ideal Parallel Computation

Practical embarrassingly parallel computation
with dynamic process creation and master slave
approach

23
Embarrassingly parallel examples

Geometrical Transformations of Images
Mandelbrot set
Monte Carlo Method

24
Geometrical Transformations of Images

Performing on the coordinates of each pixel to
move the position of the pixel without affecting
its value
The transformation on each pixel is totally
independent from other pixels
Some geometrical operations
Shifting
Scaling
Rotation
Clipping

25
Geometrical Transformations of Images

Partitioning into regions for individual
processes
Square region for each process Row region
for each process

80
640
640

80
480
480
10
26
Mandelbrot Set

Set of points in a complex plane that are
quasi-stable when computed by iterating the
function
where is the (k 1)th iteration of the complex
number z a bi and c is a complex number
giving position of point in the complex plane.
The initial value for z is zero.
Iterations continued until magnitude of z is
greater than 2 or number of iterations reaches
arbitrary limit. Magnitude of z is the length of
the vector given by

27
Mandelbrot Set

28
Mandelbrot Set

29
Mandelbrot Set

c.real real_min x (real_max -
real_min)/disp_width
c.imag imag_min y (imag_max -
imag_min)/disp_height

Static Task Assignment
Simply divide the region into fixed number of
parts, each computed by a separate processor
Not very successful because different regions
require different numbers of iterations and time

Dynamic Task Assignment
Have processor request regions after computing
previouos regions

30
Mandelbrot Set

Dynamic Task Assignment
Have processor request regions after computing
previouos regions

31
Monte Carlo Method

Another embarrassingly parallel computation
Monte Carlo methods use of random selections
Example To calculate ?
Circle formed within a square, with unit radius
so that square has side 2x2. Ratio of the area of
the circle to the square given by

32
Monte Carlo Method

One quadrant of the construction can be described
by integral
Random pairs of numbers, (xr,yr) generated, each
between 0 and 1. Counted as in circle if that
is,

33
Monte Carlo Method

Alternative method to compute integral
Use random values of x to compute f(x) and sum
values of f(x)
where xr are randomly generated values of x
between x1 and x2
Monte Carlo method very useful if the function
cannot be integrated numerically (maybe having a
large number of variables)

34
Monte Carlo Method

Example computing the integral
Sequential code
Routine randv(x1, x2) returns a pseudorandom
number between x1 and x2

35
Monte Carlo Method

Parallel Monte Carlo integration

Master
Partial sum
Request
Slaves
Random number
Random-number process
36
Contents
37
Partitioning and Divide-and-Conquer
Strategies
38
Partitioning

Partitioning simply divides the problem into
parts.
It is the basic of all parallel programming.
Partitioning can be applied to the program data
(data partitioning or domain decomposition) and
the functions of a program (functional
decomposition).
It is much less mommon to find concurrent
functions in a problem, but data partitioning is
a main strategy for parallel programming.

39
Partitioning (cont)
A sequence of numbers, x0 ,, xn-1 , are to be
added
n number of items p number of processors
Partitioning a sequence of numbers into parts and
adding them
40
Divide and Conquer

Characterized by dividing problem into
subproblems of same form as larger problem.
Further divisions into still smaller
sub-problems, usually done by recursion.
Recursive divide and conquer amenable to
parallelization because separate processes can be
used for divided parts. Also usually data is
naturally localized.

41
Divide and Conquer (cont)

A sequential recursive definition for adding
a list of numbers is
int add(int s) // add list of numbers, s
if(number(s) lt 2) return (n1 n2)
else
Divide (s, s1, s2) // divide s into two part,
s1, s2
part_sum1 add(s1)// recursive calls to add
sub lists
part_sum2 add(s2)
return (part_sum1 part_sum2)

42
Divide and Conquer (cont)
Initial problem
Divide problem
Final task
Tree construction
42
www.cse.hcmut.edu.vn
43
Divide and Conquer (cont)
Original list
Initial problem
P0
P0
P4
Divide problem
P2
P0
P6
P4
P7
P6
P5
P4
P3
P2
P1
P0
Final task
x0
xn-1
44
Partitioning/Divide and Conquer Examples

Many possibilities.
Operations on sequences of number such as simply
adding them together
Several sorting algorithms can often be
partitioned or constructed in a recursive fashion
Numerical integration
N-body problem

45
Bucket sort

One bucket assigned to hold numbers that fall
within each region.
Numbers in each bucket sorted using a sequential
sorting algorithm.
Sequental sorting time complexity O(nlog(n/m).
Works well if the original numbers uniformly
distributed across a known interval, say 0 to a -
1.

n number of items m number of buckets
46
Parallel version of bucket sort

Simple approach
Assign one processor for each bucket.

47
Further Parallelization

Partition sequence into m regions, one region for
each processor.
Each processor maintains p small buckets and
separates the numbers in its region into its own
small buckets.
Small buckets then emptied into p ?nal buckets
for sorting, whichrequires each processor to send
one small bucket to each of the other processors
(bucket i to processor i).

48
Another parallel version of bucket sort

Introduces new message-passing operation -
all-to-all broadcast.

49
all-to-all broadcast routine

Sends data from each process to every other
process

50
all-to-all broadcast routine (cont)

all-to-all routine actually transfers rows of
an array to columns
Tranposes a matrix.

51
Contents
52
Synchronous Computations

Synchronous
Barrier
Barrier Implementation
Centralized Counter implementation
Tree Barrier Implementation
Butterfly Barrier
Synchronized Computations
Fully synchronous
Data Parallel Computations
Synchronous Iteration(Synchronous Parallelism)
Locally synchronous
Heat Distribution Problem
Sequential Code
Parallel Code

53
Barrier

A basic mechanism for synchronizing processes -
inserted at the point in each process where it
must wait.
All processes can continue from this point when
all the processes have reached it
Processes reaching barrier at different times

54
Barrier Image
55
Barrier Implementation

Centralized Counter implementation ( linear
barrier)
Tree Barrier Implementation.
Butterfly Barrier
Local Synchronization
Deadlock

56
Centralized Counter implementation

Have two phase
Arrival phase (trapping)
Departure phase(release)
A process enters arrival phase and does not leave
this phase until all processes have arrived in
this phase
Then processes move to departure phase and are
released

Example code
Master
for (i 0 i lt n i)/count slaves as they
reach barrier/
recv(Pany)
for (i 0 i lt n i)/ release slaves /
send(Pi)
Slave processes
send(Pmaster)
recv(Pmaster)

58
Tree Barrier Implementation

Suppose 8 processes, P0, P1, P2, P3, P4, P5, P6,
P7
First stage
P1 sends message to P0 (when P1 reaches its
barrier)
P3 sends message to P2 (when P3 reaches its
barrier)
P5 sends message to P4 (when P5 reaches its
barrier)
P7 sends message to P6 (when P7 reaches its
barrier)
Second stage
P2 sends message to P0 (P2 P3 reached their
barrier)
P6 sends message to P4 (P6 P7 reached their
barrier)
Second stage
P4 sends message to P0 (P4, P5, P6, P7
reached barrier)
P0 terminates arrival phase( when P0 reaches
barrier received message from P4)

59
Tree Barrier Implementation

Release with a reverse tree construction.

Tree barrier
60
Butterfly Barrier

This would be used if data were exchanged between
the processes

61
Local Synchronization

Suppose a process Pi needs to be synchronized
and to exchange data with process Pi-1 and
process Pi1
Not a perfect three-process barrier because
process Pi-1 will only synchronize with Pi and
continue as soon as Pi allows. Similarly,process
Pi1 only synchronizes with Pi.

62
Synchronized Computations

Fully synchronous
In fully synchronous, all processes involved in
the computation must be synchronized.
Data Parallel Computations
Synchronous Iteration(Synchronous Parallelism)
Locally synchronous
In locally synchronous, processes only need to
synchronize with a set of logically nearby
processes, not all processes involved in the
computation
Heat Distribution Problem
Sequential Code
Parallel Code

63
Data Parallel Computations

Same operation performed on different data
elements simultaneously (SIMD)
Data parallel programming is very convenient for
two reasons
The first is its ease of programming (essentially
only one program)
The second is that it can scale easily to larger
problems sizes

64
Synchronous Iteration

Each iteration composed of several processes that
start together at beginning of iteration. Next
iteration cannot begin until all processes have
finished previous iteration Using forall
for (j 0 j lt n j) /for each synch.
iteration /
forall (i 0 i lt N i) /N procs each
using/
body(i) / specific value of i /

65
Synchronous Iteration

Solving a General System of Linear Equations by
Iteration
Suppose the equations are of a general form with
n equations and n unknowns where the unknowns are
x0, x1, x2, xn-1 (0 lt i lt n).
an-1,0x0 an-1,1x1 an-1,2x2
an-1,n-1xn-1 bn-1
.
.
.
.
a2,0x0 a2,1x1 a2,2x2 a2,n-1xn-1 b2
a1,0x0 a1,1x1 a1,2x2 a1,n-1xn-1 b1
a0,0x0 a0,1x1 a0,2x2 a0,n-1xn-1 b0
where the unknowns are x0, x1, x2, xn-1 (0lt i
lt n).

66
Synchronous Iteration

By rearranging the ith equation
ai,0x0 ai,1x1 ai,2x2 ai,n-1xn-1 bi
to
xi (1/ai,i)bi-(ai,0x0ai,1x1ai,2x2ai,i-1xi-1
ai ,i1xi1ai,n-1xn-1)
Or

67
Heat Distribution Problem

An area has known temperatures along each of its
edges. Find thetemperature distribution within.
Divide area into fine mesh of points, hi,j.
Temperature at an inside point taken to be
average of temperatures of four neighboring
points..
Temperature of each point by iterating the
equation
(0 lt i lt n, 0 lt j lt n)

68
Heat Distribution Problem
69
Sequential Code

Using a fixed number of iterations
for (iteration 0 iteration lt limit
iteration)
for (i 1 i lt n i)
for (j 1 j lt n j)
gij 0.25(hi-1jhi1jhij-1
hij1)
for (i 1 i lt n i)/ update points /
for (j 1 j lt n j)
hij gij

70
Parallel Code

With fixed number of iterations, Pi,j (except for
the boundary points)
for (iteration 0 iteration lt limit
iteration)
g 0.25 (w x y z)
send(g, Pi-1,j) / non-blocking sends /
send(g, Pi1,j)
send(g, Pi,j-1)
send(g, Pi,j1)
recv(w, Pi-1,j) / synchronous receives /
recv(x, Pi1,j)
recv(y, Pi,j-1)
recv(z, Pi,j1)

Local Barrier
71
Contents
72
Load Balancing Termination Detection
73
Load Balancing Termination Detection
74
Load Balancing
75
Load Balancing Termination Detection
76
Static Load Balancing

Round robin algorithm passes out tasks in
sequential order of processes coming back to the
first when all processes have been given a task
Randomized algorithms selects processes at
random to
take tasks
Recursive bisection recursively divides the
problem into
subproblems of equal computational effort
while minimizing message passing
Simulated annealing an optimization technique
Genetic algorithm another optimization
technique, described

77
Static Load Balancing

Several fundamental flaws with static load
balancing even if a mathematical solution exists
Very difficult to estimate accurately the
execution times of various parts of a program
without actually executing the parts.
Communication delays that vary under different
circumstances
Some problems have an indeterminate number of
steps to reach their solution.

78
Dynamic Load Balancing
79
Centralized dynamic load balancing

Tasks handed out from a centralized location.
Master-slave structure
Master process(or) holds the collection of tasks
to be performed.
Tasks are sent to the slave processes. When a
slave process completes one task, it requests
another task from the master process.
(Terms used work pool, replicated worker,
processor farm.)

80
Centralized dynamic load balancing
81
Termination

Computation terminates when
The task queue is empty and
Every process has made a request for
another task without any new tasks being
generated
Not sufficient to terminate when task queue empty
if one or more processes are still running if a
running process may provide new tasks for task
queue.

82
Decentralized dynamic load balancing
83
Fully Distributed Work Pool

Processes to execute tasks from each other
Task could be transferred by
- Receiver-initiated
- Sender-initiated

84
Process Selection

Algorithms for selecting a process
Round robin algorithm process Pi requests tasks
from process Px,where x is given by a counter
that is incremented after each request, using
modulo n arithmetic (n processes), excluding x
i.
Random polling algorithm process Pi requests
tasks from process Px, where x is a number that
is selected randomly between 0 and n- 1
(excluding i).

85
Distributed Termination Detection Algorithms

Termination Conditions
Application-specific local termination conditions
exist throughout the collection of processes, at
time t.
There are no messages in transit between
processes at time t.
Second condition necessary because a message
in transit might restart a terminated process.
More difficult to recognize. The time that it
takes for messages to travel between processes
will not be known in advance.

86
Using Acknowledgment Messages

Each process in one of two states
Inactive - without any task to perform
Active
Process that sent task to make it enter the
active state becomes its parent.

87
Using Acknowledgment Messages

When process receives a task, it immediately
sends an acknowledgment message, except if the
process it receives the taskfrom is its parent
process. Only sends an acknowledgment message to
its parent when it is ready to become inactive,
i.e. when
Its local termination condition exists (all tasks
are completed, and It has transmitted all its
acknowledgments for tasks it has received, and It
has received all its acknowledgments for tasks it
has sent out.
A process must become inactive before its parent
process. When first process becomes idle, the
computation can terminate

88
Load balancing/termination detection Example
EX Finding the shortest distance between two
points on a graph.
89
References Parallel Programming Techniques and
Applications Using Networked Workstations and
Parallel Computers, Barry Wilkinson and MiChael
Allen, Second Edition, Prentice Hall, 2005.
90
QA
91
Thank You !

Write a Comment

User Comments (0)

About PowerShow.com

Parallel%20distributed%20computing%20techniques PowerPoint PPT Presentation