Parallel%20distributed%20computing%20techniques - PowerPoint PPT Presentation

About This Presentation
Title:

Parallel%20distributed%20computing%20techniques

Description:

Parallel distributed computing techniques GVHD: Ph m Tr n V Sinh vi n: L Tr ng T n Mai V n Ninh Ph ng Quang Ch nh Nguy n c C nh – PowerPoint PPT presentation

Number of Views:152
Avg rating:3.0/5.0
Slides: 92
Provided by: vann152
Category:

less

Transcript and Presenter's Notes

Title: Parallel%20distributed%20computing%20techniques


1
Parallel distributed computing techniques
  • GVHD
  • Ph?m Tr?n Vu

Sinh viên Lê Tr?ng Tín Mai Van Ninh Phùng
Quang Chánh Nguy?n Ð?c C?nh Ð?ng Trung Tín
2
Contents
3
Contents
4
Motivation of Parallel Computing Techniques
  • Demand for Computational Speed
  • Continual demand for greater computational speed
    from a computer system than is currently possible
  • Areas requiring great computational speed include
    numerical modeling and simulation of scienti?c
    and engineering problems.
  • Computations must be completed within a
    reasonable time period.

5
Contents
6
Message-Passing Computing
  • Basics of Message-Passing Programming using
    user-level message passing libraries
  • Two primary mechanisms needed
  • A method of creating separate processes for
    execution on different computers
  • A method of sending and receiving messages

7
Message-Passing Computing
  • Static process creation

Source file
Basic MPI way
Compile to suit processor
Source file
Source file
executables
Processor n-1
Processor 0
8
Message-Passing Computing
  • Dynamic process creation

Processor 1
. spawn() . . . . .
PVM way
Start execution of process 2
Processor 2
. . . . . . .
time
9
Message-Passing Computing
Method of sending and receiving messages?
10
Contents
11
Pipelined Computation
  • Problem divided into a series of tasks
    that have to be completed one after the other
    (the basis of sequential programming).
  • Each task executed by a separate process or
    processor.

12
Pipelined Computation
  • Where pipelining can be used to good effect
  • 1-If more than one instance of the
    complete problem is to be executed
  • 2-If a series of data items must be
    processed, each requiring multiple operations
  • 3-If information to start the next process
    can be passed forward before the process has
    completed all its internal operations

13
Pipelined Computation
  • Execution time m p - 1 cycles for a p-stage
    pipeline and m instances

14
Pipelined Computation
15
Pipelined Computation
16
Pipelined Computation
17
Pipelined Computations
18
Pipelined Computation
19
Contents
20
Ideal Parallel Computation
  • A computation that can obviously be devided into
    a number of completely independent parts
  • Each of which can be executed by a separate
    processor
  • Each process can do its tasks without any
    interaction with other process

21
Ideal Parallel Computation
  • Practical embarrassingly parallel computation
    with static process creation and master slave
    approach

22
Ideal Parallel Computation
  • Practical embarrassingly parallel computation
    with dynamic process creation and master slave
    approach

23
Embarrassingly parallel examples
  • Geometrical Transformations of Images
  • Mandelbrot set
  • Monte Carlo Method

24
Geometrical Transformations of Images
  • Performing on the coordinates of each pixel to
    move the position of the pixel without affecting
    its value
  • The transformation on each pixel is totally
    independent from other pixels
  • Some geometrical operations
  • Shifting
  • Scaling
  • Rotation
  • Clipping

25
Geometrical Transformations of Images
  • Partitioning into regions for individual
    processes
  • Square region for each process Row region
    for each process

80
640
640










80
480
480
10
26
Mandelbrot Set
  • Set of points in a complex plane that are
    quasi-stable when computed by iterating the
    function
  • where is the (k 1)th iteration of the complex
    number z a bi and c is a complex number
    giving position of point in the complex plane.
    The initial value for z is zero.
  • Iterations continued until magnitude of z is
    greater than 2 or number of iterations reaches
    arbitrary limit. Magnitude of z is the length of
    the vector given by

27
Mandelbrot Set

28
Mandelbrot Set

29
Mandelbrot Set
  • c.real real_min x (real_max -
    real_min)/disp_width
  • c.imag imag_min y (imag_max -
    imag_min)/disp_height
  • Static Task Assignment
  • Simply divide the region into fixed number of
    parts, each computed by a separate processor
  • Not very successful because different regions
    require different numbers of iterations and time
  • Dynamic Task Assignment
  • Have processor request regions after computing
    previouos regions

30
Mandelbrot Set
  • Dynamic Task Assignment
  • Have processor request regions after computing
    previouos regions

31
Monte Carlo Method
  • Another embarrassingly parallel computation
  • Monte Carlo methods use of random selections
  • Example To calculate ?
  • Circle formed within a square, with unit radius
    so that square has side 2x2. Ratio of the area of
    the circle to the square given by

32
Monte Carlo Method
  • One quadrant of the construction can be described
    by integral
  • Random pairs of numbers, (xr,yr) generated, each
    between 0 and 1. Counted as in circle if that
    is,

33
Monte Carlo Method
  • Alternative method to compute integral
  • Use random values of x to compute f(x) and sum
    values of f(x)
  • where xr are randomly generated values of x
    between x1 and x2
  • Monte Carlo method very useful if the function
    cannot be integrated numerically (maybe having a
    large number of variables)

34
Monte Carlo Method
  • Example computing the integral
  • Sequential code
  • Routine randv(x1, x2) returns a pseudorandom
    number between x1 and x2

35
Monte Carlo Method
  • Parallel Monte Carlo integration

Master
Partial sum
Request
Slaves
Random number
Random-number process
36
Contents
37
Partitioning and Divide-and-Conquer
Strategies
38
Partitioning
  • Partitioning simply divides the problem into
    parts.
  • It is the basic of all parallel programming.
  • Partitioning can be applied to the program data
    (data partitioning or domain decomposition) and
    the functions of a program (functional
    decomposition).
  • It is much less mommon to find concurrent
    functions in a problem, but data partitioning is
    a main strategy for parallel programming.

39
Partitioning (cont)
A sequence of numbers, x0 ,, xn-1 , are to be
added
n number of items p number of processors
Partitioning a sequence of numbers into parts and
adding them
40
Divide and Conquer
  • Characterized by dividing problem into
    subproblems of same form as larger problem.
    Further divisions into still smaller
    sub-problems, usually done by recursion.
  • Recursive divide and conquer amenable to
    parallelization because separate processes can be
    used for divided parts. Also usually data is
    naturally localized.

41
Divide and Conquer (cont)
  • A sequential recursive definition for adding
  • a list of numbers is
  • int add(int s) // add list of numbers, s
  • if(number(s) lt 2) return (n1 n2)
  • else
  • Divide (s, s1, s2) // divide s into two part,
    s1, s2
  • part_sum1 add(s1)// recursive calls to add
    sub lists
  • part_sum2 add(s2)
  • return (part_sum1 part_sum2)

42
Divide and Conquer (cont)
Initial problem
Divide problem
Final task
Tree construction
42
www.cse.hcmut.edu.vn
43
Divide and Conquer (cont)
Original list
Initial problem
P0
P0
P4
Divide problem
P2
P0
P6
P4
P7
P6
P5
P4
P3
P2
P1
P0
Final task
x0
xn-1
44
Partitioning/Divide and Conquer Examples
  • Many possibilities.
  • Operations on sequences of number such as simply
    adding them together
  • Several sorting algorithms can often be
    partitioned or constructed in a recursive fashion
  • Numerical integration
  • N-body problem

45
Bucket sort
  • One bucket assigned to hold numbers that fall
    within each region.
  • Numbers in each bucket sorted using a sequential
    sorting algorithm.
  • Sequental sorting time complexity O(nlog(n/m).
  • Works well if the original numbers uniformly
    distributed across a known interval, say 0 to a -
    1.

n number of items m number of buckets
46
Parallel version of bucket sort
  • Simple approach
  • Assign one processor for each bucket.

47
Further Parallelization
  • Partition sequence into m regions, one region for
    each processor.
  • Each processor maintains p small buckets and
    separates the numbers in its region into its own
    small buckets.
  • Small buckets then emptied into p ?nal buckets
    for sorting, whichrequires each processor to send
    one small bucket to each of the other processors
    (bucket i to processor i).

48
Another parallel version of bucket sort
  • Introduces new message-passing operation -
    all-to-all broadcast.

49
all-to-all broadcast routine
  • Sends data from each process to every other
    process

50
all-to-all broadcast routine (cont)
  • all-to-all routine actually transfers rows of
    an array to columns
  • Tranposes a matrix.

51
Contents
52
Synchronous Computations
  • Synchronous
  • Barrier
  • Barrier Implementation
  • Centralized Counter implementation
  • Tree Barrier Implementation
  • Butterfly Barrier
  • Synchronized Computations
  • Fully synchronous
  • Data Parallel Computations
  • Synchronous Iteration(Synchronous Parallelism)
  • Locally synchronous
  • Heat Distribution Problem
  • Sequential Code
  • Parallel Code

53
Barrier
  • A basic mechanism for synchronizing processes -
    inserted at the point in each process where it
    must wait.
  • All processes can continue from this point when
    all the processes have reached it
  • Processes reaching barrier at different times

54
Barrier Image
55
Barrier Implementation
  • Centralized Counter implementation ( linear
    barrier)
  • Tree Barrier Implementation.
  • Butterfly Barrier
  • Local Synchronization
  • Deadlock

56
Centralized Counter implementation
  • Have two phase
  • Arrival phase (trapping)
  • Departure phase(release)
  • A process enters arrival phase and does not leave
    this phase until all processes have arrived in
    this phase
  • Then processes move to departure phase and are
    released

57
  • Example code
  • Master
  • for (i 0 i lt n i)/count slaves as they
    reach barrier/
  • recv(Pany)
  • for (i 0 i lt n i)/ release slaves /
  • send(Pi)
  • Slave processes
  • send(Pmaster)
  • recv(Pmaster)

58
Tree Barrier Implementation
  • Suppose 8 processes, P0, P1, P2, P3, P4, P5, P6,
    P7
  • First stage
  • P1 sends message to P0 (when P1 reaches its
    barrier)
  • P3 sends message to P2 (when P3 reaches its
    barrier)
  • P5 sends message to P4 (when P5 reaches its
    barrier)
  • P7 sends message to P6 (when P7 reaches its
    barrier)
  • Second stage
  • P2 sends message to P0 (P2 P3 reached their
    barrier)
  • P6 sends message to P4 (P6 P7 reached their
    barrier)
  • Second stage
  • P4 sends message to P0 (P4, P5, P6, P7
    reached barrier)
  • P0 terminates arrival phase( when P0 reaches
    barrier received message from P4)

59
Tree Barrier Implementation
  • Release with a reverse tree construction.

Tree barrier
60
Butterfly Barrier 
  • This would be used if data were exchanged between
    the processes

61
Local Synchronization
  • Suppose a process Pi needs to be synchronized
    and to exchange data with process Pi-1 and
    process Pi1
  • Not a perfect three-process barrier because
    process Pi-1 will only synchronize with Pi and
    continue as soon as Pi allows. Similarly,process
    Pi1 only synchronizes with Pi.

62
Synchronized Computations
  • Fully synchronous
  • In fully synchronous, all processes involved in
    the computation must be synchronized.
  • Data Parallel Computations
  • Synchronous Iteration(Synchronous Parallelism)
  • Locally synchronous
  • In locally synchronous, processes only need to
    synchronize with a set of logically nearby
    processes, not all processes involved in the
    computation
  • Heat Distribution Problem
  • Sequential Code
  • Parallel Code

63
Data Parallel Computations
  • Same operation performed on different data
    elements simultaneously (SIMD)
  • Data parallel programming is very convenient for
    two reasons
  • The first is its ease of programming (essentially
    only one program)
  • The second is that it can scale easily to larger
    problems sizes

64
Synchronous Iteration
  • Each iteration composed of several processes that
    start together at beginning of iteration. Next
    iteration cannot begin until all processes have
    finished previous iteration Using forall
  • for (j 0 j lt n j) /for each synch.
    iteration /
  • forall (i 0 i lt N i) /N procs each
    using/
  • body(i) / specific value of i /

65
Synchronous Iteration
  • Solving a General System of Linear Equations by
    Iteration
  • Suppose the equations are of a general form with
    n equations and n unknowns where the unknowns are
    x0, x1, x2, xn-1 (0 lt i lt n).
  • an-1,0x0 an-1,1x1 an-1,2x2
    an-1,n-1xn-1 bn-1
  • .
  • .
  • .
  • .
  • a2,0x0 a2,1x1 a2,2x2 a2,n-1xn-1 b2
  • a1,0x0 a1,1x1 a1,2x2 a1,n-1xn-1 b1
  • a0,0x0 a0,1x1 a0,2x2 a0,n-1xn-1 b0
  • where the unknowns are x0, x1, x2, xn-1 (0lt i
    lt n).

66
Synchronous Iteration
  • By rearranging the ith equation
  • ai,0x0 ai,1x1 ai,2x2 ai,n-1xn-1 bi
  •  to
  • xi (1/ai,i)bi-(ai,0x0ai,1x1ai,2x2ai,i-1xi-1
    ai ,i1xi1ai,n-1xn-1)
  • Or

67
Heat Distribution Problem
  • An area has known temperatures along each of its
    edges. Find thetemperature distribution within.
    Divide area into fine mesh of points, hi,j.
    Temperature at an inside point taken to be
    average of temperatures of four neighboring
  • points..
  • Temperature of each point by iterating the
    equation
  • (0 lt i lt n, 0 lt j lt n)

68
Heat Distribution Problem
69
Sequential Code
  • Using a fixed number of iterations
  • for (iteration 0 iteration lt limit
    iteration)
  • for (i 1 i lt n i)
  • for (j 1 j lt n j)
  • gij 0.25(hi-1jhi1jhij-1
    hij1)
  • for (i 1 i lt n i)/ update points /
  • for (j 1 j lt n j)
  • hij gij

70
Parallel Code
  • With fixed number of iterations, Pi,j (except for
    the boundary points)
  • for (iteration 0 iteration lt limit
    iteration)
  • g 0.25 (w x y z)
  • send(g, Pi-1,j) / non-blocking sends /
  • send(g, Pi1,j)
  • send(g, Pi,j-1)
  • send(g, Pi,j1)
  • recv(w, Pi-1,j) / synchronous receives /
  • recv(x, Pi1,j)
  • recv(y, Pi,j-1)
  • recv(z, Pi,j1)

Local Barrier
71
Contents
72
Load Balancing Termination Detection
73
Load Balancing Termination Detection
74
Load Balancing
75
Load Balancing Termination Detection
76
Static Load Balancing
  • Round robin algorithm passes out tasks in
    sequential order of processes coming back to the
    first when all processes have been given a task
  • Randomized algorithms selects processes at
    random to
  • take tasks
  • Recursive bisection recursively divides the
    problem into
  • subproblems of equal computational effort
    while minimizing message passing
  • Simulated annealing an optimization technique
  • Genetic algorithm another optimization
    technique, described

77
Static Load Balancing
  • Several fundamental flaws with static load
    balancing even if a mathematical solution exists
  • Very difficult to estimate accurately the
    execution times of various parts of a program
    without actually executing the parts.
  • Communication delays that vary under different
    circumstances
  • Some problems have an indeterminate number of
    steps to reach their solution.

78
Dynamic Load Balancing
79
Centralized dynamic load balancing
  • Tasks handed out from a centralized location.
    Master-slave structure
  • Master process(or) holds the collection of tasks
    to be performed.
  • Tasks are sent to the slave processes. When a
    slave process completes one task, it requests
    another task from the master process.
  • (Terms used work pool, replicated worker,
    processor farm.)

80
Centralized dynamic load balancing
81
Termination
  • Computation terminates when
  • The task queue is empty and
  • Every process has made a request for
    another task without any new tasks being
    generated
  • Not sufficient to terminate when task queue empty
    if one or more processes are still running if a
    running process may provide new tasks for task
    queue.

82
Decentralized dynamic load balancing
83
Fully Distributed Work Pool
  • Processes to execute tasks from each other
  • Task could be transferred by
  • - Receiver-initiated
  • - Sender-initiated

84
Process Selection
  • Algorithms for selecting a process
  • Round robin algorithm process Pi requests tasks
    from process Px,where x is given by a counter
    that is incremented after each request, using
    modulo n arithmetic (n processes), excluding x
    i.
  • Random polling algorithm process Pi requests
    tasks from process Px, where x is a number that
    is selected randomly between 0 and n- 1
    (excluding i).

85
Distributed Termination Detection Algorithms
  • Termination Conditions
  • Application-specific local termination conditions
    exist throughout the collection of processes, at
    time t.
  • There are no messages in transit between
    processes at time t.
  • Second condition necessary because a message
    in transit might restart a terminated process.
    More difficult to recognize. The time that it
    takes for messages to travel between processes
    will not be known in advance.

86
Using Acknowledgment Messages
  • Each process in one of two states
  • Inactive - without any task to perform
  • Active
  • Process that sent task to make it enter the
    active state becomes its parent.

87
Using Acknowledgment Messages
  • When process receives a task, it immediately
    sends an acknowledgment message, except if the
    process it receives the taskfrom is its parent
    process. Only sends an acknowledgment message to
    its parent when it is ready to become inactive,
    i.e. when
  • Its local termination condition exists (all tasks
    are completed, and It has transmitted all its
    acknowledgments for tasks it has received, and It
    has received all its acknowledgments for tasks it
    has sent out.
  • A process must become inactive before its parent
    process. When first process becomes idle, the
    computation can terminate

88
Load balancing/termination detection Example
EX Finding the shortest distance between two
points on a graph.
89
References Parallel Programming Techniques and
Applications Using Networked Workstations and
Parallel Computers, Barry Wilkinson and MiChael
Allen, Second Edition, Prentice Hall, 2005.
90
QA
91
Thank You !
Write a Comment
User Comments (0)
About PowerShow.com