# Introduction to Parallel Computing by Grama, Gupta, Karypis, Kumar - PowerPoint PPT Presentation

View by Category
Title:

## Introduction to Parallel Computing by Grama, Gupta, Karypis, Kumar

Description:

### Parallel Quicksort (Section 9.4.1) Sparse Matrix Factorization ... Parallel quicksort (Section 3.2.5 and 9.4.1) is an application for which a ... – PowerPoint PPT presentation

Number of Views:2985
Avg rating:3.0/5.0
Slides: 39
Provided by: Batc2
Category:
Tags:
Transcript and Presenter's Notes

Title: Introduction to Parallel Computing by Grama, Gupta, Karypis, Kumar

1
Introduction to Parallel Computingby Grama,
Gupta, Karypis, Kumar
• Selected Topics from Chapter 3 Principles of
Parallel Algorithm Design

2
Elements of a Parallel Algorithm
• Pieces of work that can be done concurrently
• Mapping of the tasks onto multiple processors
• Processes vs processors
• Distributing the input, output, and intermediate
results across different processors
• Either input or intermediate
• Synchronization of the processors at various
point of the parallel execution

3
Finding Concurrent Pieces of Work
• Decomposition
• The process of dividing the computation into
smaller pieces of work called tasks
• Tasks are programmer defined and are considered
to be indivisible.

4
Tasks can be of different sizes
5
• In most cases, there are dependencies between the
• Certain task(s) can only start once some other
• Example Producer-consumer relationships
• These dependencies are represented using a DAG

6
• A task-dependency graph is a directed acyclic
graph in which the nodes represent tasks and the
directed edges indicate the dependences between
them
• The task corresponding to a node can be executed
when all tasks connected to this node by incoming
edges have been completed.
• The number and size of the tasks that the problem
is decomposed into determines the granularity of
the decomposition.
• Called fine-grained for a large nr of small tasks
• Called coarse-grained for a small nr of large

7
• Key Concepts Derived from Task-Dependency Graph
• Degree of Concurrency
• The number of tasks that can be executed
concurrently
• We are usually most concerned about the average
degree of concurrency
• Critical Path
• The longest vertex-weighted path in the graph
• The weights inside nodes represent the task size
• Is the sum of the weights of nodes along the path
• The degree of concurrency and critical path
length normally increase as granularity becomes
smaller

8
• Captures the pattern of interaction between tasks
• This graph usually contains the task-dependency
graph as a subgraph.
• True since there may be interactions between
tasks even if there are no dependencies.
• These interactions usually due to accesses of
shared data

9
• These graphs are important in developing
effective mapping of the tasks onto the different
processors
• Need to maximize concurrency and minimize

10
Common Decomposition Methods
• Data Decomposition
• Recursive Decomposition
• Exploratory Decomposition
• Speculative Decomposition
• Hybrid Decomposition

11
Recursive Decomposition
• Suitable for problems that can be solved using
• Each of the subproblems generated by the divide

12
Example Quicksort
13
Another Example Finding the Minimum
• Note that we can obtain divide-and-conquer
algorithms for problems that are usually solved
by using other methods.

14
Recursive Decomposition
• How good are the decompositions produced?
• Average Concurrency?
• Length of critical path?
• How do the quicksort and min-finding
decompositions measure up?

15
Data Decomposition
• Used to derive concurrency for problems that
operate on large amounts of data
• The idea is to derive the tasks by focusing on
the multiplicity of data
• Data decomposition is often performed in two
steps
• Step 1 Partition the data
• Step 2 Induce a computational partitioning from
the data partitioning.

16
Data Decomposition (cont)
• Which data should we partition
• Input/Output/Intermediate?
• All of above
• This leads to different data decomposition
methods
• How to induce a computational partitioning
• Use the owner-computes rule

17
Exploratory Decomposition
• Used to decompose computations that correspond to
a search of the space of solutions.

18
Example 15-puzzle Problem
19
Exploratory Decomposition
• Not general purpose
• After sufficient branches are generated, each
node can be assigned the task to explore further
down one branch
• As soon as one task finds a solution, the other
• It can result in speedup and slowdown anomalies
• The work performed by the parallel formulation of
an algorithm can be either smaller or greater
than that performed by the serial algorithm.

20
Speculative Decomposition
• Used to extract concurrency in problems in which
the next step is one of many possible actions
that can only be determined when the current task
finishes.
• This decomposition assumes a certain outcome of
the currently executed task and executes some of
the next steps
• Similar to speculative execution at the
microprocessor

21
Speculative Decomposition
• Difference from exploratory decompostion
• In speculative decomposition, the input at a
• In exploratory decomputation, the output of the
multiple tasks originating at the branch is
unknown.

22
Example Discrete Event Simulation
23
Speculative Execution
• If predictions are wrong
• Work is wasted
• Work may need to be undone
• Memory/computations
• However, it may be the only way to extract
concurrency!

24
• A good mapping strives to achieve the following
conflicting goals
• Reducing the amount of that processor spend
interacting with each other.
• Reducing the amount of total time that some
processors are active while others are idle.
• Good mappings attempt to reduce the parallel
• If Tp is the parallel runtime using p processors
and Ts is the sequential runtime (for the same
algorithm), then the total overhead To is pTp
Ts.
• This is the work that is done by the parallel
system that is beyond that required for the
serial system.

25
Add Slides from Karypis Lecture Slides
• Add Slides 37-52 here from the PDF lecture slides
by Karypis for Chapter 3 of the textbook,
• Introduction to Parallel Computing, Second
Edition, Ananth Grama, Anshul Gupta, George
Karypis, Vipin Kumar, Addison Wesley, 2003.
• Topics covered on these slides are sections
3.3-3.5
• Characteristics of Tasks and Interactions
• Mapping Techniques for Load Balancing
• Methods for Containing Interaction Overheads
• These slides can be easiest seen by going to
View and choosing Full Screen. Exit from
Full Screen using Esc key.

26
Parallel Algorithm Models
• Closely related to Fosters Task/Channel Model
• Includes the task dependency graph where
dependencies usually result from communications
• Also includes the task-interaction graph, which
also captures other interactions between tasks
such as data sharing
• The Work Pool Model
• The Master-Slave Model
• The Pipeline or Producer-Consumer Model
• Hybrid Models

27
• The computations in a parallel algorithm can be
• Tasks are mapped to processors so that locality
is promoted
• Volume and frequency of interactions are reduced
• Asynchronous interaction methods are used to
overlap interactions with computation
• Typically used to solve problems in which the
data related to a task is rather large compared
to the amount of computation.

28
• Examples of algorithms based on task graph model
• Parallel Quicksort (Section 9.4.1)
• Sparse Matrix Factorization
• Multiple parallel algorithms derived from
divide-and-conquer decompositions.
• The type of parallelism that is expressed by the

29
The Work Pool Model
• Also called the Task Pool Model
• Involves dynamic mapping of tasks onto processes
• Any task may be potentially be performed by any
process
• The mapping of tasks to processors can be
centralized or decentralized.
• Pointers to tasks may be stored in
• a physically shared list, a priority queue, hash
table, or tree
• a physically distributed data structure.

30
The Work Pool Model (cont.)
• When work is generated dynamically and a
decentralized mapping is used, then a termination
detection algorithm is required
• When used with a message passing paradigm,
normally the data required by the tasks is
relatively small when compared to the
computations

31
The Work Pool Model (cont.)
• Examples of algorithms based on the Work Pool
Model
• Chunk-Scheduling

32
Master-Slave Model
• Also called the Manager-Worker model
• One or more master processes generate work and
allocate it to workers
can estimate the size of tasks or if a random
• Normally, workers are assigned smaller tasks, as
needed
• Work can be performed in phases
• Work in each phase is completed and workers
synchronized before next phase is started.
• Normally, any worker can do any assigned task

33
Master-Slave Model (cont)
• Can be generalized to a multi-level
manager-worker model
• Top level managers feed large chunks of tasks to
second-level managers
• Second-level managers subdivide tasks to their
workers and may also perform some of the work
• Danger of manager becoming a bottleneck
• Can happen if tasks are too small
• Granularity of tasks should be chosen so that
cost of doing work dominates cost of
synchronization
• Waiting time may be reduced if worker requests
are non-deterministic.

34
Master-Slave Model (cont)
• Examples of algorithms based on the Master-Slave
Model
• A master-slave example for centralized
dynamic load balancing in Section 3.4.2 (page
130)
• Several examples are given in textbook, Barry
Wilkinson and Michael Allen, Parallel
Programming Techniques and Applications Using
Networked Workstations and Parallel Computers,
1st or 2nd Edition,1999 2005, Prentice Hall.

35
Pipeline or Producer-Consumer Model
• Usually similiar to the linear array model
studied in Akls textbook.
• A stream of data is passed through a succession
of processes, each of which performs some task on
it.
• Called Stream Parallelism
• With exception of process initiating the work for
the pipeline,
• Arrival of new data triggers the execution of a
new task by a process in the pipeline.
• Each process can be viewed as a consumer of the
data items produced by the process preceding it

36
Pipeline or Producer-Consumer Model (cont)
• Each process in pipeline can be viewed as a
producer of data for the process following it.
• The pipeline is a chain of producers and
consumers
• The pipeline does not need to be a linear chain.
Instead, it can be a directed graph.
• Process could form pipelines in form of
• Linear or multidimensional arrays
• Trees
• General graphs with or without cycles

37
Pipeline or Producer-Consumer Model (cont)
• With larger tasks, it takes longer to fill up the
pipeline
• Too fine a granularity increases overhead, as
processes will need to receive new data and
initiate a new task after a small amount of
computation
• Examples of algorithms based on this model
• A two-dimensional pipeline is used in the
parallel LU factorization algorithm discussed in
Section 8.3.1
• An entire chapter is devoted to this model in
previously mentioned textbook by Wilkinson
Allen.

38
Hybrid Models
• In some cases, more than one model may be used in
designing an algorithm, resulting in a hybrid
algorithm
• Parallel quicksort (Section 3.2.5 and 9.4.1) is
an application for which a hybrid model is ideal.