Introduction to Parallel Computing by Grama, Gupta, Karypis, Kumar presentation

About This Presentation

Transcript and Presenter's Notes

Title: Introduction to Parallel Computing by Grama, Gupta, Karypis, Kumar

1
Introduction to Parallel Computingby Grama,
Gupta, Karypis, Kumar

Selected Topics from Chapter 3 Principles of
Parallel Algorithm Design

2
Elements of a Parallel Algorithm

Pieces of work that can be done concurrently
Tasks
Mapping of the tasks onto multiple processors
Processes vs processors
Distributing the input, output, and intermediate
results across different processors
Management of access to shared data
Either input or intermediate
Synchronization of the processors at various
point of the parallel execution

3
Finding Concurrent Pieces of Work

Decomposition
The process of dividing the computation into
smaller pieces of work called tasks
Tasks are programmer defined and are considered
to be indivisible.

4
Tasks can be of different sizes
5
Task-Dependency Graph

In most cases, there are dependencies between the
different tasks
Certain task(s) can only start once some other
task(s) have finished
Example Producer-consumer relationships
These dependencies are represented using a DAG
called a task-dependency graph

6
Task-Dependency Graph (cont)

A task-dependency graph is a directed acyclic
graph in which the nodes represent tasks and the
directed edges indicate the dependences between
them
The task corresponding to a node can be executed
when all tasks connected to this node by incoming
edges have been completed.
The number and size of the tasks that the problem
is decomposed into determines the granularity of
the decomposition.
Called fine-grained for a large nr of small tasks
Called coarse-grained for a small nr of large
tasks

7
Task-Dependency Graph (cont)

Key Concepts Derived from Task-Dependency Graph
Degree of Concurrency
The number of tasks that can be executed
concurrently
We are usually most concerned about the average
degree of concurrency
Critical Path
The longest vertex-weighted path in the graph
The weights inside nodes represent the task size
Is the sum of the weights of nodes along the path
The degree of concurrency and critical path
length normally increase as granularity becomes
smaller

8
Q Task-Interaction Graph

Captures the pattern of interaction between tasks
This graph usually contains the task-dependency
graph as a subgraph.
True since there may be interactions between
tasks even if there are no dependencies.
These interactions usually due to accesses of
shared data

9
Task Dependency and Interaction Graphs

These graphs are important in developing
effective mapping of the tasks onto the different
processors
Need to maximize concurrency and minimize
overheads.

10
Common Decomposition Methods

Data Decomposition
Recursive Decomposition
Exploratory Decomposition
Speculative Decomposition
Hybrid Decomposition

task decomposition methods
11
Recursive Decomposition

Suitable for problems that can be solved using
the divide and conquer paradigm
Each of the subproblems generated by the divide
step becomes a new task.

12
Example Quicksort
13
Another Example Finding the Minimum

Note that we can obtain divide-and-conquer
algorithms for problems that are usually solved
by using other methods.

14
Recursive Decomposition

How good are the decompositions produced?
Average Concurrency?
Length of critical path?
How do the quicksort and min-finding
decompositions measure up?

15
Data Decomposition

Used to derive concurrency for problems that
operate on large amounts of data
The idea is to derive the tasks by focusing on
the multiplicity of data
Data decomposition is often performed in two
steps
Step 1 Partition the data
Step 2 Induce a computational partitioning from
the data partitioning.

16
Data Decomposition (cont)

Which data should we partition
Input/Output/Intermediate?
All of above
This leads to different data decomposition
methods
How to induce a computational partitioning
Use the owner-computes rule

17
Exploratory Decomposition

Used to decompose computations that correspond to
a search of the space of solutions.

18
Example 15-puzzle Problem
19
Exploratory Decomposition

Not general purpose
After sufficient branches are generated, each
node can be assigned the task to explore further
down one branch
As soon as one task finds a solution, the other
tasks can be terminated.
It can result in speedup and slowdown anomalies
The work performed by the parallel formulation of
an algorithm can be either smaller or greater
than that performed by the serial algorithm.

20
Speculative Decomposition

Used to extract concurrency in problems in which
the next step is one of many possible actions
that can only be determined when the current task
finishes.
This decomposition assumes a certain outcome of
the currently executed task and executes some of
the next steps
Similar to speculative execution at the
microprocessor

21
Speculative Decomposition

Difference from exploratory decompostion
In speculative decomposition, the input at a
branch leading to multiple tasks is unknown.
In exploratory decomputation, the output of the
multiple tasks originating at the branch is
unknown.

22
Example Discrete Event Simulation
23
Speculative Execution

If predictions are wrong
Work is wasted
Work may need to be undone
State-restoring overhead
Memory/computations
However, it may be the only way to extract
concurrency!

24
Mapping Tasks to Processors

A good mapping strives to achieve the following
conflicting goals
Reducing the amount of that processor spend
interacting with each other.
Reducing the amount of total time that some
processors are active while others are idle.
Good mappings attempt to reduce the parallel
processing overheads
If Tp is the parallel runtime using p processors
and Ts is the sequential runtime (for the same
algorithm), then the total overhead To is pTp
Ts.
This is the work that is done by the parallel
system that is beyond that required for the
serial system.

25
Add Slides from Karypis Lecture Slides

Add Slides 37-52 here from the PDF lecture slides
by Karypis for Chapter 3 of the textbook,
Introduction to Parallel Computing, Second
Edition, Ananth Grama, Anshul Gupta, George
Karypis, Vipin Kumar, Addison Wesley, 2003.
Topics covered on these slides are sections
3.3-3.5
Characteristics of Tasks and Interactions
Mapping Techniques for Load Balancing
Methods for Containing Interaction Overheads
These slides can be easiest seen by going to
View and choosing Full Screen. Exit from
Full Screen using Esc key.

26
Parallel Algorithm Models

The Task Graph Model
Closely related to Fosters Task/Channel Model
Includes the task dependency graph where
dependencies usually result from communications
between two tasks
Also includes the task-interaction graph, which
also captures other interactions between tasks
such as data sharing
The Work Pool Model
The Master-Slave Model
The Pipeline or Producer-Consumer Model
Hybrid Models

27
The Task Graph Model

The computations in a parallel algorithm can be
viewed as a task-dependency graph.
Tasks are mapped to processors so that locality
is promoted
Volume and frequency of interactions are reduced
Asynchronous interaction methods are used to
overlap interactions with computation
Typically used to solve problems in which the
data related to a task is rather large compared
to the amount of computation.

28
The Task Graph Model (cont.)

Examples of algorithms based on task graph model
Parallel Quicksort (Section 9.4.1)
Sparse Matrix Factorization
Multiple parallel algorithms derived from
divide-and-conquer decompositions.
Task Parallelism
The type of parallelism that is expressed by the
independent tasks in a task-dependency graph.

29
The Work Pool Model

Also called the Task Pool Model
Involves dynamic mapping of tasks onto processes
for load balancing
Any task may be potentially be performed by any
process
The mapping of tasks to processors can be
centralized or decentralized.
Pointers to tasks may be stored in
a physically shared list, a priority queue, hash
table, or tree
a physically distributed data structure.

30
The Work Pool Model (cont.)

When work is generated dynamically and a
decentralized mapping is used, then a termination
detection algorithm is required
When used with a message passing paradigm,
normally the data required by the tasks is
relatively small when compared to the
computations
Tasks can be readily moved around without causing
too much data interaction overhead
Granularity of tasks can be adjusted to obtain
desired tradeoff between load imbalance and the
overhead of adding and extracting tasks

31
The Work Pool Model (cont.)

Examples of algorithms based on the Work Pool
Model
Chunk-Scheduling

32
Master-Slave Model

Also called the Manager-Worker model
One or more master processes generate work and
allocate it to workers
Managers can allocate tasks in advance if they
can estimate the size of tasks or if a random
mapping can avoid load-balancing problems
Normally, workers are assigned smaller tasks, as
needed
Work can be performed in phases
Work in each phase is completed and workers
synchronized before next phase is started.
Normally, any worker can do any assigned task

33
Master-Slave Model (cont)

Can be generalized to a multi-level
manager-worker model
Top level managers feed large chunks of tasks to
second-level managers
Second-level managers subdivide tasks to their
workers and may also perform some of the work
Danger of manager becoming a bottleneck
Can happen if tasks are too small
Granularity of tasks should be chosen so that
cost of doing work dominates cost of
synchronization
Waiting time may be reduced if worker requests
are non-deterministic.

34
Master-Slave Model (cont)

Examples of algorithms based on the Master-Slave
Model
A master-slave example for centralized
load-balancing is mentioned for centralized
dynamic load balancing in Section 3.4.2 (page
130)
Several examples are given in textbook, Barry
Wilkinson and Michael Allen, Parallel
Programming Techniques and Applications Using
Networked Workstations and Parallel Computers,
1st or 2nd Edition,1999 2005, Prentice Hall.

35
Pipeline or Producer-Consumer Model

Usually similiar to the linear array model
studied in Akls textbook.
A stream of data is passed through a succession
of processes, each of which performs some task on
it.
Called Stream Parallelism
With exception of process initiating the work for
the pipeline,
Arrival of new data triggers the execution of a
new task by a process in the pipeline.
Each process can be viewed as a consumer of the
data items produced by the process preceding it

36
Pipeline or Producer-Consumer Model (cont)

Each process in pipeline can be viewed as a
producer of data for the process following it.
The pipeline is a chain of producers and
consumers
The pipeline does not need to be a linear chain.
Instead, it can be a directed graph.
Process could form pipelines in form of
Linear or multidimensional arrays
Trees
General graphs with or without cycles

37
Pipeline or Producer-Consumer Model (cont)

Load balancing is a function of task granularity
With larger tasks, it takes longer to fill up the
pipeline
This keeps tasks waiting
Too fine a granularity increases overhead, as
processes will need to receive new data and
initiate a new task after a small amount of
computation
Examples of algorithms based on this model
A two-dimensional pipeline is used in the
parallel LU factorization algorithm discussed in
Section 8.3.1
An entire chapter is devoted to this model in
previously mentioned textbook by Wilkinson
Allen.

38
Hybrid Models

In some cases, more than one model may be used in
designing an algorithm, resulting in a hybrid
algorithm
Parallel quicksort (Section 3.2.5 and 9.4.1) is
an application for which a hybrid model is ideal.

Write a Comment

User Comments (0)

About PowerShow.com

Introduction to Parallel Computing by Grama, Gupta, Karypis, Kumar PowerPoint PPT Presentation