Introduction to Parallel Computing by Grama, Gupta, Karypis, Kumar - PowerPoint PPT Presentation

View by Category
About This Presentation

Introduction to Parallel Computing by Grama, Gupta, Karypis, Kumar


Parallel Quicksort (Section 9.4.1) Sparse Matrix Factorization ... Parallel quicksort (Section 3.2.5 and 9.4.1) is an application for which a ... – PowerPoint PPT presentation

Number of Views:2985
Avg rating:3.0/5.0
Slides: 39
Provided by: Batc2
Learn more at:


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Introduction to Parallel Computing by Grama, Gupta, Karypis, Kumar

Introduction to Parallel Computingby Grama,
Gupta, Karypis, Kumar
  • Selected Topics from Chapter 3 Principles of
    Parallel Algorithm Design

Elements of a Parallel Algorithm
  • Pieces of work that can be done concurrently
  • Tasks
  • Mapping of the tasks onto multiple processors
  • Processes vs processors
  • Distributing the input, output, and intermediate
    results across different processors
  • Management of access to shared data
  • Either input or intermediate
  • Synchronization of the processors at various
    point of the parallel execution

Finding Concurrent Pieces of Work
  • Decomposition
  • The process of dividing the computation into
    smaller pieces of work called tasks
  • Tasks are programmer defined and are considered
    to be indivisible.

Tasks can be of different sizes
Task-Dependency Graph
  • In most cases, there are dependencies between the
    different tasks
  • Certain task(s) can only start once some other
    task(s) have finished
  • Example Producer-consumer relationships
  • These dependencies are represented using a DAG
    called a task-dependency graph

Task-Dependency Graph (cont)
  • A task-dependency graph is a directed acyclic
    graph in which the nodes represent tasks and the
    directed edges indicate the dependences between
  • The task corresponding to a node can be executed
    when all tasks connected to this node by incoming
    edges have been completed.
  • The number and size of the tasks that the problem
    is decomposed into determines the granularity of
    the decomposition.
  • Called fine-grained for a large nr of small tasks
  • Called coarse-grained for a small nr of large

Task-Dependency Graph (cont)
  • Key Concepts Derived from Task-Dependency Graph
  • Degree of Concurrency
  • The number of tasks that can be executed
  • We are usually most concerned about the average
    degree of concurrency
  • Critical Path
  • The longest vertex-weighted path in the graph
  • The weights inside nodes represent the task size
  • Is the sum of the weights of nodes along the path
  • The degree of concurrency and critical path
    length normally increase as granularity becomes

Q Task-Interaction Graph
  • Captures the pattern of interaction between tasks
  • This graph usually contains the task-dependency
    graph as a subgraph.
  • True since there may be interactions between
    tasks even if there are no dependencies.
  • These interactions usually due to accesses of
    shared data

Task Dependency and Interaction Graphs
  • These graphs are important in developing
    effective mapping of the tasks onto the different
  • Need to maximize concurrency and minimize

Common Decomposition Methods
  • Data Decomposition
  • Recursive Decomposition
  • Exploratory Decomposition
  • Speculative Decomposition
  • Hybrid Decomposition

task decomposition methods
Recursive Decomposition
  • Suitable for problems that can be solved using
    the divide and conquer paradigm
  • Each of the subproblems generated by the divide
    step becomes a new task.

Example Quicksort
Another Example Finding the Minimum
  • Note that we can obtain divide-and-conquer
    algorithms for problems that are usually solved
    by using other methods.

Recursive Decomposition
  • How good are the decompositions produced?
  • Average Concurrency?
  • Length of critical path?
  • How do the quicksort and min-finding
    decompositions measure up?

Data Decomposition
  • Used to derive concurrency for problems that
    operate on large amounts of data
  • The idea is to derive the tasks by focusing on
    the multiplicity of data
  • Data decomposition is often performed in two
  • Step 1 Partition the data
  • Step 2 Induce a computational partitioning from
    the data partitioning.

Data Decomposition (cont)
  • Which data should we partition
  • Input/Output/Intermediate?
  • All of above
  • This leads to different data decomposition
  • How to induce a computational partitioning
  • Use the owner-computes rule

Exploratory Decomposition
  • Used to decompose computations that correspond to
    a search of the space of solutions.

Example 15-puzzle Problem
Exploratory Decomposition
  • Not general purpose
  • After sufficient branches are generated, each
    node can be assigned the task to explore further
    down one branch
  • As soon as one task finds a solution, the other
    tasks can be terminated.
  • It can result in speedup and slowdown anomalies
  • The work performed by the parallel formulation of
    an algorithm can be either smaller or greater
    than that performed by the serial algorithm.

Speculative Decomposition
  • Used to extract concurrency in problems in which
    the next step is one of many possible actions
    that can only be determined when the current task
  • This decomposition assumes a certain outcome of
    the currently executed task and executes some of
    the next steps
  • Similar to speculative execution at the

Speculative Decomposition
  • Difference from exploratory decompostion
  • In speculative decomposition, the input at a
    branch leading to multiple tasks is unknown.
  • In exploratory decomputation, the output of the
    multiple tasks originating at the branch is

Example Discrete Event Simulation
Speculative Execution
  • If predictions are wrong
  • Work is wasted
  • Work may need to be undone
  • State-restoring overhead
  • Memory/computations
  • However, it may be the only way to extract

Mapping Tasks to Processors
  • A good mapping strives to achieve the following
    conflicting goals
  • Reducing the amount of that processor spend
    interacting with each other.
  • Reducing the amount of total time that some
    processors are active while others are idle.
  • Good mappings attempt to reduce the parallel
    processing overheads
  • If Tp is the parallel runtime using p processors
    and Ts is the sequential runtime (for the same
    algorithm), then the total overhead To is pTp
  • This is the work that is done by the parallel
    system that is beyond that required for the
    serial system.

Add Slides from Karypis Lecture Slides
  • Add Slides 37-52 here from the PDF lecture slides
    by Karypis for Chapter 3 of the textbook,
  • Introduction to Parallel Computing, Second
    Edition, Ananth Grama, Anshul Gupta, George
    Karypis, Vipin Kumar, Addison Wesley, 2003.
  • Topics covered on these slides are sections
  • Characteristics of Tasks and Interactions
  • Mapping Techniques for Load Balancing
  • Methods for Containing Interaction Overheads
  • These slides can be easiest seen by going to
    View and choosing Full Screen. Exit from
    Full Screen using Esc key.

Parallel Algorithm Models
  • The Task Graph Model
  • Closely related to Fosters Task/Channel Model
  • Includes the task dependency graph where
    dependencies usually result from communications
    between two tasks
  • Also includes the task-interaction graph, which
    also captures other interactions between tasks
    such as data sharing
  • The Work Pool Model
  • The Master-Slave Model
  • The Pipeline or Producer-Consumer Model
  • Hybrid Models

The Task Graph Model
  • The computations in a parallel algorithm can be
    viewed as a task-dependency graph.
  • Tasks are mapped to processors so that locality
    is promoted
  • Volume and frequency of interactions are reduced
  • Asynchronous interaction methods are used to
    overlap interactions with computation
  • Typically used to solve problems in which the
    data related to a task is rather large compared
    to the amount of computation.

The Task Graph Model (cont.)
  • Examples of algorithms based on task graph model
  • Parallel Quicksort (Section 9.4.1)
  • Sparse Matrix Factorization
  • Multiple parallel algorithms derived from
    divide-and-conquer decompositions.
  • Task Parallelism
  • The type of parallelism that is expressed by the
    independent tasks in a task-dependency graph.

The Work Pool Model
  • Also called the Task Pool Model
  • Involves dynamic mapping of tasks onto processes
    for load balancing
  • Any task may be potentially be performed by any
  • The mapping of tasks to processors can be
    centralized or decentralized.
  • Pointers to tasks may be stored in
  • a physically shared list, a priority queue, hash
    table, or tree
  • a physically distributed data structure.

The Work Pool Model (cont.)
  • When work is generated dynamically and a
    decentralized mapping is used, then a termination
    detection algorithm is required
  • When used with a message passing paradigm,
    normally the data required by the tasks is
    relatively small when compared to the
  • Tasks can be readily moved around without causing
    too much data interaction overhead
  • Granularity of tasks can be adjusted to obtain
    desired tradeoff between load imbalance and the
    overhead of adding and extracting tasks

The Work Pool Model (cont.)
  • Examples of algorithms based on the Work Pool
  • Chunk-Scheduling

Master-Slave Model
  • Also called the Manager-Worker model
  • One or more master processes generate work and
    allocate it to workers
  • Managers can allocate tasks in advance if they
    can estimate the size of tasks or if a random
    mapping can avoid load-balancing problems
  • Normally, workers are assigned smaller tasks, as
  • Work can be performed in phases
  • Work in each phase is completed and workers
    synchronized before next phase is started.
  • Normally, any worker can do any assigned task

Master-Slave Model (cont)
  • Can be generalized to a multi-level
    manager-worker model
  • Top level managers feed large chunks of tasks to
    second-level managers
  • Second-level managers subdivide tasks to their
    workers and may also perform some of the work
  • Danger of manager becoming a bottleneck
  • Can happen if tasks are too small
  • Granularity of tasks should be chosen so that
    cost of doing work dominates cost of
  • Waiting time may be reduced if worker requests
    are non-deterministic.

Master-Slave Model (cont)
  • Examples of algorithms based on the Master-Slave
  • A master-slave example for centralized
    load-balancing is mentioned for centralized
    dynamic load balancing in Section 3.4.2 (page
  • Several examples are given in textbook, Barry
    Wilkinson and Michael Allen, Parallel
    Programming Techniques and Applications Using
    Networked Workstations and Parallel Computers,
    1st or 2nd Edition,1999 2005, Prentice Hall.

Pipeline or Producer-Consumer Model
  • Usually similiar to the linear array model
    studied in Akls textbook.
  • A stream of data is passed through a succession
    of processes, each of which performs some task on
  • Called Stream Parallelism
  • With exception of process initiating the work for
    the pipeline,
  • Arrival of new data triggers the execution of a
    new task by a process in the pipeline.
  • Each process can be viewed as a consumer of the
    data items produced by the process preceding it

Pipeline or Producer-Consumer Model (cont)
  • Each process in pipeline can be viewed as a
    producer of data for the process following it.
  • The pipeline is a chain of producers and
  • The pipeline does not need to be a linear chain.
    Instead, it can be a directed graph.
  • Process could form pipelines in form of
  • Linear or multidimensional arrays
  • Trees
  • General graphs with or without cycles

Pipeline or Producer-Consumer Model (cont)
  • Load balancing is a function of task granularity
  • With larger tasks, it takes longer to fill up the
  • This keeps tasks waiting
  • Too fine a granularity increases overhead, as
    processes will need to receive new data and
    initiate a new task after a small amount of
  • Examples of algorithms based on this model
  • A two-dimensional pipeline is used in the
    parallel LU factorization algorithm discussed in
    Section 8.3.1
  • An entire chapter is devoted to this model in
    previously mentioned textbook by Wilkinson

Hybrid Models
  • In some cases, more than one model may be used in
    designing an algorithm, resulting in a hybrid
  • Parallel quicksort (Section 3.2.5 and 9.4.1) is
    an application for which a hybrid model is ideal.