Parallel Programming Concepts presentation

About This Presentation

Transcript and Presenter's Notes

Title: Parallel Programming Concepts

1
Parallel Programming Concepts
Performance measures and related issues
Parallelisation approaches Code
organization Sources of parallelism
2
Performance Measures and Related Issues
Speedup Amdahls law Load Balancing Granularity
3
Superlinear Speedup

Should not be possible
you can simulate the fast parallel algorithm on
a single processor to beat the best sequential
algorithm
Yet sometimes happens is practice
more memory available in a parallel computer
different search order in search problems

Sequential search 12 moves
Parallel search 2 moves
4
Additional terms

Efficiency
Speedup/p
Cost
time x p
Scalability
how efficiently can the hardware/algorithm use
additional processors
Gustaffson law (observation)
situation is not as tragic as Amdahls law
suggests
the serial fraction usually stays (nearly)
constant as the problem size increases
consequence nearly linear speedup us possible
if the problem size increases with the number of
processors
Isoefficiency function see Kumar al. book

5
Exercises Speedup, Efficiency, Cost
Example 1 Compute the sums of columns in an
upper triangular matrix. Program1 for processor
i, n processors sumi 0 for(j0 jlti
j) sumi Aj,i Program2 for processor i,
p processors for(kip klt(i1)p-1 k)
sumk 0 for(j0 jltk j)
sumi Aj,k Program3 for
processor i, p processors for(ki kltn k
p) sumk 0 for(j0 jltk
j) sumi Aj,k
6
Exercises Speedup, Efficiency, Cost
Can we do any better? int AddColumn(int col)
int sum 0 for(i0 iltcol i)
sum Acol, i return sum Program4 for
processor i, p processors col -i while
(colltn) col2i sumcol
AddColumn(col) col2p-2i
sumcol AddColumn(col)
7
Exercises Speedup, Efficiency, Cost
Example 2 Compute the sum of n numbers Program
for processor i, n processors tmpSum Ai
for(j2 jltn j2) if (i j 0)
receive(ij/2, hisSum)
tmpSum hisSum else
send(i-j/2, tmpSum) break
8
Sources of inefficiency
P0 P1

computation P2

idle P3 P4 P5 P6 P7 P8
execution time
9
Sources of inefficiency II
P0 P1

computation P2
idle P3

communication P4 P5 P6 P7 P8
execution time
10
Sources of inefficiency
P0 P1

computation P2
idle P3

communication P4

additional or P5

repeated com- P6

putation P7 P8
execution time
11
Load Balancing
Efficiency adversely affected by uneven workload
P0 P1

computation P2
idle (wasted) P3 P4
12
Load Balancing (cont.)

Load balancing shifting work from heavily loaded
processors to lightly loaded ones.
P0
P1
computation
P2
idle (wasted)
P3
moved
P4
Static load balancing
before execution
Dynamic load balancing
during execution

13
Granularity

The size of the computation segments between
communication.
fine grained
coarse
grained
ILP loop
parallelism task
parallelism
The most efficient granularity is dependent on
the algorithm and the hardware environment in
which it runs
In most cases overhead associated with
communications and synchronization is high
relative to execution speed so it is advantageous
to have coarse granularity.

14
Fine Grain Parallelism

All tasks execute a small number of instructions
between communication cycles
Low computation to communication ratio
Facilitates load balancing
Implies high communication overhead and less
opportunity for performance enhancement
If granularity is too fine it is possible that
the overhead required for communications and
synchronization between tasks takes longer than
the computation

15
Coarse Grain Parallelism

Typified by long computations consisting of
large numbers of instructions between
communication synchronization points
High computation to communication ratio
Lower communication overhead, more opportunity
for performance increase
Harder to load balance efficiently

P0 P1
computation P2

commmunication P3 P4
16
Granularity vs. Coupling

Granularity
fine grained
coarse
grained
tightly

loosely
SMP ccNUMA NUMA
MPP ethernet cluster
Coupling
the looser the coupling, the coarser granularity
must be for the communication not to overwhelm
computation

17
Parallel Programming Concepts
Performance measures and related
issues Parallelisation approaches Code
organization Sources of parallelism
18
Parallelisation Approaches

Parallelizing compiler
advantage use your current code
disadvantage very limited abilities
Parallel domain-specific libraries
e.g. linear algebra, numerical libraries,
quantum chemistry
usually good choice, use when possible
Communication libraries
message passing libraries MPI, PVM
shared memory libraries declare and access
shared memory variables (on MPP machines done by
emulation)
advantage use standard compiler
disadvantage low level programming (parallel
assembler)

19
Parallelisation Approaches (cont.)

New parallel languages
use a language with built-in explicit control
for parallelism
no language is the best in every domain
needs new compiler
fights against inertia
Parallel features in existing languages
adding parallel features to an existing language
I.e. for expressing loop parallelism (pardo) and
data placement
example High Performance Fortran
Additional possibilities in shared-memory systems
use threads
preprocessor compiler directives (OpenMP)

20
Parallelisation Approaches Our Focus

Communication libraries MPI, PVM
industry standard, available for every platform
very general, low level approach
perfectly match for clusters
most likely to be useful for you
Shared memory programming
also very important
likely to be useful in next iterations of PCs

21
Parallel Programming Concepts
Performance measures and related
issues Parallelisation approaches Code
organization Sources of parallelism
22
Code Organization - SPMD

Single Program Multiple Data
well suited for SIMD computers
popular choice even for MIMD, as it keeps
everything in one place
typical in MPI programs
static process creation
may waste memory
Example
- heap-like computation SPMD way

main() if (id 0)
rootNode() else if (id lt p/2) innerNode()
else leafNode()
23
Code Organization - MPMD

Multiple Programs Multiple Data
allows dynamic process creation
typically master-slave approach
more memory-efficient
typical in PVM

master.c main() for(i1 iltp i)
sidi spawn(slave(i))
slave.c main(int id) // slave code
here
24
Parallel Programming Concepts
Performance measures and related
issues Parallelisation approaches Code
organization Sources of parallelism
25
Sources of Parallelism
Data Parallelism Task Parallelism Pipelining
26
Data Parallelism

divide data up amongst processors.
process different data segments in parallel
communicate boundary information, if necessary
includes loop parallelism
well suited for SIMD machines
communication is often implicit (HPF)

27
Task Parallelism

decompose algorithm into different sections
assign sections to different processors
often uses fork()/join()/spawn()
usually does not yield itself to high level of
parallelism

28
Pipelining

a sequence of tasks whose execution can overlap
sequential processor must execute them
sequentially, without overlap
parallel computer can overlap the tasks,
increasing throughput (but not decreasing latency)

29
New Concepts and Terms - Summary

speedup, efficiency, cost, scalability
Amdahls law, Gustaffsons law
Load Balancing static, dynamic
Granularity fine, coarse
Tightly, loosely coupled system
SPMD, MPMD
Data Parallelism, Task Parallelism, Pipelining

Write a Comment

User Comments (0)

About PowerShow.com

Parallel Programming Concepts PowerPoint PPT Presentation