Programming Parallel Algorithms NESL presentation

About This Presentation

Transcript and Presenter's Notes

Title: Programming Parallel Algorithms NESL

1
Programming Parallel Algorithms - NESL

Guy E. Blelloch
Presented by
Michael Sirivianos
Barbara Theodorides

2
Problem Statement

Why design a new language specifically for
programming parallel algorithms?
In the past 20 years there has been tremendous
progress in developing and analyzing parallel
algorithms
At that time less success in developing good
languages for programming parallel algorithms
There is a large gap between languages that are
too low level (details that obscure the meaning
of the algorithm) and languages that are too high
level (making performance implications unclear)

3
NESL

Nested Data Parallel Language
Useful for teaching and implementing parallel
algorithms.
Bridges the gap allows high-level descriptions
of parallel algorithms but also has a
straightforward mapping onto a performance model.
Goals when designing NESL
A language-based performance model that uses work
and depth rather than a machine-based model that
uses running time
Support for nested data-parallel constructs
(ability to nest parallel calls)

4
Analyzing performance

Processor-based models Performance is calculated
in terms of the number of instruction cycles a
computation takes (its running time) A
function of input size and number of processors
Virtual models Higher level models that can be
mapped onto various real machines (e.g. PRAM -
Parallel Random Access Machine)
Can be mapped efficiently onto more realistic
machines by simulating multiple processors of the
PRAM on a single processor of a host machine.
Virtual models easier to program.

5
Measuring performance Work Depth

Work the total number of operations executed by
a computation
specifies the running time on a sequential
processor
Depth the longest chain of sequential
dependencies in the computation.
represents the best possible running time
assuming an ideal machine with an unlimited
number of processors
Example Summing 16 numbers using a balanced
binary tree

6
How can work depth be incorporated into a
computational model?

Circuit model
Designing a circuit of logic gates
In previous example, design a circuit in which
the inputs are at the top, each is an adder
circuit, and each of the lines between adders is
a bundle of wires.
Work circuit size (number of gates)
Depth longest path from an input to an output

7
How can work depth be incorporated into a
computational model? (cont)

Vector Machine Models
VRAM is a sequential RAM extended with a set of
instructions that operate on vectors.
Each location in memory contains a whole vector
Vectors can vary in size during the computation
Vector instructions include element wise
operations (adding corresponding elements)
Depth instructions executed by the machine
Work sum of the lengths of the vectors

8
How can work depth be incorporated into a
computational model? (cont)

Vector Machine Models Example
Summation tree code
Work O ( n n/2 ) O (n)
Depth O (log n)

9
How can work depth be incorporated into a
computational model? (cont)

Language-Based Models
Specify the costs of the primitive instructions
and a set of rules for composing costs across
program expressions.
Discuss the running time of the algorithms
without introducing a specific machine model.
Using work depth work depth costs are
assigned to each function and scalar primitive of
a language and rules are specified for combining
parallel and sequential expressions.
Roughly speaking, when executing a set of tasks
in parallel
work sum of work of the tasks
depth maximum of the depth of the tasks

10
Why Work Depth?

Work Depth used informally for many years to
describe the performance of parallel algorithms
easier to describe
easier to think about
easier to analyze algorithms in terms of work
depth than in terms of running time and number of
processors (processor-based model)
Why models based on work depth are better than
processor-based models for programming and
analyzing parallel algorithms?
Performance analysis is closely related to the
code and code provides a clear abstraction of
parallelism.

11
Why Work Depth? (cont)

To support this claim they consider Quicksort.
Sequential algorithm
Average case run time O ( n log n ) , depth
or recur. calls O ( log n )
Parallel algorithm

12
Quicksort (cont.)

Code and analysis based on a processor based
model
Code will have to specify how the sequence is
partitioned across processor
how the subselection is implemented in parallel
how the recursive calls get partitioned among the
processors.
how the subcalls are synchronized
In the case of Quicksort, this gets even more
complicated. T
The recursive calls are not of equal sizes.

13
Work Depth and running time

Running time at the two limits
Single processor. RT work
Unlimited number of processors. RT depth
We can place upper and lower bounds for a given
number of processor.
W/ P lt T lt W / P D
valid under assumptions about communication and
scheduling costs.
e.g. given memory latency L
W/ P lt T lt W / P LD
Communication cost among processor is not unit
time thus D is multiplied by a latency factor.
Bandwidth is not taken into account. In case of
significantly different bandwidth W should be
divided by a large B factor and D by a small B
factor.

14
Work Depth and running time (cont)

Communication Bounds
Work depth do not take into account
communication costs
latency time between making a remote request and
receiving the reply
bandwidth rate at which a processor can access
memory
Latency can be hidden.
Each processor has multiple parallel tasks
(threads) to execute and therefore has plenty to
do while waiting for replies
Bandwidth can not be hidden. While processor is
waiting for data transfer to complete it is not
able to perform other operations, and therefore
remains idle.
.

15
Nested Data-Parallelism and NESL

Data-Parallelism the ability to operate in
parallel over sets of data
Data-Parallel Languages or Collection-Oriented
Languages
languages based on data-parallelism. Can be
either flat or nested
Importance of nested parallelism
Used to implement nested loops and
divide-and-conquer algorithms in parallel
Existing languages, such as C, do not have direct
support for such nesting!
NESL
Is a nested data-parallel language.
Designed in order to express nested parallelism
in a simple way with a minimum set of structures

16
NESL

Supports data-parallelism by means of operations
on sequences
Apply-to-each construct which uses a set-like
notation
e.g. a a a in 3, -4, -9, 5
Used over multiple sequences. a b a in
3, -4, -9, 5 b in 1, 2, 3, 4
Ability to subselect elements of a sequence
based on a filter.
e.g. a a a in 3, -4, -9, 5 a gt 0
Any function may be applied to each element of a
sequence
e.g. factorial(i) i in 3, 1, 7
Provides a set of functions on sequences, each
of which can be implemented in parallel (sum,
reverse, write)
e.g. write(0, 0, 0, 0, 0, 0, 0, 0,
(4,-2),(2,5),(5,9))
Nested parallelism allow sequences to be nested
and allow parallel funcitons to be used in an
apply-to-each.
e.g. sum(a) a in 2,3, 8,3,9, 7

17
The performance Model

Defines Work Depth in terms of the work and
depth of the primitive operations, and Rules for
composing the measures across expressions.
In most cases W(e1 e2) 1 W(e1) W(e2),
where ei expresions
A similar rule is used for the depth.
Rules
apply-to-each expression
if expression

18
The performance Model (cont)

Example Factorial
Concider the evaluation of the expression
e factorial(n) n in a where a 3, 1, 5,
2.
function factorial(n)
if (n 1) then 1
else nfactorial(n-1)
Using the rules for work and depth
where W , W, W- have cost 1.
The two unit constants come form the cost of the
function call and the if-then-else statement.

19
Examples of Parallel Algorithms in NESL

Principles
An important aspect of developing a good parallel
algorithm is designing one whose work is close to
the time for a good sequential algorithm that
solves the same problem.
Work-efficient Parallel algorithms are referred
to as work-efficient relative to a sequential
algorithm if their work is within a constant
factor of the time of the sequential algorithm.

20
Examples of Parallel Algorithms in NESL (cont)

Primes
Sieve of Eratosthenes
1 procedure PRIMES(n)
2 let A be an array of length n
3 set all but the first element of A to TRUE
4 for i from 2 to sqrt(n)
5 begin
6 if Ai is TRUE
7 then set all multiples of i up to n to
FALSE
8 end
Line 7 is implementing by looping over the
multiples, thus the algorithm takes O (n log log
n) time.

21
Examples of Parallel Algorithms in NESL (cont)

Primes (parallelized)
Parallelize the line set all multiples of i up
to n to FALSE
multiples of a value i can be generated in
parallel by 2ini
and can be written into the array A in
parallel with the write function
The depth of this algorithm is O (sqrt(n)), since
each iteration of the loop has constant depth and
there are sqrt(n) iterations.
The number of multiples is the same as the time
of the sequential version.
Since it does the same number of operations,
work is the same O (n log log n).

22
Examples of Parallel Algorithms in NESL (cont)

Primes Improving depth
If we are given all the primes form 2 up to
sqrt(n), we could then generate
all the multiples of these primes at once
2pnp in sqr_primes
function primes (n)
if n 2 then ( int )
else
let sqr_primes primes( isqrt(n) )
composites 2pnp p in sqr_primes
flat_comps flatten (composites)
flags write(dist(true, n), (i,false)
i in flat_comps)
indices i in 0n fl in flags fl
in drop(indices, 2)

23
Examples of Parallel Algorithms in NESL (cont)

Primes Improving depth
Analyze of Work Depth
Work clearly most of the work is done at the top
level of recursion, which does O (n log log n)
work, and therefore the total work is
O (n log log n)
Depth since each recursion level has constant
depth, the total depth is proportional to the
number of levels. The number of levels is log log
n (the size of the problem at the ith level is
n1/2d gt d log log n) and therefore the depth
is O (log log n)
This algorithm remains work-efficient and greatly
improves the depth.

24
Examples of Parallel Algorithms in NESL (cont)

Sparce Matrix Multiplication
Sparce matices most elements are zero
Representation in NESL
2.0 -1.0 0 0
A (0, 2.0), (1, -1.0),
A -1.0 2.0 -1.0 0
(0, -1.0), (1, 2.0), (2, -1.0),
0 -1.0 2.0 -1.0
(1, -1.0), (2, 2.0), (3, -1.0),
0 0 -1.0 2.0
(2, -1.0), (3, 2.0)
E.g. multiply a sparce matrix A with a dense
vector x.
The dot product Ax in NESL is sum(v xi
(i,v) in row) row in A
Let n be the number of nonzero elements in the
row, then
depth of the computation the depth of the
sum O ( log n )
work sum of the work across the elements
O (n)

25
Examples of Parallel Algorithms in NESL (cont)

Planar Convex Hull
Problem Given n points in the plane, find which
of them lie on the perimeter of the smallest
convex region that contains all points.
An example of nested parallelism for
divide-and-conquer algorithms.
Quickhull algorithm (similar to Quicksort)
The strategy is to pick a pivot element, split
the data based on the pivot, and recurse on each
of the split sets.
Worst case performance is O (n2) and the worst
case depth is O (n).

26
Examples of Parallel Algorithms in NESL (cont)

hsplit(set,A,P) hsplit(set,P,A)
cross product (p, (A,P))
pm farthest from line A-P
Recursively hsplit(set,A,pm)
hsplit(set,pm,P)
Ignores elements below the line

27
Examples of Parallel Algorithms in NESL (cont)

Performance analysis of Quickhull
Each recursive call has constant depth and O(n)
work.
However, since many points might be deleted on
each step, the work could be significantly less.
As in Quicksort, worst case performance is O (n2)
and the worst case depth is O (n).
For m hull points the best case times are O (n)
work and O( log m ) depth.

28
Summary

They formalize a clear-cut formal language-based
model for analyzing performance
Work depth based model is directly defined
through a programming language, rather than a
specific machine
It can be applied to various classes of machines
using mappings that count for number of
processors, processing and communication costs.
NESL allows simple description of parallel
algorithms and makes use of data parallel
constructs and the ability to nest such
constructs..

29
Summary

NESL hides the CPU/Memory allocation, and
inter-processor communication details by
providing an abstraction of parallelism.
The current NESL implementation is based on an
intermediate language (VCODE )and a library of
low level vector routines (CVL)
For more information on how NESL compiler is
implemented
Implementation of a Portable Nested
Data-Parallel Language Guy E. Blelloch,
Siddhartha Chatterjee, Jonathan C. Hardwick, Jay
Sipelstein, and Marco Zagha.

30
Discussion

Parallel Processing - Sensor Network Analogy
Local processing -gt Aggregation. Work
corresponds to total aggregation cost.
Moving levels up -gt Collecting aggregated
results from children nodes.
Depth-gtDepth of routing tree in sensor network.
Implies communication cost.
Latency-gtCost to transmit data between motes.
In parallel computation the goal is to reduce
execution time.
Sensor networks aim to reduce power consumption
by minimizing communications. Execution time is
also an issue when real time requirements are
imposed.

31
Discussion

NESL and TAG queries?
Can latency be hidden by assigning multiple tasks
to motes?
Can you perform different operations on an
array's elements in parallel? Is it hard to add
one more parallelism mechanism besides
apply-to-each and parallel functions?

Write a Comment

User Comments (0)

About PowerShow.com

Programming Parallel Algorithms NESL PowerPoint PPT Presentation