Loading...

PPT – Programming Parallel Algorithms NESL PowerPoint presentation | free to view - id: 9f507-MGE3Z

The Adobe Flash plugin is needed to view this content

Programming Parallel Algorithms - NESL

- Guy E. Blelloch
- Presented by
- Michael Sirivianos
- Barbara Theodorides

Problem Statement

- Why design a new language specifically for

programming parallel algorithms? - In the past 20 years there has been tremendous

progress in developing and analyzing parallel

algorithms - At that time less success in developing good

languages for programming parallel algorithms - There is a large gap between languages that are

too low level (details that obscure the meaning

of the algorithm) and languages that are too high

level (making performance implications unclear)

NESL

- Nested Data Parallel Language
- Useful for teaching and implementing parallel

algorithms. - Bridges the gap allows high-level descriptions

of parallel algorithms but also has a

straightforward mapping onto a performance model. - Goals when designing NESL
- A language-based performance model that uses work

and depth rather than a machine-based model that

uses running time - Support for nested data-parallel constructs

(ability to nest parallel calls)

Analyzing performance

- Processor-based models Performance is calculated

in terms of the number of instruction cycles a

computation takes (its running time) A

function of input size and number of processors - Virtual models Higher level models that can be

mapped onto various real machines (e.g. PRAM -

Parallel Random Access Machine) - Can be mapped efficiently onto more realistic

machines by simulating multiple processors of the

PRAM on a single processor of a host machine.

Virtual models easier to program.

Measuring performance Work Depth

- Work the total number of operations executed by

a computation - specifies the running time on a sequential

processor - Depth the longest chain of sequential

dependencies in the computation. - represents the best possible running time

assuming an ideal machine with an unlimited

number of processors - Example Summing 16 numbers using a balanced

binary tree

How can work depth be incorporated into a

computational model?

- Circuit model
- Designing a circuit of logic gates
- In previous example, design a circuit in which

the inputs are at the top, each is an adder

circuit, and each of the lines between adders is

a bundle of wires. - Work circuit size (number of gates)
- Depth longest path from an input to an output

How can work depth be incorporated into a

computational model? (cont)

- Vector Machine Models
- VRAM is a sequential RAM extended with a set of

instructions that operate on vectors. - Each location in memory contains a whole vector
- Vectors can vary in size during the computation
- Vector instructions include element wise

operations (adding corresponding elements) - Depth instructions executed by the machine
- Work sum of the lengths of the vectors

How can work depth be incorporated into a

computational model? (cont)

- Vector Machine Models Example
- Summation tree code
- Work O ( n n/2 ) O (n)
- Depth O (log n)

How can work depth be incorporated into a

computational model? (cont)

- Language-Based Models
- Specify the costs of the primitive instructions

and a set of rules for composing costs across

program expressions. - Discuss the running time of the algorithms

without introducing a specific machine model. - Using work depth work depth costs are

assigned to each function and scalar primitive of

a language and rules are specified for combining

parallel and sequential expressions. - Roughly speaking, when executing a set of tasks

in parallel - work sum of work of the tasks
- depth maximum of the depth of the tasks

Why Work Depth?

- Work Depth used informally for many years to

describe the performance of parallel algorithms - easier to describe
- easier to think about
- easier to analyze algorithms in terms of work

depth than in terms of running time and number of

processors (processor-based model) - Why models based on work depth are better than

processor-based models for programming and

analyzing parallel algorithms? - Performance analysis is closely related to the

code and code provides a clear abstraction of

parallelism.

Why Work Depth? (cont)

- To support this claim they consider Quicksort.
- Sequential algorithm
- Average case run time O ( n log n ) , depth

or recur. calls O ( log n ) - Parallel algorithm

Quicksort (cont.)

- Code and analysis based on a processor based

model - Code will have to specify how the sequence is

partitioned across processor - how the subselection is implemented in parallel
- how the recursive calls get partitioned among the

processors. - how the subcalls are synchronized
- In the case of Quicksort, this gets even more

complicated. T - The recursive calls are not of equal sizes.

Work Depth and running time

- Running time at the two limits
- Single processor. RT work
- Unlimited number of processors. RT depth
- We can place upper and lower bounds for a given

number of processor. - W/ P lt T lt W / P D
- valid under assumptions about communication and

scheduling costs. - e.g. given memory latency L
- W/ P lt T lt W / P LD
- Communication cost among processor is not unit

time thus D is multiplied by a latency factor.

Bandwidth is not taken into account. In case of

significantly different bandwidth W should be

divided by a large B factor and D by a small B

factor.

Work Depth and running time (cont)

- Communication Bounds
- Work depth do not take into account

communication costs - latency time between making a remote request and

receiving the reply - bandwidth rate at which a processor can access

memory - Latency can be hidden.
- Each processor has multiple parallel tasks

(threads) to execute and therefore has plenty to

do while waiting for replies - Bandwidth can not be hidden. While processor is

waiting for data transfer to complete it is not

able to perform other operations, and therefore

remains idle. - .

Nested Data-Parallelism and NESL

- Data-Parallelism the ability to operate in

parallel over sets of data - Data-Parallel Languages or Collection-Oriented

Languages - languages based on data-parallelism. Can be

either flat or nested - Importance of nested parallelism
- Used to implement nested loops and

divide-and-conquer algorithms in parallel - Existing languages, such as C, do not have direct

support for such nesting! - NESL
- Is a nested data-parallel language.
- Designed in order to express nested parallelism

in a simple way with a minimum set of structures

NESL

- Supports data-parallelism by means of operations

on sequences - Apply-to-each construct which uses a set-like

notation - e.g. a a a in 3, -4, -9, 5
- Used over multiple sequences. a b a in

3, -4, -9, 5 b in 1, 2, 3, 4 - Ability to subselect elements of a sequence

based on a filter. - e.g. a a a in 3, -4, -9, 5 a gt 0
- Any function may be applied to each element of a

sequence - e.g. factorial(i) i in 3, 1, 7
- Provides a set of functions on sequences, each

of which can be implemented in parallel (sum,

reverse, write) - e.g. write(0, 0, 0, 0, 0, 0, 0, 0,

(4,-2),(2,5),(5,9)) - Nested parallelism allow sequences to be nested

and allow parallel funcitons to be used in an

apply-to-each. - e.g. sum(a) a in 2,3, 8,3,9, 7

The performance Model

- Defines Work Depth in terms of the work and

depth of the primitive operations, and Rules for

composing the measures across expressions. - In most cases W(e1 e2) 1 W(e1) W(e2),

where ei expresions - A similar rule is used for the depth.
- Rules
- apply-to-each expression
- if expression

The performance Model (cont)

- Example Factorial
- Concider the evaluation of the expression
- e factorial(n) n in a where a 3, 1, 5,

2. - function factorial(n)
- if (n 1) then 1
- else nfactorial(n-1)
- Using the rules for work and depth
- where W , W, W- have cost 1.
- The two unit constants come form the cost of the

function call and the if-then-else statement.

Examples of Parallel Algorithms in NESL

- Principles
- An important aspect of developing a good parallel

algorithm is designing one whose work is close to

the time for a good sequential algorithm that

solves the same problem. - Work-efficient Parallel algorithms are referred

to as work-efficient relative to a sequential

algorithm if their work is within a constant

factor of the time of the sequential algorithm.

Examples of Parallel Algorithms in NESL (cont)

- Primes
- Sieve of Eratosthenes
- 1 procedure PRIMES(n)
- 2 let A be an array of length n
- 3 set all but the first element of A to TRUE
- 4 for i from 2 to sqrt(n)
- 5 begin
- 6 if Ai is TRUE
- 7 then set all multiples of i up to n to

FALSE - 8 end
- Line 7 is implementing by looping over the

multiples, thus the algorithm takes O (n log log

n) time.

Examples of Parallel Algorithms in NESL (cont)

- Primes (parallelized)
- Parallelize the line set all multiples of i up

to n to FALSE - multiples of a value i can be generated in

parallel by 2ini - and can be written into the array A in

parallel with the write function - The depth of this algorithm is O (sqrt(n)), since

each iteration of the loop has constant depth and

there are sqrt(n) iterations. - The number of multiples is the same as the time

of the sequential version. - Since it does the same number of operations,

work is the same O (n log log n).

Examples of Parallel Algorithms in NESL (cont)

- Primes Improving depth
- If we are given all the primes form 2 up to

sqrt(n), we could then generate - all the multiples of these primes at once

2pnp in sqr_primes - function primes (n)
- if n 2 then ( int )
- else
- let sqr_primes primes( isqrt(n) )
- composites 2pnp p in sqr_primes
- flat_comps flatten (composites)
- flags write(dist(true, n), (i,false)

i in flat_comps) - indices i in 0n fl in flags fl

- in drop(indices, 2)

Examples of Parallel Algorithms in NESL (cont)

- Primes Improving depth
- Analyze of Work Depth
- Work clearly most of the work is done at the top

level of recursion, which does O (n log log n)

work, and therefore the total work is - O (n log log n)
- Depth since each recursion level has constant

depth, the total depth is proportional to the

number of levels. The number of levels is log log

n (the size of the problem at the ith level is

n1/2d gt d log log n) and therefore the depth

is O (log log n) - This algorithm remains work-efficient and greatly

improves the depth.

Examples of Parallel Algorithms in NESL (cont)

- Sparce Matrix Multiplication
- Sparce matices most elements are zero
- Representation in NESL
- 2.0 -1.0 0 0

A (0, 2.0), (1, -1.0), - A -1.0 2.0 -1.0 0

(0, -1.0), (1, 2.0), (2, -1.0), - 0 -1.0 2.0 -1.0

(1, -1.0), (2, 2.0), (3, -1.0), - 0 0 -1.0 2.0

(2, -1.0), (3, 2.0) - E.g. multiply a sparce matrix A with a dense

vector x. - The dot product Ax in NESL is sum(v xi

(i,v) in row) row in A - Let n be the number of nonzero elements in the

row, then - depth of the computation the depth of the

sum O ( log n ) - work sum of the work across the elements

O (n)

Examples of Parallel Algorithms in NESL (cont)

- Planar Convex Hull
- Problem Given n points in the plane, find which

of them lie on the perimeter of the smallest

convex region that contains all points. - An example of nested parallelism for

divide-and-conquer algorithms. - Quickhull algorithm (similar to Quicksort)
- The strategy is to pick a pivot element, split

the data based on the pivot, and recurse on each

of the split sets. - Worst case performance is O (n2) and the worst

case depth is O (n).

Examples of Parallel Algorithms in NESL (cont)

- hsplit(set,A,P) hsplit(set,P,A)
- cross product (p, (A,P))
- pm farthest from line A-P
- Recursively hsplit(set,A,pm)
- hsplit(set,pm,P)
- Ignores elements below the line

Examples of Parallel Algorithms in NESL (cont)

- Performance analysis of Quickhull
- Each recursive call has constant depth and O(n)

work. - However, since many points might be deleted on

each step, the work could be significantly less. - As in Quicksort, worst case performance is O (n2)

and the worst case depth is O (n). - For m hull points the best case times are O (n)

work and O( log m ) depth.

Summary

- They formalize a clear-cut formal language-based

model for analyzing performance - Work depth based model is directly defined

through a programming language, rather than a

specific machine - It can be applied to various classes of machines

using mappings that count for number of

processors, processing and communication costs. - NESL allows simple description of parallel

algorithms and makes use of data parallel

constructs and the ability to nest such

constructs..

Summary

- NESL hides the CPU/Memory allocation, and

inter-processor communication details by

providing an abstraction of parallelism. - The current NESL implementation is based on an

intermediate language (VCODE )and a library of

low level vector routines (CVL) - For more information on how NESL compiler is

implemented - Implementation of a Portable Nested

Data-Parallel Language Guy E. Blelloch,

Siddhartha Chatterjee, Jonathan C. Hardwick, Jay

Sipelstein, and Marco Zagha.

Discussion

- Parallel Processing - Sensor Network Analogy
- Local processing -gt Aggregation. Work

corresponds to total aggregation cost. - Moving levels up -gt Collecting aggregated

results from children nodes. - Depth-gtDepth of routing tree in sensor network.

Implies communication cost. - Latency-gtCost to transmit data between motes.
- In parallel computation the goal is to reduce

execution time. - Sensor networks aim to reduce power consumption

by minimizing communications. Execution time is

also an issue when real time requirements are

imposed.

Discussion

- NESL and TAG queries?
- Can latency be hidden by assigning multiple tasks

to motes? - Can you perform different operations on an

array's elements in parallel? Is it hard to add

one more parallelism mechanism besides

apply-to-each and parallel functions?