Programming Parallel Algorithms NESL - PowerPoint PPT Presentation


PPT – Programming Parallel Algorithms NESL PowerPoint presentation | free to view - id: 9f507-MGE3Z


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Programming Parallel Algorithms NESL


Why design a new language specifically for programming parallel algorithms? ... the running time of the algorithms without introducing a specific machine model. ... – PowerPoint PPT presentation

Number of Views:84
Avg rating:3.0/5.0
Slides: 32
Provided by: barbarath


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Programming Parallel Algorithms NESL

Programming Parallel Algorithms - NESL
  • Guy E. Blelloch
  • Presented by
  • Michael Sirivianos
  • Barbara Theodorides

Problem Statement
  • Why design a new language specifically for
    programming parallel algorithms?
  • In the past 20 years there has been tremendous
    progress in developing and analyzing parallel
  • At that time less success in developing good
    languages for programming parallel algorithms
  • There is a large gap between languages that are
    too low level (details that obscure the meaning
    of the algorithm) and languages that are too high
    level (making performance implications unclear)

  • Nested Data Parallel Language
  • Useful for teaching and implementing parallel
  • Bridges the gap allows high-level descriptions
    of parallel algorithms but also has a
    straightforward mapping onto a performance model.
  • Goals when designing NESL
  • A language-based performance model that uses work
    and depth rather than a machine-based model that
    uses running time
  • Support for nested data-parallel constructs
    (ability to nest parallel calls)

Analyzing performance
  • Processor-based models Performance is calculated
    in terms of the number of instruction cycles a
    computation takes (its running time) A
    function of input size and number of processors
  • Virtual models Higher level models that can be
    mapped onto various real machines (e.g. PRAM -
    Parallel Random Access Machine)
  • Can be mapped efficiently onto more realistic
    machines by simulating multiple processors of the
    PRAM on a single processor of a host machine.
    Virtual models easier to program.

Measuring performance Work Depth
  • Work the total number of operations executed by
    a computation
  • specifies the running time on a sequential
  • Depth the longest chain of sequential
    dependencies in the computation.
  • represents the best possible running time
    assuming an ideal machine with an unlimited
    number of processors
  • Example Summing 16 numbers using a balanced
    binary tree

How can work depth be incorporated into a
computational model?
  • Circuit model
  • Designing a circuit of logic gates
  • In previous example, design a circuit in which
    the inputs are at the top, each is an adder
    circuit, and each of the lines between adders is
    a bundle of wires.
  • Work circuit size (number of gates)
  • Depth longest path from an input to an output

How can work depth be incorporated into a
computational model? (cont)
  • Vector Machine Models
  • VRAM is a sequential RAM extended with a set of
    instructions that operate on vectors.
  • Each location in memory contains a whole vector
  • Vectors can vary in size during the computation
  • Vector instructions include element wise
    operations (adding corresponding elements)
  • Depth instructions executed by the machine
  • Work sum of the lengths of the vectors

How can work depth be incorporated into a
computational model? (cont)
  • Vector Machine Models Example
  • Summation tree code
  • Work O ( n n/2 ) O (n)
  • Depth O (log n)

How can work depth be incorporated into a
computational model? (cont)
  • Language-Based Models
  • Specify the costs of the primitive instructions
    and a set of rules for composing costs across
    program expressions.
  • Discuss the running time of the algorithms
    without introducing a specific machine model.
  • Using work depth work depth costs are
    assigned to each function and scalar primitive of
    a language and rules are specified for combining
    parallel and sequential expressions.
  • Roughly speaking, when executing a set of tasks
    in parallel
  • work sum of work of the tasks
  • depth maximum of the depth of the tasks

Why Work Depth?
  • Work Depth used informally for many years to
    describe the performance of parallel algorithms
  • easier to describe
  • easier to think about
  • easier to analyze algorithms in terms of work
    depth than in terms of running time and number of
    processors (processor-based model)
  • Why models based on work depth are better than
    processor-based models for programming and
    analyzing parallel algorithms?
  • Performance analysis is closely related to the
    code and code provides a clear abstraction of

Why Work Depth? (cont)
  • To support this claim they consider Quicksort.
  • Sequential algorithm
  • Average case run time O ( n log n ) , depth
    or recur. calls O ( log n )
  • Parallel algorithm

Quicksort (cont.)
  • Code and analysis based on a processor based
  • Code will have to specify how the sequence is
    partitioned across processor
  • how the subselection is implemented in parallel
  • how the recursive calls get partitioned among the
  • how the subcalls are synchronized
  • In the case of Quicksort, this gets even more
    complicated. T
  • The recursive calls are not of equal sizes.

Work Depth and running time
  • Running time at the two limits
  • Single processor. RT work
  • Unlimited number of processors. RT depth
  • We can place upper and lower bounds for a given
    number of processor.
  • W/ P lt T lt W / P D
  • valid under assumptions about communication and
    scheduling costs.
  • e.g. given memory latency L
  • W/ P lt T lt W / P LD
  • Communication cost among processor is not unit
    time thus D is multiplied by a latency factor.
    Bandwidth is not taken into account. In case of
    significantly different bandwidth W should be
    divided by a large B factor and D by a small B

Work Depth and running time (cont)
  • Communication Bounds
  • Work depth do not take into account
    communication costs
  • latency time between making a remote request and
    receiving the reply
  • bandwidth rate at which a processor can access
  • Latency can be hidden.
  • Each processor has multiple parallel tasks
    (threads) to execute and therefore has plenty to
    do while waiting for replies
  • Bandwidth can not be hidden. While processor is
    waiting for data transfer to complete it is not
    able to perform other operations, and therefore
    remains idle.
  • .

Nested Data-Parallelism and NESL
  • Data-Parallelism the ability to operate in
    parallel over sets of data
  • Data-Parallel Languages or Collection-Oriented
  • languages based on data-parallelism. Can be
    either flat or nested
  • Importance of nested parallelism
  • Used to implement nested loops and
    divide-and-conquer algorithms in parallel
  • Existing languages, such as C, do not have direct
    support for such nesting!
  • NESL
  • Is a nested data-parallel language.
  • Designed in order to express nested parallelism
    in a simple way with a minimum set of structures

  • Supports data-parallelism by means of operations
    on sequences
  • Apply-to-each construct which uses a set-like
  • e.g. a a a in 3, -4, -9, 5
  • Used over multiple sequences. a b a in
    3, -4, -9, 5 b in 1, 2, 3, 4
  • Ability to subselect elements of a sequence
    based on a filter.
  • e.g. a a a in 3, -4, -9, 5 a gt 0
  • Any function may be applied to each element of a
  • e.g. factorial(i) i in 3, 1, 7
  • Provides a set of functions on sequences, each
    of which can be implemented in parallel (sum,
    reverse, write)
  • e.g. write(0, 0, 0, 0, 0, 0, 0, 0,
  • Nested parallelism allow sequences to be nested
    and allow parallel funcitons to be used in an
  • e.g. sum(a) a in 2,3, 8,3,9, 7

The performance Model
  • Defines Work Depth in terms of the work and
    depth of the primitive operations, and Rules for
    composing the measures across expressions.
  • In most cases W(e1 e2) 1 W(e1) W(e2),
    where ei expresions
  • A similar rule is used for the depth.
  • Rules
  • apply-to-each expression
  • if expression

The performance Model (cont)
  • Example Factorial
  • Concider the evaluation of the expression
  • e factorial(n) n in a where a 3, 1, 5,
  • function factorial(n)
  • if (n 1) then 1
  • else nfactorial(n-1)
  • Using the rules for work and depth
  • where W , W, W- have cost 1.
  • The two unit constants come form the cost of the
    function call and the if-then-else statement.

Examples of Parallel Algorithms in NESL
  • Principles
  • An important aspect of developing a good parallel
    algorithm is designing one whose work is close to
    the time for a good sequential algorithm that
    solves the same problem.
  • Work-efficient Parallel algorithms are referred
    to as work-efficient relative to a sequential
    algorithm if their work is within a constant
    factor of the time of the sequential algorithm.

Examples of Parallel Algorithms in NESL (cont)
  • Primes
  • Sieve of Eratosthenes
  • 1 procedure PRIMES(n)
  • 2 let A be an array of length n
  • 3 set all but the first element of A to TRUE
  • 4 for i from 2 to sqrt(n)
  • 5 begin
  • 6 if Ai is TRUE
  • 7 then set all multiples of i up to n to
  • 8 end
  • Line 7 is implementing by looping over the
    multiples, thus the algorithm takes O (n log log
    n) time.

Examples of Parallel Algorithms in NESL (cont)
  • Primes (parallelized)
  • Parallelize the line set all multiples of i up
    to n to FALSE
  • multiples of a value i can be generated in
    parallel by 2ini
  • and can be written into the array A in
    parallel with the write function
  • The depth of this algorithm is O (sqrt(n)), since
    each iteration of the loop has constant depth and
    there are sqrt(n) iterations.
  • The number of multiples is the same as the time
    of the sequential version.
  • Since it does the same number of operations,
    work is the same O (n log log n).

Examples of Parallel Algorithms in NESL (cont)
  • Primes Improving depth
  • If we are given all the primes form 2 up to
    sqrt(n), we could then generate
  • all the multiples of these primes at once
    2pnp in sqr_primes
  • function primes (n)
  • if n 2 then ( int )
  • else
  • let sqr_primes primes( isqrt(n) )
  • composites 2pnp p in sqr_primes
  • flat_comps flatten (composites)
  • flags write(dist(true, n), (i,false)
    i in flat_comps)
  • indices i in 0n fl in flags fl
  • in drop(indices, 2)

Examples of Parallel Algorithms in NESL (cont)
  • Primes Improving depth
  • Analyze of Work Depth
  • Work clearly most of the work is done at the top
    level of recursion, which does O (n log log n)
    work, and therefore the total work is
  • O (n log log n)
  • Depth since each recursion level has constant
    depth, the total depth is proportional to the
    number of levels. The number of levels is log log
    n (the size of the problem at the ith level is
    n1/2d gt d log log n) and therefore the depth
    is O (log log n)
  • This algorithm remains work-efficient and greatly
    improves the depth.

Examples of Parallel Algorithms in NESL (cont)
  • Sparce Matrix Multiplication
  • Sparce matices most elements are zero
  • Representation in NESL
  • 2.0 -1.0 0 0
    A (0, 2.0), (1, -1.0),
  • A -1.0 2.0 -1.0 0
    (0, -1.0), (1, 2.0), (2, -1.0),
  • 0 -1.0 2.0 -1.0
    (1, -1.0), (2, 2.0), (3, -1.0),
  • 0 0 -1.0 2.0
    (2, -1.0), (3, 2.0)
  • E.g. multiply a sparce matrix A with a dense
    vector x.
  • The dot product Ax in NESL is sum(v xi
    (i,v) in row) row in A
  • Let n be the number of nonzero elements in the
    row, then
  • depth of the computation the depth of the
    sum O ( log n )
  • work sum of the work across the elements
    O (n)

Examples of Parallel Algorithms in NESL (cont)
  • Planar Convex Hull
  • Problem Given n points in the plane, find which
    of them lie on the perimeter of the smallest
    convex region that contains all points.
  • An example of nested parallelism for
    divide-and-conquer algorithms.
  • Quickhull algorithm (similar to Quicksort)
  • The strategy is to pick a pivot element, split
    the data based on the pivot, and recurse on each
    of the split sets.
  • Worst case performance is O (n2) and the worst
    case depth is O (n).

Examples of Parallel Algorithms in NESL (cont)
  • hsplit(set,A,P) hsplit(set,P,A)
  • cross product (p, (A,P))
  • pm farthest from line A-P
  • Recursively hsplit(set,A,pm)
  • hsplit(set,pm,P)
  • Ignores elements below the line

Examples of Parallel Algorithms in NESL (cont)
  • Performance analysis of Quickhull
  • Each recursive call has constant depth and O(n)
  • However, since many points might be deleted on
    each step, the work could be significantly less.
  • As in Quicksort, worst case performance is O (n2)
    and the worst case depth is O (n).
  • For m hull points the best case times are O (n)
    work and O( log m ) depth.

  • They formalize a clear-cut formal language-based
    model for analyzing performance
  • Work depth based model is directly defined
    through a programming language, rather than a
    specific machine
  • It can be applied to various classes of machines
    using mappings that count for number of
    processors, processing and communication costs.
  • NESL allows simple description of parallel
    algorithms and makes use of data parallel
    constructs and the ability to nest such

  • NESL hides the CPU/Memory allocation, and
    inter-processor communication details by
    providing an abstraction of parallelism.
  • The current NESL implementation is based on an
    intermediate language (VCODE )and a library of
    low level vector routines (CVL)
  • For more information on how NESL compiler is
  • Implementation of a Portable Nested
    Data-Parallel Language Guy E. Blelloch,
    Siddhartha Chatterjee, Jonathan C. Hardwick, Jay
    Sipelstein, and Marco Zagha.

  • Parallel Processing - Sensor Network Analogy
  • Local processing -gt Aggregation. Work
    corresponds to total aggregation cost.
  • Moving levels up -gt Collecting aggregated
    results from children nodes.
  • Depth-gtDepth of routing tree in sensor network.
    Implies communication cost.
  • Latency-gtCost to transmit data between motes.
  • In parallel computation the goal is to reduce
    execution time.
  • Sensor networks aim to reduce power consumption
    by minimizing communications. Execution time is
    also an issue when real time requirements are

  • NESL and TAG queries?
  • Can latency be hidden by assigning multiple tasks
    to motes?
  • Can you perform different operations on an
    array's elements in parallel? Is it hard to add
    one more parallelism mechanism besides
    apply-to-each and parallel functions?