Tricks with Trees - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Tricks with Trees

Description:

Tricks with Trees. From s by Jim Demmel, Kathy Yelick, Alan Edelman, ... implement bucket sort, radix sort, and even quicksort. solve tridiagonal linear systems ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 20
Provided by: kath219
Category:
Tags: radix | trees | tricks

less

Transcript and Presenter's Notes

Title: Tricks with Trees


1
Tricks with Trees
  • From slides by Jim Demmel, Kathy Yelick, Alan
    Edelman, and a cast of thousands

2
Parallel Vector Operations
  • Some common vector operations for vectors x, y,
    z
  • Vector add z x y
  • Trivial to parallelize if vectors are aligned
  • AXPY z ax y (here a is scalar)
  • Broadcast a, followed by independent and
  • Dot product s xTy Sj xj yj
  • Independent followed by reduction

3
Broadcast and reduction
  • Broadcast of 1 value to p processors in log p
    time
  • Reduction of p values to 1 in log p time
  • Takes advantage of associativity in , , min,
    max, etc.

a
Broadcast
1 3 1 0 4 -6 3 2
Add-reduction
8
4
Broadcast algorithms
  • Sequential or centralized algorithm
  • P0 sends value to P-1 other processors in
    sequence
  • O(P) algorithm
  • Note variations in UPC/Titanium model based on
    whether P0 writes to all others, or others read
    from P0
  • Tree-based algorithm
  • May vary branching factor
  • O(log P) algorithm
  • If broadcasting large data blocks, may break into
    pieces and pipeline

P0
a
Broadcast
P4
P0
P6
P2
P0 P1 P2 P3 P4 P5 P6 P7
5
Lower Bound on Parallel Performance
  • To compute a function of n inputs x1,xn
  • Given only binary operations on our machine.
  • In 1 time step, output depends on at most 2
    inputs
  • In 2 time steps, output depends on at most 4
    inputs
  • Adding a time step increases possible inputs by
    at most 2x
  • In klog n time steps, output depends on at most
    n inputs
  • ? A function of n inputs requires at least log n
    parallel steps.

f(x1,x2,xn)
f(x1,x2,xn)
x1 x2 xn
x1 x2 xn
6
Scan (Parallel Prefix) Operations
  • What if you want to compute partial sums?
  • Definition the parallel prefix operation takes a
    binary associative operator , and an array of
    n elements
  • a0, a1, a2, an-1
  • and produces the array
  • a0, (a0 a1), (a0 a1
    ... an-1)
  • Example add scan of
  • 1, 2, 0, 4, 2, 1, 1, 3 is 1, 3, 3,
    7, 9, 10, 11, 14
  • Can be implemented in O(n) time by a serial
    algorithm
  • Obvious n-1 applications of operator

7
Applications of scans
  • Many applications, some more obvious than others
  • lexically compare strings of characters
  • add multi-precision numbers
  • add binary numbers fast in hardware
  • evaluate polynomials
  • implement bucket sort, radix sort, and even
    quicksort
  • solve tridiagonal linear systems
  • solve recurrence relations
  • dynamically allocate processors
  • search for regular expression (grep)
  • image processing primitives

8
Prefix sum in parallel
Algorithm 1. Pairwise sum 2. Recursive
prefix 3. Pairwise sum
1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16
3 7 11 15 19 23 27 31
(Recursively compute prefix sums)
3 10 21 36 55 78 105 136
1 3 6 10 15 21 28 36 45 55 66
78 91 105 120 136
Slide source Alan Edelman, MIT
9
Parallel prefix cost
  • Parallel prefix works on any associative
    operator
  • 1 2 3 4 5 6 7 8

  • Pairwise sums
  • 3 7 11 15

  • Recursive prefix
  • 3 10 21 36

  • Update odds
  • 1 3 6 10 15 21 28 36
  • Names \ (APL), cumsum (Matlab), MPI_SCAN
  • Warning 2n operations only n-1 needed serially

Slide source Alan Edelman, MIT
10
Implementing parallel prefix scans
  • Tree summation two phases
  • up sweep
  • get values L and R from left and right child
  • save L in local variable Mine
  • compute Tmp L R and pass to parent
  • down sweep
  • get value Tmp from parent
  • send Tmp to left child
  • send TmpMine to right child

Up sweep mine left tmp left right
Down sweep tmp parent (root is 0) right
tmp mine
0
6
6
5
4
6
9
0
6
4
6
11
5
4
3
2
4
1
4
5
4
0
3
4
6
6
10
11
12
3
2
4
1
X 3 1 2 0 4 1
1 3
3 4 6 6 10 11 12
15
3 1 2 0 4 1 1
3
11
E.g., Using Scans for Array Compression
  • Given an array of n elements
  • a0, a1, a2, an-1
  • and an array of flags
  • 1,0,1,1,0,0,1,
  • compress the flagged elements into
  • a0, a2, a3, a6,
  • Compute an add scan of 0, flags
  • 0,1,1,2,3,3,4,
  • Gives the index of the ith element in the
    compressed array
  • If the flag for this element is 1, write it into
    the result array at the given position

Slide source Alan Edelman, MIT
12
E.g., Fibonacci via Matrix Multiply Prefix
Fn1 Fn Fn-1
Can compute all Fn by matmul_prefix on
, , , , , , ,
, then select the upper left entry

Slide source Alan Edelman, MIT
13
Segmented Operations
Inputs Ordered Pairs (operand,
boolean) e.g. (x, T) or (x, F)
Change of segment indicated by switching T/F
2 (y, T) (y, F) (x, T) (x y, T) (y,
F) (x, F) (y, T) (xÅy, F) e.
g. 1 2 3 4 5 6 7 8 T T F F F T
F T 1 3 3 7 12 6 7 8
Result
14
Adding two n-bit integers in O(log n) time
  • Let a an-1an-2a0 and b
    bn-1bn-2b0 be two n-bit binary numbers
  • We want their sum s ab snsn-1s0
  • Challenge compute all ci in O(log n) time via
    parallel prefix
  • Used in all computers to implement addition -
    Carry look-ahead

c-1 0 rightmost carry bit for i
0 to n-1 ci ( (ai xor bi) and
ci-1 ) or ( ai and bi ) ... next
carry bit si ai xor bi xor ci-1
for all (0 lt i lt n-1) pi ai xor bi
propagate bit for all (0 lt i lt n-1) gi
ai and bi generate bit ci
( pi and ci-1 ) or gi pi gi
ci-1 Mi ci-1 1
1 0 1
1 1
2-by-2 Boolean matrix multiplication
(associative) Mi Mi-1 M0
0
1 evaluate each
product Mi Mi-1 M0 by parallel
prefix

15
Multiplying n-by-n matrices in O(log n) time
  • For all (1 lt i,j,k lt n) P(i,j,k) A(i,k)
    B(k,j)
  • cost 1 time unit, using n3 processors
  • For all (1 lt I,j lt n) C(i,j) S P(i,j,k)
  • cost O(log n) time, using a tree with n3 / 2
    processors

16
Inverting dense n-by-n matrices in O(log2 n) time
  • Lemma 1 Cayley-Hamilton Theorem
  • expression for A-1 via characteristic polynomial
    in A
  • Lemma 2 Newtons Identities
  • Triangular system of equations for coefficients
    of characteristic polynomial
  • Lemma 3 trace(Ak) S Ak i,i S li
    (A)k
  • Csankys Algorithm (1976)
  • Completely numerically unstable

n
n
i1
i1
1) Compute the powers A2, A3, ,An-1 by parallel
prefix cost O(log2 n) 2) Compute the
traces sk trace(Ak) cost O(log
n) 3) Solve Newton identities for coefficients of
characteristic polynomial cost O(log2
n) 4) Evaluate A-1 using Cayley-Hamilton Theorem
cost O(log n)
17
Evaluating arbitrary expressions
  • Let E be an arbitrary expression formed from ,
    -, , /, parentheses, and n variables, where each
    appearance of each variable is counted separately
  • Can think of E as arbitrary expression tree with
    n leaves (the variables) and internal nodes
    labelled by , -, and /
  • Theorem (Brent) E can be evaluated in O(log n)
    time, if we reorganize it using laws of
    commutativity, associativity and distributivity
  • Sketch of (modern) proof evaluate expression
    tree E greedily by
  • collapsing all leaves into their parents at each
    time step
  • evaluating all chains in E with parallel prefix

18
The myth of log n
  • The log2 n parallel steps is not the main reason
    for the usefulness of parallel prefix.
  • Say n 1000000p (1000000 summands per processor)
  • Cost (2000000 adds) (log2P message passings)
  • fast embarassingly parallel
  • (2000000 local adds are serial for each
    processor, of course)

19
Summary of tree algorithms
  • Lots of problems can be done quickly - in theory
    - using trees
  • Some algorithms are widely used
  • broadcasts, reductions, parallel prefix
  • carry look ahead addition
  • Some are of theoretical interest only
  • Csankys method for matrix inversion
  • Solving tridiagonal linear systems (without
    pivoting)
  • Both numerically unstable
  • Csanky needs too many processors
  • Embedded in various systems
  • CM-5 hardware control network
  • MPI, UPC, Titanium, NESL, other languages
Write a Comment
User Comments (0)
About PowerShow.com