CS 267: Applications of Parallel Computers Lecture 19: Graph Partitioning - PowerPoint PPT Presentation

1 / 69
About This Presentation
Title:

CS 267: Applications of Parallel Computers Lecture 19: Graph Partitioning

Description:

Recap of Last Lecture. Partitioning with nodal coordinates: Inertial method ... S. Barnard and H. Simon, 'A fast multilevel implementation of recursive spectral ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 70
Provided by: kathyy
Category:

less

Transcript and Presenter's Notes

Title: CS 267: Applications of Parallel Computers Lecture 19: Graph Partitioning


1
CS 267 Applications of Parallel
ComputersLecture 19Graph Partitioning Part
II
  • Kathy Yelick
  • http//www-inst.eecs.berkeley.edu/cs267

2
Recap of Last Lecture
  • Partitioning with nodal coordinates
  • Inertial method
  • Projection onto a sphere
  • Algorithms are efficient
  • Rely on graphs having nodes connected (mostly) to
    nearest neighbors in space
  • Partitioning without nodal coordinates
  • Breadth-First Search simple, but not great
    partition
  • Kernighan-Lin good corrector given reasonable
    partition
  • Spectral Method good partitions, but slow
  • Today
  • Spectral methods revisited
  • Multilevel methods

3
Basic Definitions
  • Definition The Laplacian matrix L(G) of a graph
    G(N,E) is an N by N symmetric matrix, with
    one row and column for each node. It is defined
    by
  • L(G) (i,i) degree of node I (number of incident
    edges)
  • L(G) (i,j) -1 if i ! j and there is an edge
    (i,j)
  • L(G) (i,j) 0 otherwise

2 -1 -1 0 0 -1 2 -1 0 0 -1 -1 4
-1 -1 0 0 -1 2 -1 0 0 -1 -1 2
1
4
G
L(G)
5
2
3
4
Properties of Laplacian Matrix
  • Theorem 1 Given G, L(G) has the following
    properties (proof on web page)
  • L(G) is symmetric.
  • This means the eigenvalues of L(G) are real and
    its eigenvectors are real and orthogonal.
  • Rows of L sum to zero
  • Let e 1,,1T, i.e. the column vector of all
    ones. Then L(G)e0.
  • The eigenvalues of L(G) are nonnegative
  • 0 l1 lt l2 lt lt ln
  • The number of connected components of G is equal
    to the number of li equal to 0.
  • Definition l2(L(G)) is the algebraic
    connectivity of G
  • The magnitude of l2 measures connectivity
  • In particular, l2 ! 0 if and only if G is
    connected.

5
Spectral Bisection Algorithm
  • Spectral Bisection Algorithm
  • Compute eigenvector v2 corresponding to l2(L(G))
  • For each node n of G
  • if v2(n) lt 0 put node n in partition N-
  • else put node n in partition N
  • Why does this make sense?
  • Recall l2(L(G)) is the algebraic connectivity of
    G
  • Theorem (Fiedler) Let G1(N,E1) be a subgraph of
    G(N,E), so that G1 is less connected than G.
    Then l2(L(G)) lt l2(L(G)) , i.e. the algebraic
    connectivity of G1 is less than or equal to the
    algebraic connectivity of G. (proof on web page)

6
Motivation for Spectral Bisection
  • Vibrating string has modes of vibration, or
    harmonics
  • Modes computable as follows
  • Model string as masses connected by springs (a 1D
    mesh)
  • Write down Fma for coupled system, get matrix A
  • Eigenvalues and eigenvectors of A are frequencies
    and shapes of modes
  • Label nodes by whether mode - or to get N- and
    N
  • Same idea for other graphs (eg planar graph
    trampoline)

7
Eigenvectors of L(1D mesh)
Eigenvector 1 (all ones)
Eigenvector 2
Eigenvector 3
8
2nd eigenvector of L(planar mesh)
9
Computing v2 and l2 of L(G) using Lanczos
  • Given any n-by-n symmetric matrix A (such as
    L(G)) Lanczos computes a k-by-k approximation
    T by doing k matrix-vector products, k ltlt n
  • Approximate As eigenvalues/vectors using Ts

Choose an arbitrary starting vector r b(0)
r j0 repeat jj1 q(j) r/b(j-1)
scale a vector r Aq(j)
matrix vector multiplication,
the most expensive step r r -
b(j-1)v(j-1) saxpy, or scalarvector
vector a(j) v(j)T r dot
product r r - a(j)v(j)
saxpy b(j) r
compute vector norm until convergence
details omitted
T a(1) b(1) b(1) a(2) b(2)
b(2) a(3) b(3)

b(k-2) a(k-1) b(k-1)
b(k-1) a(k)
10
Spectral Bisection Summary
  • Laplacian matrix represents graph connectivity
  • Second eigenvector gives a graph bisection
  • Roughly equal weights in two parts
  • Weak connection in the graph will be separator
  • Implementation via the Lanczos Algorithm
  • To optimize sparse-matrix-vector multiply, we
    graph partition
  • To graph partition, we find an eigenvector of a
    matrix associated with the graph
  • To find an eigenvector, we do sparse-matrix
    vector multiply
  • Have we made progress?
  • The first matrix-vector multiplies are slow, but
    use them to learn how to make the rest faster

11
Introduction to Multilevel Partitioning
  • If we want to partition G(N,E), but it is too big
    to do efficiently, what can we do?
  • 1) Replace G(N,E) by a coarse approximation
    Gc(Nc,Ec), and partition Gc instead
  • 2) Use partition of Gc to get a rough
    partitioning of G, and then iteratively improve
    it
  • What if Gc still too big?
  • Apply same idea recursively

12
Multilevel Partitioning - High Level Algorithm
(N,N- ) Multilevel_Partition( N, E )
recursive partitioning routine
returns N and N- where N N U N-
if N is small (1) Partition G
(N,E) directly to get N N U N-
Return (N, N- ) else (2)
Coarsen G to get an approximation Gc
(Nc, Ec) (3) (Nc , Nc- )
Multilevel_Partition( Nc, Ec ) (4)
Expand (Nc , Nc- ) to a partition (N , N- ) of
N (5) Improve the partition ( N ,
N- ) Return ( N , N- )
endif
(5)
V - cycle
(2,3)
(4)
How do we Coarsen? Expand? Improve?
(5)
(2,3)
(4)
(5)
(2,3)
(4)
(1)
13
Multilevel Kernighan-Lin
  • Coarsen graph and expand partition using maximal
    matchings
  • Improve partition using Kernighan-Lin

14
Maximal Matching
  • Definition A matching of a graph G(N,E) is a
    subset Em of E such that no two edges in Em share
    an endpoint
  • Definition A maximal matching of a graph G(N,E)
    is a matching Em to which no more edges can be
    added and remain a matching
  • A simple greedy algorithm computes a maximal
    matching

let Em be empty mark all nodes in N as
unmatched for i 1 to N visit the nodes
in any order if i has not been matched
mark i as matched if there is
an edge e(i,j) where j is also unmatched,
add e to Em mark j
as matched endif endif endfor
15
Maximal Matching Example
16
Coarsening using a maximal matching
1) Construct a maximal matching Em of G(N,E) for
all edges e(j,k) in Em 2) collapse
matches nodes into a single one Put node
n(e) in Nc W(n(e)) W(j) W(k) gray
statements update node/edge weights for all nodes
n in N not incident on an edge in Em 3) add
unmatched nodes Put n in Nc do not
change W(n) Now each node r in N is inside a
unique node n(r) in Nc 4) Connect two nodes in
Nc if nodes inside them are connected in E for
all edges e(j,k) in Em for each other
edge e(j,r) in E incident on j Put
edge ee (n(e),n(r)) in Ec W(ee)
W(e) for each other edge e(r,k) in E
incident on k Put edge ee
(n(r),n(e)) in Ec W(ee) W(e) If
there are multiple edges connecting two nodes in
Nc, collapse them, adding edge weights

17
Example of Coarsening
18
Expanding a partition of Gc to a partition of G
19
Multilevel Spectral Bisection
  • Coarsen graph and expand partition using maximal
    independent sets
  • Improve partition using Rayleigh Quotient
    Iteration

20
Maximal Independent Sets
  • Definition An independent set of a graph G(N,E)
    is a subset Ni of N such that no two nodes in Ni
    are connected by an edge
  • Definition A maximal independent set of a graph
    G(N,E) is an independent set Ni to which no more
    nodes can be added and remain an independent set
  • A simple greedy algorithm computes a maximal
    independent set

let Ni be empty for k 1 to N visit the
nodes in any order if node k is not
adjacent to any node already in Ni add
k to Ni endif endfor
21
Coarsening using Maximal Independent Sets
Build domains D(k) around each node k in Ni
to get nodes in Nc Add an edge to Ec whenever
it would connect two such domains Ec empty
set for all nodes k in Ni D(k) ( k,
empty set ) first set contains nodes
in D(k), second set contains edges in D(k) unmark
all edges in E repeat choose an unmarked
edge e (k,j) from E if exactly one of k
and j (say k) is in some D(m) mark e
add j and e to D(m) else if k and j
are in two different D(m)s (say D(mi) and
D(mj)) mark e add edge (mk,
mj) to Ec else if both k and j are in the
same D(m) mark e add e to
D(m) else leave e unmarked
endif until no unmarked edges
22
Example of Coarsening
- encloses domain Dk node of Nc
23
Expanding a partition of Gc to a partition of G
  • Need to convert an eigenvector vc of L(Gc) to an
    approximate eigenvector v of L(G)
  • Use interpolation

For each node j in N if j is also a node in
Nc, then v(j) vc(j) use same
eigenvector component else v(j)
average of vc(k) for all neighbors k of j in
Nc end if endif
24
Example 1D mesh of 9 nodes
25
Improve eigenvector Rayleigh Quotient Iteration
j 0 pick starting vector v(0) from
expanding vc repeat jj1 r(j)
vT(j-1) L(G) v(j-1) r(j)
Rayleigh Quotient of v(j-1)
good approximate eigenvalue v(j) (L(G) -
r(j)I)-1 v(j-1) expensive to do
exactly, so solve approximately using an
iteration called SYMMLQ, which uses
matrix-vector multiply (no surprise) v(j)
v(j) / v(j) normalize v(j) until
v(j) converges Convergence is very fast cubic
26
Example of convergence for 1D mesh
27
Available Implementations
  • Multilevel Kernighan/Lin
  • METIS (www.cs.umn.edu/metis)
  • ParMETIS - parallel version
  • Multilevel Spectral Bisection
  • S. Barnard and H. Simon, A fast multilevel
    implementation of recursive spectral bisection
    , Proc. 6th SIAM Conf. On Parallel Processing,
    1993
  • Chaco (www.cs.sandia.gov/CRF/papers_chaco.html)
  • Hybrids possible
  • Ex Using Kernighan/Lin to improve a partition
    from spectral bisection

28
Comparison of methods
  • Compare only methods that use edges, not nodal
    coordinates
  • CS267 webpage and KK95a (see below) have other
    comparisons
  • Metrics
  • Speed of partitioning
  • Number of edge cuts
  • Other application dependent metrics
  • Summary
  • No one method best
  • Multi-level Kernighan/Lin fastest by far,
    comparable to Spectral in the number of edge cuts
  • www-users.cs.umn.edu/karypis/metis/publications/m
    ail.html
  • see publications KK95a and KK95b
  • Spectral give much better cuts for some
    applications
  • Ex image segmentation
  • www.cs.berkeley.edu/jshi/Grouping/overview.html
  • see Normalized Cuts and Image Segmentation

29
Number of edges cut for a 64-way partition
For Multilevel Kernighan/Lin, as implemented in
METIS (see KK95a)
Expected cuts for 2D mesh 6427 2111
1190 11320 3326 4620 1746
8736 2252 4674 7579
Expected cuts for 3D mesh 31805 7208
3357 67647 13215 20481 5595
47887 7856 20796 39623
of Nodes 144649 15606 4960
448695 38744 74752 10672 267241
17758 76480 201142
of Edges 1074393 45878
9462 3314611 993481 261120 209093 334931
54196 152002 1479989
Edges cut for 64-way partition
88806 2965 675
194436 55753 11388 58784
1388 17894 4365
117997
Graph 144 4ELT ADD32 AUTO BBMAT FINAN512 LHR10 MA
P1 MEMPLUS SHYY161 TORSO
Description 3D FE Mesh 2D FE Mesh 32 bit
adder 3D FE Mesh 2D Stiffness M. Lin. Prog. Chem.
Eng. Highway Net. Memory circuit Navier-Stokes 3D
FE Mesh
Expected cuts for 64-way partition of 2D mesh
of n nodes n1/2 2(n/2)1/2 4(n/4)1/2
32(n/32)1/2 17 n1/2 Expected cuts
for 64-way partition of 3D mesh of n nodes
n2/3 2(n/2)2/3 4(n/4)2/3
32(n/32)2/3 11.5 n2/3
30
Speed of 256-way partitioning (from KK95a)
Partitioning time in seconds
of Nodes 144649 15606 4960
448695 38744 74752 10672 267241
17758 76480 201142
of Edges 1074393 45878
9462 3314611 993481 261120 209093 334931
54196 152002 1479989
Multilevel Spectral Bisection 607.3
25.0 18.7 2214.2
474.2 311.0 142.6 850.2
117.9 130.0 1053.4
Multilevel Kernighan/ Lin 48.1
3.1 1.6 179.2 25.5
18.0 8.1 44.8 4.3
10.1 63.9
Graph 144 4ELT ADD32 AUTO BBMAT FINAN512 LHR10 MA
P1 MEMPLUS SHYY161 TORSO
Description 3D FE Mesh 2D FE Mesh 32 bit
adder 3D FE Mesh 2D Stiffness M. Lin. Prog. Chem.
Eng. Highway Net. Memory circuit Navier-Stokes 3D
FE Mesh
Kernighan/Lin much faster than Spectral Bisection!
31
Coordinate-Free Partitioning Summary
  • Several techniques for partitioning without
    coordinates
  • Breadth-First Search simple, but not great
    partition
  • Kernighan-Lin good corrector given reasonable
    partition
  • Spectral Method good partitions, but slow
  • Multilevel methods
  • Used to speed up problems that are too large/slow
  • Coarsen, partition, expand, improve
  • Can be used with K-L and Spectral methods and
    others
  • Speed/quality
  • For load balancing of grids, multi-level K-L
    probably best
  • For other partitioning problems (vision,
    clustering, etc.) spectral may be better
  • Good software available

32
Is Graph Partitioning a Solved Problem?
  • Myths of partitioning due to Bruce Hendrickson
  • Edge cut communication cost
  • Simple graphs are sufficient
  • Edge cut is the right metric
  • Existing tools solve the problem
  • Key is finding the right partition
  • Graph partitioning is a solved problem
  • Slides and myths based on Bruce Hendricksons
  • Load Balancing Myths, Fictions Legends

33
Myth 1 Edge Cut Communication Cost
  • Myth1 The edge-cut deceit
  • edge-cut communication cost
  • Not quite true
  • vertices on boundary is actual communication
    volume
  • Do not communicate same node value twice
  • Cost of communication depends on of messages
    too (a term)
  • Congestion may also affect communication cost
  • Why is this OK for most applications?
  • Mesh-based problems match the model cost is
    edge cuts
  • Other problems (data mining, etc.) do not

34
Myth 2 Simple Graphs are Sufficient
  • Graphs often used to encode data dependencies
  • Do X before doing Y
  • Graph partitioning determines data partitioning
  • Assumes graph nodes can be evaluated in parallel
  • Communication on edges can also be done in
    parallel
  • Only dependence is between sweeps over the graph
  • More general graph models include
  • Hypergraph nodes are computation, edges are
    communication, but connected to a set (gt 2) of
    nodes
  • Bipartite model use bipartite graph for directed
    graph
  • Multi-object, Multi-Constraint model use when
    single structure may involve multiple
    computations with differing costs

35
Myth 3 Partition Quality is Paramount
  • When structure are changing dynamically during a
    simulation, need to partition dynamically
  • Speed may be more important than quality
  • Partitioner must run fast in parallel
  • Partition should be incremental
  • Change minimally relative to prior one
  • Must not use too much memory
  • Example from Touheed, Selwood, Jimack and Bersins
  • 1 M elements with adaptive refinement on SGI
    Origin
  • Timing data for different partitioning
    algorithms
  • Repartition time from 3.0 to 15.2 secs
  • Migration time 17.8 to 37.8 secs
  • Solve time 2.54 to 3.11 secs

36
Load Balancing in General
  • In some communities, load balancing is equated
    with graph partitioning
  • Some load balancing problems do not fit this
    model
  • Made several assumptions about the problem
  • Task costs (node weights) are known
  • Communication volumes (edge weights) are known
  • Dependencies are known
  • For basic partitioning techniques covered in
    class, the dependencies were only between
    iterations
  • What if we have less information?

37
Load Balancing in General
  • Spectrum of solutions
  • Static - all information available before
    starting
  • Semi-Static - some info before starting
  • Dynamic - little or no info before starting
  • Survey of solutions
  • How each one works
  • Theoretical bounds, if any
  • When to use it
  • Enormous and diverse literature on load balancing
  • Computer Science systems
  • operating systems, compilers, distributed
    computing
  • Computer Science theory
  • Operations research (IEOR)
  • Application domains

38
Understanding Load Balancing Problems
  • Load balancing problems differ in
  • Tasks costs
  • Do all tasks have equal costs?
  • If not, when are the costs known?
  • Before starting, when task created, or only when
    task ends
  • Task dependencies
  • Can all tasks be run in any order (including
    parallel)?
  • If not, when are the dependencies known?
  • Before starting, when task created, or only when
    task ends
  • Locality
  • Is it important for some tasks to be scheduled on
    the same processor (or nearby) to reduce
    communication cost?
  • When is the information about communication
    between tasks known?

39
Task Cost Spectrum
40
Task Dependency Spectrum
41
Task Locality Spectrum (Data Dependencies)
42
Spectrum of Solutions
  • One of the key questions is when certain
    information about the load balancing problem is
    known
  • Leads to a spectrum of solutions
  • Static scheduling. All information is available
    to scheduling algorithm, which runs before any
    real computation starts. (offline algorithms)
  • Semi-static scheduling. Information may be known
    at program startup, or the beginning of each
    timestep, or at other well-defined points.
    Offline algorithms may be used even though the
    problem is dynamic.
  • Dynamic scheduling. Information is not known
    until mid-execution. (online algorithms)

43
Approaches
  • Static load balancing
  • Semi-static load balancing
  • Self-scheduling
  • Distributed task queues
  • Diffusion-based load balancing
  • DAG scheduling
  • Mixed Parallelism
  • Note these are not all-inclusive, but represent
    some of the problems for which good solutions
    exist.

44
Static Load Balancing
  • Static load balancing is use when all information
    is available in advance
  • Common cases
  • dense matrix algorithms, such as LU factorization
  • done using blocked/cyclic layout
  • blocked for locality, cyclic for load balance
  • most computations on a regular mesh, e.g., FFT
  • done using cyclictransposeblocked layout for 1D
  • similar for higher dimensions, i.e., with
    transpose
  • sparse-matrix-vector multiplication
  • use graph partitioning
  • assumes graph does not change over time (or at
    least within a timestep during iterative solve)

45
Semi-Static Load Balance
  • If domain changes slowly over time and locality
    is important
  • use static algorithm
  • do some computation (usually one or more
    timesteps) allowing some load imbalance on later
    steps
  • recompute a new load balance using static
    algorithm
  • Often used in
  • particle simulations, particle-in-cell (PIC)
    methods
  • poor locality may be more of a problem than load
    imbalance as particles move from one grid
    partition to another
  • tree-structured computations (Barnes Hut, etc.)
  • grid computations with dynamically changing grid,
    which changes slowly

46
Self-Scheduling
  • Self scheduling
  • Keep a centralized pool of tasks that are
    available to run
  • When a processor completes its current task, look
    at the pool
  • If the computation of one task generates more,
    add them to the pool
  • Originally used for
  • Scheduling loops by compiler (really the
    runtime-system)
  • Original paper by Tang and Yew, ICPP 1986

47
When is Self-Scheduling a Good Idea?
  • Useful when
  • A batch (or set) of tasks without dependencies
  • can also be used with dependencies, but most
    analysis has only been done for task sets without
    dependencies
  • The cost of each task is unknown
  • Locality is not important
  • Using a shared memory multiprocessor, so a
    centralized pool of tasks is fine

48
Variations on Self-Scheduling
  • Typically, dont want to grab smallest unit of
    parallel work.
  • Instead, choose a chunk of tasks of size K.
  • If K is large, access overhead for task queue is
    small
  • If K is small, we are likely to have even finish
    times (load balance)
  • Four variations
  • Use a fixed chunk size
  • Guided self-scheduling
  • Tapering
  • Weighted Factoring
  • Note there are more

49
Variation 1 Fixed Chunk Size
  • Kruskal and Weiss give a technique for computing
    the optimal chunk size
  • Requires a lot of information about the problem
    characteristics
  • e.g., task costs, number
  • Results in an off-line algorithm. Not very
    useful in practice.
  • For use in a compiler, for example, the compiler
    would have to estimate the cost of each task
  • All tasks must be known in advance

50
Variation 2 Guided Self-Scheduling
  • Idea use larger chunks at the beginning to avoid
    excessive overhead and smaller chunks near the
    end to even out the finish times.
  • The chunk size Ki at the ith access to the task
    pool is given by
  • ceiling(Ri/p)
  • where Ri is the total number of tasks remaining
    and
  • p is the number of processors
  • See Polychronopolous, Guided Self-Scheduling A
    Practical Scheduling Scheme for Parallel
    Supercomputers, IEEE Transactions on Computers,
    Dec. 1987.

51
Variation 3 Tapering
  • Idea the chunk size, Ki is a function of not
    only the remaining work, but also the task cost
    variance
  • variance is estimated using history information
  • high variance gt small chunk size should be used
  • low variant gt larger chunks OK
  • See S. Lucco, Adaptive Parallel Programs, PhD
    Thesis, UCB, CSD-95-864, 1994.
  • Gives analysis (based on workload distribution)
  • Also gives experimental results -- tapering
    always works at least as well as GSS, although
    difference is often small

52
Variation 4 Weighted Factoring
  • Idea similar to self-scheduling, but divide task
    cost by computational power of requesting node
  • Useful for heterogeneous systems
  • Also useful for shared resource NOWs, e.g., built
    using all the machines in a building
  • as with Tapering, historical information is used
    to predict future speed
  • speed may depend on the other loads currently
    on a given processor
  • See Hummel, Schmit, Uma, and Wein, SPAA 96
  • includes experimental data and analysis

53
Distributed Task Queues
  • The obvious extension of self-scheduling to
    distributed memory is
  • a distributed task queue (or bag)
  • When are these a good idea?
  • Distributed memory multiprocessors
  • Or, shared memory with significant
    synchronization overhead
  • Locality is not (very) important
  • Tasks that are
  • known in advance, e.g., a bag of independent ones
  • dependencies exist, i.e., being computed on the
    fly
  • The costs of tasks is not known in advance

54
Theoretical Results
  • Main result A simple randomized algorithm is
    optimal with high probability
  • Adler et al 95 show this for independent, equal
    sized tasks
  • throw balls into random bins
  • tight bounds on load imbalance show p log p
    tasks leads to good balance
  • Karp and Zhang 88 show this for a tree of unit
    cost (equal size) tasks
  • parent must be done before children, tree unfolds
    at runtime
  • children pushed to random processors
  • Blumofe and Leiserson 94 show this for a fixed
    task tree of variable cost tasks
  • their algorithm uses task pulling (stealing)
    instead of pushing, which is good for locality
  • I.e., when a processor becomes idle, it steals
    from a random processor
  • also have (loose) bounds on the total memory
    required
  • Chakrabarti et al 94 show this for a dynamic
    tree of variable cost tasks
  • works for branch and bound, I.e. tree structure
    can depend on execution order
  • uses randomized pushing of tasks instead of
    pulling, so worse locality
  • Open problem does task pulling provably work
    well for dynamic trees?

55
Engineering Distributed Task Queues
  • A lot of papers on engineering these systems on
    various machines, and their applications
  • If nothing is known about task costs when created
  • organize local tasks as a stack (push/pop from
    top)
  • steal from the stack bottom (as if it were a
    queue), because old tasks likely to cost more
  • If something is known about tasks costs and
    communication costs, can be used as hints. (See
    Wen, UCB PhD, 1996.)
  • Part of Multipol (www.cs.berkeley.edu/projects/mul
    tipol)
  • Try to push tasks with high ratio of cost to
    compute/cost to push
  • Ex for matmul, ratio 2n3 cost(flop) / 2n2
    cost(send a word)
  • Goldstein, Rogers, Grunwald, and others
    (independent work) have all shown
  • advantages of integrating into the language
    framework
  • very lightweight thread creation
  • CILK (Leicerson et al) (supertech.lcs.mit.edu/cil
    k)

56
Diffusion-Based Load Balancing
  • In the randomized schemes, the machine is treated
    as fully-connected.
  • Diffusion-based load balancing takes topology
    into account
  • Locality properties better than prior work
  • Load balancing somewhat slower than randomized
  • Cost of tasks must be known at creation time
  • No dependencies between tasks

57
Diffusion-based load balancing
  • The machine is modeled as a graph
  • At each step, we compute the weight of task
    remaining on each processor
  • This is simply the number if they are unit cost
    tasks
  • Each processor compares its weight with its
    neighbors and performs some averaging
  • Markov chain analysis
  • See Ghosh et al, SPAA96 for a second order
    diffusive load balancing algorithm
  • takes into account amount of work sent last time
  • avoids some oscillation of first order schemes
  • Note locality is still not a major concern,
    although balancing with neighbors may be better
    than random

58
DAG Scheduling
  • For some problems, you have a directed acyclic
    graph (DAG) of tasks
  • nodes represent computation (may be weighted)
  • edges represent orderings and usually
    communication (may also be weighted)
  • not that common to have the DAG in advance
  • Two application domains where DAGs are known
  • Digital Signal Processing computations
  • Sparse direct solvers (mainly Cholesky, since it
    doesnt require pivoting). More on this in
    another lecture.
  • The basic offline strategy partition DAG to
    minimize communication and keep all processors
    busy
  • NP complete, so need approximations
  • Different than graph partitioning, which was for
    tasks with communication but no dependencies
  • See Gerasoulis and Yang, IEEE Transaction on
    PDS, Jun 93.

59
Mixed Parallelism
  • As another variation, consider a problem with 2
    levels of parallelism
  • course-grained task parallelism
  • good when many tasks, bad if few
  • fine-grained data parallelism
  • good when much parallelism within a task, bad if
    little
  • Appears in
  • Adaptive mesh refinement
  • Discrete event simulation, e.g., circuit
    simulation
  • Database query processing
  • Sparse matrix direct solvers

60
Mixed Parallelism Strategies
61
Which Strategy to Use
62
Switch Parallelism A Special Case
63
Simple Performance Model for Data Parallelism
64
(No Transcript)
65
Values of Sigma (Problem Size for Half Peak)
66
Modeling Performance
  • To predict performance, make assumptions about
    task tree
  • complete tree with branching factor dgt 2
  • d child tasks of parent of size N are all of
    size N/c, cgt1
  • work to do task of size N is O(Na), agt 1
  • Example Sign function based eigenvalue routine
  • d2, c4 (on average), a1.5
  • Example Sparse Cholesky on 2D mesh
  • d4, c4, a1.5
  • Combine these assumptions with model of data
    parallelism

67
Simulated Efficiency of Eigensolver
  • Starred lines are optimal mixed parallelism
  • Solid lines are data parallelism
  • Dashed lines are switched parallelism

68
Simulated efficiency of Sparse Cholesky
  • Starred lines are optimal mixed parallelism
  • Solid lines are data parallelism
  • Dashed lines are switched parallelism

69
Actual Speed of Sign Function Eigensolver
  • Starred lines are optimal mixed parallelism
  • Solid lines are data parallelism
  • Dashed lines are switched parallelism
  • Intel Paragon, built on ScaLAPACK
  • Switched parallelism worthwhile!
Write a Comment
User Comments (0)
About PowerShow.com