CS 267: Applications of Parallel Computers Lecture 19: Graph Partitioning

About This Presentation

Title:

CS 267: Applications of Parallel Computers Lecture 19: Graph Partitioning

Description:

Recap of Last Lecture. Partitioning with nodal coordinates: Inertial method ... S. Barnard and H. Simon, 'A fast multilevel implementation of recursive spectral ... – PowerPoint PPT presentation

Number of Views:77

Avg rating:3.0/5.0

Slides: 70

Provided by: kathyy

Learn more at: http://www.cs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: CS 267: Applications of Parallel Computers Lecture 19: Graph Partitioning

1
CS 267 Applications of Parallel
ComputersLecture 19Graph Partitioning Part
II

Kathy Yelick
http//www-inst.eecs.berkeley.edu/cs267

2
Recap of Last Lecture

Partitioning with nodal coordinates
Inertial method
Projection onto a sphere
Algorithms are efficient
Rely on graphs having nodes connected (mostly) to
nearest neighbors in space
Partitioning without nodal coordinates
Breadth-First Search simple, but not great
partition
Kernighan-Lin good corrector given reasonable
partition
Spectral Method good partitions, but slow
Today
Spectral methods revisited
Multilevel methods

3
Basic Definitions

Definition The Laplacian matrix L(G) of a graph
G(N,E) is an N by N symmetric matrix, with
one row and column for each node. It is defined
by
L(G) (i,i) degree of node I (number of incident
edges)
L(G) (i,j) -1 if i ! j and there is an edge
(i,j)
L(G) (i,j) 0 otherwise

2 -1 -1 0 0 -1 2 -1 0 0 -1 -1 4
-1 -1 0 0 -1 2 -1 0 0 -1 -1 2
1
4
G
L(G)
5
2
3
4
Properties of Laplacian Matrix

Theorem 1 Given G, L(G) has the following
properties (proof on web page)
L(G) is symmetric.
This means the eigenvalues of L(G) are real and
its eigenvectors are real and orthogonal.
Rows of L sum to zero
Let e 1,,1T, i.e. the column vector of all
ones. Then L(G)e0.
The eigenvalues of L(G) are nonnegative
0 l1 lt l2 lt lt ln
The number of connected components of G is equal
to the number of li equal to 0.
Definition l2(L(G)) is the algebraic
connectivity of G
The magnitude of l2 measures connectivity
In particular, l2 ! 0 if and only if G is
connected.

5
Spectral Bisection Algorithm

Spectral Bisection Algorithm
Compute eigenvector v2 corresponding to l2(L(G))
For each node n of G
if v2(n) lt 0 put node n in partition N-
else put node n in partition N
Why does this make sense?
Recall l2(L(G)) is the algebraic connectivity of
G
Theorem (Fiedler) Let G1(N,E1) be a subgraph of
G(N,E), so that G1 is less connected than G.
Then l2(L(G)) lt l2(L(G)) , i.e. the algebraic
connectivity of G1 is less than or equal to the
algebraic connectivity of G. (proof on web page)

6
Motivation for Spectral Bisection

Vibrating string has modes of vibration, or
harmonics
Modes computable as follows
Model string as masses connected by springs (a 1D
mesh)
Write down Fma for coupled system, get matrix A
Eigenvalues and eigenvectors of A are frequencies
and shapes of modes
Label nodes by whether mode - or to get N- and
N
Same idea for other graphs (eg planar graph
trampoline)

7
Eigenvectors of L(1D mesh)
Eigenvector 1 (all ones)
Eigenvector 2
Eigenvector 3
8
2nd eigenvector of L(planar mesh)
9
Computing v2 and l2 of L(G) using Lanczos

Given any n-by-n symmetric matrix A (such as
L(G)) Lanczos computes a k-by-k approximation
T by doing k matrix-vector products, k ltlt n
Approximate As eigenvalues/vectors using Ts

Choose an arbitrary starting vector r b(0)
r j0 repeat jj1 q(j) r/b(j-1)
scale a vector r Aq(j)
matrix vector multiplication,
the most expensive step r r -
b(j-1)v(j-1) saxpy, or scalarvector
vector a(j) v(j)T r dot
product r r - a(j)v(j)
saxpy b(j) r
compute vector norm until convergence
details omitted
T a(1) b(1) b(1) a(2) b(2)
b(2) a(3) b(3)

b(k-2) a(k-1) b(k-1)
b(k-1) a(k)
10
Spectral Bisection Summary

Laplacian matrix represents graph connectivity
Second eigenvector gives a graph bisection
Roughly equal weights in two parts
Weak connection in the graph will be separator
Implementation via the Lanczos Algorithm
To optimize sparse-matrix-vector multiply, we
graph partition
To graph partition, we find an eigenvector of a
matrix associated with the graph
To find an eigenvector, we do sparse-matrix
vector multiply
Have we made progress?
The first matrix-vector multiplies are slow, but
use them to learn how to make the rest faster

11
Introduction to Multilevel Partitioning

If we want to partition G(N,E), but it is too big
to do efficiently, what can we do?
1) Replace G(N,E) by a coarse approximation
Gc(Nc,Ec), and partition Gc instead
2) Use partition of Gc to get a rough
partitioning of G, and then iteratively improve
it
What if Gc still too big?
Apply same idea recursively

12
Multilevel Partitioning - High Level Algorithm
(N,N- ) Multilevel_Partition( N, E )
recursive partitioning routine
returns N and N- where N N U N-
if N is small (1) Partition G
(N,E) directly to get N N U N-
Return (N, N- ) else (2)
Coarsen G to get an approximation Gc
(Nc, Ec) (3) (Nc , Nc- )
Multilevel_Partition( Nc, Ec ) (4)
Expand (Nc , Nc- ) to a partition (N , N- ) of
N (5) Improve the partition ( N ,
N- ) Return ( N , N- )
endif
(5)
V - cycle
(2,3)
(4)
How do we Coarsen? Expand? Improve?
(5)
(2,3)
(4)
(5)
(2,3)
(4)
(1)
13
Multilevel Kernighan-Lin

Coarsen graph and expand partition using maximal
matchings
Improve partition using Kernighan-Lin

14
Maximal Matching

Definition A matching of a graph G(N,E) is a
subset Em of E such that no two edges in Em share
an endpoint
Definition A maximal matching of a graph G(N,E)
is a matching Em to which no more edges can be
added and remain a matching
A simple greedy algorithm computes a maximal
matching

let Em be empty mark all nodes in N as
unmatched for i 1 to N visit the nodes
in any order if i has not been matched
mark i as matched if there is
an edge e(i,j) where j is also unmatched,
add e to Em mark j
as matched endif endif endfor
15
Maximal Matching Example
16
Coarsening using a maximal matching
1) Construct a maximal matching Em of G(N,E) for
all edges e(j,k) in Em 2) collapse
matches nodes into a single one Put node
n(e) in Nc W(n(e)) W(j) W(k) gray
statements update node/edge weights for all nodes
n in N not incident on an edge in Em 3) add
unmatched nodes Put n in Nc do not
change W(n) Now each node r in N is inside a
unique node n(r) in Nc 4) Connect two nodes in
Nc if nodes inside them are connected in E for
all edges e(j,k) in Em for each other
edge e(j,r) in E incident on j Put
edge ee (n(e),n(r)) in Ec W(ee)
W(e) for each other edge e(r,k) in E
incident on k Put edge ee
(n(r),n(e)) in Ec W(ee) W(e) If
there are multiple edges connecting two nodes in
Nc, collapse them, adding edge weights

17
Example of Coarsening
18
Expanding a partition of Gc to a partition of G
19
Multilevel Spectral Bisection

Coarsen graph and expand partition using maximal
independent sets
Improve partition using Rayleigh Quotient
Iteration

20
Maximal Independent Sets

Definition An independent set of a graph G(N,E)
is a subset Ni of N such that no two nodes in Ni
are connected by an edge
Definition A maximal independent set of a graph
G(N,E) is an independent set Ni to which no more
nodes can be added and remain an independent set
A simple greedy algorithm computes a maximal
independent set

let Ni be empty for k 1 to N visit the
nodes in any order if node k is not
adjacent to any node already in Ni add
k to Ni endif endfor
21
Coarsening using Maximal Independent Sets
Build domains D(k) around each node k in Ni
to get nodes in Nc Add an edge to Ec whenever
it would connect two such domains Ec empty
set for all nodes k in Ni D(k) ( k,
empty set ) first set contains nodes
in D(k), second set contains edges in D(k) unmark
all edges in E repeat choose an unmarked
edge e (k,j) from E if exactly one of k
and j (say k) is in some D(m) mark e
add j and e to D(m) else if k and j
are in two different D(m)s (say D(mi) and
D(mj)) mark e add edge (mk,
mj) to Ec else if both k and j are in the
same D(m) mark e add e to
D(m) else leave e unmarked
endif until no unmarked edges
22
Example of Coarsening
- encloses domain Dk node of Nc
23
Expanding a partition of Gc to a partition of G

Need to convert an eigenvector vc of L(Gc) to an
approximate eigenvector v of L(G)
Use interpolation

For each node j in N if j is also a node in
Nc, then v(j) vc(j) use same
eigenvector component else v(j)
average of vc(k) for all neighbors k of j in
Nc end if endif
24
Example 1D mesh of 9 nodes
25
Improve eigenvector Rayleigh Quotient Iteration
j 0 pick starting vector v(0) from
expanding vc repeat jj1 r(j)
vT(j-1) L(G) v(j-1) r(j)
Rayleigh Quotient of v(j-1)
good approximate eigenvalue v(j) (L(G) -
r(j)I)-1 v(j-1) expensive to do
exactly, so solve approximately using an
iteration called SYMMLQ, which uses
matrix-vector multiply (no surprise) v(j)
v(j) / v(j) normalize v(j) until
v(j) converges Convergence is very fast cubic
26
Example of convergence for 1D mesh
27
Available Implementations

Multilevel Kernighan/Lin
METIS (www.cs.umn.edu/metis)
ParMETIS - parallel version
Multilevel Spectral Bisection
S. Barnard and H. Simon, A fast multilevel
implementation of recursive spectral bisection
, Proc. 6th SIAM Conf. On Parallel Processing,
1993
Chaco (www.cs.sandia.gov/CRF/papers_chaco.html)
Hybrids possible
Ex Using Kernighan/Lin to improve a partition
from spectral bisection

28
Comparison of methods

Compare only methods that use edges, not nodal
coordinates
CS267 webpage and KK95a (see below) have other
comparisons
Metrics
Speed of partitioning
Number of edge cuts
Other application dependent metrics
Summary
No one method best
Multi-level Kernighan/Lin fastest by far,
comparable to Spectral in the number of edge cuts
www-users.cs.umn.edu/karypis/metis/publications/m
ail.html
see publications KK95a and KK95b
Spectral give much better cuts for some
applications
Ex image segmentation
www.cs.berkeley.edu/jshi/Grouping/overview.html
see Normalized Cuts and Image Segmentation

29
Number of edges cut for a 64-way partition
For Multilevel Kernighan/Lin, as implemented in
METIS (see KK95a)
Expected cuts for 2D mesh 6427 2111
1190 11320 3326 4620 1746
8736 2252 4674 7579
Expected cuts for 3D mesh 31805 7208
3357 67647 13215 20481 5595
47887 7856 20796 39623
of Nodes 144649 15606 4960
448695 38744 74752 10672 267241
17758 76480 201142
of Edges 1074393 45878
9462 3314611 993481 261120 209093 334931
54196 152002 1479989
Edges cut for 64-way partition
88806 2965 675
194436 55753 11388 58784
1388 17894 4365
117997
Graph 144 4ELT ADD32 AUTO BBMAT FINAN512 LHR10 MA
P1 MEMPLUS SHYY161 TORSO
Description 3D FE Mesh 2D FE Mesh 32 bit
adder 3D FE Mesh 2D Stiffness M. Lin. Prog. Chem.
Eng. Highway Net. Memory circuit Navier-Stokes 3D
FE Mesh
Expected cuts for 64-way partition of 2D mesh
of n nodes n1/2 2(n/2)1/2 4(n/4)1/2
32(n/32)1/2 17 n1/2 Expected cuts
for 64-way partition of 3D mesh of n nodes
n2/3 2(n/2)2/3 4(n/4)2/3
32(n/32)2/3 11.5 n2/3
30
Speed of 256-way partitioning (from KK95a)
Partitioning time in seconds
of Nodes 144649 15606 4960
448695 38744 74752 10672 267241
17758 76480 201142
of Edges 1074393 45878
9462 3314611 993481 261120 209093 334931
54196 152002 1479989
Multilevel Spectral Bisection 607.3
25.0 18.7 2214.2
474.2 311.0 142.6 850.2
117.9 130.0 1053.4
Multilevel Kernighan/ Lin 48.1
3.1 1.6 179.2 25.5
18.0 8.1 44.8 4.3
10.1 63.9
Graph 144 4ELT ADD32 AUTO BBMAT FINAN512 LHR10 MA
P1 MEMPLUS SHYY161 TORSO
Description 3D FE Mesh 2D FE Mesh 32 bit
adder 3D FE Mesh 2D Stiffness M. Lin. Prog. Chem.
Eng. Highway Net. Memory circuit Navier-Stokes 3D
FE Mesh
Kernighan/Lin much faster than Spectral Bisection!
31
Coordinate-Free Partitioning Summary

Several techniques for partitioning without
coordinates
Breadth-First Search simple, but not great
partition
Kernighan-Lin good corrector given reasonable
partition
Spectral Method good partitions, but slow
Multilevel methods
Used to speed up problems that are too large/slow
Coarsen, partition, expand, improve
Can be used with K-L and Spectral methods and
others
Speed/quality
For load balancing of grids, multi-level K-L
probably best
For other partitioning problems (vision,
clustering, etc.) spectral may be better
Good software available

32
Is Graph Partitioning a Solved Problem?

Myths of partitioning due to Bruce Hendrickson
Edge cut communication cost
Simple graphs are sufficient
Edge cut is the right metric
Existing tools solve the problem
Key is finding the right partition
Graph partitioning is a solved problem
Slides and myths based on Bruce Hendricksons
Load Balancing Myths, Fictions Legends

33
Myth 1 Edge Cut Communication Cost

Myth1 The edge-cut deceit
edge-cut communication cost
Not quite true
vertices on boundary is actual communication
volume
Do not communicate same node value twice
Cost of communication depends on of messages
too (a term)
Congestion may also affect communication cost
Why is this OK for most applications?
Mesh-based problems match the model cost is
edge cuts
Other problems (data mining, etc.) do not

34
Myth 2 Simple Graphs are Sufficient

Graphs often used to encode data dependencies
Do X before doing Y
Graph partitioning determines data partitioning
Assumes graph nodes can be evaluated in parallel
Communication on edges can also be done in
parallel
Only dependence is between sweeps over the graph
More general graph models include
Hypergraph nodes are computation, edges are
communication, but connected to a set (gt 2) of
nodes
Bipartite model use bipartite graph for directed
graph
Multi-object, Multi-Constraint model use when
single structure may involve multiple
computations with differing costs

35
Myth 3 Partition Quality is Paramount

When structure are changing dynamically during a
simulation, need to partition dynamically
Speed may be more important than quality
Partitioner must run fast in parallel
Partition should be incremental
Change minimally relative to prior one
Must not use too much memory
Example from Touheed, Selwood, Jimack and Bersins
1 M elements with adaptive refinement on SGI
Origin
Timing data for different partitioning
algorithms
Repartition time from 3.0 to 15.2 secs
Migration time 17.8 to 37.8 secs
Solve time 2.54 to 3.11 secs

36
Load Balancing in General

In some communities, load balancing is equated
with graph partitioning
Some load balancing problems do not fit this
model
Made several assumptions about the problem
Task costs (node weights) are known
Communication volumes (edge weights) are known
Dependencies are known
For basic partitioning techniques covered in
class, the dependencies were only between
iterations
What if we have less information?

37
Load Balancing in General

Spectrum of solutions
Static - all information available before
starting
Semi-Static - some info before starting
Dynamic - little or no info before starting
Survey of solutions
How each one works
Theoretical bounds, if any
When to use it
Enormous and diverse literature on load balancing
Computer Science systems
operating systems, compilers, distributed
computing
Computer Science theory
Operations research (IEOR)
Application domains

38
Understanding Load Balancing Problems

Load balancing problems differ in
Tasks costs
Do all tasks have equal costs?
If not, when are the costs known?
Before starting, when task created, or only when
task ends
Task dependencies
Can all tasks be run in any order (including
parallel)?
If not, when are the dependencies known?
Before starting, when task created, or only when
task ends
Locality
Is it important for some tasks to be scheduled on
the same processor (or nearby) to reduce
communication cost?
When is the information about communication
between tasks known?

39
Task Cost Spectrum
40
Task Dependency Spectrum
41
Task Locality Spectrum (Data Dependencies)
42
Spectrum of Solutions

One of the key questions is when certain
information about the load balancing problem is
known
Leads to a spectrum of solutions
Static scheduling. All information is available
to scheduling algorithm, which runs before any
real computation starts. (offline algorithms)
Semi-static scheduling. Information may be known
at program startup, or the beginning of each
timestep, or at other well-defined points.
Offline algorithms may be used even though the
problem is dynamic.
Dynamic scheduling. Information is not known
until mid-execution. (online algorithms)

43
Approaches

Static load balancing
Semi-static load balancing
Self-scheduling
Distributed task queues
Diffusion-based load balancing
DAG scheduling
Mixed Parallelism
Note these are not all-inclusive, but represent
some of the problems for which good solutions
exist.

44
Static Load Balancing

Static load balancing is use when all information
is available in advance
Common cases
dense matrix algorithms, such as LU factorization
done using blocked/cyclic layout
blocked for locality, cyclic for load balance
most computations on a regular mesh, e.g., FFT
done using cyclictransposeblocked layout for 1D
similar for higher dimensions, i.e., with
transpose
sparse-matrix-vector multiplication
use graph partitioning
assumes graph does not change over time (or at
least within a timestep during iterative solve)

45
Semi-Static Load Balance

If domain changes slowly over time and locality
is important
use static algorithm
do some computation (usually one or more
timesteps) allowing some load imbalance on later
steps
recompute a new load balance using static
algorithm
Often used in
particle simulations, particle-in-cell (PIC)
methods
poor locality may be more of a problem than load
imbalance as particles move from one grid
partition to another
tree-structured computations (Barnes Hut, etc.)
grid computations with dynamically changing grid,
which changes slowly

46
Self-Scheduling

Self scheduling
Keep a centralized pool of tasks that are
available to run
When a processor completes its current task, look
at the pool
If the computation of one task generates more,
add them to the pool
Originally used for
Scheduling loops by compiler (really the
runtime-system)
Original paper by Tang and Yew, ICPP 1986

47
When is Self-Scheduling a Good Idea?

Useful when
A batch (or set) of tasks without dependencies
can also be used with dependencies, but most
analysis has only been done for task sets without
dependencies
The cost of each task is unknown
Locality is not important
Using a shared memory multiprocessor, so a
centralized pool of tasks is fine

48
Variations on Self-Scheduling

Typically, dont want to grab smallest unit of
parallel work.
Instead, choose a chunk of tasks of size K.
If K is large, access overhead for task queue is
small
If K is small, we are likely to have even finish
times (load balance)
Four variations
Use a fixed chunk size
Guided self-scheduling
Tapering
Weighted Factoring
Note there are more

49
Variation 1 Fixed Chunk Size

Kruskal and Weiss give a technique for computing
the optimal chunk size
Requires a lot of information about the problem
characteristics
e.g., task costs, number
Results in an off-line algorithm. Not very
useful in practice.
For use in a compiler, for example, the compiler
would have to estimate the cost of each task
All tasks must be known in advance

50
Variation 2 Guided Self-Scheduling

Idea use larger chunks at the beginning to avoid
excessive overhead and smaller chunks near the
end to even out the finish times.
The chunk size Ki at the ith access to the task
pool is given by
ceiling(Ri/p)
where Ri is the total number of tasks remaining
and
p is the number of processors
See Polychronopolous, Guided Self-Scheduling A
Practical Scheduling Scheme for Parallel
Supercomputers, IEEE Transactions on Computers,
Dec. 1987.

51
Variation 3 Tapering

Idea the chunk size, Ki is a function of not
only the remaining work, but also the task cost
variance
variance is estimated using history information
high variance gt small chunk size should be used
low variant gt larger chunks OK

See S. Lucco, Adaptive Parallel Programs, PhD
Thesis, UCB, CSD-95-864, 1994.
Gives analysis (based on workload distribution)
Also gives experimental results -- tapering
always works at least as well as GSS, although
difference is often small

52
Variation 4 Weighted Factoring

Idea similar to self-scheduling, but divide task
cost by computational power of requesting node
Useful for heterogeneous systems
Also useful for shared resource NOWs, e.g., built
using all the machines in a building
as with Tapering, historical information is used
to predict future speed
speed may depend on the other loads currently
on a given processor
See Hummel, Schmit, Uma, and Wein, SPAA 96
includes experimental data and analysis

53
Distributed Task Queues

The obvious extension of self-scheduling to
distributed memory is
a distributed task queue (or bag)
When are these a good idea?
Distributed memory multiprocessors
Or, shared memory with significant
synchronization overhead
Locality is not (very) important
Tasks that are
known in advance, e.g., a bag of independent ones
dependencies exist, i.e., being computed on the
fly
The costs of tasks is not known in advance

54
Theoretical Results

Main result A simple randomized algorithm is
optimal with high probability
Adler et al 95 show this for independent, equal
sized tasks
throw balls into random bins
tight bounds on load imbalance show p log p
tasks leads to good balance
Karp and Zhang 88 show this for a tree of unit
cost (equal size) tasks
parent must be done before children, tree unfolds
at runtime
children pushed to random processors
Blumofe and Leiserson 94 show this for a fixed
task tree of variable cost tasks
their algorithm uses task pulling (stealing)
instead of pushing, which is good for locality
I.e., when a processor becomes idle, it steals
from a random processor
also have (loose) bounds on the total memory
required
Chakrabarti et al 94 show this for a dynamic
tree of variable cost tasks
works for branch and bound, I.e. tree structure
can depend on execution order
uses randomized pushing of tasks instead of
pulling, so worse locality
Open problem does task pulling provably work
well for dynamic trees?

55
Engineering Distributed Task Queues

A lot of papers on engineering these systems on
various machines, and their applications
If nothing is known about task costs when created
organize local tasks as a stack (push/pop from
top)
steal from the stack bottom (as if it were a
queue), because old tasks likely to cost more
If something is known about tasks costs and
communication costs, can be used as hints. (See
Wen, UCB PhD, 1996.)
Part of Multipol (www.cs.berkeley.edu/projects/mul
tipol)
Try to push tasks with high ratio of cost to
compute/cost to push
Ex for matmul, ratio 2n3 cost(flop) / 2n2
cost(send a word)
Goldstein, Rogers, Grunwald, and others
(independent work) have all shown
advantages of integrating into the language
framework
very lightweight thread creation
CILK (Leicerson et al) (supertech.lcs.mit.edu/cil
k)

56
Diffusion-Based Load Balancing

In the randomized schemes, the machine is treated
as fully-connected.
Diffusion-based load balancing takes topology
into account
Locality properties better than prior work
Load balancing somewhat slower than randomized
Cost of tasks must be known at creation time
No dependencies between tasks

57
Diffusion-based load balancing

The machine is modeled as a graph
At each step, we compute the weight of task
remaining on each processor
This is simply the number if they are unit cost
tasks
Each processor compares its weight with its
neighbors and performs some averaging
Markov chain analysis
See Ghosh et al, SPAA96 for a second order
diffusive load balancing algorithm
takes into account amount of work sent last time
avoids some oscillation of first order schemes
Note locality is still not a major concern,
although balancing with neighbors may be better
than random

58
DAG Scheduling

For some problems, you have a directed acyclic
graph (DAG) of tasks
nodes represent computation (may be weighted)
edges represent orderings and usually
communication (may also be weighted)
not that common to have the DAG in advance
Two application domains where DAGs are known
Digital Signal Processing computations
Sparse direct solvers (mainly Cholesky, since it
doesnt require pivoting). More on this in
another lecture.
The basic offline strategy partition DAG to
minimize communication and keep all processors
busy
NP complete, so need approximations
Different than graph partitioning, which was for
tasks with communication but no dependencies
See Gerasoulis and Yang, IEEE Transaction on
PDS, Jun 93.

59
Mixed Parallelism

As another variation, consider a problem with 2
levels of parallelism
course-grained task parallelism
good when many tasks, bad if few
fine-grained data parallelism
good when much parallelism within a task, bad if
little
Appears in
Adaptive mesh refinement
Discrete event simulation, e.g., circuit
simulation
Database query processing
Sparse matrix direct solvers

60
Mixed Parallelism Strategies
61
Which Strategy to Use
62
Switch Parallelism A Special Case
63
Simple Performance Model for Data Parallelism
64
(No Transcript)
65
Values of Sigma (Problem Size for Half Peak)
66
Modeling Performance