Sparse LA

About This Presentation

Transcript and Presenter's Notes

Title: Sparse LA

1
Sparse LA

Sathish Vadhiyar

2
Motivation

Sparse computations much more challenging than
dense due to complex data structures and memory
references
Many physical systems produce sparse matrices
Most of the research and the base case in sparse
symmetric positive definite matrices

3
Sparse Cholesky

To solve Ax b
A LLT Ly b LTx y
Cholesky factorization introduces fill-in

4
Column oriented left-looking Cholesky
5
Fill-in
Fill new nonzeros in factor
6
Permutation Matrix or Ordering

Thus ordering to reduce fill or to enhance
numerical stability
Choose permutation matrix P so that Cholesky
factor L of PAPT has less fill than L.
Triangular solve
Ly Pb LTz y x PTz
The fill can be predicted in advance
Static data structure can be used symbolic
factorization

7
Steps

Ordering
Find a permutation P of matrix A,
Symbolic factorization
Set up a data structure for the Cholesky factor L
of PAPT,
Numerical factorization
Decompose PAPT into LLT,
Triangular system solution
Ly Pb LTz y x PTz.

8
Sparse Matrices and Graph Theory
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
5
5
5
5
6
6
6
6
7
7
7
7
1
3
6
3
6
3
6
6
5
5
5
5
7
7
7
7
4
4
4
4
2
2
G(A)
9
Sparse and Graph
1
1
1
2
2
2
3
3
3
4
4
4
5
5
5
6
6
6
7
7
7
6
6
5
7
7
7
F(A)
10
Ordering

The above order of elimination is natural
The first heuristic is minimum degree ordering
Simple and effective
But efficiency depends on tie breaking strategy

11
Minimum degree ordering for the previous Matrix
1
2
3
4
5
6
7
6
Ordering 2,4,5,7,3,1,6 No fill-in !
5
7
3
4
2
1
12
Ordering

Another ordering is nested dissection
(divide-and-conquer)
Find separator S of nodes whose removal (along
with edges) divides the graph into 2 disjoint
pieces
Variables in each pieces are numbered
contiguously and variables in S are numbered last
Leads to bordered block diagonal non-zero pattern
Can be applied recursively

13
Nested Dissection Illustration
1
6
11
16
21
2
7
12
17
22
3
8
13
18
23
4
9
14
19
24
5
10
15
20
25
S
14
Nested Dissection Illustration
1
2
21
11
12
3
4
22
13
14
5
6
23
15
16
7
8
24
17
18
9
10
25
19
20
S
15
Symbolic factorization

Can simulate the numerical factorization
Struct(Mi) klti mik ? 0,
Struct(Mj) kgtj mkj ? 0,
p(j) mini ? Struct(Lj), if Struct(Lj)
? 0,
j otherwise
Struct(Lj) C Struct(Lp(j))) U p(j,
Struct(Lj) Struct(Aj) U
(UiltjStruct(Li)p(i) j) j

16
Symbolic Factorization

for j 1 to n do
Rj 0
for j 1 to n do
S Struct(Aj)
for i ? Rj do
S S U Struct(Li) j
Struct(Lj) S
if Struct(Lj) ? 0 then
p(j) mini ? Struct(Lj)
Rp(j) Rp(j) U j

17
Numerical Factorization
cmod(j, k) modification of column j by column k,
k lt j cdiv(j) division of column j by a scalar
18
Algorithms
19
Elimination Tree

T(A) has an edge between two vertices i and j,
with i gt j, if i p(j). i is the parent of j.

10
9
4
5
8
2
7
3
6
1
20
Supernode

A set of contiguous columns in the Cholesky
factor L that share essentially the same sparsity
structure.
The set of contiguous columns j,j1,,jt
constitutes a supernode if
Struct(L,k) Struct(L,k1) U k1 for
jltkltjt-1
Columns in the same supernode can be treated as a
unit
Used for enhancing the efficiency of minimum
degree ordering and symbolic factorization.

21
Parallelization of Sparse Cholesky

Most of the parallel algorithms are based on
elimination trees
Work associated with two disjoint subtrees can
proceed independently
Same steps associated with sequential sparse
factorization
One additional step assignment of tasks to
processors

22
Ordering

2 issues
- ordering in parallel
- find an ordering that will help in
parallelization in the subsequent steps

23
Ordering in Parallel Nested dissection

Nested dissection can be carried in parallel
Also leads to elimination trees that can be
parallelized during subsequent factorizations
But parallelization only in the later levels of
dissection
Can be applied to only limited class of problems
More later.

24
Ordering for Parallel Factorization

No agreed objective for ordering for parallel
factorization Not all orderings that reduce
fill-in can provide scope for parallelization

7
1
2
3
6
4
5
5
6
7
4
Natural order
3
2
No fill. No scope for parallelization
1
Elimination Tree
25
Example (Contd..)
1
1
2
3
3
4
5
2
6
7
7
Nested dissection order
7
4
3
6
6
1
2
4
5
5
Elimination Tree
Fill. But scope for parallelization
26
Ordering for parallel factorization Tree
restructuring

Decouple fill reducing ordering and ordering for
parallel elimination
Determine a fill reducing ordering P of G(A)
Form the elimination tree T(PAPT)
Transform this tree T(PAPT) to one with smaller
height and record the corresponding equivalent
reordering, P

27
Ordering for parallel factorization Tree
restructuring

Efficiency depends on if such an equivalent
reordering can be found
Also on the limitations of the initial ordering,
P
Only minor modifications to the initial ordering.
Hence only limited improvement in parallelism
Algorithm by Liu (1989) based on elimination tree
rotation to reduce the height
Algorithm by Jess and Kees (1982) based on
chordal graph to reduce the height

28
Chordal graph and equivalent ordering

A graph is chordal if every cycle of length 4 or
more has a chord, i.e. an edge containing two
non-consecutive vertices of the cycle
Non-deficient or zero-deficient or simplical node
a node in a graph whose adjacent set is a
clique.

29
Properties of Chordal Graphs, Some Preliminaries

Filled graph, G(F) corresponding to filled matrix
F of A is chordal
A chordal graph has at least one perfect
elimination ordering (no fills)

30
Chordal graph and equivalent ordering

Given Matrix A, some ordered matrix A PAPT,
fill matrix, F, elimination tree, T(A), and
graphs G(A), G(F)
To solve find an equivalent reordering Pb to P
such that the fill matrix Fb corresponding to Ab
PbAPbT has no additional fill-ins to F and the
elimination tree T(Ab) has the minimum height (or
for exploiting parallel elimination)
or determining perfect elimination ordering Pb
for the chordal graph G(F) such that the
elimination tree T(Ab) has the minimum height (or
for exploiting parallel elimination)

31
Properties of Chordal Graphs, Some Preliminaries

A chordal graph has atleast one simplical node.
If the graph is not a clique, there are atleast
two such independent (non-adjacent) simplical
nodes.
The subgraph of simplical nodes with no
deficiency consists of disconnected cliques.

32
Perfect Ordering

Elimination of a simplical node has been shown to
create no filled edge
Thus, to find perfect ordering, keep finding and
eliminating simplical nodes at each step

33
Jess and Kees- Algorithm
Parallel_Elimination(G) begin G0 G i
0 while Gi ? Ø do begin Ri
a maximum independent subset of non-deficient
nodes order nodes in Ri next
Gi1 Gi Ri (eliminate the nodes of Ri from
G) i i1 end end.
34
Jess and Kees - Example
c
f
R3 R2 R1 R0
e
g
g
d
a
b
d
b
e
h

Step i Selected Ri
0 a,c,f,h
b,d
g
e

c
a
f
h
35
Jess and Kees Algorithm Minimum Height Property

This algorithm produces an elimination tree of
minimum height, h, among all perfect reordering
of filled matrix F.
Can be proven using induction principle
Considering elimination tree without the leaf
nodes assuming that reduced tree satisfies
minimum height h-1.
Considering full elimination tree proving that
the addition of leaf nodes doesnt change the
minimum height property and yields h.

36
Height and Parallel Completion time

Not all elimination trees with minimum heights
give rise to small parallel completion times.
Let each node, v, in elimination tree associated
with x,y
x timev or time for factorization of column v
y levelv
levelv timev if v is the root of
elimination tree, timevlevelparent of v
otherwise
Represents minimum time to completion starting at
node v
Parallel completion time maximum level value
among all nodes

37
Height and Parallel Completion time
h
i
5, 5
9
f
a
b
c
e
d
f
g
g
e
5
8
3, 8
6, 11
3,3
e
9
h
d
7
3, 11
4
6, 17
3, 6
5, 8
d
f
8
4
c
i
4, 21
3
3, 14
c
g
6
7
3, 9
3
6, 14
3, 17
b
2
b
h
3, 12
6
2
6, 20
a
2, 19
5
i
1
a
4, 24
2, 14
1
38
Minimization of Cost

Thus some recent algorithms (Weng-Yang Lin, J. of
Supercomputing 2003) pick at each step, the
nodes with the minimum cost (greedy approach)

39
Nested Dissection Algorithms

Use a graph partitioning heuristic to obtain a
small edge separator of the graph
Transform the small edge separator into a small
node separator
Number nodes of the separator last and
recursively apply

40
Algorithm 1 - Level Structures

d(x, y) distance between x and y
Eccentricity e(x) maxyin Xd(x, y)
Diameter d(G) max of eccentricities
Peripheral node x in X whose e(x) d(G)
Level structure a partitioning L L0,..,Ll)
such that Adj(Li) C Li-1 U Li1

41
Example
2
1
6
8
3
5
4
7
6
1
8
2
4
3
7
5
Level structure rooted at 6
42
Breadth First Search

One way of finding level structures is by BFS
starting with the peripheral node
Finding peripheral node is expensive. Hence
settle for pseudo-peripheral

43
Pseudo peripheral node

Pick arbitrary node r in X
Generate a level structure with e(r) levels
Choose node x in last level with minimum degree
Generate a level structure rooted at x
If e(x) gt e(r) r x, go to step 3 else x is the
pseudo peripheral

r
1
1
2
2
1
2
2
2
3
1
4
r
2
4
3
2
1
3
r
4
3
1
2
44
ND Heuristic based on BFS

Construct a level structure with l levels
Form separator S of nodes in level (l1)/2
Recursively apply

45
(No Transcript)
46
(No Transcript)
47
Example
A
B
a
b
A
B
b
a
48
(No Transcript)
49
K-L for ND

Form a random initial partition
Form edge separator by applying K-L to form
partitions P1 and P2
Let V1 in P1 such that nodes in V1 incident on
atleast one edge in the separator set. Similarly
V2
V1 U V2 (wide node separator),
V1 or V2 (narrow node separator) by Gilbert and
Zmijewski (1987)

50
Step 2 Mapping Problems on to processors

Based on elimination trees
But elimination trees are determined from the
structure of L which happens in symbolic
factorization (step 3) bootstrapping problem !
Efficient algorithms exist to find elimination
trees from the structure of A.
Parallel calculation of elimination tree by
zmijewski and Gilbert where each processor
computes local version of elimination tree and
then the local versions are combined.
Various strategies to map columns to processors
based on elimination trees.
Strategy 1 Successive levels in elimination tree
are wrap-mapped onto processors
Strategy 2 Subtree-to-Subcube
Strategy 3 Bin-Pack by Geist and Ng

51
Strategy 1
2
1
0
3
2
1
0
0
3
3
2
2
1
1
1
1
0
0
0
0
0
0
0
0
52
Strategy 2 Subtree-to-subcube mapping

Select an appropriate set of P subtrees of the
elimination tree, say T0, T1
Assign columns corresponding to Ti to Pi
Where two subtrees merge into a single subtree,
their processor sets are merged together and
wrap-mapped onto the nodes/columns of the
separator that begins at that point.
The root separator is wrap-mapped onto the set of
all processors.

53
Strategy 2
0
1
2
3
0
1
0
2
1
3
0
2
0
1
2
3
0
0
1
1
2
2
3
3
54
Strategy 3 Bin-Pack (Geist and Ng)

Try to find disjoint subtrees
Map the subtrees to p bins based on
first-fit-decreasing bin-packing heuristic
Subtrees are processed in decreasing order of
workloads
A subtree is packed into the current lightest bin
Weight imbalance, a ratio between lightest and
heaviest bin
If a gt user-specified tolerance, ?, stop
Else explore the heaviest subtree from the tree
and split into subtrees and p bins are repacked
using bin-packing again
Repeat until a gt ? or the largest subtree cannot
be split further
Load balance based on user-specified tolerance
For the remaining nodes from the roots of the
subtrees to the root of the tree, wrap map.

55
Parallel symbolic factorization

Sequential symbolic factorization is very
efficient
Not able to achieve good speedups with parallel
versions Limited parallelism, small task sizes,
high communication overhead
Mapping strategies typically wrap-map processors
to the columns of same supernode hence more
storage and work than sequential version
For supernodal structure, only the processor
holding the 1st column of the supernode
calculates the structure
Other processors holding other columns of
supernode simply retrieve the structure from the
processor holding the first column.

56
Parallel Numerical Factorization Submatrix
Cholesky
Tsub(k)
Tsub(k) is partitioned into various subtasks
Tsub(k,1),,Tsub(k,P) where Tsub(k,p)
cmod(j,k) j C Struct(Lk) n mycols(p)
57
Definitions

mycols(p) set of columns owned by p
mapk processor containing column k
procs(Lk) mapj j in Struct(Lk)

58
Parallel Submatrix Cholesky

for j in mycols(p) do
if j is a leaf node in T(A) do
cdiv(j)
send Lj to the processors in procs(Lj)
mycols(p) mycols(p) j
while mycols(p) ? 0 do
receive any column of L, say Lk
for j in Struct(Lk) n mycols(p) do
cmod(j, k)
if column j required no more cmods do
cdiv(j)
send Lj to the processors in procs(Lj)
mycols(p) mycols(p) j

Disadvantages
Both Lk and Struct(Lk) have to be sent
Communication is not localized

59
Parallel Numerical Factorization Sub column
Cholesky
Tcol(j) is partitioned into various subtasks
Tcol(j,1),,Tcol(j,P) where Tcol(j,p) aggregates
into a single update vector every update vector
u(j,k) for which k C Struct(Lj) n mycols(p)
60
Definitions

mycols(p) set of columns owned by p
mapk processor containing column k
procs(Lj) mapk k in Struct(Lj)
u(j, k) scaled column accumulated into the
factor column by cmod(j, k)

61
Parallel Sub column Cholesky

for j 1 to n do
if j in mycols(p) or Struct(Lj) n mycols(p) ?
0 do
u 0
for k in Struct(Lj) n mycols(p) do
u u u(j,k)
if mapj ? p do
send u to processor q mapj
else
incorporate u into the factor column j
while any aggregated update column for
column j remains unreceived do
receive in u another aggregated update
column for column j
incoprporate u into the factor column j
cdiv(j)

Has uniform and less communication than sub
matrix version Difference is due to accessing
pattern of Struct(Lj) and Struct(Lj)
62
A refined version compute-ahead fan-in

The previous version can lead to processor idling
due to waiting for the aggregates for updating
column j
Updating column j can be mixed with compute-ahead
tasks
Aggregate u(i, k) for i gt j for each completed
column k in Struct(Li) n mycols(p)
Receive aggregate update column for i gt j and
incorporate into factor column i

63
Triangular SolveParallel Forward and Back
Substitution (Anshul Gupta, Vipin Kumar
Supercomputing 95)
64
Forward Substitution

Computation starts with leaf supernodes of the
elimination trees
The portion of L corresponding to a supernode is
a dense trapezoid of width t and height n
t - number of nodes/columns in supernode
n - number of non-zeros in the leftmost column of
the supernode

65
Forward Substitution - Example
66
Steps at Supernode

Initial processing - A vector, rhs of size n is
formed.
The 1st t elements correspond to the elements in
RHS vector with the same indices as nodes of the
supernode.
The remaining n-t elements are filled with 0s.
Computation
Solve dense triangular system at the top of the
trapezoid in the supernode.
Form updates corresponding to remaining n-t rows
of the supernode
Vector x product of (bottom (n-t)xt submatrix
of L, vector of size t containing solutions from
step 1)
Subtract x from bottom n-t elements of rhs
Add bottom n-t elements of rhs with corresponding
(same index) entries of rhs at parent supernode
Step 2.1 at any supernode can begin after
contributions from all its children

67
Parallelization

For levels gt logP, the above steps are performed
sequentially on a single processor
For supernode with 0 lt l lt logP, the above
computation steps are performed in parallel on
p/2l processors.
Pipelined or wavefront algorithm is used.

68
Partitioning

Assuming unlimited parallelism
At a single time step, only t processors are
used.
At a single time step, only one block per row and
one block per column are active
Might as well use 1-D block cyclic.

69
1-D block cyclic along rows
70
Redistribution

In this scheme, the conclusion is that
scalability is achieved with 1-D block-cyclic
But numerical factorizations use 2-D block cyclic
Hence redistribution has to be performed.
Claim Redistribution cost of the same order as
triangular solve

71
Sparse Iterative Methods
72
Iterative Direct methods Pros and Cons.

Iterative methods do not give accurate results.
Convergence cannot be predicted
But absolutely no fills.

73
Parallel Jacobi, Gauss-Seidel, SOR

For problems with grid structure (1-D, 2-D etc.),
Jacobi is easily parallelizable
Gauss-Seidel and SOR need recent values. Hence
ordering of updates and sequencing among
processors
But Gauss-Seidel and SOR can be parallelized
using red-black ordering or checker board

74
2D Grid example
13
14
15
16
9
10
11
12
5
6
7
8
1
2
3
4
75
Red-Black Ordering

Color alternate nodes in each dimension red and
black
Red nodes can be updated simultaneously followed
by simultaneous black nodes updates
In general, reordering can affect convergence

76
2D Grid example Red Black Ordering
15
7
16
8
5
13
6
14
11
3
12
4
1
9
2
10
77
Multi-Color orderings

In general multi-color orderings for an arbitrary
graph
Ordering can lead to reduced convergence rate
but can lead to more parallelism
Need to strike a balance
Multi-color orderings can also be used for
pre-conditioned CG

78
Pre-conditioned CG

Instead of solving Ax b
Solve Ax b where
A C-1AC-1,
x Cx,
b C-1b
to improve convergence
M C2 is called the pre-conditioner

79
Incomplete Cholesky Preconditioner

M HHT where H is the incomplete Cholesky
factor of A
One way of incomplete Cholesky Have hij 0
when aij 0

80
Pre-Conditioned CG

k 0
r0 b Ax0
while (rk ? 0)
Solve Mzk rk (2 triangular solves
Parallelization is
not straightforward)
k k1
if k 1
p1 z0
else
ßk rk-1Tzk-1/rk-2Tzk-2
pk zk-1 ßkpk-1
end
ak rk-1Tzk-1/pkTApk
xk xk-1 akpk
rk rk-1 akApk
end
x xk

81
Graph Coloring

Graph Colored Ordering for parallel computing of
Gauss-Seidel and applying incomplete Cholesky
preconditioners
It was shown (Schreiber and Tang) that minimum
number of parallel steps in triangular solve is
given by the chromatic number of symmetric graph
Thus permutation matrix, P based on graph color
ordering
Incomplete Cholesky applied to PAPT
Unknowns corresponding to nodes of same color are
solved in parallel computation proceeds in steps

82
Parallel Triangular Solve based on Multi-Coloring

Triangular solve Ly b ( 2 steps )
bw bw Lwvyv (Corresponds to traversing the
edge ltv, wgt)
yw bw / Lww (Corresponds to visiting vertex w)
The steps can be done in parallel for all v with
same color
Thus parallel triangular solve proceeds in steps
equal to the number of colors

3, 2
2, 7
1, 1
7, 9
4, 3
6, 8
10, 10
5, 4
9, 6
8, 5
New Order
Original Order
83
Graph Coloring Problem

Given G(A) (V, E)
s V 1,2,,s is s-coloring of G if s(i) ?
s(j) for every (i, j) edge in E
Minimum possible value of s is chromatic number
of G
Graph coloring problem is to color nodes with
chromatic number of colors
NP-complete problem

84
Heuristics Greedy Heuristic

1. Compute a vertex ordering v1,,vn for V
2. For i 1 to n, set s(vi) equal to smallest
available consistent color
How to do step 1?

3
4
1
2
5
1
3
2
4
5
Non optimal! Leads to more colors. Hence step 1
is important.
85
Heuristics Saturation Degree Ordering

Let v1,..,vi-1 have been chosen
Choose vi such that vi is adjacent to maximum
number of different colors in v1,..,vi-1

86
Parallel graph Coloring General algorithm
87
Parallel Graph Coloring Finding Maximal
Independent Sets Luby (1986)

I null
V V
G G
While G ? empty
Choose an independent set I in G
I I U I X I U N(I) (N(I)
adjacent vertices to I)
V V \ X G G(V)
end
For choosing independent set I (Monte Carlo
Heuristic)
For each vertex, v in V determine a distinct
random number p(v)
v in I iff p(v) gt p(w) for every w in adj(v)
Color each MIS a different color
Disadvantage
Each new choice of random numbers requires a
global synchronization of the processors.

88
Parallel Graph Coloring Enhancement by Jones
and Plassmann (1993)

Partition the graph G into p partitions for p
processors using some graph partitioning
algorithm
VS vertices incident with separator edges
VL V \ VS
ViS Vi n VS for proc i
ViL Vi n VL for proc i
Algorithm
Color G(VS) using the asynchronous Monte Carlo
heuristic
On processor i, color G(ViL) given colors of ViS
using a sequential heuristic

89
Parallel Graph Coloring Gebremedhin and Manne
(2003)
Pseudo-Coloring
90
References in Graph Coloring

M. Luby. A simple parallel algorithm for the
maximal independent set problem. SIAM Journal on
Computing. 15(4)1036-1054 (1986)
M.T.Jones, P.E. Plassmann. A parallel graph
coloring heuristic. SIAM journal of scientific
computing, 14(3) 654-669, May 1993
L. V. Kale and B. H. Richards and T. D. Allen.
Efficient Parallel Graph Coloring with
Prioritization, Lecture Notes in Computer
Science, vol 1068, August 1995, pp 190-208.
Springer-Verlag.
A.H. Gebremedhin, F. Manne, Scalable parallel
graph coloring algorithms, Concurrency Practice
and Experience 12 (2000) 1131-1146.
A.H. Gebremedhin , I.G. Lassous , J. Gustedt ,
J.A. Telle, Graph coloring on coarse grained
multicomputers, Discrete Applied Mathematics,
v.131 n.1, p.179-198, 6 September 2003

91
References

M.T. Heath, E. Ng, B.W. Peyton. Parallel
Algorithms for Sparse Linear Systems. SIAM
Review. Vol. 33, No. 3, pp. 420-460, September
1991.
A. George, J.W.H. Liu. The Evolution of the
Minimum Degree Ordering Algorithm. SIAM Review.
Vol. 31, No. 1, pp. 1-19, March 1989.
J. W. H. Liu. Reordering sparse matrices for
parallel elimination. Parallel Computing 11
(1989) 73-91

92
References

Anshul Gupta, Vipin Kumar. Parallel algorithms
for forward and back substitution in direct
solution of sparse linear systems. Conference on
High Performance Networking and Computing.
Proceedings of the 1995 ACM/IEEE conference on
Supercomputing (CDROM).
P. Raghavan. Efficient Parallel Triangular
Solution Using Selective Inversion. Parallel
Processing Letters, Vol. 8, No. 1, pp. 29-40,
1998

93
References

Joseph W. H. Liu. The Multifrontal Method for
Sparse Matrix Factorization. SIAM Review. Vol.
34, No. 1, pp. 82-109, March 1992.
Gupta, Karypis and Kumar. Highly Scalable
Parallel Algorithms for Sparse Matrix
Factorization. TPDS. 1997.

Write a Comment

User Comments (0)

About PowerShow.com

Sparse LA PowerPoint PPT Presentation