Sparse LA - PowerPoint PPT Presentation

1 / 93
About This Presentation
Title:

Sparse LA

Description:

Permutation Matrix or Ordering. Thus ordering to reduce fill or to ... Choose permutation matrix P so that Cholesky factor L' of PAPT has less fill than L. ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 94
Provided by: SathishV4
Category:

less

Transcript and Presenter's Notes

Title: Sparse LA


1
Sparse LA
  • Sathish Vadhiyar

2
Motivation
  • Sparse computations much more challenging than
    dense due to complex data structures and memory
    references
  • Many physical systems produce sparse matrices
  • Most of the research and the base case in sparse
    symmetric positive definite matrices

3
Sparse Cholesky
  • To solve Ax b
  • A LLT Ly b LTx y
  • Cholesky factorization introduces fill-in

4
Column oriented left-looking Cholesky
5
Fill-in
Fill new nonzeros in factor
6
Permutation Matrix or Ordering
  • Thus ordering to reduce fill or to enhance
    numerical stability
  • Choose permutation matrix P so that Cholesky
    factor L of PAPT has less fill than L.
  • Triangular solve
  • Ly Pb LTz y x PTz
  • The fill can be predicted in advance
  • Static data structure can be used symbolic
    factorization

7
Steps
  • Ordering
  • Find a permutation P of matrix A,
  • Symbolic factorization
  • Set up a data structure for the Cholesky factor L
    of PAPT,
  • Numerical factorization
  • Decompose PAPT into LLT,
  • Triangular system solution
  • Ly Pb LTz y x PTz.

8
Sparse Matrices and Graph Theory
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
5
5
5
5
6
6
6
6
7
7
7
7
1
3
6
3
6
3
6
6
5
5
5
5
7
7
7
7
4
4
4
4
2
2
G(A)
9
Sparse and Graph
1
1
1
2
2
2
3
3
3
4
4
4
5
5
5
6
6
6
7
7
7
6
6
5
7
7
7
F(A)
10
Ordering
  • The above order of elimination is natural
  • The first heuristic is minimum degree ordering
  • Simple and effective
  • But efficiency depends on tie breaking strategy

11
Minimum degree ordering for the previous Matrix
1
2
3
4
5
6
7
6
Ordering 2,4,5,7,3,1,6 No fill-in !
5
7
3
4
2
1
12
Ordering
  • Another ordering is nested dissection
    (divide-and-conquer)
  • Find separator S of nodes whose removal (along
    with edges) divides the graph into 2 disjoint
    pieces
  • Variables in each pieces are numbered
    contiguously and variables in S are numbered last
  • Leads to bordered block diagonal non-zero pattern
  • Can be applied recursively

13
Nested Dissection Illustration
1
6
11
16
21
2
7
12
17
22
3
8
13
18
23
4
9
14
19
24
5
10
15
20
25
S
14
Nested Dissection Illustration
1
2
21
11
12
3
4
22
13
14
5
6
23
15
16
7
8
24
17
18
9
10
25
19
20
S
15
Symbolic factorization
  • Can simulate the numerical factorization
  • Struct(Mi) klti mik ? 0,
  • Struct(Mj) kgtj mkj ? 0,
  • p(j) mini ? Struct(Lj), if Struct(Lj)
    ? 0,
  • j otherwise
  • Struct(Lj) C Struct(Lp(j))) U p(j,
  • Struct(Lj) Struct(Aj) U
    (UiltjStruct(Li)p(i) j) j

16
Symbolic Factorization
  • for j 1 to n do
  • Rj 0
  • for j 1 to n do
  • S Struct(Aj)
  • for i ? Rj do
  • S S U Struct(Li) j
  • Struct(Lj) S
  • if Struct(Lj) ? 0 then
  • p(j) mini ? Struct(Lj)
  • Rp(j) Rp(j) U j

17
Numerical Factorization
cmod(j, k) modification of column j by column k,
k lt j cdiv(j) division of column j by a scalar
18
Algorithms
19
Elimination Tree
  • T(A) has an edge between two vertices i and j,
    with i gt j, if i p(j). i is the parent of j.

10
9
4
5
8
2
7
3
6
1
20
Supernode
  • A set of contiguous columns in the Cholesky
    factor L that share essentially the same sparsity
    structure.
  • The set of contiguous columns j,j1,,jt
    constitutes a supernode if
  • Struct(L,k) Struct(L,k1) U k1 for
    jltkltjt-1
  • Columns in the same supernode can be treated as a
    unit
  • Used for enhancing the efficiency of minimum
    degree ordering and symbolic factorization.

21
Parallelization of Sparse Cholesky
  • Most of the parallel algorithms are based on
    elimination trees
  • Work associated with two disjoint subtrees can
    proceed independently
  • Same steps associated with sequential sparse
    factorization
  • One additional step assignment of tasks to
    processors

22
Ordering
  • 2 issues
  • - ordering in parallel
  • - find an ordering that will help in
    parallelization in the subsequent steps

23
Ordering in Parallel Nested dissection
  • Nested dissection can be carried in parallel
  • Also leads to elimination trees that can be
    parallelized during subsequent factorizations
  • But parallelization only in the later levels of
    dissection
  • Can be applied to only limited class of problems
  • More later.

24
Ordering for Parallel Factorization
  • No agreed objective for ordering for parallel
    factorization Not all orderings that reduce
    fill-in can provide scope for parallelization

7
1
2
3
6
4
5
5
6
7
4
Natural order
3
2
No fill. No scope for parallelization
1
Elimination Tree
25
Example (Contd..)
1
1
2
3
3
4
5
2
6
7
7
Nested dissection order
7
4
3
6
6
1
2
4
5
5
Elimination Tree
Fill. But scope for parallelization
26
Ordering for parallel factorization Tree
restructuring
  • Decouple fill reducing ordering and ordering for
    parallel elimination
  • Determine a fill reducing ordering P of G(A)
  • Form the elimination tree T(PAPT)
  • Transform this tree T(PAPT) to one with smaller
    height and record the corresponding equivalent
    reordering, P

27
Ordering for parallel factorization Tree
restructuring
  • Efficiency depends on if such an equivalent
    reordering can be found
  • Also on the limitations of the initial ordering,
    P
  • Only minor modifications to the initial ordering.
    Hence only limited improvement in parallelism
  • Algorithm by Liu (1989) based on elimination tree
    rotation to reduce the height
  • Algorithm by Jess and Kees (1982) based on
    chordal graph to reduce the height

28
Chordal graph and equivalent ordering
  • A graph is chordal if every cycle of length 4 or
    more has a chord, i.e. an edge containing two
    non-consecutive vertices of the cycle
  • Non-deficient or zero-deficient or simplical node
    a node in a graph whose adjacent set is a
    clique.

29
Properties of Chordal Graphs, Some Preliminaries
  • Filled graph, G(F) corresponding to filled matrix
    F of A is chordal
  • A chordal graph has at least one perfect
    elimination ordering (no fills)

30
Chordal graph and equivalent ordering
  • Given Matrix A, some ordered matrix A PAPT,
    fill matrix, F, elimination tree, T(A), and
    graphs G(A), G(F)
  • To solve find an equivalent reordering Pb to P
    such that the fill matrix Fb corresponding to Ab
    PbAPbT has no additional fill-ins to F and the
    elimination tree T(Ab) has the minimum height (or
    for exploiting parallel elimination)
  • or determining perfect elimination ordering Pb
    for the chordal graph G(F) such that the
    elimination tree T(Ab) has the minimum height (or
    for exploiting parallel elimination)

31
Properties of Chordal Graphs, Some Preliminaries
  • A chordal graph has atleast one simplical node.
  • If the graph is not a clique, there are atleast
    two such independent (non-adjacent) simplical
    nodes.
  • The subgraph of simplical nodes with no
    deficiency consists of disconnected cliques.

32
Perfect Ordering
  • Elimination of a simplical node has been shown to
    create no filled edge
  • Thus, to find perfect ordering, keep finding and
    eliminating simplical nodes at each step

33
Jess and Kees- Algorithm
Parallel_Elimination(G) begin G0 G i
0 while Gi ? Ø do begin Ri
a maximum independent subset of non-deficient
nodes order nodes in Ri next
Gi1 Gi Ri (eliminate the nodes of Ri from
G) i i1 end end.
34
Jess and Kees - Example
c
f
R3 R2 R1 R0
e
g
g
d
a
b
d
b
e
h
  • Step i Selected Ri
  • 0 a,c,f,h
  • b,d
  • g
  • e

c
a
f
h
35
Jess and Kees Algorithm Minimum Height Property
  • This algorithm produces an elimination tree of
    minimum height, h, among all perfect reordering
    of filled matrix F.
  • Can be proven using induction principle
  • Considering elimination tree without the leaf
    nodes assuming that reduced tree satisfies
    minimum height h-1.
  • Considering full elimination tree proving that
    the addition of leaf nodes doesnt change the
    minimum height property and yields h.

36
Height and Parallel Completion time
  • Not all elimination trees with minimum heights
    give rise to small parallel completion times.
  • Let each node, v, in elimination tree associated
    with x,y
  • x timev or time for factorization of column v
  • y levelv
  • levelv timev if v is the root of
    elimination tree, timevlevelparent of v
    otherwise
  • Represents minimum time to completion starting at
    node v
  • Parallel completion time maximum level value
    among all nodes

37
Height and Parallel Completion time
h
i
5, 5
9
f
a
b
c
e
d
f
g
g
e
5
8
3, 8
6, 11
3,3
e
9
h
d
7
3, 11
4
6, 17
3, 6
5, 8
d
f
8
4
c
i
4, 21
3
3, 14
c
g
6
7
3, 9
3
6, 14
3, 17
b
2
b
h
3, 12
6
2
6, 20
a
2, 19
5
i
1
a
4, 24
2, 14
1
38
Minimization of Cost
  • Thus some recent algorithms (Weng-Yang Lin, J. of
    Supercomputing 2003) pick at each step, the
    nodes with the minimum cost (greedy approach)

39
Nested Dissection Algorithms
  • Use a graph partitioning heuristic to obtain a
    small edge separator of the graph
  • Transform the small edge separator into a small
    node separator
  • Number nodes of the separator last and
    recursively apply

40
Algorithm 1 - Level Structures
  • d(x, y) distance between x and y
  • Eccentricity e(x) maxyin Xd(x, y)
  • Diameter d(G) max of eccentricities
  • Peripheral node x in X whose e(x) d(G)
  • Level structure a partitioning L L0,..,Ll)
    such that Adj(Li) C Li-1 U Li1

41
Example
2
1
6
8
3
5
4
7
6
1
8
2
4
3
7
5
Level structure rooted at 6
42
Breadth First Search
  • One way of finding level structures is by BFS
    starting with the peripheral node
  • Finding peripheral node is expensive. Hence
    settle for pseudo-peripheral

43
Pseudo peripheral node
  • Pick arbitrary node r in X
  • Generate a level structure with e(r) levels
  • Choose node x in last level with minimum degree
  • Generate a level structure rooted at x
  • If e(x) gt e(r) r x, go to step 3 else x is the
    pseudo peripheral

r
1
1
2
2
1
2
2
2
3
1
4
r
2
4
3
2
1
3
r
4
3
1
2
44
ND Heuristic based on BFS
  • Construct a level structure with l levels
  • Form separator S of nodes in level (l1)/2
  • Recursively apply

45
(No Transcript)
46
(No Transcript)
47
Example
A
B
a
b
A
B
b
a
48
(No Transcript)
49
K-L for ND
  • Form a random initial partition
  • Form edge separator by applying K-L to form
    partitions P1 and P2
  • Let V1 in P1 such that nodes in V1 incident on
    atleast one edge in the separator set. Similarly
    V2
  • V1 U V2 (wide node separator),
  • V1 or V2 (narrow node separator) by Gilbert and
    Zmijewski (1987)

50
Step 2 Mapping Problems on to processors
  • Based on elimination trees
  • But elimination trees are determined from the
    structure of L which happens in symbolic
    factorization (step 3) bootstrapping problem !
  • Efficient algorithms exist to find elimination
    trees from the structure of A.
  • Parallel calculation of elimination tree by
    zmijewski and Gilbert where each processor
    computes local version of elimination tree and
    then the local versions are combined.
  • Various strategies to map columns to processors
    based on elimination trees.
  • Strategy 1 Successive levels in elimination tree
    are wrap-mapped onto processors
  • Strategy 2 Subtree-to-Subcube
  • Strategy 3 Bin-Pack by Geist and Ng

51
Strategy 1
2
1
0
3
2
1
0
0
3
3
2
2
1
1
1
1
0
0
0
0
0
0
0
0
52
Strategy 2 Subtree-to-subcube mapping
  • Select an appropriate set of P subtrees of the
    elimination tree, say T0, T1
  • Assign columns corresponding to Ti to Pi
  • Where two subtrees merge into a single subtree,
    their processor sets are merged together and
    wrap-mapped onto the nodes/columns of the
    separator that begins at that point.
  • The root separator is wrap-mapped onto the set of
    all processors.

53
Strategy 2
0
1
2
3
0
1
0
2
1
3
0
2
0
1
2
3
0
0
1
1
2
2
3
3
54
Strategy 3 Bin-Pack (Geist and Ng)
  • Try to find disjoint subtrees
  • Map the subtrees to p bins based on
    first-fit-decreasing bin-packing heuristic
  • Subtrees are processed in decreasing order of
    workloads
  • A subtree is packed into the current lightest bin
  • Weight imbalance, a ratio between lightest and
    heaviest bin
  • If a gt user-specified tolerance, ?, stop
  • Else explore the heaviest subtree from the tree
    and split into subtrees and p bins are repacked
    using bin-packing again
  • Repeat until a gt ? or the largest subtree cannot
    be split further
  • Load balance based on user-specified tolerance
  • For the remaining nodes from the roots of the
    subtrees to the root of the tree, wrap map.

55
Parallel symbolic factorization
  • Sequential symbolic factorization is very
    efficient
  • Not able to achieve good speedups with parallel
    versions Limited parallelism, small task sizes,
    high communication overhead
  • Mapping strategies typically wrap-map processors
    to the columns of same supernode hence more
    storage and work than sequential version
  • For supernodal structure, only the processor
    holding the 1st column of the supernode
    calculates the structure
  • Other processors holding other columns of
    supernode simply retrieve the structure from the
    processor holding the first column.

56
Parallel Numerical Factorization Submatrix
Cholesky
Tsub(k)
Tsub(k) is partitioned into various subtasks
Tsub(k,1),,Tsub(k,P) where Tsub(k,p)
cmod(j,k) j C Struct(Lk) n mycols(p)
57
Definitions
  • mycols(p) set of columns owned by p
  • mapk processor containing column k
  • procs(Lk) mapj j in Struct(Lk)

58
Parallel Submatrix Cholesky
  • for j in mycols(p) do
  • if j is a leaf node in T(A) do
  • cdiv(j)
  • send Lj to the processors in procs(Lj)
  • mycols(p) mycols(p) j
  • while mycols(p) ? 0 do
  • receive any column of L, say Lk
  • for j in Struct(Lk) n mycols(p) do
  • cmod(j, k)
  • if column j required no more cmods do
  • cdiv(j)
  • send Lj to the processors in procs(Lj)
  • mycols(p) mycols(p) j
  • Disadvantages
  • Both Lk and Struct(Lk) have to be sent
  • Communication is not localized

59
Parallel Numerical Factorization Sub column
Cholesky
Tcol(j) is partitioned into various subtasks
Tcol(j,1),,Tcol(j,P) where Tcol(j,p) aggregates
into a single update vector every update vector
u(j,k) for which k C Struct(Lj) n mycols(p)
60
Definitions
  • mycols(p) set of columns owned by p
  • mapk processor containing column k
  • procs(Lj) mapk k in Struct(Lj)
  • u(j, k) scaled column accumulated into the
    factor column by cmod(j, k)

61
Parallel Sub column Cholesky
  • for j 1 to n do
  • if j in mycols(p) or Struct(Lj) n mycols(p) ?
    0 do
  • u 0
  • for k in Struct(Lj) n mycols(p) do
  • u u u(j,k)
  • if mapj ? p do
  • send u to processor q mapj
  • else
  • incorporate u into the factor column j
  • while any aggregated update column for
    column j remains unreceived do
  • receive in u another aggregated update
    column for column j
  • incoprporate u into the factor column j
  • cdiv(j)

Has uniform and less communication than sub
matrix version Difference is due to accessing
pattern of Struct(Lj) and Struct(Lj)
62
A refined version compute-ahead fan-in
  • The previous version can lead to processor idling
    due to waiting for the aggregates for updating
    column j
  • Updating column j can be mixed with compute-ahead
    tasks
  • Aggregate u(i, k) for i gt j for each completed
    column k in Struct(Li) n mycols(p)
  • Receive aggregate update column for i gt j and
    incorporate into factor column i

63
Triangular SolveParallel Forward and Back
Substitution (Anshul Gupta, Vipin Kumar
Supercomputing 95)
64
Forward Substitution
  • Computation starts with leaf supernodes of the
    elimination trees
  • The portion of L corresponding to a supernode is
    a dense trapezoid of width t and height n
  • t - number of nodes/columns in supernode
  • n - number of non-zeros in the leftmost column of
    the supernode

65
Forward Substitution - Example
66
Steps at Supernode
  • Initial processing - A vector, rhs of size n is
    formed.
  • The 1st t elements correspond to the elements in
    RHS vector with the same indices as nodes of the
    supernode.
  • The remaining n-t elements are filled with 0s.
  • Computation
  • Solve dense triangular system at the top of the
    trapezoid in the supernode.
  • Form updates corresponding to remaining n-t rows
    of the supernode
  • Vector x product of (bottom (n-t)xt submatrix
    of L, vector of size t containing solutions from
    step 1)
  • Subtract x from bottom n-t elements of rhs
  • Add bottom n-t elements of rhs with corresponding
    (same index) entries of rhs at parent supernode
  • Step 2.1 at any supernode can begin after
    contributions from all its children

67
Parallelization
  • For levels gt logP, the above steps are performed
    sequentially on a single processor
  • For supernode with 0 lt l lt logP, the above
    computation steps are performed in parallel on
    p/2l processors.
  • Pipelined or wavefront algorithm is used.

68
Partitioning
  • Assuming unlimited parallelism
  • At a single time step, only t processors are
    used.
  • At a single time step, only one block per row and
    one block per column are active
  • Might as well use 1-D block cyclic.

69
1-D block cyclic along rows
70
Redistribution
  • In this scheme, the conclusion is that
    scalability is achieved with 1-D block-cyclic
  • But numerical factorizations use 2-D block cyclic
  • Hence redistribution has to be performed.
  • Claim Redistribution cost of the same order as
    triangular solve

71
Sparse Iterative Methods
72
Iterative Direct methods Pros and Cons.
  • Iterative methods do not give accurate results.
  • Convergence cannot be predicted
  • But absolutely no fills.

73
Parallel Jacobi, Gauss-Seidel, SOR
  • For problems with grid structure (1-D, 2-D etc.),
    Jacobi is easily parallelizable
  • Gauss-Seidel and SOR need recent values. Hence
    ordering of updates and sequencing among
    processors
  • But Gauss-Seidel and SOR can be parallelized
    using red-black ordering or checker board

74
2D Grid example
13
14
15
16
9
10
11
12
5
6
7
8
1
2
3
4
75
Red-Black Ordering
  • Color alternate nodes in each dimension red and
    black
  • Red nodes can be updated simultaneously followed
    by simultaneous black nodes updates
  • In general, reordering can affect convergence

76
2D Grid example Red Black Ordering
15
7
16
8
5
13
6
14
11
3
12
4
1
9
2
10
77
Multi-Color orderings
  • In general multi-color orderings for an arbitrary
    graph
  • Ordering can lead to reduced convergence rate
    but can lead to more parallelism
  • Need to strike a balance
  • Multi-color orderings can also be used for
    pre-conditioned CG

78
Pre-conditioned CG
  • Instead of solving Ax b
  • Solve Ax b where
  • A C-1AC-1,
  • x Cx,
  • b C-1b
  • to improve convergence
  • M C2 is called the pre-conditioner

79
Incomplete Cholesky Preconditioner
  • M HHT where H is the incomplete Cholesky
    factor of A
  • One way of incomplete Cholesky Have hij 0
    when aij 0

80
Pre-Conditioned CG
  • k 0
  • r0 b Ax0
  • while (rk ? 0)
  • Solve Mzk rk (2 triangular solves
  • Parallelization is
    not straightforward)
  • k k1
  • if k 1
  • p1 z0
  • else
  • ßk rk-1Tzk-1/rk-2Tzk-2
  • pk zk-1 ßkpk-1
  • end
  • ak rk-1Tzk-1/pkTApk
  • xk xk-1 akpk
  • rk rk-1 akApk
  • end
  • x xk

81
Graph Coloring
  • Graph Colored Ordering for parallel computing of
    Gauss-Seidel and applying incomplete Cholesky
    preconditioners
  • It was shown (Schreiber and Tang) that minimum
    number of parallel steps in triangular solve is
    given by the chromatic number of symmetric graph
  • Thus permutation matrix, P based on graph color
    ordering
  • Incomplete Cholesky applied to PAPT
  • Unknowns corresponding to nodes of same color are
    solved in parallel computation proceeds in steps

82
Parallel Triangular Solve based on Multi-Coloring
  • Triangular solve Ly b ( 2 steps )
  • bw bw Lwvyv (Corresponds to traversing the
    edge ltv, wgt)
  • yw bw / Lww (Corresponds to visiting vertex w)
  • The steps can be done in parallel for all v with
    same color
  • Thus parallel triangular solve proceeds in steps
    equal to the number of colors

3, 2
2, 7
1, 1
7, 9
4, 3
6, 8
10, 10
5, 4
9, 6
8, 5
New Order
Original Order
83
Graph Coloring Problem
  • Given G(A) (V, E)
  • s V 1,2,,s is s-coloring of G if s(i) ?
    s(j) for every (i, j) edge in E
  • Minimum possible value of s is chromatic number
    of G
  • Graph coloring problem is to color nodes with
    chromatic number of colors
  • NP-complete problem

84
Heuristics Greedy Heuristic
  • 1. Compute a vertex ordering v1,,vn for V
  • 2. For i 1 to n, set s(vi) equal to smallest
    available consistent color
  • How to do step 1?

3
4
1
2
5
1
3
2
4
5
Non optimal! Leads to more colors. Hence step 1
is important.
85
Heuristics Saturation Degree Ordering
  • Let v1,..,vi-1 have been chosen
  • Choose vi such that vi is adjacent to maximum
    number of different colors in v1,..,vi-1

86
Parallel graph Coloring General algorithm
87
Parallel Graph Coloring Finding Maximal
Independent Sets Luby (1986)
  • I null
  • V V
  • G G
  • While G ? empty
  • Choose an independent set I in G
  • I I U I X I U N(I) (N(I)
    adjacent vertices to I)
  • V V \ X G G(V)
  • end
  • For choosing independent set I (Monte Carlo
    Heuristic)
  • For each vertex, v in V determine a distinct
    random number p(v)
  • v in I iff p(v) gt p(w) for every w in adj(v)
  • Color each MIS a different color
  • Disadvantage
  • Each new choice of random numbers requires a
    global synchronization of the processors.

88
Parallel Graph Coloring Enhancement by Jones
and Plassmann (1993)
  • Partition the graph G into p partitions for p
    processors using some graph partitioning
    algorithm
  • VS vertices incident with separator edges
  • VL V \ VS
  • ViS Vi n VS for proc i
  • ViL Vi n VL for proc i
  • Algorithm
  • Color G(VS) using the asynchronous Monte Carlo
    heuristic
  • On processor i, color G(ViL) given colors of ViS
    using a sequential heuristic

89
Parallel Graph Coloring Gebremedhin and Manne
(2003)
Pseudo-Coloring
90
References in Graph Coloring
  • M. Luby. A simple parallel algorithm for the
    maximal independent set problem. SIAM Journal on
    Computing. 15(4)1036-1054 (1986)
  • M.T.Jones, P.E. Plassmann. A parallel graph
    coloring heuristic. SIAM journal of scientific
    computing, 14(3) 654-669, May 1993
  • L. V. Kale and B. H. Richards and T. D. Allen.
    Efficient Parallel Graph Coloring with
    Prioritization, Lecture Notes in Computer
    Science, vol 1068, August 1995, pp 190-208.
    Springer-Verlag.
  • A.H. Gebremedhin, F. Manne, Scalable parallel
    graph coloring algorithms, Concurrency Practice
    and Experience 12 (2000) 1131-1146.
  • A.H. Gebremedhin , I.G. Lassous , J. Gustedt ,
    J.A. Telle, Graph coloring on coarse grained
    multicomputers, Discrete Applied Mathematics,
    v.131 n.1, p.179-198, 6 September 2003

91
References
  • M.T. Heath, E. Ng, B.W. Peyton. Parallel
    Algorithms for Sparse Linear Systems. SIAM
    Review. Vol. 33, No. 3, pp. 420-460, September
    1991.
  • A. George, J.W.H. Liu. The Evolution of the
    Minimum Degree Ordering Algorithm. SIAM Review.
    Vol. 31, No. 1, pp. 1-19, March 1989.
  • J. W. H. Liu. Reordering sparse matrices for
    parallel elimination. Parallel Computing 11
    (1989) 73-91

92
References
  • Anshul Gupta, Vipin Kumar. Parallel algorithms
    for forward and back substitution in direct
    solution of sparse linear systems. Conference on
    High Performance Networking and Computing.
    Proceedings of the 1995 ACM/IEEE conference on
    Supercomputing (CDROM).
  • P. Raghavan. Efficient Parallel Triangular
    Solution Using Selective Inversion. Parallel
    Processing Letters, Vol. 8, No. 1, pp. 29-40,
    1998

93
References
  • Joseph W. H. Liu. The Multifrontal Method for
    Sparse Matrix Factorization. SIAM Review. Vol.
    34, No. 1, pp. 82-109, March 1992.
  • Gupta, Karypis and Kumar. Highly Scalable
    Parallel Algorithms for Sparse Matrix
    Factorization. TPDS. 1997.
Write a Comment
User Comments (0)
About PowerShow.com