Parallel Sparse LU Factorization on Second-class Message Passing Platforms

About This Presentation

Title:

Parallel Sparse LU Factorization on Second-class Message Passing Platforms

Description:

Parallel Sparse LU Factorization on Secondclass Message Passing Platforms – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 15

Provided by: kais4

Learn more at: https://www.cs.rochester.edu

Category:

more less

Transcript and Presenter's Notes

Title: Parallel Sparse LU Factorization on Second-class Message Passing Platforms

1
Parallel Sparse LU Factorization on Second-class
Message Passing Platforms

Kai Shen
University of Rochester

2
PreliminaryParallel Sparse LU Factorization

LU factorization with partial pivoting used for
solving a linear system Ax b (PALU).
Applications
Device/circuit simulation, fluid dynamics, ...
In the Newtons method for solving non-linear
systems
Challenges for parallel sparse LU factorization
Runtime data structure variation
Non-uniform computation/communication patterns
? Irregular

3
Existing Solvers and Their Portability

Shared memory solvers
SuperLU Li, Demmel et al. 1999, WSMP Gupta
2000, PARDISO Schenk Gärtner 2004
Message passing solvers
S Shen et al. 2000, MUMPS Amestoy et al.
2001, SuperLU_DIST Li Demmel 2004
Existing message passing solvers are portable,
but perform poorly on platforms with slow message
passing
Mostly designed for parallel computers with fast
interconnect
Performance portability is desirable
Large variation in the characteristics of
available platforms

4
Example Message Passing Platforms

Three platforms running MPI
Regatta-shmem, Regatta-TCP/IP, PC cluster
Per-CPU peak BLAS-3 performance is 971 MFLOPS on
Regatta and 1382 MFLOPS on a PC

5
Parallel Sparse LU Factorization on the Three
Platforms

Performance of S Shen et al. 2000

We investigate communication reduction techniques
to improve the performance on platforms with slow
comm.

6
Data Structure and Computation Steps
Row block K

for each column block K (1?N)
Perform Factor(K)
Perform SwapScale(K)
Perform Update(K)
endfor

Processor mapping
1-D cyclic
2-D cyclic (more scalable)

Column block K
7
Large Diagonal Batch Pivoting

Large diagonal batch pivoting
Locate the largest elements for all columns in a
block using one round of communication
Use them as pivoting elements
may be numerically unstable
We check the error and fall back to original
pivoting if necessary
Previous approaches Duff and Koster 1999, 2001
Li Demmel 2004 use it in iterative methods

Batch pivoting to reduce comm.

8
Speculative Batch Pivoting

Large diagonal batch pivoting fails the numerical
stability test frequently
Speculative batch pivoting
Collect candidate pivot rows (for all columns in
a block) at one processor using one gather
communication
Perform factorization at that processor to
determine the pivots
Error checking and fall back to original pivoting
if necessary
Both batch pivoting strategies
Require additional computation
May slightly weaken the numerical stability

9
Performance on Regatta-shmem

LD large diagonal SBP speculative batch
pivoting
TP threshold pivoting Duff et al. 1986

Virtually no performance benefits

10
Performance on Platforms with Slower Message
Passing

PC cluster
Improvement of SBP is 28-292 for a set of 8 test
matrices
Regatta-TCP/IP
The improvement is up to 48

11
Application Adaptation

Communication-reduction techniques
Effective on platforms with relatively slow
message passing
Ineffective on first-class platforms
their by-products (e.g., additional computation)
may not be worthwhile
Sampling-based adaptation
Collect application statistics in sampling phase
Coupled with platform characteristics, to
adaptively determine whether candidate techniques
should be employed

12
Adaptation on Regatta-shmem

The Adaptive version
Disables the comm-reduction techniques for most
matrices
Achieves similar numerical stability as the
Original version

13
Adaptation on the PC Cluster

The Adaptive version
Employs the comm-reduction techniques for all
matrices
Performs close to the TPSBP version

14
Conclusion

Contributions
Propose communication-reduction techniques to
improve the LU factorization performance on
platforms with relatively slow message passing
Runtime sampling-based adaptation to
automatically choose the appropriate version of
the application
http//www.cs.rochester.edu/kshen/research/s/

Write a Comment

User Comments (0)