Title: Parallel Sparse LU Factorization on Second-class Message Passing Platforms
1Parallel Sparse LU Factorization on Second-class
Message Passing Platforms
- Kai Shen
- University of Rochester
2PreliminaryParallel Sparse LU Factorization
- LU factorization with partial pivoting used for
solving a linear system Ax b (PALU). - Applications
- Device/circuit simulation, fluid dynamics, ...
- In the Newtons method for solving non-linear
systems - Challenges for parallel sparse LU factorization
- Runtime data structure variation
- Non-uniform computation/communication patterns
- ? Irregular
3Existing Solvers and Their Portability
- Shared memory solvers
- SuperLU Li, Demmel et al. 1999, WSMP Gupta
2000, PARDISO Schenk Gärtner 2004 - Message passing solvers
- S Shen et al. 2000, MUMPS Amestoy et al.
2001, SuperLU_DIST Li Demmel 2004 - Existing message passing solvers are portable,
but perform poorly on platforms with slow message
passing - Mostly designed for parallel computers with fast
interconnect - Performance portability is desirable
- Large variation in the characteristics of
available platforms
4Example Message Passing Platforms
- Three platforms running MPI
- Regatta-shmem, Regatta-TCP/IP, PC cluster
- Per-CPU peak BLAS-3 performance is 971 MFLOPS on
Regatta and 1382 MFLOPS on a PC
5Parallel Sparse LU Factorization on the Three
Platforms
- Performance of S Shen et al. 2000
- We investigate communication reduction techniques
to improve the performance on platforms with slow
comm.
6Data Structure and Computation Steps
Row block K
- for each column block K (1?N)
- Perform Factor(K)
- Perform SwapScale(K)
- Perform Update(K)
- endfor
- Processor mapping
- 1-D cyclic
- 2-D cyclic (more scalable)
Column block K
7Large Diagonal Batch Pivoting
- Large diagonal batch pivoting
- Locate the largest elements for all columns in a
block using one round of communication - Use them as pivoting elements
- may be numerically unstable
- We check the error and fall back to original
pivoting if necessary - Previous approaches Duff and Koster 1999, 2001
Li Demmel 2004 use it in iterative methods
- Batch pivoting to reduce comm.
8Speculative Batch Pivoting
- Large diagonal batch pivoting fails the numerical
stability test frequently - Speculative batch pivoting
- Collect candidate pivot rows (for all columns in
a block) at one processor using one gather
communication - Perform factorization at that processor to
determine the pivots - Error checking and fall back to original pivoting
if necessary - Both batch pivoting strategies
- Require additional computation
- May slightly weaken the numerical stability
9Performance on Regatta-shmem
- LD large diagonal SBP speculative batch
pivoting - TP threshold pivoting Duff et al. 1986
- Virtually no performance benefits
10Performance on Platforms with Slower Message
Passing
- PC cluster
- Improvement of SBP is 28-292 for a set of 8 test
matrices - Regatta-TCP/IP
- The improvement is up to 48
11Application Adaptation
- Communication-reduction techniques
- Effective on platforms with relatively slow
message passing - Ineffective on first-class platforms
- their by-products (e.g., additional computation)
may not be worthwhile - Sampling-based adaptation
- Collect application statistics in sampling phase
- Coupled with platform characteristics, to
adaptively determine whether candidate techniques
should be employed
12Adaptation on Regatta-shmem
- The Adaptive version
- Disables the comm-reduction techniques for most
matrices - Achieves similar numerical stability as the
Original version
13Adaptation on the PC Cluster
- The Adaptive version
- Employs the comm-reduction techniques for all
matrices - Performs close to the TPSBP version
14Conclusion
- Contributions
- Propose communication-reduction techniques to
improve the LU factorization performance on
platforms with relatively slow message passing - Runtime sampling-based adaptation to
automatically choose the appropriate version of
the application - http//www.cs.rochester.edu/kshen/research/s/