Parallel Sparse LU Factorization on Second-class Message Passing Platforms - PowerPoint PPT Presentation

About This Presentation
Title:

Parallel Sparse LU Factorization on Second-class Message Passing Platforms

Description:

Parallel Sparse LU Factorization on Secondclass Message Passing Platforms – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 15
Provided by: kais4
Category:

less

Transcript and Presenter's Notes

Title: Parallel Sparse LU Factorization on Second-class Message Passing Platforms


1
Parallel Sparse LU Factorization on Second-class
Message Passing Platforms
  • Kai Shen
  • University of Rochester

2
PreliminaryParallel Sparse LU Factorization
  • LU factorization with partial pivoting used for
    solving a linear system Ax b (PALU).
  • Applications
  • Device/circuit simulation, fluid dynamics, ...
  • In the Newtons method for solving non-linear
    systems
  • Challenges for parallel sparse LU factorization
  • Runtime data structure variation
  • Non-uniform computation/communication patterns
  • ? Irregular

3
Existing Solvers and Their Portability
  • Shared memory solvers
  • SuperLU Li, Demmel et al. 1999, WSMP Gupta
    2000, PARDISO Schenk Gärtner 2004
  • Message passing solvers
  • S Shen et al. 2000, MUMPS Amestoy et al.
    2001, SuperLU_DIST Li Demmel 2004
  • Existing message passing solvers are portable,
    but perform poorly on platforms with slow message
    passing
  • Mostly designed for parallel computers with fast
    interconnect
  • Performance portability is desirable
  • Large variation in the characteristics of
    available platforms

4
Example Message Passing Platforms
  • Three platforms running MPI
  • Regatta-shmem, Regatta-TCP/IP, PC cluster
  • Per-CPU peak BLAS-3 performance is 971 MFLOPS on
    Regatta and 1382 MFLOPS on a PC

5
Parallel Sparse LU Factorization on the Three
Platforms
  • Performance of S Shen et al. 2000
  • We investigate communication reduction techniques
    to improve the performance on platforms with slow
    comm.

6
Data Structure and Computation Steps
Row block K
  • for each column block K (1?N)
  • Perform Factor(K)
  • Perform SwapScale(K)
  • Perform Update(K)
  • endfor
  • Processor mapping
  • 1-D cyclic
  • 2-D cyclic (more scalable)

Column block K
7
Large Diagonal Batch Pivoting
  • Large diagonal batch pivoting
  • Locate the largest elements for all columns in a
    block using one round of communication
  • Use them as pivoting elements
  • may be numerically unstable
  • We check the error and fall back to original
    pivoting if necessary
  • Previous approaches Duff and Koster 1999, 2001
    Li Demmel 2004 use it in iterative methods
  • Batch pivoting to reduce comm.

8
Speculative Batch Pivoting
  • Large diagonal batch pivoting fails the numerical
    stability test frequently
  • Speculative batch pivoting
  • Collect candidate pivot rows (for all columns in
    a block) at one processor using one gather
    communication
  • Perform factorization at that processor to
    determine the pivots
  • Error checking and fall back to original pivoting
    if necessary
  • Both batch pivoting strategies
  • Require additional computation
  • May slightly weaken the numerical stability

9
Performance on Regatta-shmem
  • LD large diagonal SBP speculative batch
    pivoting
  • TP threshold pivoting Duff et al. 1986
  • Virtually no performance benefits

10
Performance on Platforms with Slower Message
Passing
  • PC cluster
  • Improvement of SBP is 28-292 for a set of 8 test
    matrices
  • Regatta-TCP/IP
  • The improvement is up to 48

11
Application Adaptation
  • Communication-reduction techniques
  • Effective on platforms with relatively slow
    message passing
  • Ineffective on first-class platforms
  • their by-products (e.g., additional computation)
    may not be worthwhile
  • Sampling-based adaptation
  • Collect application statistics in sampling phase
  • Coupled with platform characteristics, to
    adaptively determine whether candidate techniques
    should be employed

12
Adaptation on Regatta-shmem
  • The Adaptive version
  • Disables the comm-reduction techniques for most
    matrices
  • Achieves similar numerical stability as the
    Original version

13
Adaptation on the PC Cluster
  • The Adaptive version
  • Employs the comm-reduction techniques for all
    matrices
  • Performs close to the TPSBP version

14
Conclusion
  • Contributions
  • Propose communication-reduction techniques to
    improve the LU factorization performance on
    platforms with relatively slow message passing
  • Runtime sampling-based adaptation to
    automatically choose the appropriate version of
    the application
  • http//www.cs.rochester.edu/kshen/research/s/
Write a Comment
User Comments (0)
About PowerShow.com