Title: Trends in Sparse Linear Research and Software Developments in France
1Trends in Sparse Linear Research and Software
Developments in France
- P. Amestoy1, M. Daydé1, I. Duff2, L. Giraud1, A.
Haidar3, S. Lanteri4, J.-Y. LExcellent5, P.
Ramet6 - 1 IRIT INPT / ENSEEIHT
- 2 CERFACS RAL
- 3 CERFACS
- 4INRIA, nachos project-team
- 5 INRIA / LIP-ENSL
- 6 INRIA Futurs / LaBRI
- 1st French-Japanese Workshop Petascale
Applications, Algorithms and Programming (PAAP)
, Riken, Tokyo, NOv. 12, 2007
2Scope of this overview
- Mainly teams involved in the development of
computational kernels / software for scientific
libraries - Not intended to be exhaustive
- Some aspects already covered by talks of
- Serge Petiton
- Nahid Emad
- Stéphane Lanteri
- Franck Cappello
3Overview of sparse linear algebra techniques
- From Iain Duff
- CERFACS RAL
4Introduction
- Solution of
- Ax b
- Where A is large (may be 106 or greater) and
sparse
5Direct methods
- Key idea factorize the matrix into the product
of matrices easy to invert (triangular) with
possibly some permutations to preserve sparsity
and maintain numerical stability - E.g. Gaussian Elimination PAQ LU where
- P and Q are rows / columns permutations,
- L Lower triangular (sparse),
- U Upper triangular (sparse)
- Then forward / backward substitution Ly Pb
- UQTx y
- Good points
- Can solve very large problems even in 3D
- Generally are fast ... even on fancy machines
- Are robust and well packaged
- Bad points
- Need to have A either element-wise or assembled
can have very high storage requirement - Can be costly
6Iterative methods
- Generate a sequence of vectors converging towards
the solution of the linear system - E.g. iterative methods based on Krylov subspaces
?k(A r0) spr0, Ar0, Ak-1r0 where r0
b - Ax0. - The idea is then to choose a suitable xk ? x0
?k (A r0) - For example, so that it minimizes b - Axk 2
(GMRES). - There are many Krylov methods, depending on
criteria used for choosing xk. - Good points
- May not require to form A
- Usually very low storage requirements
- Can be efficient on 3D problems
- Bad points
- May require a lot of iterations for convergence
- May require preconditioning
7Hybrid methods
- Attempt to retain good properties of both direct
and iterative methods - Not just preconditioning
- ILU Sparse approximate inverse or Incomplete
Cholesky core techniques of direct methods are
used - Cf MERAM method (talk by N. Emad)
- We focus on
- using a direct method/code combined with an
iterative method.
8Hybrid methids (cont)
- Generic examples of hybrid methods are
- Domain decomposition ... using direct method on
local subdomains and/or direct preconditioner on
interface (cf. talk by Stéphane Lanteri) - Block iterative methods .... direct solver on
subblocks - Factorization of nearby problem as a
preconditioner
9Sparse direct methods
10Sparse Direct Solvers
- Usually three steps
- Pre-processing symbolic factorization
- Numercial factorization
- Forward-backward substitution
11MUMPS and sparse direct methods
http//graal.ens-lyon.fr/MUMPS and
http//mumps.enseeiht.fr
12History
- Main contributors since 1996 Patrick Amestoy,
Iain Duff, Abdou Guermouche, Jacko Koster,
Jean-Yves LExcellent, Stéphane Pralet - Current development team
- Patrick Amestoy, ENSEEIHT-IRIT
- Abdou Guermouche, LABRI-INRIA
- Jean-Yves LExcellent, INRIA
- Stéphane Pralet, now working for SAMTECH
- Phd Students
- Emmanuel Agullo, ENS-Lyon
- Tzvetomila Slavova, CERFACS.
13Users
- Around 1000 users, 2 requests per day
- Academics or industrials
- Type of applications
- Structural mechanics, CAD
- Fluid dynamics, Magnetohydrodynamic, Physical
Chemistry - Wave propagation and seismic imaging, Ocean
modelling - Acoustics and electromagnetics propagation
- Biology
- Finite Element Analysis, Numerical Optimization,
Simulation - . . .
14MUMPS A MUltifrontal Massively Parallel Solver
- MUMPS solves large systems of linear equations of
the form Axb by factorizing A into ALU or LDLT - Symmetric or unsymmetric matrices (partial
pivoting) - Parallel factorization and solution phases
(uniprocessor version also available) - Iterative refinement and backward error analysis
- Various matrix input formats
- assembled format
- distributed assembled format
- sum of elemental matrices
- Partial factorization and Schur complement matrix
- Version for complex arithmetic
- Several orderings interfaced AMD, AMF, PORD,
METIS, SCOTCH
15The multifrontal method (Duff, Reid83)
- Memory is divided into two parts (that can
- overlap in time)
- the factors
- the active memory
Elimination tree represents tasks dependencies
16Implementation
- Distributed multifrontal solver (MPI / F90 based)
- Dynamic distributed scheduling to accomodate
both numerical fill-in and multi-user
environments - Use of BLAS, ScaLAPACK
A fully asynchronous distributed solver (VAMPIR
trace, 8 processors)
17MUMPS
- 3 main steps (plus initialization and
termination)
- JOB-1 initialize solver type (LU, LDLT ) and
default parameters - JOB1 analyse the matrix, build an ordering,
prepare factorization - JOB2 (parallel) numerical factorization A LU
- JOB3 (parallel) solution step forward and
backward substitutions (Ly b,Ux y) - JOB-2 termination deallocate all MUMPS data
structures
18(No Transcript)
19- COMPETITIVE PERFORMANCE
- Comparison with SuperLU extracted from ACM TOMS
2001 and obtained with S. Li - Recent performance results (ops nb of
operations)
Factorization time in seconds of large matrices
on the CRAY T3E (1 proc not enough memory)
20Functionalities, Features
- Recent features
- Symmetric indefinite matrices preprocessing and
2-by-2 pivots - Hybrid scheduling
- 2D cyclic distributed Schur complement
- Sparse, multiple right-hand sides
- Singular matrices with detection of null pivots
- Interfaces to MUMPS Fortran, C, Matlab (S.
Pralet, while at ENSEEIHT-IRIT) and Scilab (A.
Fèvre, INRIA) - Future functionalities
- Out-of-core execution
- Parallel analysis phase
- Rank revealing algorithms
- Hybrid direct-iterative solvers (with Luc Giraud)
21On-going research on Out-of-Core solvers
- (Ph.D. E. Agullo, ENS Lyon and Ph.D. M. Slavova,
CERFACS) -
- Use disk storage to solve
- very large problems
- Parallel out-of-core
- Factorization
- Preprocessing to minimize
- volume of I/O
- Scheduling for out-of-core
- solution
Ratio of active and total memory peak on
different numbers of processors for several large
problems
22Hybrid Scheduling
- Both memory and workload information are used to
obtain a better behaviour in terms of estimated
memory, memory used and factorization time in the
context of parallel factorization algorithms - Estimated memory much closer to effective memory
used
Estimated and effective memory (millions of
reals) for the factorization on 64
processors Max maximum amount of memory Avg
average memory per processor
23Memory minimizing schedules
- Multifrontal methods can use a large amount of
temporary data - By decoupling task allocation and task
processing, we can reduce the amount of temporary
data new optimal schedule proposed in this
context (Guermouche, L'Excellent, ACM TOMS) - Memory gains
Active memory ratio (new algorithm vs Liu's
ordering)
Remark Gains relative to Liu's algorithm are
equal to 27.1, 17.5 and 19.6 for matrices 8, 9,
and 10 (gupta matrices), respectively
24Preprocessing for symmetric matrices (S. Pralet,
ENSEEIHT-IRIT)
- Preprocessing new scaling available, symmetric
weighted matching and automatic tuning of the
preprocessing strategies - Pivoting strategy (2-by-2 pivot and static
pivoting) - Improvement
- factorization time
- robustness in particular on KKT systems arising
from optimization - memory estimation
Factorization time on a Linux PC (Pentium4 2.80
GHz)
25Scotch, PasTIX
- PaStiX Team
- INRIA Futurs / LaBri
26PaStiX solver
- Functionnalities
- LLt, LDLt, LU factorization (symmetric pattern)
with supernodal implementation - Static pivoting (Max. Weight Matching) It.
Raff. / CG / GMRES - 1D/2D block distribution Full BLAS3
- Support external ordering library (provided
Scotch ordering) - MPI/Threads implementation (SMP node / Cluster /
Multi-core / NUMA) - Simple/Double precision Float/Complex
operations - Require only C MPI Posix Thread
- Multiple RHS (direct factorization)
- Incomplete factorization ILU(k) preconditionner
27PaStiX solver (cont)
- Available on INRIA Gforge
- All-in-One source code
- Easy to install on Linux or AIX systems
- Simple API (WSMP like)
- Thread safe (can be called from multiple threads
in multiple MPI communicators) - Current works
- Use of parallel ordering (PT-Scotch) and parallel
symbolic factorization) - Dynamic scheduling inside SMP nodes (static
mapping) - Out-of Core implementation
- Generic Finite Element Assembly (domaine
decomposition associated to matrix distribution)
28Direct solver chain (in PaStiX)
Analyze (sequential steps)
// fact. and solve
29Direct solver chain (in PaStiX)
- Sparse matrix ordering (minimizes fill-in)
- Scotch an hybrid algorithm
- incomplete Nested Dissection
- the resulting subgraphs being ordered with an
Approximate Minimum Degree method under
constraints (HAMD)
30Direct solver chain (in PaStiX)
The symbolic block factorization
- Q(G,P)?Q(G,P)Q(G,P)gt linear in number of
blocks! - Dense block structures ? only a few extra
pointers to store the matrix
31Direct solver chain (in PaStiX)
CPU time prediction Exact memory ressources
32Hybrid 1D and 2D distributions
- Yield 1D and 2D block distributions
- BLAS efficiency on small supernodes 1D
- Scalability on larger supernodes 2D
- Þ Switching criterion
1D block distribution
33Supernodal scheme
- 2D to 2D communication scheme
- Dynamic technique is used to improve 1D to
1D/2D communication scheme
34Direct solver chain (in PaStiX)
- Modern architecture management (SMP nodes)
hybrid Threads/MPI implementation (all processors
in the same SMP node work directly in share
memory) - Less MPI communication and lower the parallel
memory overcost
35Incomplete factorization in PaStiX
- Start from the acknowledgement that it is
difficult to build a generic and robust
pre-conditioner - Large scale 3D problems
- High performance computing
- Derive direct solver techniques to
pre-conditioner - Whats new (dense) block formulation
- Incomplete block symbolic factorization
- Remove blocks with algebraic criteria
- Use amalgamation algorithm to get dense blocks
- Provide incomplete factorization LDLt, Cholesky,
LU (with static pivoting for symmetric pattern)
36Numerical experiments (TERA1)
- Successful approach for a large collection of
industrial test cases (PARASOL, Boeing Harwell,
CEA) on IBM SP3 - TERA1 supercomputer of CEA Ile-de-France (ES45
SMP 4 procs) - COUPOLE40000 26.5 106 of unknowns 1.5 1010
NNZL and 10.8Tflops - 356 procs 34s
- 512 procs 27s
- 768 procs 20s
- (gt500Gflop/s about 35 peak perf.)
37Numerical experiments (TERA10)
- Successful approach on 3D mesh problem with about
30 millions of unkowns on TERA10 supercomputer - But memory is the bottleneck !!!
ODYSSEE code of French CEA/CESTA Electro-magnetism
code (Finite Element Meth. Integral
Equation) Complex double precision, Schur Compl.
38SP3 AUDI N943.10 3 NNZL1,21.109 5,3
TFlops
Num. of threads / MPI process
39ASTER project
- ANR CIS 2006 Adaptive MHD Simulation of Tokamak
ELMs for ITER (ASTER) - The ELM (Edge-Localized-Mode) is MHD instability
which localised at the boundary of the plasma - The energy losses induced by the ELMs within
several hundred microseconds are a real concern
for ITER - The non-linear MHD simulation code JOREK is under
development at the CEA to study the evolution of
the ELM instability - To simulate the complete cycle of the ELM
instability, a large range of time scales need to
be resolved to study - The evolution of the equilibrium pressure
gradient (seconds) - ELM instability (hundred microseconds)
- A fully implicit time evolution scheme is used in
the JOREK code - This leads to a large sparse matrix system to be
solved at every time step. - This scheme is possible due to the recent
advances made in the parallelized direct solution
of general sparse matrices (MUMPS, PaStiX,
SuperLU, ...)
40NUMASIS project
- Middleware for NUMA architecture
- Application to seismology simulation
- New dynamic scheduling algorithm for solvers
- Take into account non-uniform access to memory
- Use of MadMPI (R. Namyst)
- Need a middleware to handle complexe
architectures - Multi-Rails, packet re-ordering,
- PhD of Mathieu Faverge (INRIA-LaBRI)
41Industrial applications
- OSSAU code of French CEA/CESTA
- 2D / 3D structural mechanics code
-
- ODYSSEE code of French CEA/CESTA
- Electro-magnetism code (Finite Element Meth.
Integral Equation) - Complex double precision, Schur Compl.
- Fluid mechanics
- LU factorization with static pivoting (SuperLU
approach like)
42Other parallel sparse direct codes
43Other parallel sparse direct codes
44Sparse solver for Ax b only a black box ?
- Preprocessing and postprocessing
- Symmetric permutations to reduce fill-in (Ax b
gt PAPtPx b) - Numerical pivoting, scaling to preserve numerical
accuracy - Maximum transversal (set large entries on the
diagonal) - Preprocessing for parallelism (influence of task
mapping on parallelism) - Iterative refinement, error analysis
- Default (often automatic/adaptive) setting of the
options is often available. - However, a better knowledge of the options can
help the user to further improve - memory usage,
- time for solution,
- numerical accuracy.
45The GRID-TLSE Projet
- A web expert site for sparse linear algebra
46Overview of GRID-TLSE http//gridtlse.org
- Supported by
- ANR LEGO Project
- ANR SOLSTICE Project
- CNRS / JST Program REDIMPS Project
- ACI GRID-TLSE Project
- Partners
47Sparse Matrices Expert Site ?
- Expert site Help users in choosing the right
solvers and its parameters for a given problem - Chosen approach Expert scenarios which answer
common user requests - Main goal Provide a friendly test environment
for expert and non-expert users of sparse linear
algebra software.
48Sparse Matrices Expert Site ?
- Easy access to
- Software and tools
- A wide range of computer architectures
- Matrix collections
- Expert Scenarios.
- Also Provide a testbed for sparse linear
algebra software
49Why using the grid ?
- Sparse linear algebra software makes use of
sophisticated algorithms for (pre-/post-)
processing the matrix. - Multiple parameters interfere for efficient
execution of a sparse direct solver - Ordering
- Amount of memory
- Architecture of computer
- Available libraries.
- Determining the best combination of parameter
values is a multi-parametric problem. - Well-suited for execution over a Grid.
50Components
Software components
Diet / ITBL / AEGIS
- Middleware DIET developed within GRID-ASP (LIP,
Loria Resedas, LIFC-SDRP) and soon ITBL (?)
51Hybrid Solvers
52Parallel hybrid iterative/direct solver for the
solution of large sparse linear systems arising
from 3D elliptic discretization
- L. Giraud1 A. Haidar2 S. Watson3
- 1ENSEEIHT, Parallel Algorithms and Optimization
Group - 2 rue Camichel, 31071 Toulouse, France
- CERFACS - Parallel Algorithm Project
- 42 Avenue Coriolis, 31057 Toulouse, France
- 3Departments of Computer Science and Mathematics,
Virginia Polytechnic Institute, USA
53Non-overlapping domain decomposition
- Natural approach for PDEs, extend to general
sparse matrices - Partition the problem into subdomains, subgraphs
- Use a direct solver on the subdomains (MUMPS
package) - Robust algebraically preconditioned iterative
solver on the interface (Algebraic Additive
Schwarz preconditioner, possibly with sparsified
and mixed arithmetic variants)
54Numerical behaviour of the preconditioners
Convergence history on a 43 millions dof problems
on 1000 System X processors - Virginia Tech.
55Parallel scaled scalability study
Parallel elapsed time with fixed sub-problem size
(43 000 dof) when the number of procesors varies
from 27 (1.1 106 dof) up-to 1000 (43 106 dof)
56Hybrid iterative/direct strategiesfor solving
large sparse linear systemsresulting from the
finite element discretization of the
time-harmonic Maxwell equations
- L. Giraud1 A. Haidar2 S. Lanteri3
- 1ENSEEIHT, Parallel Algorithms and Optimization
Group - 2 rue Camichel, 31071 Toulouse, France
- CERFACS - Parallel Algorithm Project
- 42 Avenue Coriolis, 31057 Toulouse, France
- 3INRIA, nachos project-team
- 2004 Route des Lucioles, BP 93, 06902 Sophia
Antipolis Cedex, France
57Context and objectives
- Solution of time-harmonic electromagnetic wave
propagation - Discretization in space
- Discontinuous Galerkin time-harmonic methods
- Unstructured tetrahedral meshes
- Based on P? nodal (Lagrange) interpolation
- Centered or upwind fluxes for the calculation of
jump terms at cell boundaries - Discretization in space results in a large,
sparse, with complex coefficients linear system - Direct (sparse LU) solvers for 2D problems
- Parallel solvers are mandatory for 3D problems
- Related publications
- H. Fol (PhD thesis, 2006)
- V. Dolean, H. Fol, S. Lanteri and R. Perrussel(J.
Comp. Appl. Math., to appear, 2007)
58Solution algorithm
- Parallel hybrid iterative/direct solver
- Domain decomposition framework
- Schwarz algorithm with characteristic interface
conditions - Sparse LU subdomain solver (MUMPS, P.R. Amestoy,
I.S. Duff and J.-Y. LExcellent, Comput. Meth.
App. Mech. Engng., Vol 184, 2000) - Interface (Schur complement type) formulation
- Iterative interface solver (GMRES or BiCGstab)
- Algebraic block preconditioning of the interface
system exploit the structure of the system (free
to construct and store)
59Scattering of a plane wave by a PEC cube
- Plane wave frequency 900 MHz
- Tetrahedral mesh
- vertices 67,590
- elements 373,632
- Total number of DOF 2,241,792
60Scattering of a plane wave by a PEC cube
- Performance results on various number of
processors (IBM JS21)
Number of iterations
61Scattering of a plane wave by a head
- Plane wave frequency 1800 MHz
- Tetrahedral mesh
- vertices 188101
- elements 1118952
- Total number of DOF 6,713,712
62Scattering of a plane wave by a head
- Performance results on various number of
processors (Blue Gene B/L)
Number of iterations
63Conclusion
- Increased size of problems
- 3D problems using hybrid techniques
- Large-scale computing platforms
- Memory, out-of-core, scalability issues
- .
- Increased precision and numerical stability
- Multi-arithmetic (128 bits, interval computing,
successive refinements, ) -
- Algorithms design for new architectures
- Petaflop computing
- Power-aware scientific computing
- Scalability when using thousands of processors
-
- Supporting the development of software
implementing computational kernels - Portable, reliable and efficient
- Available to the scientific community
64Links
- MUMPS http//mumps.enseeiht.fr/ http//graal.
ens-lyon.fr/MUMPS - Scotch http//gforge.inria.fr/projects/scotchs
- PaStiX http//gforge.inria.fr/projects/pastix
- ScAlApplix http//www.labri.fr/project/scalappli
x - ANR CIS LEGO
- ANR CIGC Numasis
- ANR CIS Solstice Aster
- CNRS / JST Projects REDIMPS NGEST
- ..