Trends in Sparse Linear Research and Software Developments in France - PowerPoint PPT Presentation

1 / 59

About This Presentation

Title:

Trends in Sparse Linear Research and Software Developments in France

Description:

From Iain Duff. CERFACS & RAL. Introduction. Solution of. Ax = b ... since 1996 : Patrick Amestoy, Iain Duff, Abdou Guermouche, Jacko Koster, ... – PowerPoint PPT presentation

Number of Views:146

Avg rating:3.0/5.0

Slides: 60

Provided by: www293

Category:

more less

Transcript and Presenter's Notes

Title: Trends in Sparse Linear Research and Software Developments in France

1
Trends in Sparse Linear Research and Software
Developments in France

P. Amestoy1, M. Daydé1, I. Duff2, L. Giraud1, A.
Haidar3, S. Lanteri4, J.-Y. LExcellent5, P.
Ramet6
1 IRIT INPT / ENSEEIHT
2 CERFACS RAL
3 CERFACS
4INRIA, nachos project-team
5 INRIA / LIP-ENSL
6 INRIA Futurs / LaBRI
1st French-Japanese Workshop Petascale
Applications, Algorithms and Programming (PAAP)
, Riken, Tokyo, NOv. 12, 2007

2
Scope of this overview

Mainly teams involved in the development of
computational kernels / software for scientific
libraries
Not intended to be exhaustive
Some aspects already covered by talks of
Serge Petiton
Nahid Emad
Stéphane Lanteri
Franck Cappello

3
Overview of sparse linear algebra techniques

From Iain Duff
CERFACS RAL

4
Introduction

Solution of
Ax b
Where A is large (may be 106 or greater) and
sparse

5
Direct methods

Key idea factorize the matrix into the product
of matrices easy to invert (triangular) with
possibly some permutations to preserve sparsity
and maintain numerical stability
E.g. Gaussian Elimination PAQ LU where
P and Q are rows / columns permutations,
L Lower triangular (sparse),
U Upper triangular (sparse)
Then forward / backward substitution Ly Pb
UQTx y
Good points
Can solve very large problems even in 3D
Generally are fast ... even on fancy machines
Are robust and well packaged
Bad points
Need to have A either element-wise or assembled
can have very high storage requirement
Can be costly

6
Iterative methods

Generate a sequence of vectors converging towards
the solution of the linear system
E.g. iterative methods based on Krylov subspaces
?k(A r0) spr0, Ar0, Ak-1r0 where r0
b - Ax0.
The idea is then to choose a suitable xk ? x0
?k (A r0)
For example, so that it minimizes b - Axk 2
(GMRES).
There are many Krylov methods, depending on
criteria used for choosing xk.
Good points
May not require to form A
Usually very low storage requirements
Can be efficient on 3D problems
Bad points
May require a lot of iterations for convergence
May require preconditioning

7
Hybrid methods

Attempt to retain good properties of both direct
and iterative methods
Not just preconditioning
ILU Sparse approximate inverse or Incomplete
Cholesky core techniques of direct methods are
used
Cf MERAM method (talk by N. Emad)
We focus on
using a direct method/code combined with an
iterative method.

8
Hybrid methids (cont)

Generic examples of hybrid methods are
Domain decomposition ... using direct method on
local subdomains and/or direct preconditioner on
interface (cf. talk by Stéphane Lanteri)
Block iterative methods .... direct solver on
subblocks
Factorization of nearby problem as a
preconditioner

9
Sparse direct methods
10
Sparse Direct Solvers

Usually three steps
Pre-processing symbolic factorization
Numercial factorization
Forward-backward substitution

11
MUMPS and sparse direct methods

MUMPS team

http//graal.ens-lyon.fr/MUMPS and
http//mumps.enseeiht.fr
12
History

Main contributors since 1996 Patrick Amestoy,
Iain Duff, Abdou Guermouche, Jacko Koster,
Jean-Yves LExcellent, Stéphane Pralet
Current development team
Patrick Amestoy, ENSEEIHT-IRIT
Abdou Guermouche, LABRI-INRIA
Jean-Yves LExcellent, INRIA
Stéphane Pralet, now working for SAMTECH
Phd Students
Emmanuel Agullo, ENS-Lyon
Tzvetomila Slavova, CERFACS.

13
Users

Around 1000 users, 2 requests per day
Academics or industrials
Type of applications
Structural mechanics, CAD
Fluid dynamics, Magnetohydrodynamic, Physical
Chemistry
Wave propagation and seismic imaging, Ocean
modelling
Acoustics and electromagnetics propagation
Biology
Finite Element Analysis, Numerical Optimization,
Simulation
. . .

14
MUMPS A MUltifrontal Massively Parallel Solver

MUMPS solves large systems of linear equations of
the form Axb by factorizing A into ALU or LDLT
Symmetric or unsymmetric matrices (partial
pivoting)
Parallel factorization and solution phases
(uniprocessor version also available)
Iterative refinement and backward error analysis
Various matrix input formats
assembled format
distributed assembled format
sum of elemental matrices
Partial factorization and Schur complement matrix
Version for complex arithmetic
Several orderings interfaced AMD, AMF, PORD,
METIS, SCOTCH

15
The multifrontal method (Duff, Reid83)

Memory is divided into two parts (that can
overlap in time)
the factors
the active memory

Elimination tree represents tasks dependencies
16
Implementation

Distributed multifrontal solver (MPI / F90 based)
Dynamic distributed scheduling to accomodate
both numerical fill-in and multi-user
environments
Use of BLAS, ScaLAPACK

A fully asynchronous distributed solver (VAMPIR
trace, 8 processors)
17
MUMPS

3 main steps (plus initialization and
termination)

JOB-1 initialize solver type (LU, LDLT ) and
default parameters
JOB1 analyse the matrix, build an ordering,
prepare factorization
JOB2 (parallel) numerical factorization A LU
JOB3 (parallel) solution step forward and
backward substitutions (Ly b,Ux y)
JOB-2 termination deallocate all MUMPS data
structures

18
(No Transcript)
19

COMPETITIVE PERFORMANCE
Comparison with SuperLU extracted from ACM TOMS
2001 and obtained with S. Li
Recent performance results (ops nb of
operations)

Factorization time in seconds of large matrices
on the CRAY T3E (1 proc not enough memory)
20
Functionalities, Features

Recent features
Symmetric indefinite matrices preprocessing and
2-by-2 pivots
Hybrid scheduling
2D cyclic distributed Schur complement
Sparse, multiple right-hand sides
Singular matrices with detection of null pivots
Interfaces to MUMPS Fortran, C, Matlab (S.
Pralet, while at ENSEEIHT-IRIT) and Scilab (A.
Fèvre, INRIA)
Future functionalities
Out-of-core execution
Parallel analysis phase
Rank revealing algorithms
Hybrid direct-iterative solvers (with Luc Giraud)

21
On-going research on Out-of-Core solvers

(Ph.D. E. Agullo, ENS Lyon and Ph.D. M. Slavova,
CERFACS)
Use disk storage to solve
very large problems
Parallel out-of-core
Factorization
Preprocessing to minimize
volume of I/O
Scheduling for out-of-core
solution

Ratio of active and total memory peak on
different numbers of processors for several large
problems
22
Hybrid Scheduling

Both memory and workload information are used to
obtain a better behaviour in terms of estimated
memory, memory used and factorization time in the
context of parallel factorization algorithms
Estimated memory much closer to effective memory
used

Estimated and effective memory (millions of
reals) for the factorization on 64
processors Max maximum amount of memory Avg
average memory per processor
23
Memory minimizing schedules

Multifrontal methods can use a large amount of
temporary data
By decoupling task allocation and task
processing, we can reduce the amount of temporary
data new optimal schedule proposed in this
context (Guermouche, L'Excellent, ACM TOMS)
Memory gains

Active memory ratio (new algorithm vs Liu's
ordering)
Remark Gains relative to Liu's algorithm are
equal to 27.1, 17.5 and 19.6 for matrices 8, 9,
and 10 (gupta matrices), respectively
24
Preprocessing for symmetric matrices (S. Pralet,
ENSEEIHT-IRIT)

Preprocessing new scaling available, symmetric
weighted matching and automatic tuning of the
preprocessing strategies
Pivoting strategy (2-by-2 pivot and static
pivoting)
Improvement
factorization time
robustness in particular on KKT systems arising
from optimization
memory estimation

Factorization time on a Linux PC (Pentium4 2.80
GHz)
25
Scotch, PasTIX

PaStiX Team
INRIA Futurs / LaBri

26
PaStiX solver

Functionnalities
LLt, LDLt, LU factorization (symmetric pattern)
with supernodal implementation
Static pivoting (Max. Weight Matching) It.
Raff. / CG / GMRES
1D/2D block distribution Full BLAS3
Support external ordering library (provided
Scotch ordering)
MPI/Threads implementation (SMP node / Cluster /
Multi-core / NUMA)
Simple/Double precision Float/Complex
operations
Require only C MPI Posix Thread
Multiple RHS (direct factorization)
Incomplete factorization ILU(k) preconditionner

27
PaStiX solver (cont)

Available on INRIA Gforge
All-in-One source code
Easy to install on Linux or AIX systems
Simple API (WSMP like)
Thread safe (can be called from multiple threads
in multiple MPI communicators)
Current works
Use of parallel ordering (PT-Scotch) and parallel
symbolic factorization)
Dynamic scheduling inside SMP nodes (static
mapping)
Out-of Core implementation
Generic Finite Element Assembly (domaine
decomposition associated to matrix distribution)

28
Direct solver chain (in PaStiX)
Analyze (sequential steps)
// fact. and solve
29
Direct solver chain (in PaStiX)

Sparse matrix ordering (minimizes fill-in)
Scotch an hybrid algorithm
incomplete Nested Dissection
the resulting subgraphs being ordered with an
Approximate Minimum Degree method under
constraints (HAMD)

30
Direct solver chain (in PaStiX)
The symbolic block factorization

Q(G,P)?Q(G,P)Q(G,P)gt linear in number of
blocks!
Dense block structures ? only a few extra
pointers to store the matrix

31
Direct solver chain (in PaStiX)
CPU time prediction Exact memory ressources
32
Hybrid 1D and 2D distributions

Yield 1D and 2D block distributions
BLAS efficiency on small supernodes 1D
Scalability on larger supernodes 2D
Þ Switching criterion

1D block distribution
33
Supernodal scheme

2D to 2D communication scheme
Dynamic technique is used to improve 1D to
1D/2D communication scheme

34
Direct solver chain (in PaStiX)

Modern architecture management (SMP nodes)
hybrid Threads/MPI implementation (all processors
in the same SMP node work directly in share
memory)
Less MPI communication and lower the parallel
memory overcost

35
Incomplete factorization in PaStiX

Start from the acknowledgement that it is
difficult to build a generic and robust
pre-conditioner
Large scale 3D problems
High performance computing
Derive direct solver techniques to
pre-conditioner
Whats new (dense) block formulation
Incomplete block symbolic factorization
Remove blocks with algebraic criteria
Use amalgamation algorithm to get dense blocks
Provide incomplete factorization LDLt, Cholesky,
LU (with static pivoting for symmetric pattern)

36
Numerical experiments (TERA1)

Successful approach for a large collection of
industrial test cases (PARASOL, Boeing Harwell,
CEA) on IBM SP3
TERA1 supercomputer of CEA Ile-de-France (ES45
SMP 4 procs)
COUPOLE40000 26.5 106 of unknowns 1.5 1010
NNZL and 10.8Tflops
356 procs 34s
512 procs 27s
768 procs 20s
(gt500Gflop/s about 35 peak perf.)

37
Numerical experiments (TERA10)

Successful approach on 3D mesh problem with about
30 millions of unkowns on TERA10 supercomputer
But memory is the bottleneck !!!

ODYSSEE code of French CEA/CESTA Electro-magnetism
code (Finite Element Meth. Integral
Equation) Complex double precision, Schur Compl.
38
SP3 AUDI N943.10 3 NNZL1,21.109 5,3
TFlops
Num. of threads / MPI process
39
ASTER project

ANR CIS 2006 Adaptive MHD Simulation of Tokamak
ELMs for ITER (ASTER)
The ELM (Edge-Localized-Mode) is MHD instability
which localised at the boundary of the plasma
The energy losses induced by the ELMs within
several hundred microseconds are a real concern
for ITER
The non-linear MHD simulation code JOREK is under
development at the CEA to study the evolution of
the ELM instability
To simulate the complete cycle of the ELM
instability, a large range of time scales need to
be resolved to study
The evolution of the equilibrium pressure
gradient (seconds)
ELM instability (hundred microseconds)
A fully implicit time evolution scheme is used in
the JOREK code
This leads to a large sparse matrix system to be
solved at every time step.
This scheme is possible due to the recent
advances made in the parallelized direct solution
of general sparse matrices (MUMPS, PaStiX,
SuperLU, ...)

40
NUMASIS project

Middleware for NUMA architecture
Application to seismology simulation
New dynamic scheduling algorithm for solvers
Take into account non-uniform access to memory
Use of MadMPI (R. Namyst)
Need a middleware to handle complexe
architectures
Multi-Rails, packet re-ordering,
PhD of Mathieu Faverge (INRIA-LaBRI)

41
Industrial applications

OSSAU code of French CEA/CESTA
2D / 3D structural mechanics code
ODYSSEE code of French CEA/CESTA
Electro-magnetism code (Finite Element Meth.
Integral Equation)
Complex double precision, Schur Compl.
Fluid mechanics
LU factorization with static pivoting (SuperLU
approach like)

42
Other parallel sparse direct codes

Shared-memory codes

43
Other parallel sparse direct codes

Distributed-memory codes

44
Sparse solver for Ax b only a black box ?

Preprocessing and postprocessing
Symmetric permutations to reduce fill-in (Ax b
gt PAPtPx b)
Numerical pivoting, scaling to preserve numerical
accuracy
Maximum transversal (set large entries on the
diagonal)
Preprocessing for parallelism (influence of task
mapping on parallelism)
Iterative refinement, error analysis
Default (often automatic/adaptive) setting of the
options is often available.
However, a better knowledge of the options can
help the user to further improve
memory usage,
time for solution,
numerical accuracy.

45
The GRID-TLSE Projet

A web expert site for sparse linear algebra

46
Overview of GRID-TLSE http//gridtlse.org

Supported by
ANR LEGO Project
ANR SOLSTICE Project
CNRS / JST Program REDIMPS Project
ACI GRID-TLSE Project
Partners

47
Sparse Matrices Expert Site ?

Expert site Help users in choosing the right
solvers and its parameters for a given problem
Chosen approach Expert scenarios which answer
common user requests
Main goal Provide a friendly test environment
for expert and non-expert users of sparse linear
algebra software.

48
Sparse Matrices Expert Site ?

Easy access to
Software and tools
A wide range of computer architectures
Matrix collections
Expert Scenarios.
Also Provide a testbed for sparse linear
algebra software

49
Why using the grid ?

Sparse linear algebra software makes use of
sophisticated algorithms for (pre-/post-)
processing the matrix.
Multiple parameters interfere for efficient
execution of a sparse direct solver
Ordering
Amount of memory
Architecture of computer
Available libraries.
Determining the best combination of parameter
values is a multi-parametric problem.
Well-suited for execution over a Grid.

50
Components
Software components
Diet / ITBL / AEGIS

Middleware DIET developed within GRID-ASP (LIP,
Loria Resedas, LIFC-SDRP) and soon ITBL (?)

51
Hybrid Solvers
52
Parallel hybrid iterative/direct solver for the
solution of large sparse linear systems arising
from 3D elliptic discretization

L. Giraud1 A. Haidar2 S. Watson3
1ENSEEIHT, Parallel Algorithms and Optimization
Group
2 rue Camichel, 31071 Toulouse, France
CERFACS - Parallel Algorithm Project
42 Avenue Coriolis, 31057 Toulouse, France
3Departments of Computer Science and Mathematics,
Virginia Polytechnic Institute, USA

53
Non-overlapping domain decomposition

Natural approach for PDEs, extend to general
sparse matrices
Partition the problem into subdomains, subgraphs
Use a direct solver on the subdomains (MUMPS
package)
Robust algebraically preconditioned iterative
solver on the interface (Algebraic Additive
Schwarz preconditioner, possibly with sparsified
and mixed arithmetic variants)

54
Numerical behaviour of the preconditioners
Convergence history on a 43 millions dof problems
on 1000 System X processors - Virginia Tech.
55
Parallel scaled scalability study
Parallel elapsed time with fixed sub-problem size
(43 000 dof) when the number of procesors varies
from 27 (1.1 106 dof) up-to 1000 (43 106 dof)
56
Hybrid iterative/direct strategiesfor solving
large sparse linear systemsresulting from the
finite element discretization of the
time-harmonic Maxwell equations

L. Giraud1 A. Haidar2 S. Lanteri3
1ENSEEIHT, Parallel Algorithms and Optimization
Group
2 rue Camichel, 31071 Toulouse, France
CERFACS - Parallel Algorithm Project
42 Avenue Coriolis, 31057 Toulouse, France
3INRIA, nachos project-team
2004 Route des Lucioles, BP 93, 06902 Sophia
Antipolis Cedex, France

57
Context and objectives

Solution of time-harmonic electromagnetic wave
propagation
Discretization in space
Discontinuous Galerkin time-harmonic methods
Unstructured tetrahedral meshes
Based on P? nodal (Lagrange) interpolation
Centered or upwind fluxes for the calculation of
jump terms at cell boundaries
Discretization in space results in a large,
sparse, with complex coefficients linear system
Direct (sparse LU) solvers for 2D problems
Parallel solvers are mandatory for 3D problems
Related publications
H. Fol (PhD thesis, 2006)
V. Dolean, H. Fol, S. Lanteri and R. Perrussel(J.
Comp. Appl. Math., to appear, 2007)

58
Solution algorithm

Parallel hybrid iterative/direct solver
Domain decomposition framework
Schwarz algorithm with characteristic interface
conditions
Sparse LU subdomain solver (MUMPS, P.R. Amestoy,
I.S. Duff and J.-Y. LExcellent, Comput. Meth.
App. Mech. Engng., Vol 184, 2000)
Interface (Schur complement type) formulation
Iterative interface solver (GMRES or BiCGstab)
Algebraic block preconditioning of the interface
system exploit the structure of the system (free
to construct and store)

59
Scattering of a plane wave by a PEC cube