Trends in Sparse Linear Research and Software Developments in France - PowerPoint PPT Presentation

1 / 59
About This Presentation
Title:

Trends in Sparse Linear Research and Software Developments in France

Description:

From Iain Duff. CERFACS & RAL. Introduction. Solution of. Ax = b ... since 1996 : Patrick Amestoy, Iain Duff, Abdou Guermouche, Jacko Koster, ... – PowerPoint PPT presentation

Number of Views:146
Avg rating:3.0/5.0
Slides: 60
Provided by: www293
Category:

less

Transcript and Presenter's Notes

Title: Trends in Sparse Linear Research and Software Developments in France


1
Trends in Sparse Linear Research and Software
Developments in France
  • P. Amestoy1, M. Daydé1, I. Duff2, L. Giraud1, A.
    Haidar3, S. Lanteri4, J.-Y. LExcellent5, P.
    Ramet6
  • 1 IRIT INPT / ENSEEIHT
  • 2 CERFACS RAL
  • 3 CERFACS
  • 4INRIA, nachos project-team
  • 5 INRIA / LIP-ENSL
  • 6 INRIA Futurs / LaBRI
  • 1st French-Japanese Workshop Petascale
    Applications, Algorithms and Programming (PAAP)
    , Riken, Tokyo, NOv. 12, 2007

2
Scope of this overview
  • Mainly teams involved in the development of
    computational kernels / software for scientific
    libraries
  • Not intended to be exhaustive
  • Some aspects already covered by talks of
  • Serge Petiton
  • Nahid Emad
  • Stéphane Lanteri
  • Franck Cappello

3
Overview of sparse linear algebra techniques
  • From Iain Duff
  • CERFACS RAL

4
Introduction
  • Solution of
  • Ax b
  • Where A is large (may be 106 or greater) and
    sparse

5
Direct methods
  • Key idea factorize the matrix into the product
    of matrices easy to invert (triangular) with
    possibly some permutations to preserve sparsity
    and maintain numerical stability
  • E.g. Gaussian Elimination PAQ LU where
  • P and Q are rows / columns permutations,
  • L Lower triangular (sparse),
  • U Upper triangular (sparse)
  • Then forward / backward substitution Ly Pb
  • UQTx y
  • Good points
  • Can solve very large problems even in 3D
  • Generally are fast ... even on fancy machines
  • Are robust and well packaged
  • Bad points
  • Need to have A either element-wise or assembled
    can have very high storage requirement
  • Can be costly

6
Iterative methods
  • Generate a sequence of vectors converging towards
    the solution of the linear system
  • E.g. iterative methods based on Krylov subspaces
    ?k(A r0) spr0, Ar0, Ak-1r0 where r0
    b - Ax0.
  • The idea is then to choose a suitable xk ? x0
    ?k (A r0)
  • For example, so that it minimizes b - Axk 2
    (GMRES).
  • There are many Krylov methods, depending on
    criteria used for choosing xk.
  • Good points
  • May not require to form A
  • Usually very low storage requirements
  • Can be efficient on 3D problems
  • Bad points
  • May require a lot of iterations for convergence
  • May require preconditioning

7
Hybrid methods
  • Attempt to retain good properties of both direct
    and iterative methods
  • Not just preconditioning
  • ILU Sparse approximate inverse or Incomplete
    Cholesky core techniques of direct methods are
    used
  • Cf MERAM method (talk by N. Emad)
  • We focus on
  • using a direct method/code combined with an
    iterative method.

8
Hybrid methids (cont)
  • Generic examples of hybrid methods are
  • Domain decomposition ... using direct method on
    local subdomains and/or direct preconditioner on
    interface (cf. talk by Stéphane Lanteri)
  • Block iterative methods .... direct solver on
    subblocks
  • Factorization of nearby problem as a
    preconditioner

9
Sparse direct methods
10
Sparse Direct Solvers
  • Usually three steps
  • Pre-processing symbolic factorization
  • Numercial factorization
  • Forward-backward substitution

11
MUMPS and sparse direct methods
  • MUMPS team

http//graal.ens-lyon.fr/MUMPS and
http//mumps.enseeiht.fr
12
History
  • Main contributors since 1996 Patrick Amestoy,
    Iain Duff, Abdou Guermouche, Jacko Koster,
    Jean-Yves LExcellent, Stéphane Pralet
  • Current development team
  • Patrick Amestoy, ENSEEIHT-IRIT
  • Abdou Guermouche, LABRI-INRIA
  • Jean-Yves LExcellent, INRIA
  • Stéphane Pralet, now working for SAMTECH
  • Phd Students
  • Emmanuel Agullo, ENS-Lyon
  • Tzvetomila Slavova, CERFACS.

13
Users
  • Around 1000 users, 2 requests per day
  • Academics or industrials
  • Type of applications
  • Structural mechanics, CAD
  • Fluid dynamics, Magnetohydrodynamic, Physical
    Chemistry
  • Wave propagation and seismic imaging, Ocean
    modelling
  • Acoustics and electromagnetics propagation
  • Biology
  • Finite Element Analysis, Numerical Optimization,
    Simulation
  • . . .

14
MUMPS A MUltifrontal Massively Parallel Solver
  • MUMPS solves large systems of linear equations of
    the form Axb by factorizing A into ALU or LDLT
  • Symmetric or unsymmetric matrices (partial
    pivoting)
  • Parallel factorization and solution phases
    (uniprocessor version also available)
  • Iterative refinement and backward error analysis
  • Various matrix input formats
  • assembled format
  • distributed assembled format
  • sum of elemental matrices
  • Partial factorization and Schur complement matrix
  • Version for complex arithmetic
  • Several orderings interfaced AMD, AMF, PORD,
    METIS, SCOTCH

15
The multifrontal method (Duff, Reid83)
  • Memory is divided into two parts (that can
  • overlap in time)
  • the factors
  • the active memory

Elimination tree represents tasks dependencies
16
Implementation
  • Distributed multifrontal solver (MPI / F90 based)
  • Dynamic distributed scheduling to accomodate
    both numerical fill-in and multi-user
    environments
  • Use of BLAS, ScaLAPACK

A fully asynchronous distributed solver (VAMPIR
trace, 8 processors)
17
MUMPS
  • 3 main steps (plus initialization and
    termination)
  • JOB-1 initialize solver type (LU, LDLT ) and
    default parameters
  • JOB1 analyse the matrix, build an ordering,
    prepare factorization
  • JOB2 (parallel) numerical factorization A LU
  • JOB3 (parallel) solution step forward and
    backward substitutions (Ly b,Ux y)
  • JOB-2 termination deallocate all MUMPS data
    structures

18
(No Transcript)
19
  • COMPETITIVE PERFORMANCE
  • Comparison with SuperLU extracted from ACM TOMS
    2001 and obtained with S. Li
  • Recent performance results (ops nb of
    operations)

Factorization time in seconds of large matrices
on the CRAY T3E (1 proc not enough memory)
20
Functionalities, Features
  • Recent features
  • Symmetric indefinite matrices preprocessing and
    2-by-2 pivots
  • Hybrid scheduling
  • 2D cyclic distributed Schur complement
  • Sparse, multiple right-hand sides
  • Singular matrices with detection of null pivots
  • Interfaces to MUMPS Fortran, C, Matlab (S.
    Pralet, while at ENSEEIHT-IRIT) and Scilab (A.
    Fèvre, INRIA)
  • Future functionalities
  • Out-of-core execution
  • Parallel analysis phase
  • Rank revealing algorithms
  • Hybrid direct-iterative solvers (with Luc Giraud)

21
On-going research on Out-of-Core solvers
  • (Ph.D. E. Agullo, ENS Lyon and Ph.D. M. Slavova,
    CERFACS)
  • Use disk storage to solve
  • very large problems
  • Parallel out-of-core
  • Factorization
  • Preprocessing to minimize
  • volume of I/O
  • Scheduling for out-of-core
  • solution

Ratio of active and total memory peak on
different numbers of processors for several large
problems
22
Hybrid Scheduling
  • Both memory and workload information are used to
    obtain a better behaviour in terms of estimated
    memory, memory used and factorization time in the
    context of parallel factorization algorithms
  • Estimated memory much closer to effective memory
    used

Estimated and effective memory (millions of
reals) for the factorization on 64
processors Max maximum amount of memory Avg
average memory per processor
23
Memory minimizing schedules
  • Multifrontal methods can use a large amount of
    temporary data
  • By decoupling task allocation and task
    processing, we can reduce the amount of temporary
    data new optimal schedule proposed in this
    context (Guermouche, L'Excellent, ACM TOMS)
  • Memory gains

Active memory ratio (new algorithm vs Liu's
ordering)
Remark Gains relative to Liu's algorithm are
equal to 27.1, 17.5 and 19.6 for matrices 8, 9,
and 10 (gupta matrices), respectively
24
Preprocessing for symmetric matrices (S. Pralet,
ENSEEIHT-IRIT)
  • Preprocessing new scaling available, symmetric
    weighted matching and automatic tuning of the
    preprocessing strategies
  • Pivoting strategy (2-by-2 pivot and static
    pivoting)
  • Improvement
  • factorization time
  • robustness in particular on KKT systems arising
    from optimization
  • memory estimation

Factorization time on a Linux PC (Pentium4 2.80
GHz)
25
Scotch, PasTIX
  • PaStiX Team
  • INRIA Futurs / LaBri

26
PaStiX solver
  • Functionnalities
  • LLt, LDLt, LU factorization (symmetric pattern)
    with supernodal implementation
  • Static pivoting (Max. Weight Matching) It.
    Raff. / CG / GMRES
  • 1D/2D block distribution Full BLAS3
  • Support external ordering library (provided
    Scotch ordering)
  • MPI/Threads implementation (SMP node / Cluster /
    Multi-core / NUMA)
  • Simple/Double precision Float/Complex
    operations
  • Require only C MPI Posix Thread
  • Multiple RHS (direct factorization)
  • Incomplete factorization ILU(k) preconditionner

27
PaStiX solver (cont)
  • Available on INRIA Gforge
  • All-in-One source code
  • Easy to install on Linux or AIX systems
  • Simple API (WSMP like)
  • Thread safe (can be called from multiple threads
    in multiple MPI communicators)
  • Current works
  • Use of parallel ordering (PT-Scotch) and parallel
    symbolic factorization)
  • Dynamic scheduling inside SMP nodes (static
    mapping)
  • Out-of Core implementation
  • Generic Finite Element Assembly (domaine
    decomposition associated to matrix distribution)

28
Direct solver chain (in PaStiX)
Analyze (sequential steps)
// fact. and solve
29
Direct solver chain (in PaStiX)
  • Sparse matrix ordering (minimizes fill-in)
  • Scotch an hybrid algorithm
  • incomplete Nested Dissection
  • the resulting subgraphs being ordered with an
    Approximate Minimum Degree method under
    constraints (HAMD)

30
Direct solver chain (in PaStiX)
The symbolic block factorization
  • Q(G,P)?Q(G,P)Q(G,P)gt linear in number of
    blocks!
  • Dense block structures ? only a few extra
    pointers to store the matrix

31
Direct solver chain (in PaStiX)
CPU time prediction Exact memory ressources
32
Hybrid 1D and 2D distributions
  • Yield 1D and 2D block distributions
  • BLAS efficiency on small supernodes 1D
  • Scalability on larger supernodes 2D
  • Þ Switching criterion

1D block distribution
33
Supernodal scheme
  •  2D  to  2D  communication scheme
  • Dynamic technique is used to improve 1D  to
     1D/2D  communication scheme

34
Direct solver chain (in PaStiX)
  • Modern architecture management (SMP nodes)
    hybrid Threads/MPI implementation (all processors
    in the same SMP node work directly in share
    memory)
  • Less MPI communication and lower the parallel
    memory overcost

35
Incomplete factorization in PaStiX
  • Start from the acknowledgement that it is
    difficult to build a generic and robust
    pre-conditioner
  • Large scale 3D problems
  • High performance computing
  • Derive direct solver techniques to
    pre-conditioner
  • Whats new (dense) block formulation
  • Incomplete block symbolic factorization
  • Remove blocks with algebraic criteria
  • Use amalgamation algorithm to get dense blocks
  • Provide incomplete factorization LDLt, Cholesky,
    LU (with static pivoting for symmetric pattern)

36
Numerical experiments (TERA1)
  • Successful approach for a large collection of
    industrial test cases (PARASOL, Boeing Harwell,
    CEA) on IBM SP3
  • TERA1 supercomputer of CEA Ile-de-France (ES45
    SMP 4 procs)
  • COUPOLE40000 26.5 106 of unknowns 1.5 1010
    NNZL and 10.8Tflops
  • 356 procs 34s
  • 512 procs 27s
  • 768 procs 20s
  • (gt500Gflop/s about 35 peak perf.)

37
Numerical experiments (TERA10)
  • Successful approach on 3D mesh problem with about
    30 millions of unkowns on TERA10 supercomputer
  • But memory is the bottleneck !!!

ODYSSEE code of French CEA/CESTA Electro-magnetism
code (Finite Element Meth. Integral
Equation) Complex double precision, Schur Compl.
38
SP3 AUDI N943.10 3 NNZL1,21.109 5,3
TFlops
Num. of threads / MPI process
39
ASTER project
  • ANR CIS 2006 Adaptive MHD Simulation of Tokamak
    ELMs for ITER (ASTER)
  • The ELM (Edge-Localized-Mode) is MHD instability
    which localised at the boundary of the plasma
  • The energy losses induced by the ELMs within
    several hundred microseconds are a real concern
    for ITER
  • The non-linear MHD simulation code JOREK is under
    development at the CEA to study the evolution of
    the ELM instability
  • To simulate the complete cycle of the ELM
    instability, a large range of time scales need to
    be resolved to study
  • The evolution of the equilibrium pressure
    gradient (seconds)
  • ELM instability (hundred microseconds)
  • A fully implicit time evolution scheme is used in
    the JOREK code
  • This leads to a large sparse matrix system to be
    solved at every time step.
  • This scheme is possible due to the recent
    advances made in the parallelized direct solution
    of general sparse matrices (MUMPS, PaStiX,
    SuperLU, ...)

40
NUMASIS project
  • Middleware for NUMA architecture
  • Application to seismology simulation
  • New dynamic scheduling algorithm for solvers
  • Take into account non-uniform access to memory
  • Use of MadMPI (R. Namyst)
  • Need a middleware to handle complexe
    architectures
  • Multi-Rails, packet re-ordering,
  • PhD of Mathieu Faverge (INRIA-LaBRI)

41
Industrial applications
  • OSSAU code of French CEA/CESTA
  • 2D / 3D structural mechanics code
  • ODYSSEE code of French CEA/CESTA
  • Electro-magnetism code (Finite Element Meth.
    Integral Equation)
  • Complex double precision, Schur Compl.
  • Fluid mechanics
  • LU factorization with static pivoting (SuperLU
    approach like)

42
Other parallel sparse direct codes
  • Shared-memory codes

43
Other parallel sparse direct codes
  • Distributed-memory codes

44
Sparse solver for Ax b only a black box ?
  • Preprocessing and postprocessing
  • Symmetric permutations to reduce fill-in (Ax b
    gt PAPtPx b)
  • Numerical pivoting, scaling to preserve numerical
    accuracy
  • Maximum transversal (set large entries on the
    diagonal)
  • Preprocessing for parallelism (influence of task
    mapping on parallelism)
  • Iterative refinement, error analysis
  • Default (often automatic/adaptive) setting of the
    options is often available.
  • However, a better knowledge of the options can
    help the user to further improve
  • memory usage,
  • time for solution,
  • numerical accuracy.

45
The GRID-TLSE Projet
  • A web expert site for sparse linear algebra

46
Overview of GRID-TLSE http//gridtlse.org
  • Supported by
  • ANR LEGO Project
  • ANR SOLSTICE Project
  • CNRS / JST Program REDIMPS Project
  • ACI GRID-TLSE Project
  • Partners

47
Sparse Matrices Expert Site ?
  • Expert site Help users in choosing the right
    solvers and its parameters for a given problem
  • Chosen approach Expert scenarios which answer
    common user requests
  • Main goal Provide a friendly test environment
    for expert and non-expert users of sparse linear
    algebra software.

48
Sparse Matrices Expert Site ?
  • Easy access to
  • Software and tools
  • A wide range of computer architectures
  • Matrix collections
  • Expert Scenarios.
  • Also Provide a testbed for sparse linear
    algebra software

49
Why using the grid ?
  • Sparse linear algebra software makes use of
    sophisticated algorithms for (pre-/post-)
    processing the matrix.
  • Multiple parameters interfere for efficient
    execution of a sparse direct solver
  • Ordering
  • Amount of memory
  • Architecture of computer
  • Available libraries.
  • Determining the best combination of parameter
    values is a multi-parametric problem.
  • Well-suited for execution over a Grid.

50
Components
Software components
Diet / ITBL / AEGIS
  • Middleware DIET developed within GRID-ASP (LIP,
    Loria Resedas, LIFC-SDRP) and soon ITBL (?)

51
Hybrid Solvers
52
Parallel hybrid iterative/direct solver for the
solution of large sparse linear systems arising
from 3D elliptic discretization
  • L. Giraud1 A. Haidar2 S. Watson3
  • 1ENSEEIHT, Parallel Algorithms and Optimization
    Group
  • 2 rue Camichel, 31071 Toulouse, France
  • CERFACS - Parallel Algorithm Project
  • 42 Avenue Coriolis, 31057 Toulouse, France
  • 3Departments of Computer Science and Mathematics,
    Virginia Polytechnic Institute, USA

53
Non-overlapping domain decomposition
  • Natural approach for PDEs, extend to general
    sparse matrices
  • Partition the problem into subdomains, subgraphs
  • Use a direct solver on the subdomains (MUMPS
    package)
  • Robust algebraically preconditioned iterative
    solver on the interface (Algebraic Additive
    Schwarz preconditioner, possibly with sparsified
    and mixed arithmetic variants)

54
Numerical behaviour of the preconditioners
Convergence history on a 43 millions dof problems
on 1000 System X processors - Virginia Tech.
55
Parallel scaled scalability study
Parallel elapsed time with fixed sub-problem size
(43 000 dof) when the number of procesors varies
from 27 (1.1 106 dof) up-to 1000 (43 106 dof)
56
Hybrid iterative/direct strategiesfor solving
large sparse linear systemsresulting from the
finite element discretization of the
time-harmonic Maxwell equations
  • L. Giraud1 A. Haidar2 S. Lanteri3
  • 1ENSEEIHT, Parallel Algorithms and Optimization
    Group
  • 2 rue Camichel, 31071 Toulouse, France
  • CERFACS - Parallel Algorithm Project
  • 42 Avenue Coriolis, 31057 Toulouse, France
  • 3INRIA, nachos project-team
  • 2004 Route des Lucioles, BP 93, 06902 Sophia
    Antipolis Cedex, France

57
Context and objectives
  • Solution of time-harmonic electromagnetic wave
    propagation
  • Discretization in space
  • Discontinuous Galerkin time-harmonic methods
  • Unstructured tetrahedral meshes
  • Based on P? nodal (Lagrange) interpolation
  • Centered or upwind fluxes for the calculation of
    jump terms at cell boundaries
  • Discretization in space results in a large,
    sparse, with complex coefficients linear system
  • Direct (sparse LU) solvers for 2D problems
  • Parallel solvers are mandatory for 3D problems
  • Related publications
  • H. Fol (PhD thesis, 2006)
  • V. Dolean, H. Fol, S. Lanteri and R. Perrussel(J.
    Comp. Appl. Math., to appear, 2007)

58
Solution algorithm
  • Parallel hybrid iterative/direct solver
  • Domain decomposition framework
  • Schwarz algorithm with characteristic interface
    conditions
  • Sparse LU subdomain solver (MUMPS, P.R. Amestoy,
    I.S. Duff and J.-Y. LExcellent, Comput. Meth.
    App. Mech. Engng., Vol 184, 2000)
  • Interface (Schur complement type) formulation
  • Iterative interface solver (GMRES or BiCGstab)
  • Algebraic block preconditioning of the interface
    system exploit the structure of the system (free
    to construct and store)

59
Scattering of a plane wave by a PEC cube
  • Plane wave frequency 900 MHz
  • Tetrahedral mesh
  • vertices 67,590
  • elements 373,632
  • Total number of DOF 2,241,792

60
Scattering of a plane wave by a PEC cube
  • Performance results on various number of
    processors (IBM JS21)

Number of iterations
61
Scattering of a plane wave by a head
  • Plane wave frequency 1800 MHz
  • Tetrahedral mesh
  • vertices 188101
  • elements 1118952
  • Total number of DOF 6,713,712

62
Scattering of a plane wave by a head
  • Performance results on various number of
    processors (Blue Gene B/L)

Number of iterations
63
Conclusion
  • Increased size of problems
  • 3D problems using hybrid techniques
  • Large-scale computing platforms
  • Memory, out-of-core, scalability issues
  • .
  • Increased precision and numerical stability
  • Multi-arithmetic (128 bits, interval computing,
    successive refinements, )
  • Algorithms design for new architectures
  • Petaflop computing
  • Power-aware scientific computing
  • Scalability when using thousands of processors
  • Supporting the development of software
    implementing computational kernels
  • Portable, reliable and efficient
  • Available to the scientific community

64
Links
  • MUMPS http//mumps.enseeiht.fr/ http//graal.
    ens-lyon.fr/MUMPS
  • Scotch http//gforge.inria.fr/projects/scotchs
  • PaStiX http//gforge.inria.fr/projects/pastix
  • ScAlApplix http//www.labri.fr/project/scalappli
    x
  • ANR CIS LEGO
  • ANR CIGC Numasis
  • ANR CIS Solstice Aster
  • CNRS / JST Projects REDIMPS NGEST
  • ..
Write a Comment
User Comments (0)
About PowerShow.com