Numerical Parallel Algorithms for LargeScale Quantum Transport Problem - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Numerical Parallel Algorithms for LargeScale Quantum Transport Problem

Description:

Current-Voltage Characteristics obtained by self-consistent simulations: ... NEGF Formalism. Electron Density n(V) ELECTROSTATICS. Poisson equation. Potential V(n) ... – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 28
Provided by: Sne90
Category:

less

Transcript and Presenter's Notes

Title: Numerical Parallel Algorithms for LargeScale Quantum Transport Problem


1
Numerical Parallel Algorithms for Large-Scale
Quantum Transport Problem
  • E. Polizzi, A. Sameh
  • Department of Computer Sciences
  • Laboratory for
  • Modeling, Numerical Algorithms, and Simulations
  • F. Saied, M. Sayeed
  • Computing Research Institute (CRI)
  • Information Technology at Purdue (ITaP)
  • Purdue University

2
Quantum transport modelingHow to proceed ?
  • Current-Voltage Characteristics obtained by
    self-consistent simulations Transport-Electrostat
    ics
  • High level of detail and realism in the
    simulation of nanoelectronic devices requires
    huge computational capacity

3
NESSIE a state of the art nanoelectronics
simulator and a multidisciplinary simulation
environment
  • Our parallel finite element code NESSIE can
    simulate the I-V characteristics of arbitrary
    2D/3D devices using a full quantum approach
    (PDE-based model)
  • By allowing the integration of different physical
    models, new discretization schemes, robust
    mathematical methods, and new numerical parallel
    techniques NESSIE becomes a robust
    multidisciplinary simulation environment

4
Numerical Techniques
Computational problems
the potential Vi is given
Basic strategy parallel MPI procedure on the
energy where each processor handles many linear
systems ? Not suitable for large-scale
nanoelectronics problem
New strategy each linear system is solved in
parallel ? The SPIKE algorithm
5
SPIKE Introduction
After RCM reordering
  • Engineering problems usually produce large sparse
    linear systems
  • Banded, or banded with low-rank perturbations,
    structure is often obtained after reordering
  • SPIKE partitions the banded matrix into a block
    tridiagonal form
  • Each partition is associated with one CPU or
    one node ? multilevel of parallelism

6
SPIKE General algorithm (A. Sameh et al.)
Reduced system ((p-1) 2m)
AXF ? SXdiag(A1-1,,Ap-1) F
Retrieve solution
A(n n) Bj, Cj (m m), mltltn
7
SPIKE A Hybrid Algorithm
Different choices depending the properties of the
matrix and/or the architecture of the machine
  • The spikes can be computed
  • Explicitly (fully or partially)
  • On the Fly
  • Approximately
  • The diagonal blocks can be solved
  • Directly (LU, Cholesky, or sparse counterparts)
  • Iteratively (with a preconditioning strategy)
  • The reduced system can be solved
  • Directly (Recursive SPIKE)
  • Iteratively with a preconditioning scheme
  • Approximately (Truncated SPIKE)

8
SPIKE A Direct Solver and a Preconditioner
SPIKE as a Direct Solver SPIKE Preprocessing
on A
SPIKE as a Preconditioner SPIKE Preprocessing
on M
ITERATIVE METHOD
SPIKE Solver Axf (direct
/ iterative outer/inner)
  • SPIKE Solver M z r
  • Matrix-Vector mult. Axy

9
SPIKE The recursive scheme
Reduced system p partitions (p2n)
New Reduced system obtained by applying SPIKE
again
Reduced system p/2 partitions
?Better balance achieved between computational
and communication costs as compared to iterative
methods
10
SPIKE The truncated scheme for diagonally
dominant matrices
Approximate Reduced system
  • LU is used to compute 0,IVj
  • UL is used to compute I,0Wj

11
Parallel architectures used
  • The results were obtained on a IBM Power4
    (Datastar at SDSC)
  • and a SGI-Altix (ORNL)
  • DataStar has 176 (8-way) P655 nodes. The 8-way
    nodes have 16 GB of memory and 1.5 GHz CPU speed.
    The use of 8-way nodes is exclusive only one
    user is allowed at a node at any time, regardless
    of the number of CPUs one needs on that node.
  • The SGI-Altix is a shared memory machine (NUMA).
    It has 256 Intel Itanium2 processors running at
    1.5GHz, each with 6 Mb of L3 cache, 256K of L2
    cache and 32K of L1 cache. Ram has 8GB of memory
    per processors with a total of 2TB. It supports
    up to 128 processors in a single system.

12
Performance Experiments
The test matrices are dense within the band
13
SPIKE improvement over ScaLAPACK
Case 1 Spike-RP0 for non-diagonally dominant
systems
  • LU is performed with partial pivoting (LAPACK)
  • All the spikes are computed explicitly
  • The reduced system is solved directly (recursive
    SPIKE)

14
N480, 000 RHS1 procs 32
IBM-Power4
Spike with pivoting (RP0)
15
N480, 000 RHS1 procs 32
SGI-Altix
Spike with pivoting (RP0)
16
SPIKE improvement over ScaLAPACK
Case 2 Spike-RL0 for non-diagonally dominant
systems
  • LU is performed without pivoting but with
    diagonal boosting
  • All the spikes are computed explicitly
  • The reduced system is solved directly (recursive
    SPIKE)

? Spike is used as a preconditioner
17
New zero-pivot 10-7 Epsilon boosting 10-9

N480, 000 RHS1 procs 32
IBM-Power4
Spike w/o pivoting (RL0) - no zero-pivot
detected
no outer-iteration needed
18
New zero-pivot 10-7 Epsilon boosting 10-9

N480, 000 RHS1 procs 32
SGI-Altix
Spike w/o pivoting (RL0) - no zero-pivot
detected
no outer-iteration needed
19
IBM-Power4 and SGI Altix
Spike w/o pivoting (RL0) - no zero-pivot
detected
no outer-iteration needed
N480,000 RHS1 procs 32
20
New zero-pivot 10-7 Epsilon boosting 10-9

N480, 000 RHS1 procs 32
IBM-Power4
Spike w/o pivoting (RL0) - zero-pivot detected
diagonal boosting - outer-iteration needed
reslt10-8
21
SPIKE improvement over ScaLAPACK
Case 3 Spike-TU0 for diagonally dominant systems
  • LU/UL are performed without pivoting
  • Only the top/bottom blocks of the spikes are
    computed
  • The truncated reduced system is solved directly

22
N480, 000 RHS1 procs 32
IBM-Power4
Spike and LU/UL without pivoting (TU0)
23
N480, 000 RHS1 procs 32
SGI-Altix
Spike and LU/UL without pivoting (TU0)
24
Performance Experiments
Last column Time on IBM-Power4, Spike method. N
480,000, b 401, preprocess and solve times.
25
SPIKE Scalability
Spike (RL0)
IBM-Power4
b161 RHS1
N0.5M
N1M
N2M
26
SPIKE Scalability
Spike (RL0)
IBM-Power4
b401 RHS1
X computational time Y communication time
If XgtgtY Tsca/Tspike ? ? If XltltY Tsca/Tspike ?
?
27
Conclusion and future directions
  • Spike is an effective parallel scheme for narrow
    banded systems that are dense or sparse within
    the band. SPIKE outperforms ScaLAPACK banded
    solver.
  • Spike can be used as an effective parallel
    preconditioner for banded systems
  • Spike can be adapted for preconditioning sparse
    linear systems after applying an appropriate
    reordering scheme
  • Spike has been included as preconditioner in
    NESSIE to address large-scale nanoelectronics
    problems including
  • full 3D simulations and future full atomistic
    modeling
Write a Comment
User Comments (0)
About PowerShow.com