Numerical Parallel Algorithms for LargeScale Quantum Transport Problem - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

Numerical Parallel Algorithms for LargeScale Quantum Transport Problem

Description:

Current-Voltage Characteristics obtained by self-consistent simulations: ... NEGF Formalism. Electron Density n(V) ELECTROSTATICS. Poisson equation. Potential V(n) ... – PowerPoint PPT presentation

Number of Views:63

Avg rating:3.0/5.0

Slides: 28

Provided by: Sne90

Category:

more less

Transcript and Presenter's Notes

Title: Numerical Parallel Algorithms for LargeScale Quantum Transport Problem

1
Numerical Parallel Algorithms for Large-Scale
Quantum Transport Problem

E. Polizzi, A. Sameh
Department of Computer Sciences
Laboratory for
Modeling, Numerical Algorithms, and Simulations
F. Saied, M. Sayeed
Computing Research Institute (CRI)
Information Technology at Purdue (ITaP)
Purdue University

2
Quantum transport modelingHow to proceed ?

Current-Voltage Characteristics obtained by
self-consistent simulations Transport-Electrostat
ics
High level of detail and realism in the
simulation of nanoelectronic devices requires
huge computational capacity

3
NESSIE a state of the art nanoelectronics
simulator and a multidisciplinary simulation
environment

Our parallel finite element code NESSIE can
simulate the I-V characteristics of arbitrary
2D/3D devices using a full quantum approach
(PDE-based model)
By allowing the integration of different physical
models, new discretization schemes, robust
mathematical methods, and new numerical parallel
techniques NESSIE becomes a robust
multidisciplinary simulation environment

4
Numerical Techniques
Computational problems
the potential Vi is given
Basic strategy parallel MPI procedure on the
energy where each processor handles many linear
systems ? Not suitable for large-scale
nanoelectronics problem
New strategy each linear system is solved in
parallel ? The SPIKE algorithm
5
SPIKE Introduction
After RCM reordering

Engineering problems usually produce large sparse
linear systems
Banded, or banded with low-rank perturbations,
structure is often obtained after reordering
SPIKE partitions the banded matrix into a block
tridiagonal form
Each partition is associated with one CPU or
one node ? multilevel of parallelism

6
SPIKE General algorithm (A. Sameh et al.)
Reduced system ((p-1) 2m)
AXF ? SXdiag(A1-1,,Ap-1) F
Retrieve solution
A(n n) Bj, Cj (m m), mltltn
7
SPIKE A Hybrid Algorithm
Different choices depending the properties of the
matrix and/or the architecture of the machine

The spikes can be computed
Explicitly (fully or partially)
On the Fly
Approximately
The diagonal blocks can be solved
Directly (LU, Cholesky, or sparse counterparts)
Iteratively (with a preconditioning strategy)
The reduced system can be solved
Directly (Recursive SPIKE)
Iteratively with a preconditioning scheme
Approximately (Truncated SPIKE)

8
SPIKE A Direct Solver and a Preconditioner
SPIKE as a Direct Solver SPIKE Preprocessing
on A
SPIKE as a Preconditioner SPIKE Preprocessing
on M
ITERATIVE METHOD
SPIKE Solver Axf (direct
/ iterative outer/inner)

SPIKE Solver M z r
Matrix-Vector mult. Axy

9
SPIKE The recursive scheme
Reduced system p partitions (p2n)
New Reduced system obtained by applying SPIKE
again
Reduced system p/2 partitions
?Better balance achieved between computational
and communication costs as compared to iterative
methods
10
SPIKE The truncated scheme for diagonally
dominant matrices
Approximate Reduced system

LU is used to compute 0,IVj
UL is used to compute I,0Wj

11
Parallel architectures used

The results were obtained on a IBM Power4
(Datastar at SDSC)
and a SGI-Altix (ORNL)
DataStar has 176 (8-way) P655 nodes. The 8-way
nodes have 16 GB of memory and 1.5 GHz CPU speed.
The use of 8-way nodes is exclusive only one
user is allowed at a node at any time, regardless
of the number of CPUs one needs on that node.
The SGI-Altix is a shared memory machine (NUMA).
It has 256 Intel Itanium2 processors running at
1.5GHz, each with 6 Mb of L3 cache, 256K of L2
cache and 32K of L1 cache. Ram has 8GB of memory
per processors with a total of 2TB. It supports
up to 128 processors in a single system.

12
Performance Experiments
The test matrices are dense within the band
13
SPIKE improvement over ScaLAPACK
Case 1 Spike-RP0 for non-diagonally dominant
systems

LU is performed with partial pivoting (LAPACK)
All the spikes are computed explicitly
The reduced system is solved directly (recursive
SPIKE)

14
N480, 000 RHS1 procs 32
IBM-Power4
Spike with pivoting (RP0)
15
N480, 000 RHS1 procs 32
SGI-Altix
Spike with pivoting (RP0)
16
SPIKE improvement over ScaLAPACK
Case 2 Spike-RL0 for non-diagonally dominant
systems

LU is performed without pivoting but with
diagonal boosting
All the spikes are computed explicitly
The reduced system is solved directly (recursive
SPIKE)

? Spike is used as a preconditioner
17
New zero-pivot 10-7 Epsilon boosting 10-9

N480, 000 RHS1 procs 32
IBM-Power4
Spike w/o pivoting (RL0) - no zero-pivot
detected
no outer-iteration needed
18
New zero-pivot 10-7 Epsilon boosting 10-9

N480, 000 RHS1 procs 32
SGI-Altix
Spike w/o pivoting (RL0) - no zero-pivot
detected
no outer-iteration needed
19
IBM-Power4 and SGI Altix
Spike w/o pivoting (RL0) - no zero-pivot
detected
no outer-iteration needed
N480,000 RHS1 procs 32
20
New zero-pivot 10-7 Epsilon boosting 10-9

N480, 000 RHS1 procs 32
IBM-Power4
Spike w/o pivoting (RL0) - zero-pivot detected
diagonal boosting - outer-iteration needed
reslt10-8
21
SPIKE improvement over ScaLAPACK
Case 3 Spike-TU0 for diagonally dominant systems

LU/UL are performed without pivoting
Only the top/bottom blocks of the spikes are
computed
The truncated reduced system is solved directly

22
N480, 000 RHS1 procs 32
IBM-Power4
Spike and LU/UL without pivoting (TU0)
23
N480, 000 RHS1 procs 32
SGI-Altix
Spike and LU/UL without pivoting (TU0)
24
Performance Experiments
Last column Time on IBM-Power4, Spike method. N
480,000, b 401, preprocess and solve times.
25
SPIKE Scalability
Spike (RL0)
IBM-Power4
b161 RHS1
N0.5M
N1M
N2M
26
SPIKE Scalability
Spike (RL0)
IBM-Power4
b401 RHS1
X computational time Y communication time
If XgtgtY Tsca/Tspike ? ? If XltltY Tsca/Tspike ?
?
27
Conclusion and future directions

Spike is an effective parallel scheme for narrow
banded systems that are dense or sparse within
the band. SPIKE outperforms ScaLAPACK banded
solver.
Spike can be used as an effective parallel
preconditioner for banded systems
Spike can be adapted for preconditioning sparse
linear systems after applying an appropriate
reordering scheme
Spike has been included as preconditioner in
NESSIE to address large-scale nanoelectronics
problems including
full 3D simulations and future full atomistic
modeling