Title: Numerical Parallel Algorithms for LargeScale Quantum Transport Problem
1Numerical Parallel Algorithms for Large-Scale
Quantum Transport Problem
- E. Polizzi, A. Sameh
- Department of Computer Sciences
- Laboratory for
- Modeling, Numerical Algorithms, and Simulations
- F. Saied, M. Sayeed
- Computing Research Institute (CRI)
- Information Technology at Purdue (ITaP)
- Purdue University
2Quantum transport modelingHow to proceed ?
- Current-Voltage Characteristics obtained by
self-consistent simulations Transport-Electrostat
ics - High level of detail and realism in the
simulation of nanoelectronic devices requires
huge computational capacity
3NESSIE a state of the art nanoelectronics
simulator and a multidisciplinary simulation
environment
- Our parallel finite element code NESSIE can
simulate the I-V characteristics of arbitrary
2D/3D devices using a full quantum approach
(PDE-based model) - By allowing the integration of different physical
models, new discretization schemes, robust
mathematical methods, and new numerical parallel
techniques NESSIE becomes a robust
multidisciplinary simulation environment
4Numerical Techniques
Computational problems
the potential Vi is given
Basic strategy parallel MPI procedure on the
energy where each processor handles many linear
systems ? Not suitable for large-scale
nanoelectronics problem
New strategy each linear system is solved in
parallel ? The SPIKE algorithm
5SPIKE Introduction
After RCM reordering
- Engineering problems usually produce large sparse
linear systems - Banded, or banded with low-rank perturbations,
structure is often obtained after reordering - SPIKE partitions the banded matrix into a block
tridiagonal form - Each partition is associated with one CPU or
one node ? multilevel of parallelism
6SPIKE General algorithm (A. Sameh et al.)
Reduced system ((p-1) 2m)
AXF ? SXdiag(A1-1,,Ap-1) F
Retrieve solution
A(n n) Bj, Cj (m m), mltltn
7SPIKE A Hybrid Algorithm
Different choices depending the properties of the
matrix and/or the architecture of the machine
- The spikes can be computed
- Explicitly (fully or partially)
- On the Fly
- Approximately
- The diagonal blocks can be solved
- Directly (LU, Cholesky, or sparse counterparts)
- Iteratively (with a preconditioning strategy)
- The reduced system can be solved
- Directly (Recursive SPIKE)
- Iteratively with a preconditioning scheme
- Approximately (Truncated SPIKE)
-
8SPIKE A Direct Solver and a Preconditioner
SPIKE as a Direct Solver SPIKE Preprocessing
on A
SPIKE as a Preconditioner SPIKE Preprocessing
on M
ITERATIVE METHOD
SPIKE Solver Axf (direct
/ iterative outer/inner)
- SPIKE Solver M z r
- Matrix-Vector mult. Axy
9SPIKE The recursive scheme
Reduced system p partitions (p2n)
New Reduced system obtained by applying SPIKE
again
Reduced system p/2 partitions
?Better balance achieved between computational
and communication costs as compared to iterative
methods
10SPIKE The truncated scheme for diagonally
dominant matrices
Approximate Reduced system
- LU is used to compute 0,IVj
- UL is used to compute I,0Wj
11Parallel architectures used
- The results were obtained on a IBM Power4
(Datastar at SDSC) - and a SGI-Altix (ORNL)
- DataStar has 176 (8-way) P655 nodes. The 8-way
nodes have 16 GB of memory and 1.5 GHz CPU speed.
The use of 8-way nodes is exclusive only one
user is allowed at a node at any time, regardless
of the number of CPUs one needs on that node. - The SGI-Altix is a shared memory machine (NUMA).
It has 256 Intel Itanium2 processors running at
1.5GHz, each with 6 Mb of L3 cache, 256K of L2
cache and 32K of L1 cache. Ram has 8GB of memory
per processors with a total of 2TB. It supports
up to 128 processors in a single system.
12Performance Experiments
The test matrices are dense within the band
13SPIKE improvement over ScaLAPACK
Case 1 Spike-RP0 for non-diagonally dominant
systems
- LU is performed with partial pivoting (LAPACK)
- All the spikes are computed explicitly
- The reduced system is solved directly (recursive
SPIKE)
14N480, 000 RHS1 procs 32
IBM-Power4
Spike with pivoting (RP0)
15N480, 000 RHS1 procs 32
SGI-Altix
Spike with pivoting (RP0)
16SPIKE improvement over ScaLAPACK
Case 2 Spike-RL0 for non-diagonally dominant
systems
- LU is performed without pivoting but with
diagonal boosting - All the spikes are computed explicitly
- The reduced system is solved directly (recursive
SPIKE)
? Spike is used as a preconditioner
17New zero-pivot 10-7 Epsilon boosting 10-9
N480, 000 RHS1 procs 32
IBM-Power4
Spike w/o pivoting (RL0) - no zero-pivot
detected
no outer-iteration needed
18New zero-pivot 10-7 Epsilon boosting 10-9
N480, 000 RHS1 procs 32
SGI-Altix
Spike w/o pivoting (RL0) - no zero-pivot
detected
no outer-iteration needed
19IBM-Power4 and SGI Altix
Spike w/o pivoting (RL0) - no zero-pivot
detected
no outer-iteration needed
N480,000 RHS1 procs 32
20New zero-pivot 10-7 Epsilon boosting 10-9
N480, 000 RHS1 procs 32
IBM-Power4
Spike w/o pivoting (RL0) - zero-pivot detected
diagonal boosting - outer-iteration needed
reslt10-8
21SPIKE improvement over ScaLAPACK
Case 3 Spike-TU0 for diagonally dominant systems
- LU/UL are performed without pivoting
- Only the top/bottom blocks of the spikes are
computed - The truncated reduced system is solved directly
22N480, 000 RHS1 procs 32
IBM-Power4
Spike and LU/UL without pivoting (TU0)
23N480, 000 RHS1 procs 32
SGI-Altix
Spike and LU/UL without pivoting (TU0)
24Performance Experiments
Last column Time on IBM-Power4, Spike method. N
480,000, b 401, preprocess and solve times.
25SPIKE Scalability
Spike (RL0)
IBM-Power4
b161 RHS1
N0.5M
N1M
N2M
26SPIKE Scalability
Spike (RL0)
IBM-Power4
b401 RHS1
X computational time Y communication time
If XgtgtY Tsca/Tspike ? ? If XltltY Tsca/Tspike ?
?
27Conclusion and future directions
- Spike is an effective parallel scheme for narrow
banded systems that are dense or sparse within
the band. SPIKE outperforms ScaLAPACK banded
solver. - Spike can be used as an effective parallel
preconditioner for banded systems - Spike can be adapted for preconditioning sparse
linear systems after applying an appropriate
reordering scheme - Spike has been included as preconditioner in
NESSIE to address large-scale nanoelectronics
problems including - full 3D simulations and future full atomistic
modeling