Title: Numerical Libraries in High Performance Computing
1Numerical Libraries in High Performance Computing
- Dmitry Pekurovsky
- dmitry_at_sdsc.edu
- CIEG workshop, April 25 2007
2Tutorial Outline
- General comments
- Overview of commonly used libraries
- MASS
- ScaLAPACK and its components
- BLAS, BLACS, PBLAS,LAPACK
- PETSc
- NAG
- ESSL, PESSL
- FFTW
- P3DFFT
- Lab assignment
3Numerical libraries
- Use numerical (a.k.a. math or scientific)
libraries whenever possible - Minimize code development effort
- Improved performance
- Robust, accurate, up-to-date numerical algorithms
- Collection (archive) of precompiled routines
- Use nm a to look inside if needed
- Linking mpxlf l /lib/loc/libnum.a OR
- mpxlf -L/lib/loc l lnum
- Reminder order of linking matters
- In case two libraries contain the same routine
name, the first one to appear on the linking
command line gets used - -l options should come after the object files
(.o) of your source - If library A calls routines from library B, -lA
has to come before -lB. -
4Calling Fortran Libraries from C/C
- Fortran uses column-major, C/C uses row-major
array storage - In C array counts begin with 0, but in Fortran
they begin with 1 - Pass arguments by reference
- Pass characters as strings
- In C, prototype functions to be called as
follows - extern Fortran func(int n,)
- Link with lxlf90_r flag added
- More details see http//www.sdsc.edu/us/newslette
r/common/newsletter.php?issue2005-10cornersoftw
are or ask SDSC consulting
5Third-party libraries
- Advantages
- Portable code
- Extended set of features
- Disadvantages
- Suboptimal performance
- Sometimes costly, though many are free
- Sometimes, need to install yourself
- A good set is already installed on SDSC systems
- Look in /usr/local/apps directory for 64-bit
apps, in /usr/local/apps32 for 32-bit apps - Examples ScaLapack, PETSc, FFTW, NAG
6Vendor-supplied libraries (specific to a given
architecture)
- Advantages
- Highly optimized performance (for a given
processor, memory system etc) - Installed, tested, supported
- Usually free for the user
- Disadvantages
- Sometimes code not portable to other platforms
- Some features may not be implemented
- Examples on IBM ESSL, MASS
- Examples on Intel MKL
7MASS Mathematical Acceleration SubSystem
- IBM product
- Scalar Library Functions
- Replace standard system intrinsics from libm
- Call from Fortran or C
- No source code modification required
- Double precision routines sin cos tan atan sqrt
sinh cosh tanh atan2 rsqrt exp dnint log
pow(C/C) xy(Fortran) - Provide better performance
- Sometimes less accurate results (usually in the
last bit) - Linking -L/usr/local/apps/mass -lmass
8Scalar MASS Performance (cycles per call, length
1000 loop)
- F 604e pwr3 pwr4
pwr5u S S
S Sn p
p p pc R e
e e et a l
m e l m e l m e l m ei
n i a d i a d i a d i a
do g b s u b s u b s u
b s un e m s p m s p m
s p m s p--------------------------------
--------------------------acos B 91 91 1.0
70 69 1.0 108 116 0.9 113 115 1.0acosh G
318 182 1.7 256 159 1.6 483 253 1.9 527 248
2.1asin B 84 83 1.0 63 63 1.0 100 101
1.0 112 116 1.0asinh D 352 232 1.5 264 178
1.5 463 301 1.5 482 286 1.7atan2 D 601 103
5.8 441 79 5.6 754 100 7.5 847 95 8.9atanh
B 244 143 1.7 186 116 1.6 300 169 1.8 312 165
1.9cbrt D 279 282 1.0 216 216 1.0 328 333
1.0 335 336 1.0cos B 53 25 2.1 38 16
2.4 53 21 2.5 65 29 2.2cos D 76 45
1.7 58 37 1.6 74 58 1.3 86 60 1.4cosh
D 191 45 4.2 154 32 4.8 237 46 5.2 255 50
5.1cosisin B 111 58 1.9 82 34 2.4 101 46
2.2 123 50 2.5cosisin D 161 97 1.7 125 77
1.6 151 102 1.5 170 105 1.6div D 32 32
1.0 8.8 8.8 1.0 28 28 1.0 29 29 1.0exp
D 120 35 3.4 106 27 3.9 162 35 4.6 168 40
4.2expm1 D 133 41 3.2 111 28 4.0 218 43
5.1 198 43 4.6
9Scalar MASS Performance (cycles per call, length
1000 loop), cont.
- F 604e pwr3 pwr4
pwr5u S S
S Sn p
p p pc R e
e e et a l
m e l m e l m e l m ei
n i a d i a d i a d i a
do g b s u b s u b s u
b s un e m s p m s p m
s p m s plog C 134 55 2.4 105 52
2.0 204 84 2.4 210 86 2.4log10 C 133 56
2.4 107 57 1.9 206 90 2.3 207 92 2.3log1p
H 157 58 2.7 124 47 2.6 201 68 3.0 220 70
3.1pow C 309 113 2.7 235 90 2.6 453 146
3.1 463 150 3.1qdrt C 134 98 1.4 116 88
1.3 250 156 1.6 283 159 1.8rcbrt D 309 309
1.0 236 237 1.0 354 354 1.0 362 363 1.0rec
D 32 32 1.0 9.2 8.8 1.0 29 29 1.0 29 29
1.0rqdrt C 149 100 1.5 126 95 1.3 243 155
1.6 276 153 1.8rsqrt C 86 52 1.7 70 52
1.3 120 73 1.6 156 74 2.1sin B 52 27
1.9 37 16 2.3 54 24 2.3 66 32 2.1sin
D 78 45 1.7 61 37 1.6 77 57 1.4 82 60
1.4sincos B 108 56 1.9 80 34 2.4 101 48
2.1 121 54 2.2sinh D 250 50 5.0 197 35
5.6 345 52 6.6 329 55 6.0sqrt C 69 50
1.4 59 46 1.3 128 72 1.8 143 73 2.0tan
D 137 72 1.9 112 43 2.6 194 62 3.1 183 64
2.9tanh F 256 64 4.0 199 47 4.2 357 65
5.5 333 67 5.0
10Vector MASS library - MASSV
- Provides optimized versions of intrinsic
functions for vectorized operations - Example
- for(i0 i lt N i)
- Zi exp(xi)
- Replace with
- vexp(Z,x,N)
- Takes advantage of pipelining
11Vector MASS library cont.
- Some source code modification required
- Note accuracy may differ even from scalar MASS
functions (usually in the last bit) - Library source code available for portability to
other platforms - Linking -lmassvp4 on power4 platforms, otherwise
lmassv - MASS URL http//www-306.ibm.com/software/awdtools
/mass/
12Vector MASS Performance -- double precision
functions (cycles per evaluation, length 1000
loop)
- F 604e pwr3 pwr4
pwr5u S m S
m S m Sn p
a p a p a pc
R m e s e s e s
et a l a e l s e l s
e l s ei n i s d i v
d i v d i v do g b s
u b p u b p u b p un
e m v p m 3 p m 4 p m 5
p----------------------------------------------
-----------------vacos B 91 41 2.2 70
17 4.1 108 24 4.5 113 23 4.9vacosh G
318 39 8.2 256 15 17.1 483 21 23.0 527 19
27.7vasin B 84 41 2.0 63 17 3.7 100
24 4.2 112 23 4.9vasinh D 352 44 8.0
264 18 14.7 463 23 20.1 482 22 21.9vatan2
D 601 40 15.0 441 27 16.3 754 60 12.6 847 57
14.9vatanh B 244 41 6.0 186 16 11.6 300
21 14.3 312 19 16.4vcbrt D 279 20 14.0
216 8.8 24.5 328 11 29.8 335 12 27.9vcos
B 53 9.3 5.7 38 4 9.5 53 6.5 8.2 65 6.5
10.0vcos D 76 27 2.8 58 12 4.8 74
20 3.7 86 20 4.3vcosh D 191 25 7.6
154 9.2 16.7 237 14 16.9 255 13 19.6vcosisin
B 111 19 5.8 82 8 10.3 101 13 7.8 123 12
10.3vcosisin D 161 29 5.6 125 12 10.4 151
21 7.2 170 20 8.5vdiv D 32 11 2.9
8.8 6.8 1.3 28 13 2.2 29 13 2.2 - vexp D 120 16 7.5 106 6.4 16.6 162 14
11.6 168 13 12.9vexpm1 D 133 19 7.0 111
8 13.9 218 12 18.2 198 12 16.5vlog C 134
20 6.7 105 7.2 14.6 204 9.7 21.0 210 9.5 22.1
13Vector MASS Performance - double precision
functions (cycles per evaluation, length 1000
loop), cont.
- F 604e pwr3 pwr4
pwr5u S m S
m S m Sn p
a p a p a pc
R m e s e s e s
et a l a e l s e l s
e l s ei n i s d i v
d i v d i v do g b s
u b p u b p u b p un
e m v p m 3 p m 4 p m 5
p----------------------------------------------
----------------- - vlog10 C 133 19 7.0 107 7.6 14.1 206 10
20.6 207 9.9 20.9vlog1p H 157 26 6.0 124
11 11.3 201 13 15.5 220 12 18.3vpow C
309 52 5.9 235 20 11.8 453 29 15.6 463 30
15.4vqdrt C 134 15 8.9 116 6 19.3 250
8.2 30.5 283 8 35.4vrcbrt D 309 20 15.5
236 8.8 26.8 354 11 32.2 362 11 32.9vrec
D 32 10 3.2 9.2 6.4 1.4 29 12 2.4 29 11
2.6vrqdrt C 149 14 10.6 126 6 21.0 243
7.8 31.2 276 7 39.4vrsqrt C 86 16 5.4
70 8.8 8.0 120 18 6.7 156 17 9.2vsin
B 52 11 4.7 37 4.8 7.7 54 7.2 7.5 66 6.9
9.6vsin D 78 27 2.9 61 12 5.1 77
20 3.9 82 19 4.3vsinh D 250 27 9.3
197 10 19.7 345 15 23.0 329 13 25.3vsqrt
C 69 16 4.3 59 8.8 6.7 128 17 7.5 143 17
8.4vtan D 137 31 4.4 112 13 8.6 194
19 10.2 183 17 10.8vtanh F 256 35 7.3
199 13 15.3 357 19 18.8 333 18 18.5
14ScaLapack
- Scalable Linear Algebra PACKage,
http//www.netlib.org/scalapack - Developing team from University of Tennessee,
University of California Berkeley, ORNL, Rice
U.,UCLA, UIUC etc. - Support in Commercial Packages
- NAG Parallel Library
- IBM PESSL
- CRAY Scientific Library
- VNI IMSL
- Fujitsu, HP/Convex, Hitachi, NEC
- Handles dense and banded matrices
15Serial Parallel
16BLAS Basic Linear Algebra Subprograms
17BLAS, cont.
- Use blocked operations
- Optimized for a given memory hierarchy e.g.
good use of cache - Other libraries build on top of BLAS
- On any given system, try to use a
system-optimized version
18LAPACK
- Linear Algebra PACKage
- System of linear equations
- Eigenvalue problems
- Built on top of BLAS
- On SDSC systems, installed at /usr/local/apps/LAPA
CK
19BLACS - Basic Linear Algebra Communication
Subprograms
20BLACS basics
- nprow1 npcol3
- order'r' zero0
- blacs_get(zero,zero,icontxt)
- blacs_gridinit(icontxt,order,nprow,npcol)
- blacs_gridinfo(icontxt,nprow,npcol,myrow,mycol)
- icontxt like MPI_COMM_WORLD
- blacs_gridinit specifies topology
- blacs_gridinfo gets processor information
- Here we define a 3 element array of processors
21PBLAS
- Parallel version of BLAS
- Built on top of BLAS and BLACS
22ScaLapack data layout
23Block-Cyclic Partitioning (1D)
24Block-cyclic partitioning (2D)
25Block-cyclic partitioning (2D), cont.
26ScaLapack Array Descriptors
27Syntax for descinit
descinit(idesc, m,n, mb,nb, i,j, icon, mla, ier)
IorO arg Description
OUT idesc Descriptor IN m Row Size
(Global) IN n Column Size (Global) IN
mb Row Block Size IN nb Column
Size IN i Starting Grid Row IN j
Starting Grid Column IN icon BLACS
context IN mla Leading Dimension of Local
Matrix OUT ier Error number The starting
grid location is usually (i0,j0).
28ScaLapack functionality
29ScaLapack API
- Prepend LAPACK equivalent names with P
P
X
Y
Y
Z
Z
Z
C
o
m
p
u
t
a
t
i
o
n
P
e
r
f
o
r
m
e
d
M
a
t
r
i
x
T
y
p
e
D
a
t
a
T
y
p
e
s
30ScaLapack API, cont.
- Matrix Types (YY)
PXYYZZZ - DB General Band (diagonally dominant-like)
- DT general Tridiagonal (Diagonally
dominant-like) - GB General Band
- GE GEneral matrices (e.g., unsymmetric,
rectangular, etc.) - GG General matrices, Generalized problem
- HE Complex Hermitian
- OR Orthogonal Real
- PB Positive definite Banded (symmetric or
Hermitian) - PO Positive definite (symmetric or Hermitian)
- PT Positive definite Tridiagonal (symmetric or
Hermitian) - ST Symmetric Tridiagonal Real
- SY SYmmetric
- TR TRiangular (possibly quasi-triangular)
- TZ TrapeZoidal
- UN UNitary complex
31ScaLapack API, cont.
Drivers (ZZZ) PXYYZZZ
- SL Linear Equations (SVX)
- SV LU Solver
- VD Singular Value
- EV Eigenvalue (EVX)
- GVX Generalized Eigenvalue
- Estimates condition numbers.
- Provides Error Bounds.
- Equilibrates System.
- Selected Eigenvalues
32ScaLapack function call example
- Solving AxB for a general square matrix
- A (N x N), B(N x NRHS)
- CALL PDGESV( N, NRHS, A, IA, JA, DESCA, IPIV, B,
IB, JB, DESCB, INFO ) - Global indices IA JA IB JB 1
- Ipiv Integer array (N), on output containing
permutation matrix (row i of the matrix was
interchanged with row ipiv(i))
33ScaLapack performance tips
- Use native BLAS
- Choose a suitable number of processors, so as not
to make the local matrix size too large
(1000x1000 is about right) - Try to use square processor grid (NrowNcol)
- Block size 64
34PETSc
- Portable Extensible Toolkit for Scientific
Computation - Intended to facilitate the scalable (parallel)
solution of linear and non-linear partial
differential equations - Focus on problems discretized using semi or fully
implicit methods - Developed by Argonne National Laboratory, MCS
35PETSc
- Fully usable from Fortran, C, and C
- Uses MPI for all parallel communications
- Object-oriented
- Designed for both efficiency and ease of use
- Available freely from http//www.mcs.anl.gov/petsc
- Installed on Datastar at /usr/local/apps/petsc-2.3
.1
36PETSc features
- Parallel Vectors and discrete functions
- Distributed arrays (for regular grid problems)
- Parallel vector scatter/gathers (for unstructured
grid problems) - Parallel (sparse) matrices
- Parallel Krylov subspace methods
- Parallel preconditioners
- Parallel (Newton-based) nonlinear solvers
- Parallel time-stepping code
37PETSc Numerical components
Vectors
38MPI use features
- Communicators and attributes to handle managing
messages inside PETSc objects - Nonblocking sends and receives to overlap
communication with computation - Global reduction operations
- Derived data types
39NAG Numerical Algorithms Group
- Proprietary package, currently installed on
Datastar in /usr/local/apps/nag (64-bit only) - Differential Equations
- Linear Equations
- Interpolation, Curve and Surface Fitting
- Linear Algebra
- Eigensystem Analysis
- Statistical Analysis
- Random Number Generator
- Fourier Transforms, Summation of Series
- Time series analysis
- Mesh generation
40ESSL - Engineering and Scientific Subroutine
Library
- IBM product optimized for IBM platforms
- Over 400 high performance mathematical functions
- C/C and Fortran, 32 and 64 bit, thread-safe
- API is Fortran-based. From C/C must use
Fortran calling conventions, e.g. column-major
array - Linking -lessl
- ESSL User Guide http//publib.boulder.ibm.com/epu
bs/pdf/am501403.pdf
41ESSL features
- Numerical functions categories
- Linear Algebra - vector-scalar, sparse
vector-scalar, matrix-vector, sparse
matrix-vector - Matrix Operations - addition, subtraction,
multiplication, rank-k updates, transpose - Linear Algebra Equation solvers - dense, banded,
sparse, linear least-squares - Eigensystem Analysis - standard, generalized
- Signal Processing - Fourier transform,
convolutions, correlations - Sorting Searching
- Interpolation - polynomial, cubic spline
- Numerical Quadrature
- Random Number Generation
- Utilities
42ESSL example 1D real-to-complex Fast Fourier
Transform (FFT)
- include ltessl.hgt
- main()
-
- double RMN,work1,work2
- cmplx CMN/21
- int nwork1,nwork2
- Init(R,N,M)
-
- work1 malloc(nwork1sizeof(double))
- work2 malloc(nwork2sizeof(double))
- drcft(1,R,N,C,N/21,N,M,1,1.0,work1,nwork1,work2,n
work2) - drcft(0,R,N,C,N/21,N,M,1,1.0,work1,nwork1,work2,n
work2) -
43PESSL
- Invokes ESSL
- Uses MPI for communication
- PESSL Categories
- Subset of Level 2, 3 PBLAS
- Subset of ScaLAPACK (dense, banded)
- Sparse Routines
- Subset of ScaLAPACK Eigensystem Routines
- Fourier Transforms
- Uniform Random Number Generation
- BLACS
- Link with -lpessl
44FFTW Fastest Fourier Transform in the West
- Third party software, free http//www.fftw.org
- Latest version installed on SDSC systems in
/usr/local/apps/fftw312d and /usr/local/apps/fftw3
12s - 1,2, and 3 dimensional FFTs, and related
functions (sin/cos transforms, convolution etc) - Using plans
- Initialize transform parameters
- long integer plan1
- plan1 fftw_plan(..N,M,...)
- fftw_execute(plan,A,B)
- Work arrays allocated and initialized behind the
scenes - ESTIMATE, MEASURE, PATIENT intialization
- s
- Portable, free
- -s
- Performance about 3 times worse than ESSLs FFTs!
453D FFT in parallel
- FFTW version 3 does not have MPI implementation
- PESSL and other libraries use 1D processor
decomposition (slabs/planes) - Limitation Np lt N
- Not enough for some state-of-the art simulations,
for example turbulence in a periodic domain 40963 - 2D decomposition is crucial for scaling on
Petascale platforms
462D decomposition
z
x
y
472D decomposition
- 2D decomposition processors form a square grid
P1 x P2 - Columns (pencils) instead of slabs
- Local geometry
- Stage 1 N x K x M
- Stage 2 K x N x M
- Stage 3 K x M x N
- K N/P1, MN/P2
- At each stage, transform along the local
dimension (N) - Software scaling limit removed! Now the limit is
P lt N2
48Performance of DNS (2D) on Blue Gene at SDSC and
IBMs T.J.Watson Lab and SDSCs Datastar
- VN Two processors per node
- CO One processor per node
49P3DFFT
- Open source library for efficient, highly
scalable 3D FFT on parallel platforms - Built on top of an optimized 1D FFT library
- Currently ESSL or FFTW
- In the future, more libraries
- Uses 2D decomposition, which includes 1D option.
- Developed at SDSC
- Outcome of a Strategic Applications
Collaborations Project (DNS turbulence code) - Available at http//www.sdsc.edu/us/resources/p3df
ft.php
50P3DFFT features
- Implements real-to-complex (R2C) and
complex-to-real (C2R) 3D transforms - Written in Fortran 90 with MPI
- Implemented as F90 module
- Single or double precision
- Arbitrary dimensions
- Handles many uneven cases (Ni does not have to be
a factor of Pj) - Assumes Ni gt Pj
- Can do either in-place transform or otherwise
- In which case the input and output arrays should
not overlap - Memory requirements input and output arrays 1
buffer
51P3DFFT Storage
- R2C transform input
- Global (Nx,Ny,Nz) real array
- Local (Nx,Ny/P1,Nz/P2) real array
- R2C output
- Global (Nx/2,Ny,Nz) complex array
- Conjugate symmetry F(N-i) F(i)
- F(N/21) through F(N-1) are redundant
- F(0) and F(N/2) are real pack them in one
pseudo-complex pair - Total N/2 complex storage units
- Local (Nx/(2P1),Ny/P2,Nz) complex
- C2R input and output interchanged from R2C
52Building P3DFFT
- Subdirectory build/
- Edit makefile for your architecture
- Specify -DSINGLE_PREC or -DDOUBLE_PREC
- Default is single precision
- Specify -DESSL or -DFFTW
- Choose P3DFFT_ROOT
- Type make will compile and install the
library in P3DFFT_ROOT directory - On SDSC platforms, it is already installed in
/usr/local/apps/p3dfft - Subdirectory sample/
- Contains sample programs
- For manual see README file in upper level
directory
53Using P3DFFT
- use p3dfft
- Defines mytype 4 or 8 byte reals
- Initialization p3dfft_setup(proc_dims,nx,ny,nz)
- procs_dims(2) two integers, P1 and P2
- For 1D decomposition, set P1 1
- Sets up the square 2D grid of MPI communicators
in rows and columns - Initializes local array dimensions
- Local dimensions get_dims(istart,iend,isize,Q)
- integer istart(3),iend(3),isize(3)
- Set Q1 for R2C input, Q2 for R2C output
54Using P3DFFT (2)
- Now allocate your arrays
- When ready to call Fourier Transform
- p3dfft_ftran_r2c(IN,OUT,flg_inplace) Forward
transform - exp(-ikx/N)
- p3dfft_btran_c2r(IN,OUT,flg_inplace) Backward
(inverse) transform - exp(ikx/N)
- flg_inplace boolean, true if doing an in-place
transform
55Future work
- Include more array distribution schemes
- Expand options for memory ordering
- Include more transform types
- Derivatives
- Complex-to-complex
- C interface
- Convenience functions
- Break down in stages
- Expand choice of 1D FFT libraries
- Questions? Comments?
- dmitry_at_sdsc.edu
56Acknowledgements
- Amit Majumdar, SDSC
- ScaLapack development team http//www.netlib.org/s
calapack - ACTS tools presentation from NERSC
http//acts.nersc.gov
57Lab assignment
- Copy tar file dmitry/scalapack_examples.tar
- Untar
- Type make
- Test the example programs on several processor
counts - do they scale? - Modify source code to vary
- matrix size
- block sizes
- Rewrite example1.f in C or C