Numerical Libraries in High Performance Computing - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

Numerical Libraries in High Performance Computing

Description:

SAN DIEGO SUPERCOMPUTER CENTER. UNIVERSITY OF CALIFORNIA, SAN DIEGO. Numerical Libraries in High Performance Computing ... Reminder: order of linking matters ... – PowerPoint PPT presentation

Number of Views:255
Avg rating:3.0/5.0
Slides: 58
Provided by: yfc5
Category:

less

Transcript and Presenter's Notes

Title: Numerical Libraries in High Performance Computing


1
Numerical Libraries in High Performance Computing
  • Dmitry Pekurovsky
  • dmitry_at_sdsc.edu
  • CIEG workshop, April 25 2007

2
Tutorial Outline
  • General comments
  • Overview of commonly used libraries
  • MASS
  • ScaLAPACK and its components
  • BLAS, BLACS, PBLAS,LAPACK
  • PETSc
  • NAG
  • ESSL, PESSL
  • FFTW
  • P3DFFT
  • Lab assignment

3
Numerical libraries
  • Use numerical (a.k.a. math or scientific)
    libraries whenever possible
  • Minimize code development effort
  • Improved performance
  • Robust, accurate, up-to-date numerical algorithms
  • Collection (archive) of precompiled routines
  • Use nm a to look inside if needed
  • Linking mpxlf l /lib/loc/libnum.a OR
  • mpxlf -L/lib/loc l lnum
  • Reminder order of linking matters
  • In case two libraries contain the same routine
    name, the first one to appear on the linking
    command line gets used
  • -l options should come after the object files
    (.o) of your source
  • If library A calls routines from library B, -lA
    has to come before -lB.

4
Calling Fortran Libraries from C/C
  • Fortran uses column-major, C/C uses row-major
    array storage
  • In C array counts begin with 0, but in Fortran
    they begin with 1
  • Pass arguments by reference
  • Pass characters as strings
  • In C, prototype functions to be called as
    follows
  • extern Fortran func(int n,)
  • Link with lxlf90_r flag added
  • More details see http//www.sdsc.edu/us/newslette
    r/common/newsletter.php?issue2005-10cornersoftw
    are or ask SDSC consulting

5
Third-party libraries
  • Advantages
  • Portable code
  • Extended set of features
  • Disadvantages
  • Suboptimal performance
  • Sometimes costly, though many are free
  • Sometimes, need to install yourself
  • A good set is already installed on SDSC systems
  • Look in /usr/local/apps directory for 64-bit
    apps, in /usr/local/apps32 for 32-bit apps
  • Examples ScaLapack, PETSc, FFTW, NAG

6
Vendor-supplied libraries (specific to a given
architecture)
  • Advantages
  • Highly optimized performance (for a given
    processor, memory system etc)
  • Installed, tested, supported
  • Usually free for the user
  • Disadvantages
  • Sometimes code not portable to other platforms
  • Some features may not be implemented
  • Examples on IBM ESSL, MASS
  • Examples on Intel MKL

7
MASS Mathematical Acceleration SubSystem
  • IBM product
  • Scalar Library Functions
  • Replace standard system intrinsics from libm
  • Call from Fortran or C
  • No source code modification required
  • Double precision routines sin cos tan atan sqrt
    sinh cosh tanh atan2 rsqrt exp dnint log
    pow(C/C) xy(Fortran)
  • Provide better performance
  • Sometimes less accurate results (usually in the
    last bit)
  • Linking -L/usr/local/apps/mass -lmass

8
Scalar MASS Performance (cycles per call, length
1000 loop)
  • F           604e   pwr3   pwr4  
    pwr5u                   S           S      
        S           Sn                   p          
    p           p           pc        R          e  
            e           e           et        a  l  
    m   e   l   m   e   l   m   e   l   m   ei      
     n  i   a   d   i   a   d   i   a   d   i   a  
    do       g  b   s   u   b   s   u   b   s   u  
    b   s   un       e  m   s   p   m   s   p   m  
    s   p   m   s   p--------------------------------
    --------------------------acos     B  91  91 1.0
     70  69 1.0 108 116 0.9 113 115 1.0acosh    G
    318 182 1.7 256 159 1.6 483 253 1.9 527 248
    2.1asin     B  84  83 1.0  63  63 1.0 100 101
    1.0 112 116 1.0asinh    D 352 232 1.5 264 178
    1.5 463 301 1.5 482 286 1.7atan2    D 601 103
    5.8 441  79 5.6 754 100 7.5 847  95 8.9atanh  
     B 244 143 1.7 186 116 1.6 300 169 1.8 312 165
    1.9cbrt     D 279 282 1.0 216 216 1.0 328 333
    1.0 335 336 1.0cos      B  53  25 2.1  38  16
    2.4  53  21 2.5  65  29 2.2cos      D  76  45
    1.7  58  37 1.6  74  58 1.3  86  60 1.4cosh    
    D 191  45 4.2 154  32 4.8 237  46 5.2 255  50
    5.1cosisin  B 111  58 1.9  82  34 2.4 101  46
    2.2 123  50 2.5cosisin  D 161  97 1.7 125  77
    1.6 151 102 1.5 170 105 1.6div      D  32  32
    1.0 8.8 8.8 1.0  28  28 1.0  29  29 1.0exp    
     D 120  35 3.4 106  27 3.9 162  35 4.6 168  40
    4.2expm1    D 133  41 3.2 111  28 4.0 218  43
    5.1 198  43 4.6

9
Scalar MASS Performance (cycles per call, length
1000 loop), cont.
  • F           604e   pwr3   pwr4  
    pwr5u                   S           S      
        S           Sn                   p          
    p           p           pc        R          e  
            e           e           et        a  l  
    m   e   l   m   e   l   m   e   l   m   ei      
     n  i   a   d   i   a   d   i   a   d   i   a  
    do       g  b   s   u   b   s   u   b   s   u  
    b   s   un       e  m   s   p   m   s   p   m  
    s   p   m   s   plog      C 134  55 2.4 105  52
    2.0 204  84 2.4 210  86 2.4log10    C 133  56
    2.4 107  57 1.9 206  90 2.3 207  92 2.3log1p  
     H 157  58 2.7 124  47 2.6 201  68 3.0 220  70
    3.1pow      C 309 113 2.7 235  90 2.6 453 146
    3.1 463 150 3.1qdrt     C 134  98 1.4 116  88
    1.3 250 156 1.6 283 159 1.8rcbrt    D 309 309
    1.0 236 237 1.0 354 354 1.0 362 363 1.0rec    
     D  32  32 1.0 9.2 8.8 1.0  29  29 1.0  29  29
    1.0rqdrt    C 149 100 1.5 126  95 1.3 243 155
    1.6 276 153 1.8rsqrt    C  86  52 1.7  70  52
    1.3 120  73 1.6 156  74 2.1sin      B  52  27
    1.9  37  16 2.3  54  24 2.3  66  32 2.1sin    
     D  78  45 1.7  61  37 1.6  77  57 1.4  82  60
    1.4sincos   B 108  56 1.9  80  34 2.4 101  48
    2.1 121  54 2.2sinh     D 250  50 5.0 197  35
    5.6 345  52 6.6 329  55 6.0sqrt     C  69  50
    1.4  59  46 1.3 128  72 1.8 143  73 2.0tan    
     D 137  72 1.9 112  43 2.6 194  62 3.1 183  64
    2.9tanh     F 256  64 4.0 199  47 4.2 357  65
    5.5 333  67 5.0

10
Vector MASS library - MASSV
  • Provides optimized versions of intrinsic
    functions for vectorized operations
  • Example
  • for(i0 i lt N i)
  • Zi exp(xi)
  • Replace with
  • vexp(Z,x,N)
  • Takes advantage of pipelining

11
Vector MASS library cont.
  • Some source code modification required
  • Note accuracy may differ even from scalar MASS
    functions (usually in the last bit)
  • Library source code available for portability to
    other platforms
  • Linking -lmassvp4 on power4 platforms, otherwise
    lmassv
  • MASS URL http//www-306.ibm.com/software/awdtools
    /mass/

12
Vector MASS Performance -- double precision
functions (cycles per evaluation, length 1000
loop)
  • F            604e    pwr3    pwr4
       pwr5u                    S        m   S
           m   S        m   Sn                    p
           a   p        a   p        a   pc        
    R      m   e        s   e        s   e        s  
    et         a  l   a   e    l   s   e    l   s  
    e    l   s   ei         n  i   s   d    i   v  
    d    i   v   d    i   v   do         g  b   s  
    u    b   p   u    b   p   u    b   p   un      
      e  m   v   p    m   3   p    m   4   p    m   5
      p----------------------------------------------
    -----------------vacos     B  91  41  2.2  70
     17  4.1 108  24  4.5 113  23  4.9vacosh    G
    318  39  8.2 256  15 17.1 483  21 23.0 527  19
    27.7vasin     B  84  41  2.0  63  17  3.7 100
     24  4.2 112  23  4.9vasinh    D 352  44  8.0
    264  18 14.7 463  23 20.1 482  22 21.9vatan2  
     D 601  40 15.0 441  27 16.3 754  60 12.6 847  57
    14.9vatanh    B 244  41  6.0 186  16 11.6 300
     21 14.3 312  19 16.4vcbrt     D 279  20 14.0
    216 8.8 24.5 328  11 29.8 335  12 27.9vcos    
     B  53 9.3  5.7  38   4  9.5  53 6.5  8.2  65 6.5
    10.0vcos      D  76  27  2.8  58  12  4.8  74
     20  3.7  86  20  4.3vcosh     D 191  25  7.6
    154 9.2 16.7 237  14 16.9 255  13 19.6vcosisin
     B 111  19  5.8  82   8 10.3 101  13  7.8 123  12
    10.3vcosisin  D 161  29  5.6 125  12 10.4 151
     21  7.2 170  20  8.5vdiv      D  32  11  2.9
    8.8 6.8  1.3  28  13  2.2  29  13  2.2
  • vexp      D 120  16  7.5 106 6.4 16.6 162  14
    11.6 168  13 12.9vexpm1    D 133  19  7.0 111  
    8 13.9 218  12 18.2 198  12 16.5vlog      C 134
     20  6.7 105 7.2 14.6 204 9.7 21.0 210 9.5 22.1

13
Vector MASS Performance - double precision
functions (cycles per evaluation, length 1000
loop), cont.
  • F            604e    pwr3    pwr4
       pwr5u                    S        m   S
           m   S        m   Sn                    p
           a   p        a   p        a   pc        
    R      m   e        s   e        s   e        s  
    et         a  l   a   e    l   s   e    l   s  
    e    l   s   ei         n  i   s   d    i   v  
    d    i   v   d    i   v   do         g  b   s  
    u    b   p   u    b   p   u    b   p   un      
      e  m   v   p    m   3   p    m   4   p    m   5
      p----------------------------------------------
    -----------------
  • vlog10    C 133  19  7.0 107 7.6 14.1 206  10
    20.6 207 9.9 20.9vlog1p    H 157  26  6.0 124
     11 11.3 201  13 15.5 220  12 18.3vpow      C
    309  52  5.9 235  20 11.8 453  29 15.6 463  30
    15.4vqdrt     C 134  15  8.9 116   6 19.3 250
    8.2 30.5 283   8 35.4vrcbrt    D 309  20 15.5
    236 8.8 26.8 354  11 32.2 362  11 32.9vrec    
     D  32  10  3.2 9.2 6.4  1.4  29  12  2.4  29  11
     2.6vrqdrt    C 149  14 10.6 126   6 21.0 243
    7.8 31.2 276   7 39.4vrsqrt    C  86  16  5.4
     70 8.8  8.0 120  18  6.7 156  17  9.2vsin    
     B  52  11  4.7  37 4.8  7.7  54 7.2  7.5  66 6.9
     9.6vsin      D  78  27  2.9  61  12  5.1  77
     20  3.9  82  19  4.3vsinh     D 250  27  9.3
    197  10 19.7 345  15 23.0 329  13 25.3vsqrt    
    C  69  16  4.3  59 8.8  6.7 128  17  7.5 143  17
     8.4vtan      D 137  31  4.4 112  13  8.6 194
     19 10.2 183  17 10.8vtanh     F 256  35  7.3
    199  13 15.3 357  19 18.8 333  18 18.5

14
ScaLapack
  • Scalable Linear Algebra PACKage,
    http//www.netlib.org/scalapack
  • Developing team from University of Tennessee,
    University of California Berkeley, ORNL, Rice
    U.,UCLA, UIUC etc.
  • Support in Commercial Packages
  • NAG Parallel Library
  • IBM PESSL
  • CRAY Scientific Library
  • VNI IMSL
  • Fujitsu, HP/Convex, Hitachi, NEC
  • Handles dense and banded matrices

15
Serial Parallel
16
BLAS Basic Linear Algebra Subprograms

17
BLAS, cont.
  • Use blocked operations
  • Optimized for a given memory hierarchy e.g.
    good use of cache
  • Other libraries build on top of BLAS
  • On any given system, try to use a
    system-optimized version

18
LAPACK
  • Linear Algebra PACKage
  • System of linear equations
  • Eigenvalue problems
  • Built on top of BLAS
  • On SDSC systems, installed at /usr/local/apps/LAPA
    CK

19
BLACS - Basic Linear Algebra Communication
Subprograms
  • Built on top of MPI

20
BLACS basics
  • nprow1 npcol3
  • order'r' zero0
  • blacs_get(zero,zero,icontxt)
  • blacs_gridinit(icontxt,order,nprow,npcol)
  • blacs_gridinfo(icontxt,nprow,npcol,myrow,mycol)
  • icontxt like MPI_COMM_WORLD
  • blacs_gridinit specifies topology
  • blacs_gridinfo gets processor information
  • Here we define a 3 element array of processors

21
PBLAS
  • Parallel version of BLAS
  • Built on top of BLAS and BLACS

22
ScaLapack data layout
23
Block-Cyclic Partitioning (1D)
24
Block-cyclic partitioning (2D)
25
Block-cyclic partitioning (2D), cont.
26
ScaLapack Array Descriptors
27
Syntax for descinit
descinit(idesc, m,n, mb,nb, i,j, icon, mla, ier)
IorO arg Description
OUT idesc Descriptor IN m Row Size
(Global) IN n Column Size (Global) IN
mb Row Block Size IN nb Column
Size IN i Starting Grid Row IN j
Starting Grid Column IN icon BLACS
context IN mla Leading Dimension of Local
Matrix OUT ier Error number The starting
grid location is usually (i0,j0).
28
ScaLapack functionality
29
ScaLapack API
  • Prepend LAPACK equivalent names with P

P
X
Y
Y
Z
Z
Z
C
o
m
p
u
t
a
t
i
o
n

P
e
r
f
o
r
m
e
d
M
a
t
r
i
x

T
y
p
e

D
a
t
a

T
y
p
e
s
30
ScaLapack API, cont.
  • Matrix Types (YY)
    PXYYZZZ
  • DB General Band (diagonally dominant-like)
  • DT general Tridiagonal (Diagonally
    dominant-like)
  • GB General Band
  • GE GEneral matrices (e.g., unsymmetric,
    rectangular, etc.)
  • GG General matrices, Generalized problem
  • HE Complex Hermitian
  • OR Orthogonal Real
  • PB Positive definite Banded (symmetric or
    Hermitian)
  • PO Positive definite (symmetric or Hermitian)
  • PT Positive definite Tridiagonal (symmetric or
    Hermitian)
  • ST Symmetric Tridiagonal Real
  • SY SYmmetric
  • TR TRiangular (possibly quasi-triangular)
  • TZ TrapeZoidal
  • UN UNitary complex

31
ScaLapack API, cont.
Drivers (ZZZ) PXYYZZZ
  • SL Linear Equations (SVX)
  • SV LU Solver
  • VD Singular Value
  • EV Eigenvalue (EVX)
  • GVX Generalized Eigenvalue
  • Estimates condition numbers.
  • Provides Error Bounds.
  • Equilibrates System.
  • Selected Eigenvalues

32
ScaLapack function call example
  • Solving AxB for a general square matrix
  • A (N x N), B(N x NRHS)
  • CALL PDGESV( N, NRHS, A, IA, JA, DESCA, IPIV, B,
    IB, JB, DESCB, INFO )
  • Global indices IA JA IB JB 1
  • Ipiv Integer array (N), on output containing
    permutation matrix (row i of the matrix was
    interchanged with row ipiv(i))

33
ScaLapack performance tips
  • Use native BLAS
  • Choose a suitable number of processors, so as not
    to make the local matrix size too large
    (1000x1000 is about right)
  • Try to use square processor grid (NrowNcol)
  • Block size 64

34
PETSc
  • Portable Extensible Toolkit for Scientific
    Computation
  • Intended to facilitate the scalable (parallel)
    solution of linear and non-linear partial
    differential equations
  • Focus on problems discretized using semi or fully
    implicit methods
  • Developed by Argonne National Laboratory, MCS

35
PETSc
  • Fully usable from Fortran, C, and C
  • Uses MPI for all parallel communications
  • Object-oriented
  • Designed for both efficiency and ease of use
  • Available freely from http//www.mcs.anl.gov/petsc
  • Installed on Datastar at /usr/local/apps/petsc-2.3
    .1

36
PETSc features
  • Parallel Vectors and discrete functions
  • Distributed arrays (for regular grid problems)
  • Parallel vector scatter/gathers (for unstructured
    grid problems)
  • Parallel (sparse) matrices
  • Parallel Krylov subspace methods
  • Parallel preconditioners
  • Parallel (Newton-based) nonlinear solvers
  • Parallel time-stepping code

37
PETSc Numerical components

Vectors
38
MPI use features
  • Communicators and attributes to handle managing
    messages inside PETSc objects
  • Nonblocking sends and receives to overlap
    communication with computation
  • Global reduction operations
  • Derived data types

39
NAG Numerical Algorithms Group
  • Proprietary package, currently installed on
    Datastar in /usr/local/apps/nag (64-bit only)
  • Differential Equations
  • Linear Equations
  • Interpolation, Curve and Surface Fitting
  • Linear Algebra
  • Eigensystem Analysis
  • Statistical Analysis
  • Random Number Generator
  • Fourier Transforms, Summation of Series
  • Time series analysis
  • Mesh generation

40
ESSL - Engineering and Scientific Subroutine
Library
  • IBM product optimized for IBM platforms
  • Over 400 high performance mathematical functions
  • C/C and Fortran, 32 and 64 bit, thread-safe
  • API is Fortran-based. From C/C must use
    Fortran calling conventions, e.g. column-major
    array
  • Linking -lessl
  • ESSL User Guide http//publib.boulder.ibm.com/epu
    bs/pdf/am501403.pdf

41
ESSL features
  • Numerical functions categories
  • Linear Algebra - vector-scalar, sparse
    vector-scalar, matrix-vector, sparse
    matrix-vector
  • Matrix Operations - addition, subtraction,
    multiplication, rank-k updates, transpose
  • Linear Algebra Equation solvers - dense, banded,
    sparse, linear least-squares
  • Eigensystem Analysis - standard, generalized
  • Signal Processing - Fourier transform,
    convolutions, correlations
  • Sorting Searching
  • Interpolation - polynomial, cubic spline
  • Numerical Quadrature
  • Random Number Generation
  • Utilities

42
ESSL example 1D real-to-complex Fast Fourier
Transform (FFT)
  • include ltessl.hgt
  • main()
  • double RMN,work1,work2
  • cmplx CMN/21
  • int nwork1,nwork2
  • Init(R,N,M)
  • work1 malloc(nwork1sizeof(double))
  • work2 malloc(nwork2sizeof(double))
  • drcft(1,R,N,C,N/21,N,M,1,1.0,work1,nwork1,work2,n
    work2)
  • drcft(0,R,N,C,N/21,N,M,1,1.0,work1,nwork1,work2,n
    work2)

43
PESSL
  • Invokes ESSL
  • Uses MPI for communication
  • PESSL Categories
  • Subset of Level 2, 3 PBLAS
  • Subset of ScaLAPACK (dense, banded)
  • Sparse Routines
  • Subset of ScaLAPACK Eigensystem Routines
  • Fourier Transforms
  • Uniform Random Number Generation
  • BLACS
  • Link with -lpessl

44
FFTW Fastest Fourier Transform in the West
  • Third party software, free http//www.fftw.org
  • Latest version installed on SDSC systems in
    /usr/local/apps/fftw312d and /usr/local/apps/fftw3
    12s
  • 1,2, and 3 dimensional FFTs, and related
    functions (sin/cos transforms, convolution etc)
  • Using plans
  • Initialize transform parameters
  • long integer plan1
  • plan1 fftw_plan(..N,M,...)
  • fftw_execute(plan,A,B)
  • Work arrays allocated and initialized behind the
    scenes
  • ESTIMATE, MEASURE, PATIENT intialization
  • s
  • Portable, free
  • -s
  • Performance about 3 times worse than ESSLs FFTs!

45
3D FFT in parallel
  • FFTW version 3 does not have MPI implementation
  • PESSL and other libraries use 1D processor
    decomposition (slabs/planes)
  • Limitation Np lt N
  • Not enough for some state-of-the art simulations,
    for example turbulence in a periodic domain 40963
  • 2D decomposition is crucial for scaling on
    Petascale platforms

46
2D decomposition
z
x
y
47
2D decomposition
  • 2D decomposition processors form a square grid
    P1 x P2
  • Columns (pencils) instead of slabs
  • Local geometry
  • Stage 1 N x K x M
  • Stage 2 K x N x M
  • Stage 3 K x M x N
  • K N/P1, MN/P2
  • At each stage, transform along the local
    dimension (N)
  • Software scaling limit removed! Now the limit is
    P lt N2

48
Performance of DNS (2D) on Blue Gene at SDSC and
IBMs T.J.Watson Lab and SDSCs Datastar
  • VN Two processors per node
  • CO One processor per node

49
P3DFFT
  • Open source library for efficient, highly
    scalable 3D FFT on parallel platforms
  • Built on top of an optimized 1D FFT library
  • Currently ESSL or FFTW
  • In the future, more libraries
  • Uses 2D decomposition, which includes 1D option.
  • Developed at SDSC
  • Outcome of a Strategic Applications
    Collaborations Project (DNS turbulence code)
  • Available at http//www.sdsc.edu/us/resources/p3df
    ft.php

50
P3DFFT features
  • Implements real-to-complex (R2C) and
    complex-to-real (C2R) 3D transforms
  • Written in Fortran 90 with MPI
  • Implemented as F90 module
  • Single or double precision
  • Arbitrary dimensions
  • Handles many uneven cases (Ni does not have to be
    a factor of Pj)
  • Assumes Ni gt Pj
  • Can do either in-place transform or otherwise
  • In which case the input and output arrays should
    not overlap
  • Memory requirements input and output arrays 1
    buffer

51
P3DFFT Storage
  • R2C transform input
  • Global (Nx,Ny,Nz) real array
  • Local (Nx,Ny/P1,Nz/P2) real array
  • R2C output
  • Global (Nx/2,Ny,Nz) complex array
  • Conjugate symmetry F(N-i) F(i)
  • F(N/21) through F(N-1) are redundant
  • F(0) and F(N/2) are real pack them in one
    pseudo-complex pair
  • Total N/2 complex storage units
  • Local (Nx/(2P1),Ny/P2,Nz) complex
  • C2R input and output interchanged from R2C

52
Building P3DFFT
  • Subdirectory build/
  • Edit makefile for your architecture
  • Specify -DSINGLE_PREC or -DDOUBLE_PREC
  • Default is single precision
  • Specify -DESSL or -DFFTW
  • Choose P3DFFT_ROOT
  • Type make will compile and install the
    library in P3DFFT_ROOT directory
  • On SDSC platforms, it is already installed in
    /usr/local/apps/p3dfft
  • Subdirectory sample/
  • Contains sample programs
  • For manual see README file in upper level
    directory

53
Using P3DFFT
  • use p3dfft
  • Defines mytype 4 or 8 byte reals
  • Initialization p3dfft_setup(proc_dims,nx,ny,nz)
  • procs_dims(2) two integers, P1 and P2
  • For 1D decomposition, set P1 1
  • Sets up the square 2D grid of MPI communicators
    in rows and columns
  • Initializes local array dimensions
  • Local dimensions get_dims(istart,iend,isize,Q)
  • integer istart(3),iend(3),isize(3)
  • Set Q1 for R2C input, Q2 for R2C output

54
Using P3DFFT (2)
  • Now allocate your arrays
  • When ready to call Fourier Transform
  • p3dfft_ftran_r2c(IN,OUT,flg_inplace) Forward
    transform
  • exp(-ikx/N)
  • p3dfft_btran_c2r(IN,OUT,flg_inplace) Backward
    (inverse) transform
  • exp(ikx/N)
  • flg_inplace boolean, true if doing an in-place
    transform

55
Future work
  • Include more array distribution schemes
  • Expand options for memory ordering
  • Include more transform types
  • Derivatives
  • Complex-to-complex
  • C interface
  • Convenience functions
  • Break down in stages
  • Expand choice of 1D FFT libraries
  • Questions? Comments?
  • dmitry_at_sdsc.edu

56
Acknowledgements
  • Amit Majumdar, SDSC
  • ScaLapack development team http//www.netlib.org/s
    calapack
  • ACTS tools presentation from NERSC
    http//acts.nersc.gov

57
Lab assignment
  • Copy tar file dmitry/scalapack_examples.tar
  • Untar
  • Type make
  • Test the example programs on several processor
    counts - do they scale?
  • Modify source code to vary
  • matrix size
  • block sizes
  • Rewrite example1.f in C or C
Write a Comment
User Comments (0)
About PowerShow.com