Title: The Future of LAPACK and ScaLAPACK www.netlib.org/lapack-dev
1The Future of LAPACK and ScaLAPACK
www.netlib.org/lapack-dev
- Jim Demmel
- UC Berkeley
- 28 Sept 2005
2Outline
- Motivation
- Participants
- Goals
- Better numerics (faster and more accurate
algorithms) - Expand contents (more functions, more parallel
implementations) - Automate performance tuning
- Improve ease of use
- Better maintenance and support
- Increase community involvement (this means you!)
- Questions for the audience
- Selected Highlights
- Concluding poem
3Motivation
- LAPACK and ScaLAPACK are widely used
- Adopted by Cray, Fujitsu, HP, IBM, IMSL,
MathWorks, NAG, NEC, SGI, - gt52M web hits _at_ Netlib (incl. CLAPACK, LAPACK95)
- Many ways to improve them, based on
- Own algorithmic research
- Enthusiastic participation of research community
- On-going user/vendor survey (url below)
- Opportunities and demands of new architectures,
programming languages - New releases planned (NSF support)
- Your feedback desired
- www.netlib.org/lapack-dev
4Participants
- UC Berkeley
- Jim Demmel, Ming Gu, W. Kahan, Beresford Parlett,
Xiaoye Li, Osni Marques, Christof Voemel, David
Bindel, Yozo Hida, Jason Riedy, Jianlin Xia,
Jiang Zhu, undergrads - U Tennessee, Knoxville
- Jack Dongarra, Victor Eijkhout, Julien Langou,
Julie Langou, Piotr Luszczek, Stan Tomov - Other Academic Institutions
- UT Austin, UC Davis, Florida IT, U Kansas, U
Maryland, North Carolina SU, San Jose SU,
UC Santa Barbara - TU Berlin, FU Hagen, U Madrid, U Manchester, U
Umeå, U Wuppertal, U Zagreb - Research Institutions
- CERFACS, LBL
- Industrial Partners
- Cray, HP, Intel, MathWorks, NAG, SGI
5Goal 1 Better Numerics
- Fastest algorithm providing standard backward
stability - MRRR algorithm for symmetric eigenproblem / SVD
Parlett / Dhillon / Voemel / Marques /
Willems - Up to 10x faster HQR Byers / Mathias / Braman
- Extensions to QZ Kågström / Kressner
- Faster Hessenberg, tridiagonal, bidiagonal
reductions van de Geijn, Bischof /
Lang , Howell / Fulton - Recursive blocked layouts for packed formats
Gustavson / Kågström / Elmroth / Jonsson/
6Goal 1 Better Numerics
- Fastest algorithm providing standard backward
stability - MRRR algorithm for symmetric eigenproblem / SVD
Parlett / Dhillon / Voemel / Marques /
Willems - Up to 10x faster HQR Byers / Mathias / Braman
- Extensions to QZ Kågström / Kressner
- Faster Hessenberg, tridiagonal, bidiagonal
reductions van de Geijn, Bischof /
Lang , Howell / Fulton - Recursive blocked layouts for packed formats
Gustavson / Kågström / Elmroth / Jonsson/ - New Most accurate algorithm providing standard
speed - Iterative refinement for Axb, least squares
- Assume availability of Extra Precise BLAS
(Li/Hida/) - www.netlib.org/blas/blast-forum/
- Jacobi SVD Drmac/Veselic
7Goal 1 Better Numerics
- Fastest algorithm providing standard backward
stability - MRRR algorithm for symmetric eigenproblem / SVD
Parlett / Dhillon / Voemel / Marques /
Willems - Up to 10x faster HQR Byers / Mathias / Braman
- Extensions to QZ Kågström / Kressner
- Faster Hessenberg, tridiagonal, bidiagonal
reductions van de Geijn, Bischof /
Lang , Howell / Fulton - Recursive blocked layouts for packed formats,
Gustavson / Kågström / Elmroth / Jonsson/ - New Most accurate algorithm providing standard
speed - Iterative refinement for Axb, least squares
- Assume availability of Extra Precise BLAS
(Li/Hida/) - www.netlib.org/blas/blast-forum/
- Jacobi SVD Drmac/Veselic
- Condition estimates for (almost) everything
(ongoing)
8Goal 1 Better Numerics
- Fastest algorithm providing standard backward
stability - MRRR algorithm for symmetric eigenproblem / SVD
Parlett / Dhillon / Voemel / Marques /
Willems - Up to 10x faster HQR Byers / Mathias / Braman
- Extensions to QZ Kågström / Kressner
- Faster Hessenberg, tridiagonal, bidiagonal
reductions van de Geijn, Bischof /
Lang , Howell / Fulton - Recursive blocked layouts for packed formats,
Gustavson / Kågström / Elmroth / Jonsson/ - New Most accurate algorithm providing standard
speed - Iterative refinement for Axb, least squares
- Assume availability of Extra Precise BLAS
(Li/Hida/) - www.netlib.org/blas/blast-forum/
- Jacobi SVD Drmac/Veselic
- Condition estimates for (almost) everything
(ongoing) - What is not fast or accurate enough?
9What goes into Sca/LAPACK?
For all linear algebra problems
For all matrix structures
For all data types
For all architectures and networks
For all programming interfaces
Produce best algorithm(s) w.r.t.
performance and accuracy (including condition
estimates, etc)
Need to automate and prioritize!
10Goal 2 Expanded Content
- Make content of ScaLAPACK mirror LAPACK as much
as possible - Full Automation would be nice, not yet robust,
general enough - Telescoping languages, Bernoulli, Rose, FLAME,
- New functions (examples)
- Updating / downdating of factorizations Stewart,
Langou - More generalized SVDs Bai , Wang
- More generalized Sylvester/Lyapunov eqns
Kågström, Jonsson, Granat - Structured eigenproblems
- O(n2) version of roots(p) Gu, Chandrasekaran,
Zhu et al - Selected matrix polynomials Mehrmann
- How should we prioritize missing functions?
11Goal 3 Automate Performance Tuning
- Not just BLAS
- 1300 calls to ILAENV() to get block sizes, etc.
- Never been systematically tuned
- Extend automatic tuning techniques of ATLAS, etc.
to these other parameters - Automation important as architectures evolve
- Convert ScaLAPACK data layouts on the fly
- How important is peak performance?
12Goal 4 Improved Ease of Use
A \ B
CALL PDGESV( N ,NRHS, A, IA, JA, DESCA, IPIV, B,
IB, JB, DESCB, INFO)
CALL PDGESVX( FACT, TRANS, N ,NRHS, A, IA, JA,
DESCA, AF, IAF, JAF, DESCAF, IPIV, EQUED, R, C,
B, IB, JB, DESCB, X, IX, JX, DESCX, RCOND, FERR,
BERR, WORK, LWORK, IWORK, LIWORK, INFO)
13Goal 4 Improved Ease of Use, Software
Engineering (1)
- Development versus Research
- Development practical approach to produce useful
code - Research If we could start over and do it right,
how would we? - Life after F77?
- Fortran95, C, C, Java, Matlab, Python,
- Easy interfaces vs access to details
- No universal agreement across systems on easiest
interface - Leave decision to higher level packages
- Keep expert driver / simple driver /
computational routines - Conclusion Subset of F95 for core wrappers for
drivers - What subset?
- Recursion, for new data structures
- Modules, to produce multiple precision versions
- Environmental enquiries, to replace xLAMCH
- Wrappers for Fortran95, Java, Matlab, Python,
even for CLAPACK - Automatic memory allocation of workspace
14Goal 4 Improved Ease of Use, Software
Engineering (2)
- Why not full F95 for core?
- Would make interfacing to other languages and
packages harder - Some users want control over memory allocation
- Why not C for core?
- High cost/benefit ratio for full rewrite
- Performance
- Automation would be nice
- Use Babel/SIDL to produce native looking
interfaces when possible
15Precisions beyond double (1)
- Range of designs possible
- Just run in quad (or some other fixed precision)
- Support codes like
bits 32 Repeat bits 2 bits
Solve(A, b, x, error_bound, bits) Until
error_bound lt tol
16Precisions beyond double (2)
- Easiest approach fixed precision
- Use F95 modules to produce any precision on
request - Could use QD, ARPREC, GMP,
- Keep current memory allocation (twiddle GMP)
- Next easier approach maximum precision
- Build maximum allowable precision on request
- Pass in precision parameter, up to this amount
- More flexible (and difficult) approach
- Choose any precision at run time
- Dynamically allocate all variables
- Most aggressive approach
- New algorithms that minimize work to get desired
prec. - What do users want?
- Compatibility with symbolic manipulation systems?
17Goal 4 Improved Ease of Use, Software
Engineering (3)
- Research Issues
- May or may not impact development
- How to map multiple software layers to emerging
architectural layers? - Are emerging HPCS languages better?
- How much can we automate? Do we keep having to
write Gaussian elimination over and over again? - Statistical modeling to limit performance tuning
costs, improve use of shared clusters
18Goal 5Better Maintenance and Support
- Website for user feedback and requests
- New developer and discussion forums
- URL www.netlib.org/lapack-dev
- Includes NSF proposal
- Version control and bug tracking system
- Automatic Build and Test environment
- Wide variety of supported platforms
- Cooperation with vendors
- Anything else desired?
19Goal 6 Involve the Community
- To help identify priorities
- More interesting tasks than we are funded to do
- See www.netlib.org/lapack-dev for list
- To help identify promising algorithms
- What have we missed?
- To help do the work
- Bug reports, provide fixes
- Again, more tasks than we are funded to do
- Already happening thank you!
- We retain final decisions on content
- Anything else?
20Some Highlights
- Putting more of LAPACK into ScaLAPACK
- ScaLAPACK performance on 1D vs 2D grids
- MRRR Holy Grail algorithm for symmetric EVD
- Iterative Refinement for Axb
- O(n2) polynomial root finder
- Generalized SVD
21Missing Drivers in Sca/LAPACK
LAPACK ScaLAPACK
Linear Equations LU Cholesky LDLT xGESV xPOSV xSYSV PxGESV PxPOSV missing
Least Squares (LS) QR QRpivot SVD/QR SVD/DC SVD/MRRR QR iterative refine. xGELS xGELSY xGELSS xGELSD missing missing PxGELS missing driver missing driver missing (intent) missing missing
Generalized LS LS equality constr. Generalized LM Above Iterative ref. xGGLSE xGGGLM missing missing missing missing
22More missing drivers
LAPACK ScaLAPACK
Symmetric EVD QR / BisectionInvit DC MRRR xSYEV / X xSYEVD xSYEVR PxSYEV / X missing (intent) missing
Nonsymmetric EVD Schur form Vectors too xGEES / X xGEEV /X missing driver missing driver
SVD QR DC MRRR Jacobi xGESVD xGESDD missing missing PxGESVD missing (intent) missing missing
Generalized Symmetric EVD QR / BisectionInvit DC MRRR xSYGV / X xSYGVD missing PxSYGV / X missing (intent) missing
Generalized Nonsymmetric EVD Schur form Vectors too xGGES / X xGGEV / X missing missing
Generalized SVD Kogbetliantz MRRR xGGSVD missing missing (intent) missing
23Missing matrix types in ScaLAPACK
- Symmetric, Hermitian, triangular
- Band, Packed
- Positive Definite
- Packed
- Orthogonal, Unitary
- Packed
24Speedups for using 2D processor grid range from
2x to 8x
Times obtained on 60 processors, Dual AMD
Opteron 1.4GHz Cluster w/Myrinet Interconnect 2GB
Memory
25Cost of redistributing matrix to optimal layout
is small
Times obtained on 60 processors, Dual AMD
Opteron 1.4GHz Cluster w/Myrinet Interconnect 2GB
Memory
26MRRR Algorithm for eig(tridiagonal) and
svd(bidiagonal)
- Multiple Relatively Robust Representation
- 1999 Householder Award honorable mention for
Dhillon - O(nk) flops to find k eigenvalues/vectors of nxn
tridiagonal matrix (similar for SVD) - Minimum possible!
- Naturally parallelizable
- Accurate
- Small residuals Txi li xi O(n e)
- Orthogonal eigenvectors xiTxj O(n e)
- Hence nickname Holy Grail
- 2 versions
- LAPACK 3.0 large error on hard cases
- Next release fixed!
- How should we tradeoff speed and accuracy?
27Timing of Eigensolvers(1.2 GHz Athlon, only
matrices where time gt .1 sec)
28Timing of Eigensolvers(1.2 GHz Athlon, only
matrices where time gt .1 sec)
29Timing of Eigensolvers(1.2 GHz Athlon, only
matrices where time gt .1 sec)
30Timing of Eigensolvers(only matrices where time
gt .1 sec)
31Accuracy Results (old vs new Grail)
QQT I / (n e )
maxi Tqi li qi / ( n e )
32Accuracy Results (Grail vs QR vs DC)
QQT I / (n e )
maxi Tqi li qi / ( n e )
33More Accurate Solve Axb
Conventional Gaussian Elimination
e
e n1/2 2-24
34Whats new?
- Need extra precision (beyond double)
- Part of new BLAS standard
- Cost O(n2) extra per right-hand-side, vs O(n3)
to factor - Get tiny componentwise bounds too
- Error in xi small compared to xi, not just
maxj xj - Guarantees based on condition number estimates
- No bad bounds in 6.2M tests
- Different condition number for componentwise
bounds - Traditional iterative refinement can fail
- Only get matrix close to singular message when
answer wrong? - Extends to least squares
- Demmel, Kahan, Hida, Riedy, X. Li, Sarkisyan,
- LAPACK Working Note 165
35Can condition estimators lie?
- Yes, but rarely, unless they cost as much as
matrix multiply cost of LU factorization - Demmel/Diament/Malajovich (FCM2001)
- But what if matrix multiply costs O(n2)?
- Cohn/Umans/Kleinberg (FOCS 2003/5)
36New algorithm for roots(p)
- To find roots of polynomial p
- Roots(p) does eig(C(p))
- Costs O(n3), stable, reliable
- O(n2) Alternatives
- Newton, Jenkins-Traub, Laguerre,
- Stable? Reliable?
- New Exploit semiseparable structure of C(p)
- Low rank of any submatrix of upper triangle of
C(p) preserved under QR iteration - Complexity drops from O(n3) to O(n2)
- Related work Gemignani, Bini, Pan, et al
- Ming Gu, Shiv Chandrasekaran, Jiang Zhu, Jianlin
Xia, David Bindel, David Garmire, Jim Demmel
37Properties of new roots(p)
- First (?) algorithm that
- Is O(n2)
- Is backward stable, in the matrix sense
- Is backward stable, in the sense that the
computed roots are the exact roots of a slightly
perturbed input polynomial - Depends on balancing scaling roots by a
constant ? - Still need to automate choice of ?
- Byers, Mathias, Braman
- Tisseur, Higham, Mackey
38New GSVD Algorithm Timing Comparisons
- 1.0GHz Itanium2 (2GB RAM)
- Intel's Math Kernel Library 7.2.1 (incl. BLAS,
LAPACK)
Bai et al, UC Davis
PSVD, CSD on the way
39Conclusions
- Lots to do in Dense Linear Algebra
- New numerical algorithms
- Continuing architectural challenges
- Parallelism, performance tuning
- Grant support, but success depends on
contributions from community
40Extra Slides
41Downa Dating
With apologies to Robert Burns Nick Higham
- Should some equations be forgot
- when overdetermined?
- Should some equations be forgot
- using hyperbolic sines?
-
- With hyperbolic sines, my dear
- with hyperbolic sines,
- Well hope to get some boundedness
- with hyperbolic sines.
42New GSVD Algorithm (XGGQSV)
- UT A Z ?a( 0 R ), VT B Z ?b( 0 R ) A M x
N, B P xN - Modified Van Loan's method
- (1) Pre-processing reveal rank of ( AT BT )T
- (2) Split QR reduce two upper triangular into
one - (3) CSD Cosine-Sine Decomposition
- (4) Post-processing assemble resulted matrices
- Workspace ??(max(M, N, P)) 5L2 where L
rank(B) - Outperforms current XGGSVD in LAPACK
43Profile of SGGQSV
- 1.0GHz Itanium2 (2GB RAM)
- Intel's Math Kernel Library 7.2.1 (incl. BLAS,
LAPACK)
44Related New Routines
- CSD (XORCSD)
- UTQ1Z ?a, VTQ2Z ?b
- Based on Von Loan's method (1985)
- Dominated by the cost of SVD
- Workspace ???(Max(PM, N))
- PSVD (XGGPSV)
- UTABV ?
- Based on work by Golub, Solna, Van Dooren (2000)
- Workspace ???(Max(M, P, N))
45Benchmark Details
- AMD 1.2 GHz Athlon, 2GB mem, Redhat Intel
compiler - Compute all eigenvalues and eigenvectors of a
symmetric tridiagonal matrix T - Codes compared
- qr QR iteration from LAPACK dsteqr
- dc Cuppens DivideConquer from LAPACK dstedc
- gr New implementation of MRRR algorithm
(Grail) - ogr MRRR from LAPACK 3.0 dstegr (old Grail)
46Timing of Eigensolvers(1.2 GHz Athlon, only
matrices where time gt .1 sec
47Timing of Eigensolvers(1.2 GHz Athlon, only
matrices where time gt .1 sec)
48More Accurate Solve Axb
- Old idea Use Newtons method on f(x) Ax-b
- On a linear system?
- Roundoff in Ax-b makes it interesting
(nonlinear) - Iterative refinement
- Snyder, Wilkinson, Moler, Skeel,
Repeat r Ax-b compute with extra
precision Solve Ad r using LU
factorization of A Update x x d Until
accurate enough or no progress
49Speedups from 1.33x to 11.5x
Times obtained on 60 processors, Dual AMD
Opteron 1.4GHz Cluster w/Myrinet Interconnect 2GB
Memory
50Times obtained on 60 processors, Dual Opteron
1.4GHz, 64-bit machine 2GB Memory, Myrinet
Interconnect
51Speedups from 1.8x to 7.5x
Times obtained on 60 processors, Dual Intel Xeon
2.4GHz, IA-32 w/Gigabit Ethernet Interconnect 2GB
Memory
52Times obtained on 60 processors, Dual Intel Xeon
2.4GHz, IA-32 w/Gigabit Ethernet Interconnect 2GB
Memory
53Speedups from 2.2x to 6x
Times obtained on 60 processors, Dual Intel Xeon
2.4GHz, IA-32 w/Gigabit Ethernet Interconnect 2GB
Memory
54Times obtained on 60 processors, Dual Intel Xeon
2.4GHz, IA-32 w/Gigabit Ethernet Interconnect 2GB
Memory
55Speedups from 2.2x to 2.8x
Times obtained on 60 processors, IBM SP RS/6000,
16 shared memory (16 Gbytes) processors per node,
4 nodes connected with high-bandwidth low-latency
switching network Block size 64
56Times obtained on 60 processors, IBM SP RS/6000,
16 shared memory (16 Gbytes) processors per node,
4 nodes connected with high-bandwidth low-latency
switching network Block size 64
57The Future of LAPACK and ScaLAPACKwww.cs.berkel
ey.edu/demmel/Sca-LAPACK-Proposal.pdf
- Jim Demmel
- UC Berkeley
- Householder XVI
58OO-like interfaces (VE)
- Aim native looking interfaces in modern
languages (C, F90, Java, Python) - Overloading to have identical interfaces for all
data types (precision/storage) - Overloading to omit certain parameters (stride,
transpose,) - Expose data only when necessary (automatic
allocation of temporaries, pivoting
permutation,.) - Mechanism Use Babel/SIDL to interface to
existing F77 code base
59Example of HLL interface (VE)
// C example LapackPDTridiagonalMatrix A
LapackPDTridiagonalMatrix_create() x
(double) malloc(SIZEsizeof(double)) y
(double) malloc(SIZEsizeof(double))
A.MatVec(x,y) // or A.MatVec(MatTranspose,x,y)
A.Factor() A.Solve(y)
! F90 example ! Create matrix from user
array allocate(data_array(20,25)) ! Fill in the
array. call CreateWithArray(A,data_array) !
Retrieve the array of a (factored) matrix call
GetArray(A,mat) mat_array gt matd_data print
,'pivots',mat_array(0,0),mat_array(1,1)
Lots of details omitted here!