The Future of LAPACK and ScaLAPACK www.netlib.org/lapack-dev - PowerPoint PPT Presentation

About This Presentation

Title:

The Future of LAPACK and ScaLAPACK www.netlib.org/lapack-dev

Description:

For all linear algebra problems. For all matrix structures. For all data types ... Only get 'matrix close to singular' message when answer wrong? Extends to ... – PowerPoint PPT presentation

Number of Views:110

Avg rating:3.0/5.0

Slides: 60

Provided by: jimde3

Category:

more less

Transcript and Presenter's Notes

Title: The Future of LAPACK and ScaLAPACK www.netlib.org/lapack-dev

1
The Future of LAPACK and ScaLAPACK
www.netlib.org/lapack-dev

Jim Demmel
UC Berkeley
28 Sept 2005

2
Outline

Motivation
Participants
Goals
Better numerics (faster and more accurate
algorithms)
Expand contents (more functions, more parallel
implementations)
Automate performance tuning
Improve ease of use
Better maintenance and support
Increase community involvement (this means you!)
Questions for the audience
Selected Highlights
Concluding poem

3
Motivation

LAPACK and ScaLAPACK are widely used
Adopted by Cray, Fujitsu, HP, IBM, IMSL,
MathWorks, NAG, NEC, SGI,
gt52M web hits _at_ Netlib (incl. CLAPACK, LAPACK95)
Many ways to improve them, based on
Own algorithmic research
Enthusiastic participation of research community
On-going user/vendor survey (url below)
Opportunities and demands of new architectures,
programming languages
New releases planned (NSF support)
Your feedback desired
www.netlib.org/lapack-dev

4
Participants

UC Berkeley
Jim Demmel, Ming Gu, W. Kahan, Beresford Parlett,
Xiaoye Li, Osni Marques, Christof Voemel, David
Bindel, Yozo Hida, Jason Riedy, Jianlin Xia,
Jiang Zhu, undergrads
U Tennessee, Knoxville
Jack Dongarra, Victor Eijkhout, Julien Langou,
Julie Langou, Piotr Luszczek, Stan Tomov
Other Academic Institutions
UT Austin, UC Davis, Florida IT, U Kansas, U
Maryland, North Carolina SU, San Jose SU,
UC Santa Barbara
TU Berlin, FU Hagen, U Madrid, U Manchester, U
Umeå, U Wuppertal, U Zagreb
Research Institutions
CERFACS, LBL
Industrial Partners
Cray, HP, Intel, MathWorks, NAG, SGI

5
Goal 1 Better Numerics

Fastest algorithm providing standard backward
stability
MRRR algorithm for symmetric eigenproblem / SVD
Parlett / Dhillon / Voemel / Marques /
Willems
Up to 10x faster HQR Byers / Mathias / Braman
Extensions to QZ Kågström / Kressner
Faster Hessenberg, tridiagonal, bidiagonal
reductions van de Geijn, Bischof /
Lang , Howell / Fulton
Recursive blocked layouts for packed formats
Gustavson / Kågström / Elmroth / Jonsson/

6
Goal 1 Better Numerics

Fastest algorithm providing standard backward
stability
MRRR algorithm for symmetric eigenproblem / SVD
Parlett / Dhillon / Voemel / Marques /
Willems
Up to 10x faster HQR Byers / Mathias / Braman
Extensions to QZ Kågström / Kressner
Faster Hessenberg, tridiagonal, bidiagonal
reductions van de Geijn, Bischof /
Lang , Howell / Fulton
Recursive blocked layouts for packed formats
Gustavson / Kågström / Elmroth / Jonsson/
New Most accurate algorithm providing standard
speed
Iterative refinement for Axb, least squares
Assume availability of Extra Precise BLAS
(Li/Hida/)
www.netlib.org/blas/blast-forum/
Jacobi SVD Drmac/Veselic

7
Goal 1 Better Numerics

Fastest algorithm providing standard backward
stability
MRRR algorithm for symmetric eigenproblem / SVD
Parlett / Dhillon / Voemel / Marques /
Willems
Up to 10x faster HQR Byers / Mathias / Braman
Extensions to QZ Kågström / Kressner
Faster Hessenberg, tridiagonal, bidiagonal
reductions van de Geijn, Bischof /
Lang , Howell / Fulton
Recursive blocked layouts for packed formats,
Gustavson / Kågström / Elmroth / Jonsson/
New Most accurate algorithm providing standard
speed
Iterative refinement for Axb, least squares
Assume availability of Extra Precise BLAS
(Li/Hida/)
www.netlib.org/blas/blast-forum/
Jacobi SVD Drmac/Veselic
Condition estimates for (almost) everything
(ongoing)

8
Goal 1 Better Numerics

Fastest algorithm providing standard backward
stability
MRRR algorithm for symmetric eigenproblem / SVD
Parlett / Dhillon / Voemel / Marques /
Willems
Up to 10x faster HQR Byers / Mathias / Braman
Extensions to QZ Kågström / Kressner
Faster Hessenberg, tridiagonal, bidiagonal
reductions van de Geijn, Bischof /
Lang , Howell / Fulton
Recursive blocked layouts for packed formats,
Gustavson / Kågström / Elmroth / Jonsson/
New Most accurate algorithm providing standard
speed
Iterative refinement for Axb, least squares
Assume availability of Extra Precise BLAS
(Li/Hida/)
www.netlib.org/blas/blast-forum/
Jacobi SVD Drmac/Veselic
Condition estimates for (almost) everything
(ongoing)
What is not fast or accurate enough?

9
What goes into Sca/LAPACK?
For all linear algebra problems
For all matrix structures
For all data types
For all architectures and networks
For all programming interfaces
Produce best algorithm(s) w.r.t.
performance and accuracy (including condition
estimates, etc)
Need to automate and prioritize!
10
Goal 2 Expanded Content

Make content of ScaLAPACK mirror LAPACK as much
as possible
Full Automation would be nice, not yet robust,
general enough
Telescoping languages, Bernoulli, Rose, FLAME,
New functions (examples)
Updating / downdating of factorizations Stewart,
Langou
More generalized SVDs Bai , Wang
More generalized Sylvester/Lyapunov eqns
Kågström, Jonsson, Granat
Structured eigenproblems
O(n2) version of roots(p) Gu, Chandrasekaran,
Zhu et al
Selected matrix polynomials Mehrmann
How should we prioritize missing functions?

11
Goal 3 Automate Performance Tuning

Not just BLAS
1300 calls to ILAENV() to get block sizes, etc.
Never been systematically tuned
Extend automatic tuning techniques of ATLAS, etc.
to these other parameters
Automation important as architectures evolve
Convert ScaLAPACK data layouts on the fly
How important is peak performance?

12
Goal 4 Improved Ease of Use

Which do you prefer?

A \ B
CALL PDGESV( N ,NRHS, A, IA, JA, DESCA, IPIV, B,
IB, JB, DESCB, INFO)
CALL PDGESVX( FACT, TRANS, N ,NRHS, A, IA, JA,
DESCA, AF, IAF, JAF, DESCAF, IPIV, EQUED, R, C,
B, IB, JB, DESCB, X, IX, JX, DESCX, RCOND, FERR,
BERR, WORK, LWORK, IWORK, LIWORK, INFO)
13
Goal 4 Improved Ease of Use, Software
Engineering (1)

Development versus Research
Development practical approach to produce useful
code
Research If we could start over and do it right,
how would we?
Life after F77?
Fortran95, C, C, Java, Matlab, Python,
Easy interfaces vs access to details
No universal agreement across systems on easiest
interface
Leave decision to higher level packages
Keep expert driver / simple driver /
computational routines
Conclusion Subset of F95 for core wrappers for
drivers
What subset?
Recursion, for new data structures
Modules, to produce multiple precision versions
Environmental enquiries, to replace xLAMCH
Wrappers for Fortran95, Java, Matlab, Python,
even for CLAPACK
Automatic memory allocation of workspace

14
Goal 4 Improved Ease of Use, Software
Engineering (2)

Why not full F95 for core?
Would make interfacing to other languages and
packages harder
Some users want control over memory allocation
Why not C for core?
High cost/benefit ratio for full rewrite
Performance
Automation would be nice
Use Babel/SIDL to produce native looking
interfaces when possible

15
Precisions beyond double (1)

Range of designs possible
Just run in quad (or some other fixed precision)
Support codes like

bits 32 Repeat bits 2 bits
Solve(A, b, x, error_bound, bits) Until
error_bound lt tol
16
Precisions beyond double (2)

Easiest approach fixed precision
Use F95 modules to produce any precision on
request
Could use QD, ARPREC, GMP,
Keep current memory allocation (twiddle GMP)
Next easier approach maximum precision
Build maximum allowable precision on request
Pass in precision parameter, up to this amount
More flexible (and difficult) approach
Choose any precision at run time
Dynamically allocate all variables
Most aggressive approach
New algorithms that minimize work to get desired
prec.
What do users want?
Compatibility with symbolic manipulation systems?

17
Goal 4 Improved Ease of Use, Software
Engineering (3)

Research Issues
May or may not impact development
How to map multiple software layers to emerging
architectural layers?
Are emerging HPCS languages better?
How much can we automate? Do we keep having to
write Gaussian elimination over and over again?
Statistical modeling to limit performance tuning
costs, improve use of shared clusters

18
Goal 5Better Maintenance and Support

Website for user feedback and requests
New developer and discussion forums
URL www.netlib.org/lapack-dev
Includes NSF proposal
Version control and bug tracking system
Automatic Build and Test environment
Wide variety of supported platforms
Cooperation with vendors
Anything else desired?

19
Goal 6 Involve the Community

To help identify priorities
More interesting tasks than we are funded to do
See www.netlib.org/lapack-dev for list
To help identify promising algorithms
What have we missed?
To help do the work
Bug reports, provide fixes
Again, more tasks than we are funded to do
Already happening thank you!
We retain final decisions on content
Anything else?

20
Some Highlights

Putting more of LAPACK into ScaLAPACK
ScaLAPACK performance on 1D vs 2D grids
MRRR Holy Grail algorithm for symmetric EVD
Iterative Refinement for Axb
O(n2) polynomial root finder
Generalized SVD

21
Missing Drivers in Sca/LAPACK
LAPACK ScaLAPACK
Linear Equations LU Cholesky LDLT xGESV xPOSV xSYSV PxGESV PxPOSV missing
Least Squares (LS) QR QRpivot SVD/QR SVD/DC SVD/MRRR QR iterative refine. xGELS xGELSY xGELSS xGELSD missing missing PxGELS missing driver missing driver missing (intent) missing missing
Generalized LS LS equality constr. Generalized LM Above Iterative ref. xGGLSE xGGGLM missing missing missing missing
22
More missing drivers
LAPACK ScaLAPACK
Symmetric EVD QR / BisectionInvit DC MRRR xSYEV / X xSYEVD xSYEVR PxSYEV / X missing (intent) missing
Nonsymmetric EVD Schur form Vectors too xGEES / X xGEEV /X missing driver missing driver
SVD QR DC MRRR Jacobi xGESVD xGESDD missing missing PxGESVD missing (intent) missing missing
Generalized Symmetric EVD QR / BisectionInvit DC MRRR xSYGV / X xSYGVD missing PxSYGV / X missing (intent) missing
Generalized Nonsymmetric EVD Schur form Vectors too xGGES / X xGGEV / X missing missing
Generalized SVD Kogbetliantz MRRR xGGSVD missing missing (intent) missing
23
Missing matrix types in ScaLAPACK

Symmetric, Hermitian, triangular
Band, Packed
Positive Definite
Packed
Orthogonal, Unitary
Packed

24
Speedups for using 2D processor grid range from
2x to 8x
Times obtained on 60 processors, Dual AMD
Opteron 1.4GHz Cluster w/Myrinet Interconnect 2GB
Memory
25
Cost of redistributing matrix to optimal layout
is small
Times obtained on 60 processors, Dual AMD
Opteron 1.4GHz Cluster w/Myrinet Interconnect 2GB
Memory
26
MRRR Algorithm for eig(tridiagonal) and
svd(bidiagonal)

Multiple Relatively Robust Representation
1999 Householder Award honorable mention for
Dhillon
O(nk) flops to find k eigenvalues/vectors of nxn
tridiagonal matrix (similar for SVD)
Minimum possible!
Naturally parallelizable
Accurate
Small residuals Txi li xi O(n e)
Orthogonal eigenvectors xiTxj O(n e)
Hence nickname Holy Grail
2 versions
LAPACK 3.0 large error on hard cases
Next release fixed!
How should we tradeoff speed and accuracy?

27
Timing of Eigensolvers(1.2 GHz Athlon, only
matrices where time gt .1 sec)
28
Timing of Eigensolvers(1.2 GHz Athlon, only
matrices where time gt .1 sec)
29
Timing of Eigensolvers(1.2 GHz Athlon, only
matrices where time gt .1 sec)
30
Timing of Eigensolvers(only matrices where time
gt .1 sec)
31
Accuracy Results (old vs new Grail)
QQT I / (n e )
maxi Tqi li qi / ( n e )
32
Accuracy Results (Grail vs QR vs DC)
QQT I / (n e )
maxi Tqi li qi / ( n e )
33
More Accurate Solve Axb
Conventional Gaussian Elimination

e
e n1/2 2-24
34
Whats new?

Need extra precision (beyond double)
Part of new BLAS standard
Cost O(n2) extra per right-hand-side, vs O(n3)
to factor
Get tiny componentwise bounds too
Error in xi small compared to xi, not just
maxj xj
Guarantees based on condition number estimates
No bad bounds in 6.2M tests
Different condition number for componentwise
bounds
Traditional iterative refinement can fail
Only get matrix close to singular message when
answer wrong?
Extends to least squares
Demmel, Kahan, Hida, Riedy, X. Li, Sarkisyan,
LAPACK Working Note 165

35
Can condition estimators lie?

Yes, but rarely, unless they cost as much as
matrix multiply cost of LU factorization
Demmel/Diament/Malajovich (FCM2001)
But what if matrix multiply costs O(n2)?
Cohn/Umans/Kleinberg (FOCS 2003/5)

36
New algorithm for roots(p)

To find roots of polynomial p
Roots(p) does eig(C(p))
Costs O(n3), stable, reliable
O(n2) Alternatives
Newton, Jenkins-Traub, Laguerre,
Stable? Reliable?
New Exploit semiseparable structure of C(p)
Low rank of any submatrix of upper triangle of
C(p) preserved under QR iteration
Complexity drops from O(n3) to O(n2)
Related work Gemignani, Bini, Pan, et al
Ming Gu, Shiv Chandrasekaran, Jiang Zhu, Jianlin
Xia, David Bindel, David Garmire, Jim Demmel

37
Properties of new roots(p)

First (?) algorithm that
Is O(n2)
Is backward stable, in the matrix sense
Is backward stable, in the sense that the
computed roots are the exact roots of a slightly
perturbed input polynomial
Depends on balancing scaling roots by a
constant ?
Still need to automate choice of ?
Byers, Mathias, Braman
Tisseur, Higham, Mackey

38
New GSVD Algorithm Timing Comparisons

1.0GHz Itanium2 (2GB RAM)
Intel's Math Kernel Library 7.2.1 (incl. BLAS,
LAPACK)

Bai et al, UC Davis
PSVD, CSD on the way
39
Conclusions

Lots to do in Dense Linear Algebra
New numerical algorithms
Continuing architectural challenges
Parallelism, performance tuning
Grant support, but success depends on
contributions from community

40
Extra Slides
41
Downa Dating
With apologies to Robert Burns Nick Higham

Should some equations be forgot
when overdetermined?
Should some equations be forgot
using hyperbolic sines?
With hyperbolic sines, my dear
with hyperbolic sines,
Well hope to get some boundedness
with hyperbolic sines.

42
New GSVD Algorithm (XGGQSV)

UT A Z ?a( 0 R ), VT B Z ?b( 0 R ) A M x
N, B P xN
Modified Van Loan's method
(1) Pre-processing reveal rank of ( AT BT )T
(2) Split QR reduce two upper triangular into
one
(3) CSD Cosine-Sine Decomposition
(4) Post-processing assemble resulted matrices
Workspace ??(max(M, N, P)) 5L2 where L
rank(B)
Outperforms current XGGSVD in LAPACK

43
Profile of SGGQSV

1.0GHz Itanium2 (2GB RAM)
Intel's Math Kernel Library 7.2.1 (incl. BLAS,
LAPACK)

44
Related New Routines

CSD (XORCSD)
UTQ1Z ?a, VTQ2Z ?b
Based on Von Loan's method (1985)
Dominated by the cost of SVD
Workspace ???(Max(PM, N))
PSVD (XGGPSV)
UTABV ?
Based on work by Golub, Solna, Van Dooren (2000)
Workspace ???(Max(M, P, N))

45
Benchmark Details

AMD 1.2 GHz Athlon, 2GB mem, Redhat Intel
compiler
Compute all eigenvalues and eigenvectors of a
symmetric tridiagonal matrix T
Codes compared
qr QR iteration from LAPACK dsteqr
dc Cuppens DivideConquer from LAPACK dstedc
gr New implementation of MRRR algorithm
(Grail)
ogr MRRR from LAPACK 3.0 dstegr (old Grail)

46
Timing of Eigensolvers(1.2 GHz Athlon, only
matrices where time gt .1 sec
47
Timing of Eigensolvers(1.2 GHz Athlon, only
matrices where time gt .1 sec)
48
More Accurate Solve Axb

Old idea Use Newtons method on f(x) Ax-b
On a linear system?
Roundoff in Ax-b makes it interesting
(nonlinear)
Iterative refinement
Snyder, Wilkinson, Moler, Skeel,

Repeat r Ax-b compute with extra
precision Solve Ad r using LU
factorization of A Update x x d Until
accurate enough or no progress
49
Speedups from 1.33x to 11.5x
Times obtained on 60 processors, Dual AMD
Opteron 1.4GHz Cluster w/Myrinet Interconnect 2GB
Memory
50
Times obtained on 60 processors, Dual Opteron
1.4GHz, 64-bit machine 2GB Memory, Myrinet
Interconnect
51
Speedups from 1.8x to 7.5x
Times obtained on 60 processors, Dual Intel Xeon
2.4GHz, IA-32 w/Gigabit Ethernet Interconnect 2GB
Memory
52
Times obtained on 60 processors, Dual Intel Xeon
2.4GHz, IA-32 w/Gigabit Ethernet Interconnect 2GB
Memory
53
Speedups from 2.2x to 6x
Times obtained on 60 processors, Dual Intel Xeon
2.4GHz, IA-32 w/Gigabit Ethernet Interconnect 2GB
Memory
54
Times obtained on 60 processors, Dual Intel Xeon
2.4GHz, IA-32 w/Gigabit Ethernet Interconnect 2GB
Memory
55
Speedups from 2.2x to 2.8x
Times obtained on 60 processors, IBM SP RS/6000,
16 shared memory (16 Gbytes) processors per node,
4 nodes connected with high-bandwidth low-latency
switching network Block size 64
56
Times obtained on 60 processors, IBM SP RS/6000,
16 shared memory (16 Gbytes) processors per node,
4 nodes connected with high-bandwidth low-latency
switching network Block size 64
57
The Future of LAPACK and ScaLAPACKwww.cs.berkel
ey.edu/demmel/Sca-LAPACK-Proposal.pdf

Jim Demmel
UC Berkeley
Householder XVI

58
OO-like interfaces (VE)

Aim native looking interfaces in modern
languages (C, F90, Java, Python)
Overloading to have identical interfaces for all
data types (precision/storage)
Overloading to omit certain parameters (stride,
transpose,)
Expose data only when necessary (automatic
allocation of temporaries, pivoting
permutation,.)
Mechanism Use Babel/SIDL to interface to
existing F77 code base

59
Example of HLL interface (VE)
// C example LapackPDTridiagonalMatrix A
LapackPDTridiagonalMatrix_create() x
(double) malloc(SIZEsizeof(double)) y
(double) malloc(SIZEsizeof(double))
A.MatVec(x,y) // or A.MatVec(MatTranspose,x,y)
A.Factor() A.Solve(y)
! F90 example ! Create matrix from user
array allocate(data_array(20,25)) ! Fill in the
array. call CreateWithArray(A,data_array) !
Retrieve the array of a (factored) matrix call
GetArray(A,mat) mat_array gt matd_data print
,'pivots',mat_array(0,0),mat_array(1,1)
Lots of details omitted here!

Write a Comment

User Comments (0)