Title: A Look at Library Software for Linear Algebra: Past, Present, and Future
1A Look at Library Software for Linear Algebra
Past, Present, and Future
- Jack Dongarra
- University of Tennessee
- and
- Oak Ridge National Laboratory
21960s
- 1965
- DEC ships first PDP-8 IBM ships 360
- First CS PhD U of Penn. Richard Wexelblat
- Wilkinsons The Algebraic Eigenvalue Problem
published - 1966
- DARPA contract with U of I to build the ILLIAC IV
- Fortran 66
- 1967
- Forsythe Moler published
- Fortran, Algol and PLI
- 1969
- DARPANET work begins
- 4 computers connected
- UC-SB,UCLA, SRI and U of Utah
- CDC 7600
- Pipelined architecture 3.1 Mflop/s
- Unix developed Thompson and Ritchie
- 1961
- IBM Stretch delivered to LANL
- 1962
- Virtual memory from U Manchester, T. Kilburn
- 1964
- AEC urges manufacturers to look at radical
new'' machine structures. - This leads to CDC Star-100, TI ASC, and
Illiac-IV. - CDC 6600 S. Cray's design
- Functional parallelism, leading to RISC
- (3 times faster than the IBM Stretch)
.
.
.
.
.
3Wilkinson-Reinsch Handbook
- In general we have aimed to include only
algorithms that provide something approaching an
optimal solution, this may be from the point of
view of generality, elegance, speed, or economy
of storage. - Part 1 Linear System
- Part 2 Algebraic Eigenvalue Problem
- Before publishing Handbook
- Virginia Klema and others at ANL began
translation of the Algol procedures into Fortran
subroutines
41970s
- 1970
- NATS Project conceived
- Concept of certified MS and process involved with
production - Purdue Math Software Symposium
- NAG project begins
- 1971
- Handbook for Automatic Computation, Vol II
- Landmark in the development of numerical
algorithms and software - Basis for a number of software projects EISPACK,
a number of linear algebra routines in IMSL and
the F chapters of NAG - IBM 370/195
- Pipelined architecture Out of order execution
2.5 Mflop/s - Intel 4004 60 Kop/s
- IMSL founded
- 1972
- Cray Research founded
- Intel 8008
- 1/4 size Illiac IV installed NASA Ames
- 15 Mflop/s achieved 64 processors
- Paper by S. Reddaway on massive bit-level
parallelism - EISPACK available
- 150 installations EISPACK Users' Guide
- 5 versions, IBM, CDC, Univac, Honeywell and PDP,
distributed free via Argonne Code Center - M. Flynn publishes paper on architectural
taxonomy - ARPANet
- 37 computers connected
- 1973
- BLAS report in SIGNUM
- Lawson, Hanson, Krogh
.
.
.
.
.
5NATS Project
- National Activity for Testing Software
- (NSF, Argonne, Texas and Stanford)
- Project to explore the problems of testing,
certifying, disseminating and maintaining quality
math software. - First EISPACK, later FUNPACK
- Influenced other PACKs
- ELLPACK, FISKPACK, ITPACK,MINPACK,PDEPACK,
QUADPACK, SPARSPAK, ROSEPACK, TOOLPACK,
TESTPACK,LINPACK, LAPACK, ScaLAPACK . . . - Key attributes of math software
- reliability, robustness, structure, usability,
and validity
6EISPACK
- Algol versions of the algorithms written in
Fortran - Restructured to avoid underflow
- Check users claims such as if matrix positive
definite - Format programs in a unified fashion
- Burt Garbow
- Field test sites
- 1971 U of Michigan Summer Conference
- JHW algorithms CBM software
- 1972 Software released via Argonne Code Center
- 5 versions
- Software certified in the sense that reports of
poor or incorrect performance would gain the
immediate attention from the developers
7EISPACK
- EISPAC Control Program, Boyle
- One interface allowed access to whole package on
IBM - Argonnes interactive RESCUE system
- Allowed us to easily manipulate 10s of 1000s of
lines - Generalizer/Selector
- Convert IBM version to general form
- Selector extract appropriate version
- 1976 Extensions to package ready
- EISPACK II
81970s continued
- 1974
- Intel 8080
- Level 1 BLAS activity started by community
Purdue 5/74 - LINPACK meeting in summer ANL
- 1975
- First issue of TOMS
- Second LINPACK meeting ANL
- lay the groundwork and hammer out what was and
was not to be included in the package. Proposal
submitted to the NSF - 1976
- Cray 1 - model for vector computing
- 4 Mflop/s in 79 12 Mflop/s in 83 27 Mflop/s in
89 - LINPACK work during summer at ANL
- EISPACK Second edition of User's Guide
- 1977
- DEC VAX 11/780 super mini
- .14 Mflop/s 4.3 GB virtual memory
- LINPACK test software developed sent
- EISPACK second release
- IEEE Arithmetic standard meetings
- paper by Palmer on INTEL std for fl pt
- 1978
- Fortran 77
- LINPACK software released
- Sent to NESC and IMSL for distribution
- 1979
- John Cocke designs 801
- ICL DAP delivered to QMC, London
- Level 1 BLAS published/released
- LINPACK Users' Guide
- Appendix 17 machines PDP-10 to Cray-1
.
.
.
.
9Basic Linear Algebra Subprograms
- A design tool software in numerical linear
algebra - Improve readability and aid documentation
- Aid modularity and maintenance, and improve
robustness of software calling the BLAS - Improve portability, without sacrificing
efficiency, through standardization
- BLAS 1973-1977
- Consensus on
- Names
- Calling sequences
- Functional Descriptions
- Low level linear algebra operations
- Success results from
- Extensive Public Involvement
- Careful consideration of implications
10CRAY and VAX
- VAXination of groups and departments
- Crays introduction of vector computing
- Both had a significant impact on scientific
computing
11LINPACK
- June 1974, ANL
- Jim Pools meeting
- February 1975, ANL
- groundwork for project
- January 1976,
- funded by NSF and DOE ANL/UNM/UM/UCSD
- Summer 1976 - Summer 1977
- Fall 1977
- Software to 26 testsites
- December 1978
- Software released, NESC and IMSL
12LINPACK
- Research into
mechanics of software production. - Provide a yearstick against which future software
would be measured. - Produce library used by people and those that
wished to modify/extend software to handle
special problems. - Hope would be used in classroom
- Machine independent and efficient
- No mix mode arithmetic
13LINPACK Efficiency
- Condition estimator
- Inner loops via BLAS
- Column access
- Unrolling loops (done for the BLAS)
- BABE Algorithm for tridiagonal matrices
- TAMPR system allowed easy generation of
versions - LINPACK Benchmark
- Today reports machines from Cray T90 to Palm
Pilot - (1.0 Gflop/s to 1.4 Kflop/s)
141980s Vector Computing (Parallel Processing)
- 1982
- Illiac IV decommissioned
- Steve Chen's group at Cray produces X-MP
- First Denelcor HEP installed (.21 Mflop/s)
- Sun Microsys, Convex and Alliant founded
- 1983
- Total computers in use in the US exceed 10 M
- DARPA starts Strategic Computing Initiative
- Helps fund Thinking Machines, BBN, WARP
- Cosmic Cube hypercube running at Caltech
- John Palmer, after seeing Caltech machine, leaves
Intel to found Ncube - Encore , Sequent, TMC, SCS, Myrias founded
- Cray 2 introduced
- NEC SX-1 and SX-2, Fujitsu ships VP-200
- ETA System spun off from CDC
- Golub Van Loan published
- 1980
- Total computers in use in the US exceeds 1M
- CDC Introduces Cyber 205
- 1981
- IBM introduces the PC
- Intel 8088/DOS
- BBN Butterfly delivered
- FPS delivers FPS-164
- Start of mini-supercomputer market
- SGI founded by Jim Clark and others
- Loop unrolling at outer level for data locality
and parallelism - Amounts to matrix-vector operations
- Cuppen's method for Symmetric Eigenvalue DC
published - Talk at Oxford Gatlinburg (1980)
.
.
.
.
.
.
.
.
.
151980s continued
- 1985
- IEEE Standard 754 for floating point
- IBM delivers 3090 vector 16 Mflop/s Linpack, 138
Peak - TMC demos CM1 to DARPA
- Intel produces first iPSC/1 Hypercube
- 80286 connected via Ethernet controllers
- Fujitsu VP-400 NEC SX-2 Cray 2 Convex C1
- Ncube/10 .1 Mflop/s Linpack 1 processor
- FPS-264 5.9 Mflop/s Linpack 38 Peak
- IBM begins RP2 project
- Stellar, Poduska Ardent, Michels Supertek
Computer founded - Denelcor closes doors
- 1984
- NSFNET 5000 computers56 Kb/s lines
- MathWorks founded
- EISPACK third release
- Netlib begins 1/3/84
- Level 2 BLAS activity started
- Gatlinburg(Waterloo), Purdue, SIAM
- Intel Scientific Computers started by J. Rattner
- Produce commerical hypercube
- Cray X-MP 1 processor, 21 Mflop/s Linpack
- Multiflow founded by J. Fisher VLIW architecture
- Apple introduces Mac IBM introduces PC AT
- IJK paper
.
.
.
.
.
161980s (Lost Decade for Parallel Software)
- 1986
- of computers in US exceeds 30 M
- TMC ships CM-1 64K 1 bit processors
- Cray X-MP
- IBM and MIPS release first RISC WS
- 1987
- ETA Systems family of supercomputers
- Sun Microsystems introduces its first RISC WS
- IBM invests in Steve Chen's SSI
- Cray Y-MP
- First NA-DIGEST
- Level 3 BLAS work begun
- LAPACK Prospectus Development of a LA Library
for HPC
- 1988
- AMT delivers first re-engineered DAP
- Intel produces iPSC/2
- Stellar and Ardent begin delivering single user
graphics workstations - Level 2 BLAS paper published
- 1989
- of computers in the US gt 50M
- Stellar and Ardent merge, forming Stardent
- S. Cray leaves Cray Research to form Cray
Computer - Ncube 2nd generation machine
- ETA out of business
- Intel 80486 and i860 1 M transistors
- i860 RISC 64 bit floating point
.
.
.
.
.
.
17EISPACK 3 and BLAS 2 3
- Machine independence for EISPACK
- Reduce the possibility of overflow and underflow.
- Mods to the Algol from S. Hammarling
- Rewrite reductions to tridiagonal form to
involve sequential access to memory - Official Double Precision version
- Inverse iteration routines modified to reduce the
size for reorthogonalization.
- BLAS (Level 1) vector operations provide for too
much data movement. - Community effort to define extensions
- Matrix-vector ops
- Matrix-matrix ops
18Netlib - Mathematical Software and Data
- Began in 1985
- JD and Eric Grosse, ATT Bell Labs
- Motivated by the need for cost-effective, timely
distribution of high-quality mathematical
software to the community. - Designed to send, by return electronic mail,
requested items. - Automatic mechanism for electronic dissemination
of freely available software. - Still in use and growing
- Mirrored at 9 sites around the world
- Moderated collection /Distributed maintenance
- NA-DIGEST and NA-Net
- Gene Golub, Mark Kent and Cleve Moler
19Netlib Growth
Just over 6,000 in 1985 Over 29,000,000
total Over 9 Million hits in 1997 5.4 Million so
far in 1998
LAPACK Best Seller 1.6 M hits
201990s
- 1991
- Stardent to sell business and close
- Cray C-90
- Kendall Square Research delivers 32 processor
KSR-1 - TMC produces CM-200 and announces CM-5 MIMD
computer - DEC announces the Alpha
- TMC produces the first CM-5
- Fortran 90
- Workshop to consider Message Passing Standard,
beginnings of MPI - Community effort
- Xnetlib running
- 1992
- LAPACK software released Users' Guide Published
- 1990
- Internet
- World Wide Web
- Motorola introduces 68040
- NEC ships SX-3 First parallel Japanese parallel
vector supercomputer - IBM announces RS/6000 family
- has FMA instruction
- Intel hypercube based on 860 chip
- 128 processors
- Alliant delivers FX/2800 based on i860
- Fujitsu VP-2600
- PVM project started
- Level 3 BLAS Published
.
.
.
.
21(No Transcript)
22Parallel Processing Comes of Age
- "There are three rules for programming parallel
computers. - We just don't know what they are yet." --
Gary Montry - Embarrassingly Parallel, Cleve Moler
- Humiliatingly Parallel
23Memory Hierarchy and LAPACK
- ijk - implementations
- Effects order in which data referenced
some better at allowing data to keep in
higher levels of memory hierarchy. - Applies for matrix multiply, reductions to
condensed form - May do slightly more
flops - Up to 3 times faster
24Why Higher Level BLAS?
- Can only do arithmetic on data at the top of the
hierarchy - Higher level BLAS lets us do this
- Development of blocked algorithms important for
performance
25History of Block Partitioned Algorithms
- Ideas not new.
- Early algorithms involved use of small main
memory using tapes as secondary storage. - Recent work centers on use of vector registers,
level 1 and 2 cache, main memory, and out of
core memory.
26LAPACK
- Linear Algebra library in Fortran 77
- Solution of systems of equations
- Solution of eigenvalue problems
- Combine algorithms from LINPACK
and EISPACK into a single package - Block algorithms
- Efficient on a wide range of computers
- RISC, Vector, SMPs
- User interface similar to LINPACK
- Single, Double, Complex, Double Complex
- Built on the Level 1, 2, and 3 BLAS
- HP-48G to CRAY T-90
Mflop/s
27(No Transcript)
281990s continued
- 1993
- Intel Pentium system start to ship
- ScaLAPACK Prototype software released
- First portable library for distributed memory
machines - Intel, TMC and workstations using PVM
- PVM 3.0 available
- 1994
- MPI-1 Finished
- 1995
- Templates project
- 1996
- Internet 34M Users
- Nintendo 64
- More computing power than a Cray 1 and much much
better graphics
- 1997
- MPI-2 Finished
- Fortran 95
- 1998
- Issues of parallel and numerical stability
- Divide time
- DSM architectures
- "New" Algorithms
- Chaotic iteration
- Sparse LU w/o pivoting
- Pipeline HQR
- Graph partitioning
- Algorithmic bombardment
.
.
.
.
29Templates Project
- Iterative methods for large sparse systems
- Communicate to HPC community State of the Art
algorithms - Subtle algorithm issues addressed, i.e.
convergence, preconditions, data structures - Performance and parallelism considerations
- Gave the computational scientists algorithms in
form they wanted.
30ScaLAPACK
- Library of software for dense banded
- Sparse direct being developed
- Distributed Memory - Message Passing
- PVM and MPI
- MIMD Computers, Networks of Workstations, and
Clumps of SMPs - SPMD Fortran 77 with object based design
- Built on various modules
- PBLAS (BLACS and BLAS)
- PVM, MPI, IBM SP, CRI T3, Intel, TMC
- Provides right level of notation.
31High-Performance Computing Directions
- Move toward shared memory
- SMPs and Distributed Shared Memory
- Shared address space w/deep memory hierarchy
- Clustering of shared memory machines for
scalability - Emergence of PC commodity systems
- Pentium based, NT or Linux driven
- At UTK cluster of 14 (dual) Pentium based 7.2
Gflop/s - Efficiency of message passing and data parallel
programming - Helped by standards efforts such as PVM, MPI and
HPF - Complementing Supercomputing with Metacomputing
? Computational Grid
32Heterogeneous Computing
- Heterogeneity introduces new bugs in parallel
code - Slightly different fl pt can make data dependent
branches go different ways when we expect
identical behavior. - A correct algorithm on a network of identical
workstations may fail if a slightly different
machine is introduced. - Some easy to fix (compare s lt tol on 1 proc and
broadcast results) - Some hard to fix (handling denorms getting the
same answer independent of of procs)
33Java - For Numerical Computations?
- Java likely to be a dominant language.
- Provides for machine independent code.
- C like language
- No pointers, gotos, overloading arith ops, or
memory deallocation - Portability achieved via abstract machine
- Java is a convenient user interface builder which
allows one to quickly develop customized
interfaces.
34Network Enabled Servers
- Allow networked resources to be
integrated into the desktop. - Many hosts, co-existing in a loose
confederation tied together with
high-speed links. - Users have the illusion of a very powerful
computer on the desk. - Locate and deliver software or solutions to the
user in a directly usable and conventional
form. - Part of the motivation software maintenance
35Future Petaflops ( fl pt ops/s)
- dynamic redistribution of
workload - new language and constructs
- role of numerical libraries
- algorithm adaptation to hardware failure
- A Pflop for 1 second ? a typical workstation
computing for 1 year. - From an algorithmic standpoint
- concurrency
- data locality
- latency sync
- floating point accuracy
- May be feasible and affordable
by the
year 2010
36Summary
- As a community we have a lot to be proud of in
terms of the algorithms and software we have
produced. - generality, elegance, speed, or economy of
storage - Software still being used in many cases 30 years
after
37(No Transcript)
38End of talk
39BLAS Were Intended To
- Be a design tool for the development of software
in numerical linear algebra - Improve readability and aid documentation
- Aid modularity and maintenance, and improve
robustness of software calling the BLAS - Promote efficiency
- Improve portability, without sacrificing
efficiency, through standardization
40LINPACK Benchmark
- Appeared in the Users Guide
- Dense system of linear equations
- LU decomposition via Gaussian elimination.
- Users Guide Appendix
- Reported AXPY op time
- Has a life of its own
- Today reports machines from Cray T90
to Palm Pilot - (1.0 Gflop/s to 1.4 Kflop/s)
41Future Research Directions and Challenges
Prediction is hard.
42ScaLAPACK Structure
Global
Local
43Metacomputing Objectives
- Flexibility and extensibility
- Site autonomy
- Scalable architecture
- Single global namespace
- Easy-to-use, seamless environment
- High performance through parallelism
- Security
- Management / exploitation of heterogeneity
- Multilanguage interoperability
- Fault-tolerance
44Grid Based Computations
- Long running computations
- Grid based computation
- Network of Workstations
- Fault tolerance
- Reproducibility of solution
- Auditability of computation
45Heterogeneous Conclusions
- Defensive programming
- Machine parameters
- Communication and representation between
processors - Controlling processor
- Additional communication
- Testing strategies
46Metacomputing Issues
- Logically only one system
- System should determine where the
computation executes - Fault-tolerance should be transparent
- Applications freed for mechanics of distrib.
programming - Self configuring with new resources added
automatically - Just-in-time binding of software, data, and
hardware. - User shouldnt have to decide where computation
performed - User would be able to walk up to a machine and
have their program and data follow them
47Grid Computing and Numerical Computations
- With the Computational Grid certain challenges
from a numerical standpoint. - Some users would like/want/expect/demand the same
results they obtained yesterday the next time
they run the application. - How can we guarantee reproducibility?
48Heterogeneous Computing
- Software intended to be used in this context
- Machine precision and other machine specific
parameters - Communication of ft. pt. numbers between
processors - Repeatability -
- run to run differences, e.g. order in which
quantities summed - Coherency -
- within run differences, e.g. arithmetic varies
from proc to proc - Iterative convergence across clusters of
processors - Defensive programming required
- Important for the Computational Grid
49Machine Precision
- Smallest floating point number u such that 1 u
gt 1. - Used in
- Convergence tests -- has an iteration converged?
- Tolerance tests -- is a matrix numerically
singular? - Error analysis -- how does floating point
arithmetic affect results?
50Heterogeneous Machine Precision
- Return the maximum relative machine precision
over all the processors - LAPACK --gt DLAMCH
- ScaLAPACK --gt PDLAMCH
- Note that even on IEEE machines, the machine
precision can vary slightly. For double
precision e 2-53 q where, q2-105 or
q2-64 2-105.
51Grid Enabled Math Software
- Predictability and robustness of accuracy and
performance. - Run-time resource management and algorithm
selection. - Support for a multiplicity of programming
environments. - Reproducibility, fault tolerant, and auditability
of the computations. - New algorithmic techniques for latency tolerant
applications.
52Heterogeneous Computing
- Software intended to be used in this context
- Machine precision and other machine specific
parameters - Communication of ft. pt. numbers between
processors - Repeatability -
- run to run differences, e.g. order in which
quantities summed - Coherency -
- within run differences, e.g. arithmetic varies
from proc to proc - Iterative convergence across clusters of
processors - Defensive programming required
- Important for the Computational Grid
53Automatically Tuned Numerical Software
- Automatic generation of computational kernels for
RISC architectures. - A package that adapts itself to differing
architectures via code generation coupled with
timing - Code generator takes about 1-2 hours to run.
- Done once for a new architecture.
- Written in ANSI C (generate C )
- Today Level 2 and 3 BLAS
- Extension to higher level
operations. - SMPs
- SGI/Vector
- DSP
- FFTs
54Why Such a System is Needed
- BLAS require many man-hours / platform
- Only done if financial incentive is there
- Many platforms will never have an optimal version
- Lags behind hardware
- May not be affordable by everyone
- Allows for portably optimal codes
- Package contains
- Code generators
- Sophisticated timers
- Robust search routines
55Future Research Directions and Challenges
Prediction is hard.
56Future Research Directions and Challenges
Prediction is hard. Especially the future . . .
Yogi Berra, philosopher in the Baseball Hall of
Fame
- Numerical behavior in a heterogeneous
environment. - Efficient software for core routines.
- Fault tolerant aspects
- "New" Algorithms
- Chaotic iteration
- Divide and Conquer
- Sparse LU w/o pivoting
- Pipeline HQR
- Graph partitioning
- Algorithmic bombardment
57Research Directions
- Grid enabled strategies
- Parameterizable libraries
- Annotated libraries
- Hierarchical algorithm libraries
- A new division of labor between compiler writers,
library writers, and algorithm developers and
application developers will emerge.
58Motivation for NetSolve
59Motivation for NetSolve
60Motivation for NetSolve
Design an easy-to-use tool to provide efficient
and uniform access to a variety of scientific
packages on UNIX platforms
Basics
- Client-Server Design
- Non-hierarchical system
- Load Balancing and Fault Tolerance
- Heterogeneous Environment Supported
- Multiple and simple client interfaces
http//www.cs.utk.edu/netsolve/
61NetSolve - The Big Picture
http//www.cs.utk.edu/netsolve/
62Network Computing Classifications
- Code shipping (JAVA model)
- Program Server ? Client
- Data Client
- Result Client
- Remote Computing (NetSolve model)
- Program Server
- Data Client ? Server
- Result Server ? Client
http//www.cs.utk.edu/netsolve/
63NetSolve - Typical Utilizations
- Intranet, Extranet, Internet
- Proprietary, Collaborative, Open, Controlled,
...
- Used by scientists as a computational engine
- Customized local servers
- Operating environment for higher level
applications - Client/Server/Agent could be on the same machine.
64NetSolve - The Client
- Multiple interfaces
- C, FORTRAN, MATLAB, Mathematica
- Java GUI and API
- Natural problem specification
- some input objects
- some output objects
- a name (example LinearSystem)
65Architecture features
- Cache
- Goes back to the ATLAS computer
- pipelining
- CDC 7600
- vector instructions
- chaining
- overlapping
- Look unrolling
66- Linpack benchmark
- LINPACK is a benchmark, but it is really a
collection of software routines for solving
systems of linear equations. - Some vendors, I am told, have been accused of
having a switch which is called the "LINPACK
switch", a program device to recognize the
LINPACK program and try to optimize based on
that.
67- Data locality
- Give Cyber 205 example
68Problem Solving Environments Computational Grid
- Many hosts, co-existing in a loose
confederation tied together with
high-speed links. - Users have the illusion of a very powerful
computer on the desk. - System provides all computational facilities to
solve target class of problems - Automatic or semiautomatic selection of solution
methods Ways to easily incorporate novel
solution methods. - Communicate using language of target class of
problems. - Allow users to do computational steering.
69The Importance of Standards - Software
- Writing programs for MPP is hard ...
- But ... one-off efforts if written in a standard
language - Past lack of parallel programming standards ...
- has restricted uptake of technology (to
"enthusiasts") - reduced portability (over a range of
currentarchitectures and between future
generations) - lost decade
- Now standards exist (PVM, MPI HPF), which ...
- allows users manufacturers to protect software
investment - encourage growth of a "third party" parallel
software industry parallel versions of widely
used codes - Others POSIX, COBRA, ...
70LAPACK Linear Algebra Library in F77
- Solution of systems of equations Axb for
general dense, symmetric, banded, tridiagonal
matrix A, - Solution of linear least squares problems min x
Ax-b 2for general dense matrix A, - Singular value decomposition A U ? VT for
general dense, banded matrix A, - Solution for eigenvalue problems Ax ? x for
general dense, symmetric, banded matrix A, - Solution of general linear least squares, (GSEP),
(GNEP), (GSVD), - Error bounds, condition estimation for (almost)
everything, - Real and complex, single and double precision,
- Easy-to-use drivers plus computational kernels.
71LAPACK Ongoing Work
- Add functionality
- updating/downdating, divide and conquer least
squares,bidiagonal bisection, bidiagonal inverse
iteration, band SVD, Jacobi methods, ... - Move to new generation of high performance
machines - IBM SPs, CRAY T3E, SGI Origin, clusters of
workstations - New challenges
- New languages FORTRAN 90, HPF, ...
- (CMMD, MPL, NX ...)
- many flavors of message passing, need standard
(PVM, MPI) BLACS - Highly varying ratio
- Many ways to layout data,
- Fastest parallel algorithm sometimes less stable
numerically.
72LAPACK Blocked Algorithms
DO 10 J 1, N, NB CALL STRSM( 'Left',
'Upper', 'Transpose','Non-Unit', J-1, JB, ONE, A,
LDA, A( 1, J ), LDA )
CALL SSYRK( 'Upper', 'Transpose', JB, J-1,-ONE,
A( 1, J ), LDA, ONE, A( J, J ),
LDA ) CALL SPOTF2( 'Upper', JB, A( J, J ),
LDA, INFO ) IF( INFO.NE.0 ) GO TO 20 10
CONTINUE
- On Y-MP, L3 BLAS squeezes a little more out of 1
proc, but - makes a large improvement when using 8 procs.
73Challenges in Developing Distributed Memory
Libaries
- How to integrate software?
- Until recently no standards
- Many parallel languages
- Various parallel programming models
- Assumptions about the parallel environment
- granularity
- topology
- overlapping of communication/computation
- development tools
- Where is the data
- Who owns it?
- Opt data distribution
- Who determines data layout
- Determined by user?
- Determined by library developer?
- Allow dynamic data dist.
- Load balancing
74Heterogeneous Computing
- Software intended to be used in this context
- Machine precision and other machine specific
parameters - Communication of ft. pt. numbers between
processors - Repeatability -
- run to run differences, e.g. order in which
quantities summed - Coherency -
- within run differences, e.g. arithmetic varies
from proc to proc - Iterative convergence across clusters of
processors - Defensive programming required
75Portability of Numerical Software
- Numerical software to work portability on
different computers with good numerical accuracy,
stability and robustness of solution. - EISPACK - DEC, IBM, CDC (early 70s)
- LINPACK - Vector, pipelined (late 70s)
- LAPACK - Vector, RISC, SMPs (late 80s early
90s) - ScaLAPACK (late 90s)
- MPP
- DSM
- Clumps
- Heterogeneous NOWs
76ScaLAPACK
- LAPACK software expertise/quality
- Numerical methods
- Library of software dealing with dense banded
routines - Distributed Memory - Message Passing
- MPP Computers and NOWs
- SPMD Fortran 77 with object based design
- Built on various modules
- PBLAS Interprocessor communication
- BLAS/BLACS
- PVM, MPI, IBM SP, CRI T3, Intel, TMC
- Provides right level of notation.
77Choosing a Data Distribution
- Main issues are
- Load balancing
- Use of the Level 3 BLAS
- 1D block and cyclic column distributions
- 1D block-cycle column and 2D block-cyclic
distribution - 2D block-cyclic used in ScaLAPACK for dense
matrices
78Programming Style
- SPMD Fortran 77 with object based design
- Built on various modules
- PBLAS Interprocessor communication
- BLACS
- PVM, MPI, IBM SP, CRI T3, Intel, TMC
- Provides right level of notation.
- BLAS
- LAPACK software expertise/quality
- software approach
- numerical methods
79Parallelism in ScaLAPACK
- Level 3 BLAS block operations
- All the reduction routines
- Pipelining
- QR Algorithm, Triangular Solvers, classic
factorizations - Redundant computations
- Condition estimators
- Static work assignment
- Bisection
- Task parallelism
- Sign function eigenvalue computations
- Divide and Conquer
- Tridiagonal and band solvers, symmetric
eigenvalue problem and Sign function - Cyclic reduction
- Reduced system in the band solver
- Data parallelism
- Sign function
80ScaLAPACK - Whats Included
- Timing and Testing routines for almost all, these
are a large component of the package - Prebuilt libraries available for SP, PGON, HPPA,
DEC, Sun, RS6K
81Direct Sparse Solvers
- CAPSS is a package to solve Axb on a message
passing multiprocessor the matrix A is SPD and
associated with a mesh in 2 or 3D. (Version for
Intel only) - SuperLU and UMFPACK - sequential implementation
of Gaussian elimination with partial pivoting.
82Parallel Iterative Solvers
- Many packages for message passing systems
- PETSc, AZTEC, BLOCKSOLVE95, P-Sparselib,...
- PETSc set of libraries and structures that assist
in the solution of PDEs and related processes. - Parallel algebraic data structures, solvers, and
related infrastructure required for solving PDEs
using implicit methods involving finite elements,
finite differences, or finite volumes.
83Parallel Sparse Eigenvalue Solvers
- P_ARPACK (D. Sorensen et al)
- Designed to compute a few values and
corresponding vectors of a general matrix. - Appropriate for large sparse or structured
matrices A - This software is based Arnoldi process called the
Implicitly Restarted Arnoldi Method. - Reverse Communication Interface.
84Java for Numerical Computations?
- Few months ago 1 Mflop/s on a 600 Mflop/s
processor. - Top performer today is 46 Mflop/s for a P6 using
MS Explorer 3.0 JIT (62 Mflop/s in Fortran).
http//www.netlib.org/benchmark/linpackjava/
85LAPACK to JAVA
- Allows Java programmers to access to BLAS/LAPACK
routines. - Working on translator to go from LAPACK to Java
Byte Code - lap2j formal compiler of a subset of f77
sufficient for BLAS LAPACK - Plan to enable all of LAPACK
- Compiler provides quick, reliable translation.
- Focus on LAPACK Fortran
- Simple - no COMMON, EQUIVALANCE, SAVE, I/O
86Parameterized Libraries
- Architecture features for performance
- Cache size
- Latency
- Bandwidth
- Latency tolerant algorithms
- Latency reduction
- Granularity management
- High levels of concurrency
- Issues of precision
87Motivation for Network Enabled Solvers
Design an easy-to-use tool to provide efficient
and uniform access to a variety of scientific
packages on UNIX platforms
- Client-Server Design
- Non-hierarchical system
- Load Balancing
- Fault Tolerance
- Heterogeneous Environment Supported
88NetSolve-Future Paradigm
Software lookup
NetSolve Agent
Dynamic Hardware/Software Binding
Hardware lookup
NetSolve Client
NetSolve Server
89On Mathematical Software
- Labor intensive activity
- Enabling Technology
- Helped many computational scientists
- Provides a basis for many commercial efforts
- Often software people not given credit
- Lip Service given to work
- Funding to universities and laboratories
- NSF, DOE, DARPA, AFOSR, ..
90References
- Copy of slides in http//www.netlib.org/utk/peopl
e/JackDongarra/siam-797/ - Annotated table of freely available software in
http//www.netlib.org/utk/people/JackDongarra/la-s
w.html - http//www.netlib.org/
- http//www.nhse.org/
- http//www.nhse.org/hpc-netlib/
91- Design started in 1955 by IBM when lost bid for U
of C Radiation Lab (LLNL) - Univac won with LARC
- Was to be 100 X faster than what was available
- At IBM Fred Brooks, John Cocke, Erich Bloch, . .
.
- Look ahead unit
- Memory operand prefetch
- Out of order execution
- Speculative execution based on branch prediction
- Partitioned function units
92Stretch
93- double precision function pythag(a,b)
- double precision a,b
- c
- c finds dsqrt(a2b2) without overflow or
destructive underflow - c
- double precision p,r,s,t,u
- p dmax1(dabs(a),dabs(b))
- if (p .eq. 0.0d0) go to 20
- r (dmin1(dabs(a),dabs(b))/p)2
- 10 continue
- t 4.0d0 r
- if (t .eq. 4.0d0) go to 20
- s r/t
- u 1.0d0 2.0d0s
- p up
- r (s/u)2 r
- go to 10
- 20 pythag p
- return
94- Nintendo 64
- more computing power than a Cray 1 and also much
better graphics
95- double precision function epslon (x)
- double precision x
- c
- c estimate unit roundoff in quantities of
size x. - c
- double precision a,b,c,eps
- c
- c this program should function properly on
all systems - c satisfying the following two assumptions,
- c 1. the base used in representing
floating point - c numbers is not a power of three.
- c 2. the quantity a in statement 10 is
represented to - c the accuracy used in floating point
variables - c that are stored in memory.
- c the statement number 10 and the go to 10
are intended to - c force optimizing compilers to generate code
satisfying - c assumption 2.
- c under these assumptions, it should be true
that, - c a is not exactly equal to
four-thirds,
96Outline
- Architectural Opportunities and Challenges.
- Focus on High-Performance Computers
- MPP/Clusters/NOWs
- Language
- Fortran 77, 90
- HPF
- C, C, Java
- Dense direct solvers
- ScaLAPACK, PLAPACK
- Sparse direct solvers
- Iterative solvers
97LAPACK Motivation
- LAPACK using high level BLAS
- Use the manufacturers BLAS for high performance.
- Portable Fortran 77 (766 K lines)
98LAPACK Motivation
- Example Cray C-90, 1 and 16 processors
- LINPACK low performance due to excess data
movement through the memory hierarchy
99Derivation of Blocked AlgorithmsCholesky
Factorization A UTU
Equating coefficient of the jth column, we obtain
Hence, if U11 has already been computed, we can
compute uj and ujj from the equations
100LINPACK Implementation
- Here is the body of the LINPACK routine SPOFA
which implements the method - DO 30 J 1, N
- INFO J
- S 0.0E0
- JM1 J - 1
- IF( JM1.LT.1 ) GO TO 20
- DO 10 K 1, JM1
- T A( K, J ) - SDOT( K-1,
A( 1, K ), 1,A( 1, J ), 1 ) - T T / A( K, K )
- A( K, J ) T
- S S TT
- 10 CONTINUE
- 20 CONTINUE
- S A( J, J ) - S
- C ...EXIT
- IF( S.LE.0.0E0 ) GO TO 40
- A( J, J ) SQRT( S )
- 30 CONTINUE
101(No Transcript)
102LAPACK from HP-48G to CRAY T-90
- (Trans)portable FORTRAN 77 (BLAS), 766K lines
(354K library, 412K testing and timing) - Targets workstations, vector machines, and shared
memory computers - Initial release February 1992
- Linear systems, least squares, eigenproblems
- Manual available from SIAM translated into
Japanese - Fifth release later this year.
103Physical and Logical Views of PVM
104Performance Today in Terms of LINPACK Benchmark
- ASCI Red at 1.3 Tflop/s
- Sun Sparc Ultra 2 at 110 Mflop/s
- Thinkpad 600 at 50 Mflop/s
- Palm Pilot at 1.4 Kflop/s
105Standards - The MPI Forum
- Communicators - communication context
- needed for safe parallel library design.
- Derived Datatypes - user defined datatypes for
non-homogeneous, non-contiguous data. - Process Groups - collective operations can
operate over a subset of the processes.
- Created using the HPF model
- A group of 30-40 experts in message-passing
- MPP vendors,
- CS researchers,
- Application developers.
- Met 3 days every 6 weeks for 1.5 years and
created the MPI-1 specification draft.
106Super-LU
- Supernodes
- Permit use of higher level BLAS.
- Reduce inefficient indirect addressing.
- Reduce symbolic time by traversing supernodal
graph.
- Algorithm PA LU , A sparse and unsymmetric.
- Exploit dense submatrices in the L U factors
of PA LU .
107Performance on Computational Kernels
- New implementations of architectures every few
months. - How to keep up with the changing design so as to
have optimized numerical kernels. - Effort to aid the compiler
- Pre-compiled software for machine specific
implementations
108Grid Enabled Math Software
- Predictability and robustness of accuracy and
performance. - Run-time resource management and algorithm
selection. - Support for a multiplicity of programming
environments. - Reproducibility, fault tolerant, and auditability
of the computations. - New algorithmic techniques for latency tolerant
applications.