A Look at Library Software for Linear Algebra: Past, Present, and Future

About This Presentation

Title:

A Look at Library Software for Linear Algebra: Past, Present, and Future

Description:

A Look at Library Software for Linear Algebra: Past, Present, and Future Jack Dongarra University of Tennessee and Oak Ridge National Laboratory – PowerPoint PPT presentation

Number of Views:99

Avg rating:3.0/5.0

Slides: 38

Provided by: JackDo8

Learn more at: https://netlib.org

Category:

more less

Transcript and Presenter's Notes

Title: A Look at Library Software for Linear Algebra: Past, Present, and Future

1
A Look at Library Software for Linear Algebra
Past, Present, and Future

Jack Dongarra
University of Tennessee
and
Oak Ridge National Laboratory

2
1960s

1965
DEC ships first PDP-8 IBM ships 360
First CS PhD U of Penn. Richard Wexelblat
Wilkinsons The Algebraic Eigenvalue Problem
published
1966
DARPA contract with U of I to build the ILLIAC IV
Fortran 66
1967
Forsythe Moler published
Fortran, Algol and PLI
1969
DARPANET work begins
4 computers connected
UC-SB,UCLA, SRI and U of Utah
CDC 7600
Pipelined architecture 3.1 Mflop/s
Unix developed Thompson and Ritchie

1961
IBM Stretch delivered to LANL
1962
Virtual memory from U Manchester, T. Kilburn
1964
AEC urges manufacturers to look at radical
new'' machine structures.
This leads to CDC Star-100, TI ASC, and
Illiac-IV.
CDC 6600 S. Cray's design
Functional parallelism, leading to RISC
(3 times faster than the IBM Stretch)

.
.
.
.
.
3
Wilkinson-Reinsch Handbook

In general we have aimed to include only
algorithms that provide something approaching an
optimal solution, this may be from the point of
view of generality, elegance, speed, or economy
of storage.
Part 1 Linear System
Part 2 Algebraic Eigenvalue Problem
Before publishing Handbook
Virginia Klema and others at ANL began
translation of the Algol procedures into Fortran
subroutines

4
1970s

1970
NATS Project conceived
Concept of certified MS and process involved with
production
Purdue Math Software Symposium
NAG project begins
1971
Handbook for Automatic Computation, Vol II
Landmark in the development of numerical
algorithms and software
Basis for a number of software projects EISPACK,
a number of linear algebra routines in IMSL and
the F chapters of NAG
IBM 370/195
Pipelined architecture Out of order execution
2.5 Mflop/s
Intel 4004 60 Kop/s
IMSL founded

1972
Cray Research founded
Intel 8008
1/4 size Illiac IV installed NASA Ames
15 Mflop/s achieved 64 processors
Paper by S. Reddaway on massive bit-level
parallelism
EISPACK available
150 installations EISPACK Users' Guide
5 versions, IBM, CDC, Univac, Honeywell and PDP,
distributed free via Argonne Code Center
M. Flynn publishes paper on architectural
taxonomy
ARPANet
37 computers connected
1973
BLAS report in SIGNUM
Lawson, Hanson, Krogh

.
.
.
.
.
5
NATS Project

National Activity for Testing Software
(NSF, Argonne, Texas and Stanford)
Project to explore the problems of testing,
certifying, disseminating and maintaining quality
math software.
First EISPACK, later FUNPACK
Influenced other PACKs
ELLPACK, FISKPACK, ITPACK,MINPACK,PDEPACK,
QUADPACK, SPARSPAK, ROSEPACK, TOOLPACK,
TESTPACK,LINPACK, LAPACK, ScaLAPACK . . .
Key attributes of math software
reliability, robustness, structure, usability,
and validity

6
EISPACK

Algol versions of the algorithms written in
Fortran
Restructured to avoid underflow
Check users claims such as if matrix positive
definite
Format programs in a unified fashion
Burt Garbow
Field test sites
1971 U of Michigan Summer Conference
JHW algorithms CBM software
1972 Software released via Argonne Code Center
5 versions
Software certified in the sense that reports of
poor or incorrect performance would gain the
immediate attention from the developers

7
EISPACK

EISPAC Control Program, Boyle
One interface allowed access to whole package on
IBM
Argonnes interactive RESCUE system
Allowed us to easily manipulate 10s of 1000s of
lines
Generalizer/Selector
Convert IBM version to general form
Selector extract appropriate version
1976 Extensions to package ready
EISPACK II

8
1970s continued

1974
Intel 8080
Level 1 BLAS activity started by community
Purdue 5/74
LINPACK meeting in summer ANL
1975
First issue of TOMS
Second LINPACK meeting ANL
lay the groundwork and hammer out what was and
was not to be included in the package. Proposal
submitted to the NSF
1976
Cray 1 - model for vector computing
4 Mflop/s in 79 12 Mflop/s in 83 27 Mflop/s in
89
LINPACK work during summer at ANL
EISPACK Second edition of User's Guide

1977
DEC VAX 11/780 super mini
.14 Mflop/s 4.3 GB virtual memory
LINPACK test software developed sent
EISPACK second release
IEEE Arithmetic standard meetings
paper by Palmer on INTEL std for fl pt
1978
Fortran 77
LINPACK software released
Sent to NESC and IMSL for distribution
1979
John Cocke designs 801
ICL DAP delivered to QMC, London
Level 1 BLAS published/released
LINPACK Users' Guide
Appendix 17 machines PDP-10 to Cray-1

.
.
.
.
9
Basic Linear Algebra Subprograms

A design tool software in numerical linear
algebra
Improve readability and aid documentation
Aid modularity and maintenance, and improve
robustness of software calling the BLAS
Improve portability, without sacrificing
efficiency, through standardization

BLAS 1973-1977
Consensus on
Names
Calling sequences
Functional Descriptions
Low level linear algebra operations
Success results from
Extensive Public Involvement
Careful consideration of implications

10
CRAY and VAX

VAXination of groups and departments
Crays introduction of vector computing
Both had a significant impact on scientific
computing

11
LINPACK

June 1974, ANL
Jim Pools meeting
February 1975, ANL
groundwork for project
January 1976,
funded by NSF and DOE ANL/UNM/UM/UCSD
Summer 1976 - Summer 1977
Fall 1977
Software to 26 testsites
December 1978
Software released, NESC and IMSL

12
LINPACK

Research into
mechanics of software production.
Provide a yearstick against which future software
would be measured.
Produce library used by people and those that
wished to modify/extend software to handle
special problems.
Hope would be used in classroom
Machine independent and efficient
No mix mode arithmetic

13
LINPACK Efficiency

Condition estimator
Inner loops via BLAS
Column access
Unrolling loops (done for the BLAS)
BABE Algorithm for tridiagonal matrices
TAMPR system allowed easy generation of
versions
LINPACK Benchmark
Today reports machines from Cray T90 to Palm
Pilot
(1.0 Gflop/s to 1.4 Kflop/s)

14
1980s Vector Computing (Parallel Processing)

1982
Illiac IV decommissioned
Steve Chen's group at Cray produces X-MP
First Denelcor HEP installed (.21 Mflop/s)
Sun Microsys, Convex and Alliant founded
1983
Total computers in use in the US exceed 10 M
DARPA starts Strategic Computing Initiative
Helps fund Thinking Machines, BBN, WARP
Cosmic Cube hypercube running at Caltech
John Palmer, after seeing Caltech machine, leaves
Intel to found Ncube
Encore , Sequent, TMC, SCS, Myrias founded
Cray 2 introduced
NEC SX-1 and SX-2, Fujitsu ships VP-200
ETA System spun off from CDC
Golub Van Loan published

1980
Total computers in use in the US exceeds 1M
CDC Introduces Cyber 205
1981
IBM introduces the PC
Intel 8088/DOS
BBN Butterfly delivered
FPS delivers FPS-164
Start of mini-supercomputer market
SGI founded by Jim Clark and others
Loop unrolling at outer level for data locality
and parallelism
Amounts to matrix-vector operations
Cuppen's method for Symmetric Eigenvalue DC
published
Talk at Oxford Gatlinburg (1980)

.
.
.
.
.
.
.
.
.
15
1980s continued

1985
IEEE Standard 754 for floating point
IBM delivers 3090 vector 16 Mflop/s Linpack, 138
Peak
TMC demos CM1 to DARPA
Intel produces first iPSC/1 Hypercube
80286 connected via Ethernet controllers
Fujitsu VP-400 NEC SX-2 Cray 2 Convex C1
Ncube/10 .1 Mflop/s Linpack 1 processor
FPS-264 5.9 Mflop/s Linpack 38 Peak
IBM begins RP2 project
Stellar, Poduska Ardent, Michels Supertek
Computer founded
Denelcor closes doors

1984
NSFNET 5000 computers56 Kb/s lines
MathWorks founded
EISPACK third release
Netlib begins 1/3/84
Level 2 BLAS activity started
Gatlinburg(Waterloo), Purdue, SIAM
Intel Scientific Computers started by J. Rattner
Produce commerical hypercube
Cray X-MP 1 processor, 21 Mflop/s Linpack
Multiflow founded by J. Fisher VLIW architecture
Apple introduces Mac IBM introduces PC AT
IJK paper

.
.
.
.
.
16
1980s (Lost Decade for Parallel Software)

1986
of computers in US exceeds 30 M
TMC ships CM-1 64K 1 bit processors
Cray X-MP
IBM and MIPS release first RISC WS
1987
ETA Systems family of supercomputers
Sun Microsystems introduces its first RISC WS
IBM invests in Steve Chen's SSI
Cray Y-MP
First NA-DIGEST
Level 3 BLAS work begun
LAPACK Prospectus Development of a LA Library
for HPC

1988
AMT delivers first re-engineered DAP
Intel produces iPSC/2
Stellar and Ardent begin delivering single user
graphics workstations
Level 2 BLAS paper published
1989
of computers in the US gt 50M
Stellar and Ardent merge, forming Stardent
S. Cray leaves Cray Research to form Cray
Computer
Ncube 2nd generation machine
ETA out of business
Intel 80486 and i860 1 M transistors
i860 RISC 64 bit floating point

.
.
.
.
.
.
17
EISPACK 3 and BLAS 2 3

Machine independence for EISPACK
Reduce the possibility of overflow and underflow.
Mods to the Algol from S. Hammarling
Rewrite reductions to tridiagonal form to
involve sequential access to memory
Official Double Precision version
Inverse iteration routines modified to reduce the
size for reorthogonalization.

BLAS (Level 1) vector operations provide for too
much data movement.
Community effort to define extensions
Matrix-vector ops
Matrix-matrix ops

18
Netlib - Mathematical Software and Data

Began in 1985
JD and Eric Grosse, ATT Bell Labs
Motivated by the need for cost-effective, timely
distribution of high-quality mathematical
software to the community.
Designed to send, by return electronic mail,
requested items.
Automatic mechanism for electronic dissemination
of freely available software.
Still in use and growing
Mirrored at 9 sites around the world
Moderated collection /Distributed maintenance
NA-DIGEST and NA-Net
Gene Golub, Mark Kent and Cleve Moler

19
Netlib Growth
Just over 6,000 in 1985 Over 29,000,000
total Over 9 Million hits in 1997 5.4 Million so
far in 1998
LAPACK Best Seller 1.6 M hits
20
1990s

1991
Stardent to sell business and close
Cray C-90
Kendall Square Research delivers 32 processor
KSR-1
TMC produces CM-200 and announces CM-5 MIMD
computer
DEC announces the Alpha
TMC produces the first CM-5
Fortran 90
Workshop to consider Message Passing Standard,
beginnings of MPI
Community effort
Xnetlib running
1992
LAPACK software released Users' Guide Published

1990
Internet
World Wide Web
Motorola introduces 68040
NEC ships SX-3 First parallel Japanese parallel
vector supercomputer
IBM announces RS/6000 family
has FMA instruction
Intel hypercube based on 860 chip
128 processors
Alliant delivers FX/2800 based on i860
Fujitsu VP-2600
PVM project started
Level 3 BLAS Published

.
.
.
.
21
(No Transcript)
22
Parallel Processing Comes of Age

"There are three rules for programming parallel
computers.
We just don't know what they are yet." --
Gary Montry
Embarrassingly Parallel, Cleve Moler
Humiliatingly Parallel

23
Memory Hierarchy and LAPACK

ijk - implementations
Effects order in which data referenced
some better at allowing data to keep in
higher levels of memory hierarchy.
Applies for matrix multiply, reductions to
condensed form
May do slightly more
flops
Up to 3 times faster

24
Why Higher Level BLAS?

Can only do arithmetic on data at the top of the
hierarchy
Higher level BLAS lets us do this

Development of blocked algorithms important for
performance

25
History of Block Partitioned Algorithms

Ideas not new.
Early algorithms involved use of small main
memory using tapes as secondary storage.
Recent work centers on use of vector registers,
level 1 and 2 cache, main memory, and out of
core memory.

26
LAPACK

Linear Algebra library in Fortran 77
Solution of systems of equations
Solution of eigenvalue problems
Combine algorithms from LINPACK
and EISPACK into a single package
Block algorithms
Efficient on a wide range of computers
RISC, Vector, SMPs
User interface similar to LINPACK
Single, Double, Complex, Double Complex
Built on the Level 1, 2, and 3 BLAS
HP-48G to CRAY T-90

Mflop/s
27
(No Transcript)
28
1990s continued

1993
Intel Pentium system start to ship
ScaLAPACK Prototype software released
First portable library for distributed memory
machines
Intel, TMC and workstations using PVM
PVM 3.0 available
1994
MPI-1 Finished
1995
Templates project
1996
Internet 34M Users
Nintendo 64
More computing power than a Cray 1 and much much
better graphics

1997
MPI-2 Finished
Fortran 95
1998
Issues of parallel and numerical stability
Divide time
DSM architectures
"New" Algorithms
Chaotic iteration
Sparse LU w/o pivoting
Pipeline HQR
Graph partitioning
Algorithmic bombardment

.
.
.
.
29
Templates Project

Iterative methods for large sparse systems
Communicate to HPC community State of the Art
algorithms
Subtle algorithm issues addressed, i.e.
convergence, preconditions, data structures
Performance and parallelism considerations
Gave the computational scientists algorithms in
form they wanted.

30
ScaLAPACK

Library of software for dense banded
Sparse direct being developed
Distributed Memory - Message Passing
PVM and MPI
MIMD Computers, Networks of Workstations, and
Clumps of SMPs
SPMD Fortran 77 with object based design
Built on various modules
PBLAS (BLACS and BLAS)
PVM, MPI, IBM SP, CRI T3, Intel, TMC
Provides right level of notation.

31
High-Performance Computing Directions

Move toward shared memory
SMPs and Distributed Shared Memory
Shared address space w/deep memory hierarchy
Clustering of shared memory machines for
scalability
Emergence of PC commodity systems
Pentium based, NT or Linux driven
At UTK cluster of 14 (dual) Pentium based 7.2
Gflop/s
Efficiency of message passing and data parallel
programming
Helped by standards efforts such as PVM, MPI and
HPF
Complementing Supercomputing with Metacomputing
? Computational Grid

32
Heterogeneous Computing

Heterogeneity introduces new bugs in parallel
code
Slightly different fl pt can make data dependent
branches go different ways when we expect
identical behavior.
A correct algorithm on a network of identical
workstations may fail if a slightly different
machine is introduced.
Some easy to fix (compare s lt tol on 1 proc and
broadcast results)
Some hard to fix (handling denorms getting the
same answer independent of of procs)

33
Java - For Numerical Computations?

Java likely to be a dominant language.
Provides for machine independent code.
C like language
No pointers, gotos, overloading arith ops, or
memory deallocation
Portability achieved via abstract machine
Java is a convenient user interface builder which
allows one to quickly develop customized
interfaces.

34
Network Enabled Servers

Allow networked resources to be
integrated into the desktop.
Many hosts, co-existing in a loose
confederation tied together with
high-speed links.
Users have the illusion of a very powerful
computer on the desk.
Locate and deliver software or solutions to the
user in a directly usable and conventional
form.
Part of the motivation software maintenance

35
Future Petaflops ( fl pt ops/s)

dynamic redistribution of
workload
new language and constructs
role of numerical libraries
algorithm adaptation to hardware failure

A Pflop for 1 second ? a typical workstation
computing for 1 year.
From an algorithmic standpoint
concurrency
data locality
latency sync
floating point accuracy
May be feasible and affordable
by the
year 2010

36
Summary

As a community we have a lot to be proud of in
terms of the algorithms and software we have
produced.
generality, elegance, speed, or economy of
storage
Software still being used in many cases 30 years
after

37
(No Transcript)
38
End of talk
39
BLAS Were Intended To

Be a design tool for the development of software
in numerical linear algebra
Improve readability and aid documentation
Aid modularity and maintenance, and improve
robustness of software calling the BLAS
Promote efficiency
Improve portability, without sacrificing
efficiency, through standardization

40
LINPACK Benchmark

Appeared in the Users Guide
Dense system of linear equations
LU decomposition via Gaussian elimination.
Users Guide Appendix
Reported AXPY op time
Has a life of its own
Today reports machines from Cray T90
to Palm Pilot
(1.0 Gflop/s to 1.4 Kflop/s)

41
Future Research Directions and Challenges
Prediction is hard.
42
ScaLAPACK Structure
Global
Local
43
Metacomputing Objectives

Flexibility and extensibility
Site autonomy
Scalable architecture
Single global namespace
Easy-to-use, seamless environment

High performance through parallelism
Security
Management / exploitation of heterogeneity
Multilanguage interoperability
Fault-tolerance

44
Grid Based Computations

Long running computations
Grid based computation
Network of Workstations
Fault tolerance
Reproducibility of solution
Auditability of computation

45
Heterogeneous Conclusions

Defensive programming
Machine parameters
Communication and representation between
processors
Controlling processor
Additional communication
Testing strategies

46
Metacomputing Issues

Logically only one system
System should determine where the
computation executes
Fault-tolerance should be transparent
Applications freed for mechanics of distrib.
programming
Self configuring with new resources added
automatically
Just-in-time binding of software, data, and
hardware.
User shouldnt have to decide where computation
performed
User would be able to walk up to a machine and
have their program and data follow them

47
Grid Computing and Numerical Computations

With the Computational Grid certain challenges
from a numerical standpoint.
Some users would like/want/expect/demand the same
results they obtained yesterday the next time
they run the application.
How can we guarantee reproducibility?

48
Heterogeneous Computing

Software intended to be used in this context
Machine precision and other machine specific
parameters
Communication of ft. pt. numbers between
processors
Repeatability -
run to run differences, e.g. order in which
quantities summed
Coherency -
within run differences, e.g. arithmetic varies
from proc to proc
Iterative convergence across clusters of
processors
Defensive programming required
Important for the Computational Grid

49
Machine Precision

Smallest floating point number u such that 1 u
gt 1.
Used in
Convergence tests -- has an iteration converged?
Tolerance tests -- is a matrix numerically
singular?
Error analysis -- how does floating point
arithmetic affect results?

50
Heterogeneous Machine Precision

Return the maximum relative machine precision
over all the processors
LAPACK --gt DLAMCH
ScaLAPACK --gt PDLAMCH
Note that even on IEEE machines, the machine
precision can vary slightly. For double
precision e 2-53 q where, q2-105 or
q2-64 2-105.

51
Grid Enabled Math Software

Predictability and robustness of accuracy and
performance.
Run-time resource management and algorithm
selection.
Support for a multiplicity of programming
environments.
Reproducibility, fault tolerant, and auditability
of the computations.
New algorithmic techniques for latency tolerant
applications.

52
Heterogeneous Computing

Software intended to be used in this context
Machine precision and other machine specific
parameters
Communication of ft. pt. numbers between
processors
Repeatability -
run to run differences, e.g. order in which
quantities summed
Coherency -
within run differences, e.g. arithmetic varies
from proc to proc
Iterative convergence across clusters of
processors
Defensive programming required
Important for the Computational Grid

53
Automatically Tuned Numerical Software

Automatic generation of computational kernels for
RISC architectures.
A package that adapts itself to differing
architectures via code generation coupled with
timing
Code generator takes about 1-2 hours to run.
Done once for a new architecture.
Written in ANSI C (generate C )
Today Level 2 and 3 BLAS
Extension to higher level
operations.
SMPs
SGI/Vector
DSP
FFTs

54
Why Such a System is Needed

BLAS require many man-hours / platform
Only done if financial incentive is there
Many platforms will never have an optimal version
Lags behind hardware
May not be affordable by everyone
Allows for portably optimal codes
Package contains
Code generators
Sophisticated timers
Robust search routines

55
Future Research Directions and Challenges
Prediction is hard.
56
Future Research Directions and Challenges
Prediction is hard. Especially the future . . .
Yogi Berra, philosopher in the Baseball Hall of
Fame

Numerical behavior in a heterogeneous
environment.
Efficient software for core routines.
Fault tolerant aspects
"New" Algorithms
Chaotic iteration
Divide and Conquer
Sparse LU w/o pivoting

Pipeline HQR
Graph partitioning
Algorithmic bombardment

57
Research Directions

Grid enabled strategies
Parameterizable libraries
Annotated libraries
Hierarchical algorithm libraries
A new division of labor between compiler writers,
library writers, and algorithm developers and
application developers will emerge.

58
Motivation for NetSolve
59
Motivation for NetSolve
60
Motivation for NetSolve
Design an easy-to-use tool to provide efficient
and uniform access to a variety of scientific
packages on UNIX platforms
Basics

Client-Server Design
Non-hierarchical system
Load Balancing and Fault Tolerance
Heterogeneous Environment Supported
Multiple and simple client interfaces

http//www.cs.utk.edu/netsolve/
61
NetSolve - The Big Picture
http//www.cs.utk.edu/netsolve/
62
Network Computing Classifications

Code shipping (JAVA model)
Program Server ? Client
Data Client
Result Client

Remote Computing (NetSolve model)
Program Server
Data Client ? Server
Result Server ? Client

http//www.cs.utk.edu/netsolve/
63
NetSolve - Typical Utilizations

Intranet, Extranet, Internet
Proprietary, Collaborative, Open, Controlled,
...

Used by scientists as a computational engine
Customized local servers
Operating environment for higher level
applications
Client/Server/Agent could be on the same machine.

64
NetSolve - The Client

Multiple interfaces
C, FORTRAN, MATLAB, Mathematica
Java GUI and API

Natural problem specification
some input objects
some output objects
a name (example LinearSystem)

65
Architecture features

Cache
Goes back to the ATLAS computer
pipelining
CDC 7600
vector instructions
chaining
overlapping
Look unrolling

Linpack benchmark
LINPACK is a benchmark, but it is really a
collection of software routines for solving
systems of linear equations.
Some vendors, I am told, have been accused of
having a switch which is called the "LINPACK
switch", a program device to recognize the
LINPACK program and try to optimize based on
that.

Data locality
Give Cyber 205 example

68
Problem Solving Environments Computational Grid

Many hosts, co-existing in a loose
confederation tied together with
high-speed links.
Users have the illusion of a very powerful
computer on the desk.
System provides all computational facilities to
solve target class of problems
Automatic or semiautomatic selection of solution
methods Ways to easily incorporate novel
solution methods.
Communicate using language of target class of
problems.
Allow users to do computational steering.

69
The Importance of Standards - Software

Writing programs for MPP is hard ...
But ... one-off efforts if written in a standard
language
Past lack of parallel programming standards ...
has restricted uptake of technology (to
"enthusiasts")
reduced portability (over a range of
currentarchitectures and between future
generations)
lost decade
Now standards exist (PVM, MPI HPF), which ...
allows users manufacturers to protect software
investment
encourage growth of a "third party" parallel
software industry parallel versions of widely
used codes
Others POSIX, COBRA, ...

70
LAPACK Linear Algebra Library in F77

Solution of systems of equations Axb for
general dense, symmetric, banded, tridiagonal
matrix A,
Solution of linear least squares problems min x
Ax-b 2for general dense matrix A,
Singular value decomposition A U ? VT for
general dense, banded matrix A,
Solution for eigenvalue problems Ax ? x for
general dense, symmetric, banded matrix A,
Solution of general linear least squares, (GSEP),
(GNEP), (GSVD),
Error bounds, condition estimation for (almost)
everything,
Real and complex, single and double precision,
Easy-to-use drivers plus computational kernels.

71
LAPACK Ongoing Work

Add functionality
updating/downdating, divide and conquer least
squares,bidiagonal bisection, bidiagonal inverse
iteration, band SVD, Jacobi methods, ...
Move to new generation of high performance
machines
IBM SPs, CRAY T3E, SGI Origin, clusters of
workstations
New challenges
New languages FORTRAN 90, HPF, ...
(CMMD, MPL, NX ...)
many flavors of message passing, need standard
(PVM, MPI) BLACS
Highly varying ratio
Many ways to layout data,
Fastest parallel algorithm sometimes less stable
numerically.

72
LAPACK Blocked Algorithms
DO 10 J 1, N, NB CALL STRSM( 'Left',
'Upper', 'Transpose','Non-Unit', J-1, JB, ONE, A,
LDA, A( 1, J ), LDA )
CALL SSYRK( 'Upper', 'Transpose', JB, J-1,-ONE,
A( 1, J ), LDA, ONE, A( J, J ),
LDA ) CALL SPOTF2( 'Upper', JB, A( J, J ),
LDA, INFO ) IF( INFO.NE.0 ) GO TO 20 10
CONTINUE

On Y-MP, L3 BLAS squeezes a little more out of 1
proc, but
makes a large improvement when using 8 procs.

73
Challenges in Developing Distributed Memory
Libaries

How to integrate software?
Until recently no standards
Many parallel languages
Various parallel programming models
Assumptions about the parallel environment
granularity
topology
overlapping of communication/computation
development tools

Where is the data
Who owns it?
Opt data distribution
Who determines data layout
Determined by user?
Determined by library developer?
Allow dynamic data dist.
Load balancing

74
Heterogeneous Computing

Software intended to be used in this context
Machine precision and other machine specific
parameters
Communication of ft. pt. numbers between
processors
Repeatability -
run to run differences, e.g. order in which
quantities summed
Coherency -
within run differences, e.g. arithmetic varies
from proc to proc
Iterative convergence across clusters of
processors
Defensive programming required

75
Portability of Numerical Software

Numerical software to work portability on
different computers with good numerical accuracy,
stability and robustness of solution.
EISPACK - DEC, IBM, CDC (early 70s)
LINPACK - Vector, pipelined (late 70s)
LAPACK - Vector, RISC, SMPs (late 80s early
90s)
ScaLAPACK (late 90s)
MPP
DSM
Clumps
Heterogeneous NOWs

76
ScaLAPACK

LAPACK software expertise/quality
Numerical methods
Library of software dealing with dense banded
routines
Distributed Memory - Message Passing
MPP Computers and NOWs
SPMD Fortran 77 with object based design
Built on various modules
PBLAS Interprocessor communication
BLAS/BLACS
PVM, MPI, IBM SP, CRI T3, Intel, TMC
Provides right level of notation.

77
Choosing a Data Distribution

Main issues are
Load balancing
Use of the Level 3 BLAS

1D block and cyclic column distributions
1D block-cycle column and 2D block-cyclic
distribution
2D block-cyclic used in ScaLAPACK for dense
matrices

78
Programming Style

SPMD Fortran 77 with object based design
Built on various modules
PBLAS Interprocessor communication
BLACS
PVM, MPI, IBM SP, CRI T3, Intel, TMC
Provides right level of notation.
BLAS
LAPACK software expertise/quality
software approach
numerical methods

79
Parallelism in ScaLAPACK

Level 3 BLAS block operations
All the reduction routines
Pipelining
QR Algorithm, Triangular Solvers, classic
factorizations
Redundant computations
Condition estimators
Static work assignment
Bisection

Task parallelism
Sign function eigenvalue computations
Divide and Conquer
Tridiagonal and band solvers, symmetric
eigenvalue problem and Sign function
Cyclic reduction
Reduced system in the band solver
Data parallelism
Sign function

80
ScaLAPACK - Whats Included

Timing and Testing routines for almost all, these
are a large component of the package
Prebuilt libraries available for SP, PGON, HPPA,
DEC, Sun, RS6K

81
Direct Sparse Solvers

CAPSS is a package to solve Axb on a message
passing multiprocessor the matrix A is SPD and
associated with a mesh in 2 or 3D. (Version for
Intel only)
SuperLU and UMFPACK - sequential implementation
of Gaussian elimination with partial pivoting.

82
Parallel Iterative Solvers

Many packages for message passing systems
PETSc, AZTEC, BLOCKSOLVE95, P-Sparselib,...
PETSc set of libraries and structures that assist
in the solution of PDEs and related processes.
Parallel algebraic data structures, solvers, and
related infrastructure required for solving PDEs
using implicit methods involving finite elements,
finite differences, or finite volumes.

83
Parallel Sparse Eigenvalue Solvers

P_ARPACK (D. Sorensen et al)
Designed to compute a few values and
corresponding vectors of a general matrix.
Appropriate for large sparse or structured
matrices A
This software is based Arnoldi process called the
Implicitly Restarted Arnoldi Method.
Reverse Communication Interface.

84
Java for Numerical Computations?

Few months ago 1 Mflop/s on a 600 Mflop/s
processor.
Top performer today is 46 Mflop/s for a P6 using
MS Explorer 3.0 JIT (62 Mflop/s in Fortran).

http//www.netlib.org/benchmark/linpackjava/
85
LAPACK to JAVA

Allows Java programmers to access to BLAS/LAPACK
routines.
Working on translator to go from LAPACK to Java
Byte Code
lap2j formal compiler of a subset of f77
sufficient for BLAS LAPACK
Plan to enable all of LAPACK
Compiler provides quick, reliable translation.
Focus on LAPACK Fortran
Simple - no COMMON, EQUIVALANCE, SAVE, I/O

86
Parameterized Libraries

Architecture features for performance
Cache size
Latency
Bandwidth
Latency tolerant algorithms
Latency reduction
Granularity management
High levels of concurrency
Issues of precision

87
Motivation for Network Enabled Solvers
Design an easy-to-use tool to provide efficient
and uniform access to a variety of scientific
packages on UNIX platforms

Client-Server Design
Non-hierarchical system
Load Balancing
Fault Tolerance
Heterogeneous Environment Supported

88
NetSolve-Future Paradigm
Software lookup
NetSolve Agent
Dynamic Hardware/Software Binding
Hardware lookup
NetSolve Client
NetSolve Server
89
On Mathematical Software

Labor intensive activity
Enabling Technology
Helped many computational scientists
Provides a basis for many commercial efforts
Often software people not given credit
Lip Service given to work
Funding to universities and laboratories
NSF, DOE, DARPA, AFOSR, ..

90
References

Copy of slides in http//www.netlib.org/utk/peopl
e/JackDongarra/siam-797/
Annotated table of freely available software in
http//www.netlib.org/utk/people/JackDongarra/la-s
w.html
http//www.netlib.org/
http//www.nhse.org/
http//www.nhse.org/hpc-netlib/

Design started in 1955 by IBM when lost bid for U
of C Radiation Lab (LLNL)
Univac won with LARC
Was to be 100 X faster than what was available
At IBM Fred Brooks, John Cocke, Erich Bloch, . .
.

Look ahead unit
Memory operand prefetch
Out of order execution
Speculative execution based on branch prediction
Partitioned function units

92
Stretch

about

double precision function pythag(a,b)
double precision a,b
c
c finds dsqrt(a2b2) without overflow or
destructive underflow
c
double precision p,r,s,t,u
p dmax1(dabs(a),dabs(b))
if (p .eq. 0.0d0) go to 20
r (dmin1(dabs(a),dabs(b))/p)2
10 continue
t 4.0d0 r
if (t .eq. 4.0d0) go to 20
s r/t
u 1.0d0 2.0d0s
p up
r (s/u)2 r
go to 10
20 pythag p
return

Nintendo 64
more computing power than a Cray 1 and also much
better graphics

double precision function epslon (x)
double precision x
c
c estimate unit roundoff in quantities of
size x.
c
double precision a,b,c,eps
c
c this program should function properly on
all systems
c satisfying the following two assumptions,
c 1. the base used in representing
floating point
c numbers is not a power of three.
c 2. the quantity a in statement 10 is
represented to
c the accuracy used in floating point
variables
c that are stored in memory.
c the statement number 10 and the go to 10
are intended to
c force optimizing compilers to generate code
satisfying
c assumption 2.
c under these assumptions, it should be true
that,
c a is not exactly equal to
four-thirds,

96
Outline

Architectural Opportunities and Challenges.
Focus on High-Performance Computers
MPP/Clusters/NOWs
Language
Fortran 77, 90
HPF
C, C, Java

Dense direct solvers
ScaLAPACK, PLAPACK
Sparse direct solvers
Iterative solvers

97
LAPACK Motivation

LAPACK using high level BLAS
Use the manufacturers BLAS for high performance.
Portable Fortran 77 (766 K lines)

98
LAPACK Motivation

Example Cray C-90, 1 and 16 processors
LINPACK low performance due to excess data
movement through the memory hierarchy

99
Derivation of Blocked AlgorithmsCholesky
Factorization A UTU

Equating coefficient of the jth column, we obtain
Hence, if U11 has already been computed, we can
compute uj and ujj from the equations
100
LINPACK Implementation

Here is the body of the LINPACK routine SPOFA
which implements the method
DO 30 J 1, N
INFO J
S 0.0E0
JM1 J - 1
IF( JM1.LT.1 ) GO TO 20
DO 10 K 1, JM1
T A( K, J ) - SDOT( K-1,
A( 1, K ), 1,A( 1, J ), 1 )
T T / A( K, K )
A( K, J ) T
S S TT
10 CONTINUE
20 CONTINUE
S A( J, J ) - S
C ...EXIT
IF( S.LE.0.0E0 ) GO TO 40
A( J, J ) SQRT( S )
30 CONTINUE

101
(No Transcript)
102
LAPACK from HP-48G to CRAY T-90

(Trans)portable FORTRAN 77 (BLAS), 766K lines
(354K library, 412K testing and timing)
Targets workstations, vector machines, and shared
memory computers
Initial release February 1992
Linear systems, least squares, eigenproblems
Manual available from SIAM translated into
Japanese
Fifth release later this year.

103
Physical and Logical Views of PVM

Physical
Logical

104
Performance Today in Terms of LINPACK Benchmark

ASCI Red at 1.3 Tflop/s
Sun Sparc Ultra 2 at 110 Mflop/s
Thinkpad 600 at 50 Mflop/s
Palm Pilot at 1.4 Kflop/s

105
Standards - The MPI Forum

Communicators - communication context
needed for safe parallel library design.
Derived Datatypes - user defined datatypes for
non-homogeneous, non-contiguous data.
Process Groups - collective operations can
operate over a subset of the processes.

Created using the HPF model
A group of 30-40 experts in message-passing
MPP vendors,
CS researchers,
Application developers.
Met 3 days every 6 weeks for 1.5 years and
created the MPI-1 specification draft.

106
Super-LU

Supernodes
Permit use of higher level BLAS.
Reduce inefficient indirect addressing.
Reduce symbolic time by traversing supernodal
graph.

Algorithm PA LU , A sparse and unsymmetric.
Exploit dense submatrices in the L U factors
of PA LU .

107
Performance on Computational Kernels

New implementations of architectures every few
months.
How to keep up with the changing design so as to
have optimized numerical kernels.
Effort to aid the compiler
Pre-compiled software for machine specific
implementations

108
Grid Enabled Math Software