A Look at Library Software for Linear Algebra: Past, Present, and Future - PowerPoint PPT Presentation

About This Presentation
Title:

A Look at Library Software for Linear Algebra: Past, Present, and Future

Description:

A Look at Library Software for Linear Algebra: Past, Present, and Future Jack Dongarra University of Tennessee and Oak Ridge National Laboratory – PowerPoint PPT presentation

Number of Views:99
Avg rating:3.0/5.0
Slides: 38
Provided by: JackDo8
Learn more at: https://netlib.org
Category:

less

Transcript and Presenter's Notes

Title: A Look at Library Software for Linear Algebra: Past, Present, and Future


1
A Look at Library Software for Linear Algebra
Past, Present, and Future
  • Jack Dongarra
  • University of Tennessee
  • and
  • Oak Ridge National Laboratory

2
1960s
  • 1965
  • DEC ships first PDP-8 IBM ships 360
  • First CS PhD U of Penn. Richard Wexelblat
  • Wilkinsons The Algebraic Eigenvalue Problem
    published
  • 1966
  • DARPA contract with U of I to build the ILLIAC IV
  • Fortran 66
  • 1967
  • Forsythe Moler published
  • Fortran, Algol and PLI
  • 1969
  • DARPANET work begins
  • 4 computers connected
  • UC-SB,UCLA, SRI and U of Utah
  • CDC 7600
  • Pipelined architecture 3.1 Mflop/s
  • Unix developed Thompson and Ritchie
  • 1961
  • IBM Stretch delivered to LANL
  • 1962
  • Virtual memory from U Manchester, T. Kilburn
  • 1964
  • AEC urges manufacturers to look at radical
    new'' machine structures.
  • This leads to CDC Star-100, TI ASC, and
    Illiac-IV.
  • CDC 6600 S. Cray's design
  • Functional parallelism, leading to RISC
  • (3 times faster than the IBM Stretch)

.
.
.
.
.
3
Wilkinson-Reinsch Handbook
  • In general we have aimed to include only
    algorithms that provide something approaching an
    optimal solution, this may be from the point of
    view of generality, elegance, speed, or economy
    of storage.
  • Part 1 Linear System
  • Part 2 Algebraic Eigenvalue Problem
  • Before publishing Handbook
  • Virginia Klema and others at ANL began
    translation of the Algol procedures into Fortran
    subroutines

4
1970s
  • 1970
  • NATS Project conceived
  • Concept of certified MS and process involved with
    production
  • Purdue Math Software Symposium
  • NAG project begins
  • 1971
  • Handbook for Automatic Computation, Vol II
  • Landmark in the development of numerical
    algorithms and software
  • Basis for a number of software projects EISPACK,
    a number of linear algebra routines in IMSL and
    the F chapters of NAG
  • IBM 370/195
  • Pipelined architecture Out of order execution
    2.5 Mflop/s
  • Intel 4004 60 Kop/s
  • IMSL founded
  • 1972
  • Cray Research founded
  • Intel 8008
  • 1/4 size Illiac IV installed NASA Ames
  • 15 Mflop/s achieved 64 processors
  • Paper by S. Reddaway on massive bit-level
    parallelism
  • EISPACK available
  • 150 installations EISPACK Users' Guide
  • 5 versions, IBM, CDC, Univac, Honeywell and PDP,
    distributed free via Argonne Code Center
  • M. Flynn publishes paper on architectural
    taxonomy
  • ARPANet
  • 37 computers connected
  • 1973
  • BLAS report in SIGNUM
  • Lawson, Hanson, Krogh

.
.
.
.
.
5
NATS Project
  • National Activity for Testing Software
  • (NSF, Argonne, Texas and Stanford)
  • Project to explore the problems of testing,
    certifying, disseminating and maintaining quality
    math software.
  • First EISPACK, later FUNPACK
  • Influenced other PACKs
  • ELLPACK, FISKPACK, ITPACK,MINPACK,PDEPACK,
    QUADPACK, SPARSPAK, ROSEPACK, TOOLPACK,
    TESTPACK,LINPACK, LAPACK, ScaLAPACK . . .
  • Key attributes of math software
  • reliability, robustness, structure, usability,
    and validity

6
EISPACK
  • Algol versions of the algorithms written in
    Fortran
  • Restructured to avoid underflow
  • Check users claims such as if matrix positive
    definite
  • Format programs in a unified fashion
  • Burt Garbow
  • Field test sites
  • 1971 U of Michigan Summer Conference
  • JHW algorithms CBM software
  • 1972 Software released via Argonne Code Center
  • 5 versions
  • Software certified in the sense that reports of
    poor or incorrect performance would gain the
    immediate attention from the developers

7
EISPACK
  • EISPAC Control Program, Boyle
  • One interface allowed access to whole package on
    IBM
  • Argonnes interactive RESCUE system
  • Allowed us to easily manipulate 10s of 1000s of
    lines
  • Generalizer/Selector
  • Convert IBM version to general form
  • Selector extract appropriate version
  • 1976 Extensions to package ready
  • EISPACK II

8
1970s continued
  • 1974
  • Intel 8080
  • Level 1 BLAS activity started by community
    Purdue 5/74
  • LINPACK meeting in summer ANL
  • 1975
  • First issue of TOMS
  • Second LINPACK meeting ANL
  • lay the groundwork and hammer out what was and
    was not to be included in the package. Proposal
    submitted to the NSF
  • 1976
  • Cray 1 - model for vector computing
  • 4 Mflop/s in 79 12 Mflop/s in 83 27 Mflop/s in
    89
  • LINPACK work during summer at ANL
  • EISPACK Second edition of User's Guide
  • 1977
  • DEC VAX 11/780 super mini
  • .14 Mflop/s 4.3 GB virtual memory
  • LINPACK test software developed sent
  • EISPACK second release
  • IEEE Arithmetic standard meetings
  • paper by Palmer on INTEL std for fl pt
  • 1978
  • Fortran 77
  • LINPACK software released
  • Sent to NESC and IMSL for distribution
  • 1979
  • John Cocke designs 801
  • ICL DAP delivered to QMC, London
  • Level 1 BLAS published/released
  • LINPACK Users' Guide
  • Appendix 17 machines PDP-10 to Cray-1

.
.
.
.
9
Basic Linear Algebra Subprograms
  • A design tool software in numerical linear
    algebra
  • Improve readability and aid documentation
  • Aid modularity and maintenance, and improve
    robustness of software calling the BLAS
  • Improve portability, without sacrificing
    efficiency, through standardization
  • BLAS 1973-1977
  • Consensus on
  • Names
  • Calling sequences
  • Functional Descriptions
  • Low level linear algebra operations
  • Success results from
  • Extensive Public Involvement
  • Careful consideration of implications

10
CRAY and VAX
  • VAXination of groups and departments
  • Crays introduction of vector computing
  • Both had a significant impact on scientific
    computing

11
LINPACK
  • June 1974, ANL
  • Jim Pools meeting
  • February 1975, ANL
  • groundwork for project
  • January 1976,
  • funded by NSF and DOE ANL/UNM/UM/UCSD
  • Summer 1976 - Summer 1977
  • Fall 1977
  • Software to 26 testsites
  • December 1978
  • Software released, NESC and IMSL

12
LINPACK
  • Research into
    mechanics of software production.
  • Provide a yearstick against which future software
    would be measured.
  • Produce library used by people and those that
    wished to modify/extend software to handle
    special problems.
  • Hope would be used in classroom
  • Machine independent and efficient
  • No mix mode arithmetic

13
LINPACK Efficiency
  • Condition estimator
  • Inner loops via BLAS
  • Column access
  • Unrolling loops (done for the BLAS)
  • BABE Algorithm for tridiagonal matrices
  • TAMPR system allowed easy generation of
    versions
  • LINPACK Benchmark
  • Today reports machines from Cray T90 to Palm
    Pilot
  • (1.0 Gflop/s to 1.4 Kflop/s)

14
1980s Vector Computing (Parallel Processing)
  • 1982
  • Illiac IV decommissioned
  • Steve Chen's group at Cray produces X-MP
  • First Denelcor HEP installed (.21 Mflop/s)
  • Sun Microsys, Convex and Alliant founded
  • 1983
  • Total computers in use in the US exceed 10 M
  • DARPA starts Strategic Computing Initiative
  • Helps fund Thinking Machines, BBN, WARP
  • Cosmic Cube hypercube running at Caltech
  • John Palmer, after seeing Caltech machine, leaves
    Intel to found Ncube
  • Encore , Sequent, TMC, SCS, Myrias founded
  • Cray 2 introduced
  • NEC SX-1 and SX-2, Fujitsu ships VP-200
  • ETA System spun off from CDC
  • Golub Van Loan published
  • 1980
  • Total computers in use in the US exceeds 1M
  • CDC Introduces Cyber 205
  • 1981
  • IBM introduces the PC
  • Intel 8088/DOS
  • BBN Butterfly delivered
  • FPS delivers FPS-164
  • Start of mini-supercomputer market
  • SGI founded by Jim Clark and others
  • Loop unrolling at outer level for data locality
    and parallelism
  • Amounts to matrix-vector operations
  • Cuppen's method for Symmetric Eigenvalue DC
    published
  • Talk at Oxford Gatlinburg (1980)

.
.
.
.
.
.
.
.
.
15
1980s continued
  • 1985
  • IEEE Standard 754 for floating point
  • IBM delivers 3090 vector 16 Mflop/s Linpack, 138
    Peak
  • TMC demos CM1 to DARPA
  • Intel produces first iPSC/1 Hypercube
  • 80286 connected via Ethernet controllers
  • Fujitsu VP-400 NEC SX-2 Cray 2 Convex C1
  • Ncube/10 .1 Mflop/s Linpack 1 processor
  • FPS-264 5.9 Mflop/s Linpack 38 Peak
  • IBM begins RP2 project
  • Stellar, Poduska Ardent, Michels Supertek
    Computer founded
  • Denelcor closes doors
  • 1984
  • NSFNET 5000 computers56 Kb/s lines
  • MathWorks founded
  • EISPACK third release
  • Netlib begins 1/3/84
  • Level 2 BLAS activity started
  • Gatlinburg(Waterloo), Purdue, SIAM
  • Intel Scientific Computers started by J. Rattner
  • Produce commerical hypercube
  • Cray X-MP 1 processor, 21 Mflop/s Linpack
  • Multiflow founded by J. Fisher VLIW architecture
  • Apple introduces Mac IBM introduces PC AT
  • IJK paper

.
.
.
.
.
16
1980s (Lost Decade for Parallel Software)
  • 1986
  • of computers in US exceeds 30 M
  • TMC ships CM-1 64K 1 bit processors
  • Cray X-MP
  • IBM and MIPS release first RISC WS
  • 1987
  • ETA Systems family of supercomputers
  • Sun Microsystems introduces its first RISC WS
  • IBM invests in Steve Chen's SSI
  • Cray Y-MP
  • First NA-DIGEST
  • Level 3 BLAS work begun
  • LAPACK Prospectus Development of a LA Library
    for HPC
  • 1988
  • AMT delivers first re-engineered DAP
  • Intel produces iPSC/2
  • Stellar and Ardent begin delivering single user
    graphics workstations
  • Level 2 BLAS paper published
  • 1989
  • of computers in the US gt 50M
  • Stellar and Ardent merge, forming Stardent
  • S. Cray leaves Cray Research to form Cray
    Computer
  • Ncube 2nd generation machine
  • ETA out of business
  • Intel 80486 and i860 1 M transistors
  • i860 RISC 64 bit floating point

.
.
.
.
.
.
17
EISPACK 3 and BLAS 2 3
  • Machine independence for EISPACK
  • Reduce the possibility of overflow and underflow.
  • Mods to the Algol from S. Hammarling
  • Rewrite reductions to tridiagonal form to
    involve sequential access to memory
  • Official Double Precision version
  • Inverse iteration routines modified to reduce the
    size for reorthogonalization.
  • BLAS (Level 1) vector operations provide for too
    much data movement.
  • Community effort to define extensions
  • Matrix-vector ops
  • Matrix-matrix ops

18
Netlib - Mathematical Software and Data
  • Began in 1985
  • JD and Eric Grosse, ATT Bell Labs
  • Motivated by the need for cost-effective, timely
    distribution of high-quality mathematical
    software to the community.
  • Designed to send, by return electronic mail,
    requested items.
  • Automatic mechanism for electronic dissemination
    of freely available software.
  • Still in use and growing
  • Mirrored at 9 sites around the world
  • Moderated collection /Distributed maintenance
  • NA-DIGEST and NA-Net
  • Gene Golub, Mark Kent and Cleve Moler

19
Netlib Growth
Just over 6,000 in 1985 Over 29,000,000
total Over 9 Million hits in 1997 5.4 Million so
far in 1998
LAPACK Best Seller 1.6 M hits
20
1990s
  • 1991
  • Stardent to sell business and close
  • Cray C-90
  • Kendall Square Research delivers 32 processor
    KSR-1
  • TMC produces CM-200 and announces CM-5 MIMD
    computer
  • DEC announces the Alpha
  • TMC produces the first CM-5
  • Fortran 90
  • Workshop to consider Message Passing Standard,
    beginnings of MPI
  • Community effort
  • Xnetlib running
  • 1992
  • LAPACK software released Users' Guide Published
  • 1990
  • Internet
  • World Wide Web
  • Motorola introduces 68040
  • NEC ships SX-3 First parallel Japanese parallel
    vector supercomputer
  • IBM announces RS/6000 family
  • has FMA instruction
  • Intel hypercube based on 860 chip
  • 128 processors
  • Alliant delivers FX/2800 based on i860
  • Fujitsu VP-2600
  • PVM project started
  • Level 3 BLAS Published

.
.
.
.
21
(No Transcript)
22
Parallel Processing Comes of Age
  • "There are three rules for programming parallel
    computers.
  • We just don't know what they are yet." --
    Gary Montry
  • Embarrassingly Parallel, Cleve Moler
  • Humiliatingly Parallel

23
Memory Hierarchy and LAPACK
  • ijk - implementations
  • Effects order in which data referenced
    some better at allowing data to keep in
    higher levels of memory hierarchy.
  • Applies for matrix multiply, reductions to
    condensed form
  • May do slightly more
    flops
  • Up to 3 times faster

24
Why Higher Level BLAS?
  • Can only do arithmetic on data at the top of the
    hierarchy
  • Higher level BLAS lets us do this
  • Development of blocked algorithms important for
    performance

25
History of Block Partitioned Algorithms
  • Ideas not new.
  • Early algorithms involved use of small main
    memory using tapes as secondary storage.
  • Recent work centers on use of vector registers,
    level 1 and 2 cache, main memory, and out of
    core memory.

26
LAPACK
  • Linear Algebra library in Fortran 77
  • Solution of systems of equations
  • Solution of eigenvalue problems
  • Combine algorithms from LINPACK
    and EISPACK into a single package
  • Block algorithms
  • Efficient on a wide range of computers
  • RISC, Vector, SMPs
  • User interface similar to LINPACK
  • Single, Double, Complex, Double Complex
  • Built on the Level 1, 2, and 3 BLAS
  • HP-48G to CRAY T-90

Mflop/s
27
(No Transcript)
28
1990s continued
  • 1993
  • Intel Pentium system start to ship
  • ScaLAPACK Prototype software released
  • First portable library for distributed memory
    machines
  • Intel, TMC and workstations using PVM
  • PVM 3.0 available
  • 1994
  • MPI-1 Finished
  • 1995
  • Templates project
  • 1996
  • Internet 34M Users
  • Nintendo 64
  • More computing power than a Cray 1 and much much
    better graphics
  • 1997
  • MPI-2 Finished
  • Fortran 95
  • 1998
  • Issues of parallel and numerical stability
  • Divide time
  • DSM architectures
  • "New" Algorithms
  • Chaotic iteration
  • Sparse LU w/o pivoting
  • Pipeline HQR
  • Graph partitioning
  • Algorithmic bombardment

.
.
.
.
29
Templates Project
  • Iterative methods for large sparse systems
  • Communicate to HPC community State of the Art
    algorithms
  • Subtle algorithm issues addressed, i.e.
    convergence, preconditions, data structures
  • Performance and parallelism considerations
  • Gave the computational scientists algorithms in
    form they wanted.

30
ScaLAPACK
  • Library of software for dense banded
  • Sparse direct being developed
  • Distributed Memory - Message Passing
  • PVM and MPI
  • MIMD Computers, Networks of Workstations, and
    Clumps of SMPs
  • SPMD Fortran 77 with object based design
  • Built on various modules
  • PBLAS (BLACS and BLAS)
  • PVM, MPI, IBM SP, CRI T3, Intel, TMC
  • Provides right level of notation.

31
High-Performance Computing Directions
  • Move toward shared memory
  • SMPs and Distributed Shared Memory
  • Shared address space w/deep memory hierarchy
  • Clustering of shared memory machines for
    scalability
  • Emergence of PC commodity systems
  • Pentium based, NT or Linux driven
  • At UTK cluster of 14 (dual) Pentium based 7.2
    Gflop/s
  • Efficiency of message passing and data parallel
    programming
  • Helped by standards efforts such as PVM, MPI and
    HPF
  • Complementing Supercomputing with Metacomputing
    ? Computational Grid

32
Heterogeneous Computing
  • Heterogeneity introduces new bugs in parallel
    code
  • Slightly different fl pt can make data dependent
    branches go different ways when we expect
    identical behavior.
  • A correct algorithm on a network of identical
    workstations may fail if a slightly different
    machine is introduced.
  • Some easy to fix (compare s lt tol on 1 proc and
    broadcast results)
  • Some hard to fix (handling denorms getting the
    same answer independent of of procs)

33
Java - For Numerical Computations?
  • Java likely to be a dominant language.
  • Provides for machine independent code.
  • C like language
  • No pointers, gotos, overloading arith ops, or
    memory deallocation
  • Portability achieved via abstract machine
  • Java is a convenient user interface builder which
    allows one to quickly develop customized
    interfaces.

34
Network Enabled Servers
  • Allow networked resources to be
    integrated into the desktop.
  • Many hosts, co-existing in a loose
    confederation tied together with
    high-speed links.
  • Users have the illusion of a very powerful
    computer on the desk.
  • Locate and deliver software or solutions to the
    user in a directly usable and conventional
    form.
  • Part of the motivation software maintenance

35
Future Petaflops ( fl pt ops/s)
  • dynamic redistribution of
    workload
  • new language and constructs
  • role of numerical libraries
  • algorithm adaptation to hardware failure
  • A Pflop for 1 second ? a typical workstation
    computing for 1 year.
  • From an algorithmic standpoint
  • concurrency
  • data locality
  • latency sync
  • floating point accuracy
  • May be feasible and affordable
    by the
    year 2010

36
Summary
  • As a community we have a lot to be proud of in
    terms of the algorithms and software we have
    produced.
  • generality, elegance, speed, or economy of
    storage
  • Software still being used in many cases 30 years
    after

37
(No Transcript)
38
End of talk
39
BLAS Were Intended To
  • Be a design tool for the development of software
    in numerical linear algebra
  • Improve readability and aid documentation
  • Aid modularity and maintenance, and improve
    robustness of software calling the BLAS
  • Promote efficiency
  • Improve portability, without sacrificing
    efficiency, through standardization

40
LINPACK Benchmark
  • Appeared in the Users Guide
  • Dense system of linear equations
  • LU decomposition via Gaussian elimination.
  • Users Guide Appendix
  • Reported AXPY op time
  • Has a life of its own
  • Today reports machines from Cray T90
    to Palm Pilot
  • (1.0 Gflop/s to 1.4 Kflop/s)

41
Future Research Directions and Challenges
Prediction is hard.
42
ScaLAPACK Structure
Global
Local
43
Metacomputing Objectives
  • Flexibility and extensibility
  • Site autonomy
  • Scalable architecture
  • Single global namespace
  • Easy-to-use, seamless environment
  • High performance through parallelism
  • Security
  • Management / exploitation of heterogeneity
  • Multilanguage interoperability
  • Fault-tolerance

44
Grid Based Computations
  • Long running computations
  • Grid based computation
  • Network of Workstations
  • Fault tolerance
  • Reproducibility of solution
  • Auditability of computation

45
Heterogeneous Conclusions
  • Defensive programming
  • Machine parameters
  • Communication and representation between
    processors
  • Controlling processor
  • Additional communication
  • Testing strategies

46
Metacomputing Issues
  • Logically only one system
  • System should determine where the
    computation executes
  • Fault-tolerance should be transparent
  • Applications freed for mechanics of distrib.
    programming
  • Self configuring with new resources added
    automatically
  • Just-in-time binding of software, data, and
    hardware.
  • User shouldnt have to decide where computation
    performed
  • User would be able to walk up to a machine and
    have their program and data follow them

47
Grid Computing and Numerical Computations
  • With the Computational Grid certain challenges
    from a numerical standpoint.
  • Some users would like/want/expect/demand the same
    results they obtained yesterday the next time
    they run the application.
  • How can we guarantee reproducibility?

48
Heterogeneous Computing
  • Software intended to be used in this context
  • Machine precision and other machine specific
    parameters
  • Communication of ft. pt. numbers between
    processors
  • Repeatability -
  • run to run differences, e.g. order in which
    quantities summed
  • Coherency -
  • within run differences, e.g. arithmetic varies
    from proc to proc
  • Iterative convergence across clusters of
    processors
  • Defensive programming required
  • Important for the Computational Grid

49
Machine Precision
  • Smallest floating point number u such that 1 u
    gt 1.
  • Used in
  • Convergence tests -- has an iteration converged?
  • Tolerance tests -- is a matrix numerically
    singular?
  • Error analysis -- how does floating point
    arithmetic affect results?

50
Heterogeneous Machine Precision
  • Return the maximum relative machine precision
    over all the processors
  • LAPACK --gt DLAMCH
  • ScaLAPACK --gt PDLAMCH
  • Note that even on IEEE machines, the machine
    precision can vary slightly. For double
    precision e 2-53 q where, q2-105 or
    q2-64 2-105.

51
Grid Enabled Math Software
  • Predictability and robustness of accuracy and
    performance.
  • Run-time resource management and algorithm
    selection.
  • Support for a multiplicity of programming
    environments.
  • Reproducibility, fault tolerant, and auditability
    of the computations.
  • New algorithmic techniques for latency tolerant
    applications.

52
Heterogeneous Computing
  • Software intended to be used in this context
  • Machine precision and other machine specific
    parameters
  • Communication of ft. pt. numbers between
    processors
  • Repeatability -
  • run to run differences, e.g. order in which
    quantities summed
  • Coherency -
  • within run differences, e.g. arithmetic varies
    from proc to proc
  • Iterative convergence across clusters of
    processors
  • Defensive programming required
  • Important for the Computational Grid

53
Automatically Tuned Numerical Software
  • Automatic generation of computational kernels for
    RISC architectures.
  • A package that adapts itself to differing
    architectures via code generation coupled with
    timing
  • Code generator takes about 1-2 hours to run.
  • Done once for a new architecture.
  • Written in ANSI C (generate C )
  • Today Level 2 and 3 BLAS
  • Extension to higher level
    operations.
  • SMPs
  • SGI/Vector
  • DSP
  • FFTs

54
Why Such a System is Needed
  • BLAS require many man-hours / platform
  • Only done if financial incentive is there
  • Many platforms will never have an optimal version
  • Lags behind hardware
  • May not be affordable by everyone
  • Allows for portably optimal codes
  • Package contains
  • Code generators
  • Sophisticated timers
  • Robust search routines

55
Future Research Directions and Challenges
Prediction is hard.
56
Future Research Directions and Challenges
Prediction is hard. Especially the future . . .
Yogi Berra, philosopher in the Baseball Hall of
Fame
  • Numerical behavior in a heterogeneous
    environment.
  • Efficient software for core routines.
  • Fault tolerant aspects
  • "New" Algorithms
  • Chaotic iteration
  • Divide and Conquer
  • Sparse LU w/o pivoting
  • Pipeline HQR
  • Graph partitioning
  • Algorithmic bombardment

57
Research Directions
  • Grid enabled strategies
  • Parameterizable libraries
  • Annotated libraries
  • Hierarchical algorithm libraries
  • A new division of labor between compiler writers,
    library writers, and algorithm developers and
    application developers will emerge.

58
Motivation for NetSolve
59
Motivation for NetSolve
60
Motivation for NetSolve
Design an easy-to-use tool to provide efficient
and uniform access to a variety of scientific
packages on UNIX platforms
Basics
  • Client-Server Design
  • Non-hierarchical system
  • Load Balancing and Fault Tolerance
  • Heterogeneous Environment Supported
  • Multiple and simple client interfaces

http//www.cs.utk.edu/netsolve/
61
NetSolve - The Big Picture
http//www.cs.utk.edu/netsolve/
62
Network Computing Classifications
  • Code shipping (JAVA model)
  • Program Server ? Client
  • Data Client
  • Result Client
  • Remote Computing (NetSolve model)

  • Program Server
  • Data Client ? Server
  • Result Server ? Client

http//www.cs.utk.edu/netsolve/
63
NetSolve - Typical Utilizations
  • Intranet, Extranet, Internet
  • Proprietary, Collaborative, Open, Controlled,
    ...
  • Used by scientists as a computational engine
  • Customized local servers
  • Operating environment for higher level
    applications
  • Client/Server/Agent could be on the same machine.

64
NetSolve - The Client
  • Multiple interfaces
  • C, FORTRAN, MATLAB, Mathematica
  • Java GUI and API
  • Natural problem specification
  • some input objects
  • some output objects
  • a name (example LinearSystem)

65
Architecture features
  • Cache
  • Goes back to the ATLAS computer
  • pipelining
  • CDC 7600
  • vector instructions
  • chaining
  • overlapping
  • Look unrolling

66
  • Linpack benchmark
  • LINPACK is a benchmark, but it is really a
    collection of software routines for solving
    systems of linear equations.
  • Some vendors, I am told, have been accused of
    having a switch which is called the "LINPACK
    switch", a program device to recognize the
    LINPACK program and try to optimize based on
    that.

67
  • Data locality
  • Give Cyber 205 example

68
Problem Solving Environments Computational Grid
  • Many hosts, co-existing in a loose
    confederation tied together with
    high-speed links.
  • Users have the illusion of a very powerful
    computer on the desk.
  • System provides all computational facilities to
    solve target class of problems
  • Automatic or semiautomatic selection of solution
    methods Ways to easily incorporate novel
    solution methods.
  • Communicate using language of target class of
    problems.
  • Allow users to do computational steering.

69
The Importance of Standards - Software
  • Writing programs for MPP is hard ...
  • But ... one-off efforts if written in a standard
    language
  • Past lack of parallel programming standards ...
  • has restricted uptake of technology (to
    "enthusiasts")
  • reduced portability (over a range of
    currentarchitectures and between future
    generations)
  • lost decade
  • Now standards exist (PVM, MPI HPF), which ...
  • allows users manufacturers to protect software
    investment
  • encourage growth of a "third party" parallel
    software industry parallel versions of widely
    used codes
  • Others POSIX, COBRA, ...

70
LAPACK Linear Algebra Library in F77
  • Solution of systems of equations Axb for
    general dense, symmetric, banded, tridiagonal
    matrix A,
  • Solution of linear least squares problems min x
    Ax-b 2for general dense matrix A,
  • Singular value decomposition A U ? VT for
    general dense, banded matrix A,
  • Solution for eigenvalue problems Ax ? x for
    general dense, symmetric, banded matrix A,
  • Solution of general linear least squares, (GSEP),
    (GNEP), (GSVD),
  • Error bounds, condition estimation for (almost)
    everything,
  • Real and complex, single and double precision,
  • Easy-to-use drivers plus computational kernels.

71
LAPACK Ongoing Work
  • Add functionality
  • updating/downdating, divide and conquer least
    squares,bidiagonal bisection, bidiagonal inverse
    iteration, band SVD, Jacobi methods, ...
  • Move to new generation of high performance
    machines
  • IBM SPs, CRAY T3E, SGI Origin, clusters of
    workstations
  • New challenges
  • New languages FORTRAN 90, HPF, ...
  • (CMMD, MPL, NX ...)
  • many flavors of message passing, need standard
    (PVM, MPI) BLACS
  • Highly varying ratio
  • Many ways to layout data,
  • Fastest parallel algorithm sometimes less stable
    numerically.

72
LAPACK Blocked Algorithms
DO 10 J 1, N, NB CALL STRSM( 'Left',
'Upper', 'Transpose','Non-Unit', J-1, JB, ONE, A,
LDA, A( 1, J ), LDA )
CALL SSYRK( 'Upper', 'Transpose', JB, J-1,-ONE,
A( 1, J ), LDA, ONE, A( J, J ),
LDA ) CALL SPOTF2( 'Upper', JB, A( J, J ),
LDA, INFO ) IF( INFO.NE.0 ) GO TO 20 10
CONTINUE
  • On Y-MP, L3 BLAS squeezes a little more out of 1
    proc, but
  • makes a large improvement when using 8 procs.

73
Challenges in Developing Distributed Memory
Libaries
  • How to integrate software?
  • Until recently no standards
  • Many parallel languages
  • Various parallel programming models
  • Assumptions about the parallel environment
  • granularity
  • topology
  • overlapping of communication/computation
  • development tools
  • Where is the data
  • Who owns it?
  • Opt data distribution
  • Who determines data layout
  • Determined by user?
  • Determined by library developer?
  • Allow dynamic data dist.
  • Load balancing

74
Heterogeneous Computing
  • Software intended to be used in this context
  • Machine precision and other machine specific
    parameters
  • Communication of ft. pt. numbers between
    processors
  • Repeatability -
  • run to run differences, e.g. order in which
    quantities summed
  • Coherency -
  • within run differences, e.g. arithmetic varies
    from proc to proc
  • Iterative convergence across clusters of
    processors
  • Defensive programming required

75
Portability of Numerical Software
  • Numerical software to work portability on
    different computers with good numerical accuracy,
    stability and robustness of solution.
  • EISPACK - DEC, IBM, CDC (early 70s)
  • LINPACK - Vector, pipelined (late 70s)
  • LAPACK - Vector, RISC, SMPs (late 80s early
    90s)
  • ScaLAPACK (late 90s)
  • MPP
  • DSM
  • Clumps
  • Heterogeneous NOWs

76
ScaLAPACK
  • LAPACK software expertise/quality
  • Numerical methods
  • Library of software dealing with dense banded
    routines
  • Distributed Memory - Message Passing
  • MPP Computers and NOWs
  • SPMD Fortran 77 with object based design
  • Built on various modules
  • PBLAS Interprocessor communication
  • BLAS/BLACS
  • PVM, MPI, IBM SP, CRI T3, Intel, TMC
  • Provides right level of notation.

77
Choosing a Data Distribution
  • Main issues are
  • Load balancing
  • Use of the Level 3 BLAS
  • 1D block and cyclic column distributions
  • 1D block-cycle column and 2D block-cyclic
    distribution
  • 2D block-cyclic used in ScaLAPACK for dense
    matrices

78
Programming Style
  • SPMD Fortran 77 with object based design
  • Built on various modules
  • PBLAS Interprocessor communication
  • BLACS
  • PVM, MPI, IBM SP, CRI T3, Intel, TMC
  • Provides right level of notation.
  • BLAS
  • LAPACK software expertise/quality
  • software approach
  • numerical methods

79
Parallelism in ScaLAPACK
  • Level 3 BLAS block operations
  • All the reduction routines
  • Pipelining
  • QR Algorithm, Triangular Solvers, classic
    factorizations
  • Redundant computations
  • Condition estimators
  • Static work assignment
  • Bisection
  • Task parallelism
  • Sign function eigenvalue computations
  • Divide and Conquer
  • Tridiagonal and band solvers, symmetric
    eigenvalue problem and Sign function
  • Cyclic reduction
  • Reduced system in the band solver
  • Data parallelism
  • Sign function

80
ScaLAPACK - Whats Included
  • Timing and Testing routines for almost all, these
    are a large component of the package
  • Prebuilt libraries available for SP, PGON, HPPA,
    DEC, Sun, RS6K

81
Direct Sparse Solvers
  • CAPSS is a package to solve Axb on a message
    passing multiprocessor the matrix A is SPD and
    associated with a mesh in 2 or 3D. (Version for
    Intel only)
  • SuperLU and UMFPACK - sequential implementation
    of Gaussian elimination with partial pivoting.

82
Parallel Iterative Solvers
  • Many packages for message passing systems
  • PETSc, AZTEC, BLOCKSOLVE95, P-Sparselib,...
  • PETSc set of libraries and structures that assist
    in the solution of PDEs and related processes.
  • Parallel algebraic data structures, solvers, and
    related infrastructure required for solving PDEs
    using implicit methods involving finite elements,
    finite differences, or finite volumes.

83
Parallel Sparse Eigenvalue Solvers
  • P_ARPACK (D. Sorensen et al)
  • Designed to compute a few values and
    corresponding vectors of a general matrix.
  • Appropriate for large sparse or structured
    matrices A
  • This software is based Arnoldi process called the
    Implicitly Restarted Arnoldi Method.
  • Reverse Communication Interface.

84
Java for Numerical Computations?
  • Few months ago 1 Mflop/s on a 600 Mflop/s
    processor.
  • Top performer today is 46 Mflop/s for a P6 using
    MS Explorer 3.0 JIT (62 Mflop/s in Fortran).

http//www.netlib.org/benchmark/linpackjava/
85
LAPACK to JAVA
  • Allows Java programmers to access to BLAS/LAPACK
    routines.
  • Working on translator to go from LAPACK to Java
    Byte Code
  • lap2j formal compiler of a subset of f77
    sufficient for BLAS LAPACK
  • Plan to enable all of LAPACK
  • Compiler provides quick, reliable translation.
  • Focus on LAPACK Fortran
  • Simple - no COMMON, EQUIVALANCE, SAVE, I/O

86
Parameterized Libraries
  • Architecture features for performance
  • Cache size
  • Latency
  • Bandwidth
  • Latency tolerant algorithms
  • Latency reduction
  • Granularity management
  • High levels of concurrency
  • Issues of precision

87
Motivation for Network Enabled Solvers
Design an easy-to-use tool to provide efficient
and uniform access to a variety of scientific
packages on UNIX platforms
  • Client-Server Design
  • Non-hierarchical system
  • Load Balancing
  • Fault Tolerance
  • Heterogeneous Environment Supported

88
NetSolve-Future Paradigm
Software lookup
NetSolve Agent
Dynamic Hardware/Software Binding
Hardware lookup
NetSolve Client
NetSolve Server
89
On Mathematical Software
  • Labor intensive activity
  • Enabling Technology
  • Helped many computational scientists
  • Provides a basis for many commercial efforts
  • Often software people not given credit
  • Lip Service given to work
  • Funding to universities and laboratories
  • NSF, DOE, DARPA, AFOSR, ..

90
References
  • Copy of slides in http//www.netlib.org/utk/peopl
    e/JackDongarra/siam-797/
  • Annotated table of freely available software in
    http//www.netlib.org/utk/people/JackDongarra/la-s
    w.html
  • http//www.netlib.org/
  • http//www.nhse.org/
  • http//www.nhse.org/hpc-netlib/

91
  • Design started in 1955 by IBM when lost bid for U
    of C Radiation Lab (LLNL)
  • Univac won with LARC
  • Was to be 100 X faster than what was available
  • At IBM Fred Brooks, John Cocke, Erich Bloch, . .
    .
  • Look ahead unit
  • Memory operand prefetch
  • Out of order execution
  • Speculative execution based on branch prediction
  • Partitioned function units

92
Stretch
  • about

93
  • double precision function pythag(a,b)
  • double precision a,b
  • c
  • c finds dsqrt(a2b2) without overflow or
    destructive underflow
  • c
  • double precision p,r,s,t,u
  • p dmax1(dabs(a),dabs(b))
  • if (p .eq. 0.0d0) go to 20
  • r (dmin1(dabs(a),dabs(b))/p)2
  • 10 continue
  • t 4.0d0 r
  • if (t .eq. 4.0d0) go to 20
  • s r/t
  • u 1.0d0 2.0d0s
  • p up
  • r (s/u)2 r
  • go to 10
  • 20 pythag p
  • return

94
  • Nintendo 64
  • more computing power than a Cray 1 and also much
    better graphics

95
  • double precision function epslon (x)
  • double precision x
  • c
  • c estimate unit roundoff in quantities of
    size x.
  • c
  • double precision a,b,c,eps
  • c
  • c this program should function properly on
    all systems
  • c satisfying the following two assumptions,
  • c 1. the base used in representing
    floating point
  • c numbers is not a power of three.
  • c 2. the quantity a in statement 10 is
    represented to
  • c the accuracy used in floating point
    variables
  • c that are stored in memory.
  • c the statement number 10 and the go to 10
    are intended to
  • c force optimizing compilers to generate code
    satisfying
  • c assumption 2.
  • c under these assumptions, it should be true
    that,
  • c a is not exactly equal to
    four-thirds,

96
Outline
  • Architectural Opportunities and Challenges.
  • Focus on High-Performance Computers
  • MPP/Clusters/NOWs
  • Language
  • Fortran 77, 90
  • HPF
  • C, C, Java
  • Dense direct solvers
  • ScaLAPACK, PLAPACK
  • Sparse direct solvers
  • Iterative solvers

97
LAPACK Motivation
  • LAPACK using high level BLAS
  • Use the manufacturers BLAS for high performance.
  • Portable Fortran 77 (766 K lines)

98
LAPACK Motivation
  • Example Cray C-90, 1 and 16 processors
  • LINPACK low performance due to excess data
    movement through the memory hierarchy

99
Derivation of Blocked AlgorithmsCholesky
Factorization A UTU

Equating coefficient of the jth column, we obtain
Hence, if U11 has already been computed, we can
compute uj and ujj from the equations
100
LINPACK Implementation
  • Here is the body of the LINPACK routine SPOFA
    which implements the method
  • DO 30 J 1, N
  • INFO J
  • S 0.0E0
  • JM1 J - 1
  • IF( JM1.LT.1 ) GO TO 20
  • DO 10 K 1, JM1
  • T A( K, J ) - SDOT( K-1,
    A( 1, K ), 1,A( 1, J ), 1 )
  • T T / A( K, K )
  • A( K, J ) T
  • S S TT
  • 10 CONTINUE
  • 20 CONTINUE
  • S A( J, J ) - S
  • C ...EXIT
  • IF( S.LE.0.0E0 ) GO TO 40
  • A( J, J ) SQRT( S )
  • 30 CONTINUE

101
(No Transcript)
102
LAPACK from HP-48G to CRAY T-90
  • (Trans)portable FORTRAN 77 (BLAS), 766K lines
    (354K library, 412K testing and timing)
  • Targets workstations, vector machines, and shared
    memory computers
  • Initial release February 1992
  • Linear systems, least squares, eigenproblems
  • Manual available from SIAM translated into
    Japanese
  • Fifth release later this year.

103
Physical and Logical Views of PVM
  • Physical
  • Logical

104
Performance Today in Terms of LINPACK Benchmark
  • ASCI Red at 1.3 Tflop/s
  • Sun Sparc Ultra 2 at 110 Mflop/s
  • Thinkpad 600 at 50 Mflop/s
  • Palm Pilot at 1.4 Kflop/s

105
Standards - The MPI Forum
  • Communicators - communication context
  • needed for safe parallel library design.
  • Derived Datatypes - user defined datatypes for
    non-homogeneous, non-contiguous data.
  • Process Groups - collective operations can
    operate over a subset of the processes.
  • Created using the HPF model
  • A group of 30-40 experts in message-passing
  • MPP vendors,
  • CS researchers,
  • Application developers.
  • Met 3 days every 6 weeks for 1.5 years and
    created the MPI-1 specification draft.

106
Super-LU
  • Supernodes
  • Permit use of higher level BLAS.
  • Reduce inefficient indirect addressing.
  • Reduce symbolic time by traversing supernodal
    graph.
  • Algorithm PA LU , A sparse and unsymmetric.
  • Exploit dense submatrices in the L U factors
    of PA LU .

107
Performance on Computational Kernels
  • New implementations of architectures every few
    months.
  • How to keep up with the changing design so as to
    have optimized numerical kernels.
  • Effort to aid the compiler
  • Pre-compiled software for machine specific
    implementations

108
Grid Enabled Math Software
  • Predictability and robustness of accuracy and
    performance.
  • Run-time resource management and algorithm
    selection.
  • Support for a multiplicity of programming
    environments.
  • Reproducibility, fault tolerant, and auditability
    of the computations.
  • New algorithmic techniques for latency tolerant
    applications.
Write a Comment
User Comments (0)
About PowerShow.com