Linear Algebra Libraries - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Linear Algebra Libraries

Description:

... more participants (not all!) Jack Dongarra, U. Tennessee. Kathy Yelic, ... Using a red-black algorithm, titanium arrays (191 Mflops) are faster than Java arrays ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 38
Provided by: kath218
Category:

less

Transcript and Presenter's Notes

Title: Linear Algebra Libraries


1
Linear Algebra Libraries
  • James Demmel,
  • Mathematics and EECS
  • UC Berkeley

2
A few more participants (not all!)
  • Jack Dongarra, U. Tennessee
  • Kathy Yelic, UCB
  • Xiaoye Li, NERSC/LBNL
  • Tony Drummond, NERSC/LBNL
  • Osni Marques, NERSC/LBNL
  • Inderjit Dhillon, UT Austin
  • Beresford Parlett, UC Berkeley
  • Mark Adams, SNL

3
Success Stories (with NERSC, LBNL)
Cosmic Microwave Background Analysis, BOOMERanG
collaboration, MADCAP code (Apr. 27, 2000).
  • Scattering in a quantum system of three charged
    particles (Rescigno, Baertschy, Isaacs and
    McCurdy, Dec. 24, 1999).

SuperLU
ScaLAPACK
4
SuperLU - Scalable Sparse Solvers
Contact Xiaoye Li, NERSC, LBNL,
www.nersc.gov/xiaoye
  • SuperLU
  • Solve sparse linear system A x b using Gaussian
    elimination.
  • Efficient and portable
  • Sequential SuperLU
  • Achieved up to 40 peak Mflop rate
  • SuperLU_MT shared-memory
  • Achieved up to 10 fold speedup
  • SuperLU_DIST distributed-memory
  • Achieved up to 100 fold speedup
  • Included in HYPRE, PETSC, ML
  • To appear in Matlab, SUN Perf Lib, BCSLIB-EXT
  • Enabled Scientific Discovery
  • Solved complex unsymmetric systems of order 2
    million, on IBM SP.

Recigno, Baertschy, Isaacs McCurdy, Science,
24 Dec 1999
5
The Holy Grail
Eigensolver for Symmetric matices
To be propagated throughout LAPACK and ScaLAPACK
6
Parallel Multigrid for Irregular FE Problems
  • Sample problem
  • Compute crushing of a stiff sphere w/ 17 steel
    and rubber layers, in rubber cube
  • 80K 56M dof
  • Up to 640 Cray T3E processors
  • 50 scaled parallel efficiency

www.cs.berkeley.edu/madams
  • 76M dof solved in 70 seconds on 1920 processor
    ASCI Red (SC 01)
  • Prize for Best Industrial Appl in Mannheim
    SuParCup 99

7
The ACTS Toolkit acts.nersc.gov
  • Advanced Computations Testing and Simulation
  • Collection of (mostly) DOE developed tools
  • Documented and improved
  • Paid for by DOE and NSF/NPACI
  • Freely available to all users
  • Education, training and consulting
  • Tutorial 10/01 and 3/02
  • Application Level Libraries
  • ScaLAPACK, SuperLU, Aztec, PETSc, Hypre, PVODE,
    TAO, ATLAS, PHiPAC,
  • System Level Libraries
  • Global Arrays, Overture, POET, POOMA, Globus,
    Nexus,
  • Software Development Tools
  • CUMULVS, PAWS, SILOON, TAU, Tulip, PADRE, PETE

8
Templates Eigensolver Tutorial
www.siam.org
www.netlib.org
Train users to match algorithms and software to
problems
9
Future Work
  • Continue tech transfer
  • Leverage DOE investment in algorithms, software,
    training
  • Propagate Holy Grail Eigensolver throughout
    libraries
  • Will change eigenvalue solvers, singular value
    solvers, least squares
  • Better Multigrid solvers
  • New coarseners, smoothers
  • Listen to the users
  • 22M hits on netlib to (Sca)LAPACK, hundreds of
    e-mails per year
  • Example thread-safe versions for use on SMP
    nodes, like BH
  • Automatically tuned kernels
  • Harden research results from PHiPAC, Sparsity,
    BeBOP

10
Motivation for Automatic Tuning
  • Why replace conventional hand tuning by user or
    vendor?
  • Time consuming and tedious
  • Hard to predict performance from source code
  • Growing list of kernels to tune
  • Must be redone for every architecture and
    compiler
  • Cant depend on compiler
  • Compiler technology often lags architecture
  • Best algorithm may depend on input, so some
    tuning at run-time.
  • Not all algorithms semantically or mathematically
    equivalent

11
Automatic Performance Tuning
  • Two steps
  • Identify and generate a space of algorithms, with
    various
  • Instruction mixes and orders
  • Memory Access Patterns
  • Data structures
  • Mathematical Formulations
  • 2. Search for the fastest one, by running them
  • When do we search?
  • Once per kernel and architecture
  • At compile time
  • At run time, e.g., once the sparsity structure is
    known
  • All of the above
  • Many examples
  • PHiPAC, ATLAS, Sparsity, FFTW, Spiral,

12
Register Blocking Optimization
  • Identify a small dense blocks of nonzeros.
  • Fill in extra zeros to complete blocks
  • Use an optimized multiplication code for the
    particular block size.

2x2 register blocked matrix
2
1
2
3
0
2
4
1
2
5
0
1
0
0
1
3
0
2
1
3
0
5
7
3
0
4
1
1
  • Improves register reuse, lowers indexing
    overhead.
  • Challenge adds storage (potentially) and
    computation

13
Sparsity Sparse MatVec Multiply
Speedups on Itanium
14
Speedup of ATAx
  • Speedups on Pentium 4
  • Access A only once

15
Future Work on Tuning
  • Further research
  • More kernels
  • For numerics, communication, other problems
  • Higher level algorithm choices (templates)
  • NPACI problem
  • How to make tuning usable by non-experts
  • Interface design for sparse problems

16
Titanium
  • Susan Graham and Katherine Yelick
  • Computer Science Division, EECS
  • U.C. Berkeley
  • http//titanium.cs.berkeley.edu/

17
Titanium Group
  • Susan Graham
  • Katherine Yelick
  • Paul Hilfinger
  • Phillip Colella (LBNL)
  • Alex Aiken
  • Greg Balls (SDSC)
  • Peter McQuorquodale (LBNL)
  • Andrew Begel
  • Dan Bonachea
  • Szu-Huey Chang
  • Tyson Condie
  • Carrie Fei
  • David Gay
  • Ben Liblit
  • Chang Sun Lin
  • Geoff Pike
  • Jimmy Su
  • Ellen Tsai
  • Mike Welcome (LBNL)
  • Siu Man Yau

18
Titanium Overview
  • Object-oriented language based on Java with
  • Scalable parallelism
  • SPMD model with global address space
  • Multidimensional arrays
  • points and index sets as first-class values
  • Immutable classes
  • user-definable non-reference types for
    performance
  • Operator overloading
  • by demand from our user community
  • Semi-automated memory management
  • uses memory regions for high performance

19
SciMark Benchmark
  • Numerical benchmark for Java, C/C
  • Five kernels
  • FFT (complex, 1D)
  • Successive Over-Relaxation (SOR)
  • Monte Carlo integration (MC)
  • Sparse matrix multiply
  • dense LU factorization
  • Results are reported in Mflops
  • Download and run on your machine from
  • http//math.nist.gov/scimark2
  • C and Java sources also provided

Roldan Pozo, NIST, http//math.nist.gov/Rpozo
20
SciMark Java vs. C(Sun UltraSPARC 60)
Sun JDK 1.3 (HotSpot) , javac -0 Sun cc -0
SunOS 5.7
Roldan Pozo, NIST, http//math.nist.gov/Rpozo
21
SciMark Java vs. C(Intel PIII 500MHz, Win98)
Sun JDK 1.2, javac -0 Microsoft VC 5.0, cl
-0 Win98
Roldan Pozo, NIST, http//math.nist.gov/Rpozo
22
Java Compiled by Titanium Compiler
23
Titanium Compiler on the Power 3 (SP)
24
Java Compiled by Titanium Compiler
25
SOR red-black loop (small data)
  • Using a red-black algorithm, titanium arrays (191
    Mflops) are faster than Java arrays

26
Parallel Applications
  • Genome Application
  • Heart simulation
  • AMR elliptic and hyperbolic solvers
  • Scalable Poisson for infinite domains
  • Genome application
  • Several smaller benchmarks EM3D, MatMul, LU,
    FFT, Join,

27
MOOSE Application
  • Problem Microarray construction
  • Used for genome experiments
  • Possible medical applications long-term
  • Microarray Optimal Oligo Selection Engine (MOOSE)
  • A parallel engine for selecting the best
    oligonucleotide sequences for genetic microarray
    testing
  • Uses dynamic load balancing within Titanium

28
Heart Simulation
  • Problem compute blood flow in the heart
  • Modeled as an elastic structure in an
    incompressible fluid.
  • The immersed boundary method Peskin and
    McQueen.
  • 20 years of development in model
  • Many other applications blood clotting, inner
    ear, paper making, embryo growth, and more
  • Can be used for design
    of prosthetics
  • Artificial heart valves
  • Cochlear implants

29
Scalable Poisson Solver
  • MLC for Finite-Differences by Balls and Colella
  • Poisson equation with infinite boundaries
  • arise in astrophysics, some biological systems,
    etc.
  • Method is scalable
  • Low communication
  • Performance on
  • SP2 (shown) and t3e
  • scaled speedups
  • nearly ideal (flat)
  • Currently 2D and non-adaptive

30
Error on High-Wavenumber Problem
  • Charge is
  • 1 charge of concentric waves
  • 2 star-shaped charges.
  • Largest error is where the charge is changing
    rapidly. Note
  • discretization error
  • faint decomposition error
  • Run on 16 procs

31
AMR Gas Dynamics
  • Developed by McCorquodale and Colella
  • 2D Example (3D supported)
  • Mach-10 shock on solid surface
    at
    oblique angle
  • Future Self-gravitating gas dynamics package

32
Unstructured Mesh Kernel
  • EM3D Relaxation on a 3D unstructured mesh
  • Speedup on Ultrasparc SMP
  • Simple kernel mesh not partitioned.

33
Recent Developments
  • Interfaces to libraries
  • KeLP and (older) PETSc and Metis
  • New IBM SP implementation
  • Uses LAPI rather than MPI, about 2x performance
    gain
  • New release IBM, SGI, Cray, Linux cluster,
    Threads,
  • Uniprocessor optimizations
  • Method inlining, both automated and manual
  • Cache optimizations
  • Shared pointer analysis
  • Support for unstructured computation
  • General sub-array copy now with arbitrary points

34
Future Plans
  • Merge communication layer with UPC
  • Unified Parallel C has broad vendor support.
  • Uses some execution model as Titanium
  • Push vendors to expose low-overhead communication
  • Automated communication overlap
  • Analysis and refinement of cache optimizations
  • Additional support for unstructured grids
  • Conjugate gradient and particle methods are
    motivations
  • Better uniprocessor optimizations, possibly new
    arrays

35
End of Slides
36
Target Problems
  • Many modeling problems in astrophysics, biology,
    material science, and other areas require
  • Enormous range of spatial and temporal scales
  • Requires
  • Adaptive methods
  • Large scale parallel machines
  • Titanium supports
  • Stuctured grids
  • Locally-structured grids (AMR)
  • Unstructured grids (in progress)

37
Local Pointer Analysis
  • Compiler can infer many uses of local
  • Data structures must be well partitioned
Write a Comment
User Comments (0)
About PowerShow.com