The Future of Numerical Linear Algebra Automatic Performance Tuning of Sparse Matrix codes The Next LAPACK and ScaLAPACK www.cs.berkeley.edu/~demmel/Utah_Apr05.ppt

About This Presentation

Title:

The Future of Numerical Linear Algebra Automatic Performance Tuning of Sparse Matrix codes The Next LAPACK and ScaLAPACK www.cs.berkeley.edu/~demmel/Utah_Apr05.ppt

Description:

Best choice can depend on knowing a lot of applied mathematics and ... Algorithm and its implementation may strongly depend on data only known at run-time ... – PowerPoint PPT presentation

Number of Views:106

Avg rating:3.0/5.0

Slides: 79

Provided by: jamesd72

Category:

more less

Transcript and Presenter's Notes

Title: The Future of Numerical Linear Algebra Automatic Performance Tuning of Sparse Matrix codes The Next LAPACK and ScaLAPACK www.cs.berkeley.edu/~demmel/Utah_Apr05.ppt

1
The Future of Numerical Linear Algebra
Automatic Performance Tuning of Sparse Matrix
codesThe Next LAPACK and ScaLAPACK
www.cs.berkeley.edu/demmel/Utah_Apr05.ppt

James Demmel
UC Berkeley

2
Outline

Automatic Performance Tuning of Sparse Matrix
Kernels
Updating LAPACK and ScaLAPACK

3
Outline

Automatic Performance Tuning of Sparse Matrix
Kernels
Updating LAPACK and ScaLAPACK

4
Berkeley Benchmarking and OPtimization (BeBOP)

Prof. Katherine Yelick
Rich Vuduc
Many results in this talk are from Vuducs PhD
thesis, www.cs.berkeley.edu/richie
Rajesh Nishtala, Mark Hoemmen, Hormozd Gahvari
Eun-Jim Im, many other earlier contributors
Other papers at bebop.cs.berkeley.edu

5
Motivation for Automatic Performance Tuning

Writing high performance software is hard
Make programming easier while getting high speed
Ideal program in your favorite high level
language (Matlab, PETSc) and get a high fraction
of peak performance
Reality Best algorithm (and its implementation)
can depend strongly on the problem, computer
architecture, compiler,
Best choice can depend on knowing a lot of
applied mathematics and computer science
How much of this can we teach?

6
Motivation for Automatic Performance Tuning

Writing high performance software is hard
Make programming easier while getting high speed
Ideal program in your favorite high level
language (Matlab, PETSc) and get a high fraction
of peak performance
Reality Best algorithm (and its implementation)
can depend strongly on the problem, computer
architecture, compiler,
Best choice can depend on knowing a lot of
applied mathematics and computer science
How much of this can we teach?
How much of this can we automate?

7
Examples of Automatic Performance Tuning

Dense BLAS
Sequential
PHiPAC (UCB), then ATLAS (UTK)
Now in Matlab, many other releases
math-atlas.sourceforge.net/
Fast Fourier Transform (FFT) variations
FFTW (MIT)
Sequential and Parallel
1999 Wilkinson Software Prize
www.fftw.org
Digital Signal Processing
SPIRAL www.spiral.net (CMU)
MPI Collectives (UCB, UTK)
More projects, conferences, government reports,

8
Tuning Dense BLAS PHiPAC
9
Tuning Dense BLAS ATLAS
Extends applicability of PHIPAC Incorporated in
Matlab (with rest of LAPACK)
10
How tuning works, so far

What do dense BLAS, FFTs, signal processing, MPI
reductions have in common?
Can do the tuning off-line once per
architecture, algorithm
Can take as much time as necessary (hours, a
week)
At run-time, algorithm choice may depend only on
few parameters
Matrix dimension, size of FFT, etc.

11
Register Tile Size Selection inDense Matrix
Multiply
m
k
m
k0
m0
m0
k
n0
n0
n
.
k0

n
12
Tuning Register Tile Sizes (Dense Matrix
Multiply)
333 MHz Sun Ultra 2i 2-D slice of 3-D space
implementations color-coded by performance in
Mflop/s 16 registers, but 2-by-3 tile size
fastest
Needle in a haystack
13
(No Transcript)
14
Limits of off-line tuning

Algorithm and its implementation may strongly
depend on data only known at run-time
Ex Sparse matrix nonzero pattern determines both
best data structure and implementation of
Sparse-matrix-vector-multiplication (SpMV)
Cant afford to generate and test thousands of
algorithms and implementations at run-time!
BEBOP project addresses sparse tuning

15
A Sparse Matrix You Use Every Day
16
Motivation for Automatic Performance Tuning of
SpMV

SpMV widely used in practice
Kernel of iterative solvers for
linear systems
eigenvalue problems
Singular value problems
Historical trends
Sparse matrix-vector multiply (SpMV) 10 of peak
or less
2x faster than CSR with hand-tuning
Tuning becoming more difficult over time

17
SpMV Historical Trends Fraction of Peak
18
Approach to Automatic Performance Tuning of SpMV

Our approach empirical modeling and search
Off-line measure performance of variety of data
structures and SpMV algorithms
On-line sample matrix, use performance model to
predict which data structure/algorithm is best
Results
Up to 4x speedups and 31 of peak for SpMV
Using register blocking
Many other optimization techniques for SpMV

19
SpMV with Compressed Sparse Row (CSR) Storage
Matrix-vector multiply kernel y(i) ? y(i)
A(i,j)x(j) for each row i for kptri to
ptri1 do yi yi valkxindk
Matrix-vector multiply kernel y(i) ? y(i)
A(i,j)x(j) for each row i for kptri to
ptri1 do yi yi valkxindk
20
Example 1 The Difficulty of Tuning

n 21216
nnz 1.5 M
kernel SpMV
Source NASA structural analysis problem

21
Example 1 The Difficulty of Tuning

n 21216
nnz 1.5 M
kernel SpMV
Source NASA structural analysis problem

8x8 dense substructure

22
Taking advantage of block structure in SpMV

Bottleneck is time to get matrix from memory
Only 2 flops for each nonzero in matrix
Goal decrease size of data structure
Dont store each nonzero with index, instead
store each nonzero r-by-c block with one index
Storage drops by up to 2x (if rc gtgt 1, all
32-bit quantities)
Time to fetch matrix from memory decreases
Change both data structure and algorithm
Need to pick r and c
Need to change algorithm accordingly
In example, is rc8 best choice?
Minimizes storage, so looks like a good idea

23
Speedups on Itanium 2 The Need for Search
Mflop/s
Mflop/s
24
SpMV Performance (Matrix 2) Generation 1
Power3 - 13
Power4 - 14
195 Mflop/s
703 Mflop/s
100 Mflop/s
469 Mflop/s
Itanium 2 - 31
Itanium 1 - 7
225 Mflop/s
1.1 Gflop/s
103 Mflop/s
276 Mflop/s
25
Register Profile Itanium 2
1190 Mflop/s
190 Mflop/s
26
Register Profiles IBM and Intel IA-64
Power3 - 17
Power4 - 16
252 Mflop/s
820 Mflop/s
122 Mflop/s
459 Mflop/s
Itanium 2 - 33
Itanium 1 - 8
247 Mflop/s
1.2 Gflop/s
107 Mflop/s
190 Mflop/s
27
Register Profiles Sun and Intel x86
Ultra 2i - 11
Ultra 3 - 5
72 Mflop/s
90 Mflop/s
35 Mflop/s
50 Mflop/s
Pentium III-M - 15
Pentium III - 21
108 Mflop/s
122 Mflop/s
42 Mflop/s
58 Mflop/s
28
Example 2 The Difficulty of Tuning

n 21216
nnz 1.5 M
kernel SpMV
Source NASA structural analysis problem

29
Zoom in to top corner

More complicated non-zero structure in general

30
3x3 blocks look natural, but

More complicated non-zero structure in general
Example 3x3 blocking
Logical grid of 3x3 cells

31
3x3 blocks look natural, but

More complicated non-zero structure in general
Example 3x3 blocking
Logical grid of 3x3 cells
But would lead to lots of fill-in 1.5x

32
Extra Work Can Improve Efficiency!

More complicated non-zero structure in general
Example 3x3 blocking
Logical grid of 3x3 cells
Fill-in explicit zeros
Unroll 3x3 block multiplies
Fill ratio 1.5
On Pentium III 1.5x speedup!
Actual mflop rate 1.52 2.25 higher

33
Automatic Register Block Size Selection

Selecting the r x c block size
Off-line benchmark
Precompute Mflops(r,c) using dense A for each r x
c
Once per machine/architecture
Run-time search
Sample A to estimate Fill(r,c) for each r x c
Run-time heuristic model
Choose r, c to minimize time ? Fill(r,c) /
Mflops(r,c)

34
Accurate and Efficient Adaptive Fill Estimation

Idea Sample matrix
Fraction of matrix to sample s Î 0,1
Cost O(s nnz)
Control cost by controlling s
Search at run-time the constant matters!
Control s automatically by computing statistical
confidence intervals
Idea Monitor variance
Cost of tuning
Heuristic costs 1 to 11 unblocked SpMVs
Converting matrix costs 5 to 40 unblocked SpMVs
Tuning a good idea when doing lots of SpMVs

35
Test Matrix Collection

Many on-line sources (see Vuducs thesis)
Matrix 1 dense (in sparse format)
Matrices 2-9 FEM with one block size r x c
N from 14K to 62K, NNZ from 1M to 3M
Fluid flow, structural mechanics, materials
Matrices 10-17 FEM with multiple block sizes
N from 17K to 52K, NNZ from .5M to 2.7M
Fluid flow, buckling,
Matrices 18 37 Other
N from 5K to 75K, NNZ from 33K to .5M
Power grid, chem eng, finance, semiconductors,
Matrices 40 44 Linear Programming
(N,M) from (3K,13K) to (15K,77K), NNZ from 50K to
2M

36
Accuracy of the Tuning Heuristics (1/4)
See p. 375 of Vuducs thesis for matrices
NOTE Fair flops used (ops on explicit zeros
not counted as work)
37
Accuracy of the Tuning Heuristics (2/4)
38
Accuracy of the Tuning Heuristics (2/4)
DGEMV
39
Evaluating algorithms and machines for SpMV

Some speedups look good, but could we do better?
Questions
What is the best speedup possible?
Independent of instruction scheduling, selection
Can SpMV be further improved or not?
What machines are good for SpMV?
How can architectures be changed to improve SpMV?

40
Upper Bounds on Performance for register blocked
SpMV

P (flops) / (time)
Flops 2 nnz(A) dont count extra work on
zeros
Lower bound on time Two main assumptions
1. Count memory ops only (streaming)
2. Count only compulsory, capacity misses ignore
conflicts
Account for line sizes
Account for matrix size and nnz
Charge minimum access latency ai at Li cache
amem
e.g., Saavedra-Barrera and PMaC MAPS benchmarks

41
Example L2 Misses on Itanium 2
Misses measured using PAPI Browne 00
42
Example Bounds on Itanium 2
43
Example Bounds on Itanium 2
44
Example Bounds on Itanium 2
45
Fraction of Upper Bound Across Platforms
46
Summary of Other Performance Optimizations

Optimizations for SpMV
Register blocking (RB) up to 4x over CSR
Variable block splitting 2.1x over CSR, 1.8x
over RB
Diagonals 2x over CSR
Reordering to create dense structure splitting
2x over CSR
Symmetry 2.8x over CSR, 2.6x over RB
Cache blocking 2.8x over CSR
Multiple vectors (SpMM) 7x over CSR
And combinations
Sparse triangular solve
Hybrid sparse/dense data structure 1.8x over CSR
Higher-level kernels
AATx, ATAx 4x over CSR, 1.8x over RB
A2x 2x over CSR, 1.5x over RB

47
Example Sparse Triangular Factor

Raefsky4 (structural problem) SuperLU colmmd
N19779, nnz12.6 M

48
Cache Optimizations for AATx

Cache-level Interleave multiplication by A, AT

Algorithmic-level transformations for A2x, A3x,

49
Impact on Applications Omega3P

Application accelerator cavity design Ko
Relevant optimization techniques
Symmetric storage
Register blocking
Reordering rows and columns to improve blocking
Reverse Cuthill-McKee ordering to reduce
bandwidth
Traveling Salesman Problem-based ordering to
create blocks
Pinar Heath 97
Make columns adjacent if they have many common
nonzero rows
2.1x speedup on Power 4

50
Source Accelerator Cavity Design Problem (Ko via
Husbands)
51
100x100 Submatrix Along Diagonal
52
Post-RCM Reordering
53
Microscopic Effect of RCM Reordering
Before Green Red After Green Blue
54
Microscopic Effect of Combined RCMTSP
Reordering
Before Green Red After Green Blue
55
(Omega3P)
56
Optimized Sparse Kernel Interface OSKI

Provides sparse kernels automatically tuned for
users matrix machine
BLAS-style functionality SpMV, Ax ATy, TrSV
Hides complexity of run-time tuning
Includes ATAx, Akx
For advanced users solver library writers
Available as stand-alone library
Alpha release at bebop.cs.berkeley.edu/oski
Release v1.0 soon
Will be available as PETSc extension

57
How OSKI Tunes
Application Run-Time
Library Install-Time (offline)
1. Build for Target Arch.
2. Benchmark
Workload from program monitoring
History
Matrix
Benchmark data
Heuristic models
1. Evaluate Models
Generated code variants
2. Select Data Struct. Code
To user Matrix handle for kernel calls
Extensibility Advanced users may write
dynamically add Code variants and Heuristic
models to system.
58
What about the Google Matrix?

Google approach
Approx. once a month rank all pages using
connectivity structure
Find dominant eigenvector of a matrix
At query-time return list of pages ordered by
rank
Matrix A aG (1-a)(1/n)uuT
Markov model Surfer follows link with
probability a, jumps to a random page with
probability 1-a
G is n x n connectivity matrix n billions
gij is non-zero if page i links to page j
Steady-state probability xi of landing on page i
is solution to x Ax
Approximate x by power method x Akx0
In practice, k 25

59
Awards

Best Paper, Intern. Conf. Parallel Processing,
2004
Performance models for evaluation and automatic
performance tuning of symmetric sparse
matrix-vector multiply
Best Student Paper, Intern. Conf. Supercomputing,
Workshop on Performance Optimization via
High-Level Languages and Libraries, 2003
Best Student Presentation too, to Richard Vuduc
Automatic performance tuning and analysis of
sparse triangular solve
Finalist, Best Student Paper, Supercomputing 2002
To Richard Vuduc
Performance Optimization and Bounds for Sparse
Matrix-vector Multiply
Best Presentation Prize, MICRO-33 3rd ACM
Workshop on Feedback-Directed Dynamic
Optimization, 2000
To Richard Vuduc
Statistical Modeling of Feedback Data in an
Automatic Tuning System

60
Outline

Automatic Performance Tuning of Sparse Matrix
Kernels
Updating LAPACK and ScaLAPACK

61
LAPACK and ScaLAPACK

Widely used dense and banded linear algebra
libraries
In Matlab (thanks to tuning), NAG, PETSc,
Used in vendor libraries from Cray, Fujitsu, HP,
IBM, Intel, NEC, SGI
over 49M web hits at www.netlib.org
New NSF grant for new, improved releases
Joint with Jack Dongarra, many others
Community effort (academic and industry)
www.cs.berkeley.edu/demmel/Sca-LAPACK-Proposal.pd
f

62
Goals (highlights)

Putting more of LAPACK into ScaLAPACK
Lots of routines not yet parallelized
New functionality
Ex Updating/downdating of factorizations
Improving ease of use
Life after F77?
Automatic Performance Tuning
Over 1300 calls to ILAENV() to get tuning
parameters
New Algorithms
Some faster, some more accurate, some new

63
Faster eig() and svd()

Nonsymmetric eigenproblem
Incorporate SIAM Prize winning work of Byers /
Braman / Mathias on faster HQR
Up to 10x faster for large enough problems
Symmetric eigenproblem and SVD
Reduce from dense to narrow band
Incorporate work of Bischof/Lang, Howell/Fulton
Move work from BLAS2 to BLAS3
Narrow band (tri/bidiagonal) problem
Incorporate MRRR algorithm of Parlett/Dhillon
Voemel, Marques, Willems

64
MRRR Algorithm for eig(tridiagonal) and
svd(bidiagonal)

Multiple Relatively Robust Representation
1999 Householder Award honorable mention for
Dhillon
O(nk) flops to find k eigenvalues/vectors of nxn
tridiagonal matrix (similar for SVD)
Minimum possible!
Naturally parallelizable
Accurate
Small residuals Txi li xi O(n e)
Orthogonal eigenvectors xiTxj O(n e)
Hence nickname Holy Grail

65
Benchmark Details

AMD 1.2 GHz Athlon, 2GB mem, Redhat Intel
compiler
Compute all eigenvalues and eigenvectors of a
symmetric tridiagonal matrix T
Codes compared
qr QR iteration from LAPACK dsteqr
dc Cuppens DivideConquer from LAPACK dstedc
gr New implementation of MRRR algorithm
(Grail)
ogr MRRR from LAPACK dstegr (old Grail)

66
Timing of Eigensolvers(only matrices where time
gt .1 sec)
67
Timing of Eigensolvers(only matrices where time
gt .1 sec)
68
Timing of Eigensolvers(only matrices where time
gt .1 sec)
69
Timing of Eigensolvers(only matrices where time
gt .1 sec)
70
Accuracy Results (old vs new Grail)
71
Accuracy Results (Grail vs QR vs DC)
72
More Accurate Solve Axb
Conventional Gaussian Elimination

e
e n 2-24
73
More Accurate Solve Axb

Old idea Use Newtons method on f(x) Ax-b
On a linear system?
Roundoff in Ax-b makes it interesting
(nonlinear)
Iterative refinement
Snyder, Wilkinson, Moler, Skeel,

Repeat r Ax-b compute with extra
precision Solve Ad r using LU
factorization of A Update x x d Until
accurate enough or no progress
74
Whats new?

Need extra precision (beyond double)
Part of new BLAS standard
www.netlib.org/blas/blast-forum
Cost O(n2) extra per right-hand-side, vs O(n3)
for factorization
Get tiny componentwise bounds too
Error in xi small compared to xi, not just
maxj xj
Guarantees based on condition number estimates
No bad bounds in 6.2M tests, unlike old method
Different condition number for componentwise
bounds
Extends to least squares
Kahan, Hida, Riedy, X. Li, many undergrads
LAPACK Working Note 165

75
Can condition estimators lie?

Yes, unless they cost as much as matrix multiply
Demmel/Diament/Malajovich (FCM2001)
But what if matrix multiply costs O(n2)?
Cohn/Umans (FOCS 2003)

76
New algorithm for roots(p)

To find roots of polynomial p
Roots(p) does eig(C(p))
Costs O(n3), stable, reliable
O(n2) Alternatives
Newton, Laguerre, bisection,
Stable? Reliable?
New Exploit semiseparable structure of C(p)
Low rank of any submatrix of upper triangle of
C(p) preserved under QR iteration
Complexity drops from O(n3) to O(n2), same
stability
Ming Gu, Jiang Zhu, Jianlin Xia, David Bindel,

77
Conclusions

Lots of opportunities for faster and more
accurate solutions to classical problems
Tuning algorithms for problems and architectures
Automated search to deal with complexity
Surprising optima, needle in a haystack
Exploiting mathematical structure to find faster
algorithms

78
Thanks to NSF and DOE for supportThese
slides available at www.cs.berkeley.edu/demmel/
Utah_Apr05.ppt

Write a Comment

User Comments (0)