Lecture 6: Linear Algebra Algorithms

About This Presentation

Title:

Lecture 6: Linear Algebra Algorithms

Description:

Linear Algebra Algorithms Jack Dongarra, U of Tennessee Slides are adapted from Jim Demmel, UCB s Lecture on Linear Algebra Algorithms – PowerPoint PPT presentation

Number of Views:532

Avg rating:3.0/5.0

Slides: 135

Provided by: Davi1892

Learn more at: https://netlib.org

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 6: Linear Algebra Algorithms

1
Lecture 6Linear Algebra Algorithms

Jack Dongarra, U of Tennessee

Slides are adapted from Jim Demmel, UCBs Lecture
on Linear Algebra Algorithms
2
Homework 3 Grading Rules

part1 (2 points) Write an MPI program
(1) code produce correct result 2/2.
(2) code run but could not give accurate result
1.5/2.
(3) code could not run or run but nothing 1/2
Note If you only submit one program which can
print "processor
contribution" and "total integral" correctly, I
will give 1/2
for this part and 2/2 for part2. so you total
score for part1
and part2 will be 3/4.
part2 (2 points) Modify your program to sum the
contributions to give the total value of the
integral.
(1) code produce correct result 2/2.
(2) code run but could not give accurate result
1.5/2.
(3) code could not run or run but nothing 1/2
part3 (4 points) Give an expression for the
efficiency of the parallel algorithm for m
subintervals
(1) give t_p(1), t_p(p), S(p) and E(p) correctly
4/4.
(2) give t_p(1), t_p(p) and S(p) correctly 3/4.
(3) give t_p(1) and t_p(p) correctly 2/4.
(4) give other 1/4.

3
Adding numbers together

If we are adding an array of numbers and have a
choice do we add them from largest to smallest or
smallest to largest?

4
Parallel Performance Metrics
Absolute Elapsed (wall-clock) Time T(n )
Speedup S(n ) T(1) / T(n ) where T1 is
the time for the best serial implementation.
gt Performance improvement due to parallelsm
Parallel Efficiency E(n ) T(1) / n T(n )
Ideal Speedup SI (n ) n -
Theoretical limit obtainable rarely - Ignores
all of real life These definitions apply to a
fixed-problem experiment.
5
Amdahls Law
Amdahls Law places a strict limit on the speedup
that can be realized by using multiple
processors. Two equivalent expressions for
Amdahls Law are given below
tN (fp/N fs)t1 Effect of multiple
processors on run time S 1/(fs fp/N)
Effect of multiple processors on
speedup Where fs serial fraction of code fp
parallel fraction of code 1 - fs N number
of processors
6
Illustration of Amdahls Law
It takes only a small fraction of serial content
in a code to degrade the parallel performance. It
is essential to determine the scaling behavior of
your code before doing production runs using
large numbers of processors
7
Amdahls Law Vs. Reality
Amdahls Law provides a theoretical upper limit
on parallel speedup assuming that there are no
costs for communications. In reality,
communications (and I/O) will result in a further
degradation of performance.
So whats going on?
8
More on Amdahls Law

Amdahls Law can be generalized to any two
processes of with different speeds
Ex. Apply to fprocessor and fmemory
The growing processor-memory performance gap will
undermine our efforts at achieving maximum
possible speedup!

9
Gustafsons Law

Thus, Amdahls Law predicts that there is a
maximum scalability for an application,
determined by its parallel fraction, and this
limit is generally not large.
tN (fp/N fs)t1 Effect of multiple
processors on run time
S 1/(fs fp/N) Effect of multiple
processors on speedup
Where
fs serial fraction of code
fp parallel fraction of code 1 - fs
N number of processors
There is a way around this increase the problem
size
bigger problems mean bigger grids or more
particles bigger arrays
number of serial operations generally remains
constant number of parallel operations
increases parallel fraction increases

10
Fixed-Problem Size Scaling
a.k.a. Fixed-load, Fixed-Problem Size, Strong
Scaling, Problem-Constrained, constant-problem
size (CPS), variable subgrid
1 Amdahl Limit SA(n) T(1) / T(n)
----------------
f / n ( 1 -
f ) This bounds the speedup based only on the
fraction of the code that cannot use parallelism
( 1- f ) it ignores all other factors SA
--gt 1 / ( 1- f ) as n --gt
11
Fixed-Problem Size Scaling (Contd)
Efficiency (n) T(1) / T(n) n
Memory requirements decrease with n
Surface-to-volume ratio increases with n
Superlinear speedup possible from cache
effects Motivation what is the largest of
procs I can use effectively and what is the
fastest time that I can solve a given problem?
Problems - Sequential runs often not possible
(large problems) - Speedup (and efficiency) is
misleading if processors are slow
12
Fixed-Problem Size Scaling Examples
S. Goedecker and Adolfy Hoisie, Achieving High
Performance in Numerical Computations on RISC
Workstations and Parallel Systems,International
Conference on Computational Physics PC'97 Santa
Cruz, August 25-28 1997.
13
Fixed-Problem Size Scaling Examples
14
Scaled Speedup Experiments
a.k.a. Fixed Subgrid-Size, Weak Scaling,
Gustafson scaling. Motivation Want to use a
larger machine to solve a larger global problem
in the same amount of time. Memory and
surface-to-volume effects remain constant.
15
Top500 Data
Rank Manufacturer Computer Installation Site Rmax TF/s R peak TF/s Ratio Max/Peak Proc
1 IBM ASCI WhiteSP Power3 Lawrence Livermore National Laboratory 7.23 12 60.25 8192
2 Compaq AlphaServer SC ES45 1 GHz Pittsburgh Supercomputing Center 4.06 6.05 67.11 3024
3 IBM SP Power3 375 MHz NERSC/LBNL 3.05 5.00 61.00 3328
4 Intel ASCI Red Sandia National Laboratory 2.38 3.207 74.21 9632
5 IBM ASCI Blue PacificSST, IBM SP 604E Lawrence Livermore National Laboratory 2.14 3.868 55.33 5808
6 Compaq AlphaServer SC ES45 1 GHz Los Alamos National Laboratory 2.10 3.04 69.08 1536
7 Hitachi SR8000/MPP University of Tokyo 1.71 2.69 63.57 1152
8 SGI ASCI Blue Mountain Los Alamos National Laboratory 1.61 2.52 63.89 6144
9 IBM SP Power3 375Mhz Naval Oceanographic Office (NAVOCEANO) 1.42 1.97 72.08 1336
10 IBM SP Power3 375Mhz Deutscher Wetterdienst 1.29 1.83 70.49 1280
16
Example of a Scaled Speedup Experiment
17
Parallel Performance Metrics Speedup
Absolute performance
Relative performance
MFLOPS
Speedup
Processors
Processors
Speedup is only one characteristic of a program -
it is not synonymous with performance. In this
comparison of two machines the code achieves
comparable speedups but one of the machines is
faster.
18
(No Transcript)
19
Improving Ratio of Floating Point Operations to
Memory Accesses

subroutine mult(n1,nd1,n2,nd2,y,a,x)
implicit real8 (a-h,o-z)
dimension a(nd1,nd2),y(nd2),x(nd1)
do 20, i1,n1
t0.d0
do 10, j1,n2
tta(j,i)x(j)
10 continue
y(i)t
20 continue
return
end
Unroll the loops

2 FLOPS 2 LOADS
20
Improving Ratio of Floating Point Operations to
Memory Accesses
c works correctly when n1,n2 are multiples of 4
dimension a(nd1,nd2), y(nd2), x(nd1) do
i1,n1-3,4 t10.d0 t20.d0
t30.d0 t40.d0 do j1,n2-3,4
t1t1a(j0,i0)x(j0)a(j1,i0)x(j1) 1
a(j2,i0)x(j2)a(j3,i1)x(j3)
t2t2a(j0,i1)x(j0)a(j1,i1)x(j1) 1
a(j2,i1)x(j2)a(j3,i0)x(j3)
t3t3a(j0,i2)x(j0)a(j1,i2)x(j1) 1
a(j2,i2)x(j2)a(j3,i2)x(j3)
t4t4a(j0,i3)x(j0)a(j1,i3)x(j1) 1
a(j2,i3)x(j2)a(j3,i3)x(j3) enddo
y(i0)t1 y(i1)t2 y(i2)t3
y(i3)t4 enddo
32 FLOPS 20 LOADS
21
Summary of Single-Processor Optimization
Techniques (I)

Spatial and temporal data locality
Loop unrolling
Blocking
Software pipelining
Optimization of data structures
Special functions, library subroutines

22
Summary of Optimization Techniques (II)

Achieving high-performance requires code
restructuring. Minimization of memory traffic is
the single most important goal.
Compilers are getting better good at software
pipelining. But they are not there yet can do
loop transformations only in simple cases,
usually fail to produce optimal blocking,
heuristics for unrolling may not match your code
well, etc.
The optimization process is machine-specific and
requires detailed architectural knowledge.

23
Optimizing Matrix Addition for Caches

Dimension A(n,n), B(n,n), C(n,n)
A, B, C stored by column (as in Fortran)
Algorithm 1
for i1n, for j1n, A(i,j) B(i,j) C(i,j)
Algorithm 2
for j1n, for i1n, A(i,j) B(i,j) C(i,j)
What is memory access pattern for Algs 1 and 2?
Which is faster?
What if A, B, C stored by row (as in C)?

24
Using a Simpler Model of Memory to Optimize

Assume just 2 levels in the hierarchy, fast and
slow
All data initially in slow memory
m number of memory elements (words) moved
between fast and slow memory
tm time per slow memory operation
f number of arithmetic operations
tf time per arithmetic operation lt tm
q f/m average number of flops per slow element
access
Minimum possible Time ftf, when all data in
fast memory
Actual Time ftf mtm ftf(1
(tm/tf)(1/q))
Larger q means Time closer to minimum ftf
Want large q

25
Simple example using memory model

To see results of changing q, consider simple
computation

Assume tf1 Mflop/s on fast memory
Assume moving data is tm 10
Assume h takes q flops
Assume array X is in slow memory

s 0 for i 1, n s s h(Xi)

So m n and f qn
Time read X compute 10n qn
Mflop/s f/t q/(10 q)
As q increases, this approaches the peak speed
of 1 Mflop/s

26
Warm up Matrix-vector multiplication y y Ax

for i 1n
for j 1n
y(i) y(i) A(i,j)x(j)

A(i,)

y(i)
y(i)
x()
27
Warm up Matrix-vector multiplication y y Ax

read x(1n) into fast memory
read y(1n) into fast memory
for i 1n
read row i of A into fast memory
for j 1n
y(i) y(i) A(i,j)x(j)
write y(1n) back to slow memory

m number of slow memory refs 3n n2
f number of arithmetic operations 2n2
q f/m 2
Matrix-vector multiplication limited by slow
memory speed

28
Matrix Multiply CCAB

for i 1 to n
for j 1 to n
for k 1 to n
C(i,j) C(i,j) A(i,k) B(k,j)

A(i,)
C(i,j)
C(i,j)
B(,j)

29
Matrix Multiply CCAB(unblocked, or untiled)

for i 1 to n
read row i of A into fast memory
for j 1 to n
read C(i,j) into fast memory
read column j of B into fast memory
for k 1 to n
C(i,j) C(i,j) A(i,k) B(k,j)
write C(i,j) back to slow memory

A(i,)
C(i,j)
C(i,j)
B(,j)

30
Matrix Multiply (unblocked, or untiled)
qops/slow mem ref

Number of slow memory references on unblocked
matrix multiply
m n3 read each column of B n times
n2 read each column of A once for
each i
2n2 read and write each element of C
once
n3 3n2
So q f/m (2n3)/(n3 3n2)
2 for large n, no improvement over
matrix-vector mult

A(i,)
C(i,j)
C(i,j)
B(,j)

31
Matrix Multiply (blocked, or tiled)

Consider A,B,C to be N by N matrices of b by b
subblocks where bn/N is called the blocksize
for i 1 to N
for j 1 to N
read block C(i,j) into fast memory
for k 1 to N
read block A(i,k) into fast
memory
read block B(k,j) into fast
memory
C(i,j) C(i,j) A(i,k)
B(k,j) do a matrix multiply on blocks
write block C(i,j) back to slow memory

A(i,k)
C(i,j)
C(i,j)

B(k,j)
n is the size of the matrix, N blocks of size b
n Nb
32
Matrix Multiply (blocked or tiled)
qops/slow mem ref
n is the size of the matrix, N blocks of size b
n Nb

Why is this algorithm correct?
Number of slow memory references on blocked
matrix multiply
m Nn2 read each block of B N3 times (N3
n/N n/N)
Nn2 read each block of A N3 times
2n2 read and write each block of
C once
(2N 2)n2
So q f/m 2n3 / ((2N 2)n2)
n/N b for large n
So we can improve performance by increasing the
blocksize b
Can be much faster than matrix-vector multiplty
(q2)
Limit All three blocks from A,B,C must fit in
fast memory (cache), so we
cannot make these blocks arbitrarily large 3b2
lt M, so q b lt sqrt(M/3)
Theorem (Hong, Kung, 1981) Any reorganization of
this algorithm
(that uses only associativity) is limited to q
O(sqrt(M))

33
Model

As much as possible will be overlapped
Dot Product
ACC 0
do i x,n
ACC ACC x(i) y(i)
end do
Experiments done on an IBM RS6000/530
25 MHz
2 cycle to complete FMA can be pipelined
gt 50 Mflop/s peak
one cycle from cache

FMA
FMA
FMA
FMA
FMA
FMA
cycles
34
DOT Operation - Data in Cache

Do 10 I 1, n
T T X(I)Y(I)
10 CONTINUE
Theoretically, 2 loads for X(I) and Y(I), one FMA
operation, no re-use of data
Pseudo-assembler
LOAD fp0,T
label
LOAD fp1,X(I)
LOAD fp2,Y(I)
FMA fp0,fp0,fp1,fp2
BRANCH label

Load y
Load x
FMA
Load y
Load x
FMA
1 result per cycle 25 Mflop/s
35
Matrix-Vector Product

DOT version
DO 20 I 1, M
DO 10 J 1, N
Y(I) Y(I) A(I,J)X(J)
10 CONTINUE
20 CONTINUE
From Cache 22.7 Mflops
From Memory 12.4 Mflops

36
Loop Unrolling

DO 20 I 1, M, 2
T1 Y(I )
T2 Y(I1)
DO 10 J 1, N
T1 T1 A(I,J )X(J)
T2 T2 A(I1,J)X(J)
10 CONTINUE
Y(I ) T1
Y(I1) T2
20 CONTINUE
3 loads, 4 flops
Speed of yyATx, N48

Depth 1 2 3 4
?
Speed 25 33.3 37.5 40 50
Measured 22.7 30.5 34.3 36.5
Memory 12.4 12.7 12.7 12.6
unroll 1 2 loads 2 ops per 2 cycles
unroll 2 3 loads 4 ops per 3 cycles
unroll 3 4 loads 6 ops per 4 cycles
unroll n n1 loads 2n ops per n1 cycles
problem only so many registers

37
Matrix Multiply

DOT version - 25 Mflops in cache
DO 30 J 1, M
DO 20 I 1, M
DO 10 K 1, L
C(I,J) C(I,J)
A(I,K)B(K,J)
10 CONTINUE
20 CONTINUE
30 CONTINUE

38
How to Get Near Peak

Inner loop
4 loads, 8 operations, optimal.
In practice we have measured 48.1 out of a peak
of 50 Mflop/s when in cache

DO 30 J 1, M, 2
DO 20 I 1, M, 2
T11 C(I, J )
T12 C(I, J1)
T21 C(I1,J )
T22 C(I1,J1)
DO 10 K 1, L
T11 T11 A(I, K) B(K,J
)
T12 T12 A(I, K)
B(K,J1)
T21 T21 A(I1,K)B(K,J )
T22 T22 A(I1,K)B(K,J1)
10 CONTINUE
C(I, J ) T11
C(I, J1) T12
C(I1,J ) T21
C(I1,J1) T22
20 CONTINUE
30 CONTINUE

39
BLAS -- Introduction

Clarity code is shorter and easier to read,
Modularity gives programmer larger building
blocks,
Performance manufacturers will provide tuned
machine-specific BLAS,
Program portability machine dependencies are
confined to the BLAS

40
Memory Hierarchy

Key to high performance in effective use of
memory hierarchy
True on all architectures

41
Level 1, 2 and 3 BLAS

Level 1 BLAS Vector-Vector operations
Level 2 BLAS Matrix-Vector operations
Level 3 BLAS Matrix-Matrix operations

42
More on BLAS (Basic Linear Algebra Subroutines)

Industry standard interface(evolving)
Vendors, others supply optimized implementations
History
BLAS1 (1970s)
vector operations dot product, saxpy (y?xy),
etc
m2n, f2n, q 1 or less
BLAS2 (mid 1980s)
matrix-vector operations matrix vector multiply,
etc
mn2, f2n2, q2, less overhead
somewhat faster than BLAS1
BLAS3 (late 1980s)
matrix-matrix operations matrix matrix multiply,
etc
m gt 4n2, fO(n3), so q can possibly be as large
as n, so BLAS3 is potentially much faster than
BLAS2
Good algorithms used BLAS3 when possible (LAPACK)
www.netlib.org/blas, www.netlib.org/lapack

43
Why Higher Level BLAS?

Can only do arithmetic on data at the top of the
hierarchy
Higher level BLAS lets us do this

44
BLAS for Performance

Development of blocked algorithms important for
performance

45
BLAS for Performance

Development of blocked algorithms important for
performance

46
BLAS for Performance

Development of blocked algorithms important for
performance

BLAS 3 (n-by-n matrix matrix multiply) vs BLAS 2
(n-by-n matrix vector multiply) vs BLAS 1 (saxpy
of n vectors)
47
Fast linear algebra kernels BLAS

Simple linear algebra kernels such as
matrix-matrix multiply
More complicated algorithms can be built from
these basic kernels.
The interfaces of these kernels have been
standardized as the Basic Linear Algebra
Subroutines (BLAS).
Early agreement on standard interface (1980)
Led to portable libraries for vector and shared
memory parallel machines.
On distributed memory, there is a less-standard
interface called the PBLAS

48
Level 1 BLAS

Operate on vectors or pairs of vectors
perform O(n) operations
return either a vector or a scalar.
saxpy
y(i) a x(i) y(i), for i1 to n.
s stands for single precision, daxpy is for
double precision, caxpy for complex, and zaxpy
for double complex,
sscal y a x, for scalar a and vectors x,y
sdot computes s S ni1 x(i)y(i)

49
Level 2 BLAS

Operate on a matrix and a vector
return a matrix or a vector
O(n2) operations
sgemv matrix-vector multiply
y y Ax
where A is m-by-n, x is n-by-1 and y is m-by-1.
sger rank-one update
A A yxT, i.e., A(i,j) A(i,j)y(i)x(j)
where A is m-by-n, y is m-by-1, x is n-by-1,
strsv triangular solve
solves yTx for x, where T is triangular

50
Level 3 BLAS

Operate on pairs or triples of matrices
returning a matrix
complexity is O(n3).
sgemm Matrix-matrix multiplication
C C AB,
where C is m-by-n, A is m-by-k, and B is k-by-n
strsm multiple triangular solve
solves Y TX for X,
where T is a triangular matrix, and X is a
rectangular matrix.

51
Optimizing in practice

Tiling for registers
loop unrolling, use of named register variables
Tiling for multiple levels of cache
Exploiting fine-grained parallelism within the
processor
super scalar
pipelining
Complicated compiler interactions
Hard to do by hand (but youll try)
Automatic optimization an active research area
PHIPAC www.icsi.berkeley.edu/bilmes/phipac
www.cs.berkeley.edu/iyer/asci_sli
des.ps
ATLAS www.netlib.org/atlas/index.html

52
BLAS -- References

BLAS software and documentation can be obtained
via
WWW http//www.netlib.org/blas,
(anonymous) ftp ftp.netlib.org cd blas get
index
email netlib_at_www.netlib.org with the message
send index from blas
Comments and questions can be addressed to
lapack_at_cs.utk.edu

53
BLAS Papers

C. Lawson, R. Hanson, D. Kincaid, and F. Krogh,
Basic Linear Algebra Subprograms for Fortran
Usage, ACM Transactions on Mathematical Software,
5308--325, 1979.
J. Dongarra, J. Du Croz, S. Hammarling, and R.
Hanson, An Extended Set of Fortran Basic Linear
Algebra Subprograms, ACM Transactions on
Mathematical Software, 14(1)1--32, 1988.
J. Dongarra, J. Du Croz, I. Duff, S. Hammarling,
A Set of Level 3 Basic Linear Algebra
Subprograms, ACM Transactions on Mathematical
Software, 16(1)1--17, 1990.

54
Performance of BLAS

BLAS are specially optimized by the vendor
Sun BLAS uses features in the Ultrasparc
Big payoff for algorithms that can be expressed
in terms of the BLAS3 instead of BLAS2 or BLAS1.
The top speed of the BLAS3
Algorithms like Gaussian elimination organized so
that they use BLAS3

55
How To Get Performance From Commodity Processors?

Todays processors can achieve high-performance,
but this requires extensive machine-specific hand
tuning.
Routines have a large design space w/many
parameters
blocking sizes, loop nesting permutations, loop
unrolling depths, software pipelining strategies,
register allocations, and instruction schedules.
Complicated interactions with the increasingly
sophisticated microarchitectures of new
microprocessors.
A few months ago no tuned BLAS for Pentium for
Linux.
Need for quick/dynamic deployment of optimized
routines.
ATLAS - Automatic Tuned Linear Algebra Software
PhiPac from Berkeley

56
Adaptive Approach for Level 3

Do a parameter study of the operation on the
target machine, done once.
Only generated code is on-chip multiply
BLAS operation written in terms of generated
on-chip multiply
All tranpose cases coerced through data copy to 1
case of on-chip multiply
Only 1 case generated per platform

57
Code Generation Strategy

On-chip multiply optimizes for
TLB access
L1 cache reuse
FP unit usage
Memory fetch
Register reuse
Loop overhead minimization
Takes a couple of hours to run.

Code is iteratively generated timed until
optimal case is found. We try
Differing NBs
Breaking false dependencies
M, N and K loop unrolling

58
(No Transcript)
59
(No Transcript)
60
500x500 gemm-based BLAS on SGI R10000ip28
61
500x500 gemm-based BLAS on UltraSparc 2200
62
Recursive Approach for Other Level
3 BLAS

Recur down to L1 cache block size
Need kernel at bottom of recursion
Use gemm-based kernel for portability

Recursive TRMM
63
500x500 Level 2 BLAS DGEMV
64
Multi-Threaded DGEMMIntel PIII 550 MHz
65
ATLAS

Keep a repository of kernels for specific
machines.
Develop a means of dynamically downloading code
Extend work to allow sparse matrix operations
Extend work to include arbitrary code segments
See http//www.netlib.org/atlas/

66
Algorithms and Architecture

The key to performance is to understand the
algorithm and architecture interaction.
A significant improvement in performance can be
obtained by matching algorithm to the
architecture or vice versa.

67
Algorithm Issues

Use of memory hierarchy
Algorithm pre-fetching
Loop unrolling
Simulating higher precision arithmetic

68
Blocking

TLB Blocking - minimize TLB misses
Cache Blocking - minimize cache misses
Register Blocking - minimize load/stores
The general idea of blocking is to get the
information to a high-speed storage and use it
multiple times so as to amortize the cost of
moving the data
Cache blocking reduced traffic between memory and
cache
Register blocking reduces traffic between cache
and CPU

69
Loop Unrolling

Reduces data dependency delay
Exploits multiple functional units and quad
load/stores effectively.
Minimizes load/stores
Reduces loop overheads
Gives more flexibility to compiler in scheduling
Facilitates algorithm pre-fetching.
What about vector computing?

70
Whats Wrong With Speedup T1/Tp ?

Can lead to very false conclusions.
Speedup in isolation without taking into account
the speed of the processor is unrealistic and
pointless.
Speedup over what?
T1/Tp
There is usually no doubt about Tp
Often considerable dispute over the meaning of T1
Serial code? Same algorithm?

71
Speedup

Can be used to
Study, in isolation, the scaling of one algorithm
on one computer.
As a dimensionless variable in the theory of
scaling.
Should not be used to compare
Different algorithms on the same computer
The same algorithm on different computers.
Different interconnection structures.

72
(No Transcript)
73
Strassens Algorithm for Matrix Multiply
Usual Matrix Multiply
74
Strassens Algorithm
75
Strassens Algorithm

The count of arithmetic operations is
One matrix multiply is replaced by 14 matrix
additions

76
Strassens Algorithm

In reality the use of Strassens Algorithm is
limited by
Additional memory required for storing the P
matrices.
More memory accesses are needed.

77
Outline

Motivation for Dense Linear Algebra
Axb Computational Electromagnetics
Ax lx Quantum Chemistry
Review Gaussian Elimination (GE) for solving Axb
Optimizing GE for caches on sequential machines
using matrix-matrix multiplication (BLAS)
LAPACK library overview and performance
Data layouts on parallel machines
Parallel matrix-matrix multiplication
Parallel Gaussian Elimination
ScaLAPACK library overview
Eigenvalue problem

78
Parallelism in Sparse Matrix-vector multiplication

y Ax, where A is sparse and n x n
Questions
which processors store
yi, xi, and Ai,j
which processors compute
yi sum (from 1 to n) Ai,j xj
(row i of A) . x a
sparse dot product
Partitioning
Partition index set 1,,n N1 u N2 u u Np
For all i in Nk, Processor k stores yi, xi,
and row i of A
For all i in Nk, Processor k computes yi (row
i of A) . x
owner computes rule Processor k compute the
yis it owns
Goals of partitioning
balance load (how is load measured?)
balance storage (how much does each processor
store?)
minimize communication (how much is communicated?)

79
Graph Partitioning and Sparse Matrices

Relationship between matrix and graph

1 2 3 4 5 6
1 1 1 1 2 1 1
1 1 3
1 1 1 4 1 1
1 1 5 1 1 1
1 6 1 1 1 1

A good partition of the graph has
equal (weighted) number of nodes in each part
(load and storage balance)
minimum number of edges crossing between
(minimize communication)
Can reorder the rows/columns of the matrix by
putting all the nodes in one partition together

80
More on Matrix Reordering via Graph Partitioning

Ideal matrix structure for parallelism
(nearly) block diagonal
p (number of processors) blocks
few non-zeros outside these blocks, since these
require communication

P0 P1 P2 P3 P4
81
What about implicit methods and eigenproblems?

Direct methods (Gaussian elimination)
Called LU Decomposition, because we factor A
LU
Future lectures will consider both dense and
sparse cases
More complicated than sparse-matrix vector
multiplication
Iterative solvers
Will discuss several of these in future
Jacobi, Successive overrelaxiation (SOR) ,
Conjugate Gradients (CG), Multigrid,...
Most have sparse-matrix-vector multiplication in
kernel
Eigenproblems
Future lectures will discuss dense and sparse
cases
Also depend on sparse-matrix-vector
multiplication, direct methods
Graph partitioning

Partial Differential Equations
PDEs

83
Continuous Variables, Continuous Parameters

Examples of such systems include
Heat flow Temperature(position, time)
Diffusion Concentration(position, time)
Electrostatic or Gravitational Potential Pote
ntial(position)
Fluid flow Velocity,Pressure,Density(position,tim
e)
Quantum mechanics Wave-function(position,time)
Elasticity Stress,Strain(position,time)

84
Example Deriving the Heat Equation
x
x-h
0
1
xh

Consider a simple problem
A bar of uniform material, insulated except at
ends
Let u(x,t) be the temperature at position x at
time t
Heat travels from x-h to xh at rate proportional
to

d u(x,t) (u(x-h,t)-u(x,t))/h -
(u(x,t)- u(xh,t))/h dt
h
C

As h 0, we get the heat equation

85
Explicit Solution of the Heat Equation

For simplicity, assume C1
Discretize both time and position
Use finite differences with uj,i as the heat at
time t idt (i 0,1,2,) and position x jh
(j0,1,,N1/h)
initial conditions on uj,0
boundary conditions on u0,i and uN,i
At each timestep i 0,1,2,...
This corresponds to
matrix vector multiply (what is matrix?)
nearest neighbors on grid

t5 t4 t3 t2 t1 t0
For j0 to N uj,i1 zuj-1,i
(1-2z)uj,i zuj1,i where z dt/h2
u0,0 u1,0 u2,0 u3,0 u4,0 u5,0
86
Parallelism in Explicit Method for PDEs

Partitioning the space (x) into p largest chunks
good load balance (assuming large number of
points relative to p)
minimized communication (only p chunks)
Generalizes to
multiple dimensions
arbitrary graphs ( sparse matrices)
Problem with explicit approach
numerical instability
solution blows up eventually if z dt/h gt .5
need to make the timesteps very small when h is
small dt lt .5h

2
2
87
Instability in solving the heat equation
explicitly
88
Implicit Solution

As with many (stiff) ODEs, need an implicit
method
This turns into solving the following equation
Here I is the identity matrix and T is
I.e., essentially solving Poissons equation in 1D

(I (z/2)T) u,i1 (I - (z/2)T) u,i
2 -1 -1 2 -1 -1 2 -1
-1 2 -1 -1 2
Graph and stencil
T
2
-1
-1
89
2D Implicit Method

Similar to the 1D case, but the matrix T is now
Multiplying by this matrix (as in the explicit
case) is simply nearest neighbor computation on
2D grid
To solve this system, there are several techniques

Graph and stencil
4 -1 -1 -1 4 -1 -1
-1 4 -1 -1
4 -1 -1 -1 -1 4
-1 -1 -1
-1 4 -1
-1 4 -1
-1 -1 4 -1
-1 -1 4
-1
4
-1
-1
T
-1
90
Algorithms for 2D Poisson Equation with N unknowns

Algorithm Serial PRAM Memory Procs
Dense LU N3 N N2 N2
Band LU N2 N N3/2 N
Jacobi N2 N N N
Explicit Inv. N log N N N
Conj.Grad. N 3/2 N 1/2 log N N N
RB SOR N 3/2 N 1/2 N N
Sparse LU N 3/2 N 1/2 Nlog N N
FFT Nlog N log N N N
Multigrid N log2 N N N
Lower bound N log N N
PRAM is an idealized parallel model with zero
cost communication
(see next slide for explanation)

2
2
2
91
Short explanations of algorithms on previous slide

Sorted in two orders (roughly)
from slowest to fastest on sequential machines
from most general (works on any matrix) to most
specialized (works on matrices like T)
Dense LU Gaussian elimination works on any
N-by-N matrix
Band LU exploit fact that T is nonzero only on
sqrt(N) diagonals nearest main diagonal, so
faster
Jacobi essentially does matrix-vector multiply
by T in inner loop of iterative algorithm
Explicit Inverse assume we want to solve many
systems with T, so we can precompute and store
inv(T) for free, and just multiply by it
Its still expensive!
Conjugate Gradients uses matrix-vector
multiplication, like Jacobi, but exploits
mathematical properies of T that Jacobi does not
Red-Black SOR (Successive Overrelaxation)
Variation of Jacobi that exploits yet different
mathematical properties of T
Used in Multigrid
Sparse LU Gaussian elimination exploiting
particular zero structure of T
FFT (Fast Fourier Transform) works only on
matrices very like T
Multigrid also works on matrices like T, that
come from elliptic PDEs
Lower Bound serial (time to print answer)
parallel (time to combine N inputs)

92
Composite mesh from a mechanical structure
93
Converting the mesh to a matrix
94
Effects of Ordering Rows and Columns on Gaussian
Elimination
95
Irregular mesh NASA Airfoil in 2D (direct
solution)
96
Irregular mesh Tapered Tube (multigrid)
97
Adaptive Mesh Refinement (AMR)

Adaptive mesh around an explosion
John Bell and Phil Colella at LBL (see class web
page for URL)
Goal of Titanium is to make these algorithms
easier to implement
in parallel

98
Computational Electromagnetics

Developed during 1980s, driven by defense
applications
Determine the RCS (radar cross section) of
airplane
Reduce signature of plane (stealth technology)
Other applications are antenna design, medical
equipment
Two fundamental numerical approaches
MOM methods of moments ( frequency domain), and
Finite differences (time domain)

99
Computational Electromagnetics
- Discretize surface into triangular facets using
standard modeling tools - Amplitude of currents
on surface are unknowns
- Integral equation is discretized into a set of
linear equations
image NW Univ. Comp. Electromagnetics Laboratory
http//nueml.ece.nwu.edu/
100
Computational Electromagnetics (MOM)
After discretization the integral equation has
the form A x b where A is
the (dense) impedance matrix, x is the unknown
vector of amplitudes, and b is the excitation
vector. (see Cwik, Patterson, and Scott,
Electromagnetic Scattering on the Intel
Touchstone Delta, IEEE Supercomputing 92, pp 538
- 542)
101
Computational Electromagnetics (MOM)
The main steps in the solution process are
Fill computing the matrix elements
of A Factor factoring the dense matrix
A Solve solving for one or more
excitations b Field Calc computing the fields
scattered from the object
102
Analysis of MOM for Parallel Implementation
Task Work Parallelism
Parallel Speed
Fill O(n2) embarrassing
low Factor O(n3)
moderately diff. very high Solve
O(n2) moderately diff.
high Field Calc. O(n)
embarrassing high
103
Results for Parallel Implementation on Delta
Task Time (hours)
Fill 9.20
Factor 8.25 Solve
2 .17
Field Calc. 0.12
The problem solved was for a matrix of size
48,672. (The world record in 1991.)
104
Current Records for Solving Dense Systems
Year System Size Machine
Procs Gflops (Peak) 1950's O(100)

1995 128,600 Intel Paragon
6768 281 ( 338) 1996
215,000 Intel ASCI Red 7264
1068 (1453) 1998 148,000
Cray T3E 1488 1127
(1786) 1998 235,000 Intel
ASCI Red 9152 1338 (1830) 1999
374,000 SGI ASCI Blue 5040
1608 (2520) 1999 362,880
Intel ASCI Red 9632 2379 (3207) 2000
430,000 IBM ASCI White 8192
4928 (12000)
source Alan Edelman http//www-math.mit.edu/edel
man/records.html LINPACK Benchmark
http//www.netlib.org/performance/html/PDSreports.
html
105
Computational Chemistry

Seek energy levels of a molecule, crystal, etc.
Solve Schroedingers Equation for energy levels
eigenvalues
Discretize to get Ax lBx, solve for eigenvalues
l and eigenvectors x
A and B large, symmetric or Hermitian matrices (B
positive definite)
May want some or all eigenvalues/eigenvectors
MP-Quest (Sandia NL)
Si and sapphire crystals of up to 3072 atoms
Local Density Approximation to Schroedinger
Equation
A and B up to n40000, Hermitian
Need all eigenvalues and eigenvectors
Need to iterate up to 20 times (for
self-consistency)
Implemented on Intel ASCI Red
9200 Pentium Pro 200 processors (4600 Duals, a
CLUMP)
Overall application ran at 605 Gflops (out of
1800 Glops peak),
Eigensolver ran at 684 Gflops
www.cs.berkeley.edu/stanley/gbell/index.html
Runner-up for Gordon Bell Prize at Supercomputing
98

106
Review of Gaussian Elimination (GE) for solving
Axb

Add multiples of each row to later rows to make A
upper triangular
Solve resulting triangular system Ux c by
substitution

for each column i zero it out below the
diagonal by adding multiples of row i to later
rows for i 1 to n-1 for each row j below
row i for j i1 to n add a
multiple of row i to row j for k i to
n A(j,k) A(j,k) -
(A(j,i)/A(i,i)) A(i,k)
107
Refine GE Algorithm (1)

Initial Version
Remove computation of constant A(j,i)/A(i,i) from
inner loop

Last version
Dont compute what we already know
zeros below diagonal in column i

for i 1 to n-1 for j i1 to n
m A(j,i)/A(i,i) for k i to n
A(j,k) A(j,k) - m A(i,k)
for i 1 to n-1 for j i1 to n
m A(j,i)/A(i,i) for k i1 to n
A(j,k) A(j,k) - m A(i,k)
109
Refine GE Algorithm (3)

Last version
Store multipliers m below diagonal in zeroed
entries for later use

for i 1 to n-1 for j i1 to n
m A(j,i)/A(i,i) for k i1 to n
A(j,k) A(j,k) - m A(i,k)
for i 1 to n-1 for j i1 to n
A(j,i) A(j,i)/A(i,i) for k i1 to
n A(j,k) A(j,k) - A(j,i) A(i,k)
110
Refine GE Algorithm (4)

Last version
Express using matrix operations (BLAS)

for i 1 to n-1 for j i1 to n
A(j,i) A(j,i)/A(i,i) for k i1 to
n A(j,k) A(j,k) - A(j,i) A(i,k)
for i 1 to n-1 A(i1n,i) A(i1n,i) /
A(i,i) A(i1n,i1n) A(i1n , i1n )
- A(i1n , i) A(i , i1n)
111
What GE really computes

Call the strictly lower triangular matrix of
multipliers M, and let L IM
Call the upper triangle of the final matrix U
Lemma (LU Factorization) If the above algorithm
terminates (does not divide by zero) then A LU
Solving Axb using GE
Factorize A LU using GE
(cost 2/3 n3 flops)
Solve Ly b for y, using substitution (cost
n2 flops)
Solve Ux y for x, using substitution (cost
n2 flops)
Thus Ax (LU)x L(Ux) Ly b as desired

for i 1 to n-1 A(i1n,i) A(i1n,i) /
A(i,i) A(i1n,i1n) A(i1n , i1n ) -
A(i1n , i) A(i , i1n)
112
Problems with basic GE algorithm

What if some A(i,i) is zero? Or very small?
Result may not exist, or be unstable, so need
to pivot
Current computation all BLAS 1 or BLAS 2, but we
know that BLAS 3 (matrix multiply) is fastest
(Lecture 2)

for i 1 to n-1 A(i1n,i) A(i1n,i) /
A(i,i) BLAS 1 (scale a vector)
A(i1n,i1n) A(i1n , i1n ) BLAS 2
(rank-1 update) - A(i1n , i)
A(i , i1n)
Peak
BLAS 3
BLAS 2
BLAS 1
113
Pivoting in Gaussian Elimination

A 0 1 fails completely, even though
A is easy
1 0
Illustrate problems in 3-decimal digit
arithmetic
A 1e-4 1 and b 1 ,
correct answer to 3 places is x 1
1 1
2
1
Result of LU decomposition is
L 1 0 1
0 No roundoff error yet
fl(1/1e-4) 1 1e4
1
U 1e-4 1
1e-4 1 Error in 4th decimal place
0 fl(1-1e41)
0 -1e4
Check if A LU 1e-4 1
(2,2) entry entirely wrong
1
0

114
Gaussian Elimination with Partial Pivoting (GEPP)

Partial Pivoting swap rows so that each
multiplier
L(i,j) A(j,i)/A(i,i) lt
1

for i 1 to n-1 find and record k where
A(k,i) maxi lt j lt n A(j,i)
i.e. largest entry in rest of column i if
A(k,i) 0 exit with a warning that A
is singular, or nearly so elseif k ! i
swap rows i and k of A end if
A(i1n,i) A(i1n,i) / A(i,i) each
quotient lies in -1,1 A(i1n,i1n)
A(i1n , i1n ) - A(i1n , i) A(i , i1n)

Lemma This algorithm computes A PLU, where
P is a
permutation matrix
Since each entry of L(i,j) lt 1, this
algorithm is considered
numerically stable
For details see LAPACK code at
www.netlib.org/lapack/single/sgetf2.f

115
Converting BLAS2 to BLAS3 in GEPP

Blocking
Used to optimize matrix-multiplication
Harder here because of data dependencies in GEPP
Delayed Updates
Save updates to trailing matrix from several
consecutive BLAS2 updates
Apply many saved updates simultaneously in one
BLAS3 operation
Same idea works for much of dense linear algebra
Open questions remain
Need to choose a block size b
Algorithm will save and apply b updates
b must be small enough so that active submatrix
consisting of b columns of A fits in cache
b must be large enough to make BLAS3 fast

116
Blocked GEPP (www.netlib.org/lapack/single/sgetr
f.f)
for ib 1 to n-1 step b Process matrix b
columns at a time end ib b-1
Point to end of block of b columns
apply BLAS2 version of GEPP to get A(ibn ,
ibend) P L U let LL denote the
strict lower triangular part of A(ibend ,
ibend) I A(ibend , end1n) LL-1
A(ibend , end1n) update next b rows
of U A(end1n , end1n ) A(end1n ,
end1n ) - A(end1n , ibend)
A(ibend , end1n)
apply delayed updates with
single matrix-multiply
with inner dimension b
(For a correctness proof, see on-lines notes.)
117
Efficiency of Blocked GEPP
118
Overview of LAPACK

Standard library for dense/banded linear algebra
Linear systems Axb
Least squares problems minx Ax-b 2
Eigenvalue problems Ax lx, Ax lBx
Singular value decomposition (SVD) A USVT
Algorithms reorganized to use BLAS3 as much as
possible
Basis of math libraries on many computers
Many algorithmic innovations remain
Projects available

119
Performance of LAPACK (n1000)
120
Performance of LAPACK (n100)
121
Parallelizing Gaussian Elimination

parallelization steps
Decomposition identify enough parallel work, but
not too much
Assignment load balance work among threads
Orchestrate communication and synchronization
Mapping which processors execute which threads
Decomposition
In BLAS 2 algorithm nearly each flop in inner
loop can be done in parallel, so with n2
processors, need 3n parallel steps
This is too fine-grained, prefer calls to local
matmuls instead

for i 1 to n-1 A(i1n,i) A(i1n,i) /
A(i,i) BLAS 1 (scale a vector)
A(i1n,i1n) A(i1n , i1n ) BLAS 2
(rank-1 update) - A(i1n , i)
A(i , i1n)
122
Assignment of parallel work in GE

Think of assigning submatrices to threads, where
each thread responsible for updating submatrix it
owns
owner computes rule natural because of locality
What should submatrices look like to achieve load
balance?

123
Different Data Layouts for Parallel GE (on 4
procs)
Load balanced, but cant easily use BLAS2 or BLAS3
Bad load balance P0 idle after first n/4 steps
Can trade load balance and BLAS2/3 performance
by choosing b, but factorization of block column
is a bottleneck
The winner!
Complicated addressing
124
Blocked Partitioned Algorithms

Orthogonal reduction to
(upper) Hessenberg form
symmetric tridiagonal form
bidiagonal form
Block QR iteration for nonsymmetric eigenvalue
problems

LU Factorization
Cholesky factorization
Symmetric indefinite factorization
Matrix inversion
QR, QL, RQ, LQ factorizations
Form Q or QTC

125
Memory Hierarchy and LAPACK

ijk - implementations
Effects order in which
data referenced some better at
allowing data to keep in higher levels of
memory hierarchy.
Applies for matrix multiply, reductions to
condensed form
May do slightly more
flops
Up to 3 times faster

126
Derivation of Blocked AlgorithmsCholesky
Factorization A UTU

Equating coefficient of the jth column, we obtain
Hence, if U11 has already been computed, we can
compute uj and ujj from the equations
127
LINPACK Implementation

Here is the body of the LINPACK routine SPOFA
which implements the method
DO 30 J 1, N
INFO J
S 0.0E0
JM1 J - 1
IF( JM1.LT.1 ) GO TO 20
DO 10 K 1, JM1
T A( K, J ) - SDOT( K-1,
A( 1, K ), 1,A( 1, J ), 1 )
T T / A( K, K )
A( K, J ) T
S S TT
10 CONTINUE
20 CONTINUE
S A( J, J ) - S
C ...EXIT
IF( S.LE.0.0E0 ) GO TO 40
A( J, J ) SQRT( S )
30 CONTINUE

128
LAPACK Implementation

DO 10 J 1, N
CALL STRSV( 'Upper', 'Transpose',
'Non-Unit, J-1, A, LDA, A( 1, J ), 1 )
S A( J, J ) - SDOT( J-1, A( 1, J ),
1, A( 1, J ), 1 )
IF( S.LE.ZERO ) GO TO 20
A( J, J ) SQRT( S )
10 CONTINUE
This change by itself is sufficient to
significantly improve the performance on a number
of machines.
From 238 to 312 Mflop/s for a matrix of order 500
on a Pentium 4-1.7 GHz.
However on peak is 1,700 Mflop/s.
Suggest further work needed.

129
Derivation of Blocked Algorithms

Equating coefficient of second block of columns,
we obtain
Hence, if U11 has already been computed, we can
compute U12 as the solution of the following
equations by a call to the Level 3 BLAS routine
STRSM
130
LAPACK Blocked Algorithms
DO 10 J 1, N, NB CALL STRSM( 'Left',
'Upper', 'Transpose','Non-Unit', J-1, JB, ONE, A,
LDA, A( 1, J ), LDA )
CALL SSYRK( 'Upper', 'Transpose', JB, J-1,-ONE,
A( 1, J ), LDA, ONE, A( J, J ),
LDA ) CALL SPOTF2( 'Upper', JB, A( J, J ),
LDA, INFO ) IF( INFO.NE.0 ) GO TO 20 10
CONTINUE

On Pentium 4, L3 BLAS squeezes a lot more out of
1 proc

Intel Pentium 4 1.7 GHz N 500 Rate of Execution
Linpack variant (L1B) 238 Mflop/s
Level 2 BLAS Variant 312 Mflop/s
Level 3 BLAS Variant 1262 Mflop/s
131
LAPACK Contents

Combines algorithms from LINPACK and EISPACK into
a single package. User interface similar to
LINPACK.
Built on the L 1, 2 and 3 BLAS, for high
performance (manufacturers optimize BLAS)
LAPACK does not provide routines for structured
problems or general sparse matrices (i.e sparse
storage formats such as compressed-row, -column,
-diagonal, skyline ...).

132
LAPACK Ongoing Work

Add functionality
updating/downdating, divide and conquer least
squares,bidiagonal bisection, bidiagonal inverse
iteration, band SVD, Jacobi methods, ...
Move to new generation of high performance
machines
IBM SPs, CRAY T3E, SGI Origin, clusters of
workstations
New challenges
New languages FORTRAN 90, HP FORTRAN, ...
(CMMD, MPL, NX ...)
many flavors of message passing, need standard
(PVM, MPI) BLACS
Highly varying ratio
Many ways to layout data,
Fastest parallel algorithm sometimes less stable
numerically.

133
History of Block Partitioned Algorithms

Early algorithms involved use of small main
memory using tapes as secondary storage.
Recent work centers on use of vector registers,
level 1 and 2 cache, main memory, and out of
core memory.

134
Blocked Partitioned Algorithms