# Lecture 6: Linear Algebra Algorithms - PowerPoint PPT Presentation

View by Category
Title:

## Lecture 6: Linear Algebra Algorithms

Description:

### Slides are adapted from Jim Demmel, UCB's Lecture ... and the summation can be performed in a tree fashion, the algorithm ... Illustration of Amdahl's Law. 7 ... – PowerPoint PPT presentation

Number of Views:603
Avg rating:3.0/5.0
Slides: 135
Provided by: david3092
Category:
Tags:
Transcript and Presenter's Notes

Title: Lecture 6: Linear Algebra Algorithms

1
Lecture 6Linear Algebra Algorithms
• Jack Dongarra, U of Tennessee

Slides are adapted from Jim Demmel, UCBs Lecture
on Linear Algebra Algorithms
2
• part1 (2 points) Write an MPI program
• (1) code produce correct result 2/2.
• (2) code run but could not give accurate result
1.5/2.
• (3) code could not run or run but nothing 1/2
• Note If you only submit one program which can
print "processor
• contribution" and "total integral" correctly, I
will give 1/2
• for this part and 2/2 for part2. so you total
score for part1
• and part2 will be 3/4.
• part2 (2 points) Modify your program to sum the
contributions to give the total value of the
integral.
• (1) code produce correct result 2/2.
• (2) code run but could not give accurate result
1.5/2.
• (3) code could not run or run but nothing 1/2
• part3 (4 points) Give an expression for the
efficiency of the parallel algorithm for m
subintervals
• (1) give t_p(1), t_p(p), S(p) and E(p) correctly
4/4.
• (2) give t_p(1), t_p(p) and S(p) correctly 3/4.
• (3) give t_p(1) and t_p(p) correctly 2/4.
• (4) give other 1/4.

3
• If we are adding an array of numbers and have a
choice do we add them from largest to smallest or
smallest to largest?

4
Parallel Performance Metrics
Absolute Elapsed (wall-clock) Time T(n )
Speedup S(n ) T(1) / T(n ) where T1 is
the time for the best serial implementation.
gt Performance improvement due to parallelsm
Parallel Efficiency E(n ) T(1) / n T(n )
Ideal Speedup SI (n ) n -
Theoretical limit obtainable rarely - Ignores
all of real life These definitions apply to a
fixed-problem experiment.
5
Amdahls Law
Amdahls Law places a strict limit on the speedup
that can be realized by using multiple
processors. Two equivalent expressions for
Amdahls Law are given below
tN (fp/N fs)t1 Effect of multiple
processors on run time S 1/(fs fp/N)
Effect of multiple processors on
speedup Where fs serial fraction of code fp
parallel fraction of code 1 - fs N number
of processors
6
Illustration of Amdahls Law
It takes only a small fraction of serial content
in a code to degrade the parallel performance. It
is essential to determine the scaling behavior of
your code before doing production runs using
large numbers of processors
7
Amdahls Law Vs. Reality
Amdahls Law provides a theoretical upper limit
on parallel speedup assuming that there are no
costs for communications. In reality,
communications (and I/O) will result in a further
So whats going on?
8
More on Amdahls Law
• Amdahls Law can be generalized to any two
processes of with different speeds
• Ex. Apply to fprocessor and fmemory
• The growing processor-memory performance gap will
undermine our efforts at achieving maximum
possible speedup!

9
Gustafsons Law
• Thus, Amdahls Law predicts that there is a
maximum scalability for an application,
determined by its parallel fraction, and this
limit is generally not large.
• tN (fp/N fs)t1 Effect of multiple
processors on run time
• S 1/(fs fp/N) Effect of multiple
processors on speedup
• Where
• fs serial fraction of code
• fp parallel fraction of code 1 - fs
• N number of processors
• There is a way around this increase the problem
size
• bigger problems mean bigger grids or more
particles bigger arrays
• number of serial operations generally remains
constant number of parallel operations
increases parallel fraction increases

10
Fixed-Problem Size Scaling
Scaling, Problem-Constrained, constant-problem
size (CPS), variable subgrid
1 Amdahl Limit SA(n) T(1) / T(n)
----------------
f / n ( 1 -
f ) This bounds the speedup based only on the
fraction of the code that cannot use parallelism
( 1- f ) it ignores all other factors SA
--gt 1 / ( 1- f ) as n --gt
11
Fixed-Problem Size Scaling (Contd)
Efficiency (n) T(1) / T(n) n
Memory requirements decrease with n
Surface-to-volume ratio increases with n
Superlinear speedup possible from cache
effects Motivation what is the largest of
procs I can use effectively and what is the
fastest time that I can solve a given problem?
Problems - Sequential runs often not possible
(large problems) - Speedup (and efficiency) is
12
Fixed-Problem Size Scaling Examples
S. Goedecker and Adolfy Hoisie, Achieving High
Performance in Numerical Computations on RISC
Workstations and Parallel Systems,International
Conference on Computational Physics PC'97 Santa
Cruz, August 25-28 1997.
13
Fixed-Problem Size Scaling Examples
14
Scaled Speedup Experiments
a.k.a. Fixed Subgrid-Size, Weak Scaling,
Gustafson scaling. Motivation Want to use a
larger machine to solve a larger global problem
in the same amount of time. Memory and
surface-to-volume effects remain constant.
15
Top500 Data
Rank Manufacturer Computer Installation Site Rmax TF/s R peak TF/s Ratio Max/Peak Proc
1 IBM ASCI WhiteSP Power3 Lawrence Livermore National Laboratory 7.23 12 60.25 8192
2 Compaq AlphaServer SC ES45 1 GHz Pittsburgh Supercomputing Center 4.06 6.05 67.11 3024
3 IBM SP Power3 375 MHz NERSC/LBNL 3.05 5.00 61.00 3328
4 Intel ASCI Red Sandia National Laboratory 2.38 3.207 74.21 9632
5 IBM ASCI Blue PacificSST, IBM SP 604E Lawrence Livermore National Laboratory 2.14 3.868 55.33 5808
6 Compaq AlphaServer SC ES45 1 GHz Los Alamos National Laboratory 2.10 3.04 69.08 1536
7 Hitachi SR8000/MPP University of Tokyo 1.71 2.69 63.57 1152
8 SGI ASCI Blue Mountain Los Alamos National Laboratory 1.61 2.52 63.89 6144
9 IBM SP Power3 375Mhz Naval Oceanographic Office (NAVOCEANO) 1.42 1.97 72.08 1336
10 IBM SP Power3 375Mhz Deutscher Wetterdienst 1.29 1.83 70.49 1280
16
Example of a Scaled Speedup Experiment
17
Parallel Performance Metrics Speedup
Absolute performance
Relative performance
MFLOPS
Speedup
Processors
Processors
Speedup is only one characteristic of a program -
it is not synonymous with performance. In this
comparison of two machines the code achieves
comparable speedups but one of the machines is
faster.
18
(No Transcript)
19
Improving Ratio of Floating Point Operations to
Memory Accesses
• subroutine mult(n1,nd1,n2,nd2,y,a,x)
• implicit real8 (a-h,o-z)
• dimension a(nd1,nd2),y(nd2),x(nd1)
• do 20, i1,n1
• t0.d0
• do 10, j1,n2
• tta(j,i)x(j)
• 10 continue
• y(i)t
• 20 continue
• return
• end
• Unroll the loops

20
Improving Ratio of Floating Point Operations to
Memory Accesses
c works correctly when n1,n2 are multiples of 4
dimension a(nd1,nd2), y(nd2), x(nd1) do
i1,n1-3,4 t10.d0 t20.d0
t30.d0 t40.d0 do j1,n2-3,4
t1t1a(j0,i0)x(j0)a(j1,i0)x(j1) 1
a(j2,i0)x(j2)a(j3,i1)x(j3)
t2t2a(j0,i1)x(j0)a(j1,i1)x(j1) 1
a(j2,i1)x(j2)a(j3,i0)x(j3)
t3t3a(j0,i2)x(j0)a(j1,i2)x(j1) 1
a(j2,i2)x(j2)a(j3,i2)x(j3)
t4t4a(j0,i3)x(j0)a(j1,i3)x(j1) 1
a(j2,i3)x(j2)a(j3,i3)x(j3) enddo
y(i0)t1 y(i1)t2 y(i2)t3
y(i3)t4 enddo
21
Summary of Single-Processor Optimization
Techniques (I)
• Spatial and temporal data locality
• Loop unrolling
• Blocking
• Software pipelining
• Optimization of data structures
• Special functions, library subroutines

22
Summary of Optimization Techniques (II)
• Achieving high-performance requires code
restructuring. Minimization of memory traffic is
the single most important goal.
• Compilers are getting better good at software
pipelining. But they are not there yet can do
loop transformations only in simple cases,
usually fail to produce optimal blocking,
heuristics for unrolling may not match your code
well, etc.
• The optimization process is machine-specific and
requires detailed architectural knowledge.

23
• Dimension A(n,n), B(n,n), C(n,n)
• A, B, C stored by column (as in Fortran)
• Algorithm 1
• for i1n, for j1n, A(i,j) B(i,j) C(i,j)
• Algorithm 2
• for j1n, for i1n, A(i,j) B(i,j) C(i,j)
• What is memory access pattern for Algs 1 and 2?
• Which is faster?
• What if A, B, C stored by row (as in C)?

24
Using a Simpler Model of Memory to Optimize
• Assume just 2 levels in the hierarchy, fast and
slow
• All data initially in slow memory
• m number of memory elements (words) moved
between fast and slow memory
• tm time per slow memory operation
• f number of arithmetic operations
• tf time per arithmetic operation lt tm
• q f/m average number of flops per slow element
access
• Minimum possible Time ftf, when all data in
fast memory
• Actual Time ftf mtm ftf(1
(tm/tf)(1/q))
• Larger q means Time closer to minimum ftf
• Want large q

25
Simple example using memory model
• To see results of changing q, consider simple
computation
• Assume tf1 Mflop/s on fast memory
• Assume moving data is tm 10
• Assume h takes q flops
• Assume array X is in slow memory

s 0 for i 1, n s s h(Xi)
• So m n and f qn
• Time read X compute 10n qn
• Mflop/s f/t q/(10 q)
• As q increases, this approaches the peak speed
of 1 Mflop/s

26
Warm up Matrix-vector multiplication y y Ax
• for i 1n
• for j 1n
• y(i) y(i) A(i,j)x(j)

A(i,)

y(i)
y(i)
x()
27
Warm up Matrix-vector multiplication y y Ax
• read x(1n) into fast memory
• read y(1n) into fast memory
• for i 1n
• read row i of A into fast memory
• for j 1n
• y(i) y(i) A(i,j)x(j)
• write y(1n) back to slow memory
• m number of slow memory refs 3n n2
• f number of arithmetic operations 2n2
• q f/m 2
• Matrix-vector multiplication limited by slow
memory speed

28
Matrix Multiply CCAB
• for i 1 to n
• for j 1 to n
• for k 1 to n
• C(i,j) C(i,j) A(i,k) B(k,j)

A(i,)
C(i,j)
C(i,j)
B(,j)

29
Matrix Multiply CCAB(unblocked, or untiled)
• for i 1 to n
• read row i of A into fast memory
• for j 1 to n
• read C(i,j) into fast memory
• read column j of B into fast memory
• for k 1 to n
• C(i,j) C(i,j) A(i,k) B(k,j)
• write C(i,j) back to slow memory

A(i,)
C(i,j)
C(i,j)
B(,j)

30
Matrix Multiply (unblocked, or untiled)
qops/slow mem ref
• Number of slow memory references on unblocked
matrix multiply
• m n3 read each column of B n times
• n2 read each column of A once for
each i
• 2n2 read and write each element of C
once
• n3 3n2
• So q f/m (2n3)/(n3 3n2)
• 2 for large n, no improvement over
matrix-vector mult

A(i,)
C(i,j)
C(i,j)
B(,j)

31
Matrix Multiply (blocked, or tiled)
• Consider A,B,C to be N by N matrices of b by b
subblocks where bn/N is called the blocksize
• for i 1 to N
• for j 1 to N
• read block C(i,j) into fast memory
• for k 1 to N
• read block A(i,k) into fast
memory
• read block B(k,j) into fast
memory
• C(i,j) C(i,j) A(i,k)
B(k,j) do a matrix multiply on blocks
• write block C(i,j) back to slow memory

A(i,k)
C(i,j)
C(i,j)

B(k,j)
n is the size of the matrix, N blocks of size b
n Nb
32
Matrix Multiply (blocked or tiled)
qops/slow mem ref
n is the size of the matrix, N blocks of size b
n Nb
• Why is this algorithm correct?
• Number of slow memory references on blocked
matrix multiply
• m Nn2 read each block of B N3 times (N3
n/N n/N)
• Nn2 read each block of A N3 times
• 2n2 read and write each block of
C once
• (2N 2)n2
• So q f/m 2n3 / ((2N 2)n2)
• n/N b for large n
• So we can improve performance by increasing the
blocksize b
• Can be much faster than matrix-vector multiplty
(q2)
• Limit All three blocks from A,B,C must fit in
fast memory (cache), so we
• cannot make these blocks arbitrarily large 3b2
lt M, so q b lt sqrt(M/3)
• Theorem (Hong, Kung, 1981) Any reorganization of
this algorithm
• (that uses only associativity) is limited to q
O(sqrt(M))

33
Model
• As much as possible will be overlapped
• Dot Product
• ACC 0
• do i x,n
• ACC ACC x(i) y(i)
• end do
• Experiments done on an IBM RS6000/530
• 25 MHz
• 2 cycle to complete FMA can be pipelined
• gt 50 Mflop/s peak
• one cycle from cache

FMA
FMA
FMA
FMA
FMA
FMA
cycles
34
DOT Operation - Data in Cache
• Do 10 I 1, n
• T T X(I)Y(I)
• 10 CONTINUE
• Theoretically, 2 loads for X(I) and Y(I), one FMA
operation, no re-use of data
• Pseudo-assembler
• label
• FMA fp0,fp0,fp1,fp2
• BRANCH label

FMA
FMA
1 result per cycle 25 Mflop/s
35
Matrix-Vector Product
• DOT version
• DO 20 I 1, M
• DO 10 J 1, N
• Y(I) Y(I) A(I,J)X(J)
• 10 CONTINUE
• 20 CONTINUE
• From Cache 22.7 Mflops
• From Memory 12.4 Mflops

36
Loop Unrolling
• DO 20 I 1, M, 2
• T1 Y(I )
• T2 Y(I1)
• DO 10 J 1, N
• T1 T1 A(I,J )X(J)
• T2 T2 A(I1,J)X(J)
• 10 CONTINUE
• Y(I ) T1
• Y(I1) T2
• 20 CONTINUE
• Speed of yyATx, N48
• Depth 1 2 3 4
?
• Speed 25 33.3 37.5 40 50
• Measured 22.7 30.5 34.3 36.5
• Memory 12.4 12.7 12.7 12.6
• unroll 1 2 loads 2 ops per 2 cycles
• unroll 2 3 loads 4 ops per 3 cycles
• unroll 3 4 loads 6 ops per 4 cycles
• unroll n n1 loads 2n ops per n1 cycles
• problem only so many registers

37
Matrix Multiply
• DOT version - 25 Mflops in cache
• DO 30 J 1, M
• DO 20 I 1, M
• DO 10 K 1, L
• C(I,J) C(I,J)
A(I,K)B(K,J)
• 10 CONTINUE
• 20 CONTINUE
• 30 CONTINUE

38
How to Get Near Peak
• Inner loop
• 4 loads, 8 operations, optimal.
• In practice we have measured 48.1 out of a peak
of 50 Mflop/s when in cache
• DO 30 J 1, M, 2
• DO 20 I 1, M, 2
• T11 C(I, J )
• T12 C(I, J1)
• T21 C(I1,J )
• T22 C(I1,J1)
• DO 10 K 1, L
• T11 T11 A(I, K) B(K,J
)
• T12 T12 A(I, K)
B(K,J1)
• T21 T21 A(I1,K)B(K,J )
• T22 T22 A(I1,K)B(K,J1)
• 10 CONTINUE
• C(I, J ) T11
• C(I, J1) T12
• C(I1,J ) T21
• C(I1,J1) T22
• 20 CONTINUE
• 30 CONTINUE

39
BLAS -- Introduction
• Clarity code is shorter and easier to read,
• Modularity gives programmer larger building
blocks,
• Performance manufacturers will provide tuned
machine-specific BLAS,
• Program portability machine dependencies are
confined to the BLAS

40
Memory Hierarchy
• Key to high performance in effective use of
memory hierarchy
• True on all architectures

41
Level 1, 2 and 3 BLAS
• Level 1 BLAS Vector-Vector operations
• Level 2 BLAS Matrix-Vector operations
• Level 3 BLAS Matrix-Matrix operations

42
More on BLAS (Basic Linear Algebra Subroutines)
• Industry standard interface(evolving)
• Vendors, others supply optimized implementations
• History
• BLAS1 (1970s)
• vector operations dot product, saxpy (y?xy),
etc
• m2n, f2n, q 1 or less
• BLAS2 (mid 1980s)
• matrix-vector operations matrix vector multiply,
etc
• mn2, f2n2, q2, less overhead
• somewhat faster than BLAS1
• BLAS3 (late 1980s)
• matrix-matrix operations matrix matrix multiply,
etc
• m gt 4n2, fO(n3), so q can possibly be as large
as n, so BLAS3 is potentially much faster than
BLAS2
• Good algorithms used BLAS3 when possible (LAPACK)
• www.netlib.org/blas, www.netlib.org/lapack

43
Why Higher Level BLAS?
• Can only do arithmetic on data at the top of the
hierarchy
• Higher level BLAS lets us do this

44
BLAS for Performance
• Development of blocked algorithms important for
performance

45
BLAS for Performance
• Development of blocked algorithms important for
performance

46
BLAS for Performance
• Development of blocked algorithms important for
performance

BLAS 3 (n-by-n matrix matrix multiply) vs BLAS 2
(n-by-n matrix vector multiply) vs BLAS 1 (saxpy
of n vectors)
47
Fast linear algebra kernels BLAS
• Simple linear algebra kernels such as
matrix-matrix multiply
• More complicated algorithms can be built from
these basic kernels.
• The interfaces of these kernels have been
standardized as the Basic Linear Algebra
Subroutines (BLAS).
• Early agreement on standard interface (1980)
• Led to portable libraries for vector and shared
memory parallel machines.
• On distributed memory, there is a less-standard
interface called the PBLAS

48
Level 1 BLAS
• Operate on vectors or pairs of vectors
• perform O(n) operations
• return either a vector or a scalar.
• saxpy
• y(i) a x(i) y(i), for i1 to n.
• s stands for single precision, daxpy is for
double precision, caxpy for complex, and zaxpy
for double complex,
• sscal y a x, for scalar a and vectors x,y
• sdot computes s S ni1 x(i)y(i)

49
Level 2 BLAS
• Operate on a matrix and a vector
• return a matrix or a vector
• O(n2) operations
• sgemv matrix-vector multiply
• y y Ax
• where A is m-by-n, x is n-by-1 and y is m-by-1.
• sger rank-one update
• A A yxT, i.e., A(i,j) A(i,j)y(i)x(j)
• where A is m-by-n, y is m-by-1, x is n-by-1,
• strsv triangular solve
• solves yTx for x, where T is triangular

50
Level 3 BLAS
• Operate on pairs or triples of matrices
• returning a matrix
• complexity is O(n3).
• sgemm Matrix-matrix multiplication
• C C AB,
• where C is m-by-n, A is m-by-k, and B is k-by-n
• strsm multiple triangular solve
• solves Y TX for X,
• where T is a triangular matrix, and X is a
rectangular matrix.

51
Optimizing in practice
• Tiling for registers
• loop unrolling, use of named register variables
• Tiling for multiple levels of cache
• Exploiting fine-grained parallelism within the
processor
• super scalar
• pipelining
• Complicated compiler interactions
• Hard to do by hand (but youll try)
• Automatic optimization an active research area
• PHIPAC www.icsi.berkeley.edu/bilmes/phipac
• www.cs.berkeley.edu/iyer/asci_sli
des.ps
• ATLAS www.netlib.org/atlas/index.html

52
BLAS -- References
• BLAS software and documentation can be obtained
via
• WWW http//www.netlib.org/blas,
• (anonymous) ftp ftp.netlib.org cd blas get
index
• email netlib_at_www.netlib.org with the message
send index from blas
lapack_at_cs.utk.edu

53
BLAS Papers
• C. Lawson, R. Hanson, D. Kincaid, and F. Krogh,
Basic Linear Algebra Subprograms for Fortran
Usage, ACM Transactions on Mathematical Software,
5308--325, 1979.
• J. Dongarra, J. Du Croz, S. Hammarling, and R.
Hanson, An Extended Set of Fortran Basic Linear
Algebra Subprograms, ACM Transactions on
Mathematical Software, 14(1)1--32, 1988.
• J. Dongarra, J. Du Croz, I. Duff, S. Hammarling,
A Set of Level 3 Basic Linear Algebra
Subprograms, ACM Transactions on Mathematical
Software, 16(1)1--17, 1990.

54
Performance of BLAS
• BLAS are specially optimized by the vendor
• Sun BLAS uses features in the Ultrasparc
• Big payoff for algorithms that can be expressed
in terms of the BLAS3 instead of BLAS2 or BLAS1.
• The top speed of the BLAS3
• Algorithms like Gaussian elimination organized so
that they use BLAS3

55
How To Get Performance From Commodity Processors?
• Todays processors can achieve high-performance,
but this requires extensive machine-specific hand
tuning.
• Routines have a large design space w/many
parameters
• blocking sizes, loop nesting permutations, loop
unrolling depths, software pipelining strategies,
register allocations, and instruction schedules.
• Complicated interactions with the increasingly
sophisticated microarchitectures of new
microprocessors.
• A few months ago no tuned BLAS for Pentium for
Linux.
• Need for quick/dynamic deployment of optimized
routines.
• ATLAS - Automatic Tuned Linear Algebra Software
• PhiPac from Berkeley

56
• Do a parameter study of the operation on the
target machine, done once.
• Only generated code is on-chip multiply
• BLAS operation written in terms of generated
on-chip multiply
• All tranpose cases coerced through data copy to 1
case of on-chip multiply
• Only 1 case generated per platform

57
Code Generation Strategy
• On-chip multiply optimizes for
• TLB access
• L1 cache reuse
• FP unit usage
• Memory fetch
• Register reuse
• Takes a couple of hours to run.
• Code is iteratively generated timed until
optimal case is found. We try
• Differing NBs
• Breaking false dependencies
• M, N and K loop unrolling

58
(No Transcript)
59
(No Transcript)
60
500x500 gemm-based BLAS on SGI R10000ip28
61
500x500 gemm-based BLAS on UltraSparc 2200
62
Recursive Approach for Other Level
3 BLAS
• Recur down to L1 cache block size
• Need kernel at bottom of recursion
• Use gemm-based kernel for portability

Recursive TRMM
63
500x500 Level 2 BLAS DGEMV
64
65
ATLAS
• Keep a repository of kernels for specific
machines.
• Extend work to allow sparse matrix operations
• Extend work to include arbitrary code segments
• See http//www.netlib.org/atlas/

66
Algorithms and Architecture
• The key to performance is to understand the
algorithm and architecture interaction.
• A significant improvement in performance can be
obtained by matching algorithm to the
architecture or vice versa.

67
Algorithm Issues
• Use of memory hierarchy
• Algorithm pre-fetching
• Loop unrolling
• Simulating higher precision arithmetic

68
Blocking
• TLB Blocking - minimize TLB misses
• Cache Blocking - minimize cache misses
• Register Blocking - minimize load/stores
• The general idea of blocking is to get the
information to a high-speed storage and use it
multiple times so as to amortize the cost of
moving the data
• Cache blocking reduced traffic between memory and
cache
• Register blocking reduces traffic between cache
and CPU

69
Loop Unrolling
• Reduces data dependency delay
• Exploits multiple functional units and quad
• Gives more flexibility to compiler in scheduling
• Facilitates algorithm pre-fetching.

70
Whats Wrong With Speedup T1/Tp ?
• Can lead to very false conclusions.
• Speedup in isolation without taking into account
the speed of the processor is unrealistic and
pointless.
• Speedup over what?
• T1/Tp
• There is usually no doubt about Tp
• Often considerable dispute over the meaning of T1
• Serial code? Same algorithm?

71
Speedup
• Can be used to
• Study, in isolation, the scaling of one algorithm
on one computer.
• As a dimensionless variable in the theory of
scaling.
• Should not be used to compare
• Different algorithms on the same computer
• The same algorithm on different computers.
• Different interconnection structures.

72
(No Transcript)
73
Strassens Algorithm for Matrix Multiply
Usual Matrix Multiply
74
Strassens Algorithm
75
Strassens Algorithm
• The count of arithmetic operations is
• One matrix multiply is replaced by 14 matrix

76
Strassens Algorithm
• In reality the use of Strassens Algorithm is
limited by
• Additional memory required for storing the P
matrices.
• More memory accesses are needed.

77
Outline
• Motivation for Dense Linear Algebra
• Axb Computational Electromagnetics
• Ax lx Quantum Chemistry
• Review Gaussian Elimination (GE) for solving Axb
• Optimizing GE for caches on sequential machines
• using matrix-matrix multiplication (BLAS)
• LAPACK library overview and performance
• Data layouts on parallel machines
• Parallel matrix-matrix multiplication
• Parallel Gaussian Elimination
• ScaLAPACK library overview
• Eigenvalue problem

78
Parallelism in Sparse Matrix-vector multiplication
• y Ax, where A is sparse and n x n
• Questions
• which processors store
• yi, xi, and Ai,j
• which processors compute
• yi sum (from 1 to n) Ai,j xj
• (row i of A) . x a
sparse dot product
• Partitioning
• Partition index set 1,,n N1 u N2 u u Np
• For all i in Nk, Processor k stores yi, xi,
and row i of A
• For all i in Nk, Processor k computes yi (row
i of A) . x
• owner computes rule Processor k compute the
yis it owns
• Goals of partitioning
• balance storage (how much does each processor
store?)
• minimize communication (how much is communicated?)

79
Graph Partitioning and Sparse Matrices
• Relationship between matrix and graph

1 2 3 4 5 6
1 1 1 1 2 1 1
1 1 3
1 1 1 4 1 1
1 1 5 1 1 1
1 6 1 1 1 1
• A good partition of the graph has
• equal (weighted) number of nodes in each part
• minimum number of edges crossing between
(minimize communication)
• Can reorder the rows/columns of the matrix by
putting all the nodes in one partition together

80
More on Matrix Reordering via Graph Partitioning
• Ideal matrix structure for parallelism
(nearly) block diagonal
• p (number of processors) blocks
• few non-zeros outside these blocks, since these
require communication

P0 P1 P2 P3 P4
81
What about implicit methods and eigenproblems?
• Direct methods (Gaussian elimination)
• Called LU Decomposition, because we factor A
LU
• Future lectures will consider both dense and
sparse cases
• More complicated than sparse-matrix vector
multiplication
• Iterative solvers
• Will discuss several of these in future
• Jacobi, Successive overrelaxiation (SOR) ,
• Most have sparse-matrix-vector multiplication in
kernel
• Eigenproblems
• Future lectures will discuss dense and sparse
cases
• Also depend on sparse-matrix-vector
multiplication, direct methods
• Graph partitioning

82
• Partial Differential Equations
• PDEs

83
Continuous Variables, Continuous Parameters
• Examples of such systems include
• Heat flow Temperature(position, time)
• Diffusion Concentration(position, time)
• Electrostatic or Gravitational Potential Pote
ntial(position)
• Fluid flow Velocity,Pressure,Density(position,tim
e)
• Quantum mechanics Wave-function(position,time)
• Elasticity Stress,Strain(position,time)

84
Example Deriving the Heat Equation
x
x-h
0
1
xh
• Consider a simple problem
• A bar of uniform material, insulated except at
ends
• Let u(x,t) be the temperature at position x at
time t
• Heat travels from x-h to xh at rate proportional
to

d u(x,t) (u(x-h,t)-u(x,t))/h -
(u(x,t)- u(xh,t))/h dt
h
C
• As h 0, we get the heat equation

85
Explicit Solution of the Heat Equation
• For simplicity, assume C1
• Discretize both time and position
• Use finite differences with uj,i as the heat at
• time t idt (i 0,1,2,) and position x jh
(j0,1,,N1/h)
• initial conditions on uj,0
• boundary conditions on u0,i and uN,i
• At each timestep i 0,1,2,...
• This corresponds to
• matrix vector multiply (what is matrix?)
• nearest neighbors on grid

t5 t4 t3 t2 t1 t0
For j0 to N uj,i1 zuj-1,i
(1-2z)uj,i zuj1,i where z dt/h2
u0,0 u1,0 u2,0 u3,0 u4,0 u5,0
86
Parallelism in Explicit Method for PDEs
• Partitioning the space (x) into p largest chunks
• good load balance (assuming large number of
points relative to p)
• minimized communication (only p chunks)
• Generalizes to
• multiple dimensions
• arbitrary graphs ( sparse matrices)
• Problem with explicit approach
• numerical instability
• solution blows up eventually if z dt/h gt .5
• need to make the timesteps very small when h is
small dt lt .5h

2
2
87
Instability in solving the heat equation
explicitly
88
Implicit Solution
• As with many (stiff) ODEs, need an implicit
method
• This turns into solving the following equation
• Here I is the identity matrix and T is
• I.e., essentially solving Poissons equation in 1D

(I (z/2)T) u,i1 (I - (z/2)T) u,i
2 -1 -1 2 -1 -1 2 -1
-1 2 -1 -1 2
Graph and stencil
T
2
-1
-1
89
2D Implicit Method
• Similar to the 1D case, but the matrix T is now
• Multiplying by this matrix (as in the explicit
case) is simply nearest neighbor computation on
2D grid
• To solve this system, there are several techniques

Graph and stencil
4 -1 -1 -1 4 -1 -1
-1 4 -1 -1
4 -1 -1 -1 -1 4
-1 -1 -1
-1 4 -1
-1 4 -1
-1 -1 4 -1
-1 -1 4
-1
4
-1
-1
T
-1
90
Algorithms for 2D Poisson Equation with N unknowns
• Algorithm Serial PRAM Memory Procs
• Dense LU N3 N N2 N2
• Band LU N2 N N3/2 N
• Jacobi N2 N N N
• Explicit Inv. N log N N N
• Conj.Grad. N 3/2 N 1/2 log N N N
• RB SOR N 3/2 N 1/2 N N
• Sparse LU N 3/2 N 1/2 Nlog N N
• FFT Nlog N log N N N
• Multigrid N log2 N N N
• Lower bound N log N N
• PRAM is an idealized parallel model with zero
cost communication
• (see next slide for explanation)

2
2
2
91
Short explanations of algorithms on previous slide
• Sorted in two orders (roughly)
• from slowest to fastest on sequential machines
• from most general (works on any matrix) to most
specialized (works on matrices like T)
• Dense LU Gaussian elimination works on any
N-by-N matrix
• Band LU exploit fact that T is nonzero only on
sqrt(N) diagonals nearest main diagonal, so
faster
• Jacobi essentially does matrix-vector multiply
by T in inner loop of iterative algorithm
• Explicit Inverse assume we want to solve many
systems with T, so we can precompute and store
• Its still expensive!
multiplication, like Jacobi, but exploits
mathematical properies of T that Jacobi does not
• Red-Black SOR (Successive Overrelaxation)
Variation of Jacobi that exploits yet different
mathematical properties of T
• Used in Multigrid
• Sparse LU Gaussian elimination exploiting
particular zero structure of T
• FFT (Fast Fourier Transform) works only on
matrices very like T
• Multigrid also works on matrices like T, that
come from elliptic PDEs
• Lower Bound serial (time to print answer)
parallel (time to combine N inputs)

92
Composite mesh from a mechanical structure
93
Converting the mesh to a matrix
94
Effects of Ordering Rows and Columns on Gaussian
Elimination
95
Irregular mesh NASA Airfoil in 2D (direct
solution)
96
Irregular mesh Tapered Tube (multigrid)
97
• Adaptive mesh around an explosion
• John Bell and Phil Colella at LBL (see class web
page for URL)
• Goal of Titanium is to make these algorithms
easier to implement
• in parallel

98
Computational Electromagnetics
• Developed during 1980s, driven by defense
applications
• Determine the RCS (radar cross section) of
airplane
• Reduce signature of plane (stealth technology)
• Other applications are antenna design, medical
equipment
• Two fundamental numerical approaches
• MOM methods of moments ( frequency domain), and
• Finite differences (time domain)

99
Computational Electromagnetics
- Discretize surface into triangular facets using
standard modeling tools - Amplitude of currents
on surface are unknowns
- Integral equation is discretized into a set of
linear equations
image NW Univ. Comp. Electromagnetics Laboratory
http//nueml.ece.nwu.edu/
100
Computational Electromagnetics (MOM)
After discretization the integral equation has
the form A x b where A is
the (dense) impedance matrix, x is the unknown
vector of amplitudes, and b is the excitation
vector. (see Cwik, Patterson, and Scott,
Electromagnetic Scattering on the Intel
Touchstone Delta, IEEE Supercomputing 92, pp 538
- 542)
101
Computational Electromagnetics (MOM)
The main steps in the solution process are
Fill computing the matrix elements
of A Factor factoring the dense matrix
A Solve solving for one or more
excitations b Field Calc computing the fields
scattered from the object
102
Analysis of MOM for Parallel Implementation
Parallel Speed
Fill O(n2) embarrassing
low Factor O(n3)
moderately diff. very high Solve
O(n2) moderately diff.
high Field Calc. O(n)
embarrassing high
103
Results for Parallel Implementation on Delta
Fill 9.20
Factor 8.25 Solve
2 .17
Field Calc. 0.12
The problem solved was for a matrix of size
48,672. (The world record in 1991.)
104
Current Records for Solving Dense Systems
Year System Size Machine
Procs Gflops (Peak) 1950's O(100)

1995 128,600 Intel Paragon
6768 281 ( 338) 1996
215,000 Intel ASCI Red 7264
1068 (1453) 1998 148,000
Cray T3E 1488 1127
(1786) 1998 235,000 Intel
ASCI Red 9152 1338 (1830) 1999
374,000 SGI ASCI Blue 5040
1608 (2520) 1999 362,880
Intel ASCI Red 9632 2379 (3207) 2000
430,000 IBM ASCI White 8192
4928 (12000)
source Alan Edelman http//www-math.mit.edu/edel
man/records.html LINPACK Benchmark
http//www.netlib.org/performance/html/PDSreports.
html
105
Computational Chemistry
• Seek energy levels of a molecule, crystal, etc.
• Solve Schroedingers Equation for energy levels
eigenvalues
• Discretize to get Ax lBx, solve for eigenvalues
l and eigenvectors x
• A and B large, symmetric or Hermitian matrices (B
positive definite)
• May want some or all eigenvalues/eigenvectors
• MP-Quest (Sandia NL)
• Si and sapphire crystals of up to 3072 atoms
• Local Density Approximation to Schroedinger
Equation
• A and B up to n40000, Hermitian
• Need all eigenvalues and eigenvectors
• Need to iterate up to 20 times (for
self-consistency)
• Implemented on Intel ASCI Red
• 9200 Pentium Pro 200 processors (4600 Duals, a
CLUMP)
• Overall application ran at 605 Gflops (out of
1800 Glops peak),
• Eigensolver ran at 684 Gflops
• www.cs.berkeley.edu/stanley/gbell/index.html
• Runner-up for Gordon Bell Prize at Supercomputing
98

106
Review of Gaussian Elimination (GE) for solving
Axb
• Add multiples of each row to later rows to make A
upper triangular
• Solve resulting triangular system Ux c by
substitution

for each column i zero it out below the
diagonal by adding multiples of row i to later
rows for i 1 to n-1 for each row j below
row i for j i1 to n add a
multiple of row i to row j for k i to
n A(j,k) A(j,k) -
(A(j,i)/A(i,i)) A(i,k)
107
Refine GE Algorithm (1)
• Initial Version
• Remove computation of constant A(j,i)/A(i,i) from
inner loop

for each column i zero it out below the
diagonal by adding multiples of row i to later
rows for i 1 to n-1 for each row j below
row i for j i1 to n add a
multiple of row i to row j for k i to
n A(j,k) A(j,k) -
(A(j,i)/A(i,i)) A(i,k)
for i 1 to n-1 for j i1 to n
m A(j,i)/A(i,i) for k i to n
A(j,k) A(j,k) - m A(i,k)
108
Refine GE Algorithm (2)
• Last version
• Dont compute what we already know
zeros below diagonal in column i

for i 1 to n-1 for j i1 to n
m A(j,i)/A(i,i) for k i to n
A(j,k) A(j,k) - m A(i,k)
for i 1 to n-1 for j i1 to n
m A(j,i)/A(i,i) for k i1 to n
A(j,k) A(j,k) - m A(i,k)
109
Refine GE Algorithm (3)
• Last version
• Store multipliers m below diagonal in zeroed
entries for later use

for i 1 to n-1 for j i1 to n
m A(j,i)/A(i,i) for k i1 to n
A(j,k) A(j,k) - m A(i,k)
for i 1 to n-1 for j i1 to n
A(j,i) A(j,i)/A(i,i) for k i1 to
n A(j,k) A(j,k) - A(j,i) A(i,k)
110
Refine GE Algorithm (4)
• Last version
• Express using matrix operations (BLAS)

for i 1 to n-1 for j i1 to n
A(j,i) A(j,i)/A(i,i) for k i1 to
n A(j,k) A(j,k) - A(j,i) A(i,k)
for i 1 to n-1 A(i1n,i) A(i1n,i) /
A(i,i) A(i1n,i1n) A(i1n , i1n )
- A(i1n , i) A(i , i1n)
111
What GE really computes
• Call the strictly lower triangular matrix of
multipliers M, and let L IM
• Call the upper triangle of the final matrix U
• Lemma (LU Factorization) If the above algorithm
terminates (does not divide by zero) then A LU
• Solving Axb using GE
• Factorize A LU using GE
(cost 2/3 n3 flops)
• Solve Ly b for y, using substitution (cost
n2 flops)
• Solve Ux y for x, using substitution (cost
n2 flops)
• Thus Ax (LU)x L(Ux) Ly b as desired

for i 1 to n-1 A(i1n,i) A(i1n,i) /
A(i,i) A(i1n,i1n) A(i1n , i1n ) -
A(i1n , i) A(i , i1n)
112
Problems with basic GE algorithm
• What if some A(i,i) is zero? Or very small?
• Result may not exist, or be unstable, so need
to pivot
• Current computation all BLAS 1 or BLAS 2, but we
know that BLAS 3 (matrix multiply) is fastest
(Lecture 2)

for i 1 to n-1 A(i1n,i) A(i1n,i) /
A(i,i) BLAS 1 (scale a vector)
A(i1n,i1n) A(i1n , i1n ) BLAS 2
(rank-1 update) - A(i1n , i)
A(i , i1n)
Peak
BLAS 3
BLAS 2
BLAS 1
113
Pivoting in Gaussian Elimination
• A 0 1 fails completely, even though
A is easy
• 1 0
• Illustrate problems in 3-decimal digit
arithmetic
• A 1e-4 1 and b 1 ,
correct answer to 3 places is x 1
• 1 1
2
1
• Result of LU decomposition is
• L 1 0 1
0 No roundoff error yet
• fl(1/1e-4) 1 1e4
1
• U 1e-4 1
1e-4 1 Error in 4th decimal place
• 0 fl(1-1e41)
0 -1e4
• Check if A LU 1e-4 1
(2,2) entry entirely wrong
• 1
0

114
Gaussian Elimination with Partial Pivoting (GEPP)
• Partial Pivoting swap rows so that each
multiplier
• L(i,j) A(j,i)/A(i,i) lt
1

for i 1 to n-1 find and record k where
A(k,i) maxi lt j lt n A(j,i)
i.e. largest entry in rest of column i if
A(k,i) 0 exit with a warning that A
is singular, or nearly so elseif k ! i
swap rows i and k of A end if
A(i1n,i) A(i1n,i) / A(i,i) each
quotient lies in -1,1 A(i1n,i1n)
A(i1n , i1n ) - A(i1n , i) A(i , i1n)
• Lemma This algorithm computes A PLU, where
P is a
• permutation matrix
• Since each entry of L(i,j) lt 1, this
algorithm is considered
• numerically stable
• For details see LAPACK code at
www.netlib.org/lapack/single/sgetf2.f

115
Converting BLAS2 to BLAS3 in GEPP
• Blocking
• Used to optimize matrix-multiplication
• Harder here because of data dependencies in GEPP
• Save updates to trailing matrix from several
• Apply many saved updates simultaneously in one
BLAS3 operation
• Same idea works for much of dense linear algebra
• Open questions remain
• Need to choose a block size b
• Algorithm will save and apply b updates
• b must be small enough so that active submatrix
consisting of b columns of A fits in cache
• b must be large enough to make BLAS3 fast

116
Blocked GEPP (www.netlib.org/lapack/single/sgetr
f.f)
for ib 1 to n-1 step b Process matrix b
columns at a time end ib b-1
Point to end of block of b columns
apply BLAS2 version of GEPP to get A(ibn ,
ibend) P L U let LL denote the
strict lower triangular part of A(ibend ,
ibend) I A(ibend , end1n) LL-1
A(ibend , end1n) update next b rows
of U A(end1n , end1n ) A(end1n ,
end1n ) - A(end1n , ibend)
A(ibend , end1n)
single matrix-multiply
with inner dimension b
(For a correctness proof, see on-lines notes.)
117
Efficiency of Blocked GEPP
118
Overview of LAPACK
• Standard library for dense/banded linear algebra
• Linear systems Axb
• Least squares problems minx Ax-b 2
• Eigenvalue problems Ax lx, Ax lBx
• Singular value decomposition (SVD) A USVT
• Algorithms reorganized to use BLAS3 as much as
possible
• Basis of math libraries on many computers
• Many algorithmic innovations remain
• Projects available

119
Performance of LAPACK (n1000)
120
Performance of LAPACK (n100)
121
Parallelizing Gaussian Elimination
• parallelization steps
• Decomposition identify enough parallel work, but
not too much
• Orchestrate communication and synchronization
• Mapping which processors execute which threads
• Decomposition
• In BLAS 2 algorithm nearly each flop in inner
loop can be done in parallel, so with n2
processors, need 3n parallel steps
• This is too fine-grained, prefer calls to local

for i 1 to n-1 A(i1n,i) A(i1n,i) /
A(i,i) BLAS 1 (scale a vector)
A(i1n,i1n) A(i1n , i1n ) BLAS 2
(rank-1 update) - A(i1n , i)
A(i , i1n)
122
Assignment of parallel work in GE
• Think of assigning submatrices to threads, where
each thread responsible for updating submatrix it
owns
• owner computes rule natural because of locality
• What should submatrices look like to achieve load
balance?

123
Different Data Layouts for Parallel GE (on 4
procs)
Load balanced, but cant easily use BLAS2 or BLAS3
by choosing b, but factorization of block column
is a bottleneck
The winner!
124
Blocked Partitioned Algorithms
• Orthogonal reduction to
• (upper) Hessenberg form
• symmetric tridiagonal form
• bidiagonal form
• Block QR iteration for nonsymmetric eigenvalue
problems
• LU Factorization
• Cholesky factorization
• Symmetric indefinite factorization
• Matrix inversion
• QR, QL, RQ, LQ factorizations
• Form Q or QTC

125
Memory Hierarchy and LAPACK
• ijk - implementations
• Effects order in which
data referenced some better at
allowing data to keep in higher levels of
memory hierarchy.
• Applies for matrix multiply, reductions to
condensed form
• May do slightly more
flops
• Up to 3 times faster

126
Derivation of Blocked AlgorithmsCholesky
Factorization A UTU

Equating coefficient of the jth column, we obtain
Hence, if U11 has already been computed, we can
compute uj and ujj from the equations
127
LINPACK Implementation
• Here is the body of the LINPACK routine SPOFA
which implements the method
• DO 30 J 1, N
• INFO J
• S 0.0E0
• JM1 J - 1
• IF( JM1.LT.1 ) GO TO 20
• DO 10 K 1, JM1
• T A( K, J ) - SDOT( K-1,
A( 1, K ), 1,A( 1, J ), 1 )
• T T / A( K, K )
• A( K, J ) T
• S S TT
• 10 CONTINUE
• 20 CONTINUE
• S A( J, J ) - S
• C ...EXIT
• IF( S.LE.0.0E0 ) GO TO 40
• A( J, J ) SQRT( S )
• 30 CONTINUE

128
LAPACK Implementation
• DO 10 J 1, N
• CALL STRSV( 'Upper', 'Transpose',
'Non-Unit, J-1, A, LDA, A( 1, J ), 1 )
• S A( J, J ) - SDOT( J-1, A( 1, J ),
1, A( 1, J ), 1 )
• IF( S.LE.ZERO ) GO TO 20
• A( J, J ) SQRT( S )
• 10 CONTINUE
• This change by itself is sufficient to
significantly improve the performance on a number
of machines.
• From 238 to 312 Mflop/s for a matrix of order 500
on a Pentium 4-1.7 GHz.
• However on peak is 1,700 Mflop/s.
• Suggest further work needed.

129
Derivation of Blocked Algorithms

Equating coefficient of second block of columns,
we obtain
Hence, if U11 has already been computed, we can
compute U12 as the solution of the following
equations by a call to the Level 3 BLAS routine
STRSM
130
LAPACK Blocked Algorithms
DO 10 J 1, N, NB CALL STRSM( 'Left',
'Upper', 'Transpose','Non-Unit', J-1, JB, ONE, A,
LDA, A( 1, J ), LDA )
CALL SSYRK( 'Upper', 'Transpose', JB, J-1,-ONE,
A( 1, J ), LDA, ONE, A( J, J ),
LDA ) CALL SPOTF2( 'Upper', JB, A( J, J ),
LDA, INFO ) IF( INFO.NE.0 ) GO TO 20 10
CONTINUE
• On Pentium 4, L3 BLAS squeezes a lot more out of
1 proc

Intel Pentium 4 1.7 GHz N 500 Rate of Execution
Linpack variant (L1B) 238 Mflop/s
Level 2 BLAS Variant 312 Mflop/s
Level 3 BLAS Variant 1262 Mflop/s
131
LAPACK Contents
• Combines algorithms from LINPACK and EISPACK into
a single package. User interface similar to
LINPACK.
• Built on the L 1, 2 and 3 BLAS, for high
performance (manufacturers optimize BLAS)
• LAPACK does not provide routines for structured
problems or general sparse matrices (i.e sparse
storage formats such as compressed-row, -column,
-diagonal, skyline ...).

132
LAPACK Ongoing Work
• updating/downdating, divide and conquer least
squares,bidiagonal bisection, bidiagonal inverse
iteration, band SVD, Jacobi methods, ...
• Move to new generation of high performance
machines
• IBM SPs, CRAY T3E, SGI Origin, clusters of
workstations
• New challenges
• New languages FORTRAN 90, HP FORTRAN, ...
• (CMMD, MPL, NX ...)
• many flavors of message passing, need standard
(PVM, MPI) BLACS
• Highly varying ratio
• Many ways to layout data,
• Fastest parallel algorithm sometimes less stable
numerically.

133
History of Block Partitioned Algorithms
• Early algorithms involved use of small main
memory using tapes as secondary storage.
• Recent work centers on use of vector registers,
level 1 and 2 cache, main memory, and out of
core memory.

134
Blocked Partitioned Algorithms
• Orthogonal reduction to
• (upper) Hessenberg form
• symmetric tridiagonal form
• bidiagonal form
• Block QR iteration for nonsymmetric eigenvalue
problems
• LU Factorization
• Cholesky factorization
• Symmetric indefinite factorization
• Matrix inversion
• QR, QL, RQ, LQ factorizations
• Form Q or QTC