Title: Lecture 6: Linear Algebra Algorithms
1Lecture 6Linear Algebra Algorithms
- Jack Dongarra, U of Tennessee
Slides are adapted from Jim Demmel, UCBs Lecture
on Linear Algebra Algorithms
2Homework 3 Grading Rules
- part1 (2 points) Write an MPI program
- (1) code produce correct result 2/2.
- (2) code run but could not give accurate result
1.5/2. - (3) code could not run or run but nothing 1/2
- Note If you only submit one program which can
print "processor - contribution" and "total integral" correctly, I
will give 1/2 - for this part and 2/2 for part2. so you total
score for part1 - and part2 will be 3/4.
- part2 (2 points) Modify your program to sum the
contributions to give the total value of the
integral. - (1) code produce correct result 2/2.
- (2) code run but could not give accurate result
1.5/2. - (3) code could not run or run but nothing 1/2
- part3 (4 points) Give an expression for the
efficiency of the parallel algorithm for m
subintervals - (1) give t_p(1), t_p(p), S(p) and E(p) correctly
4/4. - (2) give t_p(1), t_p(p) and S(p) correctly 3/4.
- (3) give t_p(1) and t_p(p) correctly 2/4.
- (4) give other 1/4.
3Adding numbers together
- If we are adding an array of numbers and have a
choice do we add them from largest to smallest or
smallest to largest?
4Parallel Performance Metrics
Absolute Elapsed (wall-clock) Time T(n )
Speedup S(n ) T(1) / T(n ) where T1 is
the time for the best serial implementation.
gt Performance improvement due to parallelsm
Parallel Efficiency E(n ) T(1) / n T(n )
Ideal Speedup SI (n ) n -
Theoretical limit obtainable rarely - Ignores
all of real life These definitions apply to a
fixed-problem experiment.
5Amdahls Law
Amdahls Law places a strict limit on the speedup
that can be realized by using multiple
processors. Two equivalent expressions for
Amdahls Law are given below
tN (fp/N fs)t1 Effect of multiple
processors on run time S 1/(fs fp/N)
Effect of multiple processors on
speedup Where fs serial fraction of code fp
parallel fraction of code 1 - fs N number
of processors
6Illustration of Amdahls Law
It takes only a small fraction of serial content
in a code to degrade the parallel performance. It
is essential to determine the scaling behavior of
your code before doing production runs using
large numbers of processors
7Amdahls Law Vs. Reality
Amdahls Law provides a theoretical upper limit
on parallel speedup assuming that there are no
costs for communications. In reality,
communications (and I/O) will result in a further
degradation of performance.
So whats going on?
8More on Amdahls Law
- Amdahls Law can be generalized to any two
processes of with different speeds - Ex. Apply to fprocessor and fmemory
- The growing processor-memory performance gap will
undermine our efforts at achieving maximum
possible speedup!
9Gustafsons Law
- Thus, Amdahls Law predicts that there is a
maximum scalability for an application,
determined by its parallel fraction, and this
limit is generally not large. - tN (fp/N fs)t1 Effect of multiple
processors on run time - S 1/(fs fp/N) Effect of multiple
processors on speedup - Where
- fs serial fraction of code
- fp parallel fraction of code 1 - fs
- N number of processors
- There is a way around this increase the problem
size - bigger problems mean bigger grids or more
particles bigger arrays - number of serial operations generally remains
constant number of parallel operations
increases parallel fraction increases
10Fixed-Problem Size Scaling
a.k.a. Fixed-load, Fixed-Problem Size, Strong
Scaling, Problem-Constrained, constant-problem
size (CPS), variable subgrid
1 Amdahl Limit SA(n) T(1) / T(n)
----------------
f / n ( 1 -
f ) This bounds the speedup based only on the
fraction of the code that cannot use parallelism
( 1- f ) it ignores all other factors SA
--gt 1 / ( 1- f ) as n --gt
11Fixed-Problem Size Scaling (Contd)
Efficiency (n) T(1) / T(n) n
Memory requirements decrease with n
Surface-to-volume ratio increases with n
Superlinear speedup possible from cache
effects Motivation what is the largest of
procs I can use effectively and what is the
fastest time that I can solve a given problem?
Problems - Sequential runs often not possible
(large problems) - Speedup (and efficiency) is
misleading if processors are slow
12Fixed-Problem Size Scaling Examples
S. Goedecker and Adolfy Hoisie, Achieving High
Performance in Numerical Computations on RISC
Workstations and Parallel Systems,International
Conference on Computational Physics PC'97 Santa
Cruz, August 25-28 1997.
13Fixed-Problem Size Scaling Examples
14Scaled Speedup Experiments
a.k.a. Fixed Subgrid-Size, Weak Scaling,
Gustafson scaling. Motivation Want to use a
larger machine to solve a larger global problem
in the same amount of time. Memory and
surface-to-volume effects remain constant.
15Top500 Data
Rank Manufacturer Computer Installation Site Rmax TF/s R peak TF/s Ratio Max/Peak Proc
1 IBM ASCI WhiteSP Power3 Lawrence Livermore National Laboratory 7.23 12 60.25 8192
2 Compaq AlphaServer SC ES45 1 GHz Pittsburgh Supercomputing Center 4.06 6.05 67.11 3024
3 IBM SP Power3 375 MHz NERSC/LBNL 3.05 5.00 61.00 3328
4 Intel ASCI Red Sandia National Laboratory 2.38 3.207 74.21 9632
5 IBM ASCI Blue PacificSST, IBM SP 604E Lawrence Livermore National Laboratory 2.14 3.868 55.33 5808
6 Compaq AlphaServer SC ES45 1 GHz Los Alamos National Laboratory 2.10 3.04 69.08 1536
7 Hitachi SR8000/MPP University of Tokyo 1.71 2.69 63.57 1152
8 SGI ASCI Blue Mountain Los Alamos National Laboratory 1.61 2.52 63.89 6144
9 IBM SP Power3 375Mhz Naval Oceanographic Office (NAVOCEANO) 1.42 1.97 72.08 1336
10 IBM SP Power3 375Mhz Deutscher Wetterdienst 1.29 1.83 70.49 1280
16Example of a Scaled Speedup Experiment
17Parallel Performance Metrics Speedup
Absolute performance
Relative performance
MFLOPS
Speedup
Processors
Processors
Speedup is only one characteristic of a program -
it is not synonymous with performance. In this
comparison of two machines the code achieves
comparable speedups but one of the machines is
faster.
18(No Transcript)
19Improving Ratio of Floating Point Operations to
Memory Accesses
- subroutine mult(n1,nd1,n2,nd2,y,a,x)
- implicit real8 (a-h,o-z)
- dimension a(nd1,nd2),y(nd2),x(nd1)
- do 20, i1,n1
- t0.d0
- do 10, j1,n2
- tta(j,i)x(j)
- 10 continue
- y(i)t
- 20 continue
- return
- end
- Unroll the loops
2 FLOPS 2 LOADS
20Improving Ratio of Floating Point Operations to
Memory Accesses
c works correctly when n1,n2 are multiples of 4
dimension a(nd1,nd2), y(nd2), x(nd1) do
i1,n1-3,4 t10.d0 t20.d0
t30.d0 t40.d0 do j1,n2-3,4
t1t1a(j0,i0)x(j0)a(j1,i0)x(j1) 1
a(j2,i0)x(j2)a(j3,i1)x(j3)
t2t2a(j0,i1)x(j0)a(j1,i1)x(j1) 1
a(j2,i1)x(j2)a(j3,i0)x(j3)
t3t3a(j0,i2)x(j0)a(j1,i2)x(j1) 1
a(j2,i2)x(j2)a(j3,i2)x(j3)
t4t4a(j0,i3)x(j0)a(j1,i3)x(j1) 1
a(j2,i3)x(j2)a(j3,i3)x(j3) enddo
y(i0)t1 y(i1)t2 y(i2)t3
y(i3)t4 enddo
32 FLOPS 20 LOADS
21Summary of Single-Processor Optimization
Techniques (I)
- Spatial and temporal data locality
- Loop unrolling
- Blocking
- Software pipelining
- Optimization of data structures
- Special functions, library subroutines
22Summary of Optimization Techniques (II)
- Achieving high-performance requires code
restructuring. Minimization of memory traffic is
the single most important goal. - Compilers are getting better good at software
pipelining. But they are not there yet can do
loop transformations only in simple cases,
usually fail to produce optimal blocking,
heuristics for unrolling may not match your code
well, etc. - The optimization process is machine-specific and
requires detailed architectural knowledge.
23Optimizing Matrix Addition for Caches
- Dimension A(n,n), B(n,n), C(n,n)
- A, B, C stored by column (as in Fortran)
- Algorithm 1
- for i1n, for j1n, A(i,j) B(i,j) C(i,j)
- Algorithm 2
- for j1n, for i1n, A(i,j) B(i,j) C(i,j)
- What is memory access pattern for Algs 1 and 2?
- Which is faster?
- What if A, B, C stored by row (as in C)?
24 Using a Simpler Model of Memory to Optimize
- Assume just 2 levels in the hierarchy, fast and
slow - All data initially in slow memory
- m number of memory elements (words) moved
between fast and slow memory - tm time per slow memory operation
- f number of arithmetic operations
- tf time per arithmetic operation lt tm
- q f/m average number of flops per slow element
access - Minimum possible Time ftf, when all data in
fast memory - Actual Time ftf mtm ftf(1
(tm/tf)(1/q)) - Larger q means Time closer to minimum ftf
- Want large q
25Simple example using memory model
- To see results of changing q, consider simple
computation
- Assume tf1 Mflop/s on fast memory
- Assume moving data is tm 10
- Assume h takes q flops
- Assume array X is in slow memory
s 0 for i 1, n s s h(Xi)
- So m n and f qn
- Time read X compute 10n qn
- Mflop/s f/t q/(10 q)
- As q increases, this approaches the peak speed
of 1 Mflop/s
26Warm up Matrix-vector multiplication y y Ax
- for i 1n
- for j 1n
- y(i) y(i) A(i,j)x(j)
A(i,)
y(i)
y(i)
x()
27Warm up Matrix-vector multiplication y y Ax
- read x(1n) into fast memory
- read y(1n) into fast memory
- for i 1n
- read row i of A into fast memory
- for j 1n
- y(i) y(i) A(i,j)x(j)
- write y(1n) back to slow memory
- m number of slow memory refs 3n n2
- f number of arithmetic operations 2n2
- q f/m 2
- Matrix-vector multiplication limited by slow
memory speed
28Matrix Multiply CCAB
- for i 1 to n
- for j 1 to n
- for k 1 to n
- C(i,j) C(i,j) A(i,k) B(k,j)
A(i,)
C(i,j)
C(i,j)
B(,j)
29Matrix Multiply CCAB(unblocked, or untiled)
- for i 1 to n
- read row i of A into fast memory
- for j 1 to n
- read C(i,j) into fast memory
- read column j of B into fast memory
- for k 1 to n
- C(i,j) C(i,j) A(i,k) B(k,j)
- write C(i,j) back to slow memory
A(i,)
C(i,j)
C(i,j)
B(,j)
30Matrix Multiply (unblocked, or untiled)
qops/slow mem ref
- Number of slow memory references on unblocked
matrix multiply - m n3 read each column of B n times
- n2 read each column of A once for
each i - 2n2 read and write each element of C
once - n3 3n2
- So q f/m (2n3)/(n3 3n2)
- 2 for large n, no improvement over
matrix-vector mult
A(i,)
C(i,j)
C(i,j)
B(,j)
31Matrix Multiply (blocked, or tiled)
- Consider A,B,C to be N by N matrices of b by b
subblocks where bn/N is called the blocksize - for i 1 to N
- for j 1 to N
- read block C(i,j) into fast memory
- for k 1 to N
- read block A(i,k) into fast
memory - read block B(k,j) into fast
memory - C(i,j) C(i,j) A(i,k)
B(k,j) do a matrix multiply on blocks - write block C(i,j) back to slow memory
A(i,k)
C(i,j)
C(i,j)
B(k,j)
n is the size of the matrix, N blocks of size b
n Nb
32Matrix Multiply (blocked or tiled)
qops/slow mem ref
n is the size of the matrix, N blocks of size b
n Nb
- Why is this algorithm correct?
- Number of slow memory references on blocked
matrix multiply - m Nn2 read each block of B N3 times (N3
n/N n/N) - Nn2 read each block of A N3 times
- 2n2 read and write each block of
C once - (2N 2)n2
- So q f/m 2n3 / ((2N 2)n2)
- n/N b for large n
- So we can improve performance by increasing the
blocksize b - Can be much faster than matrix-vector multiplty
(q2) - Limit All three blocks from A,B,C must fit in
fast memory (cache), so we - cannot make these blocks arbitrarily large 3b2
lt M, so q b lt sqrt(M/3) - Theorem (Hong, Kung, 1981) Any reorganization of
this algorithm - (that uses only associativity) is limited to q
O(sqrt(M))
33Model
- As much as possible will be overlapped
- Dot Product
- ACC 0
- do i x,n
- ACC ACC x(i) y(i)
- end do
- Experiments done on an IBM RS6000/530
- 25 MHz
- 2 cycle to complete FMA can be pipelined
- gt 50 Mflop/s peak
- one cycle from cache
FMA
FMA
FMA
FMA
FMA
FMA
cycles
34DOT Operation - Data in Cache
- Do 10 I 1, n
- T T X(I)Y(I)
- 10 CONTINUE
- Theoretically, 2 loads for X(I) and Y(I), one FMA
operation, no re-use of data - Pseudo-assembler
- LOAD fp0,T
- label
- LOAD fp1,X(I)
- LOAD fp2,Y(I)
- FMA fp0,fp0,fp1,fp2
- BRANCH label
Load y
Load x
FMA
Load y
Load x
FMA
1 result per cycle 25 Mflop/s
35Matrix-Vector Product
- DOT version
- DO 20 I 1, M
- DO 10 J 1, N
- Y(I) Y(I) A(I,J)X(J)
- 10 CONTINUE
- 20 CONTINUE
- From Cache 22.7 Mflops
- From Memory 12.4 Mflops
36Loop Unrolling
- DO 20 I 1, M, 2
- T1 Y(I )
- T2 Y(I1)
- DO 10 J 1, N
- T1 T1 A(I,J )X(J)
- T2 T2 A(I1,J)X(J)
- 10 CONTINUE
- Y(I ) T1
- Y(I1) T2
- 20 CONTINUE
- 3 loads, 4 flops
- Speed of yyATx, N48
- Depth 1 2 3 4
? - Speed 25 33.3 37.5 40 50
- Measured 22.7 30.5 34.3 36.5
- Memory 12.4 12.7 12.7 12.6
- unroll 1 2 loads 2 ops per 2 cycles
- unroll 2 3 loads 4 ops per 3 cycles
- unroll 3 4 loads 6 ops per 4 cycles
-
- unroll n n1 loads 2n ops per n1 cycles
- problem only so many registers
37Matrix Multiply
- DOT version - 25 Mflops in cache
- DO 30 J 1, M
- DO 20 I 1, M
- DO 10 K 1, L
- C(I,J) C(I,J)
A(I,K)B(K,J) - 10 CONTINUE
- 20 CONTINUE
- 30 CONTINUE
38How to Get Near Peak
- Inner loop
- 4 loads, 8 operations, optimal.
- In practice we have measured 48.1 out of a peak
of 50 Mflop/s when in cache
- DO 30 J 1, M, 2
- DO 20 I 1, M, 2
- T11 C(I, J )
- T12 C(I, J1)
- T21 C(I1,J )
- T22 C(I1,J1)
- DO 10 K 1, L
- T11 T11 A(I, K) B(K,J
) - T12 T12 A(I, K)
B(K,J1) - T21 T21 A(I1,K)B(K,J )
- T22 T22 A(I1,K)B(K,J1)
- 10 CONTINUE
- C(I, J ) T11
- C(I, J1) T12
- C(I1,J ) T21
- C(I1,J1) T22
- 20 CONTINUE
- 30 CONTINUE
39BLAS -- Introduction
- Clarity code is shorter and easier to read,
- Modularity gives programmer larger building
blocks, - Performance manufacturers will provide tuned
machine-specific BLAS, - Program portability machine dependencies are
confined to the BLAS
40Memory Hierarchy
- Key to high performance in effective use of
memory hierarchy - True on all architectures
41Level 1, 2 and 3 BLAS
- Level 1 BLAS Vector-Vector operations
- Level 2 BLAS Matrix-Vector operations
- Level 3 BLAS Matrix-Matrix operations
42More on BLAS (Basic Linear Algebra Subroutines)
- Industry standard interface(evolving)
- Vendors, others supply optimized implementations
- History
- BLAS1 (1970s)
- vector operations dot product, saxpy (y?xy),
etc - m2n, f2n, q 1 or less
- BLAS2 (mid 1980s)
- matrix-vector operations matrix vector multiply,
etc - mn2, f2n2, q2, less overhead
- somewhat faster than BLAS1
- BLAS3 (late 1980s)
- matrix-matrix operations matrix matrix multiply,
etc - m gt 4n2, fO(n3), so q can possibly be as large
as n, so BLAS3 is potentially much faster than
BLAS2 - Good algorithms used BLAS3 when possible (LAPACK)
- www.netlib.org/blas, www.netlib.org/lapack
43Why Higher Level BLAS?
- Can only do arithmetic on data at the top of the
hierarchy - Higher level BLAS lets us do this
44BLAS for Performance
- Development of blocked algorithms important for
performance
45BLAS for Performance
- Development of blocked algorithms important for
performance
46BLAS for Performance
- Development of blocked algorithms important for
performance
BLAS 3 (n-by-n matrix matrix multiply) vs BLAS 2
(n-by-n matrix vector multiply) vs BLAS 1 (saxpy
of n vectors)
47Fast linear algebra kernels BLAS
- Simple linear algebra kernels such as
matrix-matrix multiply - More complicated algorithms can be built from
these basic kernels. - The interfaces of these kernels have been
standardized as the Basic Linear Algebra
Subroutines (BLAS). - Early agreement on standard interface (1980)
- Led to portable libraries for vector and shared
memory parallel machines. - On distributed memory, there is a less-standard
interface called the PBLAS
48Level 1 BLAS
- Operate on vectors or pairs of vectors
- perform O(n) operations
- return either a vector or a scalar.
- saxpy
- y(i) a x(i) y(i), for i1 to n.
- s stands for single precision, daxpy is for
double precision, caxpy for complex, and zaxpy
for double complex, - sscal y a x, for scalar a and vectors x,y
- sdot computes s S ni1 x(i)y(i)
49Level 2 BLAS
- Operate on a matrix and a vector
- return a matrix or a vector
- O(n2) operations
- sgemv matrix-vector multiply
- y y Ax
- where A is m-by-n, x is n-by-1 and y is m-by-1.
- sger rank-one update
- A A yxT, i.e., A(i,j) A(i,j)y(i)x(j)
- where A is m-by-n, y is m-by-1, x is n-by-1,
- strsv triangular solve
- solves yTx for x, where T is triangular
50Level 3 BLAS
- Operate on pairs or triples of matrices
- returning a matrix
- complexity is O(n3).
- sgemm Matrix-matrix multiplication
- C C AB,
- where C is m-by-n, A is m-by-k, and B is k-by-n
- strsm multiple triangular solve
- solves Y TX for X,
- where T is a triangular matrix, and X is a
rectangular matrix.
51Optimizing in practice
- Tiling for registers
- loop unrolling, use of named register variables
- Tiling for multiple levels of cache
- Exploiting fine-grained parallelism within the
processor - super scalar
- pipelining
- Complicated compiler interactions
- Hard to do by hand (but youll try)
- Automatic optimization an active research area
- PHIPAC www.icsi.berkeley.edu/bilmes/phipac
- www.cs.berkeley.edu/iyer/asci_sli
des.ps - ATLAS www.netlib.org/atlas/index.html
52BLAS -- References
- BLAS software and documentation can be obtained
via - WWW http//www.netlib.org/blas,
- (anonymous) ftp ftp.netlib.org cd blas get
index - email netlib_at_www.netlib.org with the message
send index from blas - Comments and questions can be addressed to
lapack_at_cs.utk.edu
53BLAS Papers
- C. Lawson, R. Hanson, D. Kincaid, and F. Krogh,
Basic Linear Algebra Subprograms for Fortran
Usage, ACM Transactions on Mathematical Software,
5308--325, 1979. - J. Dongarra, J. Du Croz, S. Hammarling, and R.
Hanson, An Extended Set of Fortran Basic Linear
Algebra Subprograms, ACM Transactions on
Mathematical Software, 14(1)1--32, 1988. - J. Dongarra, J. Du Croz, I. Duff, S. Hammarling,
A Set of Level 3 Basic Linear Algebra
Subprograms, ACM Transactions on Mathematical
Software, 16(1)1--17, 1990.
54Performance of BLAS
- BLAS are specially optimized by the vendor
- Sun BLAS uses features in the Ultrasparc
- Big payoff for algorithms that can be expressed
in terms of the BLAS3 instead of BLAS2 or BLAS1. - The top speed of the BLAS3
- Algorithms like Gaussian elimination organized so
that they use BLAS3
55How To Get Performance From Commodity Processors?
- Todays processors can achieve high-performance,
but this requires extensive machine-specific hand
tuning. - Routines have a large design space w/many
parameters - blocking sizes, loop nesting permutations, loop
unrolling depths, software pipelining strategies,
register allocations, and instruction schedules. - Complicated interactions with the increasingly
sophisticated microarchitectures of new
microprocessors. - A few months ago no tuned BLAS for Pentium for
Linux. - Need for quick/dynamic deployment of optimized
routines. - ATLAS - Automatic Tuned Linear Algebra Software
- PhiPac from Berkeley
56Adaptive Approach for Level 3
- Do a parameter study of the operation on the
target machine, done once. - Only generated code is on-chip multiply
- BLAS operation written in terms of generated
on-chip multiply - All tranpose cases coerced through data copy to 1
case of on-chip multiply - Only 1 case generated per platform
57Code Generation Strategy
- On-chip multiply optimizes for
- TLB access
- L1 cache reuse
- FP unit usage
- Memory fetch
- Register reuse
- Loop overhead minimization
- Takes a couple of hours to run.
- Code is iteratively generated timed until
optimal case is found. We try - Differing NBs
- Breaking false dependencies
- M, N and K loop unrolling
58(No Transcript)
59(No Transcript)
60500x500 gemm-based BLAS on SGI R10000ip28
61500x500 gemm-based BLAS on UltraSparc 2200
62Recursive Approach for Other Level
3 BLAS
- Recur down to L1 cache block size
- Need kernel at bottom of recursion
- Use gemm-based kernel for portability
Recursive TRMM
63500x500 Level 2 BLAS DGEMV
64Multi-Threaded DGEMMIntel PIII 550 MHz
65ATLAS
- Keep a repository of kernels for specific
machines. - Develop a means of dynamically downloading code
- Extend work to allow sparse matrix operations
- Extend work to include arbitrary code segments
- See http//www.netlib.org/atlas/
66Algorithms and Architecture
- The key to performance is to understand the
algorithm and architecture interaction. - A significant improvement in performance can be
obtained by matching algorithm to the
architecture or vice versa.
67Algorithm Issues
- Use of memory hierarchy
- Algorithm pre-fetching
- Loop unrolling
- Simulating higher precision arithmetic
68Blocking
- TLB Blocking - minimize TLB misses
- Cache Blocking - minimize cache misses
- Register Blocking - minimize load/stores
- The general idea of blocking is to get the
information to a high-speed storage and use it
multiple times so as to amortize the cost of
moving the data - Cache blocking reduced traffic between memory and
cache - Register blocking reduces traffic between cache
and CPU
69Loop Unrolling
- Reduces data dependency delay
- Exploits multiple functional units and quad
load/stores effectively. - Minimizes load/stores
- Reduces loop overheads
- Gives more flexibility to compiler in scheduling
- Facilitates algorithm pre-fetching.
- What about vector computing?
70Whats Wrong With Speedup T1/Tp ?
- Can lead to very false conclusions.
- Speedup in isolation without taking into account
the speed of the processor is unrealistic and
pointless. - Speedup over what?
- T1/Tp
- There is usually no doubt about Tp
- Often considerable dispute over the meaning of T1
- Serial code? Same algorithm?
71Speedup
- Can be used to
- Study, in isolation, the scaling of one algorithm
on one computer. - As a dimensionless variable in the theory of
scaling. - Should not be used to compare
- Different algorithms on the same computer
- The same algorithm on different computers.
- Different interconnection structures.
72(No Transcript)
73Strassens Algorithm for Matrix Multiply
Usual Matrix Multiply
74Strassens Algorithm
75Strassens Algorithm
- The count of arithmetic operations is
- One matrix multiply is replaced by 14 matrix
additions
76Strassens Algorithm
- In reality the use of Strassens Algorithm is
limited by - Additional memory required for storing the P
matrices. - More memory accesses are needed.
77Outline
- Motivation for Dense Linear Algebra
- Axb Computational Electromagnetics
- Ax lx Quantum Chemistry
- Review Gaussian Elimination (GE) for solving Axb
- Optimizing GE for caches on sequential machines
- using matrix-matrix multiplication (BLAS)
- LAPACK library overview and performance
- Data layouts on parallel machines
- Parallel matrix-matrix multiplication
- Parallel Gaussian Elimination
- ScaLAPACK library overview
- Eigenvalue problem
78Parallelism in Sparse Matrix-vector multiplication
- y Ax, where A is sparse and n x n
- Questions
- which processors store
- yi, xi, and Ai,j
- which processors compute
- yi sum (from 1 to n) Ai,j xj
- (row i of A) . x a
sparse dot product - Partitioning
- Partition index set 1,,n N1 u N2 u u Np
- For all i in Nk, Processor k stores yi, xi,
and row i of A - For all i in Nk, Processor k computes yi (row
i of A) . x - owner computes rule Processor k compute the
yis it owns - Goals of partitioning
- balance load (how is load measured?)
- balance storage (how much does each processor
store?) - minimize communication (how much is communicated?)
79Graph Partitioning and Sparse Matrices
- Relationship between matrix and graph
1 2 3 4 5 6
1 1 1 1 2 1 1
1 1 3
1 1 1 4 1 1
1 1 5 1 1 1
1 6 1 1 1 1
- A good partition of the graph has
- equal (weighted) number of nodes in each part
(load and storage balance) - minimum number of edges crossing between
(minimize communication) - Can reorder the rows/columns of the matrix by
putting all the nodes in one partition together
80More on Matrix Reordering via Graph Partitioning
- Ideal matrix structure for parallelism
(nearly) block diagonal - p (number of processors) blocks
- few non-zeros outside these blocks, since these
require communication
P0 P1 P2 P3 P4
81What about implicit methods and eigenproblems?
- Direct methods (Gaussian elimination)
- Called LU Decomposition, because we factor A
LU - Future lectures will consider both dense and
sparse cases - More complicated than sparse-matrix vector
multiplication - Iterative solvers
- Will discuss several of these in future
- Jacobi, Successive overrelaxiation (SOR) ,
Conjugate Gradients (CG), Multigrid,... - Most have sparse-matrix-vector multiplication in
kernel - Eigenproblems
- Future lectures will discuss dense and sparse
cases - Also depend on sparse-matrix-vector
multiplication, direct methods - Graph partitioning
82- Partial Differential Equations
- PDEs
83Continuous Variables, Continuous Parameters
- Examples of such systems include
- Heat flow Temperature(position, time)
- Diffusion Concentration(position, time)
- Electrostatic or Gravitational Potential Pote
ntial(position) - Fluid flow Velocity,Pressure,Density(position,tim
e) - Quantum mechanics Wave-function(position,time)
- Elasticity Stress,Strain(position,time)
84Example Deriving the Heat Equation
x
x-h
0
1
xh
- Consider a simple problem
- A bar of uniform material, insulated except at
ends - Let u(x,t) be the temperature at position x at
time t - Heat travels from x-h to xh at rate proportional
to
d u(x,t) (u(x-h,t)-u(x,t))/h -
(u(x,t)- u(xh,t))/h dt
h
C
- As h 0, we get the heat equation
85Explicit Solution of the Heat Equation
- For simplicity, assume C1
- Discretize both time and position
- Use finite differences with uj,i as the heat at
- time t idt (i 0,1,2,) and position x jh
(j0,1,,N1/h) - initial conditions on uj,0
- boundary conditions on u0,i and uN,i
- At each timestep i 0,1,2,...
- This corresponds to
- matrix vector multiply (what is matrix?)
- nearest neighbors on grid
t5 t4 t3 t2 t1 t0
For j0 to N uj,i1 zuj-1,i
(1-2z)uj,i zuj1,i where z dt/h2
u0,0 u1,0 u2,0 u3,0 u4,0 u5,0
86Parallelism in Explicit Method for PDEs
- Partitioning the space (x) into p largest chunks
- good load balance (assuming large number of
points relative to p) - minimized communication (only p chunks)
- Generalizes to
- multiple dimensions
- arbitrary graphs ( sparse matrices)
- Problem with explicit approach
- numerical instability
- solution blows up eventually if z dt/h gt .5
- need to make the timesteps very small when h is
small dt lt .5h
2
2
87Instability in solving the heat equation
explicitly
88Implicit Solution
- As with many (stiff) ODEs, need an implicit
method - This turns into solving the following equation
- Here I is the identity matrix and T is
- I.e., essentially solving Poissons equation in 1D
(I (z/2)T) u,i1 (I - (z/2)T) u,i
2 -1 -1 2 -1 -1 2 -1
-1 2 -1 -1 2
Graph and stencil
T
2
-1
-1
892D Implicit Method
- Similar to the 1D case, but the matrix T is now
- Multiplying by this matrix (as in the explicit
case) is simply nearest neighbor computation on
2D grid - To solve this system, there are several techniques
Graph and stencil
4 -1 -1 -1 4 -1 -1
-1 4 -1 -1
4 -1 -1 -1 -1 4
-1 -1 -1
-1 4 -1
-1 4 -1
-1 -1 4 -1
-1 -1 4
-1
4
-1
-1
T
-1
90Algorithms for 2D Poisson Equation with N unknowns
- Algorithm Serial PRAM Memory Procs
- Dense LU N3 N N2 N2
- Band LU N2 N N3/2 N
- Jacobi N2 N N N
- Explicit Inv. N log N N N
- Conj.Grad. N 3/2 N 1/2 log N N N
- RB SOR N 3/2 N 1/2 N N
- Sparse LU N 3/2 N 1/2 Nlog N N
- FFT Nlog N log N N N
- Multigrid N log2 N N N
- Lower bound N log N N
- PRAM is an idealized parallel model with zero
cost communication - (see next slide for explanation)
2
2
2
91Short explanations of algorithms on previous slide
- Sorted in two orders (roughly)
- from slowest to fastest on sequential machines
- from most general (works on any matrix) to most
specialized (works on matrices like T) - Dense LU Gaussian elimination works on any
N-by-N matrix - Band LU exploit fact that T is nonzero only on
sqrt(N) diagonals nearest main diagonal, so
faster - Jacobi essentially does matrix-vector multiply
by T in inner loop of iterative algorithm - Explicit Inverse assume we want to solve many
systems with T, so we can precompute and store
inv(T) for free, and just multiply by it - Its still expensive!
- Conjugate Gradients uses matrix-vector
multiplication, like Jacobi, but exploits
mathematical properies of T that Jacobi does not - Red-Black SOR (Successive Overrelaxation)
Variation of Jacobi that exploits yet different
mathematical properties of T - Used in Multigrid
- Sparse LU Gaussian elimination exploiting
particular zero structure of T - FFT (Fast Fourier Transform) works only on
matrices very like T - Multigrid also works on matrices like T, that
come from elliptic PDEs - Lower Bound serial (time to print answer)
parallel (time to combine N inputs)
92Composite mesh from a mechanical structure
93Converting the mesh to a matrix
94Effects of Ordering Rows and Columns on Gaussian
Elimination
95Irregular mesh NASA Airfoil in 2D (direct
solution)
96Irregular mesh Tapered Tube (multigrid)
97Adaptive Mesh Refinement (AMR)
- Adaptive mesh around an explosion
- John Bell and Phil Colella at LBL (see class web
page for URL) - Goal of Titanium is to make these algorithms
easier to implement - in parallel
98Computational Electromagnetics
- Developed during 1980s, driven by defense
applications - Determine the RCS (radar cross section) of
airplane - Reduce signature of plane (stealth technology)
- Other applications are antenna design, medical
equipment - Two fundamental numerical approaches
- MOM methods of moments ( frequency domain), and
- Finite differences (time domain)
99Computational Electromagnetics
- Discretize surface into triangular facets using
standard modeling tools - Amplitude of currents
on surface are unknowns
- Integral equation is discretized into a set of
linear equations
image NW Univ. Comp. Electromagnetics Laboratory
http//nueml.ece.nwu.edu/
100Computational Electromagnetics (MOM)
After discretization the integral equation has
the form A x b where A is
the (dense) impedance matrix, x is the unknown
vector of amplitudes, and b is the excitation
vector. (see Cwik, Patterson, and Scott,
Electromagnetic Scattering on the Intel
Touchstone Delta, IEEE Supercomputing 92, pp 538
- 542)
101Computational Electromagnetics (MOM)
The main steps in the solution process are
Fill computing the matrix elements
of A Factor factoring the dense matrix
A Solve solving for one or more
excitations b Field Calc computing the fields
scattered from the object
102Analysis of MOM for Parallel Implementation
Task Work Parallelism
Parallel Speed
Fill O(n2) embarrassing
low Factor O(n3)
moderately diff. very high Solve
O(n2) moderately diff.
high Field Calc. O(n)
embarrassing high
103Results for Parallel Implementation on Delta
Task Time (hours)
Fill 9.20
Factor 8.25 Solve
2 .17
Field Calc. 0.12
The problem solved was for a matrix of size
48,672. (The world record in 1991.)
104Current Records for Solving Dense Systems
Year System Size Machine
Procs Gflops (Peak) 1950's O(100)
1995 128,600 Intel Paragon
6768 281 ( 338) 1996
215,000 Intel ASCI Red 7264
1068 (1453) 1998 148,000
Cray T3E 1488 1127
(1786) 1998 235,000 Intel
ASCI Red 9152 1338 (1830) 1999
374,000 SGI ASCI Blue 5040
1608 (2520) 1999 362,880
Intel ASCI Red 9632 2379 (3207) 2000
430,000 IBM ASCI White 8192
4928 (12000)
source Alan Edelman http//www-math.mit.edu/edel
man/records.html LINPACK Benchmark
http//www.netlib.org/performance/html/PDSreports.
html
105Computational Chemistry
- Seek energy levels of a molecule, crystal, etc.
- Solve Schroedingers Equation for energy levels
eigenvalues - Discretize to get Ax lBx, solve for eigenvalues
l and eigenvectors x - A and B large, symmetric or Hermitian matrices (B
positive definite) - May want some or all eigenvalues/eigenvectors
- MP-Quest (Sandia NL)
- Si and sapphire crystals of up to 3072 atoms
- Local Density Approximation to Schroedinger
Equation - A and B up to n40000, Hermitian
- Need all eigenvalues and eigenvectors
- Need to iterate up to 20 times (for
self-consistency) - Implemented on Intel ASCI Red
- 9200 Pentium Pro 200 processors (4600 Duals, a
CLUMP) - Overall application ran at 605 Gflops (out of
1800 Glops peak), - Eigensolver ran at 684 Gflops
- www.cs.berkeley.edu/stanley/gbell/index.html
- Runner-up for Gordon Bell Prize at Supercomputing
98
106Review of Gaussian Elimination (GE) for solving
Axb
- Add multiples of each row to later rows to make A
upper triangular - Solve resulting triangular system Ux c by
substitution
for each column i zero it out below the
diagonal by adding multiples of row i to later
rows for i 1 to n-1 for each row j below
row i for j i1 to n add a
multiple of row i to row j for k i to
n A(j,k) A(j,k) -
(A(j,i)/A(i,i)) A(i,k)
107Refine GE Algorithm (1)
- Initial Version
- Remove computation of constant A(j,i)/A(i,i) from
inner loop
for each column i zero it out below the
diagonal by adding multiples of row i to later
rows for i 1 to n-1 for each row j below
row i for j i1 to n add a
multiple of row i to row j for k i to
n A(j,k) A(j,k) -
(A(j,i)/A(i,i)) A(i,k)
for i 1 to n-1 for j i1 to n
m A(j,i)/A(i,i) for k i to n
A(j,k) A(j,k) - m A(i,k)
108Refine GE Algorithm (2)
- Last version
- Dont compute what we already know
zeros below diagonal in column i
for i 1 to n-1 for j i1 to n
m A(j,i)/A(i,i) for k i to n
A(j,k) A(j,k) - m A(i,k)
for i 1 to n-1 for j i1 to n
m A(j,i)/A(i,i) for k i1 to n
A(j,k) A(j,k) - m A(i,k)
109Refine GE Algorithm (3)
- Last version
- Store multipliers m below diagonal in zeroed
entries for later use
for i 1 to n-1 for j i1 to n
m A(j,i)/A(i,i) for k i1 to n
A(j,k) A(j,k) - m A(i,k)
for i 1 to n-1 for j i1 to n
A(j,i) A(j,i)/A(i,i) for k i1 to
n A(j,k) A(j,k) - A(j,i) A(i,k)
110Refine GE Algorithm (4)
- Last version
- Express using matrix operations (BLAS)
for i 1 to n-1 for j i1 to n
A(j,i) A(j,i)/A(i,i) for k i1 to
n A(j,k) A(j,k) - A(j,i) A(i,k)
for i 1 to n-1 A(i1n,i) A(i1n,i) /
A(i,i) A(i1n,i1n) A(i1n , i1n )
- A(i1n , i) A(i , i1n)
111What GE really computes
- Call the strictly lower triangular matrix of
multipliers M, and let L IM - Call the upper triangle of the final matrix U
- Lemma (LU Factorization) If the above algorithm
terminates (does not divide by zero) then A LU - Solving Axb using GE
- Factorize A LU using GE
(cost 2/3 n3 flops) - Solve Ly b for y, using substitution (cost
n2 flops) - Solve Ux y for x, using substitution (cost
n2 flops) - Thus Ax (LU)x L(Ux) Ly b as desired
for i 1 to n-1 A(i1n,i) A(i1n,i) /
A(i,i) A(i1n,i1n) A(i1n , i1n ) -
A(i1n , i) A(i , i1n)
112Problems with basic GE algorithm
- What if some A(i,i) is zero? Or very small?
- Result may not exist, or be unstable, so need
to pivot - Current computation all BLAS 1 or BLAS 2, but we
know that BLAS 3 (matrix multiply) is fastest
(Lecture 2)
for i 1 to n-1 A(i1n,i) A(i1n,i) /
A(i,i) BLAS 1 (scale a vector)
A(i1n,i1n) A(i1n , i1n ) BLAS 2
(rank-1 update) - A(i1n , i)
A(i , i1n)
Peak
BLAS 3
BLAS 2
BLAS 1
113Pivoting in Gaussian Elimination
- A 0 1 fails completely, even though
A is easy - 1 0
- Illustrate problems in 3-decimal digit
arithmetic - A 1e-4 1 and b 1 ,
correct answer to 3 places is x 1 - 1 1
2
1 - Result of LU decomposition is
- L 1 0 1
0 No roundoff error yet - fl(1/1e-4) 1 1e4
1 - U 1e-4 1
1e-4 1 Error in 4th decimal place - 0 fl(1-1e41)
0 -1e4 - Check if A LU 1e-4 1
(2,2) entry entirely wrong - 1
0
114Gaussian Elimination with Partial Pivoting (GEPP)
- Partial Pivoting swap rows so that each
multiplier - L(i,j) A(j,i)/A(i,i) lt
1
for i 1 to n-1 find and record k where
A(k,i) maxi lt j lt n A(j,i)
i.e. largest entry in rest of column i if
A(k,i) 0 exit with a warning that A
is singular, or nearly so elseif k ! i
swap rows i and k of A end if
A(i1n,i) A(i1n,i) / A(i,i) each
quotient lies in -1,1 A(i1n,i1n)
A(i1n , i1n ) - A(i1n , i) A(i , i1n)
- Lemma This algorithm computes A PLU, where
P is a - permutation matrix
- Since each entry of L(i,j) lt 1, this
algorithm is considered - numerically stable
- For details see LAPACK code at
www.netlib.org/lapack/single/sgetf2.f
115Converting BLAS2 to BLAS3 in GEPP
- Blocking
- Used to optimize matrix-multiplication
- Harder here because of data dependencies in GEPP
- Delayed Updates
- Save updates to trailing matrix from several
consecutive BLAS2 updates - Apply many saved updates simultaneously in one
BLAS3 operation - Same idea works for much of dense linear algebra
- Open questions remain
- Need to choose a block size b
- Algorithm will save and apply b updates
- b must be small enough so that active submatrix
consisting of b columns of A fits in cache - b must be large enough to make BLAS3 fast
116Blocked GEPP (www.netlib.org/lapack/single/sgetr
f.f)
for ib 1 to n-1 step b Process matrix b
columns at a time end ib b-1
Point to end of block of b columns
apply BLAS2 version of GEPP to get A(ibn ,
ibend) P L U let LL denote the
strict lower triangular part of A(ibend ,
ibend) I A(ibend , end1n) LL-1
A(ibend , end1n) update next b rows
of U A(end1n , end1n ) A(end1n ,
end1n ) - A(end1n , ibend)
A(ibend , end1n)
apply delayed updates with
single matrix-multiply
with inner dimension b
(For a correctness proof, see on-lines notes.)
117Efficiency of Blocked GEPP
118Overview of LAPACK
- Standard library for dense/banded linear algebra
- Linear systems Axb
- Least squares problems minx Ax-b 2
- Eigenvalue problems Ax lx, Ax lBx
- Singular value decomposition (SVD) A USVT
- Algorithms reorganized to use BLAS3 as much as
possible - Basis of math libraries on many computers
- Many algorithmic innovations remain
- Projects available
119Performance of LAPACK (n1000)
120Performance of LAPACK (n100)
121Parallelizing Gaussian Elimination
- parallelization steps
- Decomposition identify enough parallel work, but
not too much - Assignment load balance work among threads
- Orchestrate communication and synchronization
- Mapping which processors execute which threads
- Decomposition
- In BLAS 2 algorithm nearly each flop in inner
loop can be done in parallel, so with n2
processors, need 3n parallel steps - This is too fine-grained, prefer calls to local
matmuls instead
for i 1 to n-1 A(i1n,i) A(i1n,i) /
A(i,i) BLAS 1 (scale a vector)
A(i1n,i1n) A(i1n , i1n ) BLAS 2
(rank-1 update) - A(i1n , i)
A(i , i1n)
122Assignment of parallel work in GE
- Think of assigning submatrices to threads, where
each thread responsible for updating submatrix it
owns - owner computes rule natural because of locality
- What should submatrices look like to achieve load
balance?
123Different Data Layouts for Parallel GE (on 4
procs)
Load balanced, but cant easily use BLAS2 or BLAS3
Bad load balance P0 idle after first n/4 steps
Can trade load balance and BLAS2/3 performance
by choosing b, but factorization of block column
is a bottleneck
The winner!
Complicated addressing
124Blocked Partitioned Algorithms
- Orthogonal reduction to
- (upper) Hessenberg form
- symmetric tridiagonal form
- bidiagonal form
- Block QR iteration for nonsymmetric eigenvalue
problems
- LU Factorization
- Cholesky factorization
- Symmetric indefinite factorization
- Matrix inversion
- QR, QL, RQ, LQ factorizations
- Form Q or QTC
125Memory Hierarchy and LAPACK
- ijk - implementations
- Effects order in which
data referenced some better at
allowing data to keep in higher levels of
memory hierarchy. - Applies for matrix multiply, reductions to
condensed form - May do slightly more
flops - Up to 3 times faster
126Derivation of Blocked AlgorithmsCholesky
Factorization A UTU
Equating coefficient of the jth column, we obtain
Hence, if U11 has already been computed, we can
compute uj and ujj from the equations
127LINPACK Implementation
- Here is the body of the LINPACK routine SPOFA
which implements the method - DO 30 J 1, N
- INFO J
- S 0.0E0
- JM1 J - 1
- IF( JM1.LT.1 ) GO TO 20
- DO 10 K 1, JM1
- T A( K, J ) - SDOT( K-1,
A( 1, K ), 1,A( 1, J ), 1 ) - T T / A( K, K )
- A( K, J ) T
- S S TT
- 10 CONTINUE
- 20 CONTINUE
- S A( J, J ) - S
- C ...EXIT
- IF( S.LE.0.0E0 ) GO TO 40
- A( J, J ) SQRT( S )
- 30 CONTINUE
128LAPACK Implementation
- DO 10 J 1, N
- CALL STRSV( 'Upper', 'Transpose',
'Non-Unit, J-1, A, LDA, A( 1, J ), 1 ) - S A( J, J ) - SDOT( J-1, A( 1, J ),
1, A( 1, J ), 1 ) - IF( S.LE.ZERO ) GO TO 20
- A( J, J ) SQRT( S )
- 10 CONTINUE
- This change by itself is sufficient to
significantly improve the performance on a number
of machines. - From 238 to 312 Mflop/s for a matrix of order 500
on a Pentium 4-1.7 GHz. - However on peak is 1,700 Mflop/s.
- Suggest further work needed.
129Derivation of Blocked Algorithms
Equating coefficient of second block of columns,
we obtain
Hence, if U11 has already been computed, we can
compute U12 as the solution of the following
equations by a call to the Level 3 BLAS routine
STRSM
130LAPACK Blocked Algorithms
DO 10 J 1, N, NB CALL STRSM( 'Left',
'Upper', 'Transpose','Non-Unit', J-1, JB, ONE, A,
LDA, A( 1, J ), LDA )
CALL SSYRK( 'Upper', 'Transpose', JB, J-1,-ONE,
A( 1, J ), LDA, ONE, A( J, J ),
LDA ) CALL SPOTF2( 'Upper', JB, A( J, J ),
LDA, INFO ) IF( INFO.NE.0 ) GO TO 20 10
CONTINUE
- On Pentium 4, L3 BLAS squeezes a lot more out of
1 proc
Intel Pentium 4 1.7 GHz N 500 Rate of Execution
Linpack variant (L1B) 238 Mflop/s
Level 2 BLAS Variant 312 Mflop/s
Level 3 BLAS Variant 1262 Mflop/s
131LAPACK Contents
- Combines algorithms from LINPACK and EISPACK into
a single package. User interface similar to
LINPACK. - Built on the L 1, 2 and 3 BLAS, for high
performance (manufacturers optimize BLAS) - LAPACK does not provide routines for structured
problems or general sparse matrices (i.e sparse
storage formats such as compressed-row, -column,
-diagonal, skyline ...).
132LAPACK Ongoing Work
- Add functionality
- updating/downdating, divide and conquer least
squares,bidiagonal bisection, bidiagonal inverse
iteration, band SVD, Jacobi methods, ... - Move to new generation of high performance
machines - IBM SPs, CRAY T3E, SGI Origin, clusters of
workstations - New challenges
- New languages FORTRAN 90, HP FORTRAN, ...
- (CMMD, MPL, NX ...)
- many flavors of message passing, need standard
(PVM, MPI) BLACS - Highly varying ratio
- Many ways to layout data,
- Fastest parallel algorithm sometimes less stable
numerically.
133History of Block Partitioned Algorithms
- Early algorithms involved use of small main
memory using tapes as secondary storage. - Recent work centers on use of vector registers,
level 1 and 2 cache, main memory, and out of
core memory.
134Blocked Partitioned Algorithms
- Orthogonal reduction to
- (upper) Hessenberg form
- symmetric tridiagonal form
- bidiagonal form
- Block QR iteration for nonsymmetric eigenvalue
problems
- LU Factorization
- Cholesky factorization
- Symmetric indefinite factorization
- Matrix inversion
- QR, QL, RQ, LQ factorizations
- Form Q or QTC