Lecture 6Linear Algebra Algorithms

- Jack Dongarra, U of Tennessee

Slides are adapted from Jim Demmel, UCBs Lecture

on Linear Algebra Algorithms

Homework 3 Grading Rules

- part1 (2 points) Write an MPI program
- (1) code produce correct result 2/2.
- (2) code run but could not give accurate result

1.5/2. - (3) code could not run or run but nothing 1/2
- Note If you only submit one program which can

print "processor - contribution" and "total integral" correctly, I

will give 1/2 - for this part and 2/2 for part2. so you total

score for part1 - and part2 will be 3/4.
- part2 (2 points) Modify your program to sum the

contributions to give the total value of the

integral. - (1) code produce correct result 2/2.
- (2) code run but could not give accurate result

1.5/2. - (3) code could not run or run but nothing 1/2
- part3 (4 points) Give an expression for the

efficiency of the parallel algorithm for m

subintervals - (1) give t_p(1), t_p(p), S(p) and E(p) correctly

4/4. - (2) give t_p(1), t_p(p) and S(p) correctly 3/4.
- (3) give t_p(1) and t_p(p) correctly 2/4.
- (4) give other 1/4.

Adding numbers together

- If we are adding an array of numbers and have a

choice do we add them from largest to smallest or

smallest to largest?

Parallel Performance Metrics

Absolute Elapsed (wall-clock) Time T(n )

Speedup S(n ) T(1) / T(n ) where T1 is

the time for the best serial implementation.

gt Performance improvement due to parallelsm

Parallel Efficiency E(n ) T(1) / n T(n )

Ideal Speedup SI (n ) n -

Theoretical limit obtainable rarely - Ignores

all of real life These definitions apply to a

fixed-problem experiment.

Amdahls Law

Amdahls Law places a strict limit on the speedup

that can be realized by using multiple

processors. Two equivalent expressions for

Amdahls Law are given below

tN (fp/N fs)t1 Effect of multiple

processors on run time S 1/(fs fp/N)

Effect of multiple processors on

speedup Where fs serial fraction of code fp

parallel fraction of code 1 - fs N number

of processors

Illustration of Amdahls Law

It takes only a small fraction of serial content

in a code to degrade the parallel performance. It

is essential to determine the scaling behavior of

your code before doing production runs using

large numbers of processors

Amdahls Law Vs. Reality

Amdahls Law provides a theoretical upper limit

on parallel speedup assuming that there are no

costs for communications. In reality,

communications (and I/O) will result in a further

degradation of performance.

So whats going on?

More on Amdahls Law

- Amdahls Law can be generalized to any two

processes of with different speeds - Ex. Apply to fprocessor and fmemory
- The growing processor-memory performance gap will

undermine our efforts at achieving maximum

possible speedup!

Gustafsons Law

- Thus, Amdahls Law predicts that there is a

maximum scalability for an application,

determined by its parallel fraction, and this

limit is generally not large. - tN (fp/N fs)t1 Effect of multiple

processors on run time - S 1/(fs fp/N) Effect of multiple

processors on speedup - Where
- fs serial fraction of code
- fp parallel fraction of code 1 - fs
- N number of processors
- There is a way around this increase the problem

size - bigger problems mean bigger grids or more

particles bigger arrays - number of serial operations generally remains

constant number of parallel operations

increases parallel fraction increases

Fixed-Problem Size Scaling

a.k.a. Fixed-load, Fixed-Problem Size, Strong

Scaling, Problem-Constrained, constant-problem

size (CPS), variable subgrid

1 Amdahl Limit SA(n) T(1) / T(n)

----------------

f / n ( 1 -

f ) This bounds the speedup based only on the

fraction of the code that cannot use parallelism

( 1- f ) it ignores all other factors SA

--gt 1 / ( 1- f ) as n --gt

Fixed-Problem Size Scaling (Contd)

Efficiency (n) T(1) / T(n) n

Memory requirements decrease with n

Surface-to-volume ratio increases with n

Superlinear speedup possible from cache

effects Motivation what is the largest of

procs I can use effectively and what is the

fastest time that I can solve a given problem?

Problems - Sequential runs often not possible

(large problems) - Speedup (and efficiency) is

misleading if processors are slow

Fixed-Problem Size Scaling Examples

S. Goedecker and Adolfy Hoisie, Achieving High

Performance in Numerical Computations on RISC

Workstations and Parallel Systems,International

Conference on Computational Physics PC'97 Santa

Cruz, August 25-28 1997.

Fixed-Problem Size Scaling Examples

Scaled Speedup Experiments

a.k.a. Fixed Subgrid-Size, Weak Scaling,

Gustafson scaling. Motivation Want to use a

larger machine to solve a larger global problem

in the same amount of time. Memory and

surface-to-volume effects remain constant.

Top500 Data

Rank Manufacturer Computer Installation Site Rmax TF/s R peak TF/s Ratio Max/Peak Proc

1 IBM ASCI WhiteSP Power3 Lawrence Livermore National Laboratory 7.23 12 60.25 8192

2 Compaq AlphaServer SC ES45 1 GHz Pittsburgh Supercomputing Center 4.06 6.05 67.11 3024

3 IBM SP Power3 375 MHz NERSC/LBNL 3.05 5.00 61.00 3328

4 Intel ASCI Red Sandia National Laboratory 2.38 3.207 74.21 9632

5 IBM ASCI Blue PacificSST, IBM SP 604E Lawrence Livermore National Laboratory 2.14 3.868 55.33 5808

6 Compaq AlphaServer SC ES45 1 GHz Los Alamos National Laboratory 2.10 3.04 69.08 1536

7 Hitachi SR8000/MPP University of Tokyo 1.71 2.69 63.57 1152

8 SGI ASCI Blue Mountain Los Alamos National Laboratory 1.61 2.52 63.89 6144

9 IBM SP Power3 375Mhz Naval Oceanographic Office (NAVOCEANO) 1.42 1.97 72.08 1336

10 IBM SP Power3 375Mhz Deutscher Wetterdienst 1.29 1.83 70.49 1280

Example of a Scaled Speedup Experiment

Parallel Performance Metrics Speedup

Absolute performance

Relative performance

MFLOPS

Speedup

Processors

Processors

Speedup is only one characteristic of a program -

it is not synonymous with performance. In this

comparison of two machines the code achieves

comparable speedups but one of the machines is

faster.

(No Transcript)

Improving Ratio of Floating Point Operations to

Memory Accesses

- subroutine mult(n1,nd1,n2,nd2,y,a,x)
- implicit real8 (a-h,o-z)
- dimension a(nd1,nd2),y(nd2),x(nd1)
- do 20, i1,n1
- t0.d0
- do 10, j1,n2
- tta(j,i)x(j)
- 10 continue
- y(i)t
- 20 continue
- return
- end
- Unroll the loops

2 FLOPS 2 LOADS

Improving Ratio of Floating Point Operations to

Memory Accesses

c works correctly when n1,n2 are multiples of 4

dimension a(nd1,nd2), y(nd2), x(nd1) do

i1,n1-3,4 t10.d0 t20.d0

t30.d0 t40.d0 do j1,n2-3,4

t1t1a(j0,i0)x(j0)a(j1,i0)x(j1) 1

a(j2,i0)x(j2)a(j3,i1)x(j3)

t2t2a(j0,i1)x(j0)a(j1,i1)x(j1) 1

a(j2,i1)x(j2)a(j3,i0)x(j3)

t3t3a(j0,i2)x(j0)a(j1,i2)x(j1) 1

a(j2,i2)x(j2)a(j3,i2)x(j3)

t4t4a(j0,i3)x(j0)a(j1,i3)x(j1) 1

a(j2,i3)x(j2)a(j3,i3)x(j3) enddo

y(i0)t1 y(i1)t2 y(i2)t3

y(i3)t4 enddo

32 FLOPS 20 LOADS

Summary of Single-Processor Optimization

Techniques (I)

- Spatial and temporal data locality
- Loop unrolling
- Blocking
- Software pipelining
- Optimization of data structures
- Special functions, library subroutines

Summary of Optimization Techniques (II)

- Achieving high-performance requires code

restructuring. Minimization of memory traffic is

the single most important goal. - Compilers are getting better good at software

pipelining. But they are not there yet can do

loop transformations only in simple cases,

usually fail to produce optimal blocking,

heuristics for unrolling may not match your code

well, etc. - The optimization process is machine-specific and

requires detailed architectural knowledge.

Optimizing Matrix Addition for Caches

- Dimension A(n,n), B(n,n), C(n,n)
- A, B, C stored by column (as in Fortran)
- Algorithm 1
- for i1n, for j1n, A(i,j) B(i,j) C(i,j)
- Algorithm 2
- for j1n, for i1n, A(i,j) B(i,j) C(i,j)
- What is memory access pattern for Algs 1 and 2?
- Which is faster?
- What if A, B, C stored by row (as in C)?

Using a Simpler Model of Memory to Optimize

- Assume just 2 levels in the hierarchy, fast and

slow - All data initially in slow memory
- m number of memory elements (words) moved

between fast and slow memory - tm time per slow memory operation
- f number of arithmetic operations
- tf time per arithmetic operation lt tm
- q f/m average number of flops per slow element

access - Minimum possible Time ftf, when all data in

fast memory - Actual Time ftf mtm ftf(1

(tm/tf)(1/q)) - Larger q means Time closer to minimum ftf
- Want large q

Simple example using memory model

- To see results of changing q, consider simple

computation

- Assume tf1 Mflop/s on fast memory
- Assume moving data is tm 10
- Assume h takes q flops
- Assume array X is in slow memory

s 0 for i 1, n s s h(Xi)

- So m n and f qn
- Time read X compute 10n qn
- Mflop/s f/t q/(10 q)
- As q increases, this approaches the peak speed

of 1 Mflop/s

Warm up Matrix-vector multiplication y y Ax

- for i 1n
- for j 1n
- y(i) y(i) A(i,j)x(j)

A(i,)

y(i)

y(i)

x()

Warm up Matrix-vector multiplication y y Ax

- read x(1n) into fast memory
- read y(1n) into fast memory
- for i 1n
- read row i of A into fast memory
- for j 1n
- y(i) y(i) A(i,j)x(j)
- write y(1n) back to slow memory

- m number of slow memory refs 3n n2
- f number of arithmetic operations 2n2
- q f/m 2
- Matrix-vector multiplication limited by slow

memory speed

Matrix Multiply CCAB

- for i 1 to n
- for j 1 to n
- for k 1 to n
- C(i,j) C(i,j) A(i,k) B(k,j)

A(i,)

C(i,j)

C(i,j)

B(,j)

Matrix Multiply CCAB(unblocked, or untiled)

- for i 1 to n
- read row i of A into fast memory
- for j 1 to n
- read C(i,j) into fast memory
- read column j of B into fast memory
- for k 1 to n
- C(i,j) C(i,j) A(i,k) B(k,j)
- write C(i,j) back to slow memory

A(i,)

C(i,j)

C(i,j)

B(,j)

Matrix Multiply (unblocked, or untiled)

qops/slow mem ref

- Number of slow memory references on unblocked

matrix multiply - m n3 read each column of B n times
- n2 read each column of A once for

each i - 2n2 read and write each element of C

once - n3 3n2
- So q f/m (2n3)/(n3 3n2)
- 2 for large n, no improvement over

matrix-vector mult

A(i,)

C(i,j)

C(i,j)

B(,j)

Matrix Multiply (blocked, or tiled)

- Consider A,B,C to be N by N matrices of b by b

subblocks where bn/N is called the blocksize - for i 1 to N
- for j 1 to N
- read block C(i,j) into fast memory
- for k 1 to N
- read block A(i,k) into fast

memory - read block B(k,j) into fast

memory - C(i,j) C(i,j) A(i,k)

B(k,j) do a matrix multiply on blocks - write block C(i,j) back to slow memory

A(i,k)

C(i,j)

C(i,j)

B(k,j)

n is the size of the matrix, N blocks of size b

n Nb

Matrix Multiply (blocked or tiled)

qops/slow mem ref

n is the size of the matrix, N blocks of size b

n Nb

- Why is this algorithm correct?
- Number of slow memory references on blocked

matrix multiply - m Nn2 read each block of B N3 times (N3

n/N n/N) - Nn2 read each block of A N3 times
- 2n2 read and write each block of

C once - (2N 2)n2
- So q f/m 2n3 / ((2N 2)n2)
- n/N b for large n
- So we can improve performance by increasing the

blocksize b - Can be much faster than matrix-vector multiplty

(q2) - Limit All three blocks from A,B,C must fit in

fast memory (cache), so we - cannot make these blocks arbitrarily large 3b2

lt M, so q b lt sqrt(M/3) - Theorem (Hong, Kung, 1981) Any reorganization of

this algorithm - (that uses only associativity) is limited to q

O(sqrt(M))

Model

- As much as possible will be overlapped
- Dot Product
- ACC 0
- do i x,n
- ACC ACC x(i) y(i)
- end do
- Experiments done on an IBM RS6000/530
- 25 MHz
- 2 cycle to complete FMA can be pipelined
- gt 50 Mflop/s peak
- one cycle from cache

FMA

FMA

FMA

FMA

FMA

FMA

cycles

DOT Operation - Data in Cache

- Do 10 I 1, n
- T T X(I)Y(I)
- 10 CONTINUE
- Theoretically, 2 loads for X(I) and Y(I), one FMA

operation, no re-use of data - Pseudo-assembler
- LOAD fp0,T
- label
- LOAD fp1,X(I)
- LOAD fp2,Y(I)
- FMA fp0,fp0,fp1,fp2
- BRANCH label

Load y

Load x

FMA

Load y

Load x

FMA

1 result per cycle 25 Mflop/s

Matrix-Vector Product

- DOT version
- DO 20 I 1, M
- DO 10 J 1, N
- Y(I) Y(I) A(I,J)X(J)
- 10 CONTINUE
- 20 CONTINUE
- From Cache 22.7 Mflops
- From Memory 12.4 Mflops

Loop Unrolling

- DO 20 I 1, M, 2
- T1 Y(I )
- T2 Y(I1)
- DO 10 J 1, N
- T1 T1 A(I,J )X(J)
- T2 T2 A(I1,J)X(J)
- 10 CONTINUE
- Y(I ) T1
- Y(I1) T2
- 20 CONTINUE
- 3 loads, 4 flops
- Speed of yyATx, N48

- Depth 1 2 3 4

? - Speed 25 33.3 37.5 40 50
- Measured 22.7 30.5 34.3 36.5
- Memory 12.4 12.7 12.7 12.6
- unroll 1 2 loads 2 ops per 2 cycles
- unroll 2 3 loads 4 ops per 3 cycles
- unroll 3 4 loads 6 ops per 4 cycles
- unroll n n1 loads 2n ops per n1 cycles
- problem only so many registers

Matrix Multiply

- DOT version - 25 Mflops in cache
- DO 30 J 1, M
- DO 20 I 1, M
- DO 10 K 1, L
- C(I,J) C(I,J)

A(I,K)B(K,J) - 10 CONTINUE
- 20 CONTINUE
- 30 CONTINUE

How to Get Near Peak

- Inner loop
- 4 loads, 8 operations, optimal.
- In practice we have measured 48.1 out of a peak

of 50 Mflop/s when in cache

- DO 30 J 1, M, 2
- DO 20 I 1, M, 2
- T11 C(I, J )
- T12 C(I, J1)
- T21 C(I1,J )
- T22 C(I1,J1)
- DO 10 K 1, L
- T11 T11 A(I, K) B(K,J

) - T12 T12 A(I, K)

B(K,J1) - T21 T21 A(I1,K)B(K,J )
- T22 T22 A(I1,K)B(K,J1)
- 10 CONTINUE
- C(I, J ) T11
- C(I, J1) T12
- C(I1,J ) T21
- C(I1,J1) T22
- 20 CONTINUE
- 30 CONTINUE

BLAS -- Introduction

- Clarity code is shorter and easier to read,
- Modularity gives programmer larger building

blocks, - Performance manufacturers will provide tuned

machine-specific BLAS, - Program portability machine dependencies are

confined to the BLAS

Memory Hierarchy

- Key to high performance in effective use of

memory hierarchy - True on all architectures

Level 1, 2 and 3 BLAS

- Level 1 BLAS Vector-Vector operations
- Level 2 BLAS Matrix-Vector operations
- Level 3 BLAS Matrix-Matrix operations

More on BLAS (Basic Linear Algebra Subroutines)

- Industry standard interface(evolving)
- Vendors, others supply optimized implementations
- History
- BLAS1 (1970s)
- vector operations dot product, saxpy (y?xy),

etc - m2n, f2n, q 1 or less
- BLAS2 (mid 1980s)
- matrix-vector operations matrix vector multiply,

etc - mn2, f2n2, q2, less overhead
- somewhat faster than BLAS1
- BLAS3 (late 1980s)
- matrix-matrix operations matrix matrix multiply,

etc - m gt 4n2, fO(n3), so q can possibly be as large

as n, so BLAS3 is potentially much faster than

BLAS2 - Good algorithms used BLAS3 when possible (LAPACK)
- www.netlib.org/blas, www.netlib.org/lapack

Why Higher Level BLAS?

- Can only do arithmetic on data at the top of the

hierarchy - Higher level BLAS lets us do this

BLAS for Performance

- Development of blocked algorithms important for

performance

BLAS for Performance

- Development of blocked algorithms important for

performance

BLAS for Performance

- Development of blocked algorithms important for

performance

BLAS 3 (n-by-n matrix matrix multiply) vs BLAS 2

(n-by-n matrix vector multiply) vs BLAS 1 (saxpy

of n vectors)

Fast linear algebra kernels BLAS

- Simple linear algebra kernels such as

matrix-matrix multiply - More complicated algorithms can be built from

these basic kernels. - The interfaces of these kernels have been

standardized as the Basic Linear Algebra

Subroutines (BLAS). - Early agreement on standard interface (1980)
- Led to portable libraries for vector and shared

memory parallel machines. - On distributed memory, there is a less-standard

interface called the PBLAS

Level 1 BLAS

- Operate on vectors or pairs of vectors
- perform O(n) operations
- return either a vector or a scalar.
- saxpy
- y(i) a x(i) y(i), for i1 to n.
- s stands for single precision, daxpy is for

double precision, caxpy for complex, and zaxpy

for double complex, - sscal y a x, for scalar a and vectors x,y
- sdot computes s S ni1 x(i)y(i)

Level 2 BLAS

- Operate on a matrix and a vector
- return a matrix or a vector
- O(n2) operations
- sgemv matrix-vector multiply
- y y Ax
- where A is m-by-n, x is n-by-1 and y is m-by-1.
- sger rank-one update
- A A yxT, i.e., A(i,j) A(i,j)y(i)x(j)
- where A is m-by-n, y is m-by-1, x is n-by-1,
- strsv triangular solve
- solves yTx for x, where T is triangular

Level 3 BLAS

- Operate on pairs or triples of matrices
- returning a matrix
- complexity is O(n3).
- sgemm Matrix-matrix multiplication
- C C AB,
- where C is m-by-n, A is m-by-k, and B is k-by-n
- strsm multiple triangular solve
- solves Y TX for X,
- where T is a triangular matrix, and X is a

rectangular matrix.

Optimizing in practice

- Tiling for registers
- loop unrolling, use of named register variables
- Tiling for multiple levels of cache
- Exploiting fine-grained parallelism within the

processor - super scalar
- pipelining
- Complicated compiler interactions
- Hard to do by hand (but youll try)
- Automatic optimization an active research area
- PHIPAC www.icsi.berkeley.edu/bilmes/phipac
- www.cs.berkeley.edu/iyer/asci_sli

des.ps - ATLAS www.netlib.org/atlas/index.html

BLAS -- References

- BLAS software and documentation can be obtained

via - WWW http//www.netlib.org/blas,
- (anonymous) ftp ftp.netlib.org cd blas get

index - email netlib_at_www.netlib.org with the message

send index from blas - Comments and questions can be addressed to

lapack_at_cs.utk.edu

BLAS Papers

- C. Lawson, R. Hanson, D. Kincaid, and F. Krogh,

Basic Linear Algebra Subprograms for Fortran

Usage, ACM Transactions on Mathematical Software,

5308--325, 1979. - J. Dongarra, J. Du Croz, S. Hammarling, and R.

Hanson, An Extended Set of Fortran Basic Linear

Algebra Subprograms, ACM Transactions on

Mathematical Software, 14(1)1--32, 1988. - J. Dongarra, J. Du Croz, I. Duff, S. Hammarling,

A Set of Level 3 Basic Linear Algebra

Subprograms, ACM Transactions on Mathematical

Software, 16(1)1--17, 1990.

Performance of BLAS

- BLAS are specially optimized by the vendor
- Sun BLAS uses features in the Ultrasparc
- Big payoff for algorithms that can be expressed

in terms of the BLAS3 instead of BLAS2 or BLAS1. - The top speed of the BLAS3
- Algorithms like Gaussian elimination organized so

that they use BLAS3

How To Get Performance From Commodity Processors?

- Todays processors can achieve high-performance,

but this requires extensive machine-specific hand

tuning. - Routines have a large design space w/many

parameters - blocking sizes, loop nesting permutations, loop

unrolling depths, software pipelining strategies,

register allocations, and instruction schedules. - Complicated interactions with the increasingly

sophisticated microarchitectures of new

microprocessors. - A few months ago no tuned BLAS for Pentium for

Linux. - Need for quick/dynamic deployment of optimized

routines. - ATLAS - Automatic Tuned Linear Algebra Software
- PhiPac from Berkeley

Adaptive Approach for Level 3

- Do a parameter study of the operation on the

target machine, done once. - Only generated code is on-chip multiply
- BLAS operation written in terms of generated

on-chip multiply - All tranpose cases coerced through data copy to 1

case of on-chip multiply - Only 1 case generated per platform

Code Generation Strategy

- On-chip multiply optimizes for
- TLB access
- L1 cache reuse
- FP unit usage
- Memory fetch
- Register reuse
- Loop overhead minimization
- Takes a couple of hours to run.

- Code is iteratively generated timed until

optimal case is found. We try - Differing NBs
- Breaking false dependencies
- M, N and K loop unrolling

(No Transcript)

(No Transcript)

500x500 gemm-based BLAS on SGI R10000ip28

500x500 gemm-based BLAS on UltraSparc 2200

Recursive Approach for Other Level

3 BLAS

- Recur down to L1 cache block size
- Need kernel at bottom of recursion
- Use gemm-based kernel for portability

Recursive TRMM

500x500 Level 2 BLAS DGEMV

Multi-Threaded DGEMMIntel PIII 550 MHz

ATLAS

- Keep a repository of kernels for specific

machines. - Develop a means of dynamically downloading code
- Extend work to allow sparse matrix operations
- Extend work to include arbitrary code segments
- See http//www.netlib.org/atlas/

Algorithms and Architecture

- The key to performance is to understand the

algorithm and architecture interaction. - A significant improvement in performance can be

obtained by matching algorithm to the

architecture or vice versa.

Algorithm Issues

- Use of memory hierarchy
- Algorithm pre-fetching
- Loop unrolling
- Simulating higher precision arithmetic

Blocking

- TLB Blocking - minimize TLB misses
- Cache Blocking - minimize cache misses
- Register Blocking - minimize load/stores
- The general idea of blocking is to get the

information to a high-speed storage and use it

multiple times so as to amortize the cost of

moving the data - Cache blocking reduced traffic between memory and

cache - Register blocking reduces traffic between cache

and CPU

Loop Unrolling

- Reduces data dependency delay
- Exploits multiple functional units and quad

load/stores effectively. - Minimizes load/stores
- Reduces loop overheads
- Gives more flexibility to compiler in scheduling
- Facilitates algorithm pre-fetching.
- What about vector computing?

Whats Wrong With Speedup T1/Tp ?

- Can lead to very false conclusions.
- Speedup in isolation without taking into account

the speed of the processor is unrealistic and

pointless. - Speedup over what?
- T1/Tp
- There is usually no doubt about Tp
- Often considerable dispute over the meaning of T1
- Serial code? Same algorithm?

Speedup

- Can be used to
- Study, in isolation, the scaling of one algorithm

on one computer. - As a dimensionless variable in the theory of

scaling. - Should not be used to compare
- Different algorithms on the same computer
- The same algorithm on different computers.
- Different interconnection structures.

(No Transcript)

Strassens Algorithm for Matrix Multiply

Usual Matrix Multiply

Strassens Algorithm

Strassens Algorithm

- The count of arithmetic operations is
- One matrix multiply is replaced by 14 matrix

additions

Strassens Algorithm

- In reality the use of Strassens Algorithm is

limited by - Additional memory required for storing the P

matrices. - More memory accesses are needed.

Outline

- Motivation for Dense Linear Algebra
- Axb Computational Electromagnetics
- Ax lx Quantum Chemistry
- Review Gaussian Elimination (GE) for solving Axb
- Optimizing GE for caches on sequential machines
- using matrix-matrix multiplication (BLAS)
- LAPACK library overview and performance
- Data layouts on parallel machines
- Parallel matrix-matrix multiplication
- Parallel Gaussian Elimination
- ScaLAPACK library overview
- Eigenvalue problem

Parallelism in Sparse Matrix-vector multiplication

- y Ax, where A is sparse and n x n
- Questions
- which processors store
- yi, xi, and Ai,j
- which processors compute
- yi sum (from 1 to n) Ai,j xj
- (row i of A) . x a

sparse dot product - Partitioning
- Partition index set 1,,n N1 u N2 u u Np
- For all i in Nk, Processor k stores yi, xi,

and row i of A - For all i in Nk, Processor k computes yi (row

i of A) . x - owner computes rule Processor k compute the

yis it owns - Goals of partitioning
- balance load (how is load measured?)
- balance storage (how much does each processor

store?) - minimize communication (how much is communicated?)

Graph Partitioning and Sparse Matrices

- Relationship between matrix and graph

1 2 3 4 5 6

1 1 1 1 2 1 1

1 1 3

1 1 1 4 1 1

1 1 5 1 1 1

1 6 1 1 1 1

- A good partition of the graph has
- equal (weighted) number of nodes in each part

(load and storage balance) - minimum number of edges crossing between

(minimize communication) - Can reorder the rows/columns of the matrix by

putting all the nodes in one partition together

More on Matrix Reordering via Graph Partitioning

- Ideal matrix structure for parallelism

(nearly) block diagonal - p (number of processors) blocks
- few non-zeros outside these blocks, since these

require communication

P0 P1 P2 P3 P4

What about implicit methods and eigenproblems?

- Direct methods (Gaussian elimination)
- Called LU Decomposition, because we factor A

LU - Future lectures will consider both dense and

sparse cases - More complicated than sparse-matrix vector

multiplication - Iterative solvers
- Will discuss several of these in future
- Jacobi, Successive overrelaxiation (SOR) ,

Conjugate Gradients (CG), Multigrid,... - Most have sparse-matrix-vector multiplication in

kernel - Eigenproblems
- Future lectures will discuss dense and sparse

cases - Also depend on sparse-matrix-vector

multiplication, direct methods - Graph partitioning

- Partial Differential Equations
- PDEs

Continuous Variables, Continuous Parameters

- Examples of such systems include
- Heat flow Temperature(position, time)
- Diffusion Concentration(position, time)
- Electrostatic or Gravitational Potential Pote

ntial(position) - Fluid flow Velocity,Pressure,Density(position,tim

e) - Quantum mechanics Wave-function(position,time)
- Elasticity Stress,Strain(position,time)

Example Deriving the Heat Equation

x

x-h

0

1

xh

- Consider a simple problem
- A bar of uniform material, insulated except at

ends - Let u(x,t) be the temperature at position x at

time t - Heat travels from x-h to xh at rate proportional

to

d u(x,t) (u(x-h,t)-u(x,t))/h -

(u(x,t)- u(xh,t))/h dt

h

C

- As h 0, we get the heat equation

Explicit Solution of the Heat Equation

- For simplicity, assume C1
- Discretize both time and position
- Use finite differences with uj,i as the heat at
- time t idt (i 0,1,2,) and position x jh

(j0,1,,N1/h) - initial conditions on uj,0
- boundary conditions on u0,i and uN,i
- At each timestep i 0,1,2,...
- This corresponds to
- matrix vector multiply (what is matrix?)
- nearest neighbors on grid

t5 t4 t3 t2 t1 t0

For j0 to N uj,i1 zuj-1,i

(1-2z)uj,i zuj1,i where z dt/h2

u0,0 u1,0 u2,0 u3,0 u4,0 u5,0

Parallelism in Explicit Method for PDEs

- Partitioning the space (x) into p largest chunks
- good load balance (assuming large number of

points relative to p) - minimized communication (only p chunks)
- Generalizes to
- multiple dimensions
- arbitrary graphs ( sparse matrices)
- Problem with explicit approach
- numerical instability
- solution blows up eventually if z dt/h gt .5
- need to make the timesteps very small when h is

small dt lt .5h

2

2

Instability in solving the heat equation

explicitly

Implicit Solution

- As with many (stiff) ODEs, need an implicit

method - This turns into solving the following equation
- Here I is the identity matrix and T is
- I.e., essentially solving Poissons equation in 1D

(I (z/2)T) u,i1 (I - (z/2)T) u,i

2 -1 -1 2 -1 -1 2 -1

-1 2 -1 -1 2

Graph and stencil

T

2

-1

-1

2D Implicit Method

- Similar to the 1D case, but the matrix T is now
- Multiplying by this matrix (as in the explicit

case) is simply nearest neighbor computation on

2D grid - To solve this system, there are several techniques

Graph and stencil

4 -1 -1 -1 4 -1 -1

-1 4 -1 -1

4 -1 -1 -1 -1 4

-1 -1 -1

-1 4 -1

-1 4 -1

-1 -1 4 -1

-1 -1 4

-1

4

-1

-1

T

-1

Algorithms for 2D Poisson Equation with N unknowns

- Algorithm Serial PRAM Memory Procs
- Dense LU N3 N N2 N2
- Band LU N2 N N3/2 N
- Jacobi N2 N N N
- Explicit Inv. N log N N N
- Conj.Grad. N 3/2 N 1/2 log N N N
- RB SOR N 3/2 N 1/2 N N
- Sparse LU N 3/2 N 1/2 Nlog N N
- FFT Nlog N log N N N
- Multigrid N log2 N N N
- Lower bound N log N N
- PRAM is an idealized parallel model with zero

cost communication - (see next slide for explanation)

2

2

2

Short explanations of algorithms on previous slide

- Sorted in two orders (roughly)
- from slowest to fastest on sequential machines
- from most general (works on any matrix) to most

specialized (works on matrices like T) - Dense LU Gaussian elimination works on any

N-by-N matrix - Band LU exploit fact that T is nonzero only on

sqrt(N) diagonals nearest main diagonal, so

faster - Jacobi essentially does matrix-vector multiply

by T in inner loop of iterative algorithm - Explicit Inverse assume we want to solve many

systems with T, so we can precompute and store

inv(T) for free, and just multiply by it - Its still expensive!
- Conjugate Gradients uses matrix-vector

multiplication, like Jacobi, but exploits

mathematical properies of T that Jacobi does not - Red-Black SOR (Successive Overrelaxation)

Variation of Jacobi that exploits yet different

mathematical properties of T - Used in Multigrid
- Sparse LU Gaussian elimination exploiting

particular zero structure of T - FFT (Fast Fourier Transform) works only on

matrices very like T - Multigrid also works on matrices like T, that

come from elliptic PDEs - Lower Bound serial (time to print answer)

parallel (time to combine N inputs)

Composite mesh from a mechanical structure

Converting the mesh to a matrix

Effects of Ordering Rows and Columns on Gaussian

Elimination

Irregular mesh NASA Airfoil in 2D (direct

solution)

Irregular mesh Tapered Tube (multigrid)

Adaptive Mesh Refinement (AMR)

- Adaptive mesh around an explosion
- John Bell and Phil Colella at LBL (see class web

page for URL) - Goal of Titanium is to make these algorithms

easier to implement - in parallel

Computational Electromagnetics

- Developed during 1980s, driven by defense

applications - Determine the RCS (radar cross section) of

airplane - Reduce signature of plane (stealth technology)
- Other applications are antenna design, medical

equipment - Two fundamental numerical approaches
- MOM methods of moments ( frequency domain), and
- Finite differences (time domain)

Computational Electromagnetics

- Discretize surface into triangular facets using

standard modeling tools - Amplitude of currents

on surface are unknowns

- Integral equation is discretized into a set of

linear equations

image NW Univ. Comp. Electromagnetics Laboratory

http//nueml.ece.nwu.edu/

Computational Electromagnetics (MOM)

After discretization the integral equation has

the form A x b where A is

the (dense) impedance matrix, x is the unknown

vector of amplitudes, and b is the excitation

vector. (see Cwik, Patterson, and Scott,

Electromagnetic Scattering on the Intel

Touchstone Delta, IEEE Supercomputing 92, pp 538

- 542)

Computational Electromagnetics (MOM)

The main steps in the solution process are

Fill computing the matrix elements

of A Factor factoring the dense matrix

A Solve solving for one or more

excitations b Field Calc computing the fields

scattered from the object

Analysis of MOM for Parallel Implementation

Task Work Parallelism

Parallel Speed

Fill O(n2) embarrassing

low Factor O(n3)

moderately diff. very high Solve

O(n2) moderately diff.

high Field Calc. O(n)

embarrassing high

Results for Parallel Implementation on Delta

Task Time (hours)

Fill 9.20

Factor 8.25 Solve

2 .17

Field Calc. 0.12

The problem solved was for a matrix of size

48,672. (The world record in 1991.)

Current Records for Solving Dense Systems

Year System Size Machine

Procs Gflops (Peak) 1950's O(100)

1995 128,600 Intel Paragon

6768 281 ( 338) 1996

215,000 Intel ASCI Red 7264

1068 (1453) 1998 148,000

Cray T3E 1488 1127

(1786) 1998 235,000 Intel

ASCI Red 9152 1338 (1830) 1999

374,000 SGI ASCI Blue 5040

1608 (2520) 1999 362,880

Intel ASCI Red 9632 2379 (3207) 2000

430,000 IBM ASCI White 8192

4928 (12000)

source Alan Edelman http//www-math.mit.edu/edel

man/records.html LINPACK Benchmark

http//www.netlib.org/performance/html/PDSreports.

html

Computational Chemistry

- Seek energy levels of a molecule, crystal, etc.
- Solve Schroedingers Equation for energy levels

eigenvalues - Discretize to get Ax lBx, solve for eigenvalues

l and eigenvectors x - A and B large, symmetric or Hermitian matrices (B

positive definite) - May want some or all eigenvalues/eigenvectors
- MP-Quest (Sandia NL)
- Si and sapphire crystals of up to 3072 atoms
- Local Density Approximation to Schroedinger

Equation - A and B up to n40000, Hermitian
- Need all eigenvalues and eigenvectors
- Need to iterate up to 20 times (for

self-consistency) - Implemented on Intel ASCI Red
- 9200 Pentium Pro 200 processors (4600 Duals, a

CLUMP) - Overall application ran at 605 Gflops (out of

1800 Glops peak), - Eigensolver ran at 684 Gflops
- www.cs.berkeley.edu/stanley/gbell/index.html
- Runner-up for Gordon Bell Prize at Supercomputing

98

Review of Gaussian Elimination (GE) for solving

Axb

- Add multiples of each row to later rows to make A

upper triangular - Solve resulting triangular system Ux c by

substitution

for each column i zero it out below the

diagonal by adding multiples of row i to later

rows for i 1 to n-1 for each row j below

row i for j i1 to n add a

multiple of row i to row j for k i to

n A(j,k) A(j,k) -

(A(j,i)/A(i,i)) A(i,k)

Refine GE Algorithm (1)

- Initial Version
- Remove computation of constant A(j,i)/A(i,i) from

inner loop

for each column i zero it out below the

diagonal by adding multiples of row i to later

rows for i 1 to n-1 for each row j below

row i for j i1 to n add a

multiple of row i to row j for k i to

n A(j,k) A(j,k) -

(A(j,i)/A(i,i)) A(i,k)

for i 1 to n-1 for j i1 to n

m A(j,i)/A(i,i) for k i to n

A(j,k) A(j,k) - m A(i,k)

Refine GE Algorithm (2)

- Last version
- Dont compute what we already know

zeros below diagonal in column i

for i 1 to n-1 for j i1 to n

m A(j,i)/A(i,i) for k i to n

A(j,k) A(j,k) - m A(i,k)

for i 1 to n-1 for j i1 to n

m A(j,i)/A(i,i) for k i1 to n

A(j,k) A(j,k) - m A(i,k)

Refine GE Algorithm (3)

- Last version
- Store multipliers m below diagonal in zeroed

entries for later use

for i 1 to n-1 for j i1 to n

m A(j,i)/A(i,i) for k i1 to n

A(j,k) A(j,k) - m A(i,k)

for i 1 to n-1 for j i1 to n

A(j,i) A(j,i)/A(i,i) for k i1 to

n A(j,k) A(j,k) - A(j,i) A(i,k)

Refine GE Algorithm (4)

- Last version
- Express using matrix operations (BLAS)

for i 1 to n-1 for j i1 to n

A(j,i) A(j,i)/A(i,i) for k i1 to

n A(j,k) A(j,k) - A(j,i) A(i,k)

for i 1 to n-1 A(i1n,i) A(i1n,i) /

A(i,i) A(i1n,i1n) A(i1n , i1n )

- A(i1n , i) A(i , i1n)

What GE really computes

- Call the strictly lower triangular matrix of

multipliers M, and let L IM - Call the upper triangle of the final matrix U
- Lemma (LU Factorization) If the above algorithm

terminates (does not divide by zero) then A LU - Solving Axb using GE
- Factorize A LU using GE

(cost 2/3 n3 flops) - Solve Ly b for y, using substitution (cost

n2 flops) - Solve Ux y for x, using substitution (cost

n2 flops) - Thus Ax (LU)x L(Ux) Ly b as desired

for i 1 to n-1 A(i1n,i) A(i1n,i) /

A(i,i) A(i1n,i1n) A(i1n , i1n ) -

A(i1n , i) A(i , i1n)

Problems with basic GE algorithm

- What if some A(i,i) is zero? Or very small?
- Result may not exist, or be unstable, so need

to pivot - Current computation all BLAS 1 or BLAS 2, but we

know that BLAS 3 (matrix multiply) is fastest

(Lecture 2)

for i 1 to n-1 A(i1n,i) A(i1n,i) /

A(i,i) BLAS 1 (scale a vector)

A(i1n,i1n) A(i1n , i1n ) BLAS 2

(rank-1 update) - A(i1n , i)

A(i , i1n)

Peak

BLAS 3

BLAS 2

BLAS 1

Pivoting in Gaussian Elimination

- A 0 1 fails completely, even though

A is easy - 1 0
- Illustrate problems in 3-decimal digit

arithmetic - A 1e-4 1 and b 1 ,

correct answer to 3 places is x 1 - 1 1

2

1 - Result of LU decomposition is
- L 1 0 1

0 No roundoff error yet - fl(1/1e-4) 1 1e4

1 - U 1e-4 1

1e-4 1 Error in 4th decimal place - 0 fl(1-1e41)

0 -1e4 - Check if A LU 1e-4 1

(2,2) entry entirely wrong - 1

0

Gaussian Elimination with Partial Pivoting (GEPP)

- Partial Pivoting swap rows so that each

multiplier - L(i,j) A(j,i)/A(i,i) lt

1

for i 1 to n-1 find and record k where

A(k,i) maxi lt j lt n A(j,i)

i.e. largest entry in rest of column i if

A(k,i) 0 exit with a warning that A

is singular, or nearly so elseif k ! i

swap rows i and k of A end if

A(i1n,i) A(i1n,i) / A(i,i) each

quotient lies in -1,1 A(i1n,i1n)

A(i1n , i1n ) - A(i1n , i) A(i , i1n)

- Lemma This algorithm computes A PLU, where

P is a - permutation matrix
- Since each entry of L(i,j) lt 1, this

algorithm is considered - numerically stable
- For details see LAPACK code at

www.netlib.org/lapack/single/sgetf2.f

Converting BLAS2 to BLAS3 in GEPP

- Blocking
- Used to optimize matrix-multiplication
- Harder here because of data dependencies in GEPP
- Delayed Updates
- Save updates to trailing matrix from several

consecutive BLAS2 updates - Apply many saved updates simultaneously in one

BLAS3 operation - Same idea works for much of dense linear algebra
- Open questions remain
- Need to choose a block size b
- Algorithm will save and apply b updates
- b must be small enough so that active submatrix

consisting of b columns of A fits in cache - b must be large enough to make BLAS3 fast

Blocked GEPP (www.netlib.org/lapack/single/sgetr

f.f)

for ib 1 to n-1 step b Process matrix b

columns at a time end ib b-1

Point to end of block of b columns

apply BLAS2 version of GEPP to get A(ibn ,

ibend) P L U let LL denote the

strict lower triangular part of A(ibend ,

ibend) I A(ibend , end1n) LL-1

A(ibend , end1n) update next b rows

of U A(end1n , end1n ) A(end1n ,

end1n ) - A(end1n , ibend)

A(ibend , end1n)

apply delayed updates with

single matrix-multiply

with inner dimension b

(For a correctness proof, see on-lines notes.)

Efficiency of Blocked GEPP

Overview of LAPACK

- Standard library for dense/banded linear algebra
- Linear systems Axb
- Least squares problems minx Ax-b 2
- Eigenvalue problems Ax lx, Ax lBx
- Singular value decomposition (SVD) A USVT
- Algorithms reorganized to use BLAS3 as much as

possible - Basis of math libraries on many computers
- Many algorithmic innovations remain
- Projects available

Performance of LAPACK (n1000)

Performance of LAPACK (n100)

Parallelizing Gaussian Elimination

- parallelization steps
- Decomposition identify enough parallel work, but

not too much - Assignment load balance work among threads
- Orchestrate communication and synchronization
- Mapping which processors execute which threads
- Decomposition
- In BLAS 2 algorithm nearly each flop in inner

loop can be done in parallel, so with n2

processors, need 3n parallel steps - This is too fine-grained, prefer calls to local

matmuls instead

for i 1 to n-1 A(i1n,i) A(i1n,i) /

A(i,i) BLAS 1 (scale a vector)

A(i1n,i1n) A(i1n , i1n ) BLAS 2

(rank-1 update) - A(i1n , i)

A(i , i1n)

Assignment of parallel work in GE

- Think of assigning submatrices to threads, where

each thread responsible for updating submatrix it

owns - owner computes rule natural because of locality
- What should submatrices look like to achieve load

balance?

Different Data Layouts for Parallel GE (on 4

procs)

Load balanced, but cant easily use BLAS2 or BLAS3

Bad load balance P0 idle after first n/4 steps

Can trade load balance and BLAS2/3 performance

by choosing b, but factorization of block column

is a bottleneck

The winner!

Complicated addressing

Blocked Partitioned Algorithms

- Orthogonal reduction to
- (upper) Hessenberg form
- symmetric tridiagonal form
- bidiagonal form
- Block QR iteration for nonsymmetric eigenvalue

problems

- LU Factorization
- Cholesky factorization
- Symmetric indefinite factorization
- Matrix inversion
- QR, QL, RQ, LQ factorizations
- Form Q or QTC

Memory Hierarchy and LAPACK

- ijk - implementations
- Effects order in which

data referenced some better at

allowing data to keep in higher levels of

memory hierarchy. - Applies for matrix multiply, reductions to

condensed form - May do slightly more

flops - Up to 3 times faster

Derivation of Blocked AlgorithmsCholesky

Factorization A UTU

Equating coefficient of the jth column, we obtain

Hence, if U11 has already been computed, we can

compute uj and ujj from the equations

LINPACK Implementation

- Here is the body of the LINPACK routine SPOFA

which implements the method - DO 30 J 1, N
- INFO J
- S 0.0E0
- JM1 J - 1
- IF( JM1.LT.1 ) GO TO 20
- DO 10 K 1, JM1
- T A( K, J ) - SDOT( K-1,

A( 1, K ), 1,A( 1, J ), 1 ) - T T / A( K, K )
- A( K, J ) T
- S S TT
- 10 CONTINUE
- 20 CONTINUE
- S A( J, J ) - S
- C ...EXIT
- IF( S.LE.0.0E0 ) GO TO 40
- A( J, J ) SQRT( S )
- 30 CONTINUE

LAPACK Implementation

- DO 10 J 1, N
- CALL STRSV( 'Upper', 'Transpose',

'Non-Unit, J-1, A, LDA, A( 1, J ), 1 ) - S A( J, J ) - SDOT( J-1, A( 1, J ),

1, A( 1, J ), 1 ) - IF( S.LE.ZERO ) GO TO 20
- A( J, J ) SQRT( S )
- 10 CONTINUE
- This change by itself is sufficient to

significantly improve the performance on a number

of machines. - From 238 to 312 Mflop/s for a matrix of order 500

on a Pentium 4-1.7 GHz. - However on peak is 1,700 Mflop/s.
- Suggest further work needed.

Derivation of Blocked Algorithms

Equating coefficient of second block of columns,

we obtain

Hence, if U11 has already been computed, we can

compute U12 as the solution of the following

equations by a call to the Level 3 BLAS routine

STRSM

LAPACK Blocked Algorithms

DO 10 J 1, N, NB CALL STRSM( 'Left',

'Upper', 'Transpose','Non-Unit', J-1, JB, ONE, A,

LDA, A( 1, J ), LDA )

CALL SSYRK( 'Upper', 'Transpose', JB, J-1,-ONE,

A( 1, J ), LDA, ONE, A( J, J ),

LDA ) CALL SPOTF2( 'Upper', JB, A( J, J ),

LDA, INFO ) IF( INFO.NE.0 ) GO TO 20 10

CONTINUE

- On Pentium 4, L3 BLAS squeezes a lot more out of

1 proc

Intel Pentium 4 1.7 GHz N 500 Rate of Execution

Linpack variant (L1B) 238 Mflop/s

Level 2 BLAS Variant 312 Mflop/s

Level 3 BLAS Variant 1262 Mflop/s

LAPACK Contents

- Combines algorithms from LINPACK and EISPACK into

a single package. User interface similar to

LINPACK. - Built on the L 1, 2 and 3 BLAS, for high

performance (manufacturers optimize BLAS) - LAPACK does not provide routines for structured

problems or general sparse matrices (i.e sparse

storage formats such as compressed-row, -column,

-diagonal, skyline ...).

LAPACK Ongoing Work

- Add functionality
- updating/downdating, divide and conquer least

squares,bidiagonal bisection, bidiagonal inverse

iteration, band SVD, Jacobi methods, ... - Move to new generation of high performance

machines - IBM SPs, CRAY T3E, SGI Origin, clusters of

workstations - New challenges
- New languages FORTRAN 90, HP FORTRAN, ...
- (CMMD, MPL, NX ...)
- many flavors of message passing, need standard

(PVM, MPI) BLACS - Highly varying ratio
- Many ways to layout data,
- Fastest parallel algorithm sometimes less stable

numerically.

History of Block Partitioned Algorithms

- Early algorithms involved use of small main

memory using tapes as secondary storage. - Recent work centers on use of vector registers,

level 1 and 2 cache, main memory, and out of

core memory.

Blocked Partitioned Algorithms

- Orthogonal reduction to
- (upper) Hessenberg form
- symmetric tridiagonal form
- bidiagonal form
- Block QR iteration for nonsymmetric eigenvalue

problems

- LU Factorization
- Cholesky factorization
- Symmetric indefinite factorization
- Matrix inversion
- QR, QL, RQ, LQ factorizations
- Form Q or QTC