1 / 90

Parallelism and Locality in Matrix

Computationswww.cs.berkeley.edu/demmel/cs267_Sp

r09Dense Linear AlgebraOptimizing Sequential

Matrix Multiply

- James Demmel
- EECS Math Departments, UC Berkeley
- demmel_at_cs.berkeley.edu

Outline of Dense Linear Algebra

- A brief history of linear algebra software
- Optimizing Sequential Matrix Multiplication

(MatMul) - Optimizing Parallel MatMul
- Optimizing Gaussian Elimination
- Lower bounds on communication (moving data)

A brief history of (Dense) Linear Algebra

software (1/6)

- In the beginning was the do-loop

- Libraries like EISPACK (for eigenvalue problems)
- Then the BLAS (1) were invented (1973-1977)
- Standard library of 15 operations (mostly) on

vectors - AXPY ( y ax y ), dot product, scale (x

ax ), etc - Up to 4 versions of each (S/D/C/Z), 46 routines,

3300 LOC - Goals
- Common pattern to ease programming,

readability, self-documentation - Robustness, via careful coding (avoiding

over/underflow) - Portability Efficiency via machine specific

implementations - Why BLAS 1 ? They do O(n1) ops on O(n1) data
- Used in libraries like LINPACK (for linear

systems) - Source of the name LINPACK Benchmark (not the

code!)

A brief history of (Dense) Linear Algebra

software (2/6)

- But the BLAS-1 werent enough
- Consider AXPY ( y ax y ) 2n flops on 3n

read/writes - Computational intensity flops / mem_refs

(2n)/(3n) 2/3 - Too low to run near peak speed (time for mem_refs

dominates) - Hard to vectorize (SIMDize) on supercomputers

of the day (1980s) - So the BLAS-2 were invented (1984-1986)
- Standard library of 25 operations (mostly) on

matrix/vector pairs - GEMV y aAx ßx, GER A A axyT,

TRSV y T-1x - Up to 4 versions of each (S/D/C/Z), 66 routines,

18K LOC - Why BLAS 2 ? They do O(n2) ops on O(n2) data
- So computational intensity still just (2n2)/(n2)

2 - OK for vector machines, but not for machine with

caches

A brief history of (Dense) Linear Algebra

software (3/6)

- The next step BLAS-3 (1987-1988)
- Standard library of 9 operations (mostly) on

matrix/matrix pairs - GEMM C aAB ßC, SYRK C aAAT

ßC, TRSM C T-1B - Up to 4 versions of each (S/D/C/Z), 30 routines,

10K LOC - Why BLAS 3 ? They do O(n3) ops on O(n2) data
- So computational intensity (2n3)/(4n2) n/2

big at last! - Tuning opportunities machines with caches, other

mem. hierarchy levels - How much BLAS1/2/3 code so far (all at

www.netlib.org/blas) - Source 142 routines, 31K LOC, Testing 28K

LOC - Reference (unoptimized) implementation only
- Ex 3 nested loops for GEMM
- Lots more optimized code
- Most computer vendors provide own optimized

versions - Motivates automatic tuning of the BLAS (details

later)

A brief history of (Dense) Linear Algebra

software (4/6)

- LAPACK Linear Algebra PACKage - uses BLAS-3

(1989 now) - Ex Obvious way to express Gaussian Elimination

(GE) is adding multiples of one row to other rows

BLAS-1 - How do we reorganize GE to use BLAS-3 ? (details

later) - Contents of LAPACK (summary)
- Algorithms we can turn into (nearly) 100 BLAS 3
- Linear Systems solve Axb for x
- Least Squares choose x to minimize r2 ? ?S

ri2 where rAx-b - Algorithms we can only make up to 50 BLAS 3

(so far) - Eigenproblems Find l and x where Ax l x
- Singular Value Decomposition (SVD) ATAx?2x
- Error bounds for everything
- Lots of variants depending on As structure

(banded, AAT, etc) - How much code? (Release 3.2, Nov 2008)

(www.netlib.org/lapack) - Source 1582 routines, 490K LOC, Testing 352K

LOC - Ongoing development (at UCB, UTK and elsewhere)

What could go into a linear algebra library?

For all linear algebra problems

For all matrix/problem structures

For all data types

For all architectures and networks

For all programming interfaces

Produce best algorithm(s) w.r.t.

performance and accuracy (including condition

estimates, etc)

Need to prioritize, automate!

A brief history of (Dense) Linear Algebra

software (5/6)

- Is LAPACK parallel?
- Only if the BLAS are parallel (possible in shared

memory) - ScaLAPACK Scalable LAPACK (1995 now)
- For distributed memory uses MPI
- More complex data structures, algorithms than

LAPACK - Only (small) subset of LAPACKs functionality

available - Details later (class projects!)
- All at www.netlib.org/scalapack

Success Stories for Sca/LAPACK

- Widely used
- Adopted by Mathworks, Cray, Fujitsu, HP, IBM,

IMSL, Intel, NAG, NEC, SGI, - gt100M web hits(in 2009, 56M in 2006) _at_ Netlib

(incl. CLAPACK, LAPACK95) - New Science discovered through the solution of

dense matrix systems - Nature article on the flat universe used

ScaLAPACK - Other articles in Physics Review B that also use

it - 1998 Gordon Bell Prize
- www.nersc.gov/news/reports/newNERSCresults050703.p

df

Cosmic Microwave Background Analysis, BOOMERanG

collaboration, MADCAP code (Apr. 27, 2000).

ScaLAPACK

A brief history/future of (Dense) Linear Algebra

software (6/6)

- Versions for GPUs some algorithms change

significantly - PLASMA extensions to multicore
- Communication-minimizing Linear Algebra
- Lower bounds on data movement for all direct

linear algebra - LU, QR, eig, svd, dense sparse

sequential parallel - New algorithms that attain these lower bounds
- Can one software infrastructure accommodate all

algorithms and platforms of current (future)

interest? - How much code generation and tuning can we

automate? - MAGMA planned PLASMA extension to

multicore/GPU/heterogeneous - FLAME (www.cs.utexas.edu/users/flame/)
- Related projects
- BLAST Forum (www.netlib.org/blas/blast-forum)

BLAS extensions - Other languages, add some new functions, sparse

matrices, extra-precision - Extra precision BLAS in latest LAPACK release

Organizing Linear Algebra in books

www.netlib.org/lapack

www.netlib.org/scalapack

gams.nist.gov

www.cs.utk.edu/dongarra/etemplates

www.netlib.org/templates

Outline of Dense Linear Algebra

- A brief history of linear algebra software
- Optimizing Sequential Matrix Multiplication

(MatMul) - Simple performance models to understand

performance - Warm-up Matrix-vector multiplication
- Reducing memory traffic for MatMul
- Lower bound on memory traffic for MatMul
- Other variants on MatMul
- Automatic vs Hand-Tuning of Algorithms
- Optimizing Parallel MatMul
- Optimizing Gaussian Elimination
- Lower bounds on communication (moving data)

Why Matrix Multiplication?

- An important kernel in many problems
- Used in many linear algebra algorithms
- Closely related to other algorithms, e.g.,

transitive closure on a graph using

Floyd-Warshall - Optimization ideas can be used in other problems
- The best case for optimization payoffs
- The most-studied algorithm in high performance

computing

Matrix-multiply, optimized several ways

Speed of n-by-n matrix multiply on Sun

Ultra-1/170, peak 330 MFlops

Note on Matrix Storage

- A matrix is (usually) a 2-D array of elements,

but memory addresses are 1-D - Conventions for matrix layout
- by column, or column major (Fortran default)

A(i,j) at Aijn - by row, or row major (C default) A(i,j) at

Ainj - recursive (later)
- Column major (for now)

Column major matrix in memory

Row major

Column major

0

5

10

15

0

1

2

3

1

6

11

16

4

5

6

7

2

7

12

17

8

9

10

11

3

8

13

18

12

13

14

15

4

9

14

19

16

17

18

19

Blue row of matrix is stored in red cachelines

cachelines

Figure source Larry Carter, UCSD

Using a Simple Model of Memory to Optimize

- Assume just 2 levels in memory hierarchy, fast

and slow - All data initially in slow memory
- m number of memory elements (words) moved

between fast and slow memory - tm time per slow memory operation
- f number of arithmetic operations
- tf time per arithmetic operation ltlt tm
- q f / m average number of flops per slow

memory access - Minimum possible time f tf when all data in

fast memory - Actual time
- f tf m tm f tf (1 tm/tf 1/q)
- Larger q means time closer to minimum f tf
- q ? tm/tf needed to get at least half of peak

speed

Warm up Matrix-vector multiplication

- implements y y Ax
- for i 1n
- for j 1n
- y(i) y(i) A(i,j)x(j)

A(i,)

y(i)

y(i)

x()

Warm up Matrix-vector multiplication

- read x(1n) into fast memory
- read y(1n) into fast memory
- for i 1n
- read row i of A into fast memory
- for j 1n
- y(i) y(i) A(i,j)x(j)
- write y(1n) back to slow memory

- m number of slow memory refs 3n n2
- f number of arithmetic operations 2n2
- q f / m ? 2
- Matrix-vector multiplication limited by slow

memory speed

Modeling Matrix-Vector Multiplication

- Compute time for nxn 1000x1000 matrix
- Time
- f tf m tm f tf (1 tm/tf 1/q)
- 2n2 tf (1

tm/tf 1/2) - For tf and tm, using data from R. Vuducs PhD (pp

351-3) - http//bebop.cs.berkeley.edu/pubs/vuduc2003-disser

tation.pdf - For tm use minimum-memory-latency /

words-per-cache-line

machine balance (q must be at least this for ½

peak speed, but q2

Validating the Model

- How well does the model predict actual

performance? - Actual DGEMV Most highly optimized code for the

platform - Model sufficient to compare across machines
- But under-predicting on most recent ones due to

latency estimate

Naïve Matrix Multiply

- implements C C AB
- for i 1 to n
- for j 1 to n
- for k 1 to n
- C(i,j) C(i,j) A(i,k) B(k,j)

Algorithm has 2n3 O(n3) Flops and operates on

3n2 words of memory q potentially as large as

2n3 / 3n2 O(n)

A(i,)

C(i,j)

C(i,j)

B(,j)

Naïve Matrix Multiply

- implements C C AB
- for i 1 to n
- read row i of A into fast memory
- for j 1 to n
- read C(i,j) into fast memory
- read column j of B into fast memory
- for k 1 to n
- C(i,j) C(i,j) A(i,k) B(k,j)
- write C(i,j) back to slow memory

A(i,)

C(i,j)

C(i,j)

B(,j)

Naïve Matrix Multiply

- Number of slow memory references on unblocked

matrix multiply - m n3 to read each column of B n times
- n2 to read each row of A once
- 2n2 to read and write each element

of C once - n3 3n2
- So q f / m 2n3 / (n3 3n2)
- ? 2 for large n, no improvement over

matrix-vector multiply - Inner two loops are just matrix-vector multiply,

of row i of A times B - Similar for any other order of 3 loops

A(i,)

C(i,j)

C(i,j)

B(,j)

Matrix-multiply, optimized several ways

Speed of n-by-n matrix multiply on Sun

Ultra-1/170, peak 330 MFlops

Naïve Matrix Multiply on IBM RS/6000

12000 would take 1095 years

T N4.7

Size 2000 took 5 days

O(N3) performance would have constant

cycles/flop Performance looks like O(N4.7)

Slide source Larry Carter, UCSD

Naïve Matrix Multiply on RS/6000

Page miss every iteration

TLB miss every iteration

Cache miss every 16 iterations

Page miss every 512 iterations

Slide source Larry Carter, UCSD

From Naïve to Blocked (or Tiled) Matrix Multiply

- C(i,j) C(i,j) A(i,1)B(1,j) A(i,2)B(2,j)

- C(i,j) A(i,1n)B(1n,j)
- True if C(i,j) is a scalar (1x1 submatrix)
- A(i,1n) is a row of A and B(1n,j) is a column

of B - Operation is a dot product
- True if C(i,j) is a b x b submatrix
- A(i,1n) is a b x n submatrix of A and B(1n,j)

is an n x b submatrix of B - C(i,j) C(i,j) A(i,1)B(1,j) A(i,2)B(2,j)

A(i,N)B(N,j) where - All factors are b x b subblocks
- N n/b number of b x b blocks

Blocked (Tiled) Matrix Multiply

- Consider A,B,C to be N-by-N matrices of b-by-b

subblocks where bn / N is called the

block size - for i 1 to N
- for j 1 to N
- read block C(i,j) into fast memory
- for k 1 to N
- read block A(i,k) into fast

memory - read block B(k,j) into fast

memory - C(i,j) C(i,j) A(i,k)

B(k,j) do a matrix multiply on blocks - write block C(i,j) back to slow memory

A(i,k)

C(i,j)

C(i,j)

B(k,j)

Blocked (Tiled) Matrix Multiply

- Recall
- m is amount memory traffic between slow and

fast memory - matrix has nxn elements, and NxN blocks each

of size bxb - f is number of floating point operations, 2n3

for this problem - q f / m is our measure of algorithm

efficiency in the memory system - So

m Nn2 read each block of B N3 times (N3

b2 N3 (n/N)2 Nn2) Nn2 read

each block of A N3 times 2n2 read

and write each block of C once (2N

2) n2 So computational intensity q f / m

2n3 / ((2N 2) n2)

? n / N b for large n So we

can improve performance by increasing the

blocksize b Can be much faster than

matrix-vector multiply (q2)

How large can we make computional intensity of

matmul?

- The blocked algorithm has computational intensity

q ? b - The larger the block size, the more efficient our

algorithm will be - Limit All three blocks from A,B,C must fit in

fast memory (cache), so we cannot make these

blocks arbitrarily large - Assume your fast memory has size Mfast
- 3b2 ? Mfast, so q ? b ?

(Mfast/3)1/2

- To build a machine to run matrix multiply at 1/2

peak arithmetic speed of the machine, we need a

fast memory of size - Mfast ? 3b2 ? 3q2 3(tm/tf)2
- This size is reasonable for L1 cache, but not for

register sets - Note analysis assumes it is possible to schedule

the instructions perfectly

Limits to Optimizing Matrix Multiply

- The blocked algorithm changes the order in which

values are accumulated into each Ci,j by

applying commutativity and associativity - Get slightly different answers from naïve code,

because of roundoff - OK - We just showed that the blocked algorithm has

computational intensity - q ? b ? (Mfast/3)1/2
- Theorem (Hong Kung, 1981) Any reorganization

of this algorithm (that uses only

commutativity/associativity) is limited to q O(

(Mfast)1/2 ) - Said another way,

words moved between fast/slow memory ?( n3 /

Mfast1/2 ) - Proof of more general result later
- Can get a lower bound on the latency cost (

messages sent) by dividing above lower bound by

maximum message size Mfast - messages for sequential MatMul ?( n3 / Mfast

3/2 ) - Ex disk accesses

BLAS speeds on an IBM RS6000/590

Peak speed 266 Mflops

Peak

BLAS 3

BLAS 2

BLAS 1

BLAS 3 (n-by-n matrix matrix multiply) vs BLAS 2

(n-by-n matrix vector multiply) vs BLAS 1 (saxpy

of n vectors)

Dense Linear Algebra BLAS2 vs. BLAS3

- BLAS2 and BLAS3 have very different computational

intensity, and therefore different performance

Data source Jack Dongarra

Recursion Cache Oblivious Algorithms

- The tiled algorithm requires finding a good block

size - Machine dependent
- What if there are multiple levels of cache? Need

to block b x b matrix multiply in inner most

loop - 1 level of memory ? 3 nested loops (naïve

algorithm) - 2 levels of memory ? 6 nested loops
- 3 levels of memory ? 9 nested loops
- Cache Oblivious Algorithms offer an alternative
- Treat nxn matrix multiply as a set of smaller

problems - Eventually, these will fit in cache
- Will minimize words moved between every level

of memory hierarchy (between L1 and L2 cache, L2

and L3, L3 and main memory etc.) at least

asymptotically

Recursive Matrix Multiplication (RMM) (1/2)

- For simplicity square matrices with n 2m
- C A B
- True when each Aij etc 1x1 or n/2 x n/2

func C RMM (A, B, n) if n 1, C A

B, else C11 RMM (A11 , B11 , n/2)

RMM (A12 , B21 , n/2) C12 RMM (A11

, B12 , n/2) RMM (A12 , B22 , n/2)

C21 RMM (A21 , B11 , n/2) RMM (A22 , B21 ,

n/2) C22 RMM (A21 , B12 , n/2)

RMM (A22 , B22 , n/2) return

Recursive Matrix Multiplication (2/2)

func C RMM (A, B, n) if n1, C A B,

else C11 RMM (A11 , B11 , n/2)

RMM (A12 , B21 , n/2) C12 RMM (A11

, B12 , n/2) RMM (A12 , B22 , n/2)

C21 RMM (A21 , B11 , n/2) RMM (A22 , B21 ,

n/2) C22 RMM (A21 , B12 , n/2)

RMM (A22 , B22 , n/2) return

A(n) arithmetic operations in RMM( . , . ,

n) 8 A(n/2) 4(n/2)2 if n gt 1,

else 1 2n3 same operations as

usual, in different order M(n) words moved

between fast, slow memory by RMM( . , . , n)

8 M(n/2) 4(n/2)2 if 3n2 gt Mfast ,

else 3n2 O( n3 / (Mfast )1/2 n2 )

same as blocked matmul

Recursion Cache Oblivious Algorithms

- Recursion for general A (nxm) B (mxp)

- Case1 mgt maxn,p split A horizontally
- Case 2 ngt maxm,p split A vertically and B

horizontally - Case 3 pgt maxm,n split B vertically
- Attains lower bound in O() sense

Case 1

Case 3

Experience with Cache-Oblivious Algorithms

- In practice, need to cut off recursion well

before 1x1 blocks - Call Micro-kernel for small blocks, eg 16 x 16
- Implementing a high-performance Cache-Oblivious

code is not easy - Using fully recursive approach with highly

optimized recursive micro-kernel, Pingali et al

report that they never got more than 2/3 of peak. - Issues with Cache Oblivious (recursive) approach
- Recursive Micro-Kernels yield less performance

than iterative ones using same scheduling

techniques - Pre-fetching is needed to compete with best code

not well-understood in the context of Cache

Oblivous codes

Unpublished work, presented at LACSI 2006

Recursive Data Layouts

- A related idea is to use a recursive structure

for the matrix - Goal any square submatrix (suitably aligned)

should lie in contiguous memory - Improves locality with machine-independent data

structure - Can minimize latency with multiple levels of

memory hierarchy - This figure shows Z-Morton Ordering (space

filling curve) - Other orderings possible, eg C-Morton
- See papers on cache oblivious algorithms and

recursive layouts - Gustavson, Kagstrom, et al, SIAM Review, 2004

- Advantages
- the recursive layout works well for any cache

size - Disadvantages
- The index calculations to find Ai,j are

expensive - Implementations switch to column-major for small

sizes

Strassens Matrix Multiply

- The traditional algorithm (with or without

tiling) has O(n3) flops - Strassen discovered an algorithm with

asymptotically lower flops - O(n2.81)
- Consider a 2x2 matrix multiply, normally takes 8

multiplies, 4 adds - Strassen does it with 7 multiplies and 18 adds

Let M m11 m12 a11 a12 b11 b12

m21 m22 a21 a22 b21 b22 Let p1

(a12 - a22) (b21 b22)

p5 a11 (b12 - b22) p2 (a11

a22) (b11 b22)

p6 a22 (b21 - b11) p3 (a11 - a21)

(b11 b12) p7

(a21 a22) b11 p4 (a11 a12)

b22 Then m11 p1 p2 - p4 p6 m12

p4 p5 m21 p6 p7 m22

p2 - p3 p5 - p7

Extends to nxn by divideconquer

Strassen Matrix Multiplication (2/2)

func C StrMM (A, B, n) if n1 (or small

enough), C A B, else P1 StrMM

(A12 - A22 , B21 B22 , n/2) P2

StrMM (A11 A22 , B11 B22 , n/2)

P3 StrMM (A11 - A21 , B11 B12 , n/2)

P4 StrMM (A11 A12 , B22 , n/2)

P5 StrMM (A11 , B12 - B22 , n/2) P6

StrMM (A22 , B21 B11 , n/2) P7

StrMM (A21 A22 , B11 , n/2) C11

P1 P2 P4 P6, C12 P4 P5

C22 P2 - P3 P5 P7, C21 P6 P7

return

Strassen (continued)

- Asymptotically faster
- Several times faster for large n in practice
- Cross-over depends on machine
- Available in some libraries (but see below)
- Tuning Strassen's Matrix Multiplication for

Memory Efficiency, M. S. Thottethodi, S.

Chatterjee, and A. Lebeck, in Proceedings of

Supercomputing '98 - Caveats
- Needs more memory than standard algorithm
- Can be a little less accurate because of roundoff

error - Forbidden by rules of Top500 list (so not widely

used)

Other Fast Matrix Multiplication Algorithms

- Current worlds record is O(n 2.376... )
- Coppersmith Winograd, 1987
- Why does Hong/Kung theorem not apply?
- Extension is open problem
- Possibility of O(n2?) algorithm!
- Cohn, Umans, Kleinberg, 2003
- Can show they all can be made numerically stable
- D., Dumitriu, Holtz, Kleinberg, 2007
- Can do rest of linear algebra (solve Axb, least

squares, Ax?x, etc) as fast , and numerically

stably - D., Dumitriu, Holtz, 2008
- Fast methods (besides Strassen) may need

unrealistically large n - But can reuse ideas to minimize communication

Tuning Code in Practice

- Tuning code can be tedious
- Lots of code variations to try besides blocking

details later - Machine hardware performance hard to predict
- Compiler behavior hard to predict
- Response Autotuning
- Let computer generate large set of possible code

variations, and search them for the fastest ones - Field started with CS267 homework assignment in

mid 1990s - PHiPAC, leading to ATLAS, incorporated in Matlab
- We still use the same assignment
- We (and others) are extending autotuning to other

dwarfs / motifs - Still need to understand how to do it by hand
- Not every code will have an autotuner
- Need to know if you want to build autotuners

Search Over Block Sizes

- Performance models are useful for high level

algorithms - Helps in developing a blocked algorithm
- Models have not proven very useful for block size

selection - too complicated to be useful
- See work by Sid Chatterjee for detailed model
- too simple to be accurate
- Multiple multidimensional arrays, virtual memory,

etc. - Speed depends on matrix dimensions, details of

code, compiler, processor

What the Search Space Can Look Like

Number of columns in register block

Number of rows in register block

A 2-D slice of a 3-D register-tile search space.

The dark blue region was pruned. (Platform Sun

Ultra-IIi, 333 MHz, 667 Mflop/s peak, Sun cc v5.0

compiler)

Tiling Alone Might Not Be Enough

- Naïve and a naïvely tiled code on Itanium 2
- Searched all block sizes to find best, b56
- Starting point for next homework

Example Select a Matmul Implementation

Example Support Vector Classification

ATLAS (DGEMM n 500)

Source Jack Dongarra

- ATLAS is faster than all other portable BLAS

implementations and it is comparable with

machine-specific libraries provided by the vendor.

Optimizing in Practice

- Tiling for registers
- loop unrolling, use of named register variables
- Tiling for multiple levels of cache and TLB
- Exploiting fine-grained parallelism in processor
- superscalar pipelining
- Complicated compiler interactions
- Hard to do by hand (but you should try!)
- Automatic optimization an active research area
- BeBOP bebop.cs.berkeley.edu , FFTW, Spiral,
- Older work
- PHiPAC www.icsi.berkeley.edu/bilmes/phipac
- in particular tr-98-035.ps.gz
- ATLAS www.netlib.org/atlas
- "Performance Optimization of Numerically

Intensive Codes", by Stefan Goedecker and Adolfy

Hoisie, SIAM 2001.

Removing False Dependencies

- Using local variables, reorder operations to

remove false dependencies

ai bi c ai1 bi1 d

false read-after-write hazard between ai and

bi1

float f1 bi float f2 bi1 ai f1

c ai1 f2 d

- With some compilers, you can declare a and b

unaliased. - Done via restrict pointers, compiler flag, or

pragma

Exploit Multiple Registers

- Reduce demands on memory bandwidth by pre-loading

into local variables

while( ) res filter0signal0

filter1signal1

filter2signal2 signal

float f0 filter0 float f1 filter1 float

f2 filter2 while( ) res

f0signal0 f1signal1

f2signal2 signal

also register float f0

Example is a convolution

Minimize Pointer Updates

- Replace pointer updates for strided memory

addressing with constant array offsets

f0 r8 r8 4 f1 r8 r8 4 f2 r8

r8 4

f0 r80 f1 r84 f2 r88 r8 12

- Pointer vs. array expression costs may differ.
- Some compilers do a better job at analyzing one

than the other

Loop Unrolling

- Expose instruction-level parallelism

float f0 filter0, f1 filter1, f2

filter2 float s0 signal0, s1 signal1,

s2 signal2 res f0s0 f1s1

f2s2 do signal 3 s0 signal0

res0 f0s1 f1s2 f2s0 s1

signal1 res1 f0s2 f1s0 f2s1

s2 signal2 res2 f0s0 f1s1

f2s2 res 3 while( )

Expose Independent Operations

- Hide instruction latency
- Use local variables to expose independent

operations that can execute in parallel or in a

pipelined fashion - Balance the instruction mix (what functional

units are available?)

f1 f5 f9 f2 f6 f10 f3 f7 f11 f4

f8 f12

Copy optimization

- Copy input operands or blocks
- Reduce cache conflicts
- Constant array offsets for fixed size blocks
- Expose page-level locality
- Alternative use different data structures from

start (if users willing)

Original matrix (numbers are addresses)

Reorganized into 2x2 blocks

0

4

8

12

0

2

8

10

1

5

9

13

1

3

9

11

2

6

10

14

4

6

12

13

3

7

11

15

5

7

14

15

Summary of Performance Tuning

- Performance programming on uniprocessors requires
- understanding of memory system moving data

slower than arithmetic - understanding of fine-grained parallelism in

processor - Simple performance models can aid in

understanding - Two ratios are key to efficiency (relative to

peak) - computational intensity of the algorithm
- q f/m floating point operations / slow

memory references - machine balance in the memory system
- tm/tf time for slow memory reference / time for

floating point operation - Want q gt tm/tf to get half machine peak
- Blocking (tiling) is a basic approach to increase

q - Techniques apply generally, but the details

(e.g., block size) are architecture dependent - Similar techniques are possible on other data

structures and algorithms

Extra slides

More Reading for Today

- Sourcebook Chapter 3, (note that chapters 2 and 3

cover the material of lecture 2 and lecture 3,

but not in the same order). - "Performance Optimization of Numerically

Intensive Codes", by Stefan Goedecker and Adolfy

Hoisie, SIAM 2001. - Web pages for reference
- BeBOP Homepage
- ATLAS Homepage
- BLAS (Basic Linear Algebra Subroutines),

Reference for (unoptimized) implementations of

the BLAS, with documentation. - LAPACK (Linear Algebra PACKage), a standard

linear algebra library optimized to use the BLAS

effectively on uniprocessors and shared memory

machines (software, documentation and reports) - ScaLAPACK (Scalable LAPACK), a parallel version

of LAPACK for distributed memory machines

(software, documentation and reports) - Tuning Strassen's Matrix Multiplication for

Memory Efficiency Mithuna S. Thottethodi,

Siddhartha Chatterjee, and Alvin R. Lebeck in

Proceedings of Supercomputing '98, November 1998

postscript - Recursive Array Layouts and Fast Parallel Matrix

Multiplication by Chatterjee et al. IEEE TPDS

November 2002.

Questions You Should Be Able to Answer

- What is the key to understand algorithm

efficiency in our simple memory model? - What is the key to understand machine efficiency

in our simple memory model? - What is tiling?
- Why does block matrix multiply reduce the number

of memory references? - What are the BLAS?
- Why does loop unrolling improve uniprocessor

performance?

Proof of Communication Lower Bound on C AB

(1/6)

- Proof from Irony/Tishkin/Toledo (2004)
- Well need it for the rest of linear algebra
- Think of instruction stream being executed
- Looks like add, load, multiply, store,

load, add, - We want to count the number of loads and stores,

given that we are multiplying n-by-n matrices C

AB using the usual 2n3 flops, possibly

reordered assuming addition is commutative/associa

tive - It actually isnt associative in floating point,

but close enough - Assuming that at most M words can be stored in

fast memory - Outline
- Break instruction stream into segments, each

containing M loads and stores - Somehow bound the maximum number of adds and

multiplies that could be done in each segment,

call it F - So F segments ? 2n3 , and

segments ? 2n3 / F - So loads stores M segments ? M 2

n3 / F

Proof of Communication Lower Bound on C AB

(2/6)

- Given segment of instruction stream with M loads

stores, how many adds multiplies (F) can we

do? - At most 2M entries of C, 2M entries of A and/or

2M entries of B can be accessed - Use geometry
- Represent 2n3 operations by n x n x n cube
- One n x n face represents A
- each 1 x 1 subsquare represents one A(i,k)
- One n x n face represents B
- each 1 x 1 subsquare represents one B(k,j)
- One n x n face represents C
- each 1 x 1 subsquare represents one C(i,j)
- Each 1 x 1 x 1 subcube represents one C(i,j)

A(i,k) B(k,j)

Proof of Communication Lower Bound on C AB

(3/6)

k

C face

C(2,3)

C(1,1)

B(3,1)

A(1,3)

j

B(1,3)

A(2,1)

B face

i

A face

Proof of Communication Lower Bound on C AB

(4/6)

- Given segment of instruction stream with M load

stores, how many adds multiplies (F) can we do? - At most 2M entries of C, and/or of A and/or of

B can be accessed - Use geometry
- Represent 2n3 operations by n x n x n cube
- One n x n face represents A
- each 1 x 1 subsquare represents one A(i,k)
- One n x n face represents B
- each 1 x 1 subsquare represents one B(k,j)
- One n x n face represents C
- each 1 x 1 subsquare represents one C(i,j)
- Each 1 x 1 x 1 subcube represents one C(i,j)

A(i,k) B(k,j)

- If we have at most 2M A squares, 2M B

squares, and 2M C squares on faces, how many

cubes can we have?

Proof of Communication Lower Bound on C AB

(5/6)

cubes in black box with side lengths x, y

and z Volume of black box xyz (A?s

B?s C?s )1/2 ( xz zy yx)1/2

Proof of Communication Lower Bound on C AB

(6/6)

- Consider one segment of instructions with M

loads and stores - There can be at most 2M entries of A, B, C

available in one segment - Volume of set of cubes representing possible

multiply/adds (2M 2M 2M)1/2 (2M) 3/2 F - Segments ? 2n3 / F
- Loads Stores M Segments ? M 2n3 / F

n3 / (2M)1/2

- Parallel Case apply reasoning to one processor

out of P - Adds and Muls 2n3 / P (assuming load

balanced) - M n2 / P (each processor gets equal fraction of

matrix) - Load Stores words communicated with

other procs ? M (2n3 /P) / F M (2n3 /P) /

(2M) 3/2 n2 / (2P)1/2

Communication Lower Bound for Parallel MatMul

- Apply reasoning to one processor out of P
- Adds and Muls 2n3 / P
- At least one processor must do this much work
- M n2 / P
- Assume each processor gets equal fraction of

matrix - (Other assumptions possible)
- Load Stores words communicated with

other procs ? M (2n3 /P) / F M (2n3 /P) /

(2M) 3/2 n2 / (2P)1/2

- Possible to communicate less than this
- Need to increase M by having multiple copies of

matrices - 3D algorithm
- P1/3 copies of each matrix, so M n2 / P2/3
- Lower bound becomes ?(n2 / (2P)2/3) ,
- O(P1/6) times smaller than before attainable

Lower bounds on latency O(messages)

- Hong/Kung and Irony/Tishkin/Toledo theorem is a

lower bound on amount of data communicated - Bandwidth cost
- Can get a lower bound on the latency cost (number

of messages sent) by dividing about lower bound

by maximum message size fast memory size M - messages for sequential MatMul ?( n3 / M3/2

) - Ex disk accesses
- messages for parallel Matmul ?((n2 / P1/2 ) /

(n2 / P)) ?( P1/2)

Communication lower bounds for all Linear Algebra

- Irony/Tiskin/Toledo proof can be extended
- Proof does not depend on doing matrix

multiplication - Out(i,j) func( In1(i,k), In2(k,j) for k ? S,

where S is some set ) - Only need Out(i,j), etc to be a one-to-one map

from (i,j) to memory - Proof does not depend on inputs being dense
- S can be any set, depending on (i,j)
- Proof does not assume inputs and outputs do not

overlap - Eg in Gaussian Elimination A LU, L and U

overwrite A - Applies to LU, Cholesky, solving triangular

systems, - Dense or sparse matrices
- Sequential or parallel algorithm
- Lower bound on words moved ? ( flops / M1/2 )
- Lower bound on messages ? ( flops / M3/2 )
- Also extends to QR, eigenvalue algorithms, SVD
- New proof need to account for data that is

created in fast memory, used, discarded without

being written to memory - Same lower bounds

- Review of material on performance of

matrix multiplication so far - More about tuning matrix multiplication, other

codes

Using a Simple Model of Memory to Optimize

- Assume just 2 levels in the hierarchy, fast and

slow - All data initially in slow memory
- m number of memory elements (words) moved

between fast and slow memory - tm time per slow memory operation
- f number of arithmetic operations
- tf time per arithmetic operation ltlt tm
- q f / m average number of flops per slow

memory access - Minimum possible time f tf when all data in

fast memory - Actual time
- f tf m tm f tf (1 tm/tf 1/q)
- Larger q means time closer to minimum f tf
- q ? tm/tf needed to get at least half of peak

speed - Goal reorganize algorithm (change order of

operations) to maximize q ( minimize tm )

Naïve Matrix Multiply no better than

Matrix-vector multiply

- implements C C AB
- for i 1 to n
- read row i of A into fast memory n reads,

n2 total - for j 1 to n
- read C(i,j) into fast memory 1 read,

n2 total - read column j of B into fast memory n

reads, n3 total - for k 1 to n
- C(i,j) C(i,j) A(i,k) B(k,j)

all data in fast memory - write C(i,j) back to slow memory 1

write, n2 total

m slow memory accesses n3 3n2 , so q

f/m 2n3 /( n3 3n2) ? 2

From Naïve to Blocked Matrix Multiply

- C(i,j) C(i,j) A(i,1)B(1,j) A(i,2)B(2,j)

- C(i,j) A(i,1n)B(1n,j)
- True if C(i,j) is a scalar (1x1 submatrix)
- A(i,1n) is a row of A and B(1n,j) is a column

of B - Operation is a dot product
- True if C(i,j) is a b x b submatrix
- A(i,1n) is a b x n submatrix of A and B(1n,j)

is an n x b submatrix of B - C(i,j) C(i,j) A(i,1)B(1,j) A(i,2)B(2,j)

A(i,N)B(N,j) where - All factors are b x b subblocks
- N n/b number of b x b blocks

Blocked (Tiled) Matrix Multiply

- Consider A,B,C to be N-by-N matrices of b-by-b

subblocks where - bn / N is called the block size
- for i 1 to N
- for j 1 to N
- read block C(i,j) into fast memory

b2 reads, N2 b2 n2 total - for k 1 to N
- read block A(i,k) into fast

memory b2 reads, N3 b2 n3/b total - read block B(k,j) into fast

memory b2 reads, N3 b2 n3/b total - C(i,j) C(i,j) A(i,k)

B(k,j) do a matrix multiply on blocks -

all data in fast memory - write block C(i,j) back to slow memory

b2 reads, N2 b2 n2 total

m slow memory accesses 2n3/b 2n2 , so q

f/m ? b

Blocked (Tiled) Matrix Multiply (Review)

- m slow memory accesses 2n3/b 2n2 ,
- so q f/m 2n3 / ( 2n3/b 2n2 ) ? b for

large n - To minimize m ( maximize q ) we want to make b as

large as possible - 3b2 ? Mfast, so that all three b x b

submatrices C(i,j), A(i,k), B(k,j) fit - So q ? b ? (Mfast/3)1/2
- Theorem (Hong/Kung, 1981 Irony/Tishkin/Toledo

2004) Any reorganization of

this algorithm (that uses only associativity and

commutativity of addition) is limited to q O(

(Mfast)1/2 ) - Blocked matrix multiplication minimizes the

number of slow memory accesses (up to a constant

factor)

(No Transcript)

Current Records for Solving Dense Systems

(11/2008)

www.netlib.org, click on Performance Database

Server

Gigaflops Machine n100

n1000 Any n Peak Roadrunner (LANL)

1.1M

1.5M (1.1 Petaflops, 130K cores, 2.5 MW,

n2.3M)

NEC SX 8

(fastest in 2005) (8 proc, 2 GHz)

75.1

128 (1 proc, 2 GHz) 2.2

15.0 16

Palm Pilot III .00000169

(1.69 Kiloflops)

For all linear algebra problemsEx LAPACK Table

of Contents

- Linear Systems
- Least Squares
- Overdetermined, underdetermined
- Unconstrained, constrained, weighted
- Eigenvalues and vectors of Symmetric Matrices
- Standard (Ax ?x), Generalized (Ax?xB)
- Eigenvalues and vectors of Unsymmetric matrices
- Eigenvalues, Schur form, eigenvectors, invariant

subspaces - Standard, Generalized
- Singular Values and vectors (SVD)
- Standard, Generalized
- Level of detail
- Simple Driver
- Expert Drivers with error bounds,

extra-precision, other options - Lower level routines (apply certain kind of

orthogonal transformation)

For all matrix/problem structuresEx LAPACK

Table of Contents

- SB symmetric, banded
- SP symmetric, packed
- ST symmetric, tridiagonal
- SY symmetric
- TB triangular, banded
- TG triangular, pair
- TB triangular, banded
- TP triangular, packed
- TR triangular
- TZ trapezoidal
- UN unitary
- UP unitary packed

- BD bidiagonal
- GB general banded
- GE general
- GG general , pair
- GT tridiagonal
- HB Hermitian banded
- HE Hermitian
- HG upper Hessenberg, pair
- HP Hermitian, packed
- HS upper Hessenberg
- OR (real) orthogonal
- OP (real) orthogonal, packed
- PB positive definite, banded
- PO positive definite
- PP positive definite, packed
- PT positive definite, tridiagonal

For all matrix/problem structuresEx LAPACK

Table of Contents

- BD bidiagonal
- GB general banded
- GE general
- GG general , pair
- GT tridiagonal
- HB Hermitian banded
- HE Hermitian
- HG upper Hessenberg, pair
- HP Hermitian, packed
- HS upper Hessenberg
- OR (real) orthogonal
- OP (real) orthogonal, packed
- PB positive definite, banded
- PO positive definite
- PP positive definite, packed
- PT positive definite, tridiagonal

- SB symmetric, banded
- SP symmetric, packed
- ST symmetric, tridiagonal
- SY symmetric
- TB triangular, banded
- TG triangular, pair
- TB triangular, banded
- TP triangular, packed
- TR triangular
- TZ trapezoidal
- UN unitary
- UP unitary packed

For all matrix/problem structuresEx LAPACK

Table of Contents

- BD bidiagonal
- GB general banded
- GE general
- GG general, pair
- GT tridiagonal
- HB Hermitian banded
- HE Hermitian
- HG upper Hessenberg, pair
- HP Hermitian, packed
- HS upper Hessenberg
- OR (real) orthogonal
- OP (real) orthogonal, packed
- PB positive definite, banded
- PO positive definite
- PP positive definite, packed
- PT positive definite, tridiagonal

- SB symmetric, banded
- SP symmetric, packed
- ST symmetric, tridiagonal
- SY symmetric
- TB triangular, banded
- TG triangular, pair
- TB triangular, banded
- TP triangular, packed
- TR triangular
- TZ trapezoidal
- UN unitary
- UP unitary packed

For all matrix/problem structuresEx LAPACK

Table of Contents

- BD bidiagonal
- GB general banded
- GE general
- GG general, pair
- GT tridiagonal
- HB Hermitian banded
- HE Hermitian
- HG upper Hessenberg, pair
- HP Hermitian, packed
- HS upper Hessenberg
- OR (real) orthogonal
- OP (real) orthogonal, packed
- PB positive definite, banded
- PO positive definite
- PP positive definite, packed
- PT positive definite, tridiagonal

- SB symmetric, banded
- SP symmetric, packed
- ST symmetric, tridiagonal
- SY symmetric
- TB triangular, banded
- TG triangular, pair
- TB triangular, banded
- TP triangular, packed
- TR triangular
- TZ trapezoidal
- UN unitary
- UP unitary packed

For all matrix/problem structuresEx LAPACK

Table of Contents

- BD bidiagonal
- GB general banded
- GE general
- GG general, pair
- GT tridiagonal
- HB Hermitian banded
- HE Hermitian
- HG upper Hessenberg, pair
- HP Hermitian, packed
- HS upper Hessenberg
- OR (real) orthogonal
- OP (real) orthogonal, packed
- PB positive definite, banded
- PO positive definite
- PP positive definite, packed
- PT positive definite, tridiagonal

- SB symmetric, banded
- SP symmetric, packed
- ST symmetric, tridiagonal
- SY symmetric
- TB triangular, banded
- TG triangular, pair
- TB triangular, banded
- TP triangular, packed
- TR triangular
- TZ trapezoidal
- UN unitary
- UP unitary packed

For all matrix/problem structuresEx LAPACK

Table of Contents

- BD bidiagonal
- GB general banded
- GE general
- GG general , pair
- GT tridiagonal
- HB Hermitian banded
- HE Hermitian
- HG upper Hessenberg, pair
- HP Hermitian, packed
- HS upper Hessenberg
- OR (real) orthogonal
- OP (real) orthogonal, packed
- PB positive definite, banded
- PO positive definite
- PP positive definite, packed
- PT positive definite, tridiagonal

- SB symmetric, banded
- SP symmetric, packed
- ST symmetric, tridiagonal
- SY symmetric
- TB triangular, banded
- TG triangular, pair
- TB triangular, banded
- TP triangular, packed
- TR triangular
- TZ trapezoidal
- UN unitary
- UP unitary packed

For all matrix/problem structuresEx LAPACK

Table of Contents

- BD bidiagonal
- GB general banded
- GE general
- GG general, pair
- GT tridiagonal
- HB Hermitian banded
- HE Hermitian
- HG upper Hessenberg, pair
- HP Hermitian, packed
- HS upper Hessenberg
- OR (real) orthogonal
- OP (real) orthogonal, packed
- PB positive definite, banded
- PO positive definite
- PP positive definite, packed
- PT positive definite, tridiagonal

- SB symmetric, banded
- SP symmetric, packed
- ST symmetric, tridiagonal
- SY symmetric
- TB triangular, banded
- TG triangular, pair
- TB triangular, banded
- TP triangular, packed
- TR triangular
- TZ trapezoidal
- UN unitary
- UP unitary packed

Motif/Dwarf Common Computational Methods (Red

Hot ? Blue Cool)

What do commercial and CSE applications have in

common?

Basic Linear Algebra Subroutines (BLAS)

- Industry standard interface (evolving)
- www.netlib.org/blas, www.netlib.org/blas/blast-

-forum - Vendors, others supply optimized implementations
- History
- BLAS1 (1970s)
- vector operations dot product, saxpy (yaxy),

etc - m2n, f2n, q 1 or less
- BLAS2 (mid 1980s)
- matrix-vector operations matrix vector multiply,

etc - mn2, f2n2, q2, less overhead
- somewhat faster than BLAS1
- BLAS3 (late 1980s)
- matrix-matrix operations matrix matrix multiply,

etc - m ? 3n2, fO(n3), so qf/m can possibly be as

large as n (or Mfast1/2 ), so BLAS3

is potentially much faster than BLAS2 - Good algorithms used BLAS3 when possible (LAPACK

ScaLAPACK) - More later in course

Locality in Other Algorithms

- The performance of any algorithm is limited by q

computational intensity - In matrix multiply, we increase q by changing

computation order - increased temporal locality
- For other algorithms and data structures, even

hand-transformations are still an open problem - Lots of open problems

Simplifying Assumptions

- What simplifying assumptions did we make in this

analysis? - Ignored parallelism in processor between memory

and arithmetic within the processor - Sometimes drop arithmetic term in this type of

analysis - Assumed fast memory was large enough to hold

three vectors - Reasonable if we are talking about any level of

cache - Not if we are talking about registers (32 words)
- Assumed the cost of a fast memory access is 0
- Reasonable if we are talking about registers
- Not necessarily if we are talking about cache

(1-2 cycles for L1) - Memory latency is constant
- Could simplify even further by ignoring memory

operations in X and Y vectors - Mflop rate/element 2 / (2 tf tm)

Experiments on Search vs. Modeling

- Study compares search (Atlas) to optimization

selection based on performance models - Ten architectures
- Model did well on most cases
- Better on UltraSparc
- Worse on Itanium
- Eliminating performance gaps think globally,

search locally - -small performance gaps local search
- -large performance gaps
- refine model
- Substantial gap between ATLAS CGw/S and ATLAS

Unleashed on some machines

Source K. Pingali. Results from IEEE 05 paper

by K Yotov, X Li, G Ren, M Garzarán, D Padua, K

Pingali, P Stodghill.