# Parallelism and Locality in Matrix Computations www.cs.berkeley.edu/~demmel/cs267_Spr09 Dense Linear Algebra: Optimizing Sequential Matrix Multiply - PowerPoint PPT Presentation

1 / 90
Title:

## Parallelism and Locality in Matrix Computations www.cs.berkeley.edu/~demmel/cs267_Spr09 Dense Linear Algebra: Optimizing Sequential Matrix Multiply

Description:

### demmel_at_cs.berkeley.edu * CS267 Lecture 2 * Paper: ... Sun Ultra2 Model 2200. SGI R8000ip21. Vendor BLAS. Reference BLAS. DGEMM. DSYMM. DSYR2K. DSYRK. DTRMM. DTRSM. Vendor. HP9000/735/125. – PowerPoint PPT presentation

Number of Views:379
Avg rating:3.0/5.0
Slides: 91
Provided by: microsof83
Category:
Tags:
Transcript and Presenter's Notes

Title: Parallelism and Locality in Matrix Computations www.cs.berkeley.edu/~demmel/cs267_Spr09 Dense Linear Algebra: Optimizing Sequential Matrix Multiply

1
Parallelism and Locality in Matrix
Computationswww.cs.berkeley.edu/demmel/cs267_Sp
r09Dense Linear AlgebraOptimizing Sequential
Matrix Multiply
• James Demmel
• EECS Math Departments, UC Berkeley
• demmel_at_cs.berkeley.edu

2
Outline of Dense Linear Algebra
• A brief history of linear algebra software
• Optimizing Sequential Matrix Multiplication
(MatMul)
• Optimizing Parallel MatMul
• Optimizing Gaussian Elimination
• Lower bounds on communication (moving data)

3
A brief history of (Dense) Linear Algebra
software (1/6)
• In the beginning was the do-loop
• Libraries like EISPACK (for eigenvalue problems)
• Then the BLAS (1) were invented (1973-1977)
• Standard library of 15 operations (mostly) on
vectors
• AXPY ( y ax y ), dot product, scale (x
ax ), etc
• Up to 4 versions of each (S/D/C/Z), 46 routines,
3300 LOC
• Goals
• Common pattern to ease programming,
• Robustness, via careful coding (avoiding
over/underflow)
• Portability Efficiency via machine specific
implementations
• Why BLAS 1 ? They do O(n1) ops on O(n1) data
• Used in libraries like LINPACK (for linear
systems)
• Source of the name LINPACK Benchmark (not the
code!)

4
A brief history of (Dense) Linear Algebra
software (2/6)
• But the BLAS-1 werent enough
• Consider AXPY ( y ax y ) 2n flops on 3n
• Computational intensity flops / mem_refs
(2n)/(3n) 2/3
• Too low to run near peak speed (time for mem_refs
dominates)
• Hard to vectorize (SIMDize) on supercomputers
of the day (1980s)
• So the BLAS-2 were invented (1984-1986)
• Standard library of 25 operations (mostly) on
matrix/vector pairs
• GEMV y aAx ßx, GER A A axyT,
TRSV y T-1x
• Up to 4 versions of each (S/D/C/Z), 66 routines,
18K LOC
• Why BLAS 2 ? They do O(n2) ops on O(n2) data
• So computational intensity still just (2n2)/(n2)
2
• OK for vector machines, but not for machine with
caches

5
A brief history of (Dense) Linear Algebra
software (3/6)
• The next step BLAS-3 (1987-1988)
• Standard library of 9 operations (mostly) on
matrix/matrix pairs
• GEMM C aAB ßC, SYRK C aAAT
ßC, TRSM C T-1B
• Up to 4 versions of each (S/D/C/Z), 30 routines,
10K LOC
• Why BLAS 3 ? They do O(n3) ops on O(n2) data
• So computational intensity (2n3)/(4n2) n/2
big at last!
• Tuning opportunities machines with caches, other
mem. hierarchy levels
• How much BLAS1/2/3 code so far (all at
www.netlib.org/blas)
• Source 142 routines, 31K LOC, Testing 28K
LOC
• Reference (unoptimized) implementation only
• Ex 3 nested loops for GEMM
• Lots more optimized code
• Most computer vendors provide own optimized
versions
• Motivates automatic tuning of the BLAS (details
later)

6
A brief history of (Dense) Linear Algebra
software (4/6)
• LAPACK Linear Algebra PACKage - uses BLAS-3
(1989 now)
• Ex Obvious way to express Gaussian Elimination
(GE) is adding multiples of one row to other rows
BLAS-1
• How do we reorganize GE to use BLAS-3 ? (details
later)
• Contents of LAPACK (summary)
• Algorithms we can turn into (nearly) 100 BLAS 3
• Linear Systems solve Axb for x
• Least Squares choose x to minimize r2 ? ?S
ri2 where rAx-b
• Algorithms we can only make up to 50 BLAS 3
(so far)
• Eigenproblems Find l and x where Ax l x
• Singular Value Decomposition (SVD) ATAx?2x
• Error bounds for everything
• Lots of variants depending on As structure
(banded, AAT, etc)
• How much code? (Release 3.2, Nov 2008)
(www.netlib.org/lapack)
• Source 1582 routines, 490K LOC, Testing 352K
LOC
• Ongoing development (at UCB, UTK and elsewhere)

7
What could go into a linear algebra library?
For all linear algebra problems
For all matrix/problem structures
For all data types
For all architectures and networks
For all programming interfaces
Produce best algorithm(s) w.r.t.
performance and accuracy (including condition
estimates, etc)
Need to prioritize, automate!
8
A brief history of (Dense) Linear Algebra
software (5/6)
• Is LAPACK parallel?
• Only if the BLAS are parallel (possible in shared
memory)
• ScaLAPACK Scalable LAPACK (1995 now)
• For distributed memory uses MPI
• More complex data structures, algorithms than
LAPACK
• Only (small) subset of LAPACKs functionality
available
• Details later (class projects!)
• All at www.netlib.org/scalapack

9
Success Stories for Sca/LAPACK
• Widely used
• Adopted by Mathworks, Cray, Fujitsu, HP, IBM,
IMSL, Intel, NAG, NEC, SGI,
• gt100M web hits(in 2009, 56M in 2006) _at_ Netlib
(incl. CLAPACK, LAPACK95)
• New Science discovered through the solution of
dense matrix systems
• Nature article on the flat universe used
ScaLAPACK
• Other articles in Physics Review B that also use
it
• 1998 Gordon Bell Prize
• www.nersc.gov/news/reports/newNERSCresults050703.p
df

Cosmic Microwave Background Analysis, BOOMERanG
collaboration, MADCAP code (Apr. 27, 2000).
ScaLAPACK
10
A brief history/future of (Dense) Linear Algebra
software (6/6)
• Versions for GPUs some algorithms change
significantly
• PLASMA extensions to multicore
• Communication-minimizing Linear Algebra
• Lower bounds on data movement for all direct
linear algebra
• LU, QR, eig, svd, dense sparse
sequential parallel
• New algorithms that attain these lower bounds
• Can one software infrastructure accommodate all
algorithms and platforms of current (future)
interest?
• How much code generation and tuning can we
automate?
• MAGMA planned PLASMA extension to
multicore/GPU/heterogeneous
• FLAME (www.cs.utexas.edu/users/flame/)
• Related projects
• BLAST Forum (www.netlib.org/blas/blast-forum)
BLAS extensions
• Other languages, add some new functions, sparse
matrices, extra-precision
• Extra precision BLAS in latest LAPACK release

11
Organizing Linear Algebra in books
www.netlib.org/lapack
www.netlib.org/scalapack
gams.nist.gov
www.cs.utk.edu/dongarra/etemplates
www.netlib.org/templates
12
Outline of Dense Linear Algebra
• A brief history of linear algebra software
• Optimizing Sequential Matrix Multiplication
(MatMul)
• Simple performance models to understand
performance
• Warm-up Matrix-vector multiplication
• Reducing memory traffic for MatMul
• Lower bound on memory traffic for MatMul
• Other variants on MatMul
• Automatic vs Hand-Tuning of Algorithms
• Optimizing Parallel MatMul
• Optimizing Gaussian Elimination
• Lower bounds on communication (moving data)

13
Why Matrix Multiplication?
• An important kernel in many problems
• Used in many linear algebra algorithms
• Closely related to other algorithms, e.g.,
transitive closure on a graph using
Floyd-Warshall
• Optimization ideas can be used in other problems
• The best case for optimization payoffs
• The most-studied algorithm in high performance
computing

14
Matrix-multiply, optimized several ways
Speed of n-by-n matrix multiply on Sun
Ultra-1/170, peak 330 MFlops
15
Note on Matrix Storage
• A matrix is (usually) a 2-D array of elements,
• Conventions for matrix layout
• by column, or column major (Fortran default)
A(i,j) at Aijn
• by row, or row major (C default) A(i,j) at
Ainj
• recursive (later)
• Column major (for now)

Column major matrix in memory
Row major
Column major
0
5
10
15
0
1
2
3
1
6
11
16
4
5
6
7
2
7
12
17
8
9
10
11
3
8
13
18
12
13
14
15
4
9
14
19
16
17
18
19
Blue row of matrix is stored in red cachelines
cachelines
Figure source Larry Carter, UCSD
16
Using a Simple Model of Memory to Optimize
• Assume just 2 levels in memory hierarchy, fast
and slow
• All data initially in slow memory
• m number of memory elements (words) moved
between fast and slow memory
• tm time per slow memory operation
• f number of arithmetic operations
• tf time per arithmetic operation ltlt tm
• q f / m average number of flops per slow
memory access
• Minimum possible time f tf when all data in
fast memory
• Actual time
• f tf m tm f tf (1 tm/tf 1/q)
• Larger q means time closer to minimum f tf
• q ? tm/tf needed to get at least half of peak
speed

17
Warm up Matrix-vector multiplication
• implements y y Ax
• for i 1n
• for j 1n
• y(i) y(i) A(i,j)x(j)

A(i,)

y(i)
y(i)
x()
18
Warm up Matrix-vector multiplication
• read x(1n) into fast memory
• read y(1n) into fast memory
• for i 1n
• read row i of A into fast memory
• for j 1n
• y(i) y(i) A(i,j)x(j)
• write y(1n) back to slow memory
• m number of slow memory refs 3n n2
• f number of arithmetic operations 2n2
• q f / m ? 2
• Matrix-vector multiplication limited by slow
memory speed

19
Modeling Matrix-Vector Multiplication
• Compute time for nxn 1000x1000 matrix
• Time
• f tf m tm f tf (1 tm/tf 1/q)
• 2n2 tf (1
tm/tf 1/2)
• For tf and tm, using data from R. Vuducs PhD (pp
351-3)
• http//bebop.cs.berkeley.edu/pubs/vuduc2003-disser
tation.pdf
• For tm use minimum-memory-latency /
words-per-cache-line

machine balance (q must be at least this for ½
peak speed, but q2
20
Validating the Model
• How well does the model predict actual
performance?
• Actual DGEMV Most highly optimized code for the
platform
• Model sufficient to compare across machines
• But under-predicting on most recent ones due to
latency estimate

21
Naïve Matrix Multiply
• implements C C AB
• for i 1 to n
• for j 1 to n
• for k 1 to n
• C(i,j) C(i,j) A(i,k) B(k,j)

Algorithm has 2n3 O(n3) Flops and operates on
3n2 words of memory q potentially as large as
2n3 / 3n2 O(n)
A(i,)
C(i,j)
C(i,j)
B(,j)

22
Naïve Matrix Multiply
• implements C C AB
• for i 1 to n
• read row i of A into fast memory
• for j 1 to n
• read C(i,j) into fast memory
• read column j of B into fast memory
• for k 1 to n
• C(i,j) C(i,j) A(i,k) B(k,j)
• write C(i,j) back to slow memory

A(i,)
C(i,j)
C(i,j)
B(,j)

23
Naïve Matrix Multiply
• Number of slow memory references on unblocked
matrix multiply
• m n3 to read each column of B n times
• n2 to read each row of A once
• 2n2 to read and write each element
of C once
• n3 3n2
• So q f / m 2n3 / (n3 3n2)
• ? 2 for large n, no improvement over
matrix-vector multiply
• Inner two loops are just matrix-vector multiply,
of row i of A times B
• Similar for any other order of 3 loops

A(i,)
C(i,j)
C(i,j)
B(,j)

24
Matrix-multiply, optimized several ways
Speed of n-by-n matrix multiply on Sun
Ultra-1/170, peak 330 MFlops
25
Naïve Matrix Multiply on IBM RS/6000
12000 would take 1095 years
T N4.7
Size 2000 took 5 days
O(N3) performance would have constant
cycles/flop Performance looks like O(N4.7)
Slide source Larry Carter, UCSD
26
Naïve Matrix Multiply on RS/6000
Page miss every iteration
TLB miss every iteration
Cache miss every 16 iterations
Page miss every 512 iterations
Slide source Larry Carter, UCSD
27
From Naïve to Blocked (or Tiled) Matrix Multiply
• C(i,j) C(i,j) A(i,1)B(1,j) A(i,2)B(2,j)
• C(i,j) A(i,1n)B(1n,j)
• True if C(i,j) is a scalar (1x1 submatrix)
• A(i,1n) is a row of A and B(1n,j) is a column
of B
• Operation is a dot product
• True if C(i,j) is a b x b submatrix
• A(i,1n) is a b x n submatrix of A and B(1n,j)
is an n x b submatrix of B
• C(i,j) C(i,j) A(i,1)B(1,j) A(i,2)B(2,j)
A(i,N)B(N,j) where
• All factors are b x b subblocks
• N n/b number of b x b blocks

28
Blocked (Tiled) Matrix Multiply
• Consider A,B,C to be N-by-N matrices of b-by-b
subblocks where bn / N is called the
block size
• for i 1 to N
• for j 1 to N
• read block C(i,j) into fast memory
• for k 1 to N
• read block A(i,k) into fast
memory
• read block B(k,j) into fast
memory
• C(i,j) C(i,j) A(i,k)
B(k,j) do a matrix multiply on blocks
• write block C(i,j) back to slow memory

A(i,k)
C(i,j)
C(i,j)

B(k,j)
29
Blocked (Tiled) Matrix Multiply
• Recall
• m is amount memory traffic between slow and
fast memory
• matrix has nxn elements, and NxN blocks each
of size bxb
• f is number of floating point operations, 2n3
for this problem
• q f / m is our measure of algorithm
efficiency in the memory system
• So

m Nn2 read each block of B N3 times (N3
b2 N3 (n/N)2 Nn2) Nn2 read
each block of A N3 times 2n2 read
and write each block of C once (2N
2) n2 So computational intensity q f / m
2n3 / ((2N 2) n2)
? n / N b for large n So we
can improve performance by increasing the
blocksize b Can be much faster than
matrix-vector multiply (q2)
30
How large can we make computional intensity of
matmul?
• The blocked algorithm has computational intensity
q ? b
• The larger the block size, the more efficient our
algorithm will be
• Limit All three blocks from A,B,C must fit in
fast memory (cache), so we cannot make these
blocks arbitrarily large
• Assume your fast memory has size Mfast
• 3b2 ? Mfast, so q ? b ?
(Mfast/3)1/2
• To build a machine to run matrix multiply at 1/2
peak arithmetic speed of the machine, we need a
fast memory of size
• Mfast ? 3b2 ? 3q2 3(tm/tf)2
• This size is reasonable for L1 cache, but not for
register sets
• Note analysis assumes it is possible to schedule
the instructions perfectly

31
Limits to Optimizing Matrix Multiply
• The blocked algorithm changes the order in which
values are accumulated into each Ci,j by
applying commutativity and associativity
• Get slightly different answers from naïve code,
because of roundoff - OK
• We just showed that the blocked algorithm has
computational intensity
• q ? b ? (Mfast/3)1/2
• Theorem (Hong Kung, 1981) Any reorganization
of this algorithm (that uses only
commutativity/associativity) is limited to q O(
(Mfast)1/2 )
• Said another way,

words moved between fast/slow memory ?( n3 /
Mfast1/2 )
• Proof of more general result later
• Can get a lower bound on the latency cost (
messages sent) by dividing above lower bound by
maximum message size Mfast
• messages for sequential MatMul ?( n3 / Mfast
3/2 )
• Ex disk accesses

32
BLAS speeds on an IBM RS6000/590
Peak speed 266 Mflops
Peak
BLAS 3
BLAS 2
BLAS 1
BLAS 3 (n-by-n matrix matrix multiply) vs BLAS 2
(n-by-n matrix vector multiply) vs BLAS 1 (saxpy
of n vectors)
33
Dense Linear Algebra BLAS2 vs. BLAS3
• BLAS2 and BLAS3 have very different computational
intensity, and therefore different performance

Data source Jack Dongarra
34
Recursion Cache Oblivious Algorithms
• The tiled algorithm requires finding a good block
size
• Machine dependent
• What if there are multiple levels of cache? Need
to block b x b matrix multiply in inner most
loop
• 1 level of memory ? 3 nested loops (naïve
algorithm)
• 2 levels of memory ? 6 nested loops
• 3 levels of memory ? 9 nested loops
• Cache Oblivious Algorithms offer an alternative
• Treat nxn matrix multiply as a set of smaller
problems
• Eventually, these will fit in cache
• Will minimize words moved between every level
of memory hierarchy (between L1 and L2 cache, L2
and L3, L3 and main memory etc.) at least
asymptotically

35
Recursive Matrix Multiplication (RMM) (1/2)
• For simplicity square matrices with n 2m
• C A B
• True when each Aij etc 1x1 or n/2 x n/2

func C RMM (A, B, n) if n 1, C A
B, else C11 RMM (A11 , B11 , n/2)
RMM (A12 , B21 , n/2) C12 RMM (A11
, B12 , n/2) RMM (A12 , B22 , n/2)
C21 RMM (A21 , B11 , n/2) RMM (A22 , B21 ,
n/2) C22 RMM (A21 , B12 , n/2)
RMM (A22 , B22 , n/2) return
36
Recursive Matrix Multiplication (2/2)
func C RMM (A, B, n) if n1, C A B,
else C11 RMM (A11 , B11 , n/2)
RMM (A12 , B21 , n/2) C12 RMM (A11
, B12 , n/2) RMM (A12 , B22 , n/2)
C21 RMM (A21 , B11 , n/2) RMM (A22 , B21 ,
n/2) C22 RMM (A21 , B12 , n/2)
RMM (A22 , B22 , n/2) return
A(n) arithmetic operations in RMM( . , . ,
n) 8 A(n/2) 4(n/2)2 if n gt 1,
else 1 2n3 same operations as
usual, in different order M(n) words moved
between fast, slow memory by RMM( . , . , n)
8 M(n/2) 4(n/2)2 if 3n2 gt Mfast ,
else 3n2 O( n3 / (Mfast )1/2 n2 )
same as blocked matmul
37
Recursion Cache Oblivious Algorithms
• Recursion for general A (nxm) B (mxp)
• Case1 mgt maxn,p split A horizontally
• Case 2 ngt maxm,p split A vertically and B
horizontally
• Case 3 pgt maxm,n split B vertically
• Attains lower bound in O() sense

Case 1
Case 3
38
Experience with Cache-Oblivious Algorithms
• In practice, need to cut off recursion well
before 1x1 blocks
• Call Micro-kernel for small blocks, eg 16 x 16
• Implementing a high-performance Cache-Oblivious
code is not easy
• Using fully recursive approach with highly
optimized recursive micro-kernel, Pingali et al
report that they never got more than 2/3 of peak.
• Issues with Cache Oblivious (recursive) approach
• Recursive Micro-Kernels yield less performance
than iterative ones using same scheduling
techniques
• Pre-fetching is needed to compete with best code
not well-understood in the context of Cache
Oblivous codes

Unpublished work, presented at LACSI 2006
39
Recursive Data Layouts
• A related idea is to use a recursive structure
for the matrix
• Goal any square submatrix (suitably aligned)
should lie in contiguous memory
• Improves locality with machine-independent data
structure
• Can minimize latency with multiple levels of
memory hierarchy
• This figure shows Z-Morton Ordering (space
filling curve)
• Other orderings possible, eg C-Morton
• See papers on cache oblivious algorithms and
recursive layouts
• Gustavson, Kagstrom, et al, SIAM Review, 2004
• the recursive layout works well for any cache
size
• The index calculations to find Ai,j are
expensive
• Implementations switch to column-major for small
sizes

40
Strassens Matrix Multiply
• The traditional algorithm (with or without
tiling) has O(n3) flops
• Strassen discovered an algorithm with
asymptotically lower flops
• O(n2.81)
• Consider a 2x2 matrix multiply, normally takes 8
• Strassen does it with 7 multiplies and 18 adds

Let M m11 m12 a11 a12 b11 b12
m21 m22 a21 a22 b21 b22 Let p1
(a12 - a22) (b21 b22)
p5 a11 (b12 - b22) p2 (a11
a22) (b11 b22)
p6 a22 (b21 - b11) p3 (a11 - a21)
(b11 b12) p7
(a21 a22) b11 p4 (a11 a12)
b22 Then m11 p1 p2 - p4 p6 m12
p4 p5 m21 p6 p7 m22
p2 - p3 p5 - p7
Extends to nxn by divideconquer
41
Strassen Matrix Multiplication (2/2)
func C StrMM (A, B, n) if n1 (or small
enough), C A B, else P1 StrMM
(A12 - A22 , B21 B22 , n/2) P2
StrMM (A11 A22 , B11 B22 , n/2)
P3 StrMM (A11 - A21 , B11 B12 , n/2)
P4 StrMM (A11 A12 , B22 , n/2)
P5 StrMM (A11 , B12 - B22 , n/2) P6
StrMM (A22 , B21 B11 , n/2) P7
StrMM (A21 A22 , B11 , n/2) C11
P1 P2 P4 P6, C12 P4 P5
C22 P2 - P3 P5 P7, C21 P6 P7
return
42
Strassen (continued)
• Asymptotically faster
• Several times faster for large n in practice
• Cross-over depends on machine
• Available in some libraries (but see below)
• Tuning Strassen's Matrix Multiplication for
Memory Efficiency, M. S. Thottethodi, S.
Chatterjee, and A. Lebeck, in Proceedings of
Supercomputing '98
• Caveats
• Needs more memory than standard algorithm
• Can be a little less accurate because of roundoff
error
• Forbidden by rules of Top500 list (so not widely
used)

43
Other Fast Matrix Multiplication Algorithms
• Current worlds record is O(n 2.376... )
• Why does Hong/Kung theorem not apply?
• Extension is open problem
• Possibility of O(n2?) algorithm!
• Cohn, Umans, Kleinberg, 2003
• Can show they all can be made numerically stable
• D., Dumitriu, Holtz, Kleinberg, 2007
• Can do rest of linear algebra (solve Axb, least
squares, Ax?x, etc) as fast , and numerically
stably
• D., Dumitriu, Holtz, 2008
• Fast methods (besides Strassen) may need
unrealistically large n
• But can reuse ideas to minimize communication

44
Tuning Code in Practice
• Tuning code can be tedious
• Lots of code variations to try besides blocking
details later
• Machine hardware performance hard to predict
• Compiler behavior hard to predict
• Response Autotuning
• Let computer generate large set of possible code
variations, and search them for the fastest ones
• Field started with CS267 homework assignment in
mid 1990s
• PHiPAC, leading to ATLAS, incorporated in Matlab
• We still use the same assignment
• We (and others) are extending autotuning to other
dwarfs / motifs
• Still need to understand how to do it by hand
• Not every code will have an autotuner
• Need to know if you want to build autotuners

45
Search Over Block Sizes
• Performance models are useful for high level
algorithms
• Helps in developing a blocked algorithm
• Models have not proven very useful for block size
selection
• too complicated to be useful
• See work by Sid Chatterjee for detailed model
• too simple to be accurate
• Multiple multidimensional arrays, virtual memory,
etc.
• Speed depends on matrix dimensions, details of
code, compiler, processor

46
What the Search Space Can Look Like
Number of columns in register block
Number of rows in register block
A 2-D slice of a 3-D register-tile search space.
The dark blue region was pruned. (Platform Sun
Ultra-IIi, 333 MHz, 667 Mflop/s peak, Sun cc v5.0
compiler)
47
Tiling Alone Might Not Be Enough
• Naïve and a naïvely tiled code on Itanium 2
• Searched all block sizes to find best, b56
• Starting point for next homework

48
Example Select a Matmul Implementation
49
Example Support Vector Classification
50
ATLAS (DGEMM n 500)
Source Jack Dongarra
• ATLAS is faster than all other portable BLAS
implementations and it is comparable with
machine-specific libraries provided by the vendor.

51
Optimizing in Practice
• Tiling for registers
• loop unrolling, use of named register variables
• Tiling for multiple levels of cache and TLB
• Exploiting fine-grained parallelism in processor
• superscalar pipelining
• Complicated compiler interactions
• Hard to do by hand (but you should try!)
• Automatic optimization an active research area
• BeBOP bebop.cs.berkeley.edu , FFTW, Spiral,
• Older work
• PHiPAC www.icsi.berkeley.edu/bilmes/phipac
• in particular tr-98-035.ps.gz
• ATLAS www.netlib.org/atlas
• "Performance Optimization of Numerically
Intensive Codes", by Stefan Goedecker and Adolfy
Hoisie, SIAM 2001.

52
Removing False Dependencies
• Using local variables, reorder operations to
remove false dependencies

ai bi c ai1 bi1 d
false read-after-write hazard between ai and
bi1
float f1 bi float f2 bi1 ai f1
c ai1 f2 d
• With some compilers, you can declare a and b
unaliased.
• Done via restrict pointers, compiler flag, or
pragma

53
Exploit Multiple Registers
into local variables

while( ) res filter0signal0
filter1signal1
filter2signal2 signal
float f0 filter0 float f1 filter1 float
f2 filter2 while( ) res
f0signal0 f1signal1
f2signal2 signal
also register float f0
Example is a convolution
54
• Replace pointer updates for strided memory

f0 r8 r8 4 f1 r8 r8 4 f2 r8
r8 4
f0 r80 f1 r84 f2 r88 r8 12
• Pointer vs. array expression costs may differ.
• Some compilers do a better job at analyzing one
than the other

55
Loop Unrolling
• Expose instruction-level parallelism

float f0 filter0, f1 filter1, f2
filter2 float s0 signal0, s1 signal1,
s2 signal2 res f0s0 f1s1
f2s2 do signal 3 s0 signal0
res0 f0s1 f1s2 f2s0 s1
signal1 res1 f0s2 f1s0 f2s1
s2 signal2 res2 f0s0 f1s1
f2s2 res 3 while( )
56
Expose Independent Operations
• Hide instruction latency
• Use local variables to expose independent
operations that can execute in parallel or in a
pipelined fashion
• Balance the instruction mix (what functional
units are available?)

f1 f5 f9 f2 f6 f10 f3 f7 f11 f4
f8 f12
57
Copy optimization
• Copy input operands or blocks
• Reduce cache conflicts
• Constant array offsets for fixed size blocks
• Expose page-level locality
• Alternative use different data structures from
start (if users willing)

Reorganized into 2x2 blocks
0
4
8
12
0
2
8
10
1
5
9
13
1
3
9
11
2
6
10
14
4
6
12
13
3
7
11
15
5
7
14
15
58
Summary of Performance Tuning
• Performance programming on uniprocessors requires
• understanding of memory system moving data
slower than arithmetic
• understanding of fine-grained parallelism in
processor
• Simple performance models can aid in
understanding
• Two ratios are key to efficiency (relative to
peak)
• computational intensity of the algorithm
• q f/m floating point operations / slow
memory references
• machine balance in the memory system
• tm/tf time for slow memory reference / time for
floating point operation
• Want q gt tm/tf to get half machine peak
• Blocking (tiling) is a basic approach to increase
q
• Techniques apply generally, but the details
(e.g., block size) are architecture dependent
• Similar techniques are possible on other data
structures and algorithms

59
Extra slides
60
• Sourcebook Chapter 3, (note that chapters 2 and 3
cover the material of lecture 2 and lecture 3,
but not in the same order).
• "Performance Optimization of Numerically
Intensive Codes", by Stefan Goedecker and Adolfy
Hoisie, SIAM 2001.
• Web pages for reference
• BeBOP Homepage
• ATLAS Homepage
• BLAS (Basic Linear Algebra Subroutines),
Reference for (unoptimized) implementations of
the BLAS, with documentation.
• LAPACK (Linear Algebra PACKage), a standard
linear algebra library optimized to use the BLAS
effectively on uniprocessors and shared memory
machines (software, documentation and reports)
• ScaLAPACK (Scalable LAPACK), a parallel version
of LAPACK for distributed memory machines
(software, documentation and reports)
• Tuning Strassen's Matrix Multiplication for
Memory Efficiency Mithuna S. Thottethodi,
Siddhartha Chatterjee, and Alvin R. Lebeck in
Proceedings of Supercomputing '98, November 1998
postscript
• Recursive Array Layouts and Fast Parallel Matrix
Multiplication by Chatterjee et al. IEEE TPDS
November 2002.

61
Questions You Should Be Able to Answer
• What is the key to understand algorithm
efficiency in our simple memory model?
• What is the key to understand machine efficiency
in our simple memory model?
• What is tiling?
• Why does block matrix multiply reduce the number
of memory references?
• What are the BLAS?
• Why does loop unrolling improve uniprocessor
performance?

62
Proof of Communication Lower Bound on C AB
(1/6)
• Proof from Irony/Tishkin/Toledo (2004)
• Well need it for the rest of linear algebra
• Think of instruction stream being executed
• We want to count the number of loads and stores,
given that we are multiplying n-by-n matrices C
AB using the usual 2n3 flops, possibly
tive
• It actually isnt associative in floating point,
but close enough
• Assuming that at most M words can be stored in
fast memory
• Outline
• Break instruction stream into segments, each
• Somehow bound the maximum number of adds and
multiplies that could be done in each segment,
call it F
• So F segments ? 2n3 , and
segments ? 2n3 / F
• So loads stores M segments ? M 2
n3 / F

63
Proof of Communication Lower Bound on C AB
(2/6)
• Given segment of instruction stream with M loads
stores, how many adds multiplies (F) can we
do?
• At most 2M entries of C, 2M entries of A and/or
2M entries of B can be accessed
• Use geometry
• Represent 2n3 operations by n x n x n cube
• One n x n face represents A
• each 1 x 1 subsquare represents one A(i,k)
• One n x n face represents B
• each 1 x 1 subsquare represents one B(k,j)
• One n x n face represents C
• each 1 x 1 subsquare represents one C(i,j)
• Each 1 x 1 x 1 subcube represents one C(i,j)
A(i,k) B(k,j)

64
Proof of Communication Lower Bound on C AB
(3/6)
k
C face
C(2,3)
C(1,1)
B(3,1)
A(1,3)
j
B(1,3)
A(2,1)
B face
i
A face
65
Proof of Communication Lower Bound on C AB
(4/6)
• Given segment of instruction stream with M load
stores, how many adds multiplies (F) can we do?
• At most 2M entries of C, and/or of A and/or of
B can be accessed
• Use geometry
• Represent 2n3 operations by n x n x n cube
• One n x n face represents A
• each 1 x 1 subsquare represents one A(i,k)
• One n x n face represents B
• each 1 x 1 subsquare represents one B(k,j)
• One n x n face represents C
• each 1 x 1 subsquare represents one C(i,j)
• Each 1 x 1 x 1 subcube represents one C(i,j)
A(i,k) B(k,j)
• If we have at most 2M A squares, 2M B
squares, and 2M C squares on faces, how many
cubes can we have?

66
Proof of Communication Lower Bound on C AB
(5/6)
cubes in black box with side lengths x, y
and z Volume of black box xyz (A?s
B?s C?s )1/2 ( xz zy yx)1/2
67
Proof of Communication Lower Bound on C AB
(6/6)
• Consider one segment of instructions with M
• There can be at most 2M entries of A, B, C
available in one segment
• Volume of set of cubes representing possible
multiply/adds (2M 2M 2M)1/2 (2M) 3/2 F
• Segments ? 2n3 / F
• Loads Stores M Segments ? M 2n3 / F
n3 / (2M)1/2
• Parallel Case apply reasoning to one processor
out of P
balanced)
• M n2 / P (each processor gets equal fraction of
matrix)
• Load Stores words communicated with
other procs ? M (2n3 /P) / F M (2n3 /P) /
(2M) 3/2 n2 / (2P)1/2

68
Communication Lower Bound for Parallel MatMul
• Apply reasoning to one processor out of P
• Adds and Muls 2n3 / P
• At least one processor must do this much work
• M n2 / P
• Assume each processor gets equal fraction of
matrix
• (Other assumptions possible)
• Load Stores words communicated with
other procs ? M (2n3 /P) / F M (2n3 /P) /
(2M) 3/2 n2 / (2P)1/2
• Possible to communicate less than this
• Need to increase M by having multiple copies of
matrices
• 3D algorithm
• P1/3 copies of each matrix, so M n2 / P2/3
• Lower bound becomes ?(n2 / (2P)2/3) ,
• O(P1/6) times smaller than before attainable

69
Lower bounds on latency O(messages)
• Hong/Kung and Irony/Tishkin/Toledo theorem is a
lower bound on amount of data communicated
• Bandwidth cost
• Can get a lower bound on the latency cost (number
of messages sent) by dividing about lower bound
by maximum message size fast memory size M
• messages for sequential MatMul ?( n3 / M3/2
)
• Ex disk accesses
• messages for parallel Matmul ?((n2 / P1/2 ) /
(n2 / P)) ?( P1/2)

70
Communication lower bounds for all Linear Algebra
• Irony/Tiskin/Toledo proof can be extended
• Proof does not depend on doing matrix
multiplication
• Out(i,j) func( In1(i,k), In2(k,j) for k ? S,
where S is some set )
• Only need Out(i,j), etc to be a one-to-one map
from (i,j) to memory
• Proof does not depend on inputs being dense
• S can be any set, depending on (i,j)
• Proof does not assume inputs and outputs do not
overlap
• Eg in Gaussian Elimination A LU, L and U
overwrite A
• Applies to LU, Cholesky, solving triangular
systems,
• Dense or sparse matrices
• Sequential or parallel algorithm
• Lower bound on words moved ? ( flops / M1/2 )
• Lower bound on messages ? ( flops / M3/2 )
• Also extends to QR, eigenvalue algorithms, SVD
• New proof need to account for data that is
created in fast memory, used, discarded without
being written to memory
• Same lower bounds

71
• Review of material on performance of
matrix multiplication so far
• More about tuning matrix multiplication, other
codes

72
Using a Simple Model of Memory to Optimize
• Assume just 2 levels in the hierarchy, fast and
slow
• All data initially in slow memory
• m number of memory elements (words) moved
between fast and slow memory
• tm time per slow memory operation
• f number of arithmetic operations
• tf time per arithmetic operation ltlt tm
• q f / m average number of flops per slow
memory access
• Minimum possible time f tf when all data in
fast memory
• Actual time
• f tf m tm f tf (1 tm/tf 1/q)
• Larger q means time closer to minimum f tf
• q ? tm/tf needed to get at least half of peak
speed
• Goal reorganize algorithm (change order of
operations) to maximize q ( minimize tm )

73
Naïve Matrix Multiply no better than
Matrix-vector multiply
• implements C C AB
• for i 1 to n
n2 total
• for j 1 to n
n2 total
• read column j of B into fast memory n
• for k 1 to n
• C(i,j) C(i,j) A(i,k) B(k,j)
all data in fast memory
• write C(i,j) back to slow memory 1
write, n2 total

m slow memory accesses n3 3n2 , so q
f/m 2n3 /( n3 3n2) ? 2
74
From Naïve to Blocked Matrix Multiply
• C(i,j) C(i,j) A(i,1)B(1,j) A(i,2)B(2,j)
• C(i,j) A(i,1n)B(1n,j)
• True if C(i,j) is a scalar (1x1 submatrix)
• A(i,1n) is a row of A and B(1n,j) is a column
of B
• Operation is a dot product
• True if C(i,j) is a b x b submatrix
• A(i,1n) is a b x n submatrix of A and B(1n,j)
is an n x b submatrix of B
• C(i,j) C(i,j) A(i,1)B(1,j) A(i,2)B(2,j)
A(i,N)B(N,j) where
• All factors are b x b subblocks
• N n/b number of b x b blocks

75
Blocked (Tiled) Matrix Multiply
• Consider A,B,C to be N-by-N matrices of b-by-b
subblocks where
• bn / N is called the block size
• for i 1 to N
• for j 1 to N
• read block C(i,j) into fast memory
b2 reads, N2 b2 n2 total
• for k 1 to N
• read block A(i,k) into fast
memory b2 reads, N3 b2 n3/b total
• read block B(k,j) into fast
memory b2 reads, N3 b2 n3/b total
• C(i,j) C(i,j) A(i,k)
B(k,j) do a matrix multiply on blocks

• all data in fast memory
• write block C(i,j) back to slow memory
b2 reads, N2 b2 n2 total

m slow memory accesses 2n3/b 2n2 , so q
f/m ? b
76
Blocked (Tiled) Matrix Multiply (Review)
• m slow memory accesses 2n3/b 2n2 ,
• so q f/m 2n3 / ( 2n3/b 2n2 ) ? b for
large n
• To minimize m ( maximize q ) we want to make b as
large as possible
• 3b2 ? Mfast, so that all three b x b
submatrices C(i,j), A(i,k), B(k,j) fit
• So q ? b ? (Mfast/3)1/2
• Theorem (Hong/Kung, 1981 Irony/Tishkin/Toledo
2004) Any reorganization of
this algorithm (that uses only associativity and
commutativity of addition) is limited to q O(
(Mfast)1/2 )
• Blocked matrix multiplication minimizes the
number of slow memory accesses (up to a constant
factor)

77
(No Transcript)
78
Current Records for Solving Dense Systems
(11/2008)
www.netlib.org, click on Performance Database
Server

Gigaflops Machine n100
n1000 Any n Peak  Roadrunner (LANL)
1.1M
1.5M (1.1 Petaflops, 130K cores, 2.5 MW,
n2.3M)

NEC SX 8
(fastest in 2005) (8 proc, 2 GHz)
75.1
128 (1 proc, 2 GHz) 2.2
15.0 16
Palm Pilot III .00000169
(1.69 Kiloflops)
79
For all linear algebra problemsEx LAPACK Table
of Contents
• Linear Systems
• Least Squares
• Overdetermined, underdetermined
• Unconstrained, constrained, weighted
• Eigenvalues and vectors of Symmetric Matrices
• Standard (Ax ?x), Generalized (Ax?xB)
• Eigenvalues and vectors of Unsymmetric matrices
• Eigenvalues, Schur form, eigenvectors, invariant
subspaces
• Standard, Generalized
• Singular Values and vectors (SVD)
• Standard, Generalized
• Level of detail
• Simple Driver
• Expert Drivers with error bounds,
extra-precision, other options
• Lower level routines (apply certain kind of
orthogonal transformation)

80
For all matrix/problem structuresEx LAPACK
• SB symmetric, banded
• SP symmetric, packed
• ST symmetric, tridiagonal
• SY symmetric
• TB triangular, banded
• TG triangular, pair
• TB triangular, banded
• TP triangular, packed
• TR triangular
• TZ trapezoidal
• UN unitary
• UP unitary packed
• BD bidiagonal
• GB general banded
• GE general
• GG general , pair
• GT tridiagonal
• HB Hermitian banded
• HE Hermitian
• HG upper Hessenberg, pair
• HP Hermitian, packed
• HS upper Hessenberg
• OR (real) orthogonal
• OP (real) orthogonal, packed
• PB positive definite, banded
• PO positive definite
• PP positive definite, packed
• PT positive definite, tridiagonal

81
For all matrix/problem structuresEx LAPACK
• BD bidiagonal
• GB general banded
• GE general
• GG general , pair
• GT tridiagonal
• HB Hermitian banded
• HE Hermitian
• HG upper Hessenberg, pair
• HP Hermitian, packed
• HS upper Hessenberg
• OR (real) orthogonal
• OP (real) orthogonal, packed
• PB positive definite, banded
• PO positive definite
• PP positive definite, packed
• PT positive definite, tridiagonal
• SB symmetric, banded
• SP symmetric, packed
• ST symmetric, tridiagonal
• SY symmetric
• TB triangular, banded
• TG triangular, pair
• TB triangular, banded
• TP triangular, packed
• TR triangular
• TZ trapezoidal
• UN unitary
• UP unitary packed

82
For all matrix/problem structuresEx LAPACK
• BD bidiagonal
• GB general banded
• GE general
• GG general, pair
• GT tridiagonal
• HB Hermitian banded
• HE Hermitian
• HG upper Hessenberg, pair
• HP Hermitian, packed
• HS upper Hessenberg
• OR (real) orthogonal
• OP (real) orthogonal, packed
• PB positive definite, banded
• PO positive definite
• PP positive definite, packed
• PT positive definite, tridiagonal
• SB symmetric, banded
• SP symmetric, packed
• ST symmetric, tridiagonal
• SY symmetric
• TB triangular, banded
• TG triangular, pair
• TB triangular, banded
• TP triangular, packed
• TR triangular
• TZ trapezoidal
• UN unitary
• UP unitary packed

83
For all matrix/problem structuresEx LAPACK
• BD bidiagonal
• GB general banded
• GE general
• GG general, pair
• GT tridiagonal
• HB Hermitian banded
• HE Hermitian
• HG upper Hessenberg, pair
• HP Hermitian, packed
• HS upper Hessenberg
• OR (real) orthogonal
• OP (real) orthogonal, packed
• PB positive definite, banded
• PO positive definite
• PP positive definite, packed
• PT positive definite, tridiagonal
• SB symmetric, banded
• SP symmetric, packed
• ST symmetric, tridiagonal
• SY symmetric
• TB triangular, banded
• TG triangular, pair
• TB triangular, banded
• TP triangular, packed
• TR triangular
• TZ trapezoidal
• UN unitary
• UP unitary packed

84
For all matrix/problem structuresEx LAPACK
• BD bidiagonal
• GB general banded
• GE general
• GG general, pair
• GT tridiagonal
• HB Hermitian banded
• HE Hermitian
• HG upper Hessenberg, pair
• HP Hermitian, packed
• HS upper Hessenberg
• OR (real) orthogonal
• OP (real) orthogonal, packed
• PB positive definite, banded
• PO positive definite
• PP positive definite, packed
• PT positive definite, tridiagonal
• SB symmetric, banded
• SP symmetric, packed
• ST symmetric, tridiagonal
• SY symmetric
• TB triangular, banded
• TG triangular, pair
• TB triangular, banded
• TP triangular, packed
• TR triangular
• TZ trapezoidal
• UN unitary
• UP unitary packed

85
For all matrix/problem structuresEx LAPACK
• BD bidiagonal
• GB general banded
• GE general
• GG general , pair
• GT tridiagonal
• HB Hermitian banded
• HE Hermitian
• HG upper Hessenberg, pair
• HP Hermitian, packed
• HS upper Hessenberg
• OR (real) orthogonal
• OP (real) orthogonal, packed
• PB positive definite, banded
• PO positive definite
• PP positive definite, packed
• PT positive definite, tridiagonal
• SB symmetric, banded
• SP symmetric, packed
• ST symmetric, tridiagonal
• SY symmetric
• TB triangular, banded
• TG triangular, pair
• TB triangular, banded
• TP triangular, packed
• TR triangular
• TZ trapezoidal
• UN unitary
• UP unitary packed

86
For all matrix/problem structuresEx LAPACK
• BD bidiagonal
• GB general banded
• GE general
• GG general, pair
• GT tridiagonal
• HB Hermitian banded
• HE Hermitian
• HG upper Hessenberg, pair
• HP Hermitian, packed
• HS upper Hessenberg
• OR (real) orthogonal
• OP (real) orthogonal, packed
• PB positive definite, banded
• PO positive definite
• PP positive definite, packed
• PT positive definite, tridiagonal
• SB symmetric, banded
• SP symmetric, packed
• ST symmetric, tridiagonal
• SY symmetric
• TB triangular, banded
• TG triangular, pair
• TB triangular, banded
• TP triangular, packed
• TR triangular
• TZ trapezoidal
• UN unitary
• UP unitary packed

87
Motif/Dwarf Common Computational Methods (Red
Hot ? Blue Cool)
What do commercial and CSE applications have in
common?
88
Basic Linear Algebra Subroutines (BLAS)
• Industry standard interface (evolving)
• www.netlib.org/blas, www.netlib.org/blas/blast-
-forum
• Vendors, others supply optimized implementations
• History
• BLAS1 (1970s)
• vector operations dot product, saxpy (yaxy),
etc
• m2n, f2n, q 1 or less
• BLAS2 (mid 1980s)
• matrix-vector operations matrix vector multiply,
etc
• mn2, f2n2, q2, less overhead
• somewhat faster than BLAS1
• BLAS3 (late 1980s)
• matrix-matrix operations matrix matrix multiply,
etc
• m ? 3n2, fO(n3), so qf/m can possibly be as
large as n (or Mfast1/2 ), so BLAS3
is potentially much faster than BLAS2
• Good algorithms used BLAS3 when possible (LAPACK
ScaLAPACK)
• More later in course

89
Locality in Other Algorithms
• The performance of any algorithm is limited by q
computational intensity
• In matrix multiply, we increase q by changing
computation order
• increased temporal locality
• For other algorithms and data structures, even
hand-transformations are still an open problem
• Lots of open problems

90
Simplifying Assumptions
• What simplifying assumptions did we make in this
analysis?
• Ignored parallelism in processor between memory
and arithmetic within the processor
• Sometimes drop arithmetic term in this type of
analysis
• Assumed fast memory was large enough to hold
three vectors
• Reasonable if we are talking about any level of
cache
• Not if we are talking about registers (32 words)
• Assumed the cost of a fast memory access is 0
• Reasonable if we are talking about registers
• Not necessarily if we are talking about cache
(1-2 cycles for L1)
• Memory latency is constant
• Could simplify even further by ignoring memory
operations in X and Y vectors
• Mflop rate/element 2 / (2 tf tm)

91
Experiments on Search vs. Modeling
• Study compares search (Atlas) to optimization
selection based on performance models
• Ten architectures
• Model did well on most cases
• Better on UltraSparc
• Worse on Itanium
• Eliminating performance gaps think globally,
search locally
• -small performance gaps local search
• -large performance gaps
• refine model
• Substantial gap between ATLAS CGw/S and ATLAS
Unleashed on some machines

Source K. Pingali. Results from IEEE 05 paper
by K Yotov, X Li, G Ren, M Garzarán, D Padua, K
Pingali, P Stodghill.