Lecture 6: Linear Algebra Algorithms - PowerPoint PPT Presentation

About This Presentation

Lecture 6: Linear Algebra Algorithms


Linear Algebra Algorithms Jack Dongarra, U of Tennessee Slides are adapted from Jim Demmel, UCB s Lecture on Linear Algebra Algorithms – PowerPoint PPT presentation

Number of Views:532
Avg rating:3.0/5.0
Slides: 135
Provided by: Davi1892
Learn more at: https://netlib.org


Transcript and Presenter's Notes

Title: Lecture 6: Linear Algebra Algorithms

Lecture 6Linear Algebra Algorithms
  • Jack Dongarra, U of Tennessee

Slides are adapted from Jim Demmel, UCBs Lecture
on Linear Algebra Algorithms
Homework 3 Grading Rules
  • part1 (2 points) Write an MPI program
  • (1) code produce correct result 2/2.
  • (2) code run but could not give accurate result
  • (3) code could not run or run but nothing 1/2
  • Note If you only submit one program which can
    print "processor
  • contribution" and "total integral" correctly, I
    will give 1/2
  • for this part and 2/2 for part2. so you total
    score for part1
  • and part2 will be 3/4.
  • part2 (2 points) Modify your program to sum the
    contributions to give the total value of the
  • (1) code produce correct result 2/2.
  • (2) code run but could not give accurate result
  • (3) code could not run or run but nothing 1/2
  • part3 (4 points) Give an expression for the
    efficiency of the parallel algorithm for m
  • (1) give t_p(1), t_p(p), S(p) and E(p) correctly
  • (2) give t_p(1), t_p(p) and S(p) correctly 3/4.
  • (3) give t_p(1) and t_p(p) correctly 2/4.
  • (4) give other 1/4.

Adding numbers together
  • If we are adding an array of numbers and have a
    choice do we add them from largest to smallest or
    smallest to largest?

Parallel Performance Metrics
Absolute Elapsed (wall-clock) Time T(n )
Speedup S(n ) T(1) / T(n ) where T1 is
the time for the best serial implementation.
gt Performance improvement due to parallelsm
Parallel Efficiency E(n ) T(1) / n T(n )
Ideal Speedup SI (n ) n -
Theoretical limit obtainable rarely - Ignores
all of real life These definitions apply to a
fixed-problem experiment.
Amdahls Law
Amdahls Law places a strict limit on the speedup
that can be realized by using multiple
processors. Two equivalent expressions for
Amdahls Law are given below
tN (fp/N fs)t1 Effect of multiple
processors on run time S 1/(fs fp/N)
Effect of multiple processors on
speedup Where fs serial fraction of code fp
parallel fraction of code 1 - fs N number
of processors
Illustration of Amdahls Law
It takes only a small fraction of serial content
in a code to degrade the parallel performance. It
is essential to determine the scaling behavior of
your code before doing production runs using
large numbers of processors
Amdahls Law Vs. Reality
Amdahls Law provides a theoretical upper limit
on parallel speedup assuming that there are no
costs for communications. In reality,
communications (and I/O) will result in a further
degradation of performance.
So whats going on?
More on Amdahls Law
  • Amdahls Law can be generalized to any two
    processes of with different speeds
  • Ex. Apply to fprocessor and fmemory
  • The growing processor-memory performance gap will
    undermine our efforts at achieving maximum
    possible speedup!

Gustafsons Law
  • Thus, Amdahls Law predicts that there is a
    maximum scalability for an application,
    determined by its parallel fraction, and this
    limit is generally not large.
  • tN (fp/N fs)t1 Effect of multiple
    processors on run time
  • S 1/(fs fp/N) Effect of multiple
    processors on speedup
  • Where
  • fs serial fraction of code
  • fp parallel fraction of code 1 - fs
  • N number of processors
  • There is a way around this increase the problem
  • bigger problems mean bigger grids or more
    particles bigger arrays
  • number of serial operations generally remains
    constant number of parallel operations
    increases parallel fraction increases

Fixed-Problem Size Scaling
a.k.a. Fixed-load, Fixed-Problem Size, Strong
Scaling, Problem-Constrained, constant-problem
size (CPS), variable subgrid
1 Amdahl Limit SA(n) T(1) / T(n)
f / n ( 1 -
f ) This bounds the speedup based only on the
fraction of the code that cannot use parallelism
( 1- f ) it ignores all other factors SA
--gt 1 / ( 1- f ) as n --gt
Fixed-Problem Size Scaling (Contd)
Efficiency (n) T(1) / T(n) n
Memory requirements decrease with n
Surface-to-volume ratio increases with n
Superlinear speedup possible from cache
effects Motivation what is the largest of
procs I can use effectively and what is the
fastest time that I can solve a given problem?
Problems - Sequential runs often not possible
(large problems) - Speedup (and efficiency) is
misleading if processors are slow
Fixed-Problem Size Scaling Examples
S. Goedecker and Adolfy Hoisie, Achieving High
Performance in Numerical Computations on RISC
Workstations and Parallel Systems,International
Conference on Computational Physics PC'97 Santa
Cruz, August 25-28 1997.
Fixed-Problem Size Scaling Examples
Scaled Speedup Experiments
a.k.a. Fixed Subgrid-Size, Weak Scaling,
Gustafson scaling. Motivation Want to use a
larger machine to solve a larger global problem
in the same amount of time. Memory and
surface-to-volume effects remain constant.
Top500 Data
Rank Manufacturer Computer Installation Site Rmax TF/s R peak TF/s Ratio Max/Peak Proc
1 IBM ASCI WhiteSP Power3 Lawrence Livermore National Laboratory 7.23 12 60.25 8192
2 Compaq AlphaServer SC ES45 1 GHz Pittsburgh Supercomputing Center 4.06 6.05 67.11 3024
3 IBM SP Power3 375 MHz NERSC/LBNL 3.05 5.00 61.00 3328
4 Intel ASCI Red Sandia National Laboratory 2.38 3.207 74.21 9632
5 IBM ASCI Blue PacificSST, IBM SP 604E Lawrence Livermore National Laboratory 2.14 3.868 55.33 5808
6 Compaq AlphaServer SC ES45 1 GHz Los Alamos National Laboratory 2.10 3.04 69.08 1536
7 Hitachi SR8000/MPP University of Tokyo 1.71 2.69 63.57 1152
8 SGI ASCI Blue Mountain Los Alamos National Laboratory 1.61 2.52 63.89 6144
9 IBM SP Power3 375Mhz Naval Oceanographic Office (NAVOCEANO) 1.42 1.97 72.08 1336
10 IBM SP Power3 375Mhz Deutscher Wetterdienst 1.29 1.83 70.49 1280
Example of a Scaled Speedup Experiment
Parallel Performance Metrics Speedup
Absolute performance
Relative performance
Speedup is only one characteristic of a program -
it is not synonymous with performance. In this
comparison of two machines the code achieves
comparable speedups but one of the machines is
(No Transcript)
Improving Ratio of Floating Point Operations to
Memory Accesses
  • subroutine mult(n1,nd1,n2,nd2,y,a,x)
  • implicit real8 (a-h,o-z)
  • dimension a(nd1,nd2),y(nd2),x(nd1)
  • do 20, i1,n1
  • t0.d0
  • do 10, j1,n2
  • tta(j,i)x(j)
  • 10 continue
  • y(i)t
  • 20 continue
  • return
  • end
  • Unroll the loops

Improving Ratio of Floating Point Operations to
Memory Accesses
c works correctly when n1,n2 are multiples of 4
dimension a(nd1,nd2), y(nd2), x(nd1) do
i1,n1-3,4 t10.d0 t20.d0
t30.d0 t40.d0 do j1,n2-3,4
t1t1a(j0,i0)x(j0)a(j1,i0)x(j1) 1
t2t2a(j0,i1)x(j0)a(j1,i1)x(j1) 1
t3t3a(j0,i2)x(j0)a(j1,i2)x(j1) 1
t4t4a(j0,i3)x(j0)a(j1,i3)x(j1) 1
a(j2,i3)x(j2)a(j3,i3)x(j3) enddo
y(i0)t1 y(i1)t2 y(i2)t3
y(i3)t4 enddo
Summary of Single-Processor Optimization
Techniques (I)
  • Spatial and temporal data locality
  • Loop unrolling
  • Blocking
  • Software pipelining
  • Optimization of data structures
  • Special functions, library subroutines

Summary of Optimization Techniques (II)
  • Achieving high-performance requires code
    restructuring. Minimization of memory traffic is
    the single most important goal.
  • Compilers are getting better good at software
    pipelining. But they are not there yet can do
    loop transformations only in simple cases,
    usually fail to produce optimal blocking,
    heuristics for unrolling may not match your code
    well, etc.
  • The optimization process is machine-specific and
    requires detailed architectural knowledge.

Optimizing Matrix Addition for Caches
  • Dimension A(n,n), B(n,n), C(n,n)
  • A, B, C stored by column (as in Fortran)
  • Algorithm 1
  • for i1n, for j1n, A(i,j) B(i,j) C(i,j)
  • Algorithm 2
  • for j1n, for i1n, A(i,j) B(i,j) C(i,j)
  • What is memory access pattern for Algs 1 and 2?
  • Which is faster?
  • What if A, B, C stored by row (as in C)?

Using a Simpler Model of Memory to Optimize
  • Assume just 2 levels in the hierarchy, fast and
  • All data initially in slow memory
  • m number of memory elements (words) moved
    between fast and slow memory
  • tm time per slow memory operation
  • f number of arithmetic operations
  • tf time per arithmetic operation lt tm
  • q f/m average number of flops per slow element
  • Minimum possible Time ftf, when all data in
    fast memory
  • Actual Time ftf mtm ftf(1
  • Larger q means Time closer to minimum ftf
  • Want large q

Simple example using memory model
  • To see results of changing q, consider simple
  • Assume tf1 Mflop/s on fast memory
  • Assume moving data is tm 10
  • Assume h takes q flops
  • Assume array X is in slow memory

s 0 for i 1, n s s h(Xi)
  • So m n and f qn
  • Time read X compute 10n qn
  • Mflop/s f/t q/(10 q)
  • As q increases, this approaches the peak speed
    of 1 Mflop/s

Warm up Matrix-vector multiplication y y Ax
  • for i 1n
  • for j 1n
  • y(i) y(i) A(i,j)x(j)


Warm up Matrix-vector multiplication y y Ax
  • read x(1n) into fast memory
  • read y(1n) into fast memory
  • for i 1n
  • read row i of A into fast memory
  • for j 1n
  • y(i) y(i) A(i,j)x(j)
  • write y(1n) back to slow memory
  • m number of slow memory refs 3n n2
  • f number of arithmetic operations 2n2
  • q f/m 2
  • Matrix-vector multiplication limited by slow
    memory speed

Matrix Multiply CCAB
  • for i 1 to n
  • for j 1 to n
  • for k 1 to n
  • C(i,j) C(i,j) A(i,k) B(k,j)


Matrix Multiply CCAB(unblocked, or untiled)
  • for i 1 to n
  • read row i of A into fast memory
  • for j 1 to n
  • read C(i,j) into fast memory
  • read column j of B into fast memory
  • for k 1 to n
  • C(i,j) C(i,j) A(i,k) B(k,j)
  • write C(i,j) back to slow memory


Matrix Multiply (unblocked, or untiled)
qops/slow mem ref
  • Number of slow memory references on unblocked
    matrix multiply
  • m n3 read each column of B n times
  • n2 read each column of A once for
    each i
  • 2n2 read and write each element of C
  • n3 3n2
  • So q f/m (2n3)/(n3 3n2)
  • 2 for large n, no improvement over
    matrix-vector mult


Matrix Multiply (blocked, or tiled)
  • Consider A,B,C to be N by N matrices of b by b
    subblocks where bn/N is called the blocksize
  • for i 1 to N
  • for j 1 to N
  • read block C(i,j) into fast memory
  • for k 1 to N
  • read block A(i,k) into fast
  • read block B(k,j) into fast
  • C(i,j) C(i,j) A(i,k)
    B(k,j) do a matrix multiply on blocks
  • write block C(i,j) back to slow memory


n is the size of the matrix, N blocks of size b
n Nb
Matrix Multiply (blocked or tiled)
qops/slow mem ref
n is the size of the matrix, N blocks of size b
n Nb
  • Why is this algorithm correct?
  • Number of slow memory references on blocked
    matrix multiply
  • m Nn2 read each block of B N3 times (N3
    n/N n/N)
  • Nn2 read each block of A N3 times
  • 2n2 read and write each block of
    C once
  • (2N 2)n2
  • So q f/m 2n3 / ((2N 2)n2)
  • n/N b for large n
  • So we can improve performance by increasing the
    blocksize b
  • Can be much faster than matrix-vector multiplty
  • Limit All three blocks from A,B,C must fit in
    fast memory (cache), so we
  • cannot make these blocks arbitrarily large 3b2
    lt M, so q b lt sqrt(M/3)
  • Theorem (Hong, Kung, 1981) Any reorganization of
    this algorithm
  • (that uses only associativity) is limited to q

  • As much as possible will be overlapped
  • Dot Product
  • ACC 0
  • do i x,n
  • ACC ACC x(i) y(i)
  • end do
  • Experiments done on an IBM RS6000/530
  • 25 MHz
  • 2 cycle to complete FMA can be pipelined
  • gt 50 Mflop/s peak
  • one cycle from cache

DOT Operation - Data in Cache
  • Do 10 I 1, n
  • T T X(I)Y(I)
  • Theoretically, 2 loads for X(I) and Y(I), one FMA
    operation, no re-use of data
  • Pseudo-assembler
  • LOAD fp0,T
  • label
  • LOAD fp1,X(I)
  • LOAD fp2,Y(I)
  • FMA fp0,fp0,fp1,fp2
  • BRANCH label

Load y
Load x
Load y
Load x
1 result per cycle 25 Mflop/s
Matrix-Vector Product
  • DOT version
  • DO 20 I 1, M
  • DO 10 J 1, N
  • Y(I) Y(I) A(I,J)X(J)
  • From Cache 22.7 Mflops
  • From Memory 12.4 Mflops

Loop Unrolling
  • DO 20 I 1, M, 2
  • T1 Y(I )
  • T2 Y(I1)
  • DO 10 J 1, N
  • T1 T1 A(I,J )X(J)
  • T2 T2 A(I1,J)X(J)
  • Y(I ) T1
  • Y(I1) T2
  • 3 loads, 4 flops
  • Speed of yyATx, N48
  • Depth 1 2 3 4
  • Speed 25 33.3 37.5 40 50
  • Measured 22.7 30.5 34.3 36.5
  • Memory 12.4 12.7 12.7 12.6
  • unroll 1 2 loads 2 ops per 2 cycles
  • unroll 2 3 loads 4 ops per 3 cycles
  • unroll 3 4 loads 6 ops per 4 cycles
  • unroll n n1 loads 2n ops per n1 cycles
  • problem only so many registers

Matrix Multiply
  • DOT version - 25 Mflops in cache
  • DO 30 J 1, M
  • DO 20 I 1, M
  • DO 10 K 1, L
  • C(I,J) C(I,J)

How to Get Near Peak
  • Inner loop
  • 4 loads, 8 operations, optimal.
  • In practice we have measured 48.1 out of a peak
    of 50 Mflop/s when in cache
  • DO 30 J 1, M, 2
  • DO 20 I 1, M, 2
  • T11 C(I, J )
  • T12 C(I, J1)
  • T21 C(I1,J )
  • T22 C(I1,J1)
  • DO 10 K 1, L
  • T11 T11 A(I, K) B(K,J
  • T12 T12 A(I, K)
  • T21 T21 A(I1,K)B(K,J )
  • T22 T22 A(I1,K)B(K,J1)
  • C(I, J ) T11
  • C(I, J1) T12
  • C(I1,J ) T21
  • C(I1,J1) T22

BLAS -- Introduction
  • Clarity code is shorter and easier to read,
  • Modularity gives programmer larger building
  • Performance manufacturers will provide tuned
    machine-specific BLAS,
  • Program portability machine dependencies are
    confined to the BLAS

Memory Hierarchy
  • Key to high performance in effective use of
    memory hierarchy
  • True on all architectures

Level 1, 2 and 3 BLAS
  • Level 1 BLAS Vector-Vector operations
  • Level 2 BLAS Matrix-Vector operations
  • Level 3 BLAS Matrix-Matrix operations

More on BLAS (Basic Linear Algebra Subroutines)
  • Industry standard interface(evolving)
  • Vendors, others supply optimized implementations
  • History
  • BLAS1 (1970s)
  • vector operations dot product, saxpy (y?xy),
  • m2n, f2n, q 1 or less
  • BLAS2 (mid 1980s)
  • matrix-vector operations matrix vector multiply,
  • mn2, f2n2, q2, less overhead
  • somewhat faster than BLAS1
  • BLAS3 (late 1980s)
  • matrix-matrix operations matrix matrix multiply,
  • m gt 4n2, fO(n3), so q can possibly be as large
    as n, so BLAS3 is potentially much faster than
  • Good algorithms used BLAS3 when possible (LAPACK)
  • www.netlib.org/blas, www.netlib.org/lapack

Why Higher Level BLAS?
  • Can only do arithmetic on data at the top of the
  • Higher level BLAS lets us do this

BLAS for Performance
  • Development of blocked algorithms important for

BLAS for Performance
  • Development of blocked algorithms important for

BLAS for Performance
  • Development of blocked algorithms important for

BLAS 3 (n-by-n matrix matrix multiply) vs BLAS 2
(n-by-n matrix vector multiply) vs BLAS 1 (saxpy
of n vectors)
Fast linear algebra kernels BLAS
  • Simple linear algebra kernels such as
    matrix-matrix multiply
  • More complicated algorithms can be built from
    these basic kernels.
  • The interfaces of these kernels have been
    standardized as the Basic Linear Algebra
    Subroutines (BLAS).
  • Early agreement on standard interface (1980)
  • Led to portable libraries for vector and shared
    memory parallel machines.
  • On distributed memory, there is a less-standard
    interface called the PBLAS

Level 1 BLAS
  • Operate on vectors or pairs of vectors
  • perform O(n) operations
  • return either a vector or a scalar.
  • saxpy
  • y(i) a x(i) y(i), for i1 to n.
  • s stands for single precision, daxpy is for
    double precision, caxpy for complex, and zaxpy
    for double complex,
  • sscal y a x, for scalar a and vectors x,y
  • sdot computes s S ni1 x(i)y(i)

Level 2 BLAS
  • Operate on a matrix and a vector
  • return a matrix or a vector
  • O(n2) operations
  • sgemv matrix-vector multiply
  • y y Ax
  • where A is m-by-n, x is n-by-1 and y is m-by-1.
  • sger rank-one update
  • A A yxT, i.e., A(i,j) A(i,j)y(i)x(j)
  • where A is m-by-n, y is m-by-1, x is n-by-1,
  • strsv triangular solve
  • solves yTx for x, where T is triangular

Level 3 BLAS
  • Operate on pairs or triples of matrices
  • returning a matrix
  • complexity is O(n3).
  • sgemm Matrix-matrix multiplication
  • C C AB,
  • where C is m-by-n, A is m-by-k, and B is k-by-n
  • strsm multiple triangular solve
  • solves Y TX for X,
  • where T is a triangular matrix, and X is a
    rectangular matrix.

Optimizing in practice
  • Tiling for registers
  • loop unrolling, use of named register variables
  • Tiling for multiple levels of cache
  • Exploiting fine-grained parallelism within the
  • super scalar
  • pipelining
  • Complicated compiler interactions
  • Hard to do by hand (but youll try)
  • Automatic optimization an active research area
  • PHIPAC www.icsi.berkeley.edu/bilmes/phipac
  • www.cs.berkeley.edu/iyer/asci_sli
  • ATLAS www.netlib.org/atlas/index.html

BLAS -- References
  • BLAS software and documentation can be obtained
  • WWW http//www.netlib.org/blas,
  • (anonymous) ftp ftp.netlib.org cd blas get
  • email netlib_at_www.netlib.org with the message
    send index from blas
  • Comments and questions can be addressed to

BLAS Papers
  • C. Lawson, R. Hanson, D. Kincaid, and F. Krogh,
    Basic Linear Algebra Subprograms for Fortran
    Usage, ACM Transactions on Mathematical Software,
    5308--325, 1979.
  • J. Dongarra, J. Du Croz, S. Hammarling, and R.
    Hanson, An Extended Set of Fortran Basic Linear
    Algebra Subprograms, ACM Transactions on
    Mathematical Software, 14(1)1--32, 1988.
  • J. Dongarra, J. Du Croz, I. Duff, S. Hammarling,
    A Set of Level 3 Basic Linear Algebra
    Subprograms, ACM Transactions on Mathematical
    Software, 16(1)1--17, 1990.

Performance of BLAS
  • BLAS are specially optimized by the vendor
  • Sun BLAS uses features in the Ultrasparc
  • Big payoff for algorithms that can be expressed
    in terms of the BLAS3 instead of BLAS2 or BLAS1.
  • The top speed of the BLAS3
  • Algorithms like Gaussian elimination organized so
    that they use BLAS3

How To Get Performance From Commodity Processors?
  • Todays processors can achieve high-performance,
    but this requires extensive machine-specific hand
  • Routines have a large design space w/many
  • blocking sizes, loop nesting permutations, loop
    unrolling depths, software pipelining strategies,
    register allocations, and instruction schedules.
  • Complicated interactions with the increasingly
    sophisticated microarchitectures of new
  • A few months ago no tuned BLAS for Pentium for
  • Need for quick/dynamic deployment of optimized
  • ATLAS - Automatic Tuned Linear Algebra Software
  • PhiPac from Berkeley

Adaptive Approach for Level 3
  • Do a parameter study of the operation on the
    target machine, done once.
  • Only generated code is on-chip multiply
  • BLAS operation written in terms of generated
    on-chip multiply
  • All tranpose cases coerced through data copy to 1
    case of on-chip multiply
  • Only 1 case generated per platform

Code Generation Strategy
  • On-chip multiply optimizes for
  • TLB access
  • L1 cache reuse
  • FP unit usage
  • Memory fetch
  • Register reuse
  • Loop overhead minimization
  • Takes a couple of hours to run.
  • Code is iteratively generated timed until
    optimal case is found. We try
  • Differing NBs
  • Breaking false dependencies
  • M, N and K loop unrolling

(No Transcript)
(No Transcript)
500x500 gemm-based BLAS on SGI R10000ip28
500x500 gemm-based BLAS on UltraSparc 2200
Recursive Approach for Other Level
  • Recur down to L1 cache block size
  • Need kernel at bottom of recursion
  • Use gemm-based kernel for portability

Recursive TRMM
500x500 Level 2 BLAS DGEMV
Multi-Threaded DGEMMIntel PIII 550 MHz
  • Keep a repository of kernels for specific
  • Develop a means of dynamically downloading code
  • Extend work to allow sparse matrix operations
  • Extend work to include arbitrary code segments
  • See http//www.netlib.org/atlas/

Algorithms and Architecture
  • The key to performance is to understand the
    algorithm and architecture interaction.
  • A significant improvement in performance can be
    obtained by matching algorithm to the
    architecture or vice versa.

Algorithm Issues
  • Use of memory hierarchy
  • Algorithm pre-fetching
  • Loop unrolling
  • Simulating higher precision arithmetic

  • TLB Blocking - minimize TLB misses
  • Cache Blocking - minimize cache misses
  • Register Blocking - minimize load/stores
  • The general idea of blocking is to get the
    information to a high-speed storage and use it
    multiple times so as to amortize the cost of
    moving the data
  • Cache blocking reduced traffic between memory and
  • Register blocking reduces traffic between cache
    and CPU

Loop Unrolling
  • Reduces data dependency delay
  • Exploits multiple functional units and quad
    load/stores effectively.
  • Minimizes load/stores
  • Reduces loop overheads
  • Gives more flexibility to compiler in scheduling
  • Facilitates algorithm pre-fetching.
  • What about vector computing?

Whats Wrong With Speedup T1/Tp ?
  • Can lead to very false conclusions.
  • Speedup in isolation without taking into account
    the speed of the processor is unrealistic and
  • Speedup over what?
  • T1/Tp
  • There is usually no doubt about Tp
  • Often considerable dispute over the meaning of T1
  • Serial code? Same algorithm?

  • Can be used to
  • Study, in isolation, the scaling of one algorithm
    on one computer.
  • As a dimensionless variable in the theory of
  • Should not be used to compare
  • Different algorithms on the same computer
  • The same algorithm on different computers.
  • Different interconnection structures.

(No Transcript)
Strassens Algorithm for Matrix Multiply
Usual Matrix Multiply
Strassens Algorithm
Strassens Algorithm
  • The count of arithmetic operations is
  • One matrix multiply is replaced by 14 matrix

Strassens Algorithm
  • In reality the use of Strassens Algorithm is
    limited by
  • Additional memory required for storing the P
  • More memory accesses are needed.

  • Motivation for Dense Linear Algebra
  • Axb Computational Electromagnetics
  • Ax lx Quantum Chemistry
  • Review Gaussian Elimination (GE) for solving Axb
  • Optimizing GE for caches on sequential machines
  • using matrix-matrix multiplication (BLAS)
  • LAPACK library overview and performance
  • Data layouts on parallel machines
  • Parallel matrix-matrix multiplication
  • Parallel Gaussian Elimination
  • ScaLAPACK library overview
  • Eigenvalue problem

Parallelism in Sparse Matrix-vector multiplication
  • y Ax, where A is sparse and n x n
  • Questions
  • which processors store
  • yi, xi, and Ai,j
  • which processors compute
  • yi sum (from 1 to n) Ai,j xj
  • (row i of A) . x a
    sparse dot product
  • Partitioning
  • Partition index set 1,,n N1 u N2 u u Np
  • For all i in Nk, Processor k stores yi, xi,
    and row i of A
  • For all i in Nk, Processor k computes yi (row
    i of A) . x
  • owner computes rule Processor k compute the
    yis it owns
  • Goals of partitioning
  • balance load (how is load measured?)
  • balance storage (how much does each processor
  • minimize communication (how much is communicated?)

Graph Partitioning and Sparse Matrices
  • Relationship between matrix and graph

1 2 3 4 5 6
1 1 1 1 2 1 1
1 1 3
1 1 1 4 1 1
1 1 5 1 1 1
1 6 1 1 1 1
  • A good partition of the graph has
  • equal (weighted) number of nodes in each part
    (load and storage balance)
  • minimum number of edges crossing between
    (minimize communication)
  • Can reorder the rows/columns of the matrix by
    putting all the nodes in one partition together

More on Matrix Reordering via Graph Partitioning
  • Ideal matrix structure for parallelism
    (nearly) block diagonal
  • p (number of processors) blocks
  • few non-zeros outside these blocks, since these
    require communication

P0 P1 P2 P3 P4
What about implicit methods and eigenproblems?
  • Direct methods (Gaussian elimination)
  • Called LU Decomposition, because we factor A
  • Future lectures will consider both dense and
    sparse cases
  • More complicated than sparse-matrix vector
  • Iterative solvers
  • Will discuss several of these in future
  • Jacobi, Successive overrelaxiation (SOR) ,
    Conjugate Gradients (CG), Multigrid,...
  • Most have sparse-matrix-vector multiplication in
  • Eigenproblems
  • Future lectures will discuss dense and sparse
  • Also depend on sparse-matrix-vector
    multiplication, direct methods
  • Graph partitioning

  • Partial Differential Equations
  • PDEs

Continuous Variables, Continuous Parameters
  • Examples of such systems include
  • Heat flow Temperature(position, time)
  • Diffusion Concentration(position, time)
  • Electrostatic or Gravitational Potential Pote
  • Fluid flow Velocity,Pressure,Density(position,tim
  • Quantum mechanics Wave-function(position,time)
  • Elasticity Stress,Strain(position,time)

Example Deriving the Heat Equation
  • Consider a simple problem
  • A bar of uniform material, insulated except at
  • Let u(x,t) be the temperature at position x at
    time t
  • Heat travels from x-h to xh at rate proportional

d u(x,t) (u(x-h,t)-u(x,t))/h -
(u(x,t)- u(xh,t))/h dt
  • As h 0, we get the heat equation

Explicit Solution of the Heat Equation
  • For simplicity, assume C1
  • Discretize both time and position
  • Use finite differences with uj,i as the heat at
  • time t idt (i 0,1,2,) and position x jh
  • initial conditions on uj,0
  • boundary conditions on u0,i and uN,i
  • At each timestep i 0,1,2,...
  • This corresponds to
  • matrix vector multiply (what is matrix?)
  • nearest neighbors on grid

t5 t4 t3 t2 t1 t0
For j0 to N uj,i1 zuj-1,i
(1-2z)uj,i zuj1,i where z dt/h2
u0,0 u1,0 u2,0 u3,0 u4,0 u5,0
Parallelism in Explicit Method for PDEs
  • Partitioning the space (x) into p largest chunks
  • good load balance (assuming large number of
    points relative to p)
  • minimized communication (only p chunks)
  • Generalizes to
  • multiple dimensions
  • arbitrary graphs ( sparse matrices)
  • Problem with explicit approach
  • numerical instability
  • solution blows up eventually if z dt/h gt .5
  • need to make the timesteps very small when h is
    small dt lt .5h

Instability in solving the heat equation
Implicit Solution
  • As with many (stiff) ODEs, need an implicit
  • This turns into solving the following equation
  • Here I is the identity matrix and T is
  • I.e., essentially solving Poissons equation in 1D

(I (z/2)T) u,i1 (I - (z/2)T) u,i
2 -1 -1 2 -1 -1 2 -1
-1 2 -1 -1 2
Graph and stencil
2D Implicit Method
  • Similar to the 1D case, but the matrix T is now
  • Multiplying by this matrix (as in the explicit
    case) is simply nearest neighbor computation on
    2D grid
  • To solve this system, there are several techniques

Graph and stencil
4 -1 -1 -1 4 -1 -1
-1 4 -1 -1
4 -1 -1 -1 -1 4
-1 -1 -1
-1 4 -1
-1 4 -1
-1 -1 4 -1
-1 -1 4
Algorithms for 2D Poisson Equation with N unknowns
  • Algorithm Serial PRAM Memory Procs
  • Dense LU N3 N N2 N2
  • Band LU N2 N N3/2 N
  • Jacobi N2 N N N
  • Explicit Inv. N log N N N
  • Conj.Grad. N 3/2 N 1/2 log N N N
  • RB SOR N 3/2 N 1/2 N N
  • Sparse LU N 3/2 N 1/2 Nlog N N
  • FFT Nlog N log N N N
  • Multigrid N log2 N N N
  • Lower bound N log N N
  • PRAM is an idealized parallel model with zero
    cost communication
  • (see next slide for explanation)

Short explanations of algorithms on previous slide
  • Sorted in two orders (roughly)
  • from slowest to fastest on sequential machines
  • from most general (works on any matrix) to most
    specialized (works on matrices like T)
  • Dense LU Gaussian elimination works on any
    N-by-N matrix
  • Band LU exploit fact that T is nonzero only on
    sqrt(N) diagonals nearest main diagonal, so
  • Jacobi essentially does matrix-vector multiply
    by T in inner loop of iterative algorithm
  • Explicit Inverse assume we want to solve many
    systems with T, so we can precompute and store
    inv(T) for free, and just multiply by it
  • Its still expensive!
  • Conjugate Gradients uses matrix-vector
    multiplication, like Jacobi, but exploits
    mathematical properies of T that Jacobi does not
  • Red-Black SOR (Successive Overrelaxation)
    Variation of Jacobi that exploits yet different
    mathematical properties of T
  • Used in Multigrid
  • Sparse LU Gaussian elimination exploiting
    particular zero structure of T
  • FFT (Fast Fourier Transform) works only on
    matrices very like T
  • Multigrid also works on matrices like T, that
    come from elliptic PDEs
  • Lower Bound serial (time to print answer)
    parallel (time to combine N inputs)

Composite mesh from a mechanical structure
Converting the mesh to a matrix
Effects of Ordering Rows and Columns on Gaussian
Irregular mesh NASA Airfoil in 2D (direct
Irregular mesh Tapered Tube (multigrid)
Adaptive Mesh Refinement (AMR)
  • Adaptive mesh around an explosion
  • John Bell and Phil Colella at LBL (see class web
    page for URL)
  • Goal of Titanium is to make these algorithms
    easier to implement
  • in parallel

Computational Electromagnetics
  • Developed during 1980s, driven by defense
  • Determine the RCS (radar cross section) of
  • Reduce signature of plane (stealth technology)
  • Other applications are antenna design, medical
  • Two fundamental numerical approaches
  • MOM methods of moments ( frequency domain), and
  • Finite differences (time domain)

Computational Electromagnetics
- Discretize surface into triangular facets using
standard modeling tools - Amplitude of currents
on surface are unknowns
- Integral equation is discretized into a set of
linear equations
image NW Univ. Comp. Electromagnetics Laboratory
Computational Electromagnetics (MOM)
After discretization the integral equation has
the form A x b where A is
the (dense) impedance matrix, x is the unknown
vector of amplitudes, and b is the excitation
vector. (see Cwik, Patterson, and Scott,
Electromagnetic Scattering on the Intel
Touchstone Delta, IEEE Supercomputing 92, pp 538
- 542)
Computational Electromagnetics (MOM)
The main steps in the solution process are
Fill computing the matrix elements
of A Factor factoring the dense matrix
A Solve solving for one or more
excitations b Field Calc computing the fields
scattered from the object
Analysis of MOM for Parallel Implementation
Task Work Parallelism
Parallel Speed
Fill O(n2) embarrassing
low Factor O(n3)
moderately diff. very high Solve
O(n2) moderately diff.
high Field Calc. O(n)
embarrassing high
Results for Parallel Implementation on Delta
Task Time (hours)
Fill 9.20
Factor 8.25 Solve
2 .17
Field Calc. 0.12
The problem solved was for a matrix of size
48,672. (The world record in 1991.)
Current Records for Solving Dense Systems
Year System Size Machine
Procs Gflops (Peak) 1950's O(100)

1995 128,600 Intel Paragon
6768 281 ( 338) 1996
215,000 Intel ASCI Red 7264
1068 (1453) 1998 148,000
Cray T3E 1488 1127
(1786) 1998 235,000 Intel
ASCI Red 9152 1338 (1830) 1999
374,000 SGI ASCI Blue 5040
1608 (2520) 1999 362,880
Intel ASCI Red 9632 2379 (3207) 2000
430,000 IBM ASCI White 8192
4928 (12000)
source Alan Edelman http//www-math.mit.edu/edel
man/records.html LINPACK Benchmark
Computational Chemistry
  • Seek energy levels of a molecule, crystal, etc.
  • Solve Schroedingers Equation for energy levels
  • Discretize to get Ax lBx, solve for eigenvalues
    l and eigenvectors x
  • A and B large, symmetric or Hermitian matrices (B
    positive definite)
  • May want some or all eigenvalues/eigenvectors
  • MP-Quest (Sandia NL)
  • Si and sapphire crystals of up to 3072 atoms
  • Local Density Approximation to Schroedinger
  • A and B up to n40000, Hermitian
  • Need all eigenvalues and eigenvectors
  • Need to iterate up to 20 times (for
  • Implemented on Intel ASCI Red
  • 9200 Pentium Pro 200 processors (4600 Duals, a
  • Overall application ran at 605 Gflops (out of
    1800 Glops peak),
  • Eigensolver ran at 684 Gflops
  • www.cs.berkeley.edu/stanley/gbell/index.html
  • Runner-up for Gordon Bell Prize at Supercomputing

Review of Gaussian Elimination (GE) for solving
  • Add multiples of each row to later rows to make A
    upper triangular
  • Solve resulting triangular system Ux c by

for each column i zero it out below the
diagonal by adding multiples of row i to later
rows for i 1 to n-1 for each row j below
row i for j i1 to n add a
multiple of row i to row j for k i to
n A(j,k) A(j,k) -
(A(j,i)/A(i,i)) A(i,k)
Refine GE Algorithm (1)
  • Initial Version
  • Remove computation of constant A(j,i)/A(i,i) from
    inner loop

for each column i zero it out below the
diagonal by adding multiples of row i to later
rows for i 1 to n-1 for each row j below
row i for j i1 to n add a
multiple of row i to row j for k i to
n A(j,k) A(j,k) -
(A(j,i)/A(i,i)) A(i,k)
for i 1 to n-1 for j i1 to n
m A(j,i)/A(i,i) for k i to n
A(j,k) A(j,k) - m A(i,k)
Refine GE Algorithm (2)
  • Last version
  • Dont compute what we already know
    zeros below diagonal in column i

for i 1 to n-1 for j i1 to n
m A(j,i)/A(i,i) for k i to n
A(j,k) A(j,k) - m A(i,k)
for i 1 to n-1 for j i1 to n
m A(j,i)/A(i,i) for k i1 to n
A(j,k) A(j,k) - m A(i,k)
Refine GE Algorithm (3)
  • Last version
  • Store multipliers m below diagonal in zeroed
    entries for later use

for i 1 to n-1 for j i1 to n
m A(j,i)/A(i,i) for k i1 to n
A(j,k) A(j,k) - m A(i,k)
for i 1 to n-1 for j i1 to n
A(j,i) A(j,i)/A(i,i) for k i1 to
n A(j,k) A(j,k) - A(j,i) A(i,k)
Refine GE Algorithm (4)
  • Last version
  • Express using matrix operations (BLAS)

for i 1 to n-1 for j i1 to n
A(j,i) A(j,i)/A(i,i) for k i1 to
n A(j,k) A(j,k) - A(j,i) A(i,k)
for i 1 to n-1 A(i1n,i) A(i1n,i) /
A(i,i) A(i1n,i1n) A(i1n , i1n )
- A(i1n , i) A(i , i1n)
What GE really computes
  • Call the strictly lower triangular matrix of
    multipliers M, and let L IM
  • Call the upper triangle of the final matrix U
  • Lemma (LU Factorization) If the above algorithm
    terminates (does not divide by zero) then A LU
  • Solving Axb using GE
  • Factorize A LU using GE
    (cost 2/3 n3 flops)
  • Solve Ly b for y, using substitution (cost
    n2 flops)
  • Solve Ux y for x, using substitution (cost
    n2 flops)
  • Thus Ax (LU)x L(Ux) Ly b as desired

for i 1 to n-1 A(i1n,i) A(i1n,i) /
A(i,i) A(i1n,i1n) A(i1n , i1n ) -
A(i1n , i) A(i , i1n)
Problems with basic GE algorithm
  • What if some A(i,i) is zero? Or very small?
  • Result may not exist, or be unstable, so need
    to pivot
  • Current computation all BLAS 1 or BLAS 2, but we
    know that BLAS 3 (matrix multiply) is fastest
    (Lecture 2)

for i 1 to n-1 A(i1n,i) A(i1n,i) /
A(i,i) BLAS 1 (scale a vector)
A(i1n,i1n) A(i1n , i1n ) BLAS 2
(rank-1 update) - A(i1n , i)
A(i , i1n)
Pivoting in Gaussian Elimination
  • A 0 1 fails completely, even though
    A is easy
  • 1 0
  • Illustrate problems in 3-decimal digit
  • A 1e-4 1 and b 1 ,
    correct answer to 3 places is x 1
  • 1 1
  • Result of LU decomposition is
  • L 1 0 1
    0 No roundoff error yet
  • fl(1/1e-4) 1 1e4
  • U 1e-4 1
    1e-4 1 Error in 4th decimal place
  • 0 fl(1-1e41)
    0 -1e4
  • Check if A LU 1e-4 1
    (2,2) entry entirely wrong
  • 1

Gaussian Elimination with Partial Pivoting (GEPP)
  • Partial Pivoting swap rows so that each
  • L(i,j) A(j,i)/A(i,i) lt

for i 1 to n-1 find and record k where
A(k,i) maxi lt j lt n A(j,i)
i.e. largest entry in rest of column i if
A(k,i) 0 exit with a warning that A
is singular, or nearly so elseif k ! i
swap rows i and k of A end if
A(i1n,i) A(i1n,i) / A(i,i) each
quotient lies in -1,1 A(i1n,i1n)
A(i1n , i1n ) - A(i1n , i) A(i , i1n)
  • Lemma This algorithm computes A PLU, where
    P is a
  • permutation matrix
  • Since each entry of L(i,j) lt 1, this
    algorithm is considered
  • numerically stable
  • For details see LAPACK code at

Converting BLAS2 to BLAS3 in GEPP
  • Blocking
  • Used to optimize matrix-multiplication
  • Harder here because of data dependencies in GEPP
  • Delayed Updates
  • Save updates to trailing matrix from several
    consecutive BLAS2 updates
  • Apply many saved updates simultaneously in one
    BLAS3 operation
  • Same idea works for much of dense linear algebra
  • Open questions remain
  • Need to choose a block size b
  • Algorithm will save and apply b updates
  • b must be small enough so that active submatrix
    consisting of b columns of A fits in cache
  • b must be large enough to make BLAS3 fast

Blocked GEPP (www.netlib.org/lapack/single/sgetr
for ib 1 to n-1 step b Process matrix b
columns at a time end ib b-1
Point to end of block of b columns
apply BLAS2 version of GEPP to get A(ibn ,
ibend) P L U let LL denote the
strict lower triangular part of A(ibend ,
ibend) I A(ibend , end1n) LL-1
A(ibend , end1n) update next b rows
of U A(end1n , end1n ) A(end1n ,
end1n ) - A(end1n , ibend)
A(ibend , end1n)
apply delayed updates with
single matrix-multiply
with inner dimension b
(For a correctness proof, see on-lines notes.)
Efficiency of Blocked GEPP
Overview of LAPACK
  • Standard library for dense/banded linear algebra
  • Linear systems Axb
  • Least squares problems minx Ax-b 2
  • Eigenvalue problems Ax lx, Ax lBx
  • Singular value decomposition (SVD) A USVT
  • Algorithms reorganized to use BLAS3 as much as
  • Basis of math libraries on many computers
  • Many algorithmic innovations remain
  • Projects available

Performance of LAPACK (n1000)
Performance of LAPACK (n100)
Parallelizing Gaussian Elimination
  • parallelization steps
  • Decomposition identify enough parallel work, but
    not too much
  • Assignment load balance work among threads
  • Orchestrate communication and synchronization
  • Mapping which processors execute which threads
  • Decomposition
  • In BLAS 2 algorithm nearly each flop in inner
    loop can be done in parallel, so with n2
    processors, need 3n parallel steps
  • This is too fine-grained, prefer calls to local
    matmuls instead

for i 1 to n-1 A(i1n,i) A(i1n,i) /
A(i,i) BLAS 1 (scale a vector)
A(i1n,i1n) A(i1n , i1n ) BLAS 2
(rank-1 update) - A(i1n , i)
A(i , i1n)
Assignment of parallel work in GE
  • Think of assigning submatrices to threads, where
    each thread responsible for updating submatrix it
  • owner computes rule natural because of locality
  • What should submatrices look like to achieve load

Different Data Layouts for Parallel GE (on 4
Load balanced, but cant easily use BLAS2 or BLAS3
Bad load balance P0 idle after first n/4 steps
Can trade load balance and BLAS2/3 performance
by choosing b, but factorization of block column
is a bottleneck
The winner!
Complicated addressing
Blocked Partitioned Algorithms
  • Orthogonal reduction to
  • (upper) Hessenberg form
  • symmetric tridiagonal form
  • bidiagonal form
  • Block QR iteration for nonsymmetric eigenvalue
  • LU Factorization
  • Cholesky factorization
  • Symmetric indefinite factorization
  • Matrix inversion
  • QR, QL, RQ, LQ factorizations
  • Form Q or QTC

Memory Hierarchy and LAPACK
  • ijk - implementations
  • Effects order in which
    data referenced some better at
    allowing data to keep in higher levels of
    memory hierarchy.
  • Applies for matrix multiply, reductions to
    condensed form
  • May do slightly more
  • Up to 3 times faster

Derivation of Blocked AlgorithmsCholesky
Factorization A UTU

Equating coefficient of the jth column, we obtain
Hence, if U11 has already been computed, we can
compute uj and ujj from the equations
LINPACK Implementation
  • Here is the body of the LINPACK routine SPOFA
    which implements the method
  • DO 30 J 1, N
  • INFO J
  • S 0.0E0
  • JM1 J - 1
  • IF( JM1.LT.1 ) GO TO 20
  • DO 10 K 1, JM1
  • T A( K, J ) - SDOT( K-1,
    A( 1, K ), 1,A( 1, J ), 1 )
  • T T / A( K, K )
  • A( K, J ) T
  • S S TT
  • S A( J, J ) - S
  • C ...EXIT
  • IF( S.LE.0.0E0 ) GO TO 40
  • A( J, J ) SQRT( S )

LAPACK Implementation
  • DO 10 J 1, N
  • CALL STRSV( 'Upper', 'Transpose',
    'Non-Unit, J-1, A, LDA, A( 1, J ), 1 )
  • S A( J, J ) - SDOT( J-1, A( 1, J ),
    1, A( 1, J ), 1 )
  • IF( S.LE.ZERO ) GO TO 20
  • A( J, J ) SQRT( S )
  • This change by itself is sufficient to
    significantly improve the performance on a number
    of machines.
  • From 238 to 312 Mflop/s for a matrix of order 500
    on a Pentium 4-1.7 GHz.
  • However on peak is 1,700 Mflop/s.
  • Suggest further work needed.

Derivation of Blocked Algorithms

Equating coefficient of second block of columns,
we obtain
Hence, if U11 has already been computed, we can
compute U12 as the solution of the following
equations by a call to the Level 3 BLAS routine
LAPACK Blocked Algorithms
DO 10 J 1, N, NB CALL STRSM( 'Left',
'Upper', 'Transpose','Non-Unit', J-1, JB, ONE, A,
LDA, A( 1, J ), LDA )
CALL SSYRK( 'Upper', 'Transpose', JB, J-1,-ONE,
A( 1, J ), LDA, ONE, A( J, J ),
LDA ) CALL SPOTF2( 'Upper', JB, A( J, J ),
LDA, INFO ) IF( INFO.NE.0 ) GO TO 20 10
  • On Pentium 4, L3 BLAS squeezes a lot more out of
    1 proc

Intel Pentium 4 1.7 GHz N 500 Rate of Execution
Linpack variant (L1B) 238 Mflop/s
Level 2 BLAS Variant 312 Mflop/s
Level 3 BLAS Variant 1262 Mflop/s
LAPACK Contents
  • Combines algorithms from LINPACK and EISPACK into
    a single package. User interface similar to
  • Built on the L 1, 2 and 3 BLAS, for high
    performance (manufacturers optimize BLAS)
  • LAPACK does not provide routines for structured
    problems or general sparse matrices (i.e sparse
    storage formats such as compressed-row, -column,
    -diagonal, skyline ...).

LAPACK Ongoing Work
  • Add functionality
  • updating/downdating, divide and conquer least
    squares,bidiagonal bisection, bidiagonal inverse
    iteration, band SVD, Jacobi methods, ...
  • Move to new generation of high performance
  • IBM SPs, CRAY T3E, SGI Origin, clusters of
  • New challenges
  • New languages FORTRAN 90, HP FORTRAN, ...
  • (CMMD, MPL, NX ...)
  • many flavors of message passing, need standard
  • Highly varying ratio
  • Many ways to layout data,
  • Fastest parallel algorithm sometimes less stable

History of Block Partitioned Algorithms
  • Early algorithms involved use of small main
    memory using tapes as secondary storage.
  • Recent work centers on use of vector registers,
    level 1 and 2 cache, main memory, and out of
    core memory.

Blocked Partitioned Algorithms
  • Orthogonal reduction to
  • (upper) Hessenberg form
  • symmetric tridiagonal form
  • bidiagonal form
  • Block QR iteration for nonsymmetric eigenvalue
  • LU Factorization
  • Cholesky factorization
  • Symmetric indefinite factorization
  • Matrix inversion
  • QR, QL, RQ, LQ factorizations
  • Form Q or QTC
Write a Comment
User Comments (0)
About PowerShow.com