Minimizing Communication in Numerical Linear Algebra www.cs.berkeley.edu/~demmel Sparse-Matrix-Vector-Multiplication (SpMV) - PowerPoint PPT Presentation

1 / 63

About This Presentation

Title:

Minimizing Communication in Numerical Linear Algebra www.cs.berkeley.edu/~demmel Sparse-Matrix-Vector-Multiplication (SpMV)

Description:

Minimizing Communication in Numerical Linear Algebra www.cs.berkeley.edu/~demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments, UC ... – PowerPoint PPT presentation

Number of Views:128

Avg rating:3.0/5.0

Slides: 64

Provided by: WSE107

Category:

more less

Transcript and Presenter's Notes

Title: Minimizing Communication in Numerical Linear Algebra www.cs.berkeley.edu/~demmel Sparse-Matrix-Vector-Multiplication (SpMV)

1
Minimizing Communication in Numerical Linear
Algebrawww.cs.berkeley.edu/demmelSparse-Matri
x-Vector-Multiplication (SpMV)
Jim Demmel EECS Math Departments, UC
Berkeley demmel_at_cs.berkeley.edu
2
Outline

Motivation for Automatic Performance Tuning
Results for sparse matrix kernels
Sparse Matrix Vector Multiplication (SpMV)
Sequential and Multicore results
OSKI Optimized Sparse Kernel Interface
Need to understand tuning of single SpMV to
understand opportunities and limits for tuning
entire sparse solvers

2
3
Berkeley Benchmarking and OPtimization (BeBOP)

Prof. Katherine Yelick
Current members
Kaushik Datta, Mark Hoemmen, Marghoob Mohiyuddin,
Shoaib Kamil, Rajesh Nishtala, Vasily Volkov,
Sam Williams,
Previous members
Hormozd Gahvari, Eun-Jim Im, Ankit Jain, Rich
Vuduc, many undergrads,
Many results here from current, previous students
bebop.cs.berkeley.edu

3
4
Automatic Performance Tuning

Goal Let machine do hard work of writing fast
code
What does tuning of dense BLAS, FFTs, signal
processing, have in common?
Can do the tuning off-line once per
architecture, algorithm
Can take as much time as necessary (hours, a
week)
At run-time, algorithm choice may depend only on
few parameters (matrix dimensions, size of FFT,
etc.)
Cant always do tuning off-line
Algorithm and implementation may strongly depend
on data only known at run-time
Ex Sparse matrix nonzero pattern determines both
best data structure and implementation of
Sparse-matrix-vector-multiplication (SpMV)
Part of search for best algorithm just be done
(very quickly!) at run-time

4
5
Source Accelerator Cavity Design Problem (Ko via
Husbands)
5
6
Linear Programming Matrix

Summer School Lecture 7
6
7
A Sparse Matrix You Encounter Every Day
7
8
SpMV with Compressed Sparse Row (CSR) Storage
Matrix-vector multiply kernel y(i) ? y(i)
A(i,j)x(j) for each row i for kptri to
ptri1 do yi yi valkxindk
Matrix-vector multiply kernel y(i) ? y(i)
A(i,j)x(j) for each row i for kptri to
ptri1 do yi yi valkxindk
Summer School Lecture 7
8
9
Example The Difficulty of Tuning

n 21200
nnz 1.5 M
kernel SpMV
Source NASA structural analysis problem

9
10
Example The Difficulty of Tuning

n 21200
nnz 1.5 M
kernel SpMV
Source NASA structural analysis problem

8x8 dense substructure exploit this to limit
mem_refs

10
11
Taking advantage of block structure in SpMV

Bottleneck is time to get matrix from memory
Only 2 flops for each nonzero in matrix
Dont store each nonzero with index, instead
store each nonzero r-by-c block with index
Storage drops by up to 2x, if rc gtgt 1, all 32-bit
quantities
Time to fetch matrix from memory decreases
Change both data structure and algorithm
Need to pick r and c
Need to change algorithm accordingly
In example, is rc8 best choice?
Minimizes storage, so looks like a good idea

12
Speedups on Itanium 2 The Need for Search
Mflop/s
Mflop/s
12
13
Register Profile Itanium 2
1190 Mflop/s
190 Mflop/s
13
14
Register Profiles IBM and Intel IA-64
Power3 - 17
Power4 - 16
252 Mflop/s
820 Mflop/s
122 Mflop/s
459 Mflop/s
Itanium 2 - 33
Itanium 1 - 8
247 Mflop/s
1.2 Gflop/s
107 Mflop/s
190 Mflop/s
15
Register Profiles Sun and Intel x86
Ultra 2i - 11
Ultra 3 - 5
72 Mflop/s
90 Mflop/s
35 Mflop/s
50 Mflop/s
Pentium III-M - 15
Pentium III - 21
108 Mflop/s
122 Mflop/s
42 Mflop/s
58 Mflop/s
16
Another example of tuning challenges

More complicated non-zero structure in general
N 16614
NNZ 1.1M

16
17
Zoom in to top corner

More complicated non-zero structure in general
N 16614
NNZ 1.1M

17
18
3x3 blocks look natural, but

More complicated non-zero structure in general
Example 3x3 blocking
Logical grid of 3x3 cells
But would lead to lots of fill-in

18
19
Extra Work Can Improve Efficiency!

More complicated non-zero structure in general
Example 3x3 blocking
Logical grid of 3x3 cells
Fill-in explicit zeros
Unroll 3x3 block multiplies
Fill ratio 1.5
On Pentium III 1.5x speedup!
Actual mflop rate 1.52 2.25 higher

19
20
Automatic Register Block Size Selection

Selecting the r x c block size
Off-line benchmark
Precompute Mflops(r,c) using dense A for each r x
c
Once per machine/architecture
Run-time search
Sample A to estimate Fill(r,c) for each r x c
Run-time heuristic model
Choose r, c to minimize time Fill(r,c) /
Mflops(r,c)

20
21
Accurate and Efficient Adaptive Fill Estimation

Idea Sample matrix
Fraction of matrix to sample s Î 0,1
Control cost O(s nnz ) by controlling s
Search at run-time the constant matters!
Control s automatically by computing statistical
confidence intervals, by monitoring variance
Cost of tuning
Lower bound convert matrix in 5 to 40 unblocked
SpMVs
Heuristic 1 to 11 SpMVs
Tuning only useful when we do many SpMVs
Common case, eg in sparse solvers

21
22
Accuracy of the Tuning Heuristics (1/4)
See p. 375 of Vuducs thesis for matrices
NOTE Fair flops used (ops on explicit zeros
not counted as work)
22
23
Accuracy of the Tuning Heuristics (2/4)
23
24
Accuracy of the Tuning Heuristics (2/4)
DGEMV
24
25
Upper Bounds on Performance for blocked SpMV

P (flops) / (time)
Flops 2 nnz(A)
Upper bound on speed Two main assumptions
1. Count memory ops only (streaming)
2. Count only compulsory, capacity misses ignore
conflicts
Account for line sizes
Account for matrix size and nnz
Charge minimum access latency ai at Li cache
amem
e.g., Saavedra-Barrera and PMaC MAPS benchmarks

Summer School Lecture 7
25

Upper bound on time assume all references to
x( ) miss

26
Example Bounds on Itanium 2
26
27
Example Bounds on Itanium 2
27
28
Example Bounds on Itanium 2
28
29
Summary of Other Performance Optimizations

Optimizations for SpMV
Register blocking (RB) up to 4x over CSR
Variable block splitting 2.1x over CSR, 1.8x
over RB
Diagonals 2x over CSR
Reordering to create dense structure splitting
2x over CSR
Symmetry 2.8x over CSR, 2.6x over RB
Cache blocking 2.8x over CSR
Multiple vectors (SpMM) 7x over CSR
And combinations
Sparse triangular solve
Hybrid sparse/dense data structure 1.8x over CSR
Higher-level kernels
AATx, ATAx 4x over CSR, 1.8x over RB
A2x 2x over CSR, 1.5x over RB
Ax, A2x, A3x, .. , Akx . more to say
later

29
30
Example Sparse Triangular Factor

Raefsky4 (structural problem) SuperLU colmmd
N19779, nnz12.6 M

30
31
Cache Optimizations for AATx

Cache-level Interleave multiplication by A, AT
Only fetch A from memory once

31
32
Example Combining Optimizations (1/2)

Register blocking, symmetry, multiple (k) vectors
Three low-level tuning parameters r, c, v

X
k
v

r
c

Y
A
32
33
Example Combining Optimizations (2/2)

Register blocking, symmetry, and multiple vectors
Ben Lee _at_ UCB
Symmetric, blocked, 1 vector
Up to 2.6x over nonsymmetric, blocked, 1 vector
Symmetric, blocked, k vectors
Up to 2.1x over nonsymmetric, blocked, k vecs.
Up to 7.3x over nonsymmetric, nonblocked, 1,
vector
Symmetric Storage up to 64.7 savings

33
34
Potential Impact on Applications Omega3P

Application accelerator cavity design Ko
Relevant optimization techniques
Symmetric storage
Register blocking
Reordering, to create more dense blocks
Reverse Cuthill-McKee ordering to reduce
bandwidth
Do Breadth-First-Search, number nodes in reverse
order visited
Traveling Salesman Problem-based ordering to
create blocks
Nodes columns of A
Weights(u, v) no. of nz u, v have in common
Tour ordering of columns
Choose maximum weight tour
See Pinar Heath 97
2.1x speedup on IBM Power 4

34
35
Source Accelerator Cavity Design Problem (Ko via
Husbands)
35
36
Post-RCM Reordering
36
37
100x100 Submatrix Along Diagonal
Summer School Lecture 7
37
38
Microscopic Effect of RCM Reordering
Before Green Red After Green Blue
Summer School Lecture 7
38
39
Microscopic Effect of Combined RCMTSP
Reordering
Before Green Red After Green Blue
Summer School Lecture 7
39
40
(Omega3P)
40
41
Optimized Sparse Kernel Interface - OSKI

Provides sparse kernels automatically tuned for
users matrix machine
BLAS-style functionality SpMV, Ax ATy, TrSV
Hides complexity of run-time tuning
Includes new, faster locality-aware kernels
ATAx, Akx
Faster than standard implementations
Up to 4x faster matvec, 1.8x trisolve, 4x ATAx
For advanced users solver library writers
Available as stand-alone library (OSKI 1.0.1h,
6/07)
Available as PETSc extension (OSKI-PETSc .1d,
3/06)
Bebop.cs.berkeley.edu/oski

41
42
How the OSKI Tunes (Overview)
Application Run-Time
Library Install-Time (offline)
1. Build for Target Arch.
2. Benchmark
Workload from program monitoring
History
Matrix
Benchmark data
Heuristic models
1. Evaluate Models
Generated code variants
2. Select Data Struct. Code
To user Matrix handle for kernel calls
Extensibility Advanced users may write
dynamically add Code variants and Heuristic
models to system.
43
How to Call OSKI Basic Usage

May gradually migrate existing apps
Step 1 Wrap existing data structures
Step 2 Make BLAS-like kernel calls

int ptr , ind double val /
Matrix, in CSR format / double x , y
/ Let x and y be two dense vectors / /
Compute y ?y ?Ax, 500 times / for( i 0
i lt 500 i ) my_matmult( ptr, ind, val, ?, x,
b, y )
43
44
How to Call OSKI Basic Usage

May gradually migrate existing apps
Step 1 Wrap existing data structures
Step 2 Make BLAS-like kernel calls

int ptr , ind double val /
Matrix, in CSR format / double x , y
/ Let x and y be two dense vectors / / Step 1
Create OSKI wrappers around this data
/ oski_matrix_t A_tunable oski_CreateMatCSR(ptr
, ind, val, num_rows, num_cols, SHARE_INPUTMAT,
) oski_vecview_t x_view oski_CreateVecView(x,
num_cols, UNIT_STRIDE) oski_vecview_t y_view
oski_CreateVecView(y, num_rows, UNIT_STRIDE) /
Compute y ?y ?Ax, 500 times / for( i 0
i lt 500 i ) my_matmult( ptr, ind, val, ?, x,
b, y )
44
45
How to Call OSKI Basic Usage

May gradually migrate existing apps
Step 1 Wrap existing data structures
Step 2 Make BLAS-like kernel calls

int ptr , ind double val /
Matrix, in CSR format / double x , y
/ Let x and y be two dense vectors / / Step 1
Create OSKI wrappers around this data
/ oski_matrix_t A_tunable oski_CreateMatCSR(ptr
, ind, val, num_rows, num_cols, SHARE_INPUTMAT,
) oski_vecview_t x_view oski_CreateVecView(x,
num_cols, UNIT_STRIDE) oski_vecview_t y_view
oski_CreateVecView(y, num_rows, UNIT_STRIDE) /
Compute y ?y ?Ax, 500 times / for( i 0
i lt 500 i ) oski_MatMult(A_tunable,
OP_NORMAL, ?, x_view, ?, y_view)/ Step 2 /
45
46
How to Call OSKI Tune with Explicit Hints

User calls tune routine
May provide explicit tuning hints (OPTIONAL)

oski_matrix_t A_tunable oski_CreateMatCSR(
) / / / Tell OSKI we will call SpMV 500
times (workload hint) / oski_SetHintMatMult(A_tun
able, OP_NORMAL, ?, x_view, ?, y_view, 500) /
Tell OSKI we think the matrix has 8x8 blocks
(structural hint) / oski_SetHint(A_tunable,
HINT_SINGLE_BLOCKSIZE, 8, 8) oski_TuneMat(A_tuna
ble) / Ask OSKI to tune / for( i 0 i lt
500 i ) oski_MatMult(A_tunable, OP_NORMAL, ?,
x_view, ?, y_view)
46
47
How the User Calls OSKI Implicit Tuning

Ask library to infer workload
Library profiles all kernel calls
May periodically re-tune

oski_matrix_t A_tunable oski_CreateMatCSR(
) / / for( i 0 i lt 500 i )
oski_MatMult(A_tunable, OP_NORMAL, ?, x_view,
?, y_view) oski_TuneMat(A_tunable) / Ask OSKI
to tune /
47
48
Multicore SMPs Used for Tuning SpMV
AMD Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
IBM QS20 Cell Blade
Sun T2 T5140 (Victoria Falls)
48
49
Multicore SMPs with Conventional cache-based
memory hierarchy
AMD Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
Sun T2 T5140 (Victoria Falls)
IBM QS20 Cell Blade
49
50
Multicore SMPs with local store-based memory
hierarchy
AMD Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
Sun T2 T5140 (Victoria Falls)
IBM QS20 Cell Blade
50
51
Multicore SMPs with CMT Chip-MultiThreading
AMD Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
IBM QS20 Cell Blade
Sun T2 T5140 (Victoria Falls)
52
Multicore SMPs Number of threads
AMD Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
8 threads
8 threads
IBM QS20 Cell Blade
Sun T2 T5140 (Victoria Falls)
16 threads
128 threads
SPEs only
53
Multicore SMPs peak double precision flops
AMD Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
75 GFlop/s
74 Gflop/s
IBM QS20 Cell Blade
Sun T2 T5140 (Victoria Falls)
29 GFlop/s
19 GFlop/s
SPEs only
54
Multicore SMPs total DRAM bandwidth
AMD Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
21 GB/s (read) 10 GB/s (write)
21 GB/s
IBM QS20 Cell Blade
Sun T2 T5140 (Victoria Falls)
51 GB/s
42 GB/s (read) 21 GB/s (write)
SPEs only
55
Multicore SMPs with Non-Uniform Memory Access -
NUMA
AMD Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
IBM QS20 Cell Blade
Sun T2 T5140 (Victoria Falls)
SPEs only
56
Set of 14 test matrices

All bigger than the caches of our SMPs

2K x 2K Dense matrix stored in sparse format
Dense
Well Structured (sorted by nonzeros/row)
Protein
FEM / Spheres
FEM / Cantilever
Wind Tunnel
FEM / Harbor
QCD
FEM / Ship
Economics
Epidemiology
Poorly Structured hodgepodge
FEM / Accelerator
Circuit
webbase
Extreme Aspect Ratio (linear programming)
LP
56
57
SpMV Performance Naive parallelization

Out-of-the box SpMV performance on a suite of 14
matrices
Scalability isnt great
Compare to threads
8 8
128 16

Naïve Pthreads
Naïve
57
58
SpMV Performance NUMA and Software Prefetching

NUMA-aware allocation is essential on NUMA SMPs.
Explicit software prefetching can boost bandwidth
and change cache replacement policies
used exhaustive search

58
59
SpMV Performance Matrix Compression

Compression includes
register blocking
other formats
smaller indices
Use heuristic rather than search

59
60
SpMV Performance cache and TLB blocking
Cache/LS/TLB Blocking
Matrix Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
60
61
SpMV Performance Architecture specific
optimizations
Cache/LS/TLB Blocking
Matrix Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
61
62
SpMV Performance max speedup

Fully auto-tuned SpMV performance across the
suite of matrices
Included SPE/local store optimized version
Why do some optimizations work better on some
architectures?

2.7x
4.0x
2.9x
35x
Cache/LS/TLB Blocking
Matrix Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
63
Extra slides
Summer School Lecture 7
63

Write a Comment

User Comments (0)