Minimizing Communication in Numerical Linear Algebra www.cs.berkeley.edu/~demmel Sparse-Matrix-Vector-Multiplication (SpMV) - PowerPoint PPT Presentation

1 / 63
About This Presentation
Title:

Minimizing Communication in Numerical Linear Algebra www.cs.berkeley.edu/~demmel Sparse-Matrix-Vector-Multiplication (SpMV)

Description:

Minimizing Communication in Numerical Linear Algebra www.cs.berkeley.edu/~demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments, UC ... – PowerPoint PPT presentation

Number of Views:128
Avg rating:3.0/5.0
Slides: 64
Provided by: WSE107
Category:

less

Transcript and Presenter's Notes

Title: Minimizing Communication in Numerical Linear Algebra www.cs.berkeley.edu/~demmel Sparse-Matrix-Vector-Multiplication (SpMV)


1
Minimizing Communication in Numerical Linear
Algebrawww.cs.berkeley.edu/demmelSparse-Matri
x-Vector-Multiplication (SpMV)
Jim Demmel EECS Math Departments, UC
Berkeley demmel_at_cs.berkeley.edu
2
Outline
  • Motivation for Automatic Performance Tuning
  • Results for sparse matrix kernels
  • Sparse Matrix Vector Multiplication (SpMV)
  • Sequential and Multicore results
  • OSKI Optimized Sparse Kernel Interface
  • Need to understand tuning of single SpMV to
    understand opportunities and limits for tuning
    entire sparse solvers

2
3
Berkeley Benchmarking and OPtimization (BeBOP)
  • Prof. Katherine Yelick
  • Current members
  • Kaushik Datta, Mark Hoemmen, Marghoob Mohiyuddin,
    Shoaib Kamil, Rajesh Nishtala, Vasily Volkov,
    Sam Williams,
  • Previous members
  • Hormozd Gahvari, Eun-Jim Im, Ankit Jain, Rich
    Vuduc, many undergrads,
  • Many results here from current, previous students
  • bebop.cs.berkeley.edu

3
4
Automatic Performance Tuning
  • Goal Let machine do hard work of writing fast
    code
  • What does tuning of dense BLAS, FFTs, signal
    processing, have in common?
  • Can do the tuning off-line once per
    architecture, algorithm
  • Can take as much time as necessary (hours, a
    week)
  • At run-time, algorithm choice may depend only on
    few parameters (matrix dimensions, size of FFT,
    etc.)
  • Cant always do tuning off-line
  • Algorithm and implementation may strongly depend
    on data only known at run-time
  • Ex Sparse matrix nonzero pattern determines both
    best data structure and implementation of
    Sparse-matrix-vector-multiplication (SpMV)
  • Part of search for best algorithm just be done
    (very quickly!) at run-time

4
5
Source Accelerator Cavity Design Problem (Ko via
Husbands)
5
6
Linear Programming Matrix

Summer School Lecture 7
6
7
A Sparse Matrix You Encounter Every Day
7
8
SpMV with Compressed Sparse Row (CSR) Storage
Matrix-vector multiply kernel y(i) ? y(i)
A(i,j)x(j) for each row i for kptri to
ptri1 do yi yi valkxindk
Matrix-vector multiply kernel y(i) ? y(i)
A(i,j)x(j) for each row i for kptri to
ptri1 do yi yi valkxindk
Summer School Lecture 7
8
9
Example The Difficulty of Tuning
  • n 21200
  • nnz 1.5 M
  • kernel SpMV
  • Source NASA structural analysis problem

9
10
Example The Difficulty of Tuning
  • n 21200
  • nnz 1.5 M
  • kernel SpMV
  • Source NASA structural analysis problem
  • 8x8 dense substructure exploit this to limit
    mem_refs

10
11
Taking advantage of block structure in SpMV
  • Bottleneck is time to get matrix from memory
  • Only 2 flops for each nonzero in matrix
  • Dont store each nonzero with index, instead
    store each nonzero r-by-c block with index
  • Storage drops by up to 2x, if rc gtgt 1, all 32-bit
    quantities
  • Time to fetch matrix from memory decreases
  • Change both data structure and algorithm
  • Need to pick r and c
  • Need to change algorithm accordingly
  • In example, is rc8 best choice?
  • Minimizes storage, so looks like a good idea

12
Speedups on Itanium 2 The Need for Search
Mflop/s
Mflop/s
12
13
Register Profile Itanium 2
1190 Mflop/s
190 Mflop/s
13
14
Register Profiles IBM and Intel IA-64
Power3 - 17
Power4 - 16
252 Mflop/s
820 Mflop/s
122 Mflop/s
459 Mflop/s
Itanium 2 - 33
Itanium 1 - 8
247 Mflop/s
1.2 Gflop/s
107 Mflop/s
190 Mflop/s
15
Register Profiles Sun and Intel x86
Ultra 2i - 11
Ultra 3 - 5
72 Mflop/s
90 Mflop/s
35 Mflop/s
50 Mflop/s
Pentium III-M - 15
Pentium III - 21
108 Mflop/s
122 Mflop/s
42 Mflop/s
58 Mflop/s
16
Another example of tuning challenges
  • More complicated non-zero structure in general
  • N 16614
  • NNZ 1.1M

16
17
Zoom in to top corner
  • More complicated non-zero structure in general
  • N 16614
  • NNZ 1.1M

17
18
3x3 blocks look natural, but
  • More complicated non-zero structure in general
  • Example 3x3 blocking
  • Logical grid of 3x3 cells
  • But would lead to lots of fill-in

18
19
Extra Work Can Improve Efficiency!
  • More complicated non-zero structure in general
  • Example 3x3 blocking
  • Logical grid of 3x3 cells
  • Fill-in explicit zeros
  • Unroll 3x3 block multiplies
  • Fill ratio 1.5
  • On Pentium III 1.5x speedup!
  • Actual mflop rate 1.52 2.25 higher

19
20
Automatic Register Block Size Selection
  • Selecting the r x c block size
  • Off-line benchmark
  • Precompute Mflops(r,c) using dense A for each r x
    c
  • Once per machine/architecture
  • Run-time search
  • Sample A to estimate Fill(r,c) for each r x c
  • Run-time heuristic model
  • Choose r, c to minimize time Fill(r,c) /
    Mflops(r,c)

20
21
Accurate and Efficient Adaptive Fill Estimation
  • Idea Sample matrix
  • Fraction of matrix to sample s Î 0,1
  • Control cost O(s nnz ) by controlling s
  • Search at run-time the constant matters!
  • Control s automatically by computing statistical
    confidence intervals, by monitoring variance
  • Cost of tuning
  • Lower bound convert matrix in 5 to 40 unblocked
    SpMVs
  • Heuristic 1 to 11 SpMVs
  • Tuning only useful when we do many SpMVs
  • Common case, eg in sparse solvers

21
22
Accuracy of the Tuning Heuristics (1/4)
See p. 375 of Vuducs thesis for matrices
NOTE Fair flops used (ops on explicit zeros
not counted as work)
22
23
Accuracy of the Tuning Heuristics (2/4)
23
24
Accuracy of the Tuning Heuristics (2/4)
DGEMV
24
25
Upper Bounds on Performance for blocked SpMV
  • P (flops) / (time)
  • Flops 2 nnz(A)
  • Upper bound on speed Two main assumptions
  • 1. Count memory ops only (streaming)
  • 2. Count only compulsory, capacity misses ignore
    conflicts
  • Account for line sizes
  • Account for matrix size and nnz
  • Charge minimum access latency ai at Li cache
    amem
  • e.g., Saavedra-Barrera and PMaC MAPS benchmarks

Summer School Lecture 7
25
  • Upper bound on time assume all references to
    x( ) miss

26
Example Bounds on Itanium 2
26
27
Example Bounds on Itanium 2
27
28
Example Bounds on Itanium 2
28
29
Summary of Other Performance Optimizations
  • Optimizations for SpMV
  • Register blocking (RB) up to 4x over CSR
  • Variable block splitting 2.1x over CSR, 1.8x
    over RB
  • Diagonals 2x over CSR
  • Reordering to create dense structure splitting
    2x over CSR
  • Symmetry 2.8x over CSR, 2.6x over RB
  • Cache blocking 2.8x over CSR
  • Multiple vectors (SpMM) 7x over CSR
  • And combinations
  • Sparse triangular solve
  • Hybrid sparse/dense data structure 1.8x over CSR
  • Higher-level kernels
  • AATx, ATAx 4x over CSR, 1.8x over RB
  • A2x 2x over CSR, 1.5x over RB
  • Ax, A2x, A3x, .. , Akx . more to say
    later

29
30
Example Sparse Triangular Factor
  • Raefsky4 (structural problem) SuperLU colmmd
  • N19779, nnz12.6 M

30
31
Cache Optimizations for AATx
  • Cache-level Interleave multiplication by A, AT
  • Only fetch A from memory once


  • Register-level aiT to be rc block row, or diag
    row

31
32
Example Combining Optimizations (1/2)
  • Register blocking, symmetry, multiple (k) vectors
  • Three low-level tuning parameters r, c, v

X
k
v

r
c

Y
A
32
33
Example Combining Optimizations (2/2)
  • Register blocking, symmetry, and multiple vectors
    Ben Lee _at_ UCB
  • Symmetric, blocked, 1 vector
  • Up to 2.6x over nonsymmetric, blocked, 1 vector
  • Symmetric, blocked, k vectors
  • Up to 2.1x over nonsymmetric, blocked, k vecs.
  • Up to 7.3x over nonsymmetric, nonblocked, 1,
    vector
  • Symmetric Storage up to 64.7 savings

33
34
Potential Impact on Applications Omega3P
  • Application accelerator cavity design Ko
  • Relevant optimization techniques
  • Symmetric storage
  • Register blocking
  • Reordering, to create more dense blocks
  • Reverse Cuthill-McKee ordering to reduce
    bandwidth
  • Do Breadth-First-Search, number nodes in reverse
    order visited
  • Traveling Salesman Problem-based ordering to
    create blocks
  • Nodes columns of A
  • Weights(u, v) no. of nz u, v have in common
  • Tour ordering of columns
  • Choose maximum weight tour
  • See Pinar Heath 97
  • 2.1x speedup on IBM Power 4

34
35
Source Accelerator Cavity Design Problem (Ko via
Husbands)
35
36
Post-RCM Reordering
36
37
100x100 Submatrix Along Diagonal
Summer School Lecture 7
37
38
Microscopic Effect of RCM Reordering
Before Green Red After Green Blue
Summer School Lecture 7
38
39
Microscopic Effect of Combined RCMTSP
Reordering
Before Green Red After Green Blue
Summer School Lecture 7
39
40
(Omega3P)
40
41
Optimized Sparse Kernel Interface - OSKI
  • Provides sparse kernels automatically tuned for
    users matrix machine
  • BLAS-style functionality SpMV, Ax ATy, TrSV
  • Hides complexity of run-time tuning
  • Includes new, faster locality-aware kernels
    ATAx, Akx
  • Faster than standard implementations
  • Up to 4x faster matvec, 1.8x trisolve, 4x ATAx
  • For advanced users solver library writers
  • Available as stand-alone library (OSKI 1.0.1h,
    6/07)
  • Available as PETSc extension (OSKI-PETSc .1d,
    3/06)
  • Bebop.cs.berkeley.edu/oski

41
42
How the OSKI Tunes (Overview)
Application Run-Time
Library Install-Time (offline)
1. Build for Target Arch.
2. Benchmark
Workload from program monitoring
History
Matrix
Benchmark data
Heuristic models
1. Evaluate Models
Generated code variants
2. Select Data Struct. Code
To user Matrix handle for kernel calls
Extensibility Advanced users may write
dynamically add Code variants and Heuristic
models to system.
43
How to Call OSKI Basic Usage
  • May gradually migrate existing apps
  • Step 1 Wrap existing data structures
  • Step 2 Make BLAS-like kernel calls

int ptr , ind double val /
Matrix, in CSR format / double x , y
/ Let x and y be two dense vectors / /
Compute y ?y ?Ax, 500 times / for( i 0
i lt 500 i ) my_matmult( ptr, ind, val, ?, x,
b, y )
43
44
How to Call OSKI Basic Usage
  • May gradually migrate existing apps
  • Step 1 Wrap existing data structures
  • Step 2 Make BLAS-like kernel calls

int ptr , ind double val /
Matrix, in CSR format / double x , y
/ Let x and y be two dense vectors / / Step 1
Create OSKI wrappers around this data
/ oski_matrix_t A_tunable oski_CreateMatCSR(ptr
, ind, val, num_rows, num_cols, SHARE_INPUTMAT,
) oski_vecview_t x_view oski_CreateVecView(x,
num_cols, UNIT_STRIDE) oski_vecview_t y_view
oski_CreateVecView(y, num_rows, UNIT_STRIDE) /
Compute y ?y ?Ax, 500 times / for( i 0
i lt 500 i ) my_matmult( ptr, ind, val, ?, x,
b, y )
44
45
How to Call OSKI Basic Usage
  • May gradually migrate existing apps
  • Step 1 Wrap existing data structures
  • Step 2 Make BLAS-like kernel calls

int ptr , ind double val /
Matrix, in CSR format / double x , y
/ Let x and y be two dense vectors / / Step 1
Create OSKI wrappers around this data
/ oski_matrix_t A_tunable oski_CreateMatCSR(ptr
, ind, val, num_rows, num_cols, SHARE_INPUTMAT,
) oski_vecview_t x_view oski_CreateVecView(x,
num_cols, UNIT_STRIDE) oski_vecview_t y_view
oski_CreateVecView(y, num_rows, UNIT_STRIDE) /
Compute y ?y ?Ax, 500 times / for( i 0
i lt 500 i ) oski_MatMult(A_tunable,
OP_NORMAL, ?, x_view, ?, y_view)/ Step 2 /
45
46
How to Call OSKI Tune with Explicit Hints
  • User calls tune routine
  • May provide explicit tuning hints (OPTIONAL)

oski_matrix_t A_tunable oski_CreateMatCSR(
) / / / Tell OSKI we will call SpMV 500
times (workload hint) / oski_SetHintMatMult(A_tun
able, OP_NORMAL, ?, x_view, ?, y_view, 500) /
Tell OSKI we think the matrix has 8x8 blocks
(structural hint) / oski_SetHint(A_tunable,
HINT_SINGLE_BLOCKSIZE, 8, 8) oski_TuneMat(A_tuna
ble) / Ask OSKI to tune / for( i 0 i lt
500 i ) oski_MatMult(A_tunable, OP_NORMAL, ?,
x_view, ?, y_view)
46
47
How the User Calls OSKI Implicit Tuning
  • Ask library to infer workload
  • Library profiles all kernel calls
  • May periodically re-tune

oski_matrix_t A_tunable oski_CreateMatCSR(
) / / for( i 0 i lt 500 i )
oski_MatMult(A_tunable, OP_NORMAL, ?, x_view,
?, y_view) oski_TuneMat(A_tunable) / Ask OSKI
to tune /
47
48
Multicore SMPs Used for Tuning SpMV
AMD Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
IBM QS20 Cell Blade
Sun T2 T5140 (Victoria Falls)
48
49
Multicore SMPs with Conventional cache-based
memory hierarchy
AMD Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
Sun T2 T5140 (Victoria Falls)
IBM QS20 Cell Blade
49
50
Multicore SMPs with local store-based memory
hierarchy
AMD Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
Sun T2 T5140 (Victoria Falls)
IBM QS20 Cell Blade
50
51
Multicore SMPs with CMT Chip-MultiThreading
AMD Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
IBM QS20 Cell Blade
Sun T2 T5140 (Victoria Falls)
52
Multicore SMPs Number of threads
AMD Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
8 threads
8 threads
IBM QS20 Cell Blade
Sun T2 T5140 (Victoria Falls)
16 threads
128 threads
SPEs only
53
Multicore SMPs peak double precision flops
AMD Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
75 GFlop/s
74 Gflop/s
IBM QS20 Cell Blade
Sun T2 T5140 (Victoria Falls)
29 GFlop/s
19 GFlop/s
SPEs only
54
Multicore SMPs total DRAM bandwidth
AMD Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
21 GB/s (read) 10 GB/s (write)
21 GB/s
IBM QS20 Cell Blade
Sun T2 T5140 (Victoria Falls)
51 GB/s
42 GB/s (read) 21 GB/s (write)
SPEs only
55
Multicore SMPs with Non-Uniform Memory Access -
NUMA
AMD Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
IBM QS20 Cell Blade
Sun T2 T5140 (Victoria Falls)
SPEs only
56
Set of 14 test matrices
  • All bigger than the caches of our SMPs

2K x 2K Dense matrix stored in sparse format
Dense
Well Structured (sorted by nonzeros/row)
Protein
FEM / Spheres
FEM / Cantilever
Wind Tunnel
FEM / Harbor
QCD
FEM / Ship
Economics
Epidemiology
Poorly Structured hodgepodge
FEM / Accelerator
Circuit
webbase
Extreme Aspect Ratio (linear programming)
LP
56
57
SpMV Performance Naive parallelization
  • Out-of-the box SpMV performance on a suite of 14
    matrices
  • Scalability isnt great
  • Compare to threads
  • 8 8
  • 128 16

Naïve Pthreads
Naïve
57
58
SpMV Performance NUMA and Software Prefetching
  • NUMA-aware allocation is essential on NUMA SMPs.
  • Explicit software prefetching can boost bandwidth
    and change cache replacement policies
  • used exhaustive search

58
59
SpMV Performance Matrix Compression
  • Compression includes
  • register blocking
  • other formats
  • smaller indices
  • Use heuristic rather than search

59
60
SpMV Performance cache and TLB blocking
Cache/LS/TLB Blocking
Matrix Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
60
61
SpMV Performance Architecture specific
optimizations
Cache/LS/TLB Blocking
Matrix Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
61
62
SpMV Performance max speedup
  • Fully auto-tuned SpMV performance across the
    suite of matrices
  • Included SPE/local store optimized version
  • Why do some optimizations work better on some
    architectures?

2.7x
4.0x
2.9x
35x
Cache/LS/TLB Blocking
Matrix Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
63
Extra slides
Summer School Lecture 7
63
Write a Comment
User Comments (0)
About PowerShow.com