Optimizing the Performance of Sparse MatrixVector Multiplication - PowerPoint PPT Presentation

1 / 38

About This Presentation

Title:

Optimizing the Performance of Sparse MatrixVector Multiplication

Description:

Vendor optimized matrix-vector multiplication: 57Mflops ... Reorganizing the matrix (yellow bars) ... Document retrieval matrix : 10K x 256K, 37M nonzeros, SVD ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 39

Provided by: ejim

Learn more at: https://people.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Optimizing the Performance of Sparse MatrixVector Multiplication

1
Optimizing the Performance of Sparse
Matrix-Vector Multiplication

Eun-Jin Im
U.C.Berkeley

2
Overview

Motivation
Optimization techniques
Register Blocking
Cache Blocking
Multiple Vectors
Sparsity system
Related Work
Contribution
Conclusion

3
Motivation Usage

Sparse Matrix-Vector Multiplication
Usage of this operation
Iterative Solvers
Explicit Methods
Eigenvalue and Singular Value Problems
Applications in structure modeling, fluid
dynamics, document retrieval(Latent Semantic
Indexing) and many other simulation areas

4
Motivation Performance (1)

Matrix-vector multiplication (BLAS2) is slower
than matrix-matrix multiplication (BLAS3)
For example, on 167 MHz UltraSPARC I,
Vendor optimized matrix-vector multiplication
57Mflops
Vendor optimized matrix-matrix multiplication
185Mflops
The reason lower ratio of the number of floating
point operations to the number of memory operation

5
Motivation Performance (2)

Sparse matrix operation is slower than dense
matrix operation.
For example, on 167 MHz UltraSPARC I,
Dense matrix-vector multiplication
naïve implementation
38Mflops
vendor optimized implementation
57Mflops
Sparse matrix-vector multiplication (Naïve
implementation)
5.7 - 25Mflops
The reason indirect data structure, thus
inefficient memory accesses

6
Motivation Optimized libraries

Old approach Hand-Optimized Libraries
Vendor-supplied BLAS, LAPACK
New approach Automatic generation of libraries
PHiPAC (dense linear algebra)
ATLAS (dense linear algebra)
FFTW (fast fourier transform)
Our approach Automatic generation of libraries
for sparse matrices
Additional dimension nonzero structure of
sparse matrices

7
Sparse Matrix Formats

There are large number of sparse matrix formats.
Point-entry
Coordinate format (COO), Compressed Sparse Row
(CSR),
Compressed Sparse Column (CSC), Sparse Diagonal
(DIA),
Block-entry
Block Coordinate (BCO), Block Sparse Row (BSR),
Block Sparse Column (BSC), Block Diagonal (BDI),
Variable Block Compressed Sparse Row (VBR),

8
Compressed Sparse Row Format

We internally use CSR format, because it is
relatively efficient format

9
Optimization Techniques

Register Blocking
Cache Blocking
Multiple vector

10
Register Blocking

Blocked Compressed Sparse Row Format
Advantages of the format
Better temporal locality in registers
The multiplication loop can be unrolled for
better performance

11
Register Blocking Fill Overhead

We use uniform block size, adding fill overhead.
fill overhead 12/7 1.71
This increases both space and the number of
floating point operations.

12
Register Blocking

Dense Matrix profile on an UltraSPARC I (input to
the performance model)

13
Register Blocking Selecting the block size

The hard part of the problem is picking the block
size so that
It minimizes the fill overhead
It maximizes the raw performance
Two approaches
Exhaustive search
Using a model

14
Register Blocking Performance model

Two components to the performance model
Multiplication performance of dense matrix
represented in sparse format
Estimated fill overhead
Predicted performance for block size r x c
dense r x c blocked performance
fill overhead

15
Benchmark matrices

Matrix 1 Dense matrix (1000 x 1000)
Matrices 2-17 Finite Element Method matrices
Matrices 18-39 matrices from Structural
Engineering, Device Simulation
Matrices 40-44 Linear Programming matrices
Matrix 45 document retrieval matrix
used for Latent Semantic
Indexing
Matrix 46 random matrix (10000 x 10000, 0.15)

16
Register Blocking Performance

The optimization is effective most on FEM
matrices and dense matrix (lower-numbered).

17
Register Blocking Performance

Speedup is generally best on MIPS R10000, which
is competitive with the dense BLAS performance.
(DGEMV/DGEMM 0.38)

18
Register Blocking Validation of Performance
Model

Comparison to the performance of exhaustive
search (yellow bars, block sizes in lower row) on
a subset of the benchmark matrices
The exhaustive search does not produce much
better result.

19
Register Blocking Overhead

Pre-computation overhead
Estimating fill overhead (red bars)
Reorganizing the matrix (yellow bars)
The ratio means the number of repetitions for
which the optimization is beneficial.

20
Cache Blocking

Temporal locality of access to source vector

Source vector x
Destination Vector y
In memory
21
Cache Blocking Performance

MIPS speedup is generally better.
larger cache, larger miss penalty (26/589 ns for
MIPS, 36/268 ns for Ultra.)
Except document retrieval and random matrix.

22
Cache Blocking Performance on document
retrieval matrix

Document retrieval matrix 10K x 256K, 37M
nonzeros, SVD is applied for LSI(Latent Semantic
Indexing)
The nonzero elements are spread across the
matrix, with no dense cluster.
Peak at 16K x 16K cache block with speedup 3.1

23
Cache Blocking When and how to use cache
blocking

From the experiment, the matrices for which cache
blocking is most effective are large and
random.
We developed a measurement of randomness of
matrix.
We perform search in coarse grain, to decide
cache block size.

24
Combination of Register and Cache blocking
UltraSPARC

The combination is rarely beneficial, often
slower than either of the two optimization.

25
Combination of Register and Cache blocking MIPS
26
Multiple Vector Multiplication

Better chance of optimization BLAS2 vs. BLAS3

Repetition of single-vector case
Multiple-vector case
27
Multiple Vector Multiplication Performances

Register blocking performance
Cache blocking performance

28
Multiple Vector Multiplication Register
Blocking Performance

The speedup is larger than single vector register
blocking.
Even the performance of the matrices that did not
speedup improved. (middle group in UltraSPARC)

29
Multiple Vector Multiplication Cache Blocking
Performance
MIPS
UltraSPARC

Noticeable speedup for the matrices that did not
speedup (UltraSPARC)
Block sizes are much smaller than that of single
vector cache blocking.

30
Sparsity System Purpose

Guide a choice of optimization
Automatic selection of optimization parameters
such as block size, number of vectors
http//comix.cs.berkeley.edu/ejim/sparsity

31
Sparsity System Organization
Example matrix
Sparsity Machine Profiler
Machine Performance Profile
Sparsity Optimizer
Optimized code, drivers
Maximum Number of vectors
32
Summary Speedup of Sparsity on UltraSPARC

On UltraSPARC, up to 3x for single vector, 4.7x
for multiple vector

Single Vector
Multiple Vector
33
Summary Speedup of Sparsity on MIPS

On MIPS, up to 3x single vector, 6x for multiple
vector

Single Vector
Multiple Vector
34
Summary Overhead of Sparsity Optimization

The number of iteration
Overhead time
Time saved
The BLAS Technical Forum include a parameter in
the matrix creation routine to indicate how many
times the operation is performed.

35
Related Work (1)

Dense Matrix Optimization
Loop transformation by compilers M. Wolf, etc.
Hand-optimized libraries BLAS, LAPACK
Automatic Generation of Libraries
PHiPAC, ATLAS and FFTW
Sparse Matrix Standardization and Libraries
BLAS Technical Forum
NIST Sparse BLAS, MV, SparseLib, TNT
Hand Optimization of Sparse Matrix-Vector Multi.
S. Toledo, Oliker et. al.

36
Related Work (2)

Sparse Matrix Packages
SPARSKIT, PSPARSELIB, Aztec, BlockSolve95,
Spark98
Compiling Sparse Matrix Code
Sparse compiler (Bik), Bernoulli compiler
(Kotlyar)
On-demand Code Generation
NIST SparseBLAS, Sparse compiler

37
Contribution

Thorough investigation of memory hierarchy
optimization for sparse matrix-vector
multiplication
Performance study on benchmark matrices
Development of performance model to choose
optimization parameter
Sparsity system for automatic tuning and code
generation of sparse matrix-vector multiplication

38
Conclusion

Memory hierarchy optimization for sparse
matrix-vector multiplication
Register Blocking matrices with dense local
structure benefit
Cache Blocking large matrices with random
structure benefit
Multiple vector multiplication improves the
performance further because of reuse of matrix
elements
The choice of optimization depends on both matrix
structure and machine architecture.
The automated system helps this complicated and
time-consuming process.