Optimizing the Performance of Sparse MatrixVector Multiplication - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Optimizing the Performance of Sparse MatrixVector Multiplication

Description:

Vendor optimized matrix-vector multiplication: 57Mflops ... Reorganizing the matrix (yellow bars) ... Document retrieval matrix : 10K x 256K, 37M nonzeros, SVD ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 39
Provided by: ejim
Category:

less

Transcript and Presenter's Notes

Title: Optimizing the Performance of Sparse MatrixVector Multiplication


1
Optimizing the Performance of Sparse
Matrix-Vector Multiplication
  • Eun-Jin Im
  • U.C.Berkeley

2
Overview
  • Motivation
  • Optimization techniques
  • Register Blocking
  • Cache Blocking
  • Multiple Vectors
  • Sparsity system
  • Related Work
  • Contribution
  • Conclusion

3
Motivation Usage
  • Sparse Matrix-Vector Multiplication
  • Usage of this operation
  • Iterative Solvers
  • Explicit Methods
  • Eigenvalue and Singular Value Problems
  • Applications in structure modeling, fluid
    dynamics, document retrieval(Latent Semantic
    Indexing) and many other simulation areas

4
Motivation Performance (1)
  • Matrix-vector multiplication (BLAS2) is slower
    than matrix-matrix multiplication (BLAS3)
  • For example, on 167 MHz UltraSPARC I,
  • Vendor optimized matrix-vector multiplication
    57Mflops
  • Vendor optimized matrix-matrix multiplication
    185Mflops
  • The reason lower ratio of the number of floating
    point operations to the number of memory operation

5
Motivation Performance (2)
  • Sparse matrix operation is slower than dense
    matrix operation.
  • For example, on 167 MHz UltraSPARC I,
  • Dense matrix-vector multiplication
  • naïve implementation
    38Mflops
  • vendor optimized implementation
    57Mflops
  • Sparse matrix-vector multiplication (Naïve
    implementation)

  • 5.7 - 25Mflops
  • The reason indirect data structure, thus
    inefficient memory accesses

6
Motivation Optimized libraries
  • Old approach Hand-Optimized Libraries
  • Vendor-supplied BLAS, LAPACK
  • New approach Automatic generation of libraries
  • PHiPAC (dense linear algebra)
  • ATLAS (dense linear algebra)
  • FFTW (fast fourier transform)
  • Our approach Automatic generation of libraries
    for sparse matrices
  • Additional dimension nonzero structure of
    sparse matrices

7
Sparse Matrix Formats
  • There are large number of sparse matrix formats.
  • Point-entry
  • Coordinate format (COO), Compressed Sparse Row
    (CSR),
  • Compressed Sparse Column (CSC), Sparse Diagonal
    (DIA),
  • Block-entry
  • Block Coordinate (BCO), Block Sparse Row (BSR),
  • Block Sparse Column (BSC), Block Diagonal (BDI),
  • Variable Block Compressed Sparse Row (VBR),

8
Compressed Sparse Row Format
  • We internally use CSR format, because it is
    relatively efficient format

9
Optimization Techniques
  • Register Blocking
  • Cache Blocking
  • Multiple vector

10
Register Blocking
  • Blocked Compressed Sparse Row Format
  • Advantages of the format
  • Better temporal locality in registers
  • The multiplication loop can be unrolled for
    better performance

11
Register Blocking Fill Overhead
  • We use uniform block size, adding fill overhead.
  • fill overhead 12/7 1.71
  • This increases both space and the number of
    floating point operations.

12
Register Blocking
  • Dense Matrix profile on an UltraSPARC I (input to
    the performance model)

13
Register Blocking Selecting the block size
  • The hard part of the problem is picking the block
    size so that
  • It minimizes the fill overhead
  • It maximizes the raw performance
  • Two approaches
  • Exhaustive search
  • Using a model

14
Register Blocking Performance model
  • Two components to the performance model
  • Multiplication performance of dense matrix
    represented in sparse format
  • Estimated fill overhead
  • Predicted performance for block size r x c
  • dense r x c blocked performance
  • fill overhead

15
Benchmark matrices
  • Matrix 1 Dense matrix (1000 x 1000)
  • Matrices 2-17 Finite Element Method matrices
  • Matrices 18-39 matrices from Structural
    Engineering, Device Simulation
  • Matrices 40-44 Linear Programming matrices
  • Matrix 45 document retrieval matrix
  • used for Latent Semantic
    Indexing
  • Matrix 46 random matrix (10000 x 10000, 0.15)

16
Register Blocking Performance
  • The optimization is effective most on FEM
    matrices and dense matrix (lower-numbered).

17
Register Blocking Performance
  • Speedup is generally best on MIPS R10000, which
    is competitive with the dense BLAS performance.
    (DGEMV/DGEMM 0.38)

18
Register Blocking Validation of Performance
Model
  • Comparison to the performance of exhaustive
    search (yellow bars, block sizes in lower row) on
    a subset of the benchmark matrices
  • The exhaustive search does not produce much
    better result.

19
Register Blocking Overhead
  • Pre-computation overhead
  • Estimating fill overhead (red bars)
  • Reorganizing the matrix (yellow bars)
  • The ratio means the number of repetitions for
    which the optimization is beneficial.

20
Cache Blocking
  • Temporal locality of access to source vector

Source vector x
Destination Vector y
In memory
21
Cache Blocking Performance
  • MIPS speedup is generally better.
  • larger cache, larger miss penalty (26/589 ns for
    MIPS, 36/268 ns for Ultra.)
  • Except document retrieval and random matrix.

22
Cache Blocking Performance on document
retrieval matrix
  • Document retrieval matrix 10K x 256K, 37M
    nonzeros, SVD is applied for LSI(Latent Semantic
    Indexing)
  • The nonzero elements are spread across the
    matrix, with no dense cluster.
  • Peak at 16K x 16K cache block with speedup 3.1

23
Cache Blocking When and how to use cache
blocking
  • From the experiment, the matrices for which cache
    blocking is most effective are large and
    random.
  • We developed a measurement of randomness of
    matrix.
  • We perform search in coarse grain, to decide
    cache block size.

24
Combination of Register and Cache blocking
UltraSPARC
  • The combination is rarely beneficial, often
    slower than either of the two optimization.

25
Combination of Register and Cache blocking MIPS
26
Multiple Vector Multiplication
  • Better chance of optimization BLAS2 vs. BLAS3

Repetition of single-vector case
Multiple-vector case
27
Multiple Vector Multiplication Performances
  • Register blocking performance
  • Cache blocking performance

28
Multiple Vector Multiplication Register
Blocking Performance
  • The speedup is larger than single vector register
    blocking.
  • Even the performance of the matrices that did not
    speedup improved. (middle group in UltraSPARC)

29
Multiple Vector Multiplication Cache Blocking
Performance
MIPS
UltraSPARC
  • Noticeable speedup for the matrices that did not
    speedup (UltraSPARC)
  • Block sizes are much smaller than that of single
    vector cache blocking.

30
Sparsity System Purpose
  • Guide a choice of optimization
  • Automatic selection of optimization parameters
    such as block size, number of vectors
  • http//comix.cs.berkeley.edu/ejim/sparsity

31
Sparsity System Organization
Example matrix
Sparsity Machine Profiler
Machine Performance Profile
Sparsity Optimizer
Optimized code, drivers
Maximum Number of vectors
32
Summary Speedup of Sparsity on UltraSPARC
  • On UltraSPARC, up to 3x for single vector, 4.7x
    for multiple vector

Single Vector
Multiple Vector
33
Summary Speedup of Sparsity on MIPS
  • On MIPS, up to 3x single vector, 6x for multiple
    vector

Single Vector
Multiple Vector
34
Summary Overhead of Sparsity Optimization
  • The number of iteration
  • Overhead time
  • Time saved
  • The BLAS Technical Forum include a parameter in
    the matrix creation routine to indicate how many
    times the operation is performed.

35
Related Work (1)
  • Dense Matrix Optimization
  • Loop transformation by compilers M. Wolf, etc.
  • Hand-optimized libraries BLAS, LAPACK
  • Automatic Generation of Libraries
  • PHiPAC, ATLAS and FFTW
  • Sparse Matrix Standardization and Libraries
  • BLAS Technical Forum
  • NIST Sparse BLAS, MV, SparseLib, TNT
  • Hand Optimization of Sparse Matrix-Vector Multi.
  • S. Toledo, Oliker et. al.

36
Related Work (2)
  • Sparse Matrix Packages
  • SPARSKIT, PSPARSELIB, Aztec, BlockSolve95,
    Spark98
  • Compiling Sparse Matrix Code
  • Sparse compiler (Bik), Bernoulli compiler
    (Kotlyar)
  • On-demand Code Generation
  • NIST SparseBLAS, Sparse compiler

37
Contribution
  • Thorough investigation of memory hierarchy
    optimization for sparse matrix-vector
    multiplication
  • Performance study on benchmark matrices
  • Development of performance model to choose
    optimization parameter
  • Sparsity system for automatic tuning and code
    generation of sparse matrix-vector multiplication

38
Conclusion
  • Memory hierarchy optimization for sparse
    matrix-vector multiplication
  • Register Blocking matrices with dense local
    structure benefit
  • Cache Blocking large matrices with random
    structure benefit
  • Multiple vector multiplication improves the
    performance further because of reuse of matrix
    elements
  • The choice of optimization depends on both matrix
    structure and machine architecture.
  • The automated system helps this complicated and
    time-consuming process.
Write a Comment
User Comments (0)
About PowerShow.com