Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures

Description:

Sparse Matrix Vector Multiply Algorithms and Optimizations ... A is a sparse matrix ( 1% of entries are nonzero) Applications employing SpM V in the inner loop ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 22
Provided by: Ank35
Category:

less

Transcript and Presenter's Notes

Title: Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures


1
Sparse Matrix Vector Multiply Algorithms and
Optimizations on Modern Architectures
  • Ankit Jain, Vasily Volkov
  • CS252 Final Presentation
  • 5/9/2007
  • ankit_at_eecs.berkeley.edu
  • volkov_at_eecs.berkeley.edu

2
SpMV and its Applications
vector x
matrix A
vector y
  • Sparse Matrix Vector Multiply (SpMV) y ? yAx
  • x, y are dense vectors
  • x source vector
  • y destination vector
  • A is a sparse matrix (lt1 of entries are nonzero)
  • Applications employing SpMV in the inner loop
  • Least Squares Problems
  • Eigenvalue Problems

3
Storing a Matrix in Memory
  • type val realk type ind intk type ptr
    intm1
  • foreach row i do
  • for l ptri to ptri 1 1 do
  • yi ? yi vall xindl

Compressed Sparse Row Data Structure and
Algorithm
4
Whats so hard about it?
  • Reason for Poor Performance of Naïve
    Implementation
  • Poor locality (indirect and irregular memory
    accesses)
  • Limited by speed of main memory
  • Poor instruction mix (low flops to memory
    operations ratio)
  • Algorithm dependent on non-zero structure of
    matrix
  • Dense matrices vs Sparse matrices

5
Register-Level Blocking (SPARSITY) 3x3 Example
6
Register-Level Blocking (SPARSITY) 3x3 Example
BCSR with uniform, aligned grid
7
Register-Level Blocking (SPARSITY) 3x3 Example
Fill-in zeros trade-off extra ops for better
efficiency
8
Blocked Compressed Sparse Row
  • Inner loop performs floating point multiply-add
    on each non-zero in block instead of just one
    non-zero
  • Reduces the number of times the source vector x
    has to be brought back into memory
  • Reduces the number of indices that have to be
    stored and loaded

9
The Payoff Speedups on Itanium 2
10
Explicit Software Pipelining
  • ORIGINAL CODE
  • type val realk type ind intk type ptr
    intm1
  • foreach row i do
  • for l ptri to ptri 1 1 do
  • yi ? yi vall xindl
  • SOFTWARE PIPELINED CODE
  • type val realk type ind intk type ptr
    intm1
  • foreach row i do
  • for l ptri to ptri 1 1 do
  • yi ? yi val_1 x_1
  • val_1 vall 1
  • x_1 xind_2
  • ind_2 indl 2

11
Explicit Software Prefetching
  • ORIGINAL CODE
  • type val realk type ind intk type ptr
    intm1
  • foreach row i do
  • for l ptri to ptri 1 1 do
  • yi ? yi vall xindl
  • SOFTWARE PREFETCHED CODE
  • type val realk type ind intk type ptr
    intm1
  • foreach row i do
  • for l ptri to ptri 1 1 do
  • yi ? yi vall xindl
  • pref(NTA, pref_v_amt vall)
  • pref(NTA, pref_i_amt indl)
  • pref(NONE, xindlpref_x_amt)
  • NTA refers to no temporal locality on all levels
  • NONE refers to temporal locality on highest Level

12
Characteristics of Modern Architectures
  • High Set Associativity in Caches
  • 4-way L1, 8-way L2, 12-way L3 Itanium 2
  • Multiple Load Store Units
  • Multiple Execution Units
  • Six Integer Execution Units on Itanium 2
  • Two Floating Point Multiply-Add Execution Units
    in Itanium 2
  • Question
  • What if we broke the matrix into multiple
    streams of execution?

13
Parallel SpMV
  • Run different rows in different threads
  • Can do that on data parallel architectures
    (SIMD/VLIW, Itanium/GPU)?
  • What if rows have different length?
  • One row finishes, other are still running
  • Waiting threads keep processors idle
  • Can we avoid idleness?
  • Standard solution Segmented scan

14
Segmented Scan
  • Multiple Segments (streams) of Simultaneous
    Execution
  • Single Loop with branches inside to check if
    weve reached the end of a row for each segment.
  • Reduces Loop Overhead
  • Good if average NZ/Row is small
  • Changes the Memory Access Patterns and can more
    efficiently use caches for some matrices
  • Future Work Pass SpMV through a cache simulator
    to observe cache behavior

15
Itanium 2 Results (1.3 GHz, Millennium Cluster)
16
Conclusions Future Work
  • Optimizations studied are a good idea and should
    include this into OSKI
  • Develop Parallel / Multicore versions
  • Dual Core, Dual Socket Opterons, etc

17
Questions?
18
Extra Slides
19
Algorithm 2 Segmented Scan
1x1x2 SegmentedScan Code type val realktype
ind intktype ptr intm1type RowStart
intVectorLength r0 ? RowStart0r1 ?
RowStart1 nnz0 ? ptrr0nnz1 ? ptrr1 EoR0
? ptrr01EoR1 ? ptrr11
  1. while nnz0 lt SegmentLength do
  2. yr0 ? yr0 valnnz0 xindnnz0
  3. yr1 ? yr1 valnnz1 xindnnz1
  4. if(nnz0 EoR0)
  5. r0
  6. EoR0 ? ptrr01
  7. if(nnz1 EoR1)
  8. r1
  9. EoR1 ? ptrr11
  10. nnz0 ? nnz0 1
  11. nnz1 ? nnz1 1

20
Measuring Performance
  • Measure Dense Performance (r,c)
  • Performance (Mflop/s) of dense matrix in sparse r
    x c blocked format
  • Estimate Fill Ratio (r,c), ?r,c
  • Fill Ratio (r,c) (number of stored values) /
    (number of true non-zeros)
  • Choose r,c that maximizes
  • Estimated Performance (r,c)

21
References
  1. G. Belloch, M. Heroux, and M. Zagha. Segmented
    operations for sparse matrix computation on
    vector multiprocessors. Technical Report
    CMU-CS-93-173, Carnegie Mellon University, 1993.
  2. E.-J. Im. Optimizing the performance of sparse
    matrix-vector multiplication. PhD thesis,
    University of California, Berkeley, May 2000.
  3. E.-J. Im, K. A. Yelick, and R. Vuduc. SPARSITY
    Framework for optimizing sparse matrix-vector
    multiply. International Journal of High
    Performance Computing Applications,
    18(1)135158, February 2004.
  4. R. Nishtala, R. W. Vuduc, J. W. Demmel, and K. A.
    Yelick. Performance Modeling and Analysis of
    Cache Blocking in Sparse Matrix Vector Multiply.
    Technical Report UCB/CSD-04-1335, University of
    California, Berkeley, Berkeley, CA, USA, June
    2004.
  5. Y. Saad. SPARSKIT A basic tool kit for sparse
    matrix computations. Technical Report 90-20, NASA
    Ames Research Center, Moffett Field, CA, 1990.
  6. A. Schwaighofer. A matlab interface to svm light
    to version 4.0.http//www.cis.tugraz.at/igi/aschw
    aig/software.html, 2004.
  7. R. Vuduc. Automatic Performance Tuning of Sparse
    Matrix Kernels. PhD thesis, University of
    California, Berkeley, December 2003.
  8. R. Vuduc, J. Demmel, and K. Yelick. OSKI A
    library of automatically tuned sparse matrix
    kernels. In Proceedings of SciDAC 2005, Journal
    of Physics Conference Series, San Francisco, CA,
    USA, June 2005. Institute of Physics Publishing.
    (to appear).
Write a Comment
User Comments (0)
About PowerShow.com