Title: Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures
1Sparse Matrix Vector Multiply Algorithms and
Optimizations on Modern Architectures
- Ankit Jain, Vasily Volkov
- CS252 Final Presentation
- 5/9/2007
- ankit_at_eecs.berkeley.edu
- volkov_at_eecs.berkeley.edu
2SpMV and its Applications
vector x
matrix A
vector y
- Sparse Matrix Vector Multiply (SpMV) y ? yAx
- x, y are dense vectors
- x source vector
- y destination vector
- A is a sparse matrix (lt1 of entries are nonzero)
- Applications employing SpMV in the inner loop
- Least Squares Problems
- Eigenvalue Problems
3Storing a Matrix in Memory
- type val realk type ind intk type ptr
intm1 - foreach row i do
- for l ptri to ptri 1 1 do
- yi ? yi vall xindl
Compressed Sparse Row Data Structure and
Algorithm
4Whats so hard about it?
- Reason for Poor Performance of Naïve
Implementation - Poor locality (indirect and irregular memory
accesses) - Limited by speed of main memory
- Poor instruction mix (low flops to memory
operations ratio) - Algorithm dependent on non-zero structure of
matrix - Dense matrices vs Sparse matrices
5Register-Level Blocking (SPARSITY) 3x3 Example
6Register-Level Blocking (SPARSITY) 3x3 Example
BCSR with uniform, aligned grid
7Register-Level Blocking (SPARSITY) 3x3 Example
Fill-in zeros trade-off extra ops for better
efficiency
8Blocked Compressed Sparse Row
- Inner loop performs floating point multiply-add
on each non-zero in block instead of just one
non-zero - Reduces the number of times the source vector x
has to be brought back into memory - Reduces the number of indices that have to be
stored and loaded
9The Payoff Speedups on Itanium 2
10Explicit Software Pipelining
- ORIGINAL CODE
- type val realk type ind intk type ptr
intm1 - foreach row i do
- for l ptri to ptri 1 1 do
- yi ? yi vall xindl
- SOFTWARE PIPELINED CODE
- type val realk type ind intk type ptr
intm1 - foreach row i do
- for l ptri to ptri 1 1 do
- yi ? yi val_1 x_1
- val_1 vall 1
- x_1 xind_2
- ind_2 indl 2
11Explicit Software Prefetching
- ORIGINAL CODE
- type val realk type ind intk type ptr
intm1 - foreach row i do
- for l ptri to ptri 1 1 do
- yi ? yi vall xindl
- SOFTWARE PREFETCHED CODE
- type val realk type ind intk type ptr
intm1 - foreach row i do
- for l ptri to ptri 1 1 do
- yi ? yi vall xindl
- pref(NTA, pref_v_amt vall)
- pref(NTA, pref_i_amt indl)
- pref(NONE, xindlpref_x_amt)
- NTA refers to no temporal locality on all levels
- NONE refers to temporal locality on highest Level
12Characteristics of Modern Architectures
- High Set Associativity in Caches
- 4-way L1, 8-way L2, 12-way L3 Itanium 2
- Multiple Load Store Units
- Multiple Execution Units
- Six Integer Execution Units on Itanium 2
- Two Floating Point Multiply-Add Execution Units
in Itanium 2 - Question
- What if we broke the matrix into multiple
streams of execution?
13Parallel SpMV
- Run different rows in different threads
- Can do that on data parallel architectures
(SIMD/VLIW, Itanium/GPU)? - What if rows have different length?
- One row finishes, other are still running
- Waiting threads keep processors idle
- Can we avoid idleness?
- Standard solution Segmented scan
14Segmented Scan
- Multiple Segments (streams) of Simultaneous
Execution - Single Loop with branches inside to check if
weve reached the end of a row for each segment. - Reduces Loop Overhead
- Good if average NZ/Row is small
- Changes the Memory Access Patterns and can more
efficiently use caches for some matrices - Future Work Pass SpMV through a cache simulator
to observe cache behavior
15Itanium 2 Results (1.3 GHz, Millennium Cluster)
16Conclusions Future Work
- Optimizations studied are a good idea and should
include this into OSKI - Develop Parallel / Multicore versions
- Dual Core, Dual Socket Opterons, etc
17Questions?
18Extra Slides
19Algorithm 2 Segmented Scan
1x1x2 SegmentedScan Code type val realktype
ind intktype ptr intm1type RowStart
intVectorLength r0 ? RowStart0r1 ?
RowStart1 nnz0 ? ptrr0nnz1 ? ptrr1 EoR0
? ptrr01EoR1 ? ptrr11
- while nnz0 lt SegmentLength do
- yr0 ? yr0 valnnz0 xindnnz0
- yr1 ? yr1 valnnz1 xindnnz1
- if(nnz0 EoR0)
- r0
- EoR0 ? ptrr01
- if(nnz1 EoR1)
- r1
- EoR1 ? ptrr11
- nnz0 ? nnz0 1
- nnz1 ? nnz1 1
20Measuring Performance
- Measure Dense Performance (r,c)
- Performance (Mflop/s) of dense matrix in sparse r
x c blocked format - Estimate Fill Ratio (r,c), ?r,c
- Fill Ratio (r,c) (number of stored values) /
(number of true non-zeros) - Choose r,c that maximizes
- Estimated Performance (r,c)
21References
- G. Belloch, M. Heroux, and M. Zagha. Segmented
operations for sparse matrix computation on
vector multiprocessors. Technical Report
CMU-CS-93-173, Carnegie Mellon University, 1993. - E.-J. Im. Optimizing the performance of sparse
matrix-vector multiplication. PhD thesis,
University of California, Berkeley, May 2000. - E.-J. Im, K. A. Yelick, and R. Vuduc. SPARSITY
Framework for optimizing sparse matrix-vector
multiply. International Journal of High
Performance Computing Applications,
18(1)135158, February 2004. - R. Nishtala, R. W. Vuduc, J. W. Demmel, and K. A.
Yelick. Performance Modeling and Analysis of
Cache Blocking in Sparse Matrix Vector Multiply.
Technical Report UCB/CSD-04-1335, University of
California, Berkeley, Berkeley, CA, USA, June
2004. - Y. Saad. SPARSKIT A basic tool kit for sparse
matrix computations. Technical Report 90-20, NASA
Ames Research Center, Moffett Field, CA, 1990. - A. Schwaighofer. A matlab interface to svm light
to version 4.0.http//www.cis.tugraz.at/igi/aschw
aig/software.html, 2004. - R. Vuduc. Automatic Performance Tuning of Sparse
Matrix Kernels. PhD thesis, University of
California, Berkeley, December 2003. - R. Vuduc, J. Demmel, and K. Yelick. OSKI A
library of automatically tuned sparse matrix
kernels. In Proceedings of SciDAC 2005, Journal
of Physics Conference Series, San Francisco, CA,
USA, June 2005. Institute of Physics Publishing.
(to appear).