Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures

Description:

Sparse Matrix Vector Multiply Algorithms and Optimizations ... A is a sparse matrix ( 1% of entries are nonzero) Applications employing SpM V in the inner loop ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 22

Provided by: Ank35

Learn more at: https://people.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures

1
Sparse Matrix Vector Multiply Algorithms and
Optimizations on Modern Architectures

Ankit Jain, Vasily Volkov
CS252 Final Presentation
5/9/2007
ankit_at_eecs.berkeley.edu
volkov_at_eecs.berkeley.edu

2
SpMV and its Applications
vector x
matrix A
vector y

Sparse Matrix Vector Multiply (SpMV) y ? yAx
x, y are dense vectors
x source vector
y destination vector
A is a sparse matrix (lt1 of entries are nonzero)
Applications employing SpMV in the inner loop
Least Squares Problems
Eigenvalue Problems

3
Storing a Matrix in Memory

type val realk type ind intk type ptr
intm1
foreach row i do
for l ptri to ptri 1 1 do
yi ? yi vall xindl

Compressed Sparse Row Data Structure and
Algorithm
4
Whats so hard about it?

Reason for Poor Performance of Naïve
Implementation
Poor locality (indirect and irregular memory
accesses)
Limited by speed of main memory
Poor instruction mix (low flops to memory
operations ratio)
Algorithm dependent on non-zero structure of
matrix
Dense matrices vs Sparse matrices

5
Register-Level Blocking (SPARSITY) 3x3 Example
6
Register-Level Blocking (SPARSITY) 3x3 Example
BCSR with uniform, aligned grid
7
Register-Level Blocking (SPARSITY) 3x3 Example
Fill-in zeros trade-off extra ops for better
efficiency
8
Blocked Compressed Sparse Row

Inner loop performs floating point multiply-add
on each non-zero in block instead of just one
non-zero
Reduces the number of times the source vector x
has to be brought back into memory
Reduces the number of indices that have to be
stored and loaded

9
The Payoff Speedups on Itanium 2
10
Explicit Software Pipelining

ORIGINAL CODE
type val realk type ind intk type ptr
intm1
foreach row i do
for l ptri to ptri 1 1 do
yi ? yi vall xindl

SOFTWARE PIPELINED CODE
type val realk type ind intk type ptr
intm1
foreach row i do
for l ptri to ptri 1 1 do
yi ? yi val_1 x_1
val_1 vall 1
x_1 xind_2
ind_2 indl 2

11
Explicit Software Prefetching

ORIGINAL CODE
type val realk type ind intk type ptr
intm1
foreach row i do
for l ptri to ptri 1 1 do
yi ? yi vall xindl

SOFTWARE PREFETCHED CODE
type val realk type ind intk type ptr
intm1
foreach row i do
for l ptri to ptri 1 1 do
yi ? yi vall xindl
pref(NTA, pref_v_amt vall)
pref(NTA, pref_i_amt indl)
pref(NONE, xindlpref_x_amt)
NTA refers to no temporal locality on all levels
NONE refers to temporal locality on highest Level

12
Characteristics of Modern Architectures

High Set Associativity in Caches
4-way L1, 8-way L2, 12-way L3 Itanium 2
Multiple Load Store Units
Multiple Execution Units
Six Integer Execution Units on Itanium 2
Two Floating Point Multiply-Add Execution Units
in Itanium 2
Question
What if we broke the matrix into multiple
streams of execution?

13
Parallel SpMV

Run different rows in different threads
Can do that on data parallel architectures
(SIMD/VLIW, Itanium/GPU)?
What if rows have different length?
One row finishes, other are still running
Waiting threads keep processors idle
Can we avoid idleness?
Standard solution Segmented scan

14
Segmented Scan

Multiple Segments (streams) of Simultaneous
Execution
Single Loop with branches inside to check if
weve reached the end of a row for each segment.
Reduces Loop Overhead
Good if average NZ/Row is small
Changes the Memory Access Patterns and can more
efficiently use caches for some matrices
Future Work Pass SpMV through a cache simulator
to observe cache behavior

15
Itanium 2 Results (1.3 GHz, Millennium Cluster)
16
Conclusions Future Work

Optimizations studied are a good idea and should
include this into OSKI
Develop Parallel / Multicore versions
Dual Core, Dual Socket Opterons, etc

17
Questions?
18
Extra Slides
19
Algorithm 2 Segmented Scan
1x1x2 SegmentedScan Code type val realktype
ind intktype ptr intm1type RowStart
intVectorLength r0 ? RowStart0r1 ?
RowStart1 nnz0 ? ptrr0nnz1 ? ptrr1 EoR0
? ptrr01EoR1 ? ptrr11

while nnz0 lt SegmentLength do
yr0 ? yr0 valnnz0 xindnnz0
yr1 ? yr1 valnnz1 xindnnz1
if(nnz0 EoR0)
r0
EoR0 ? ptrr01
if(nnz1 EoR1)
r1
EoR1 ? ptrr11
nnz0 ? nnz0 1
nnz1 ? nnz1 1

20
Measuring Performance

Measure Dense Performance (r,c)
Performance (Mflop/s) of dense matrix in sparse r
x c blocked format
Estimate Fill Ratio (r,c), ?r,c
Fill Ratio (r,c) (number of stored values) /
(number of true non-zeros)
Choose r,c that maximizes
Estimated Performance (r,c)

21
References

G. Belloch, M. Heroux, and M. Zagha. Segmented
operations for sparse matrix computation on
vector multiprocessors. Technical Report
CMU-CS-93-173, Carnegie Mellon University, 1993.
E.-J. Im. Optimizing the performance of sparse
matrix-vector multiplication. PhD thesis,
University of California, Berkeley, May 2000.
E.-J. Im, K. A. Yelick, and R. Vuduc. SPARSITY
Framework for optimizing sparse matrix-vector
multiply. International Journal of High
Performance Computing Applications,
18(1)135158, February 2004.
R. Nishtala, R. W. Vuduc, J. W. Demmel, and K. A.
Yelick. Performance Modeling and Analysis of
Cache Blocking in Sparse Matrix Vector Multiply.
Technical Report UCB/CSD-04-1335, University of
California, Berkeley, Berkeley, CA, USA, June
2004.
Y. Saad. SPARSKIT A basic tool kit for sparse
matrix computations. Technical Report 90-20, NASA
Ames Research Center, Moffett Field, CA, 1990.
A. Schwaighofer. A matlab interface to svm light
to version 4.0.http//www.cis.tugraz.at/igi/aschw
aig/software.html, 2004.
R. Vuduc. Automatic Performance Tuning of Sparse
Matrix Kernels. PhD thesis, University of
California, Berkeley, December 2003.
R. Vuduc, J. Demmel, and K. Yelick. OSKI A
library of automatically tuned sparse matrix
kernels. In Proceedings of SciDAC 2005, Journal
of Physics Conference Series, San Francisco, CA,
USA, June 2005. Institute of Physics Publishing.
(to appear).