A benchmark for sparse matrixvector multiplication - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

A benchmark for sparse matrixvector multiplication

Description:

... NSF, Argonne National Lab, a gift from Intel, National Energy Research ... Idea: Replace Sparsity's explicit (BLAS-1-like) register block multiplication... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 42
Provided by: csBer
Category:

less

Transcript and Presenter's Notes

Title: A benchmark for sparse matrixvector multiplication


1
A benchmark for sparse matrix-vector
multiplication
  • Hormozd Gahvari and Mark Hoemmen
    hormozdmhoemmen_at_eecs
  • http//mhoemmen.arete.cc/Report/
  • Research made possible by NSF, Argonne
    National Lab, a gift from Intel, National Energy
    Research Scientific Computing Center, and Tyler
    Berry tyler_at_arete.cc

2
Topcs for today
  • Sparse matrix-vector multiplication (SMVM) and
    the Sparsity optimization
  • Preexisting SMVM benchmarks vs. ours
  • Results Performance predictors
  • Test case Desktop SIMD

3
Sparse matrix-vector multiplication
  • Sparse vs. dense matrix vector
  • Dense Can take advantage of temporal, spatial
    locality (BLAS level 2,3)
  • Sparse Stream through matrix one value at a
    time
  • Index arrays Lose locality
  • Compressed sparse row (CSR) format

4
Register block optimization
  • Many matrices have small blocks
  • FEM matrices especially
  • 2x2, 3x3, 6x6 common
  • Register blocking Like unrolling a loop
    (circumvent latencies)
  • Sparsity
  • Automatic heuristic optimal block size selection

5
SMVM benchmarks Three strategies
  • Actually do SMVM with test cases
  • Simpler ops simulating SMVM
  • Analytical / heuristic model

6
1) Actually do SMVM
  • SparseBench Iterative Krylov solvers
  • Tests other things besides SMVM!
  • SciMark 2.0
  • Fixed problem size
  • Uses unoptimized CSR (no reg. blocks)
  • Doesn't capture potential performance with many
    types of matrices
  • Register blocking Large impact (will see)

7
2) Microbenchmarks simulating SMVM
  • Goal capture SMVM behavior with simple set of
    operations
  • STREAM http//www.streambench.org/
  • Sustained memory bandwidth
  • Copy, Scale, Add, Triad
  • Triad like dense level-1 BLAS DAXPY
  • Rich Vuduc's indirect indexed variants
  • Resemble sparse matrix addressing
  • Still not predictive

8
3) Analytical models of SMVM performance
  • Account for miss rates, latencies and bandwidths
  • Sparsity bounds as heuristic to predict best
    block dimensions for a machine
  • Upper and lower bounds not tight, so difficult to
    use for performance prediction
  • Sparsity's goal optimization, not performance
    prediction

9
Our SMVM benchmark
  • Do SMVM with BSR matrix randomly scattered
    blocks
  • BSR format Typically less structured matrices
    anyway
  • Best block size, 1x1
  • Characterize different matrix types
  • Take advantage of potential optimizations (unlike
    current benchmarks), but in a general way

10
Dense matrix in sparse format
  • Test this with optimal block size
  • To show that fill doesn't affect performance much
  • Fill affects locality of accesses to source
    vector

11
Data set sizing
  • Size vectors to fit in largest cache, matrix out
    of cache
  • Tests streaming in of matrix values
  • Natural scaling to machine parameters!
  • Inspiration SPECfp92 (small enough so
    manufacturers could size cache to fit all data)
    vs. SPECfp95 (data sizes increased)
  • Fill now machine-dependent
  • Tests show fill (locality of source vector
    accesses) has little effect

12
Results Best block size
  • Highest Mflops/s value for the block sizes
    tested, for
  • Sparse matrix (fill chosen as above)
  • Dense matrix in sparse format (4096 x 4096)
  • Compare with Mflops/s for STREAM Triad (ai
    bi s ci)

13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
Rank processors acc. to benchmarks
  • For optimized (best block size) SMVM
  • Peak mem bandwidth good predictor for Itanium 2,
    P4, PM relationship
  • STREAM mispredicts these
  • STREAM
  • Better predicts unoptimized (1 x 1) SMVM
  • Peak bandwidth no longer helpful

17
(No Transcript)
18
Our benchmark Useful performance indicator
  • Comparison with results for real-life matrices
  • Works well for FEM matrices
  • Not always as well for non-FEM matrices
  • More wasted space in block data structure
    directly proportional to slowdown

19
Comparison of Benchmark with Real Matrices
  • Following two graphs show MFLOP rate of matrices
    generated by our benchmark vs. matrices from
    BeBOP group and a dense matrix in sparse format
  • Plots compare by block size matrix number is
    given in parentheses. Matrices 2-9 are FEM
    matrices.
  • A comprehensive list of the BeBOP test suite
    matrices can be found in Vuduc, et. al.,
    Performance Optimizations and Bounds for Sparse
    Matrix-Vector Multiply, 2002.

20
(No Transcript)
21
(No Transcript)
22
Comparison Conclusions
  • Our benchmark does a good job modeling real data
  • Dense matrix in sparse format looks good on Ultra
    3, but is noticeably inferior to our benchmark
    for large block sizes on Itanium 2

23
Evaluating SIMD instructions
  • SMVM benchmark
  • Tool to evaluate arch. features
  • e.g. Desktop SIMD floating-point
  • SSE-2 ISA
  • Pentium 4, M AMD Opteron
  • Parallel ops on 2 floating-point doubles
  • ADDMULDIVPD arithmetic
  • MOVAPD load aligned pair

24
Vectorizing DAXPY
  • Register block small dense Matrix vector
  • Dep. on matrix data ordering
  • Column-major (Fortran-style)
  • Need scalar vector operation
  • Row-major (C-style)
  • Need reduce (dot product)

25
Sparsity register block layout
  • Row-major order within block
  • Vs. Sparse BLAS proposal (col-major)!
  • Vector reductions change associativity (results
    may differ from scalar version, due to roundoff)
  • We chose to keep it for now
  • Can't just switch algorithm orientation affects
    stride of vector loads
  • Need a good vector reduction

26
Vector reduce
  • e.g. C. Kozyrakis' recent UC Berkeley Ph.D.
    thesis on multimedia vector ops
  • vhalf instruction
  • Copy lower half of src vector reg. --gt upper half
    of dest.
  • Iterate (vhalf, vector add) to reduce.

27
SSE-2 has vhalf!
  • Sum the 2 elements of xmm1
  • --------------------------------
  • Low 8B xmm1 --gt high 8B xmm0
  • SHUFPD xmm0, xmm1
  • High 8B of xmm0 gets sum
  • ADDPD xmm0, xmm1

28
One possible SSE-2 6x6 Ax
  • xmm0 lt- (dest(0), 0)
  • 6 MOVAPD interleave matrix row pairs and src
    vector pairs
  • Update indices
  • 3x (MULPD, then ADDPD to xmm0)
  • Sum elems of xmm0
  • (SHUFPD and ADDPD)
  • Extract and store sum

29
SSE-2 gcc and Intel C compilers won't vectorize!
  • Use SIMD registers for scalar math!
  • SSE-2 latency 1 cycle less than x87
  • x87 uses same fn unit as SIMD anyway
  • Vector reduce sub-optimal?
  • Fewer ops less latency-hiding potential
  • Only 8 XMM regs Can't unroll
  • Col-major suboptimal
  • No scalar vector instruction!
  • Or the alignment issue...

30
Small matrix library
  • From Intel Matrix vector
  • Optimized for 6x6 or less
  • Idea
  • Replace Sparsity's explicit (BLAS-1-like)
    register block multiplication...
  • ...with optimized function (BLAS-2-like)
  • We're working on this
  • Needed to say if SIMD valuable

31
SIMD load alignment
  • Possible reason for no automatic vectorization
  • Load pair needs alignm. on 16B bdys
  • Non-aligned load slower
  • Compiler can't guarantee alignment
  • Itanium 2 Same issue reappears...

32
SSE-2 results Disappointing
  • Pentium M gains nothing
  • Pentium 4 actually gains a little
  • SSE-2 1 cycle lower latency than x87
  • Small blocks latency dominates
  • x87 ISA harder to schedule
  • AMD Opteron not available for testing
  • 16 XMM regs (vs. 8) better unrolling capability?

33
How SSE-2 should look STREAM Scale
  • b0N-1 scalar c0N-1
  • (speedup 1.72)
  • Loop
  • movapd c(eax), xmm4
  • mulpd xmm0, xmm4
  • movntpd xmm4, b(eax)
  • addl 16, eax
  • cmpl 16000000, eax
  • jl Loop

34
NetBurst (Pentium 4,M arch)(Note diagram used
w/out permission)
35
(No Transcript)
36
Can NetBurst keep up with DAXPY?
  • One cycle
  • 1 load aligned pair, 1 store aligned pair, 1 SIMD
    flop (alternate ADDPD/MULPD)
  • DAXPY (in row-major) Triad - like
  • y(i) y(i) A(i,j) x(j)
  • If y(i) loaded 2 lds, 1 mul, 1 add, 1 store
  • Ratio of loads to stores inadequate?
  • Itanium 2 changes this...

37
Itanium 2 Streaming fl-pt
  • NO SSE-2 support!!!
  • BUT In 1 cycle 2 MMF bundles
  • 2 load pair (4 loads), 2 stores
  • 2 FMACs (a s b)
  • (Or MFI Load pair, FMAC, update idx)
  • 1 cycle theoretically 2x DAXPY!

38
Itanium 2 Alignment strikes again!
  • Intel C Compiler won't generate load pair
    instructions!!!
  • Why?
  • ldfpd (load pair) needs aligned data
  • Compiler doesn't see underlying dense BLAS 2
    structure?
  • Register pressure?

39
SIMD conclusions
  • STREAM Triad suggests modest potential speedup
  • Multiple scalar functional units
  • More flexible than SIMD Speedup independent of
    orientation
  • Code scheduling difficult
  • Pragmas to tell compiler data is aligned
  • Encapsulate block Ax in hand-coded routine

40
Conclusions
  • Our benchmark
  • Good SMVM performance prediction
  • Scales for any typical uniprocessor
  • With optimal block sizes
  • Performance tied to memory bandwidth
  • With 1x1 blocks
  • Performance related more to latency

41
Conclusions (2)
  • SIMD Need to test custom mini dense matrix
    vector routines
  • Development will continue after this semester
  • More testing
  • Parallelization
Write a Comment
User Comments (0)
About PowerShow.com