Autotuning Sparse Matrix Kernels - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Autotuning Sparse Matrix Kernels

Description:

Sun Niagara2 (Huron) IBM QS20 Cell Blade. We show ... Sun Niagara2 (Huron) AMD Opteron. Intel Clovertown. Opteron. Opteron. 667MHz DDR2 DIMMs ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 40
Provided by: samwil
Category:

less

Transcript and Presenter's Notes

Title: Autotuning Sparse Matrix Kernels


1
Auto-tuning Sparse Matrix Kernels
  • Sam Williams1,2
  • Richard Vuduc3, Leonid Oliker1,2, John Shalf2,
    Katherine Yelick1,2,
  • James Demmel1,2
  • 1University of California Berkeley 2Lawrence
    Berkeley National Laboratory 3Georgia
    Institute of Technology
  • samw_at_cs.berkeley.edu

2
Motivation
  • Multicore is the de facto solution for improving
    peak performance for the next decade
  • How do we ensure this applies to sustained
    performance as well ?
  • Processor architectures are extremely diverse and
    compilers can rarely fully exploit them
  • Require a HW/SW solution that guarantees
    performance without completely sacrificing
    productivity

3
Overview
  • Examine Sparse Matrix Vector Multiplication
    (SpMV) kernel
  • Present and analyze two threaded auto-tuned
    implementations
  • Benchmarked performance across 4 diverse
    multicore architectures
  • Intel Xeon (Clovertown)
  • AMD Opteron
  • Sun Niagara2 (Huron)
  • IBM QS20 Cell Blade
  • We show
  • Auto-tuning can significantly improve performance
  • Cell consistently delivers good performance and
    efficiency
  • Niagara2 delivers good performance and
    productivity

4
Multicore SMPs used
5
Multicore SMP Systems
6
Multicore SMP Systems(memory hierarchy)
Conventional Cache-based Memory Hierarchy
7
Multicore SMP Systems(memory hierarchy)
Conventional Cache-based Memory Hierarchy
Disjoint Local Store Memory Hierarchy
8
Multicore SMP Systems(memory hierarchy)
Cache Pthreads implementationsb
Local Store libspe implementations
9
Multicore SMP Systems(peak flops)
75 Gflop/s
17 Gflop/s
PPEs 13 Gflop/s SPEs 29 Gflop/s
11 Gflop/s
10
Multicore SMP Systems(peak DRAM bandwidth)
21 GB/s(read) 10 GB/s(write)
21 GB/s
51 GB/s
42 GB/s(read) 21 GB/s(write)
11
Multicore SMP Systems
Non-Uniform Memory Access
Uniform Memory Access
12
Arithmetic Intensity
O( N )
O( 1 )
O( log(N) )
A r i t h m e t i c I n t e n s i t y
SpMV, BLAS1,2
FFTs
Stencils (PDEs)
Dense Linear Algebra (BLAS3)
Lattice Methods
Particle Methods
  • Arithmetic Intensity Total Flops / Total DRAM
    Bytes
  • Some HPC kernels have an arithmetic intensity
    that scales with with problem size (increasing
    temporal locality)
  • But there are many important and interesting
    kernels that dont

13
Auto-tuning
14
Auto-tuning
  • Hand optimizing each architecture/dataset
    combination is not feasible
  • Goal Productive Solution for Performance
    portability
  • Our auto-tuning approach finds a good performance
    solution by a combination of heuristics and
    exhaustive search
  • Perl script generates many possible kernels
  • (Generate SIMD optimized kernels)
  • Auto-tuning benchmark examines kernels and
    reports back with the best one for the current
    architecture/dataset/compiler/
  • Performance depends on the optimizations
    generated
  • Heuristics are often desirable when the search
    space isnt tractable
  • Proven value in Dense Linear Algebra(ATLAS),
    Spectral(FFTW,SPIRAL), and Sparse Methods(OSKI)

15
Sparse Matrix-Vector Multiplication (SpMV)
  • Samuel Williams, Leonid Oliker, Richard Vuduc,
    John Shalf, Katherine Yelick, James Demmel,
    "Optimization of Sparse Matrix-Vector
    Multiplication on Emerging Multicore Platforms",
    Supercomputing (SC), 2007.

16
Sparse MatrixVector Multiplication
  • Sparse Matrix
  • Most entries are 0.0
  • Performance advantage in only
  • storing/operating on the nonzeros
  • Requires significant meta data
  • Evaluate yAx
  • A is a sparse matrix
  • x y are dense vectors
  • Challenges
  • Difficult to exploit ILP(bad for superscalar),
  • Difficult to exploit DLP(bad for SIMD)
  • Irregular memory access to source vector
  • Difficult to load balance
  • Very low computational intensity (often gt6
    bytes/flop)
  • likely memory bound

A
x
y
17
Dataset (Matrices)
2K x 2K Dense matrix stored in sparse format
Dense
Well Structured (sorted by nonzeros/row)
Protein
FEM / Spheres
FEM / Cantilever
Wind Tunnel
FEM / Harbor
QCD
FEM / Ship
Economics
Epidemiology
Poorly Structured hodgepodge
FEM / Accelerator
Circuit
webbase
Extreme Aspect Ratio (linear programming)
LP
  • Pruned original SPARSITY suite down to 14
  • none should fit in cache
  • Subdivided them into 4 categories
  • Rank ranges from 2K to 1M

18
Naïve Serial Implementation
  • Vanilla C implementation
  • Matrix stored in CSR (compressed sparse row)
  • Explored compiler options, but only the best is
    presented here
  • x86 core delivers gt 10x the performance of a
    Niagara2 thread

19
Naïve Parallel Implementation
  • SPMD style
  • Partition by rows
  • Load balance by nonzeros
  • N2 2.5x x86 machine

Naïve Pthreads
Naïve
20
Naïve Parallel Implementation
  • SPMD style
  • Partition by rows
  • Load balance by nonzeros
  • N2 2.5x x86 machine

8x cores 1.9x performance
4x cores 1.5x performance
64x threads 41x performance
4x threads 3.4x performance
Naïve Pthreads
Naïve
21
Naïve Parallel Implementation
  • SPMD style
  • Partition by rows
  • Load balance by nonzeros
  • N2 2.5x x86 machine

1.4 of peak flops 29 of bandwidth
4 of peak flops 20 of bandwidth
25 of peak flops 39 of bandwidth
2.7 of peak flops 4 of bandwidth
Naïve Pthreads
Naïve
22
Auto-tuned Performance(NUMA SW Prefetching)
  • Use first touch, or libnuma to exploit NUMA.
  • Also includes process affinity.
  • Tag prefetches with temporal locality
  • Auto-tune search for the optimal prefetch
    distances

SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
23
Auto-tuned Performance(Matrix Compression)
  • If memory bound, only hope is minimizing memory
    traffic
  • Heuristically compress the parallelized matrix to
    minimize it
  • Implemented with SSE
  • Benefit of prefetching is hidden by requirement
    of register blocking
  • Options register blocking, index size, format,
    etc

Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
24
Auto-tuned Performance(Cache/TLB Blocking)
  • Reorganize matrix to maximize locality of source
    vector accesses

Cache/TLB Blocking
Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
25
Auto-tuned Performance(DIMMs, Firmware, Padding)
  • Clovertown was already fully populated with DIMMs
  • Gave Opteron as many DIMMs as Clovertown
  • Firmware update for Niagara2
  • Array padding to avoid inter-thread conflict
    misses
  • PPEs use 1/3 of Cell chip area

More DIMMs(opteron), FW fix, array
padding(N2), etc
Cache/TLB Blocking
Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
26
Auto-tuned Performance(DIMMs, Firmware, Padding)
  • Clovertown was already fully populated with DIMMs
  • Gave Opteron as many DIMMs as Clovertown
  • Firmware update for Niagara2
  • Array padding to avoid inter-thread conflict
    misses
  • PPEs use 1/3 of Cell chip area

4 of peak flops 52 of bandwidth
20 of peak flops 65 of bandwidth
54 of peak flops 57 of bandwidth
More DIMMs(opteron), FW fix, array
padding(N2), etc
Cache/TLB Blocking
Compression
10 of peak flops 10 of bandwidth
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
27
Auto-tuned Performance(Cell/SPE version)
  • Wrote a double precision Cell/SPE version
  • DMA, local store blocked, NUMA aware, etc
  • Only 2x1 and larger BCOO
  • Only the SpMV-proper routine changed
  • About 12x faster (median) than using the PPEs
    alone.

More DIMMs(opteron), FW fix, array
padding(N2), etc
Cache/TLB Blocking
Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
28
Auto-tuned Performance(Cell/SPE version)
  • Wrote a double precision Cell/SPE version
  • DMA, local store blocked, NUMA aware, etc
  • Only 2x1 and larger BCOO
  • Only the SpMV-proper routine changed
  • About 12x faster than using the PPEs alone.

4 of peak flops 52 of bandwidth
20 of peak flops 65 of bandwidth
54 of peak flops 57 of bandwidth
More DIMMs(opteron), FW fix, array
padding(N2), etc
40 of peak flops 92 of bandwidth
Cache/TLB Blocking
Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
29
Auto-tuned Performance(How much did double
precision and 2x1 blocking hurt)
  • Model faster cores by commenting out the inner
    kernel calls, but still performing all DMAs
  • Enabled 1x1 BCOO
  • 16 improvement

better Cell implementation
More DIMMs(opteron), FW fix, array
padding(N2), etc
Cache/TLB Blocking
Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
30
Speedup from Auto-tuningMedian (max)
  • Wrote a double precision Cell/SPE version
  • DMA, local store blocked, NUMA aware, etc
  • Only 2x1 and larger BCOO
  • Only the SpMV-proper routine changed
  • About 12x faster than using the PPEs alone.

3.9x (4.4x)
1.6x (2.7x)
1.3x (2.9x)
26x (34x)
More DIMMs(opteron), FW fix, array
padding(N2), etc
Cache/TLB Blocking
Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
31
Summary
32
Aggregate Performance (Fully optimized)
  • Cell consistently delivers the best full system
    performance
  • Although, Niagara2 delivers near comparable per
    socket performance
  • Dual core Opteron delivers far better performance
    (bandwidth) than Clovertown
  • Clovertown has far too little effective FSB
    bandwidth
  • Huron has far more bandwidth than it can exploit
  • (too much latency, too few cores)

33
Parallel Efficiency(average performance per
thread, Fully optimized)
  • Aggregate Mflop/s / cores
  • Niagara2 Cell showed very good multicore
    scaling
  • Clovertown showed very poor multicore scaling on
    both applications
  • For SpMV, Opteron and Clovertown showed good
    multisocket scaling

34
Power Efficiency(Fully Optimized)
  • Used a digital power meter to measure sustained
    power under load
  • Calculate power efficiency as
  • sustained performance / sustained power
  • All cache-based machines delivered similar power
    efficiency
  • FBDIMMs (12W each) sustained power
  • 8 DIMMs on Clovertown (total of 330W)
  • 16 DIMMs on N2 machine (total of 450W)

35
Productivity
  • Niagara2 required significantly less work to
    deliver good performance.
  • Cache based machines required search for some
    optimizations, while Cell relied solely on
    heuristics (less time to tune)

36
Summary
  • Paradoxically, the most complex/advanced
    architectures required the most tuning, and
    delivered the lowest performance.
  • Niagara2 delivered both very good performance and
    productivity
  • Cell delivered very good performance and
    efficiency (processor and power)
  • Our multicore specific auto-tuned SpMV
    implementation significantly outperformed
    existing parallelization strategies including an
    auto-tuned MPI implementation (as discussed
    _at_SC07)
  • Architectural transparency is invaluable in
    optimizing code

37
Acknowledgements
  • UC Berkeley
  • RADLab Cluster (Opterons)
  • PSI cluster(Clovertowns)
  • Sun Microsystems
  • Niagara2 donations
  • Forschungszentrum Jülich
  • Cell blade cluster access

38
Questions?
39
switch to pOSKI
Write a Comment
User Comments (0)
About PowerShow.com