Understanding%20the%20Efficiency%20of%20GPU%20Algorithms%20for%20Matrix-Matrix%20Multiplication - PowerPoint PPT Presentation

About This Presentation
Title:

Understanding%20the%20Efficiency%20of%20GPU%20Algorithms%20for%20Matrix-Matrix%20Multiplication

Description:

Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication ... Level 1 BLAS operations Buck et al. [2004] Fluid solvers Kruger & Westermann [2003] ... – PowerPoint PPT presentation

Number of Views:142
Avg rating:3.0/5.0
Slides: 29
Provided by: kayvonfa
Category:

less

Transcript and Presenter's Notes

Title: Understanding%20the%20Efficiency%20of%20GPU%20Algorithms%20for%20Matrix-Matrix%20Multiplication


1
Understanding the Efficiency of GPU Algorithms
for Matrix-Matrix Multiplication

  • Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan
  • Stanford University
  • August 30, 2004

2
Motivation Harness GPU Performance
6
5
4
Peak FLOPS
Relative Performance
3
Memory BW
2
1
0
P4 3.4Ghz
6800 Ultra
X800 XT PE
3
Streaming Computation on GPUs
  • GPUs accelerate streaming numerical algorithms
  • Data parallelism
  • High ratio of arithmetic to data access
  • Little data reuse

Kernel function (shader)
Input Elements
Output Elements
4
Streaming Computation on GPUs
  • Level 1 BLAS operations Buck et al. 2004
  • Fluid solvers Kruger
    Westermann 2003
  • Boltz et al. 2003
  • Image processing Apple Corp. 2004
  • McCormick et al. 2004
  • Segmentation Sherbondy et al. 2003
  • Database operations Govindaraju et al.
    2004
  • Data Clustering Hall et al. 2004

5
Dense Matrix Multiplication


C
A
B
  • Abundant data parallelism
  • Regular data access (no branching)
  • High ratio of computation to data access

6
Dense Matrix Multiplication
  • Widely used computational kernel
  • Building block for LAPACK library

7
Matrix Multiplication on GPUs
  • Larsen McAllister 2001
  • Moravansky 2003
  • Hall et al. 2003
  • Limited analysis of performance

8
Overview
  • GPU Implementations
  • Results
  • Analysis Why GPUs are slow
  • Ways to Make GPUs Better

9
CPU-Based Approaches
  • High performance matrix multiplication algorithms
    are cache aware



C
A
B
  • Partition computation into submatrix
    multiplications
  • Load input submatrices into cache
  • Multiply submatrices
  • Store output submatrix to memory

10
Method 1 Column Packed (CP)


A
B
C
4 elements stored per texel
4x4 matrix by 4-vector multiplications
Larsen McAllister SC2001 Moravansky 2003
11
Method 2 Submatrix Packed (SP)


A
B
C
2x2 submatrix stored per texel
2x2 by 2x2 submatrix multiplications
Hall et al. 2003
12
Alternative Approaches Ineffective
  • Varied mapping into texture memory
  • Altered rasterization order with geometry
  • Single quad most effective
  • Utilized multiple outputs
  • Varied amount of loop unrolling
  • Column packed unroll maximally
  • Submatrix packed unroll 128 times

13
Performance Results
  • Pentium 4 3Ghz CPU, 512KB L2 cache
  • 12 GFLOPS peak compute
  • 44.1GB/sec cache BW
  • Using sgemm routine from ATLAS package
  • NVIDIA
  • GeForce 5900 Ultra
  • GeForce 6800 Ultra
  • ATI
  • Radeon 9800 XT
  • Radeon X800 XT PE
  • (prerelease 500Mhz mem / 500Mhz core clock)

14
Previous Generation GPUs
Multiplication of 1024x1024 Matrices
12
30
10
25
8
20
GB/sec
GFLOPS
6
15
4
10
GFLOPS
2
5
Bandwidth
0
0
P4 3Ghz
5900 Ultra
9800 XT
15
Current Generation GPUs
Multiplication of 1024x1024 Matrices
12
30
10
25
8
20
GB/sec
GFLOPS
6
15
4
10
GFLOPS
2
5
Bandwidth
0
0
P4 3Ghz
6800 Ultra
X800 XT PE
16
Fragment Processor Data Paths
Fragment Processor
Texture Unit
L1 Texture Cache
From L2
To Frame Buffer
17
GPU Microbenchmarks
Peak Arithmetic Rate
70
60
50
40
GFLOPS
30
20
10
0
5900 Ultra
6800 Ultra
9800 XT
X800 XT PE
18
GPU Microbenchmarks
Observed Bandwidth
30
25
20
Cache BW
GB/sec
15
Seq BW
10
5
0
5900 Ultra
6800 Ultra
9800 XT
X800 XT PE
19
Fragment Processor Data Paths
Low bandwidth (1 float/clock)
1 4-wide MAD/clock
High bandwidth (texture filtering)
Fragment processor consumes data at 8X rate
texture provides it!
20
Datapaths Designed for Shading
Fragment Processor
Texture Unit
8 to 1 reduction in amount of data
4 components per clock
L1 Texture Cache
From L2
To Frame Buffer
  • 8 bit components
  • 2 to 1 ratio of compute to bandwidth
  • Texture units filter (reduce) data
  • Shaders use interpolated values constants

21
Compute and Bandwidth Efficiency
100
80
60
Percentage of Peak
40
Compute
Bandwidth
20
0
5900 Ultra
6800 Ultra
9800 XT
X800 XT PE
P4 3Ghz
GPU algorithms are severely bandwidth limited!
22
Minimize Texture Fetches
  • Block in shader register file
  • Would need 8x8 submatrices to run at peak rates
  • Limited to 4x4 submatrices by available outputs

23
Improvement 1 Widen Datapath
  • Fragment processor receives cached data more
    quickly
  • Expect performance to improve linearly with
    increase in bandwidth
  • Need 4X improvement to achieve peak perf
  • But L2 may no longer be able to fill L1

24
Improvement 2 Larger Scratch Space
  • Requires large number of registers
  • Needs large number of output values
  • Reduces texture bandwidth requirements
  • Performance increases linearly with dimension of
    submatrices
  • Increases amount of per-pixel state
  • Storage increases as square of dimension of
    submatrices
  • Requires 16X space of SP method for peak perf

25
Summary
  • GPU algorithms for matrix-matrix multiplication
    run inefficiently
  • Best algorithms achieve below 20 of peak
    performance
  • Saturate data path between texture and FP units
  • Cache-aware software blocking strategies do not
    improve performance
  • Cannot exploit data reuse
  • Hardware limits algorithm efficiency

26
Summary
  • Hardware changes required to improve efficiency
  • Widen path between texture and register file
  • Output large number of values from shaders
  • Improved efficiency would make GPUs powerful
    platform for broader class of numerical
    algorithms

27
Acknowledgements
  • Thanks to Ian Buck, Mike Houston, Sean Treichler,
    Nick Triantos, Steve Morein
  • Support from ATI, NVIDIA, DARPA, IBM, SONY
  • Rambus Stanford Graduate Fellowship
  • Stanford School of Engineering Fellowship

28
Questions?
Write a Comment
User Comments (0)
About PowerShow.com