Title: Understanding%20the%20Efficiency%20of%20GPU%20Algorithms%20for%20Matrix-Matrix%20Multiplication
1Understanding the Efficiency of GPU Algorithms
for Matrix-Matrix Multiplication
- Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan
- Stanford University
- August 30, 2004
2Motivation Harness GPU Performance
6
5
4
Peak FLOPS
Relative Performance
3
Memory BW
2
1
0
P4 3.4Ghz
6800 Ultra
X800 XT PE
3Streaming Computation on GPUs
- GPUs accelerate streaming numerical algorithms
- Data parallelism
- High ratio of arithmetic to data access
- Little data reuse
Kernel function (shader)
Input Elements
Output Elements
4Streaming Computation on GPUs
- Level 1 BLAS operations Buck et al. 2004
- Fluid solvers Kruger
Westermann 2003 - Boltz et al. 2003
- Image processing Apple Corp. 2004
- McCormick et al. 2004
- Segmentation Sherbondy et al. 2003
- Database operations Govindaraju et al.
2004 - Data Clustering Hall et al. 2004
5Dense Matrix Multiplication
C
A
B
- Abundant data parallelism
- Regular data access (no branching)
- High ratio of computation to data access
6Dense Matrix Multiplication
- Widely used computational kernel
- Building block for LAPACK library
7Matrix Multiplication on GPUs
- Larsen McAllister 2001
- Moravansky 2003
- Hall et al. 2003
- Limited analysis of performance
8Overview
- GPU Implementations
- Results
- Analysis Why GPUs are slow
- Ways to Make GPUs Better
9CPU-Based Approaches
- High performance matrix multiplication algorithms
are cache aware
C
A
B
- Partition computation into submatrix
multiplications - Load input submatrices into cache
- Multiply submatrices
- Store output submatrix to memory
10Method 1 Column Packed (CP)
A
B
C
4 elements stored per texel
4x4 matrix by 4-vector multiplications
Larsen McAllister SC2001 Moravansky 2003
11Method 2 Submatrix Packed (SP)
A
B
C
2x2 submatrix stored per texel
2x2 by 2x2 submatrix multiplications
Hall et al. 2003
12Alternative Approaches Ineffective
- Varied mapping into texture memory
- Altered rasterization order with geometry
- Single quad most effective
- Utilized multiple outputs
- Varied amount of loop unrolling
- Column packed unroll maximally
- Submatrix packed unroll 128 times
13Performance Results
- Pentium 4 3Ghz CPU, 512KB L2 cache
- 12 GFLOPS peak compute
- 44.1GB/sec cache BW
- Using sgemm routine from ATLAS package
- NVIDIA
- GeForce 5900 Ultra
- GeForce 6800 Ultra
- ATI
- Radeon 9800 XT
- Radeon X800 XT PE
- (prerelease 500Mhz mem / 500Mhz core clock)
14Previous Generation GPUs
Multiplication of 1024x1024 Matrices
12
30
10
25
8
20
GB/sec
GFLOPS
6
15
4
10
GFLOPS
2
5
Bandwidth
0
0
P4 3Ghz
5900 Ultra
9800 XT
15Current Generation GPUs
Multiplication of 1024x1024 Matrices
12
30
10
25
8
20
GB/sec
GFLOPS
6
15
4
10
GFLOPS
2
5
Bandwidth
0
0
P4 3Ghz
6800 Ultra
X800 XT PE
16Fragment Processor Data Paths
Fragment Processor
Texture Unit
L1 Texture Cache
From L2
To Frame Buffer
17GPU Microbenchmarks
Peak Arithmetic Rate
70
60
50
40
GFLOPS
30
20
10
0
5900 Ultra
6800 Ultra
9800 XT
X800 XT PE
18GPU Microbenchmarks
Observed Bandwidth
30
25
20
Cache BW
GB/sec
15
Seq BW
10
5
0
5900 Ultra
6800 Ultra
9800 XT
X800 XT PE
19Fragment Processor Data Paths
Low bandwidth (1 float/clock)
1 4-wide MAD/clock
High bandwidth (texture filtering)
Fragment processor consumes data at 8X rate
texture provides it!
20Datapaths Designed for Shading
Fragment Processor
Texture Unit
8 to 1 reduction in amount of data
4 components per clock
L1 Texture Cache
From L2
To Frame Buffer
- 8 bit components
- 2 to 1 ratio of compute to bandwidth
- Texture units filter (reduce) data
- Shaders use interpolated values constants
21Compute and Bandwidth Efficiency
100
80
60
Percentage of Peak
40
Compute
Bandwidth
20
0
5900 Ultra
6800 Ultra
9800 XT
X800 XT PE
P4 3Ghz
GPU algorithms are severely bandwidth limited!
22Minimize Texture Fetches
- Block in shader register file
- Would need 8x8 submatrices to run at peak rates
- Limited to 4x4 submatrices by available outputs
23Improvement 1 Widen Datapath
- Fragment processor receives cached data more
quickly - Expect performance to improve linearly with
increase in bandwidth - Need 4X improvement to achieve peak perf
- But L2 may no longer be able to fill L1
24Improvement 2 Larger Scratch Space
- Requires large number of registers
- Needs large number of output values
- Reduces texture bandwidth requirements
- Performance increases linearly with dimension of
submatrices - Increases amount of per-pixel state
- Storage increases as square of dimension of
submatrices - Requires 16X space of SP method for peak perf
25Summary
- GPU algorithms for matrix-matrix multiplication
run inefficiently - Best algorithms achieve below 20 of peak
performance - Saturate data path between texture and FP units
- Cache-aware software blocking strategies do not
improve performance - Cannot exploit data reuse
- Hardware limits algorithm efficiency
26Summary
- Hardware changes required to improve efficiency
- Widen path between texture and register file
- Output large number of values from shaders
- Improved efficiency would make GPUs powerful
platform for broader class of numerical
algorithms
27Acknowledgements
- Thanks to Ian Buck, Mike Houston, Sean Treichler,
Nick Triantos, Steve Morein - Support from ATI, NVIDIA, DARPA, IBM, SONY
- Rambus Stanford Graduate Fellowship
- Stanford School of Engineering Fellowship
28Questions?