Understanding%20the%20Efficiency%20of%20GPU%20Algorithms%20for%20Matrix-Matrix%20Multiplication

About This Presentation

Title:

Understanding%20the%20Efficiency%20of%20GPU%20Algorithms%20for%20Matrix-Matrix%20Multiplication

Description:

Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication ... Level 1 BLAS operations Buck et al. [2004] Fluid solvers Kruger & Westermann [2003] ... – PowerPoint PPT presentation

Number of Views:142

Avg rating:3.0/5.0

Slides: 29

Provided by: kayvonfa

Learn more at: https://www.graphicshardware.org

Category:

more less

Transcript and Presenter's Notes

Title: Understanding%20the%20Efficiency%20of%20GPU%20Algorithms%20for%20Matrix-Matrix%20Multiplication

1
Understanding the Efficiency of GPU Algorithms
for Matrix-Matrix Multiplication

Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan
Stanford University
August 30, 2004

2
Motivation Harness GPU Performance
6
5
4
Peak FLOPS
Relative Performance
3
Memory BW
2
1
0
P4 3.4Ghz
6800 Ultra
X800 XT PE
3
Streaming Computation on GPUs

GPUs accelerate streaming numerical algorithms
Data parallelism
High ratio of arithmetic to data access
Little data reuse

Kernel function (shader)
Input Elements
Output Elements
4
Streaming Computation on GPUs

Level 1 BLAS operations Buck et al. 2004
Fluid solvers Kruger
Westermann 2003
Boltz et al. 2003
Image processing Apple Corp. 2004
McCormick et al. 2004
Segmentation Sherbondy et al. 2003
Database operations Govindaraju et al.
2004
Data Clustering Hall et al. 2004

5
Dense Matrix Multiplication

C
A
B

Abundant data parallelism

Regular data access (no branching)

High ratio of computation to data access

6
Dense Matrix Multiplication

Widely used computational kernel
Building block for LAPACK library

7
Matrix Multiplication on GPUs

Larsen McAllister 2001
Moravansky 2003
Hall et al. 2003
Limited analysis of performance

8
Overview

GPU Implementations
Results
Analysis Why GPUs are slow
Ways to Make GPUs Better

9
CPU-Based Approaches

High performance matrix multiplication algorithms
are cache aware

C
A
B

Partition computation into submatrix
multiplications
Load input submatrices into cache
Multiply submatrices
Store output submatrix to memory

10
Method 1 Column Packed (CP)

A
B
C
4 elements stored per texel
4x4 matrix by 4-vector multiplications
Larsen McAllister SC2001 Moravansky 2003
11
Method 2 Submatrix Packed (SP)

A
B
C
2x2 submatrix stored per texel
2x2 by 2x2 submatrix multiplications
Hall et al. 2003
12
Alternative Approaches Ineffective

Varied mapping into texture memory
Altered rasterization order with geometry
Single quad most effective
Utilized multiple outputs
Varied amount of loop unrolling
Column packed unroll maximally
Submatrix packed unroll 128 times

13
Performance Results

Pentium 4 3Ghz CPU, 512KB L2 cache
12 GFLOPS peak compute
44.1GB/sec cache BW
Using sgemm routine from ATLAS package
NVIDIA
GeForce 5900 Ultra
GeForce 6800 Ultra
ATI
Radeon 9800 XT
Radeon X800 XT PE
(prerelease 500Mhz mem / 500Mhz core clock)

14
Previous Generation GPUs
Multiplication of 1024x1024 Matrices
12
30
10
25
8
20
GB/sec
GFLOPS
6
15
4
10
GFLOPS
2
5
Bandwidth
0
0
P4 3Ghz
5900 Ultra
9800 XT
15
Current Generation GPUs
Multiplication of 1024x1024 Matrices
12
30
10
25
8
20
GB/sec
GFLOPS
6
15
4
10
GFLOPS
2
5
Bandwidth
0
0
P4 3Ghz
6800 Ultra
X800 XT PE
16
Fragment Processor Data Paths
Fragment Processor
Texture Unit
L1 Texture Cache
From L2
To Frame Buffer
17
GPU Microbenchmarks
Peak Arithmetic Rate
70
60
50
40
GFLOPS
30
20
10
0
5900 Ultra
6800 Ultra
9800 XT
X800 XT PE
18
GPU Microbenchmarks
Observed Bandwidth
30
25
20
Cache BW
GB/sec
15
Seq BW
10
5
0
5900 Ultra
6800 Ultra
9800 XT
X800 XT PE
19
Fragment Processor Data Paths
Low bandwidth (1 float/clock)
1 4-wide MAD/clock
High bandwidth (texture filtering)
Fragment processor consumes data at 8X rate
texture provides it!
20
Datapaths Designed for Shading
Fragment Processor
Texture Unit
8 to 1 reduction in amount of data
4 components per clock
L1 Texture Cache
From L2
To Frame Buffer

8 bit components
2 to 1 ratio of compute to bandwidth

Texture units filter (reduce) data
Shaders use interpolated values constants

21
Compute and Bandwidth Efficiency
100
80
60
Percentage of Peak
40
Compute
Bandwidth
20
0
5900 Ultra
6800 Ultra
9800 XT
X800 XT PE
P4 3Ghz
GPU algorithms are severely bandwidth limited!
22
Minimize Texture Fetches

Block in shader register file
Would need 8x8 submatrices to run at peak rates
Limited to 4x4 submatrices by available outputs

23
Improvement 1 Widen Datapath

Fragment processor receives cached data more
quickly
Expect performance to improve linearly with
increase in bandwidth
Need 4X improvement to achieve peak perf
But L2 may no longer be able to fill L1

24
Improvement 2 Larger Scratch Space

Requires large number of registers
Needs large number of output values
Reduces texture bandwidth requirements
Performance increases linearly with dimension of
submatrices
Increases amount of per-pixel state
Storage increases as square of dimension of
submatrices
Requires 16X space of SP method for peak perf

25
Summary

GPU algorithms for matrix-matrix multiplication
run inefficiently
Best algorithms achieve below 20 of peak
performance
Saturate data path between texture and FP units
Cache-aware software blocking strategies do not
improve performance
Cannot exploit data reuse
Hardware limits algorithm efficiency

26
Summary

Hardware changes required to improve efficiency
Widen path between texture and register file
Output large number of values from shaders
Improved efficiency would make GPUs powerful
platform for broader class of numerical
algorithms

27
Acknowledgements