CSE 690: GPGPU Lecture 7: Matrix Multiplications presentation

About This Presentation

Transcript and Presenter's Notes

Title: CSE 690: GPGPU Lecture 7: Matrix Multiplications

1
CSE 690 GPGPULecture 7 Matrix Multiplications

2
Basic Concept

3
GPU Algorithms

4
GPU Algorithms

5
GPU Algorithms

6
GPU Algorithms

7
GPU Algorithms

8
GPU Algorithms

Instead of a 2x2 submatrix, pack 4x1 column
vectors
6 fetches are needed for 4 mads (mult-adds) -gt
1.5 times more than before
but less rows and columns are accessed per pass
-gt improves cache hit frequency

9
GPU Algorithms

10
Reality Check

11
Platforms

12
Performance
13
Bandwidth vs. Peak FLOPS
14
Analysis

Currently
GPUs can fetch 16 floats and perform 16
4-component mads per clock
our app fetches 8 floats to perform one
4-component mad -gt not enough computations
need more math ops per float fetched (gt 8)

15
Analysis

16
Analysis

17
Analysis

18
Analysis

What do GPUs need
bigger caches to enable larger blocks
currently there are enough registers to store a
6x6 submatrix
but currently shaders can only produce a small
number of outputs -gt limits the amount of
blocking
Provide full-floating point accumulator registers
Widen path between texture and register files

19
References

E. Larsen and D. McAllister, Fast matrix
multiplies using graphics hardware,
Supercomputing 2001.
J. Hall, N. Carr and J. Hart, Cache and
bandwidth aware matrix multiplication on the
GPU, Tech Report UIUCDCS-R-2003-2328-1
K. Fatahalian, J. Sugerman, and P. Hanrahan,
Understanding the efficiency of GPU algorithms
for matrix-matrix multiplication, Graphics
Hardware Workshop 2004.

Write a Comment

User Comments (0)

About PowerShow.com

CSE 690: GPGPU Lecture 7: Matrix Multiplications PowerPoint PPT Presentation