# CSE 690: GPGPU Lecture 7: Matrix Multiplications - PowerPoint PPT Presentation

PPT – CSE 690: GPGPU Lecture 7: Matrix Multiplications PowerPoint presentation | free to download - id: 6ee627-YTQ1Y

The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
Title:

## CSE 690: GPGPU Lecture 7: Matrix Multiplications

Description:

### CSE 690: GPGPU Lecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University Basic Concept Triple loop GPU Algorithms First algorithm ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 20
Provided by: Klau51
Category:
Tags:
Transcript and Presenter's Notes

Title: CSE 690: GPGPU Lecture 7: Matrix Multiplications

1
CSE 690 GPGPULecture 7 Matrix Multiplications
• Klaus Mueller
• Computer Science, Stony Brook University

2
Basic Concept
• Triple loop

3
GPU Algorithms
• First algorithm
• render a rectangle of size NxN
• represent the matrices as NxN textures
• each (i,j) is then a fragment
• each fragment program is a loop or an unrolled
loop -gt may get too long
• must pull in the same data many times -gt poor
data reuse, needs bandwidth
• makes no use of 4-way RGBA parallelism -gt wastes
speedup

4
GPU Algorithms
• Better algorithm
• use RGBA channels, pack a 2x2 submatrix
• use swizzling to facilitate data reuse
• swizzling improves fragment code length by factor
2
• may need multiple passes for larger matrices

5
GPU Algorithms
• Using multi-texturing
• requires l passes

6
GPU Algorithms
• Can use RGBA parallelism as well
• each texel represents a 2x2 submatrix
• use swizzling as usual
• needs l/2 passes

7
GPU Algorithms
• Instead of a 2x2 submatrix, pack 4x1 column
vectors
• makes 4-times reuse of texels read from B, but
uses texels from A only once

8
GPU Algorithms
• Instead of a 2x2 submatrix, pack 4x1 column
vectors
1.5 times more than before
• but less rows and columns are accessed per pass
-gt improves cache hit frequency

9
GPU Algorithms
• Originally only compute one product per shader
• practically can unroll the loop 3-6 times
(compute 3-6 products)
• maximal fragment program length is the limit
• reduces the number of passes required

10
Reality Check
• Would like to compare CPU and GPU efficiencies
• The task of matrix multiplication is insightful
here
• features much data reuse
• graphics programs are generally more stream-like
and have less data reuse
• this may lead to some limitations

11
Platforms
• Pentium 4 3Ghz CPU, 512KB L2 cache
• 12 GFLOPS peak compute
• 44.1GB/sec cache BW
• Using sgemm routine from ATLAS package
• NVIDIA
• GeForce 5900 Ultra
• GeForce 6800 Ultra
• ATI

12
Performance
13
Bandwidth vs. Peak FLOPS
14
Analysis
• Currently
• GPUs can fetch 16 floats and perform 16
• our app fetches 8 floats to perform one
4-component mad -gt not enough computations
• need more math ops per float fetched (gt 8)

15
Analysis
• Pentium processors have large L1 caches to boost
memory bandwidth (bw)
• bw / compute ratio better
• main reason for only small performance gain
achieved with GPUs

16
Analysis
• Pentium processors have large L1 caches to boost
memory bandwidth (bw)
• bw / compute ratio better
• main reason for only small performance gain
achieved with GPUs
• for matrix multiplications

17
Analysis
• Expectations
• make sure that there is enough arithmetic per
data item fetched
• lots of data resuse in the algorithm / task will
make the CPU look better
• streaming data OK -gt they dont suffer from
reuse
• matrix multiplication is an excellent
reality-check example

18
Analysis
• What do GPUs need
• bigger caches to enable larger blocks
• currently there are enough registers to store a
6x6 submatrix
• but currently shaders can only produce a small
number of outputs -gt limits the amount of
blocking
• Provide full-floating point accumulator registers
• Widen path between texture and register files

19
References
• E. Larsen and D. McAllister, Fast matrix
multiplies using graphics hardware,
Supercomputing 2001.
• J. Hall, N. Carr and J. Hart, Cache and
bandwidth aware matrix multiplication on the
GPU, Tech Report UIUCDCS-R-2003-2328-1
• K. Fatahalian, J. Sugerman, and P. Hanrahan,
Understanding the efficiency of GPU algorithms
for matrix-matrix multiplication, Graphics
Hardware Workshop 2004.