CSE 690: GPGPU Lecture 7: Matrix Multiplications - PowerPoint PPT Presentation

Loading...

PPT – CSE 690: GPGPU Lecture 7: Matrix Multiplications PowerPoint presentation | free to download - id: 6ee627-YTQ1Y



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

CSE 690: GPGPU Lecture 7: Matrix Multiplications

Description:

CSE 690: GPGPU Lecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University Basic Concept Triple loop GPU Algorithms First algorithm ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 20
Provided by: Klau51
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: CSE 690: GPGPU Lecture 7: Matrix Multiplications


1
CSE 690 GPGPULecture 7 Matrix Multiplications
  • Klaus Mueller
  • Computer Science, Stony Brook University

2
Basic Concept
  • Triple loop

3
GPU Algorithms
  • First algorithm
  • render a rectangle of size NxN
  • represent the matrices as NxN textures
  • each (i,j) is then a fragment
  • each fragment program is a loop or an unrolled
    loop -gt may get too long
  • must pull in the same data many times -gt poor
    data reuse, needs bandwidth
  • makes no use of 4-way RGBA parallelism -gt wastes
    speedup

4
GPU Algorithms
  • Better algorithm
  • use RGBA channels, pack a 2x2 submatrix
  • use swizzling to facilitate data reuse
  • swizzling improves fragment code length by factor
    2
  • may need multiple passes for larger matrices

5
GPU Algorithms
  • Using multi-texturing
  • requires l passes

6
GPU Algorithms
  • Can use RGBA parallelism as well
  • each texel represents a 2x2 submatrix
  • use swizzling as usual
  • needs l/2 passes

7
GPU Algorithms
  • Instead of a 2x2 submatrix, pack 4x1 column
    vectors
  • makes 4-times reuse of texels read from B, but
    uses texels from A only once

8
GPU Algorithms
  • Instead of a 2x2 submatrix, pack 4x1 column
    vectors
  • 6 fetches are needed for 4 mads (mult-adds) -gt
    1.5 times more than before
  • but less rows and columns are accessed per pass
    -gt improves cache hit frequency

9
GPU Algorithms
  • Originally only compute one product per shader
  • practically can unroll the loop 3-6 times
    (compute 3-6 products)
  • maximal fragment program length is the limit
  • reduces the number of passes required

10
Reality Check
  • Would like to compare CPU and GPU efficiencies
    for GPGPU tasks
  • The task of matrix multiplication is insightful
    here
  • features much data reuse
  • graphics programs are generally more stream-like
    and have less data reuse
  • this may lead to some limitations

11
Platforms
  • Pentium 4 3Ghz CPU, 512KB L2 cache
  • 12 GFLOPS peak compute
  • 44.1GB/sec cache BW
  • Using sgemm routine from ATLAS package
  • NVIDIA
  • GeForce 5900 Ultra
  • GeForce 6800 Ultra
  • ATI
  • Radeon 9800 XT
  • Radeon X800 XT PE

12
Performance
13
Bandwidth vs. Peak FLOPS
14
Analysis
  • Currently
  • GPUs can fetch 16 floats and perform 16
    4-component mads per clock
  • our app fetches 8 floats to perform one
    4-component mad -gt not enough computations
  • need more math ops per float fetched (gt 8)

15
Analysis
  • Pentium processors have large L1 caches to boost
    memory bandwidth (bw)
  • bw / compute ratio better
  • main reason for only small performance gain
    achieved with GPUs

16
Analysis
  • Pentium processors have large L1 caches to boost
    memory bandwidth (bw)
  • bw / compute ratio better
  • main reason for only small performance gain
    achieved with GPUs
  • for matrix multiplications

17
Analysis
  • Expectations
  • make sure that there is enough arithmetic per
    data item fetched
  • lots of data resuse in the algorithm / task will
    make the CPU look better
  • streaming data OK -gt they dont suffer from
    reuse
  • matrix multiplication is an excellent
    reality-check example

18
Analysis
  • What do GPUs need
  • bigger caches to enable larger blocks
  • currently there are enough registers to store a
    6x6 submatrix
  • but currently shaders can only produce a small
    number of outputs -gt limits the amount of
    blocking
  • Provide full-floating point accumulator registers
  • Widen path between texture and register files

19
References
  • E. Larsen and D. McAllister, Fast matrix
    multiplies using graphics hardware,
    Supercomputing 2001.
  • J. Hall, N. Carr and J. Hart, Cache and
    bandwidth aware matrix multiplication on the
    GPU, Tech Report UIUCDCS-R-2003-2328-1
  • K. Fatahalian, J. Sugerman, and P. Hanrahan,
    Understanding the efficiency of GPU algorithms
    for matrix-matrix multiplication, Graphics
    Hardware Workshop 2004.
About PowerShow.com