EECE571R%20--%20Harnessing%20Massively%20Parallel%20Processors%20%20http://www.ece.ubc.ca/~matei/EECE571/%20Lecture%201:%20Introduction%20to%20GPU%20Programming%20%20By%20Samer%20Al-Kiswany - PowerPoint PPT Presentation

View by Category
About This Presentation
Title:

EECE571R%20--%20Harnessing%20Massively%20Parallel%20Processors%20%20http://www.ece.ubc.ca/~matei/EECE571/%20Lecture%201:%20Introduction%20to%20GPU%20Programming%20%20By%20Samer%20Al-Kiswany

Description:

Title: CS267: Introduction Author: Katherine Yelick Last modified by: Matei Ripeanu Created Date: 1/18/2010 11:23:35 PM Document presentation format – PowerPoint PPT presentation

Number of Views:218
Avg rating:3.0/5.0

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: EECE571R%20--%20Harnessing%20Massively%20Parallel%20Processors%20%20http://www.ece.ubc.ca/~matei/EECE571/%20Lecture%201:%20Introduction%20to%20GPU%20Programming%20%20By%20Samer%20Al-Kiswany


1
EECE571R -- Harnessing Massively Parallel
Processors http//www.ece.ubc.ca/matei/EECE571/
Lecture 1 Introduction to GPU
Programming By Samer Al-Kiswany
Acknowledgement some slides borrowed from
presentations by Kayvon Fatahalian, and Mark
Harris
2
Outline
  • Hardware
  • Software
  • Programming Model
  • Optimizations

3
GPU Architecture Intuition
4
GPU Architecture Intuition
5
GPU Architecture Intuition
6
GPU Architecture Intuition
7
GPU Architecture Intuition
8
GPU Architecture Intuition
9
GPU Architecture Intuition
10
GPU Architecture Intuition
11
GPU Architecture
Host Machine
Multiprocessor N
GPU
Multiprocessor 2
Host
Constant Memory
Texture Memory
Global Memory
12
GPU Architecture
  • SIMD Architecture.
  • Four memories.
  • Device (a.k.a. global)
  • slow 400-600 cycles
    access latency
  • large 256MB 1GB
  • Shared
  • fast 4 cycles access
    latency
  • small 16KB
  • Texture read only
  • Constant read only

13
GPU Architecture Program Flow
  1. Preprocessing
  2. Data transfer in
  3. GPU Processing
  4. Data transfer out
  5. Postprocessing

3
TTotal
14
Outline
  • Hardware
  • Software
  • Programming Model
  • Optimizations

15
GPU Programming Model
  • Programming Model Software representation of the
    Hardware

16
GPU Programming Model
Block
Kernel A function on the grid
17
GPU Programming Model
18
GPU Programming Model
19
GPU Programming Model
In reality scheduling granularity is a warp (32
threads) ? 4 cycles to complete a single
instruction by a warp
20
GPU Programming Model
  • In reality scheduling granularity is a warp (32
    threads) ? 4 cycles to complete a single
    instruction by a warp
  • Threads in a Block can share stat through shared
    memory
  • Threads in the Block can synchronies
  • Global atomic operations

21
Outline
  • Hardware
  • Software
  • Programming Model
  • Optimizations

22
Optimizations
  • Can be roughly categorized into the following
    categories
  • Memory Related
  • Computation Related
  • Data Transfer Related

23
Optimizations - Memory
  • Use shared memory
  • Use texture (1D, 2D, or 3D) and constant memory
  • Avoid shared memory bank conflicts
  • Coalesced memory access (one approach padding)

24
Optimizations - Memory
Shared Memory Complications
Shared memory is organized into 16 -1KB banks.
Complication I Concurrent accesses to the same
bank will be serialized (bank conflict) ? slow
down.
Tip Assign different threads to different banks.
Complication II Banks are interleaved.
25
Optimizations - Memory
Global Memory Coalesced Access
26
Optimizations - Memory
Global Memory Non-Coalesced Access
27
Optimizations
  • Can be roughly categorized into the following
    categories
  • Memory Related
  • Computation Related
  • Data Transfer Related

28
Optimizations - Computation
  • Use 1000s of threads to best use the GPU hardware
  • Use Full Warps (32 threads) (use blocks multiple
    of 32).
  • Lower code branch divergence.
  • Avoid synchronization
  • Loop unrolling (Less instructions, space for
    compiler optimizations)

29
Optimizations
  • Can be roughly categorized into the following
    categories
  • Memory Related
  • Computation Related
  • Data Transfer Related

30
Optimizations Data Transfer
  • Reduce amount of data transferred between host
    and GPU
  • Hide transfer overhead through overlapping
    transfer and computation (Asynchronous transfer)

31
Summary
  • GPUs are highly parallel devices.
  • Easy to program for (functionality).
  • Hard to optimize for (performance).
  • Optimization
  • Many optimization, but often you do not need them
    all (Iteration of profiling and optimization)
  • May bring hard tradeoffs (More coputation vs.
    less memory, more computation vs. better memory
    access, ..etc).
About PowerShow.com