Basic CUDA Programming - PowerPoint PPT Presentation

About This Presentation
Title:

Basic CUDA Programming

Description:

http://www.nvidia.com/object/cuda_get.html. CUDA driver. CUDA toolkit. CUDA SDK (optional) ... Download http://twins.ee.nctu.edu.tw/~skchen/lab1.zip ... – PowerPoint PPT presentation

Number of Views:92
Avg rating:3.0/5.0
Slides: 32
Provided by: skc2
Category:

less

Transcript and Presenter's Notes

Title: Basic CUDA Programming


1
Basic CUDA Programming
  • Shin-Kai Chen
  • skchen_at_twins.ee.nctu.edu.tw
  • VLSI Signal Processing Laboratory
  • Department of Electronics Engineering
  • National Chiao Tung University

2
What will you learn in this lab?
  • Concept of multicore accelerator
  • Multithreaded/multicore programming
  • Memory optimization

3
Slides
  • Mostly from Prof. Wen-Mei Hwu of UIUC
  • http//courses.ece.uiuc.edu/ece498/al/Syllabus.htm
    l

4
CUDA Hardware? Software?
5
Host-Device Architecture
CPU (host)
GPU w/ local DRAM (device)
6
G80 CUDA mode A Device Example
7
Functional Units in G80
  • Streaming Multiprocessor (SM)
  • 1 instruction decoder ( 1 instruction / 4 cycle )
  • 8 streaming processor (SP)
  • Shared memory

SM 1
SM 0
Blocks
Blocks
8
Setup CUDA for Windows
9
CUDA Environment Setup
  • Get GPU that support CUDA
  • http//www.nvidia.com/object/cuda_learn_products.h
    tml
  • Download CUDA
  • http//www.nvidia.com/object/cuda_get.html
  • CUDA driver
  • CUDA toolkit
  • CUDA SDK (optional)
  • Install CUDA
  • Test CUDA
  • Device Query

10
Setup CUDA for Visual Studio
  • From scratch
  • http//forums.nvidia.com/index.php?showtopic30273
  • CUDA VS Wizard
  • http//sourceforge.net/projects/cudavswizard/
  • Modified from existing project

11
Lab1 First CUDA Program
12
CUDA Computing Model
13
Data Manipulation between Host and Device
  • cudaError_t cudaMalloc( void devPtr, size_t
    count )
  • Allocates count bytes of linear memory on the
    device and return in devPtr as a pointer to the
    allocated memory
  • cudaError_t cudaMemcpy( void dst, const void
    src, size_t count, enum cudaMemcpyKind kind)
  • Copies count bytes from memory area pointed to by
    src to the memory area pointed to by dst
  • kind indicates the type of memory transfer
  • cudaMemcpyHostToHost
  • cudaMemcpyHostToDevice
  • cudaMemcpyDeviceToHost
  • cudaMemcpyDeviceToDevice
  • cudaError_t cudaFree( void devPtr )
  • Frees the memory space pointed to by devPtr

14
Example
  • Functionality
  • Given an integer array A holding 8192 elements
  • For each element in array A, calculate Ai256
    and leave the result in Bi

15
Now, go and finish your first CUDA program !!!
16
  • Download http//twins.ee.nctu.edu.tw/skchen/lab1.
    zip
  • Open project with Visual C 2008 (
    lab1/cuda_lab/cuda_lab.vcproj )
  • main.cu
  • Random input generation, output validation,
    result reporting
  • device.cu
  • Lunch GPU kernel, GPU kernel code
  • parameter.h
  • Fill in appropriate APIs
  • GPU_kernel() in device.cu

17
Lab2 Make the Parallel Code Faster
18
Parallel Processing in CUDA
  • Parallel code can be partitioned into blocks and
    threads
  • cuda_kernelltltltnBlk, nTidgtgtgt()
  • Multiple tasks will be initialized, each with
    different block id and thread id
  • The tasks are dynamically scheduled
  • Tasks within the same block will be scheduled on
    the same stream multiprocessor
  • Each task take care of single data partition
    according to its block id and thread id

19
Locate Data Partition by Built-in Variables
  • Built-in Variables
  • gridDim
  • x, y
  • blockIdx
  • x, y
  • blockDim
  • x, y, z
  • threadIdx
  • x, y, z

20
Data Partition for Previous Example
When processing 64 integer data cuda_kernelltltlt2,
2gtgtgt()
int total_task gridDim.x blockDim.x int
task_sn blockIdx.x blockDim.x threadIdx.x
int length SIZE / total_task int head
task_sn length
21
Processing Single Data Partition
22
Parallelize Your Program !!!
23
  • Partition kernel into threads
  • Increase nTid from 1 to 512
  • Keep nBlk 1
  • Group threads into blocks
  • Adjust nBlk and see if it helps
  • Maintain total number of threads below 512, e.g.
    nBlk nTid lt 512

24
Lab3 Resolve Memory Contention
25
Parallel Memory Architecture
  • Memory is divided into banks to achieve high
    bandwidth
  • Each bank can service one address per cycle
  • Successive 32-bit words are assigned to
    successive banks

26
Lab2 Review
When processing 64 integer data cuda_kernelltltlt1,
4gtgtgt()
27
How about Interleave Accessing?
When processing 64 integer data cuda_kernelltltlt1,
4gtgtgt()
28
Implementation of Interleave Accessing
cuda_kernelltltlt1, 4gtgtgt()
  • head task_sn
  • stripe total_task

29
Improve Your Program !!!
30
  • Modify original kernel code in interleaving
    manner
  • cuda_kernel() in device.cu
  • Adjusting nBlk and nTid as in Lab2 and examine
    the effect
  • Maintain total number of threads below 512, e.g.
    nBlk nTid lt 512

31
Thank You
  • http//twins.ee.nctu.edu.tw/skchen/lab3.zip
  • Final project issue
  • Group issue
Write a Comment
User Comments (0)
About PowerShow.com