Basic CUDA Programming - PowerPoint PPT Presentation

About This Presentation

Title:

Basic CUDA Programming

Description:

http://www.nvidia.com/object/cuda_get.html. CUDA driver. CUDA toolkit. CUDA SDK (optional) ... Download http://twins.ee.nctu.edu.tw/~skchen/lab1.zip ... – PowerPoint PPT presentation

Number of Views:92

Avg rating:3.0/5.0

Slides: 32

Provided by: skc2

Category:

more less

Transcript and Presenter's Notes

Title: Basic CUDA Programming

1
Basic CUDA Programming

Shin-Kai Chen
skchen_at_twins.ee.nctu.edu.tw
VLSI Signal Processing Laboratory
Department of Electronics Engineering
National Chiao Tung University

2
What will you learn in this lab?

Concept of multicore accelerator
Multithreaded/multicore programming
Memory optimization

3
Slides

Mostly from Prof. Wen-Mei Hwu of UIUC
http//courses.ece.uiuc.edu/ece498/al/Syllabus.htm
l

4
CUDA Hardware? Software?
5
Host-Device Architecture
CPU (host)
GPU w/ local DRAM (device)
6
G80 CUDA mode A Device Example
7
Functional Units in G80

Streaming Multiprocessor (SM)
1 instruction decoder ( 1 instruction / 4 cycle )
8 streaming processor (SP)
Shared memory

SM 1
SM 0
Blocks
Blocks
8
Setup CUDA for Windows
9
CUDA Environment Setup

Get GPU that support CUDA
http//www.nvidia.com/object/cuda_learn_products.h
tml
Download CUDA
http//www.nvidia.com/object/cuda_get.html
CUDA driver
CUDA toolkit
CUDA SDK (optional)
Install CUDA
Test CUDA
Device Query

10
Setup CUDA for Visual Studio

From scratch
http//forums.nvidia.com/index.php?showtopic30273
CUDA VS Wizard
http//sourceforge.net/projects/cudavswizard/
Modified from existing project

11
Lab1 First CUDA Program
12
CUDA Computing Model
13
Data Manipulation between Host and Device

cudaError_t cudaMalloc( void devPtr, size_t
count )
Allocates count bytes of linear memory on the
device and return in devPtr as a pointer to the
allocated memory
cudaError_t cudaMemcpy( void dst, const void
src, size_t count, enum cudaMemcpyKind kind)
Copies count bytes from memory area pointed to by
src to the memory area pointed to by dst
kind indicates the type of memory transfer
cudaMemcpyHostToHost
cudaMemcpyHostToDevice
cudaMemcpyDeviceToHost
cudaMemcpyDeviceToDevice
cudaError_t cudaFree( void devPtr )
Frees the memory space pointed to by devPtr

14
Example

Functionality
Given an integer array A holding 8192 elements
For each element in array A, calculate Ai256
and leave the result in Bi

15
Now, go and finish your first CUDA program !!!
16

Download http//twins.ee.nctu.edu.tw/skchen/lab1.
zip
Open project with Visual C 2008 (
lab1/cuda_lab/cuda_lab.vcproj )
main.cu
Random input generation, output validation,
result reporting
device.cu
Lunch GPU kernel, GPU kernel code
parameter.h
Fill in appropriate APIs
GPU_kernel() in device.cu

17
Lab2 Make the Parallel Code Faster
18
Parallel Processing in CUDA

Parallel code can be partitioned into blocks and
threads
cuda_kernelltltltnBlk, nTidgtgtgt()
Multiple tasks will be initialized, each with
different block id and thread id
The tasks are dynamically scheduled
Tasks within the same block will be scheduled on
the same stream multiprocessor
Each task take care of single data partition
according to its block id and thread id

19
Locate Data Partition by Built-in Variables

Built-in Variables
gridDim
x, y
blockIdx
x, y
blockDim
x, y, z
threadIdx
x, y, z

20
Data Partition for Previous Example
When processing 64 integer data cuda_kernelltltlt2,
2gtgtgt()
int total_task gridDim.x blockDim.x int
task_sn blockIdx.x blockDim.x threadIdx.x
int length SIZE / total_task int head
task_sn length
21
Processing Single Data Partition
22
Parallelize Your Program !!!
23

Partition kernel into threads
Increase nTid from 1 to 512
Keep nBlk 1
Group threads into blocks
Adjust nBlk and see if it helps
Maintain total number of threads below 512, e.g.
nBlk nTid lt 512

24
Lab3 Resolve Memory Contention
25
Parallel Memory Architecture

Memory is divided into banks to achieve high
bandwidth
Each bank can service one address per cycle
Successive 32-bit words are assigned to
successive banks

26
Lab2 Review
When processing 64 integer data cuda_kernelltltlt1,
4gtgtgt()
27
How about Interleave Accessing?
When processing 64 integer data cuda_kernelltltlt1,
4gtgtgt()
28
Implementation of Interleave Accessing
cuda_kernelltltlt1, 4gtgtgt()