Introduction to CUDA Programming - PowerPoint PPT Presentation

About This Presentation
Title:

Introduction to CUDA Programming

Description:

Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk – PowerPoint PPT presentation

Number of Views:237
Avg rating:3.0/5.0
Slides: 58
Provided by: Own2751
Category:

less

Transcript and Presenter's Notes

Title: Introduction to CUDA Programming


1
Introduction to CUDA Programming
  • Profiler, Assembly, and Floating-Point
  • Andreas Moshovos
  • Winter 2009
  • Some material from
  • Wen-Mei Hwu and David Kirk
  • NVIDIA
  • Robert Strzodka, Dominik Göddeke, NVISION08
    presentation
  • http//www.mathematik.uni-dortmund.de/goeddeke/pu
    bs/NVISION08-long.pdf

2
The CUDA Profiler
  • Both GUI and command-line
  • Non-GUI control
  • CUDA_PROFILE
  • set to 1 or 0 to enable or disable the profiler
  • CUDA_PROFILE_LOG
  • set to the name of the log file (will default to
    ./cuda_profile.log)
  • CUDA_PROFILE_CSV
  • set to 1 or 0 to enable or disable a comma
    separated version of the log
  • CUDA_PROFILE_CONFIG
  • specify a config file with up to 4 signals

3
Profiler Signals
4
Profiler Counters
  • Grid size X, Y
  • Block size X, Y, Z
  • Dyn smem per block
  • Dynamic shared memory
  • Sta smem per block
  • static shared memory
  • Reg per thread
  • Mem transfer dir
  • Direction 0 ? host to device, 1 ? device to host
  • Mem transfer size
  • bytes

5
Interpreting Profiler Counters
  • Values represent events within a thread warp
  • Only targets one multiprocessor
  • Values will not correspond to the total number of
    warps launched for a particular kernel.
  • Launch enough thread blocks to ensure that the
    target multiprocessor is given a consistent
    percentage of the total work.
  • Values are best used to identify relative
    performance differences between non-optimized and
    optimized code
  • e.g., make the number of non-coalesced loads go
    from some non-zero value to zero

6
CUDA Visual Profiler
  • Helps measure and find potential performance
    problem
  • GPU and CPU timing for all kernel invocations and
    memcpys
  • Time stamps
  • Access to hardware performance counters

7
Assembly
8
PTX Assembly for NVIDIA GPUs
  • Parallel Thread eXecution
  • Virtual Assembly
  • Translated to actual machine code at runtime
  • Allows for different hardware implementations
  • Might enable additional optimizations
  • E.g., clock register to time blocks of code

9
Code Generation Flow
  • Parallel Thread eXecution (PTX)
  • Virtual Machine and ISA
  • Programming model
  • Execution resources and state
  • ISA Instruction Set Architecture
  • Variable declarations
  • Instructions and operands
  • Translator is an optimizing compiler
  • Translates PTX to Target code
  • Program install time
  • Driver implements VM runtime
  • Coupled with Translator

C/C Application
C/CCompiler
ASM-level Library Programmer
PTX Code
PTX Code
PTX to Target Translator
Target code
10
How to See the PTX code
  • nvcc keep
  • Produces .ptx and .cubin
  • nvcc --opencc-options -LISTsourceon

11
PTX Example
float4 me gxgtid me.x me.y me.z
CUDA
ld.global.v4.f32 f1,f3,f5,f7, r90
174 me.x me.y me.z mad.f32
f1, f5, f3, f1
PTX
Registers are virtual The actual hardware
registers are hidden from PTX
12
PTX Syntax Example
13
Another Example CUDA Function
  • CUDA
  • PTX

sub.f32 f18, f1, f15 sub.f32
f19, f3, f16 sub.f32 f20, f5,
f17 mul.f32 f21, f18, f18 mul.f32
f22, f19, f19 mul.f32 f23,
f20, f20 add.f32 f24, f21,
f22 add.f32 f25, f23, f24 rsqrt.f32
f26, f25 mad.f32 f13, f18,
f26, f13 mov.f32 f14, f13 mad.f32
f11, f19, f26, f11 mov.f32
f12, f11 mad.f32 f9, f20, f26,
f9 mov.f32 f10, f9
__device__ void interaction( float4 b0, float4
b1, float3 accel) r.x b1.x - b0.x
r.y b1.y - b0.y r.z b1.z - b0.z
float distSqr r.x r.x r.y r.y r.z
r.z float s 1.0f/sqrt(distSqr)
accel-gtx r.x s accel-gty r.y s
accel-gtz r.z s
14
PTX Data types
15
Predicated Execution
  • p Evaluate cond
  • Branch not true After
  • Then Code
  • After
  • After Code
  • p Evaluate cond
  • (p) Then Code
  • After Code

16
PTX Predicated Execution
17
Variable Declaration
18
Parameterized Variable Names
  • How to create 100 register variables
  • .reg .b32 rlt100gt
  • Declares r0 - r99

19
Addresses as Operands
The value of x
The value of tbl12
The base address of tlb
20
Compiling a loop that calls a function - again
  • CUDA
  • sx is shared
  • mx, accel are local
  • PTX

mov.s32 r12, 0 Lt_0_26
setp.eq.u32 p1, r12, r5 _at_p1 bra
Lt_0_27 mul.lo.u32 r13, r12, 16
add.u32 r14, r13, r1 ld.shared.f32
f15, r140 ld.shared.f32 f16,
r144 ld.shared.f32 f17,
r148 func body from previous slide inlined
here Lt_0_27 add.s32 r12, r12, 1
mov.s32 r15, 128 setp.ne.s32
p2, r12, r15 _at_p2 bra Lt_0_26
for (i 0 i lt K i) if (i !
threadIdx.x) interaction(
sxi, mx, accel )
21
Yet Another Example SAXPY code
  • cvt.u32.u16 blockid, ctaid.x // Calculate i
    from thread/block IDs
  • cvt.u32.u16 blocksize, ntid.x
  • cvt.u32.u16 tid, tid.x
  • mad24.lo.u32 i, blockid, blocksize, tid
  • ld.param.u32 n, N // Nothing to do if n i
  • setp.le.u32 p1, n, i
  • _at_p1 bra L_finish
  • mul.lo.u32 offset, i, 4 // Load yi
  • ld.param.u32 yaddr, Y
  • add.u32 yaddr, yaddr, offset
  • ld.global.f32 y_i, yaddr0
  • ld.param.u32 xaddr, X // Load xi
  • add.u32 xaddr, xaddr, offset
  • ld.global.f32 x_i, xaddr0
  • ld.param.f32 alpha, ALPHA // Compute and
    store alphaxi yi
  • mad.f32 y_i, alpha, x_i, y_i
  • st.global.f32 yaddr0, y_i

22
The clock register
  • Real time clock cycle counter
  • How to read
  •   mov.u32         r1, clock
  • Can be used to time code
  • It measures real time not just time spent
    executing this thread
  • If a thread is blocks time still elapses   

23
PTX Reference
  • Please Read the PTX ISA specification
  • Posted under the handouts section

24
Occupancy Calculator
  • http//developer.download.nvidia.com/compute/cuda/
    CUDA_Occupancy_calculator.xls
  • GPU Occupancy
  • Active warps / max warps
  • Threads/block
  • Registers/thread
  • Shared memory/block
  • Nvcc cubin
  • code name my_kernellmem 0smem 24reg
    5bar 0bincode ? const ?

25
Occupancy Calculator Example
26
Floating Point Considerations
27
Comparison of FP Capabilities
G80 SSE IBM Altivec Cell SPE
Precision IEEE 754 IEEE 754 IEEE 754 IEEE 754
Rounding modes for FADD and FMUL Round to nearest and round to zero All 4 IEEE, round to nearest, zero, inf, -inf Round to nearest only Round to zero/truncate only
Denormal handling Flush to zero Supported,1000s of cycles Supported,1000s of cycles Flush to zero
NaN support Yes Yes Yes No
Overflow and Infinity support Yes, only clamps to max norm Yes Yes No, infinity
Flags No Yes Yes Some
Square root Software only Hardware Software only Software only
Division Software only Hardware Software only Software only
Reciprocal estimate accuracy 24 bit 12 bit 12 bit 12 bit
Reciprocal sqrt estimate accuracy 23 bit 12 bit 12 bit 12 bit
log2(x) and 2x estimates accuracy 23 bit No 12 bit No
28
IEEE Floating Point Representation
  • A floating point binary number consists of three
    parts
  • sign (S), exponent (E), and mantissa (M).
  • Each (S, E, M) pattern uniquely identifies a
    floating point number.
  • For each bit pattern, its IEEE floating-point
    value is derived as
  • value (-1)S M 2E, where 1.0 M lt 10.0B
  • The interpretation of S is simple S0 results in
    a positive number and S1 a negative number.

29
Normalized Representation
  • Specifying that 1.0B M lt 10.0B makes the
    mantissa value for each floating point number
    unique.
  • For example, the only one mantissa value allowed
    for 0.5D is M 1.0
  • 0.5D  1.0B 2-1
  • Neither 10.0B 2 -2 nor 0.1B 2 0 qualifies
  • Because all mantissa values are of the form
    1.XX, one can omit the 1. part in the
    representation. 
  • The mantissa value of 0.5D in a 2-bit mantissa is
    00, which is derived by omitting 1. from 1.00.

30
Exponent Representation
  • In an n-bits exponent representation, 2n-1-1 is
    added to its 2's complement representation to
    form its excess representation.
  • See Table for a 3-bit exponent representation
  • A simple unsigned integer comparator can be used
    to compare the magnitude of two FP numbers
  • Symmetric range for /- exponents (111 reserved)

2s complement Actual decimal Excess-3
000 0 011
001 1 100
010 2 101
011 3 110
100 (reserved pattern) 111
101 -3 000
110 -2 001
111 -1 010
E represented E - BIAS
31
A Hypothetical 5-bit Floating Point Representation
  • Assume 1-bit S, 2-bit E, and 2-bit M
  • 0.5D  1.00B 2-1
  • 0.5D 0 00 00, 
  • where
  • S 0,
  • E 00
  • M (1.)00

2s complement Actual decimal Excess-1
00 0 01
01 1 10
10 (reserved pattern) 11
11 -1 00
32
Representable Numbers
  • The representable numbers of a given format is
    the set of all numbers that can be exactly
    represented in the format.
  • See Table for representable numbers of an
    unsigned 3-bit integer format

000 0
001 1
010 2
011 3
100 4
101 5
110 6
111 7
0
7
1
4
2
3
5
6
-1
9
8
33
Hypothetical 5-bit FP Representable Numbers
No-zero No-zero Abrupt underflow Abrupt underflow Gradual underflow Gradual underflow
E M S0 S1 S0 S1 S0 S1
00 00 2-1 -(2-1) 0 0 0 0
00 01 2-112-3 -(2-112-3) 0 0 12-2 -12-2
00 10 2-122-3 -(2-122-3) 0 0 22-2 -22-2
00 11 2-132-3 -(2-132-3) 0 0 32-2 -32-2
01 00 20 -(20) 20 -(20) 20 -(20)
01 01 2012-2 -(2012-2) 2012-2 -(2012-2) 2012-2 -(2012-2)
01 10 2022-2 -(2022-2) 2022-2 -(2022-2) 2022-2 -(2022-2)
01 11 2032-2 -(2032-2) 2032-2 -(2032-2) 2032-2 -(2032-2)
10 00 21 -(21) 21 -(21) 21 -(21)
10 01 2112-1 -(2112-1) 2112-1 -(2112-1) 2112-1 -(2112-1)
10 10 2122-1 -(2122-1) 2122-1 -(2122-1) 2122-1 -(2122-1)
10 11 2132-1 -(2132-1) 2132-1 -(2132-1) 2132-1 -(2132-1)
11 Reserved pattern Reserved pattern Reserved pattern Reserved pattern Reserved pattern Reserved pattern Reserved pattern
34
Flush To Zero
  • Treat all bit patterns with E0 as 0.0
  • This takes away several representable numbers
    near zero and lump them all into 0.0
  • For a representation with large M, a large number
    of representable numbers numbers will be removed.

1
2
3
4
0
35
Hypothetical 5-bit FP Representable Numbers
No-zero No-zero Abrupt underflow Abrupt underflow Gradual underflow Gradual underflow
E M S0 S1 S0 S1 S0 S1
00 00 2-1 -(2-1) 0 0 0 0
00 01 2-112-3 -(2-112-3) 0 0 12-2 -12-2
00 10 2-122-3 -(2-122-3) 0 0 22-2 -22-2
00 11 2-132-3 -(2-132-3) 0 0 32-2 -32-2
01 00 20 -(20) 20 -(20) 20 -(20)
01 01 2012-2 -(2012-2) 2012-2 -(2012-2) 2012-2 -(2012-2)
01 10 2022-2 -(2022-2) 2022-2 -(2022-2) 2022-2 -(2022-2)
01 11 2032-2 -(2032-2) 2032-2 -(2032-2) 2032-2 -(2032-2)
10 00 21 -(21) 21 -(21) 21 -(21)
10 01 2112-1 -(2112-1) 2112-1 -(2112-1) 2112-1 -(2112-1)
10 10 2122-1 -(2122-1) 2122-1 -(2122-1) 2122-1 -(2122-1)
10 11 2132-1 -(2132-1) 2132-1 -(2132-1) 2132-1 -(2132-1)
11 Reserved pattern Reserved pattern Reserved pattern Reserved pattern Reserved pattern Reserved pattern Reserved pattern
36
Denormalized Numbers
  • The actual method adopted by the IEEE standard is
    called denormalized numbers or gradual underflow.
  • The method relaxes the normalization requirement
    for numbers very close to 0.
  • whenever E0, the mantissa is no longer assumed
    to be of the form 1.XX. Rather, it is assumed to
    be 0.XX. In general, if the n-bit exponent is 0,
    the value is
  • 0.M 2 - 2 (n-1) 2

0
2
1
3
37
Hypothetical 5-bit FP Representable Numbers
No-zero No-zero Abrupt underflow Abrupt underflow Gradual underflow Gradual underflow
E M S0 S1 S0 S1 S0 S1
00 00 2-1 -(2-1) 0 0 0 0
00 01 2-112-3 -(2-112-3) 0 0 12-2 -12-2
00 10 2-122-3 -(2-122-3) 0 0 22-2 -22-2
00 11 2-132-3 -(2-132-3) 0 0 32-2 -32-2
01 00 20 -(20) 20 -(20) 20 -(20)
01 01 2012-2 -(2012-2) 2012-2 -(2012-2) 2012-2 -(2012-2)
01 10 2022-2 -(2022-2) 2022-2 -(2022-2) 2022-2 -(2022-2)
01 11 2032-2 -(2032-2) 2032-2 -(2032-2) 2032-2 -(2032-2)
10 00 21 -(21) 21 -(21) 21 -(21)
10 01 2112-1 -(2112-1) 2112-1 -(2112-1) 2112-1 -(2112-1)
10 10 2122-1 -(2122-1) 2122-1 -(2122-1) 2122-1 -(2122-1)
10 11 2132-1 -(2132-1) 2132-1 -(2132-1) 2132-1 -(2132-1)
11 Reserved pattern Reserved pattern Reserved pattern Reserved pattern Reserved pattern Reserved pattern Reserved pattern
38
Floating Point Numbers
  • As the exponent gets larger
  • The distance between two representable numbers
    increases

39
Arithmetic Instruction Throughput
  • int and float add, shift, min, max and float mul,
    mad 4 cycles per warp
  • int multiply () is by default 32-bit
  • requires multiple cycles / warp
  • Use __mul24() / __umul24() intrinsics for 4-cycle
    24-bit int multiply
  • For G80, for G20 should be OK
  • Integer divide and modulo are expensive
  • Compiler will convert literal power-of-2 divides
    to shifts
  • Be explicit in cases where compiler cant tell
    that divisor is a power of 2
  • Useful trick foo n foo (n-1) if n is a
    power of 2

40
Arithmetic Instruction Throughput
  • Reciprocal, reciprocal square root, sin/cos, log,
    exp 16 cycles per warp
  • These are the versions prefixed with __
  • Examples__rcp(), __sin(), __exp()
  • Other functions are combinations of the above
  • y / x rcp(x) y 20 cycles per warp
  • sqrt(x) rcp(rsqrt(x)) 32 cycles per warp

41
Runtime Math Library
  • There are two types of runtime math operations
  • __func() direct mapping to hardware ISA
  • Fast but low accuracy (see prog. guide for
    details)
  • Examples __sin(x), __exp(x), __pow(x,y)
  • func() compile to multiple instructions
  • Slower but higher accuracy (5 ulp, units in the
    least place, or less)
  • Examples sin(x), exp(x), pow(x,y)
  • The -use_fast_math compiler option forces every
    func() to compile to __func()

42
Make your program float-safe!
  • G20 has double precision support
  • G80 is single-precision only
  • Double precision has additional performance cost
  • Only one unit per multiprocessor
  • Careless use of double or undeclared types may
    run more slowly on G80
  • Important to be float-safe (be explicit whenever
    you want single precision) to avoid using double
    precision where it is not needed
  • Add f specifier on float literals
  • foo bar 0.123 // double assumed
  • foo bar 0.123f // float explicit
  • Use float version of standard library functions
  • foo sin(bar) // double assumed
  • foo sinf(bar) // single precision explicit

43
Deviations from IEEE-754
  • Addition and Multiplication are IEEE 754
    compliant
  • Maximum 0.5 ulp (units in the least place) error
  • However, often combined into multiply-add (FMAD)
  • Intermediate result is truncated
  • Division is non-compliant (2 ulp)
  • Not all rounding modes are supported
  • Denormalized numbers are not supported
  • No mechanism to detect floating-point exceptions

44
Units in the Last Place Error
  • If the result of a FP computation is
  • 3.12 x 10-2 0.0312
  • But the answer when computed to infinite
    precision is
  • -0.0312159
  • Then ulp is
  • 0.0314 0.0312 0.159
  • For binary representations the maximum ulp is 0.5
  • Round to nearest number

45
Mixed Precision Methods
  • From slides by
  • Robert Strzodka
  • Dominik Göddeke
  • http//www.mathematik.uni-dortmund.de/goeddeke/pu
    bs/NVISION08-long.pdf

46
What is a Mixed Precision Method?
47
Mixed Precision Performance Gains
48
Single vs Double Precision FP
Float s23e8 Double s53e11
49
Round off and Cancellation
50
Double Precision ! Better Accuracy
51
The Dominant Data Error
52
Understanding Floating Point Operations
53
Commutative Summation
54
Commutative Summation Example
Say we want to calculate 1 0.0000004
0.00000003
55
High Precision Emulation
56
Example Addition c a b
57
Please read the following
Write a Comment
User Comments (0)
About PowerShow.com