Automatic Transformation and Optimization of Applications on GPUs and GPU Clusters - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

Automatic Transformation and Optimization of Applications on GPUs and GPU Clusters

Description:

Title: 1 Last modified by: mawe Document presentation format: On-screen Show (4:3) Other titles: Times New Roman Arial Vijaya GungsuhChe Arial Black ... – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 52
Provided by: ohi91
Category:

less

Transcript and Presenter's Notes

Title: Automatic Transformation and Optimization of Applications on GPUs and GPU Clusters


1
Automatic Transformation and Optimization of
Applications on GPUs and GPU Clusters
PhD Oral Defence Wenjing Ma Advisor Dr Gagan
Agrawal The Ohio State University
2
Outline of Contents
  • Motivation
  • Accelerators, GPGPU and GPU cluster
  • Difficulty of GPU programming
  • Framework and Approaches
  • Code generation for data mining applications
  • Translation system for enabling data mining
    applications on GPUs
  • Automatic translation of data mining applications
    from MATLAB to GPUs
  • Automatic code generation for data mining on
    clusters with GPU support
  • Arranging data on shared memory with ILP Solver
  • Code optimization for tensor contractions
  • Auto-tuning approach for tensor contractions on
    GPUs
  • Loop transformation for tensor contraction
    sequences on multi-level memory architecture

3
Introduction
  • Accelerators, GPGPU and GPU cluster
  • Multi-core architectures are more and more
    popular in high performance computing
  • GPU, Cell Processor, FPGA
  • GPU has good performance/price ratio
  • Difficulty of Programming
  • How to program a cluster with accelerators on
    each node ?

4
Our Approach
  • Provide high-level support for programming
    emerging high-end configurations
  • Effective and simple optimization strategies
  • Focus on specific application classes
  • Data mining application
  • Tensor contraction expressions

5
Outline of Contents
  • Motivation
  • Accelerators, GPGPU and GPU cluster
  • Difficulty of GPU programming
  • Framework and Approaches
  • Code generation for data mining applications
  • Translation system for enabling data mining
    applications on GPUs
  • Automatic translation of data mining applications
    from MATLAB to GPUs
  • Automatic code generation for data mining on
    clusters with GPU support
  • Arranging data on shared memory with ILP Solver
  • Code optimization for tensor contractions
  • Auto-tuning approach for tensor contractions on
    GPUs
  • Loop transformation for tensor contraction
    sequences on multi-level memory architecture

6
Shared memory on GPU
  • Features of shared memory on GPU
  • Small in size
  • Software controllable
  • Much faster than device memory
  • Need a strategy to arrange data on shared memory
  • Arrange by hand Time consuming and not optimal
  • Previous work intuitive solution

7
An Example to show shared memory usage
Void Kernel_function(float A, float C, )
__shared__ float s_CrNUM_THREADS __shared__
float s_ArNUM_THREADS for(int
i0iltniNUM_THREADS) for(int
j0jltrj)? / load A in device memory
into s_A / for(int j0jltmj)?
for(int k0kltrk)? / load C in
device memory into s_C/ ...... / load B
in device memory into s_A /
8
Problem Formulation for Shared Memory Arrangement
  • What to Consider
  • A kernel function (with a number of basic blocks)
  • Array, section of array, element of array
  • Live ranges of each variable
  • Determine in which basic block a variable is
    allocated to shared memory
  • Assign_pointik variable i, basic block k

9
Integer Linear Programming
  • Linear Programming
  • Objective function
  • Maximize z CT x
  • Constraints
  • Axb
  • Solution
  • Values of vector x
  • Special case of linear programming
  • All the unknown variables are integers (within
    1,0 in our case)?
  • Solvable for reasonable size of problems

10
Integer Programming for Shared Memory Arrangement
(cntd)?
  • Objective Function
  • Maximize shared memory usage
  • Minimize data transfer between memory hierarchies

Maximize z ?i?1nVar, k ?1nLiveiAgg_SMref
ik ? i ?1..nVar, k
?1nLiveiTotal_memcopyik
11
Integer Programming for Shared Memory Arrangement
  • Objective Function

Agg_SMrefik ?j?live_blocksijIs_assignedijR
efsijitersj
Total_memcopyik Data_transijitersj

2size_allocij , if Accessik readwrite
Data_transij 0 , if Accessik temp
size_allocij , otherwise
12
An Example to Show size_alloc
for (int i0 iltn i)? for (int j0 jltm
j)? for (int k 0 kltr k)?
Ck Aik- Bjk ......
Size_alloc rm
Size_alloc rm
Size_alloc r
Size_alloc 1
13
Integer Programming for Shared Memory Arrangement
  • Constraints
  • Total allocation does not exceed the limit of
    shared memory at any time
  • Only at most one assign_point is 1 in each live
    range

?i?live_listjIs_assignedijsize_allocijlimit
?i?live_blocksjkassign_pointij1
14
An Example
for (int i0 iltn i)? for (int j0 jltm
j)? for (int k 0 kltr k)?
Ck Aik- Bjk ......
Integer Programming Solver
A nr B mr C r n 2048 m 3 r 3 NUM_THREADS
256
assign_pointij i denotes variable I, j
denotes basic block j. Variables 0, 1, 2
correspond to A, B, C in the code.
assign_point011 assign_point101 assign
_point201 / all other elements of
assign_point are 0 /
15
An Example (cntd)?
Generated Code __shared__ float
s_Bmr __shared__ float s_CrNUM_THREADS __
shared__ float s_ArNUM_THREADS / load B to
s_B / for(int i0iltniNUM_THREADS)
for(int j0jltrj)? s_AtidrjAtidi
j for(int j0jltmj)? for(int
k0kltrk)? s_Cktids_Atidrk
-s_Bjk ...... / Synchronize and
combination of C /
for (int i0 iltn i)? for (int j0 jltm
j)? for (int k 0 kltr k)?
Ck Aik- Bjk ......
16
Suggesting Loop Transformation
for (int rc 0 rc lt nRowCl rc)
tempDis 0 for(int c 0cltnumColc)?
tempDis tempDis datarc
AcomprccolCLc
for (int rc 0 rc lt nRowCl rc)
tempDisrc 0 for(int c 0cltnumColc)?
/ load into shared memory / for (int rc
0 rc lt nRowCl rc)?
tempDisrc datarc AcomprccolCLc

17
Experiment Results
  • K-means EM

18
Experiment Results
PCA Co-clustering
19
Effect of Loop Transformation
PCA Co-clustering
20
Outline of Contents
  • Motivation
  • Accelerators, GPGPU and GPU cluster
  • Difficulty of GPU programming
  • Framework and Approaches
  • Code generation for data mining applications
  • Translation system for enabling data mining
    applications on GPUs
  • Automatic translation of data mining applications
    from MATLAB to GPUs
  • Automatic code generation for data mining on
    clusters with GPU support
  • Arranging data on shared memory with ILP Solver
  • Code optimization for tensor contractions
  • Auto-tuning approach for tensor contractions on
    GPUs
  • Loop transformation for tensor contraction
    sequences on multi-level memory architecture

21
Tensor Contraction on GPU and Auto-tuning
  • Tensor contraction expressions
  • Motivated by the CCSD(T) part of NWchem
  • In the form of high-dimensional matrix
    multiplication
  • Example
  • rh1 h2 p3 p4 th6 h7 h1 h2 vp3 p4 h6
    h7
  • Auto-tuning
  • Compile-time and Run-time optimization
  • Selecting best implementation with given input
    problem

22
Original Algorithm and Optimization
  • Original Algorithm on T10 GPU
  • Loading input matrices to shared memory
  • Index Calculation
  • Flattening and index combination
  • Optimization for Fermi
  • Register tiling
  • Registers serve as a second level of cache
  • Larger shared memory and register file on Fermi
  • Modified index calculation order
  • Different output/input access ratio for each
    thread
  • rh1 h2 p4 p3 th6 h7 h1 h2 vp3 p4 h6
    h7

23
Motivation of auto-tuning for tensor contractions
on GPU
Running time of two functions on Fermi with
different index orders
Favor input Favor output
Ex 1 (a) 0.425 0.504
Ex 1 (b) 0.487 0.584
Ex 1 (c) 0.51 0.671
Ex 1 (d) 0.681 0.881
Ex 2 (A) 13.6 11
Ex 2 (B) 105.5 41.5
Ex 2 (C) 199.7 149.9
Ex 2 (D) 27.1 22.6
  • Algorithm modification for different
    architectures
  • Different algorithm choices for different inputs

24
Approaches of Auto-tuning
  • Existing approaches
  • Analytical cost model
  • Hard to capture complex architecture features
  • Empirical search
  • Not practical when search space is large
  • Our approach
  • Parametrizable micro-benchmarks
  • Focusing on main features that affect performance

25
Auto-tuning with Parametrizable Micro-benchmarks
Different Implementations
Target Expressions
Architecture Features
Micro Benchmark
Parameter Space
Expression and problem size in application
Execution
Models and Thresholds
Implementation Choice
26
Auto-tuning Approach for Tensor Contractions on
Different GPUs
  • Auto-tuning tool
  • Parametrizable micro-benchmarks
  • Auto-tuning parameters
  • Memory access pattern
  • Kernel Consolidation

27
Micro-benchmark Evaluation for Memory Access
  • Access Stride on device memory makes big
    difference
  • Coalesced accesses
  • adjacent threads access contiguous words in
    device memory
  • Cache
  • L1 and L2
  • Mapping to tensor contractions
  • Index calculation order
  • For uncommon index in the order of input/output
  • For common index in the order of each input

28
Mapping to tensor contractions
rh1 h2 p4 p3 th6 h7 h1 h2 vp3 p4 h6
h7 Mapping to Ca,b Aa,c
Bc,b Collaborative loading of the input
ThreadID.x
Index c of B
Index p3 of v
  • calculate with input order p3 is the inner loop
  • Accessing v
  • Strides between two thread with adjacent x index
    1
  • Calculate with output order p4 is the inner loop
  • Accessing v
  • Strides between two thread with adjacent x index
    range(p3)

29
Micro-benchmark Evaluation for Memory Access
  • A simple micro-benchmark
  • Three types of stride stride_x, stride_y,
    stride_iter

Fermi
Atid.xstride_x tid.ystride_y
istride_iter / i is the index of the loop /
T10
30
Experiments
  • Memory access for single expression

Actual values are running time in ms
Tile size Predicted choice Actual (in order) Actual (out order)
12 in order 0.241 0.295
13 in order 0.312 0.302
14 in order 0.425 0.504
15 in order 0.487 0.584
16 in order 0.51 0.671
17 in order 0.681 0.881
18 in order 1.078 1.471
Tile size Predicted choice Actual (in order) Actual (out order)
12 out order 0.222 0.214
13 out order 0.28 0.27
14 out order 0.364 0.354
15 out order 0.511 0.482
16 out order 0.854 0.644
17 Equal 0.943 0.92
18 Equal 1.193 1.124
31
Micro-benchmark Evaluation for Kernel
Consolidation
  • Launching multiple kernels at the same time
  • With data copy
  • Overlapping of computing and data transfer
  • Without data copy
  • Better utilization of the computing resource
  • Using a matrix-matrix multiplication kernel as
    micro-benchmark

32
Choice of kernel consolidation
  • Tightly coupled consolidation
  • For functions with large data movement cost
  • Loosely coupled consolidation
  • For functions with comparable computation and
    data movement

Foreach (task i) data copy (host to
device) Foreach (task i) launch the
kernels Foreach (task i) data copy (device to
host)
Foreach (task i) data copy for task i (host to
device) launch kernel(i) data copy for task i
(device to host)
33
Experiments
  • Kernel Consolidation for single expression

Micro-benchmark
Real contraction
34
Experiment
  • Running on collections of tensor contractions

Fermi without data copy
T10 without data copy
Fermi with data copy
35
Outline of Contents
  • Motivation
  • Accelerators, GPGPU and GPU cluster
  • Difficulty of GPU programming
  • Framework and Approaches
  • Code generation for data mining applications
  • Translation system for enabling data mining
    applications on GPUs
  • Automatic translation of data mining applications
    from MATLAB to GPUs
  • Automatic code generation for data mining on
    clusters with GPU support
  • Arranging data on shared memory with ILP Solver
  • Code optimization for tensor contractions
  • Auto-tuning approach for tensor contractions on
    GPUs
  • Loop transformation for tensor contraction
    sequences on multi-level memory architecture

36
Motivation of loop fusion for sequence of tensor
contractions
  • Tensor contraction Sequence

?pC4(p, a) A(p, q, r, s)
T3(a, q, r, s)
?qC3(q, b)
T2(a, b, r, s)
T3(a, q, r, s)
?rC2(r, c)
T2(a, b, r, s)
T1(a, b, c, s)
T1(a, b, c, s)
B(a, b, c, d)
?sC1(s, d)
  • Need to find the fusion chains
  • Memory limit at different levels
  • With GPU, memory limitation is more strict

37
Tensor contractions in multi-level memory
hierarchy
  • Memory hierarchy in GPU clusters
  • a disk
  • ß global memory
  • ? local memory/GPU memory
  • None of the levels could be bypassed
  • A higher level is smaller and faster than a lower
    level

38
Loop transformation for tensor contraction
sequences on multi-level memory architecture
  • Single tensor contraction
  • Memory and data movement cost on multi-level
    memory
  • Tensor contractions represented as ZXY
  • Loop fusion for sequence of tensor contractions
  • Condition for fusion
  • Fusion on multi-level memory hierarchy

39
Single Tensor Contraction on Multi-level memory
Hierarchy
  • One array fits in memory
  • Xx y, Y y z, Zx z , assume X fits in
    memory
  • Memory cost NxNymin(Nx, Ny)1 Mß
  • No redundant data movement
  • No array fits in memory
  • To minimize data movement, a preferred solution
    is
  • Ti Tj T
  • Multi-level memory hierarchy
  • Tile size determined with particular system
    parameters and problem sizes

40
Fusion Conditions
  • A sequence
  • Only when data movement dominates
  • Factor determining the ratio
  • Common index of the first contraction
  • Uncommon index of the smaller matrix in the
    second contraction

I1(d, c2,..., cn) I0(d, c1, , cn) B0(d, c1,
, cn) I2(d, c3,, cn) I1(d, c2, , cn) B1(d,
c2, , cn)
In(d) In-1(d, cn) Bn-1(d, cn)
Ii(ci1)
Ii(ci1)
Ii(ci1)
Ii(ci1)
41
Fusion Conditions
Ii(ci1)
Bi
Bi
Ii(ci1)
  • The B matrices in the middle of the chain
    should be very small
  • Bi resides in memory
  • The first B and the last B could be large
  • Tile sizes are determined as in single
    contraction

42
Memory requirement and data movement cost of
fused loops
S1IinIi1n Ii2 S2Ii n Ii1, S3 Ii2 S4
Ii for sx?S1 do Allocate Ii1sx for sy ?
S2-S1 do Allocate Iisy for sz ? S4-S2
do Produce Iisz end for Update
Ii1sy end for for sw ? S3-S1 do
Allocate Ii2sw Produce Ii2sw end
for end for
I1(d, c2,..., cn) I0(d, c1, , cn) B0(d, c1,
, cn) I2(d, c3,, cn) I1(d, c2, , cn) B1(d,
c2, , cn) In(d) In-1(d, cn) Bn-1(d, cn)
43
Algorithm to determine fusion chains
  • For a fusable contraction list
  • With one matrix fitting to memory in each
    contraction
  • Memory cost
  • When memory cost exceeds memory limit, a split is
    made to break the fusion chain

f(i, j) 0, if jlti

, otherwise


44
Fusion in multi-level memory hierarchy
  • With given chains at the lower level, determine
    subchains at the higher level
  • Reduced memory requirement forß level
  • Same procedure to select fusion chains

f(i, j) 0, if jlti
, if memory?(i, j) M?

, otherwise

45
Evaluation
Fusion at Global Memory level
Fusion at disk level
46
Outline
  • Motivation
  • Accelerators, GPGPU and GPU cluster
  • Difficulty of GPU programming
  • Framework and Approaches
  • Code generation for data mining applications
  • Translation system for enabling data mining
    applications on GPUs
  • Automatic translation of data mining applications
    from MATLAB to GPUs
  • Automatic code generation for data mining on
    clusters with GPU support
  • Arranging data on shared memory with ILP Solver
  • Code optimization for tensor contractions
  • Auto-tuning approach for tensor contractions on
    GPUs
  • Loop transformation for tensor contraction
    sequences on multi-level memory architecture

47
GREENRIDE A Translation system for enabling data
mining applications on GPUs
  • User input
  • Code analyzer
  • Analysis of variables (variable type and size)?
  • Analysis of reduction functions (sequential code
    from the user)?
  • Code Generator ( generating CUDA code and C
    code invoking the kernel function)?
  • Optimization

48
GREENRIDE A Translation system for enabling data
mining applications on GPUs
Variable Analyzer
Host Program
User Input
Variable information
Variable Access Pattern and Combination Operations
Kernel functions
Reduction functions
Code Generator
Data copy and thread grid configuration
Optional functions
Code Analyzer( In LLVM)?
Executable
49
GMAT-DM Automatic Transformation from MATLAB for
GPUs
MATLAB code
OCTAVE parser
GMAT-DM
  • Transform MATLAB code for GPU
  • Convert MATLAB code to C
  • Use GREENRIDE to convert to CUDA
  • Matrix manipulation
  • Modified metric for matrix multiplication chain
  • Function combination

C code
GREENRIDE
CUDA code
50
AUTO-GC Automatic Code Generation for FREERIDE
with GPU Support
Variable Information
Reduction Functions
Optional Functions
User input
Add support to GPU clusters!
Code Analyzer
Access Pattern Reduction Objects Combination
Operation
Variable Analyzer
Variable Info Parallel Loop
Cluster of CPUs
FREERIDE Code
Code Generator
CUDA Code
GPU on Each Node
51
Future Work
  • Extend the code generation system for data mining
    applications to more structures
  • Improve and apply ILP approach for shared memory
    arrangement for other architectures
  • Include more parameters in the auto-tuning
    framework
  • Extend loop transformation to heterogeneous
    structures

52
Conclusion
  • Code generation for data mining applications
  • Translation system for enabling data mining
    applications on GPUs
  • Automatic translation of data mining applications
    from MATLAB to GPUs
  • Automatic code generation for data mining on
    clusters with GPU support
  • Arranging data on shared memory with ILP Solver
  • Code optimization for tensor contractions
  • Auto-tuning approach for tensor contractions on
    GPUs
  • Loop transformation for tensor contraction
    sequences on multi-level memory architecture

53
Thank you !
Write a Comment
User Comments (0)
About PowerShow.com