Automatic Transformation and Optimization of Applications on GPUs and GPU Clusters - PowerPoint PPT Presentation

1 / 51

About This Presentation

Title:

Automatic Transformation and Optimization of Applications on GPUs and GPU Clusters

Description:

Title: 1 Last modified by: mawe Document presentation format: On-screen Show (4:3) Other titles: Times New Roman Arial Vijaya GungsuhChe Arial Black ... – PowerPoint PPT presentation

Number of Views:85

Avg rating:3.0/5.0

Slides: 52

Provided by: ohi91

Learn more at: http://web.cse.ohio-state.edu

Category:

more less

Transcript and Presenter's Notes

Title: Automatic Transformation and Optimization of Applications on GPUs and GPU Clusters

1
Automatic Transformation and Optimization of
Applications on GPUs and GPU Clusters
PhD Oral Defence Wenjing Ma Advisor Dr Gagan
Agrawal The Ohio State University
2
Outline of Contents

Motivation
Accelerators, GPGPU and GPU cluster
Difficulty of GPU programming
Framework and Approaches
Code generation for data mining applications
Translation system for enabling data mining
applications on GPUs
Automatic translation of data mining applications
from MATLAB to GPUs
Automatic code generation for data mining on
clusters with GPU support
Arranging data on shared memory with ILP Solver
Code optimization for tensor contractions
Auto-tuning approach for tensor contractions on
GPUs
Loop transformation for tensor contraction
sequences on multi-level memory architecture

3
Introduction

Accelerators, GPGPU and GPU cluster
Multi-core architectures are more and more
popular in high performance computing
GPU, Cell Processor, FPGA
GPU has good performance/price ratio
Difficulty of Programming
How to program a cluster with accelerators on
each node ?

4
Our Approach

Provide high-level support for programming
emerging high-end configurations
Effective and simple optimization strategies
Focus on specific application classes
Data mining application
Tensor contraction expressions

5
Outline of Contents

Motivation
Accelerators, GPGPU and GPU cluster
Difficulty of GPU programming
Framework and Approaches
Code generation for data mining applications
Translation system for enabling data mining
applications on GPUs
Automatic translation of data mining applications
from MATLAB to GPUs
Automatic code generation for data mining on
clusters with GPU support
Arranging data on shared memory with ILP Solver
Code optimization for tensor contractions
Auto-tuning approach for tensor contractions on
GPUs
Loop transformation for tensor contraction
sequences on multi-level memory architecture

6
Shared memory on GPU

Features of shared memory on GPU
Small in size
Software controllable
Much faster than device memory
Need a strategy to arrange data on shared memory
Arrange by hand Time consuming and not optimal
Previous work intuitive solution

7
An Example to show shared memory usage
Void Kernel_function(float A, float C, )
__shared__ float s_CrNUM_THREADS __shared__
float s_ArNUM_THREADS for(int
i0iltniNUM_THREADS) for(int
j0jltrj)? / load A in device memory
into s_A / for(int j0jltmj)?
for(int k0kltrk)? / load C in
device memory into s_C/ ...... / load B
in device memory into s_A /
8
Problem Formulation for Shared Memory Arrangement

What to Consider
A kernel function (with a number of basic blocks)
Array, section of array, element of array
Live ranges of each variable
Determine in which basic block a variable is
allocated to shared memory
Assign_pointik variable i, basic block k

9
Integer Linear Programming

Linear Programming
Objective function
Maximize z CT x
Constraints
Axb
Solution
Values of vector x
Special case of linear programming
All the unknown variables are integers (within
1,0 in our case)?
Solvable for reasonable size of problems

10
Integer Programming for Shared Memory Arrangement
(cntd)?

Objective Function
Maximize shared memory usage
Minimize data transfer between memory hierarchies

Maximize z ?i?1nVar, k ?1nLiveiAgg_SMref
ik ? i ?1..nVar, k
?1nLiveiTotal_memcopyik
11
Integer Programming for Shared Memory Arrangement

Objective Function

Agg_SMrefik ?j?live_blocksijIs_assignedijR
efsijitersj
Total_memcopyik Data_transijitersj

2size_allocij , if Accessik readwrite
Data_transij 0 , if Accessik temp
size_allocij , otherwise
12
An Example to Show size_alloc
for (int i0 iltn i)? for (int j0 jltm
j)? for (int k 0 kltr k)?
Ck Aik- Bjk ......
Size_alloc rm
Size_alloc rm
Size_alloc r
Size_alloc 1
13
Integer Programming for Shared Memory Arrangement

Constraints
Total allocation does not exceed the limit of
shared memory at any time
Only at most one assign_point is 1 in each live
range

?i?live_listjIs_assignedijsize_allocijlimit
?i?live_blocksjkassign_pointij1
14
An Example
for (int i0 iltn i)? for (int j0 jltm
j)? for (int k 0 kltr k)?
Ck Aik- Bjk ......
Integer Programming Solver
A nr B mr C r n 2048 m 3 r 3 NUM_THREADS
256
assign_pointij i denotes variable I, j
denotes basic block j. Variables 0, 1, 2
correspond to A, B, C in the code.
assign_point011 assign_point101 assign
_point201 / all other elements of
assign_point are 0 /
15
An Example (cntd)?
Generated Code __shared__ float
s_Bmr __shared__ float s_CrNUM_THREADS __
shared__ float s_ArNUM_THREADS / load B to
s_B / for(int i0iltniNUM_THREADS)
for(int j0jltrj)? s_AtidrjAtidi
j for(int j0jltmj)? for(int
k0kltrk)? s_Cktids_Atidrk
-s_Bjk ...... / Synchronize and
combination of C /
for (int i0 iltn i)? for (int j0 jltm
j)? for (int k 0 kltr k)?
Ck Aik- Bjk ......
16
Suggesting Loop Transformation
for (int rc 0 rc lt nRowCl rc)
tempDis 0 for(int c 0cltnumColc)?
tempDis tempDis datarc
AcomprccolCLc
for (int rc 0 rc lt nRowCl rc)
tempDisrc 0 for(int c 0cltnumColc)?
/ load into shared memory / for (int rc
0 rc lt nRowCl rc)?
tempDisrc datarc AcomprccolCLc

17
Experiment Results

K-means EM

18
Experiment Results
PCA Co-clustering
19
Effect of Loop Transformation
PCA Co-clustering
20
Outline of Contents

Motivation
Accelerators, GPGPU and GPU cluster
Difficulty of GPU programming
Framework and Approaches
Code generation for data mining applications
Translation system for enabling data mining
applications on GPUs
Automatic translation of data mining applications
from MATLAB to GPUs
Automatic code generation for data mining on
clusters with GPU support
Arranging data on shared memory with ILP Solver
Code optimization for tensor contractions
Auto-tuning approach for tensor contractions on
GPUs
Loop transformation for tensor contraction
sequences on multi-level memory architecture

21
Tensor Contraction on GPU and Auto-tuning

Tensor contraction expressions
Motivated by the CCSD(T) part of NWchem
In the form of high-dimensional matrix
multiplication
Example
rh1 h2 p3 p4 th6 h7 h1 h2 vp3 p4 h6
h7
Auto-tuning
Compile-time and Run-time optimization
Selecting best implementation with given input
problem

22
Original Algorithm and Optimization

Original Algorithm on T10 GPU
Loading input matrices to shared memory
Index Calculation
Flattening and index combination
Optimization for Fermi
Register tiling
Registers serve as a second level of cache
Larger shared memory and register file on Fermi
Modified index calculation order
Different output/input access ratio for each
thread
rh1 h2 p4 p3 th6 h7 h1 h2 vp3 p4 h6
h7

23
Motivation of auto-tuning for tensor contractions
on GPU
Running time of two functions on Fermi with
different index orders
Favor input Favor output
Ex 1 (a) 0.425 0.504
Ex 1 (b) 0.487 0.584
Ex 1 (c) 0.51 0.671
Ex 1 (d) 0.681 0.881
Ex 2 (A) 13.6 11
Ex 2 (B) 105.5 41.5
Ex 2 (C) 199.7 149.9
Ex 2 (D) 27.1 22.6

Algorithm modification for different
architectures
Different algorithm choices for different inputs

24
Approaches of Auto-tuning

Existing approaches
Analytical cost model
Hard to capture complex architecture features
Empirical search
Not practical when search space is large
Our approach
Parametrizable micro-benchmarks
Focusing on main features that affect performance

25
Auto-tuning with Parametrizable Micro-benchmarks
Different Implementations
Target Expressions
Architecture Features
Micro Benchmark
Parameter Space
Expression and problem size in application
Execution
Models and Thresholds
Implementation Choice
26
Auto-tuning Approach for Tensor Contractions on
Different GPUs

Auto-tuning tool
Parametrizable micro-benchmarks
Auto-tuning parameters
Memory access pattern
Kernel Consolidation

27
Micro-benchmark Evaluation for Memory Access

Access Stride on device memory makes big
difference
Coalesced accesses
adjacent threads access contiguous words in
device memory
Cache
L1 and L2
Mapping to tensor contractions
Index calculation order
For uncommon index in the order of input/output
For common index in the order of each input

28
Mapping to tensor contractions
rh1 h2 p4 p3 th6 h7 h1 h2 vp3 p4 h6
h7 Mapping to Ca,b Aa,c
Bc,b Collaborative loading of the input
ThreadID.x
Index c of B
Index p3 of v

calculate with input order p3 is the inner loop
Accessing v
Strides between two thread with adjacent x index
1
Calculate with output order p4 is the inner loop
Accessing v
Strides between two thread with adjacent x index
range(p3)

29
Micro-benchmark Evaluation for Memory Access

A simple micro-benchmark
Three types of stride stride_x, stride_y,
stride_iter

Fermi
Atid.xstride_x tid.ystride_y
istride_iter / i is the index of the loop /
T10
30
Experiments

Memory access for single expression

Actual values are running time in ms
Tile size Predicted choice Actual (in order) Actual (out order)
12 in order 0.241 0.295
13 in order 0.312 0.302
14 in order 0.425 0.504
15 in order 0.487 0.584
16 in order 0.51 0.671
17 in order 0.681 0.881
18 in order 1.078 1.471
Tile size Predicted choice Actual (in order) Actual (out order)
12 out order 0.222 0.214
13 out order 0.28 0.27
14 out order 0.364 0.354
15 out order 0.511 0.482
16 out order 0.854 0.644
17 Equal 0.943 0.92
18 Equal 1.193 1.124
31
Micro-benchmark Evaluation for Kernel
Consolidation

Launching multiple kernels at the same time
With data copy
Overlapping of computing and data transfer
Without data copy
Better utilization of the computing resource
Using a matrix-matrix multiplication kernel as
micro-benchmark

32
Choice of kernel consolidation

Tightly coupled consolidation
For functions with large data movement cost
Loosely coupled consolidation
For functions with comparable computation and
data movement

Foreach (task i) data copy (host to
device) Foreach (task i) launch the
kernels Foreach (task i) data copy (device to
host)
Foreach (task i) data copy for task i (host to
device) launch kernel(i) data copy for task i
(device to host)
33
Experiments

Kernel Consolidation for single expression

Micro-benchmark
Real contraction
34
Experiment

Running on collections of tensor contractions

Fermi without data copy
T10 without data copy
Fermi with data copy
35
Outline of Contents

Motivation
Accelerators, GPGPU and GPU cluster
Difficulty of GPU programming
Framework and Approaches
Code generation for data mining applications
Translation system for enabling data mining
applications on GPUs
Automatic translation of data mining applications
from MATLAB to GPUs
Automatic code generation for data mining on
clusters with GPU support
Arranging data on shared memory with ILP Solver
Code optimization for tensor contractions
Auto-tuning approach for tensor contractions on
GPUs
Loop transformation for tensor contraction
sequences on multi-level memory architecture

36
Motivation of loop fusion for sequence of tensor
contractions

Tensor contraction Sequence

?pC4(p, a) A(p, q, r, s)
T3(a, q, r, s)
?qC3(q, b)
T2(a, b, r, s)
T3(a, q, r, s)
?rC2(r, c)
T2(a, b, r, s)
T1(a, b, c, s)
T1(a, b, c, s)
B(a, b, c, d)
?sC1(s, d)

Need to find the fusion chains
Memory limit at different levels
With GPU, memory limitation is more strict

37
Tensor contractions in multi-level memory
hierarchy

Memory hierarchy in GPU clusters
a disk
ß global memory
? local memory/GPU memory
None of the levels could be bypassed
A higher level is smaller and faster than a lower
level

38
Loop transformation for tensor contraction
sequences on multi-level memory architecture

Single tensor contraction
Memory and data movement cost on multi-level
memory
Tensor contractions represented as ZXY
Loop fusion for sequence of tensor contractions
Condition for fusion
Fusion on multi-level memory hierarchy

39
Single Tensor Contraction on Multi-level memory
Hierarchy

One array fits in memory
Xx y, Y y z, Zx z , assume X fits in
memory
Memory cost NxNymin(Nx, Ny)1 Mß
No redundant data movement
No array fits in memory
To minimize data movement, a preferred solution
is
Ti Tj T
Multi-level memory hierarchy
Tile size determined with particular system
parameters and problem sizes

40
Fusion Conditions

A sequence
Only when data movement dominates
Factor determining the ratio
Common index of the first contraction
Uncommon index of the smaller matrix in the
second contraction

I1(d, c2,..., cn) I0(d, c1, , cn) B0(d, c1,
, cn) I2(d, c3,, cn) I1(d, c2, , cn) B1(d,
c2, , cn)
In(d) In-1(d, cn) Bn-1(d, cn)
Ii(ci1)
Ii(ci1)
Ii(ci1)
Ii(ci1)
41
Fusion Conditions
Ii(ci1)
Bi
Bi
Ii(ci1)

The B matrices in the middle of the chain
should be very small
Bi resides in memory
The first B and the last B could be large
Tile sizes are determined as in single
contraction

42
Memory requirement and data movement cost of
fused loops
S1IinIi1n Ii2 S2Ii n Ii1, S3 Ii2 S4
Ii for sx?S1 do Allocate Ii1sx for sy ?
S2-S1 do Allocate Iisy for sz ? S4-S2
do Produce Iisz end for Update
Ii1sy end for for sw ? S3-S1 do
Allocate Ii2sw Produce Ii2sw end
for end for
I1(d, c2,..., cn) I0(d, c1, , cn) B0(d, c1,
, cn) I2(d, c3,, cn) I1(d, c2, , cn) B1(d,
c2, , cn) In(d) In-1(d, cn) Bn-1(d, cn)
43
Algorithm to determine fusion chains

For a fusable contraction list
With one matrix fitting to memory in each
contraction
Memory cost
When memory cost exceeds memory limit, a split is
made to break the fusion chain

f(i, j) 0, if jlti

, otherwise

44
Fusion in multi-level memory hierarchy

With given chains at the lower level, determine
subchains at the higher level
Reduced memory requirement forß level
Same procedure to select fusion chains

f(i, j) 0, if jlti
, if memory?(i, j) M?

, otherwise

45
Evaluation
Fusion at Global Memory level
Fusion at disk level
46
Outline

Motivation
Accelerators, GPGPU and GPU cluster
Difficulty of GPU programming
Framework and Approaches
Code generation for data mining applications
Translation system for enabling data mining
applications on GPUs
Automatic translation of data mining applications
from MATLAB to GPUs
Automatic code generation for data mining on
clusters with GPU support
Arranging data on shared memory with ILP Solver
Code optimization for tensor contractions
Auto-tuning approach for tensor contractions on
GPUs
Loop transformation for tensor contraction
sequences on multi-level memory architecture

47
GREENRIDE A Translation system for enabling data
mining applications on GPUs

User input
Code analyzer
Analysis of variables (variable type and size)?
Analysis of reduction functions (sequential code
from the user)?
Code Generator ( generating CUDA code and C
code invoking the kernel function)?
Optimization

48
GREENRIDE A Translation system for enabling data
mining applications on GPUs
Variable Analyzer
Host Program
User Input
Variable information
Variable Access Pattern and Combination Operations
Kernel functions
Reduction functions
Code Generator
Data copy and thread grid configuration
Optional functions
Code Analyzer( In LLVM)?
Executable
49
GMAT-DM Automatic Transformation from MATLAB for
GPUs
MATLAB code
OCTAVE parser
GMAT-DM

Transform MATLAB code for GPU
Convert MATLAB code to C
Use GREENRIDE to convert to CUDA
Matrix manipulation
Modified metric for matrix multiplication chain
Function combination

C code
GREENRIDE
CUDA code
50
AUTO-GC Automatic Code Generation for FREERIDE
with GPU Support
Variable Information
Reduction Functions
Optional Functions
User input
Add support to GPU clusters!
Code Analyzer
Access Pattern Reduction Objects Combination
Operation
Variable Analyzer
Variable Info Parallel Loop
Cluster of CPUs
FREERIDE Code
Code Generator
CUDA Code
GPU on Each Node
51
Future Work

Extend the code generation system for data mining
applications to more structures
Improve and apply ILP approach for shared memory
arrangement for other architectures
Include more parameters in the auto-tuning
framework
Extend loop transformation to heterogeneous
structures

52
Conclusion

Code generation for data mining applications
Translation system for enabling data mining
applications on GPUs
Automatic translation of data mining applications
from MATLAB to GPUs
Automatic code generation for data mining on
clusters with GPU support
Arranging data on shared memory with ILP Solver
Code optimization for tensor contractions
Auto-tuning approach for tensor contractions on
GPUs
Loop transformation for tensor contraction
sequences on multi-level memory architecture

53
Thank you !

Write a Comment

User Comments (0)