Loading...

PPT – A Performance Optimization Framework for Compilation of Tensor Contraction Expressions Into Parallel Programs PowerPoint presentation | free to download - id: 809569-YTU1M

The Adobe Flash plugin is needed to view this content

A Performance Optimization Framework for

Compilation of Tensor Contraction Expressions

Into Parallel Programs

- Gerald Baumgartner, Ohio State
- David E. Bernholdt, ORNL
- Daniel Cociorva, Ohio State
- Robert Harrison, PNNL
- Chi-Chung Lam, Ohio State
- Marcel Nooijen, Princeton
- J. Ramanujam, Louisiana State
- P. Sadayappan, Ohio State

Application Domain

- Quantum chemistry, condensed matter physics
- Example simulate chemical reaction
- Typical program structure
- quantum chemistry code
- while (! converged)
- tensor contractions
- quantum chemistry code
- Bulk of computation in tensor contractions

Problem Tensor Contractions

- Formulas of the form
- As many as 100 arrays and array indices
- Index ranges between 10 and 3000
- Assumptions
- All arrays are dense
- No FFTs or other special functions
- Arrays either on disk or computed as needed

Realistic Quantum Chemistry Example

- hbara,b,i,j sumfb,cti,j,a,c,c

-sumfk,ctk,bti,j,a,c,k,c

sumfa,cti,j,c,b,c -sumfk,ctk,ati

,j,c,b,k,c -sumfk,jti,k,a,b,k

-sumfk,ctj,cti,k,a,b,k,c

-sumfk,itj,k,b,a,k -sumfk,cti,ctj

,k,b,a,k,c sumti,ctj,dva,b,c,d,c,d

sumti,j,c,dva,b,c,d,c,d

sumtj,cva,b,i,c,c -sumtk,bva,k,i,j

,k sumti,cvb,a,j,c,c

-sumtk,avb,k,j,i,k -sumtk,dti,j,c,b

vk,a,c,d,k,c,d -sumti,ctj,k,b,dvk,a,

c,d,k,c,d -sumtj,ctk,bvk,a,c,i,k,c

2sumtj,k,b,cvk,a,c,i,k,c

-sumtj,k,c,bvk,a,c,i,k,c

-sumti,ctj,dtk,bvk,a,d,c,k,c,d

2sumtk,dti,j,c,bvk,a,d,c,k,c,d

-sumtk,bti,j,c,dvk,a,d,c,k,c,d

-sumtj,dti,k,c,bvk,a,d,c,k,c,d

2sumti,ctj,k,b,dvk,a,d,c,k,c,d

-sumti,ctj,k,d,bvk,a,d,c,k,c,d

-sumtj,k,b,cvk,a,i,c,k,c

-sumti,ctk,bvk,a,j,c,k,c

-sumti,k,c,bvk,a,j,c,k,c

-sumti,ctj,dtk,avk,b,c,d,k,c,d

-sumtk,dti,j,a,cvk,b,c,d,k,c,d

-sumtk,ati,j,c,dvk,b,c,d,k,c,d

2sumtj,dti,k,a,cvk,b,c,d,k,c,d

-sumtj,dti,k,c,avk,b,c,d,k,c,d

-sumti,ctj,k,d,avk,b,c,d,k,c,d

-sumti,ctk,avk,b,c,j,k,c

2sumti,k,a,cvk,b,c,j,k,c

-sumti,k,c,avk,b,c,j,k,c

2sumtk,dti,j,a,cvk,b,d,c,k,c,d

-sumtj,dti,k,a,cvk,b,d,c,k,c,d

-sumtj,ctk,avk,b,i,c,k,c

-sumtj,k,c,avk,b,i,c,k,c

-sumti,k,a,cvk,b,j,c,k,c

sumti,ctj,dtk,atl,bvk,l,c,d,k,l,c

,d -2sumtk,btl,dti,j,a,cvk,l,c,d,k

,l,c,d -2sumtk,atl,dti,j,c,bvk,l,c,d

,k,l,c,d sumtk,atl,bti,j,c,dvk,l,c

,d,k,l,c,d -2sumtj,ctl,dti,k,a,bvk

,l,c,d,k,l,c,d -2sumtj,dtl,bti,k,a,c

vk,l,c,d,k,l,c,d sumtj,dtl,bti,k,c,

avk,l,c,d,k,l,c,d -2sumti,ctl,dtj,

k,b,avk,l,c,d,k,l,c,d sumti,ctl,at

j,k,b,dvk,l,c,d,k,l,c,d sumti,ctl,b

tj,k,d,avk,l,c,d,k,l,c,d

sumti,k,c,dtj,l,b,avk,l,c,d,k,l,c,d

4sumti,k,a,ctj,l,b,dvk,l,c,d,k,l,c,d

-2sumti,k,c,atj,l,b,dvk,l,c,d,k,l,c,d

-2sumti,k,a,btj,l,c,dvk,l,c,d,k,l,c,d

-2sumti,k,a,ctj,l,d,bvk,l,c,d,k,l,c,

d sumti,k,c,atj,l,d,bvk,l,c,d,k,l,c,d

sumti,ctj,dtk,l,a,bvk,l,c,d,k,l,c

,d sumti,j,c,dtk,l,a,bvk,l,c,d,k,l,c,

d -2sumti,j,c,btk,l,a,dvk,l,c,d,k,l,c

,d -2sumti,j,a,ctk,l,b,dvk,l,c,d,k,l,

c,d sumtj,ctk,btl,avk,l,c,i,k,l,c

sumtl,ctj,k,b,avk,l,c,i,k,l,c

-2sumtl,atj,k,b,cvk,l,c,i,k,l,c

sumtl,atj,k,c,bvk,l,c,i,k,l,c

-2sumtk,ctj,l,b,avk,l,c,i,k,l,c

sumtk,atj,l,b,cvk,l,c,i,k,l,c

sumtk,btj,l,c,avk,l,c,i,k,l,c

sumtj,ctl,k,a,bvk,l,c,i,k,l,c

sumti,ctk,atl,bvk,l,c,j,k,l,c

sumtl,cti,k,a,bvk,l,c,j,k,l,c

-2sumtl,bti,k,a,cvk,l,c,j,k,l,c

sumtl,bti,k,c,avk,l,c,j,k,l,c

sumti,ctk,l,a,bvk,l,c,j,k,l,c

sumtj,ctl,dti,k,a,bvk,l,d,c,k,l,c,d

sumtj,dtl,bti,k,a,cvk,l,d,c,k,l,c,

d sumtj,dtl,ati,k,c,bvk,l,d,c,k,l,

c,d -2sumti,k,c,dtj,l,b,avk,l,d,c,k,l

,c,d -2sumti,k,a,ctj,l,b,dvk,l,d,c,k,

l,c,d sumti,k,c,atj,l,b,dvk,l,d,c,k,l

,c,d sumti,k,a,btj,l,c,dvk,l,d,c,k,l,

c,d sumti,k,c,btj,l,d,avk,l,d,c,k,l,c

,d sumti,k,a,ctj,l,d,bvk,l,d,c,k,l,c,

d sumtk,atl,bvk,l,i,j,k,l

sumtk,l,a,bvk,l,i,j,k,l

sumtk,btl,dti,j,a,cvl,k,c,d,k,l,c,d

sumtk,atl,dti,j,c,bvl,k,c,d,k,l,c,

d sumti,ctl,dtj,k,b,avl,k,c,d,k,l,

c,d -2sumti,ctl,atj,k,b,dvl,k,c,d,

k,l,c,d sumti,ctl,atj,k,d,bvl,k,c,d

,k,l,c,d sumti,j,c,btk,l,a,dvl,k,c,d,

k,l,c,d sumti,j,a,ctk,l,b,dvl,k,c,d,

k,l,c,d -2sumtl,cti,k,a,bvl,k,c,j,k,l

,c sumtl,bti,k,a,cvl,k,c,j,k,l,c

sumtl,ati,k,c,bvl,k,c,j,k,l,c

va,b,i,j

Operation Minimization

- Requires 4 N10 operations if indices a l have

range N - Using associative, commutative, distributive laws

acceptable - Optimal formula sequence requires only 6 N6

operations

Loop Fusion for Memory Reduction

S 0 for b, c T1f 0 T2f 0 for d,

f for e, l T1f Bbefl

Dcdel for j, k T2fjk

T1f Cdfjk for a, i, j, k Sabij

T2fjk Aacik

T1 0 T2 0 S 0 for b, c, d, e, f, l

T1bcdf Bbefl Dcdel for b, c, d, f, j, k

T2bcjk T1bcdf Cdfjk for a, b, c, i, j, k

Sabij T2bcjk Aacik

Formula sequence

Unfused code

Fused code

Tensor Contraction Engine

Tensor contraction expressions

Operation Minimization

Sequence of loop nests (expression tree)

Memory Minimization

(Storage req. exceed limits)

(Storage requirements within limits)

Space-Time Trade-Offs

Imperfectly nested loops

Communication Minimization

Imperfectly nested parallel loops

Data Locality Optimization

Partitioned, tiled, parallel Fortran loops

Domain-Specific Language

- memory limit 2GB
- range V 3000
- range O 100
- index a, b, c, d, e, f V
- index i, j, k, l O
- function N(V,V,V,O)
- function D(V,V,V,O)
- procedure test (in TV,V,O,O, in SV,O,V,O, out

XV,V,O,O) - begin
- Xa,b,i,j sumTa,c,i,k Sd,j,f,k

N(c,d,e,l) D(b,e,f,l), c,d,e,f,k,l - end

Operation Minimization

- For individual term (no binary addition)
- Bottom-up dynamic programming algorithm
- NP-complete, but very effective pruning
- For arbitrary tensor contraction expressions
- Enumerate all formulas using distributive and

associative laws - Use pruning search algorithm for individual terms
- Need domain knowledge for pruning

Memory Minimization

- Search all possible loop fusion structures
- Dynamic programming algorithm
- For each expression tree node
- Construct set of solutions containing (fusion,

memory cost) - E.g., (ltij, k, lgt, 42)
- Prune solutions with more constraining fusion,

higher cost - Solution for root is unique

Memory Reduction Through Recomputation

for a, e, c, f for i, j Xaecf

Tijae Tijcf for c, e, b, k T1cebk f1(c, e,

b, k) for a, f, b, k T2afbk f2(a, f, b,

k) for c, e, a, f for b, k Yceaf

T1cebk T2afbk for c, e, a, f E Xaecf

Yceaf

array space time X V4 V4O2

T1 V3O Cf1V3O T2 V3O Cf2V3O Y V4 V5O E

1 V4

a .. f range V 1000 .. 3000 i .. k range O

30 .. 100

Redundant Computation to Allow Full Fusion

for a, e, c, f for i, j X Tijae

Tijcf for b, k T1 f1(c, e, b,

k) T2 f2(a, f, b, k) Y

T1 T2 E X Y

array space time X 1 V4O2

T1 1 Cf1V5O T2 1 Cf2V5O Y 1

V5O E 1 V4

Tiling for Reducing Recomputation

for at, et, ct, ft for a, e, c, f

for i, j Xaecf Tijae Tijcf

for b, k for c, e T1ce

f1(c, e, b, k) for a, f

T2af f2(a, f, b, k) for c, e, a, f

Yceaf T1ce T2af for c, e, a,

f E Xaecf Yceaf

array space time X B4

V4O2 T1 B2 Cf1(V/B)2V3O T2

B2 Cf2(V/B)2V3O Y B4 V5O E 1

V4

Space-Time Trade-Off Algorithm

- Improve loop fusion opportunities through

recomputation - For e1(i,j,k) e2(i,j) consider recomputation of

e2 in k loop - For each expression tree node
- Construct set of solutions containing (fusion,

mem cost, rec cost) - Prune inferior solutions
- Results in set of space-time trade-offs for root

of expression tree - For all solutions for root
- Split recomputation loops into tiling/intra-tile

loops - Move intra-tile loops inside tiling loops where

needed - Increase tile sizes to minimize recomputation,

use up memory - Use solution with minimal recomputation cost

Communication Minimization

- View processors as multi-dimensional processor

grid - Array dimensions are distributed over processor

dimensions - Example distribution of B(a,b,i,j) lta,i,,1gt
- Arrays might have to be redistributed before an

operation - Communication moving cost startup cost
- For each tree node
- Enumerate (fusion, distrib., mem. cost, comm.

cost) configurations - Use solution with least communication cost that

fits in memory - Communication cost model is empirically derived

Tiling for Minimizing Memory Access Time

FOR ii 1, Ni, Ti FOR jj 1, Nj, Tj

FOR kk 1, Nk, Tk FOR i ii,

ii Ti - 1 FOR j jj, jj

Tj - 1 FOR k kk, kk

Tk - 1 C(i,k)

C(i,k) A(i,j) B(j,k) END FOR k, j, i, kk, jj,

ii

- Choose Ti, Tj, and Tk such that Ti Tj Ti

Tk Tj Tk lt cache size - Number of cache misses
- A(i,j) Ni Nj
- B(j,k) Nj Nk Ni/Ti
- C(i,k) Ni Nk Nj/Tj

Data Locality Optimization

- Problem loop fusion and tiling conflict
- Solution start with fused loop structure
- Expand scalars and vectors to tile-size width
- Split loops into tiling loops and intra-tile

loops - Move all intra-tile loops inside tiling loops
- In bottom-up traversal over parse tree count

cache misses - Iterate over different tile sizes to find optimal

solution

Conclusions and Status

- Mathematically simple computational domain
- Search algorithms for optimizing computation
- Empirically derived cost models
- Prototypes developed so far
- Operation minimization for individual terms in C
- Memory minimization and space-time trade-offs in

ML - Cache optimization and tiling for space-time

trade-offs in C - Communication minimization in C
- Currently developing integrated tool in Java
- Future extensions
- Sparse arrays, symmetry, common subexpressions
- Cost model for BLAS routines
- Domain-specific optimizations