A Performance Optimization Framework for Compilation of Tensor Contraction Expressions Into Parallel Programs - PowerPoint PPT Presentation

Loading...

PPT – A Performance Optimization Framework for Compilation of Tensor Contraction Expressions Into Parallel Programs PowerPoint presentation | free to download - id: 809569-YTU1M



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

A Performance Optimization Framework for Compilation of Tensor Contraction Expressions Into Parallel Programs

Description:

A Performance Optimization Framework for Compilation of Tensor Contraction Expressions Into Parallel Programs Gerald Baumgartner, Ohio State David E. Bernholdt, ORNL – PowerPoint PPT presentation

Number of Views:18
Avg rating:3.0/5.0

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: A Performance Optimization Framework for Compilation of Tensor Contraction Expressions Into Parallel Programs


1
A Performance Optimization Framework for
Compilation of Tensor Contraction Expressions
Into Parallel Programs
  • Gerald Baumgartner, Ohio State
  • David E. Bernholdt, ORNL
  • Daniel Cociorva, Ohio State
  • Robert Harrison, PNNL
  • Chi-Chung Lam, Ohio State
  • Marcel Nooijen, Princeton
  • J. Ramanujam, Louisiana State
  • P. Sadayappan, Ohio State

2
Application Domain
  • Quantum chemistry, condensed matter physics
  • Example simulate chemical reaction
  • Typical program structure
  • quantum chemistry code
  • while (! converged)
  • tensor contractions
  • quantum chemistry code
  • Bulk of computation in tensor contractions

3
Problem Tensor Contractions
  • Formulas of the form
  • As many as 100 arrays and array indices
  • Index ranges between 10 and 3000
  • Assumptions
  • All arrays are dense
  • No FFTs or other special functions
  • Arrays either on disk or computed as needed

4
Realistic Quantum Chemistry Example
  • hbara,b,i,j sumfb,cti,j,a,c,c
    -sumfk,ctk,bti,j,a,c,k,c
    sumfa,cti,j,c,b,c -sumfk,ctk,ati
    ,j,c,b,k,c -sumfk,jti,k,a,b,k
    -sumfk,ctj,cti,k,a,b,k,c
    -sumfk,itj,k,b,a,k -sumfk,cti,ctj
    ,k,b,a,k,c sumti,ctj,dva,b,c,d,c,d
    sumti,j,c,dva,b,c,d,c,d
    sumtj,cva,b,i,c,c -sumtk,bva,k,i,j
    ,k sumti,cvb,a,j,c,c
    -sumtk,avb,k,j,i,k -sumtk,dti,j,c,b
    vk,a,c,d,k,c,d -sumti,ctj,k,b,dvk,a,
    c,d,k,c,d -sumtj,ctk,bvk,a,c,i,k,c
    2sumtj,k,b,cvk,a,c,i,k,c
    -sumtj,k,c,bvk,a,c,i,k,c
    -sumti,ctj,dtk,bvk,a,d,c,k,c,d
    2sumtk,dti,j,c,bvk,a,d,c,k,c,d
    -sumtk,bti,j,c,dvk,a,d,c,k,c,d
    -sumtj,dti,k,c,bvk,a,d,c,k,c,d
    2sumti,ctj,k,b,dvk,a,d,c,k,c,d
    -sumti,ctj,k,d,bvk,a,d,c,k,c,d
    -sumtj,k,b,cvk,a,i,c,k,c
    -sumti,ctk,bvk,a,j,c,k,c
    -sumti,k,c,bvk,a,j,c,k,c
    -sumti,ctj,dtk,avk,b,c,d,k,c,d
    -sumtk,dti,j,a,cvk,b,c,d,k,c,d
    -sumtk,ati,j,c,dvk,b,c,d,k,c,d
    2sumtj,dti,k,a,cvk,b,c,d,k,c,d
    -sumtj,dti,k,c,avk,b,c,d,k,c,d
    -sumti,ctj,k,d,avk,b,c,d,k,c,d
    -sumti,ctk,avk,b,c,j,k,c
    2sumti,k,a,cvk,b,c,j,k,c
    -sumti,k,c,avk,b,c,j,k,c
    2sumtk,dti,j,a,cvk,b,d,c,k,c,d
    -sumtj,dti,k,a,cvk,b,d,c,k,c,d
    -sumtj,ctk,avk,b,i,c,k,c
    -sumtj,k,c,avk,b,i,c,k,c
    -sumti,k,a,cvk,b,j,c,k,c
    sumti,ctj,dtk,atl,bvk,l,c,d,k,l,c
    ,d -2sumtk,btl,dti,j,a,cvk,l,c,d,k
    ,l,c,d -2sumtk,atl,dti,j,c,bvk,l,c,d
    ,k,l,c,d sumtk,atl,bti,j,c,dvk,l,c
    ,d,k,l,c,d -2sumtj,ctl,dti,k,a,bvk
    ,l,c,d,k,l,c,d -2sumtj,dtl,bti,k,a,c
    vk,l,c,d,k,l,c,d sumtj,dtl,bti,k,c,
    avk,l,c,d,k,l,c,d -2sumti,ctl,dtj,
    k,b,avk,l,c,d,k,l,c,d sumti,ctl,at
    j,k,b,dvk,l,c,d,k,l,c,d sumti,ctl,b
    tj,k,d,avk,l,c,d,k,l,c,d
    sumti,k,c,dtj,l,b,avk,l,c,d,k,l,c,d
    4sumti,k,a,ctj,l,b,dvk,l,c,d,k,l,c,d
    -2sumti,k,c,atj,l,b,dvk,l,c,d,k,l,c,d
    -2sumti,k,a,btj,l,c,dvk,l,c,d,k,l,c,d
    -2sumti,k,a,ctj,l,d,bvk,l,c,d,k,l,c,
    d sumti,k,c,atj,l,d,bvk,l,c,d,k,l,c,d
    sumti,ctj,dtk,l,a,bvk,l,c,d,k,l,c
    ,d sumti,j,c,dtk,l,a,bvk,l,c,d,k,l,c,
    d -2sumti,j,c,btk,l,a,dvk,l,c,d,k,l,c
    ,d -2sumti,j,a,ctk,l,b,dvk,l,c,d,k,l,
    c,d sumtj,ctk,btl,avk,l,c,i,k,l,c
    sumtl,ctj,k,b,avk,l,c,i,k,l,c
    -2sumtl,atj,k,b,cvk,l,c,i,k,l,c
    sumtl,atj,k,c,bvk,l,c,i,k,l,c
    -2sumtk,ctj,l,b,avk,l,c,i,k,l,c
    sumtk,atj,l,b,cvk,l,c,i,k,l,c
    sumtk,btj,l,c,avk,l,c,i,k,l,c
    sumtj,ctl,k,a,bvk,l,c,i,k,l,c
    sumti,ctk,atl,bvk,l,c,j,k,l,c
    sumtl,cti,k,a,bvk,l,c,j,k,l,c
    -2sumtl,bti,k,a,cvk,l,c,j,k,l,c
    sumtl,bti,k,c,avk,l,c,j,k,l,c
    sumti,ctk,l,a,bvk,l,c,j,k,l,c
    sumtj,ctl,dti,k,a,bvk,l,d,c,k,l,c,d
    sumtj,dtl,bti,k,a,cvk,l,d,c,k,l,c,
    d sumtj,dtl,ati,k,c,bvk,l,d,c,k,l,
    c,d -2sumti,k,c,dtj,l,b,avk,l,d,c,k,l
    ,c,d -2sumti,k,a,ctj,l,b,dvk,l,d,c,k,
    l,c,d sumti,k,c,atj,l,b,dvk,l,d,c,k,l
    ,c,d sumti,k,a,btj,l,c,dvk,l,d,c,k,l,
    c,d sumti,k,c,btj,l,d,avk,l,d,c,k,l,c
    ,d sumti,k,a,ctj,l,d,bvk,l,d,c,k,l,c,
    d sumtk,atl,bvk,l,i,j,k,l
    sumtk,l,a,bvk,l,i,j,k,l
    sumtk,btl,dti,j,a,cvl,k,c,d,k,l,c,d
    sumtk,atl,dti,j,c,bvl,k,c,d,k,l,c,
    d sumti,ctl,dtj,k,b,avl,k,c,d,k,l,
    c,d -2sumti,ctl,atj,k,b,dvl,k,c,d,
    k,l,c,d sumti,ctl,atj,k,d,bvl,k,c,d
    ,k,l,c,d sumti,j,c,btk,l,a,dvl,k,c,d,
    k,l,c,d sumti,j,a,ctk,l,b,dvl,k,c,d,
    k,l,c,d -2sumtl,cti,k,a,bvl,k,c,j,k,l
    ,c sumtl,bti,k,a,cvl,k,c,j,k,l,c
    sumtl,ati,k,c,bvl,k,c,j,k,l,c
    va,b,i,j

5
Operation Minimization
  • Requires 4 N10 operations if indices a l have
    range N
  • Using associative, commutative, distributive laws
    acceptable
  • Optimal formula sequence requires only 6 N6
    operations

6
Loop Fusion for Memory Reduction
S 0 for b, c T1f 0 T2f 0 for d,
f for e, l T1f Bbefl
Dcdel for j, k T2fjk
T1f Cdfjk for a, i, j, k Sabij
T2fjk Aacik
T1 0 T2 0 S 0 for b, c, d, e, f, l
T1bcdf Bbefl Dcdel for b, c, d, f, j, k
T2bcjk T1bcdf Cdfjk for a, b, c, i, j, k
Sabij T2bcjk Aacik
Formula sequence
Unfused code
Fused code
7
Tensor Contraction Engine
Tensor contraction expressions
Operation Minimization
Sequence of loop nests (expression tree)
Memory Minimization
(Storage req. exceed limits)
(Storage requirements within limits)
Space-Time Trade-Offs
Imperfectly nested loops
Communication Minimization
Imperfectly nested parallel loops
Data Locality Optimization
Partitioned, tiled, parallel Fortran loops
8
Domain-Specific Language
  • memory limit 2GB
  • range V 3000
  • range O 100
  • index a, b, c, d, e, f V
  • index i, j, k, l O
  • function N(V,V,V,O)
  • function D(V,V,V,O)
  • procedure test (in TV,V,O,O, in SV,O,V,O, out
    XV,V,O,O)
  • begin
  • Xa,b,i,j sumTa,c,i,k Sd,j,f,k
    N(c,d,e,l) D(b,e,f,l), c,d,e,f,k,l
  • end

9
Operation Minimization
  • For individual term (no binary addition)
  • Bottom-up dynamic programming algorithm
  • NP-complete, but very effective pruning
  • For arbitrary tensor contraction expressions
  • Enumerate all formulas using distributive and
    associative laws
  • Use pruning search algorithm for individual terms
  • Need domain knowledge for pruning

10
Memory Minimization
  • Search all possible loop fusion structures
  • Dynamic programming algorithm
  • For each expression tree node
  • Construct set of solutions containing (fusion,
    memory cost)
  • E.g., (ltij, k, lgt, 42)
  • Prune solutions with more constraining fusion,
    higher cost
  • Solution for root is unique

11
Memory Reduction Through Recomputation
for a, e, c, f for i, j Xaecf
Tijae Tijcf for c, e, b, k T1cebk f1(c, e,
b, k) for a, f, b, k T2afbk f2(a, f, b,
k) for c, e, a, f for b, k Yceaf
T1cebk T2afbk for c, e, a, f E Xaecf
Yceaf
array space time X V4 V4O2
T1 V3O Cf1V3O T2 V3O Cf2V3O Y V4 V5O E
1 V4
a .. f range V 1000 .. 3000 i .. k range O
30 .. 100
12
Redundant Computation to Allow Full Fusion
for a, e, c, f for i, j X Tijae
Tijcf for b, k T1 f1(c, e, b,
k) T2 f2(a, f, b, k) Y
T1 T2 E X Y
array space time X 1 V4O2
T1 1 Cf1V5O T2 1 Cf2V5O Y 1
V5O E 1 V4
13
Tiling for Reducing Recomputation
for at, et, ct, ft for a, e, c, f
for i, j Xaecf Tijae Tijcf
for b, k for c, e T1ce
f1(c, e, b, k) for a, f
T2af f2(a, f, b, k) for c, e, a, f
Yceaf T1ce T2af for c, e, a,
f E Xaecf Yceaf
array space time X B4
V4O2 T1 B2 Cf1(V/B)2V3O T2
B2 Cf2(V/B)2V3O Y B4 V5O E 1
V4
14
Space-Time Trade-Off Algorithm
  • Improve loop fusion opportunities through
    recomputation
  • For e1(i,j,k) e2(i,j) consider recomputation of
    e2 in k loop
  • For each expression tree node
  • Construct set of solutions containing (fusion,
    mem cost, rec cost)
  • Prune inferior solutions
  • Results in set of space-time trade-offs for root
    of expression tree
  • For all solutions for root
  • Split recomputation loops into tiling/intra-tile
    loops
  • Move intra-tile loops inside tiling loops where
    needed
  • Increase tile sizes to minimize recomputation,
    use up memory
  • Use solution with minimal recomputation cost

15
Communication Minimization
  • View processors as multi-dimensional processor
    grid
  • Array dimensions are distributed over processor
    dimensions
  • Example distribution of B(a,b,i,j) lta,i,,1gt
  • Arrays might have to be redistributed before an
    operation
  • Communication moving cost startup cost
  • For each tree node
  • Enumerate (fusion, distrib., mem. cost, comm.
    cost) configurations
  • Use solution with least communication cost that
    fits in memory
  • Communication cost model is empirically derived

16
Tiling for Minimizing Memory Access Time
FOR ii 1, Ni, Ti FOR jj 1, Nj, Tj
FOR kk 1, Nk, Tk FOR i ii,
ii Ti - 1 FOR j jj, jj
Tj - 1 FOR k kk, kk
Tk - 1 C(i,k)
C(i,k) A(i,j) B(j,k) END FOR k, j, i, kk, jj,
ii
  • Choose Ti, Tj, and Tk such that Ti Tj Ti
    Tk Tj Tk lt cache size
  • Number of cache misses
  • A(i,j) Ni Nj
  • B(j,k) Nj Nk Ni/Ti
  • C(i,k) Ni Nk Nj/Tj

17
Data Locality Optimization
  • Problem loop fusion and tiling conflict
  • Solution start with fused loop structure
  • Expand scalars and vectors to tile-size width
  • Split loops into tiling loops and intra-tile
    loops
  • Move all intra-tile loops inside tiling loops
  • In bottom-up traversal over parse tree count
    cache misses
  • Iterate over different tile sizes to find optimal
    solution

18
Conclusions and Status
  • Mathematically simple computational domain
  • Search algorithms for optimizing computation
  • Empirically derived cost models
  • Prototypes developed so far
  • Operation minimization for individual terms in C
  • Memory minimization and space-time trade-offs in
    ML
  • Cache optimization and tiling for space-time
    trade-offs in C
  • Communication minimization in C
  • Currently developing integrated tool in Java
  • Future extensions
  • Sparse arrays, symmetry, common subexpressions
  • Cost model for BLAS routines
  • Domain-specific optimizations
About PowerShow.com