Title: Space-Time Trade-Off Optimization for a Class of Electronic Structure Calculations
1Space-Time Trade-Off Optimization for a Class of
Electronic Structure Calculations
- Daniel Cociorva, Ohio State
- Gerald Baumgartner, Ohio State
- Chi-Chung Lam, Ohio State
- P. Sadayappan, Ohio State
- J. Ramanujam, Louisiana State
- Marcel Nooijen, Princeton
- David E. Bernholdt, ORNL
- Robert Harrison, PNNL
2Realistic Quantum Chemistry Example
- hbara,b,i,j sumfb,cti,j,a,c,c
-sumfk,ctk,bti,j,a,c,k,c
sumfa,cti,j,c,b,c -sumfk,ctk,ati
,j,c,b,k,c -sumfk,jti,k,a,b,k
-sumfk,ctj,cti,k,a,b,k,c
-sumfk,itj,k,b,a,k -sumfk,cti,ctj
,k,b,a,k,c sumti,ctj,dva,b,c,d,c,d
sumti,j,c,dva,b,c,d,c,d
sumtj,cva,b,i,c,c -sumtk,bva,k,i,j
,k sumti,cvb,a,j,c,c
-sumtk,avb,k,j,i,k -sumtk,dti,j,c,b
vk,a,c,d,k,c,d -sumti,ctj,k,b,dvk,a,
c,d,k,c,d -sumtj,ctk,bvk,a,c,i,k,c
2sumtj,k,b,cvk,a,c,i,k,c
-sumtj,k,c,bvk,a,c,i,k,c
-sumti,ctj,dtk,bvk,a,d,c,k,c,d
2sumtk,dti,j,c,bvk,a,d,c,k,c,d
-sumtk,bti,j,c,dvk,a,d,c,k,c,d
-sumtj,dti,k,c,bvk,a,d,c,k,c,d
2sumti,ctj,k,b,dvk,a,d,c,k,c,d
-sumti,ctj,k,d,bvk,a,d,c,k,c,d
-sumtj,k,b,cvk,a,i,c,k,c
-sumti,ctk,bvk,a,j,c,k,c
-sumti,k,c,bvk,a,j,c,k,c
-sumti,ctj,dtk,avk,b,c,d,k,c,d
-sumtk,dti,j,a,cvk,b,c,d,k,c,d
-sumtk,ati,j,c,dvk,b,c,d,k,c,d
2sumtj,dti,k,a,cvk,b,c,d,k,c,d
-sumtj,dti,k,c,avk,b,c,d,k,c,d
-sumti,ctj,k,d,avk,b,c,d,k,c,d
-sumti,ctk,avk,b,c,j,k,c
2sumti,k,a,cvk,b,c,j,k,c
-sumti,k,c,avk,b,c,j,k,c
2sumtk,dti,j,a,cvk,b,d,c,k,c,d
-sumtj,dti,k,a,cvk,b,d,c,k,c,d
-sumtj,ctk,avk,b,i,c,k,c
-sumtj,k,c,avk,b,i,c,k,c
-sumti,k,a,cvk,b,j,c,k,c
sumti,ctj,dtk,atl,bvk,l,c,d,k,l,c
,d -2sumtk,btl,dti,j,a,cvk,l,c,d,k
,l,c,d -2sumtk,atl,dti,j,c,bvk,l,c,d
,k,l,c,d sumtk,atl,bti,j,c,dvk,l,c
,d,k,l,c,d -2sumtj,ctl,dti,k,a,bvk
,l,c,d,k,l,c,d -2sumtj,dtl,bti,k,a,c
vk,l,c,d,k,l,c,d sumtj,dtl,bti,k,c,
avk,l,c,d,k,l,c,d -2sumti,ctl,dtj,
k,b,avk,l,c,d,k,l,c,d sumti,ctl,at
j,k,b,dvk,l,c,d,k,l,c,d sumti,ctl,b
tj,k,d,avk,l,c,d,k,l,c,d
sumti,k,c,dtj,l,b,avk,l,c,d,k,l,c,d
4sumti,k,a,ctj,l,b,dvk,l,c,d,k,l,c,d
-2sumti,k,c,atj,l,b,dvk,l,c,d,k,l,c,d
-2sumti,k,a,btj,l,c,dvk,l,c,d,k,l,c,d
-2sumti,k,a,ctj,l,d,bvk,l,c,d,k,l,c,
d sumti,k,c,atj,l,d,bvk,l,c,d,k,l,c,d
sumti,ctj,dtk,l,a,bvk,l,c,d,k,l,c
,d sumti,j,c,dtk,l,a,bvk,l,c,d,k,l,c,
d -2sumti,j,c,btk,l,a,dvk,l,c,d,k,l,c
,d -2sumti,j,a,ctk,l,b,dvk,l,c,d,k,l,
c,d sumtj,ctk,btl,avk,l,c,i,k,l,c
sumtl,ctj,k,b,avk,l,c,i,k,l,c
-2sumtl,atj,k,b,cvk,l,c,i,k,l,c
sumtl,atj,k,c,bvk,l,c,i,k,l,c
-2sumtk,ctj,l,b,avk,l,c,i,k,l,c
sumtk,atj,l,b,cvk,l,c,i,k,l,c
sumtk,btj,l,c,avk,l,c,i,k,l,c
sumtj,ctl,k,a,bvk,l,c,i,k,l,c
sumti,ctk,atl,bvk,l,c,j,k,l,c
sumtl,cti,k,a,bvk,l,c,j,k,l,c
-2sumtl,bti,k,a,cvk,l,c,j,k,l,c
sumtl,bti,k,c,avk,l,c,j,k,l,c
sumti,ctk,l,a,bvk,l,c,j,k,l,c
sumtj,ctl,dti,k,a,bvk,l,d,c,k,l,c,d
sumtj,dtl,bti,k,a,cvk,l,d,c,k,l,c,
d sumtj,dtl,ati,k,c,bvk,l,d,c,k,l,
c,d -2sumti,k,c,dtj,l,b,avk,l,d,c,k,l
,c,d -2sumti,k,a,ctj,l,b,dvk,l,d,c,k,
l,c,d sumti,k,c,atj,l,b,dvk,l,d,c,k,l
,c,d sumti,k,a,btj,l,c,dvk,l,d,c,k,l,
c,d sumti,k,c,btj,l,d,avk,l,d,c,k,l,c
,d sumti,k,a,ctj,l,d,bvk,l,d,c,k,l,c,
d sumtk,atl,bvk,l,i,j,k,l
sumtk,l,a,bvk,l,i,j,k,l
sumtk,btl,dti,j,a,cvl,k,c,d,k,l,c,d
sumtk,atl,dti,j,c,bvl,k,c,d,k,l,c,
d sumti,ctl,dtj,k,b,avl,k,c,d,k,l,
c,d -2sumti,ctl,atj,k,b,dvl,k,c,d,
k,l,c,d sumti,ctl,atj,k,d,bvl,k,c,d
,k,l,c,d sumti,j,c,btk,l,a,dvl,k,c,d,
k,l,c,d sumti,j,a,ctk,l,b,dvl,k,c,d,
k,l,c,d -2sumtl,cti,k,a,bvl,k,c,j,k,l
,c sumtl,bti,k,a,cvl,k,c,j,k,l,c
sumtl,ati,k,c,bvl,k,c,j,k,l,c
va,b,i,j
3Problem Tensor Contractions
- Formulas of the form
- 10s of arrays and array indices, 100s of terms
- Index ranges between 10 and 3000
- And this is still a simple model
4Application Domain
- Quantum chemistry, condensed matter physics
- Example study chemical properties
- Typical program structure
- quantum chemistry code
- while (! converged)
- tensor contractions
- quantum chemistry code
-
- Bulk of computation in tensor contractions
5Operation Minimization
- Requires 4 N10 operations if indices a l have
range N - Using associative, commutative, distributive laws
acceptable - Optimal formula sequence requires only 6 N6
operations
6Loop Fusion for Memory Reduction
S 0 for b, c T1f 0 T2f 0 for d,
f for e, l T1f Bbefl
Dcdel for j, k T2fjk
T1f Cdfjk for a, i, j, k Sabij
T2fjk Aacik
T1 0 T2 0 S 0 for b, c, d, e, f, l
T1bcdf Bbefl Dcdel for b, c, d, f, j, k
T2bcjk T1bcdf Cdfjk for a, b, c, i, j, k
Sabij T2bcjk Aacik
Formula sequence
Unfused code
Fused code
7Tensor Contraction Engine
Tensor contraction expressions
Operation Minimization
Sequence of loop nests (expression tree)
Memory Minimization
(Storage req. exceed limits)
(Storage requirements within limits)
Space-Time Trade-Offs
Imperfectly nested loops
Communication Minimization
Imperfectly nested parallel loops
Data Locality Optimization
Partitioned, tiled, parallel Fortran loops
8Example to Illustrate Space-Time Trade-Offs
for a, e, c, f for i, j Xaecf
Tijae Tijcf for c, e, b, k T1cebk f1(c, e,
b, k) for a, f, b, k T2afbk f2(a, f, b,
k) for c, e, a, f for b, k Yceaf
T1cebk T2afbk for c, e, a, f E Xaecf
Yceaf
array space time X V4 V4O2
T1 V3O Cf1V3O T2 V3O Cf2V3O Y V4 V5O E
1 V4
a .. f range V 1000 .. 3000 i .. k range O
30 .. 100
9Memory-Minimal Form
for a, f, b, k T2afbk f2(a, f, b, k) for
c, e for b, k T1bk f1(c, e, b,
k) for a, f for i, j
X Tijae Tijcf for b, k
Y T1bk T2afbk E X Y
array space time X 1 V4O2
T1 VO Cf1V3O T2 V3O Cf2V3O Y 1 V5O E
1 V4
a .. f range V 3000 i .. k range O 100
10Redundant Computation to Allow Full Fusion
for a, e, c, f for i, j X Tijae
Tijcf for b, k T1 f1(c, e, b,
k) T2 f2(a, f, b, k) Y
T1 T2 E X Y
array space time X 1 V4O2
T1 1 Cf1V5O T2 1 Cf2V5O Y 1
V5O E 1 V4
11Tiling for Reducing Recomputation
for at, et, ct, ft for a, e, c, f
for i, j Xaecf Tijae Tijcf
for b, k for c, e T1ce
f1(c, e, b, k) for a, f
T2af f2(a, f, b, k) for c, e, a, f
Yceaf T1ce T2af for c, e, a,
f E Xaecf Yceaf
array space time X B4
V4O2 T1 B2 Cf1(V/B)2V3O T2
B2 Cf2(V/B)2V3O Y B4 V5O E 1
V4
12The Fusion Graph
j
?
A(i,j)
B(j,k)
i
j
j
k
13Making Fused Loops Explicit
i range 10 j range 10 k range 100
j
?
A(i,j)
B(j,k)
i
j
j
k
14Adding Recomputation Loops
i range 10 j range 10 k range 100
j
?
(lti j,kgt, 3, 9000), . . .
A(i,j)
B(j,k)
i
j
i
j
k
15Tiling Recomputation Loops
i range 10 j range 10 k range 100
j
?
A(it,i,j)
B(j,k)
i
j
it
j
k
it
16Space-Time Trade-Off Algorithm
- For e1(i,j,k) e2(i,j) consider recomputation of
e2 in k loop - For each expression tree node in bottom-up
traversal - Construct set of solutions (fusion, mem cost,
rec cost), . . . - Prune inferior solutions
- For all solutions for the root of the expression
tree - Split recomputation loops into tiling/intra-tile
loops - Move intra-tile loops inside tiling loops where
needed - Search for tile sizes that minimize recomputation
cost
17Demo
- mlimit 144000010
- range V 300
- range O 70
- range U 40
- index a, b, c, d V
- index e, f O
- index i, j, k, l U
- procedure P (in TV,V,U,U, in SV,U,O,U, in
NV,V,O,U, in DV,O,O,U, - out XV,V,O,O)
- begin
- Xa,b,i,j sumTa,c,i,k Sd,j,f,k
Nc,d,e,l Db,e,f,l, c,d,e,f,k,l - end
18Conclusions and Status
- Computational domain with exploitable structure
- Search algorithms for optimizing computation
- Developing compiler in Java
- Future extensions
- Sparse arrays, symmetry, common subexpressions
- Domain-specific optimizations