Loading...

PPT – Space-Time Trade-Off Optimization for a Class of Electronic Structure Calculations PowerPoint presentation | free to download - id: 809568-NTkzM

The Adobe Flash plugin is needed to view this content

Space-Time Trade-Off Optimization for a Class of

Electronic Structure Calculations

- Daniel Cociorva, Ohio State
- Gerald Baumgartner, Ohio State
- Chi-Chung Lam, Ohio State
- P. Sadayappan, Ohio State
- J. Ramanujam, Louisiana State
- Marcel Nooijen, Princeton
- David E. Bernholdt, ORNL
- Robert Harrison, PNNL

Realistic Quantum Chemistry Example

- hbara,b,i,j sumfb,cti,j,a,c,c

-sumfk,ctk,bti,j,a,c,k,c

sumfa,cti,j,c,b,c -sumfk,ctk,ati

,j,c,b,k,c -sumfk,jti,k,a,b,k

-sumfk,ctj,cti,k,a,b,k,c

-sumfk,itj,k,b,a,k -sumfk,cti,ctj

,k,b,a,k,c sumti,ctj,dva,b,c,d,c,d

sumti,j,c,dva,b,c,d,c,d

sumtj,cva,b,i,c,c -sumtk,bva,k,i,j

,k sumti,cvb,a,j,c,c

-sumtk,avb,k,j,i,k -sumtk,dti,j,c,b

vk,a,c,d,k,c,d -sumti,ctj,k,b,dvk,a,

c,d,k,c,d -sumtj,ctk,bvk,a,c,i,k,c

2sumtj,k,b,cvk,a,c,i,k,c

-sumtj,k,c,bvk,a,c,i,k,c

-sumti,ctj,dtk,bvk,a,d,c,k,c,d

2sumtk,dti,j,c,bvk,a,d,c,k,c,d

-sumtk,bti,j,c,dvk,a,d,c,k,c,d

-sumtj,dti,k,c,bvk,a,d,c,k,c,d

2sumti,ctj,k,b,dvk,a,d,c,k,c,d

-sumti,ctj,k,d,bvk,a,d,c,k,c,d

-sumtj,k,b,cvk,a,i,c,k,c

-sumti,ctk,bvk,a,j,c,k,c

-sumti,k,c,bvk,a,j,c,k,c

-sumti,ctj,dtk,avk,b,c,d,k,c,d

-sumtk,dti,j,a,cvk,b,c,d,k,c,d

-sumtk,ati,j,c,dvk,b,c,d,k,c,d

2sumtj,dti,k,a,cvk,b,c,d,k,c,d

-sumtj,dti,k,c,avk,b,c,d,k,c,d

-sumti,ctj,k,d,avk,b,c,d,k,c,d

-sumti,ctk,avk,b,c,j,k,c

2sumti,k,a,cvk,b,c,j,k,c

-sumti,k,c,avk,b,c,j,k,c

2sumtk,dti,j,a,cvk,b,d,c,k,c,d

-sumtj,dti,k,a,cvk,b,d,c,k,c,d

-sumtj,ctk,avk,b,i,c,k,c

-sumtj,k,c,avk,b,i,c,k,c

-sumti,k,a,cvk,b,j,c,k,c

sumti,ctj,dtk,atl,bvk,l,c,d,k,l,c

,d -2sumtk,btl,dti,j,a,cvk,l,c,d,k

,l,c,d -2sumtk,atl,dti,j,c,bvk,l,c,d

,k,l,c,d sumtk,atl,bti,j,c,dvk,l,c

,d,k,l,c,d -2sumtj,ctl,dti,k,a,bvk

,l,c,d,k,l,c,d -2sumtj,dtl,bti,k,a,c

vk,l,c,d,k,l,c,d sumtj,dtl,bti,k,c,

avk,l,c,d,k,l,c,d -2sumti,ctl,dtj,

k,b,avk,l,c,d,k,l,c,d sumti,ctl,at

j,k,b,dvk,l,c,d,k,l,c,d sumti,ctl,b

tj,k,d,avk,l,c,d,k,l,c,d

sumti,k,c,dtj,l,b,avk,l,c,d,k,l,c,d

4sumti,k,a,ctj,l,b,dvk,l,c,d,k,l,c,d

-2sumti,k,c,atj,l,b,dvk,l,c,d,k,l,c,d

-2sumti,k,a,btj,l,c,dvk,l,c,d,k,l,c,d

-2sumti,k,a,ctj,l,d,bvk,l,c,d,k,l,c,

d sumti,k,c,atj,l,d,bvk,l,c,d,k,l,c,d

sumti,ctj,dtk,l,a,bvk,l,c,d,k,l,c

,d sumti,j,c,dtk,l,a,bvk,l,c,d,k,l,c,

d -2sumti,j,c,btk,l,a,dvk,l,c,d,k,l,c

,d -2sumti,j,a,ctk,l,b,dvk,l,c,d,k,l,

c,d sumtj,ctk,btl,avk,l,c,i,k,l,c

sumtl,ctj,k,b,avk,l,c,i,k,l,c

-2sumtl,atj,k,b,cvk,l,c,i,k,l,c

sumtl,atj,k,c,bvk,l,c,i,k,l,c

-2sumtk,ctj,l,b,avk,l,c,i,k,l,c

sumtk,atj,l,b,cvk,l,c,i,k,l,c

sumtk,btj,l,c,avk,l,c,i,k,l,c

sumtj,ctl,k,a,bvk,l,c,i,k,l,c

sumti,ctk,atl,bvk,l,c,j,k,l,c

sumtl,cti,k,a,bvk,l,c,j,k,l,c

-2sumtl,bti,k,a,cvk,l,c,j,k,l,c

sumtl,bti,k,c,avk,l,c,j,k,l,c

sumti,ctk,l,a,bvk,l,c,j,k,l,c

sumtj,ctl,dti,k,a,bvk,l,d,c,k,l,c,d

sumtj,dtl,bti,k,a,cvk,l,d,c,k,l,c,

d sumtj,dtl,ati,k,c,bvk,l,d,c,k,l,

c,d -2sumti,k,c,dtj,l,b,avk,l,d,c,k,l

,c,d -2sumti,k,a,ctj,l,b,dvk,l,d,c,k,

l,c,d sumti,k,c,atj,l,b,dvk,l,d,c,k,l

,c,d sumti,k,a,btj,l,c,dvk,l,d,c,k,l,

c,d sumti,k,c,btj,l,d,avk,l,d,c,k,l,c

,d sumti,k,a,ctj,l,d,bvk,l,d,c,k,l,c,

d sumtk,atl,bvk,l,i,j,k,l

sumtk,l,a,bvk,l,i,j,k,l

sumtk,btl,dti,j,a,cvl,k,c,d,k,l,c,d

sumtk,atl,dti,j,c,bvl,k,c,d,k,l,c,

d sumti,ctl,dtj,k,b,avl,k,c,d,k,l,

c,d -2sumti,ctl,atj,k,b,dvl,k,c,d,

k,l,c,d sumti,ctl,atj,k,d,bvl,k,c,d

,k,l,c,d sumti,j,c,btk,l,a,dvl,k,c,d,

k,l,c,d sumti,j,a,ctk,l,b,dvl,k,c,d,

k,l,c,d -2sumtl,cti,k,a,bvl,k,c,j,k,l

,c sumtl,bti,k,a,cvl,k,c,j,k,l,c

sumtl,ati,k,c,bvl,k,c,j,k,l,c

va,b,i,j

Problem Tensor Contractions

- Formulas of the form
- 10s of arrays and array indices, 100s of terms
- Index ranges between 10 and 3000
- And this is still a simple model

Application Domain

- Quantum chemistry, condensed matter physics
- Example study chemical properties
- Typical program structure
- quantum chemistry code
- while (! converged)
- tensor contractions
- quantum chemistry code
- Bulk of computation in tensor contractions

Operation Minimization

- Requires 4 N10 operations if indices a l have

range N - Using associative, commutative, distributive laws

acceptable - Optimal formula sequence requires only 6 N6

operations

Loop Fusion for Memory Reduction

S 0 for b, c T1f 0 T2f 0 for d,

f for e, l T1f Bbefl

Dcdel for j, k T2fjk

T1f Cdfjk for a, i, j, k Sabij

T2fjk Aacik

T1 0 T2 0 S 0 for b, c, d, e, f, l

T1bcdf Bbefl Dcdel for b, c, d, f, j, k

T2bcjk T1bcdf Cdfjk for a, b, c, i, j, k

Sabij T2bcjk Aacik

Formula sequence

Unfused code

Fused code

Tensor Contraction Engine

Tensor contraction expressions

Operation Minimization

Sequence of loop nests (expression tree)

Memory Minimization

(Storage req. exceed limits)

(Storage requirements within limits)

Space-Time Trade-Offs

Imperfectly nested loops

Communication Minimization

Imperfectly nested parallel loops

Data Locality Optimization

Partitioned, tiled, parallel Fortran loops

Example to Illustrate Space-Time Trade-Offs

for a, e, c, f for i, j Xaecf

Tijae Tijcf for c, e, b, k T1cebk f1(c, e,

b, k) for a, f, b, k T2afbk f2(a, f, b,

k) for c, e, a, f for b, k Yceaf

T1cebk T2afbk for c, e, a, f E Xaecf

Yceaf

array space time X V4 V4O2

T1 V3O Cf1V3O T2 V3O Cf2V3O Y V4 V5O E

1 V4

a .. f range V 1000 .. 3000 i .. k range O

30 .. 100

Memory-Minimal Form

for a, f, b, k T2afbk f2(a, f, b, k) for

c, e for b, k T1bk f1(c, e, b,

k) for a, f for i, j

X Tijae Tijcf for b, k

Y T1bk T2afbk E X Y

array space time X 1 V4O2

T1 VO Cf1V3O T2 V3O Cf2V3O Y 1 V5O E

1 V4

a .. f range V 3000 i .. k range O 100

Redundant Computation to Allow Full Fusion

for a, e, c, f for i, j X Tijae

Tijcf for b, k T1 f1(c, e, b,

k) T2 f2(a, f, b, k) Y

T1 T2 E X Y

array space time X 1 V4O2

T1 1 Cf1V5O T2 1 Cf2V5O Y 1

V5O E 1 V4

Tiling for Reducing Recomputation

for at, et, ct, ft for a, e, c, f

for i, j Xaecf Tijae Tijcf

for b, k for c, e T1ce

f1(c, e, b, k) for a, f

T2af f2(a, f, b, k) for c, e, a, f

Yceaf T1ce T2af for c, e, a,

f E Xaecf Yceaf

array space time X B4

V4O2 T1 B2 Cf1(V/B)2V3O T2

B2 Cf2(V/B)2V3O Y B4 V5O E 1

V4

The Fusion Graph

j

?

A(i,j)

B(j,k)

i

j

j

k

Making Fused Loops Explicit

i range 10 j range 10 k range 100

j

?

A(i,j)

B(j,k)

i

j

j

k

Adding Recomputation Loops

i range 10 j range 10 k range 100

j

?

(lti j,kgt, 3, 9000), . . .

A(i,j)

B(j,k)

i

j

i

j

k

Tiling Recomputation Loops

i range 10 j range 10 k range 100

j

?

A(it,i,j)

B(j,k)

i

j

it

j

k

it

Space-Time Trade-Off Algorithm

- For e1(i,j,k) e2(i,j) consider recomputation of

e2 in k loop - For each expression tree node in bottom-up

traversal - Construct set of solutions (fusion, mem cost,

rec cost), . . . - Prune inferior solutions
- For all solutions for the root of the expression

tree - Split recomputation loops into tiling/intra-tile

loops - Move intra-tile loops inside tiling loops where

needed - Search for tile sizes that minimize recomputation

cost

Demo

- mlimit 144000010
- range V 300
- range O 70
- range U 40
- index a, b, c, d V
- index e, f O
- index i, j, k, l U
- procedure P (in TV,V,U,U, in SV,U,O,U, in

NV,V,O,U, in DV,O,O,U, - out XV,V,O,O)
- begin
- Xa,b,i,j sumTa,c,i,k Sd,j,f,k

Nc,d,e,l Db,e,f,l, c,d,e,f,k,l - end

Conclusions and Status

- Computational domain with exploitable structure
- Search algorithms for optimizing computation
- Developing compiler in Java
- Future extensions
- Sparse arrays, symmetry, common subexpressions
- Domain-specific optimizations