Title: Memory Constrained Data Locality Optimization for Tensor Contractions*
1Memory Constrained Data Locality Optimization for
Tensor Contractions
Alina Bibireata, Sandhya Krishnan, Gerald
Baumgartner, Daniel Cociorva, Chi-Chung Lam, P.
Sadayappan, J. Ramanujam, David E. Bernholdt,
Venkatesh Choppella
Supported by NSF and DOE
2The Tensor Contraction Engine Addresses
Programming Challenges
- User describes computational problem (tensor
contractions expressions) in a simple, high-level
language - Similar to what might be written in papers
- Synthesis tool translates high-level language
into traditional Fortran (or C, or) code - Generated code is compiled and linked to quantum
chemistry suite, e.g. NWChem or GAMESS
- Productivity
- User writes simple, high-level code
- Code generation tools do the tedious work
- Complexity
- Significantly reduces complexity visible to
programmer - Performance
- Perform optimizations prior to code generation
- Automate many decisions humans make empirically
- Tailor generated code to target computer
- Tailor generated code to specific problem
3TCE Components
Sequence of Matrix Products Element-wise Matrix
Operations Element-wise Function Eval.
Tensor Expressions
- Algebraic Transformations
- Minimize operation count
- Memory Minimization
- Reduce intermediate storage
- Space-Time Transformation
- Trade-offs btw storage and recomputation
- Storage Management and Data Locality Optimization
- Optimize use of storage hierarchy
- Data Distribution and Partitioning
- Optimize parallel layout
Algebraic Transformations
System Memory Specification
No soln fits disk
Memory Minimization
No soln fits disk
Soln fits disk, not mem.
Soln fits mem.
Space-Time Trade-Offs
Storage and Data Locality Management
Soln fits mem.
Data Distribution and Partitioning
Performance Model
Parallel Code Fortran/C/ OpenMP/MPI/Global Arrays
4A High-Level Language for Tensor Contraction
Expressions
range V 3000 range O 100 index a,b,c,d,e,f
V index i,j,k O mlimit
1000000000000 function F1(V,V,V,O) function
F2(V,V,V,O) procedure P(in T1O,O,V,V, in
T2O,O,V,V, out X) begin X sum
sumF1(a,b,f,k) F2(c,e,b,k), b,k
sumT1i,j,a,e T2i,j,c,f, i,j,
a,e,c,f end
5Problem Addressed in this Paper
- Given an operation-minimal set of tensor
contractions, apply loop fusion and tiling, and
insert disk I/O stmts to minimize data movement
cost - Current TCE prototype uses a simpler decoupled
approach we now develop a more integrated
approach
6Decoupled Approach
- Explore fusion structure space first to find a
memory-minimal solution - Explore tile size space and use a greedy
placement of disk I/O stmts at outermost possible
point in the code. - Select the solution with minimum disk access cost
7Integrated Approach
- Optimal Algorithm
- The fusion structure x disk placements search
space can be decoupled from tile size space - Pruning search to eliminate all inferior fusion
x placements solutions with respect to memory
cost and disk access volume, irrespective of tile
sizes - Explore the tile size space for un-pruned
solutions - Heuristic Algorithm
- Use memory space disk access costs for case of
unit tile size to prune more aggressively in
first step
8Example
- A Two Contraction example
Ni 3500 Nj 3600 Nk 3800 Nl 4000
Double Precision Arrays Memory Limit 10 MB
9Loop Fusion
Fused Code after Memory Minimization
Unfused Code
T(,) 0.0 FOR j, k, l T(j,k) B(k,l)
C(j,l) D(,) 0.0 FOR i, j, k D(i,j)
A(i,k) T(j,k)
D(,) 0.0 FOR j, k T 0.0 FOR l
T C(j,l) B(k,l) FOR i D(i,j)
A(i,k) T
- Intermediate T is reduced to a scalar, thus
reducing space requirements
10Disk I/O Placements
Fused Code after Memory Minimization
Disk I/O Placement I (For only First
Contraction)
D(,) 0.0 FOR j, k T 0.0 FOR l
T C(j,l) B(k,l) FOR i D(i,j)
A(i,k) T
FOR j, k T 0.0 FOR l Cm Read
C(j,l) Bm Read B(k,l) T Cm
Bm
11Disk I/O Placements
Fused Code after Memory Minimization
Disk I/O Placement II (For only First
Contraction)
D(,) 0.0 FOR j, k T 0.0 FOR l
T C(j,l) B(k,l) FOR i D(i,j)
A(i,k) T
FOR j Cm(l) Read C(j,l) FOR k T
0.0 Bm(l) Read B(k,l) FOR l
T Cm(l) Bm(l)
12Loop Tiling
After Loop Tiling
Fused Code with I/O Placements
FOR jT Cm(jI,l) Read C(j,l) FOR kT
FOR jI, kI T(jI,kI) 0.0 Bm(kI,l)
Read B(k,l) FOR lT, jI, kI, lI
T(jI,kI) Cm(jI,lTlI)
Bm(kI,lTlI)
FOR j Cm(l) Read C(j,l) FOR k T
0.0 Bm(l) Read B(k,l) FOR l
T Cm(l) Bm(l)
13Decoupled Approach
- Step One Memory Minimization
- Finds a fusion structure that is memory minimal
- Step Two Loop Tiling and Disk I/O Placement
- Tiles the loops
- Determines tile sizes and disk I/O placements to
minimize disk access cost under memory limit
constraints
14Decoupled Approach
Step One Memory Minimization
for j
D(,) 0.0 FOR j, k T 0.0 FOR l
T C(j,l) B(k,l) FOR i D(i,j)
A(i,k) T
for k
for l
for i
T C(j,l) B(k,l)
D(i,j) A(i,k) T
Fused Code
Parse Tree
Memory minimal solution
15Decoupled Approach
Step Two Loop Tiling and Disk I/O Placements
for jT
Tile Sizes Found
for kT
Ti 500 Tj 900 Tk 543 Tl 500
for lT
for iT
for jI
for jI
for kI
for kI
No. of Accesses at lT 47 MB TjTk
TjNl TkNl No. of Accesses at iT 42 MB
NiTj NiTk TjTk Memory Limit 10 MB
for lI
for iI
T(jI,kI) C(jTjI,lTlI) B(kTkI,lTlI)
D(iTiI,jTjI) A(iTiI,kTkI) T(jI,kI)
16Decoupled Approach
Step Three Loop Tiling and Disk I/O Placements
Disk Access Cost 234 secs Memory Usage 9.23
MB
for jT
Redundancy for I/O statement
for kT
for lT
for iT
for jI
for jI
Read C(jI,lI)
Read D(iI,jI)
Write D(iI,jI)
for kI
for kI
Read B(kI,lI)
Read A(iI,kI)
for lI
for iI
T(jI,kI) C(jI,lI) B(kI,lI)
D(iI,jI) A(iI,kI) T(jI,kI)
17Integrated Approach
- Step One Memory Minimization and Disk I/O
Placement - Finds a set of fusion structures with disk I/O
placements - Step Two Loop Tiling
- Tiles the loops
- Determines tile sizes to minimize disk access
cost under memory limit constraints
18Integrated Approach
Step One Memory Minimization and Disk I/O
Placement
for j
for l
for i
for k
for k
Read C
Write D
D A T(k)
Read A
T(k) C B
Read B
Best Loop Structure with Disk I/O Placements
found by Step One
Redundancy for I/O statement
No Redundancy for I/O statement
19Integrated Approach
Step Two Loop Tiling
Tile Sizes Found
Ti 300 Tj 254 Tk 308 Tl 292
Disk Access Cost 194 secs Memory Usage 9.99
MB
20Experimental Results
- AO-to-MO transform example in quantum chemistry
Ranges for p, q, r, s N O V Ranges for a,
b, c, d V
O No. of occupied orbitals V No. of unoccupied
orbitals
21Disk Access Costs for Both Approaches
22CCSD Doubles Equation
- hbara,b,i,j sumfb,cti,j,a,c,c
-sumfk,ctk,bti,j,a,c,k,c
sumfa,cti,j,c,b,c -sumfk,ctk,ati
,j,c,b,k,c -sumfk,jti,k,a,b,k
-sumfk,ctj,cti,k,a,b,k,c
-sumfk,itj,k,b,a,k -sumfk,cti,ctj
,k,b,a,k,c sumti,ctj,dva,b,c,d,c,d
sumti,j,c,dva,b,c,d,c,d
sumtj,cva,b,i,c,c -sumtk,bva,k,i,j
,k sumti,cvb,a,j,c,c
-sumtk,avb,k,j,i,k -sumtk,dti,j,c,b
vk,a,c,d,k,c,d -sumti,ctj,k,b,dvk,a,
c,d,k,c,d -sumtj,ctk,bvk,a,c,i,k,c
2sumtj,k,b,cvk,a,c,i,k,c
-sumtj,k,c,bvk,a,c,i,k,c
-sumti,ctj,dtk,bvk,a,d,c,k,c,d
2sumtk,dti,j,c,bvk,a,d,c,k,c,d
-sumtk,bti,j,c,dvk,a,d,c,k,c,d
-sumtj,dti,k,c,bvk,a,d,c,k,c,d
2sumti,ctj,k,b,dvk,a,d,c,k,c,d
-sumti,ctj,k,d,bvk,a,d,c,k,c,d
-sumtj,k,b,cvk,a,i,c,k,c
-sumti,ctk,bvk,a,j,c,k,c
-sumti,k,c,bvk,a,j,c,k,c
-sumti,ctj,dtk,avk,b,c,d,k,c,d
-sumtk,dti,j,a,cvk,b,c,d,k,c,d
-sumtk,ati,j,c,dvk,b,c,d,k,c,d
2sumtj,dti,k,a,cvk,b,c,d,k,c,d
-sumtj,dti,k,c,avk,b,c,d,k,c,d
-sumti,ctj,k,d,avk,b,c,d,k,c,d
-sumti,ctk,avk,b,c,j,k,c
2sumti,k,a,cvk,b,c,j,k,c
-sumti,k,c,avk,b,c,j,k,c
2sumtk,dti,j,a,cvk,b,d,c,k,c,d
-sumtj,dti,k,a,cvk,b,d,c,k,c,d
-sumtj,ctk,avk,b,i,c,k,c
-sumtj,k,c,avk,b,i,c,k,c
-sumti,k,a,cvk,b,j,c,k,c
sumti,ctj,dtk,atl,bvk,l,c,d,k,l,c
,d -2sumtk,btl,dti,j,a,cvk,l,c,d,k
,l,c,d -2sumtk,atl,dti,j,c,bvk,l,c,d
,k,l,c,d sumtk,atl,bti,j,c,dvk,l,c
,d,k,l,c,d -2sumtj,ctl,dti,k,a,bvk
,l,c,d,k,l,c,d -2sumtj,dtl,bti,k,a,c
vk,l,c,d,k,l,c,d sumtj,dtl,bti,k,c,
avk,l,c,d,k,l,c,d -2sumti,ctl,dtj,
k,b,avk,l,c,d,k,l,c,d sumti,ctl,at
j,k,b,dvk,l,c,d,k,l,c,d sumti,ctl,b
tj,k,d,avk,l,c,d,k,l,c,d
sumti,k,c,dtj,l,b,avk,l,c,d,k,l,c,d
4sumti,k,a,ctj,l,b,dvk,l,c,d,k,l,c,d
-2sumti,k,c,atj,l,b,dvk,l,c,d,k,l,c,d
-2sumti,k,a,btj,l,c,dvk,l,c,d,k,l,c,d
-2sumti,k,a,ctj,l,d,bvk,l,c,d,k,l,c,
d sumti,k,c,atj,l,d,bvk,l,c,d,k,l,c,d
sumti,ctj,dtk,l,a,bvk,l,c,d,k,l,c
,d sumti,j,c,dtk,l,a,bvk,l,c,d,k,l,c,
d -2sumti,j,c,btk,l,a,dvk,l,c,d,k,l,c
,d -2sumti,j,a,ctk,l,b,dvk,l,c,d,k,l,
c,d sumtj,ctk,btl,avk,l,c,i,k,l,c
sumtl,ctj,k,b,avk,l,c,i,k,l,c
-2sumtl,atj,k,b,cvk,l,c,i,k,l,c
sumtl,atj,k,c,bvk,l,c,i,k,l,c
-2sumtk,ctj,l,b,avk,l,c,i,k,l,c
sumtk,atj,l,b,cvk,l,c,i,k,l,c
sumtk,btj,l,c,avk,l,c,i,k,l,c
sumtj,ctl,k,a,bvk,l,c,i,k,l,c
sumti,ctk,atl,bvk,l,c,j,k,l,c
sumtl,cti,k,a,bvk,l,c,j,k,l,c
-2sumtl,bti,k,a,cvk,l,c,j,k,l,c
sumtl,bti,k,c,avk,l,c,j,k,l,c
sumti,ctk,l,a,bvk,l,c,j,k,l,c
sumtj,ctl,dti,k,a,bvk,l,d,c,k,l,c,d
sumtj,dtl,bti,k,a,cvk,l,d,c,k,l,c,
d sumtj,dtl,ati,k,c,bvk,l,d,c,k,l,
c,d -2sumti,k,c,dtj,l,b,avk,l,d,c,k,l
,c,d -2sumti,k,a,ctj,l,b,dvk,l,d,c,k,
l,c,d sumti,k,c,atj,l,b,dvk,l,d,c,k,l
,c,d sumti,k,a,btj,l,c,dvk,l,d,c,k,l,
c,d sumti,k,c,btj,l,d,avk,l,d,c,k,l,c
,d sumti,k,a,ctj,l,d,bvk,l,d,c,k,l,c,
d sumtk,atl,bvk,l,i,j,k,l
sumtk,l,a,bvk,l,i,j,k,l
sumtk,btl,dti,j,a,cvl,k,c,d,k,l,c,d
sumtk,atl,dti,j,c,bvl,k,c,d,k,l,c,
d sumti,ctl,dtj,k,b,avl,k,c,d,k,l,
c,d -2sumti,ctl,atj,k,b,dvl,k,c,d,
k,l,c,d sumti,ctl,atj,k,d,bvl,k,c,d
,k,l,c,d sumti,j,c,btk,l,a,dvl,k,c,d,
k,l,c,d sumti,j,a,ctk,l,b,dvl,k,c,d,
k,l,c,d -2sumtl,cti,k,a,bvl,k,c,j,k,l
,c sumtl,bti,k,a,cvl,k,c,j,k,l,c
sumtl,ati,k,c,bvl,k,c,j,k,l,c
va,b,i,j