Memory Constrained Data Locality Optimization for Tensor Contractions* - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Memory Constrained Data Locality Optimization for Tensor Contractions*

Description:

Automate many decisions humans make empirically. Tailor generated code to target computer. Tailor generated code to specific problem. TCE Components. Algebraic ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 23
Provided by: Sand50
Learn more at: https://parasol.tamu.edu
Category:

less

Transcript and Presenter's Notes

Title: Memory Constrained Data Locality Optimization for Tensor Contractions*


1
Memory Constrained Data Locality Optimization for
Tensor Contractions
Alina Bibireata, Sandhya Krishnan, Gerald
Baumgartner, Daniel Cociorva, Chi-Chung Lam, P.
Sadayappan, J. Ramanujam, David E. Bernholdt,
Venkatesh Choppella
Supported by NSF and DOE
2
The Tensor Contraction Engine Addresses
Programming Challenges
  • User describes computational problem (tensor
    contractions expressions) in a simple, high-level
    language
  • Similar to what might be written in papers
  • Synthesis tool translates high-level language
    into traditional Fortran (or C, or) code
  • Generated code is compiled and linked to quantum
    chemistry suite, e.g. NWChem or GAMESS
  • Productivity
  • User writes simple, high-level code
  • Code generation tools do the tedious work
  • Complexity
  • Significantly reduces complexity visible to
    programmer
  • Performance
  • Perform optimizations prior to code generation
  • Automate many decisions humans make empirically
  • Tailor generated code to target computer
  • Tailor generated code to specific problem

3
TCE Components
Sequence of Matrix Products Element-wise Matrix
Operations Element-wise Function Eval.
Tensor Expressions
  • Algebraic Transformations
  • Minimize operation count
  • Memory Minimization
  • Reduce intermediate storage
  • Space-Time Transformation
  • Trade-offs btw storage and recomputation
  • Storage Management and Data Locality Optimization
  • Optimize use of storage hierarchy
  • Data Distribution and Partitioning
  • Optimize parallel layout

Algebraic Transformations
System Memory Specification
No soln fits disk
Memory Minimization
No soln fits disk
Soln fits disk, not mem.
Soln fits mem.
Space-Time Trade-Offs
Storage and Data Locality Management
Soln fits mem.
Data Distribution and Partitioning
Performance Model
Parallel Code Fortran/C/ OpenMP/MPI/Global Arrays
4
A High-Level Language for Tensor Contraction
Expressions
range V 3000 range O 100 index a,b,c,d,e,f
V index i,j,k O mlimit
1000000000000 function F1(V,V,V,O) function
F2(V,V,V,O) procedure P(in T1O,O,V,V, in
T2O,O,V,V, out X) begin X sum
sumF1(a,b,f,k) F2(c,e,b,k), b,k
sumT1i,j,a,e T2i,j,c,f, i,j,
a,e,c,f end
5
Problem Addressed in this Paper
  • Given an operation-minimal set of tensor
    contractions, apply loop fusion and tiling, and
    insert disk I/O stmts to minimize data movement
    cost
  • Current TCE prototype uses a simpler decoupled
    approach we now develop a more integrated
    approach

6
Decoupled Approach
  • Explore fusion structure space first to find a
    memory-minimal solution
  • Explore tile size space and use a greedy
    placement of disk I/O stmts at outermost possible
    point in the code.
  • Select the solution with minimum disk access cost

7
Integrated Approach
  • Optimal Algorithm
  • The fusion structure x disk placements search
    space can be decoupled from tile size space
  • Pruning search to eliminate all inferior fusion
    x placements solutions with respect to memory
    cost and disk access volume, irrespective of tile
    sizes
  • Explore the tile size space for un-pruned
    solutions
  • Heuristic Algorithm
  • Use memory space disk access costs for case of
    unit tile size to prune more aggressively in
    first step

8
Example
  • A Two Contraction example

Ni 3500 Nj 3600 Nk 3800 Nl 4000
Double Precision Arrays Memory Limit 10 MB
9
Loop Fusion
Fused Code after Memory Minimization
Unfused Code
T(,) 0.0 FOR j, k, l T(j,k) B(k,l)
C(j,l) D(,) 0.0 FOR i, j, k D(i,j)
A(i,k) T(j,k)
D(,) 0.0 FOR j, k T 0.0 FOR l
T C(j,l) B(k,l) FOR i D(i,j)
A(i,k) T
  • Intermediate T is reduced to a scalar, thus
    reducing space requirements

10
Disk I/O Placements
Fused Code after Memory Minimization
Disk I/O Placement I (For only First
Contraction)
D(,) 0.0 FOR j, k T 0.0 FOR l
T C(j,l) B(k,l) FOR i D(i,j)
A(i,k) T
FOR j, k T 0.0 FOR l Cm Read
C(j,l) Bm Read B(k,l) T Cm
Bm
11
Disk I/O Placements
Fused Code after Memory Minimization
Disk I/O Placement II (For only First
Contraction)
D(,) 0.0 FOR j, k T 0.0 FOR l
T C(j,l) B(k,l) FOR i D(i,j)
A(i,k) T
FOR j Cm(l) Read C(j,l) FOR k T
0.0 Bm(l) Read B(k,l) FOR l
T Cm(l) Bm(l)
12
Loop Tiling
After Loop Tiling
Fused Code with I/O Placements
FOR jT Cm(jI,l) Read C(j,l) FOR kT
FOR jI, kI T(jI,kI) 0.0 Bm(kI,l)
Read B(k,l) FOR lT, jI, kI, lI
T(jI,kI) Cm(jI,lTlI)
Bm(kI,lTlI)
FOR j Cm(l) Read C(j,l) FOR k T
0.0 Bm(l) Read B(k,l) FOR l
T Cm(l) Bm(l)
13
Decoupled Approach
  • Step One Memory Minimization
  • Finds a fusion structure that is memory minimal
  • Step Two Loop Tiling and Disk I/O Placement
  • Tiles the loops
  • Determines tile sizes and disk I/O placements to
    minimize disk access cost under memory limit
    constraints

14
Decoupled Approach
Step One Memory Minimization
for j
D(,) 0.0 FOR j, k T 0.0 FOR l
T C(j,l) B(k,l) FOR i D(i,j)
A(i,k) T
for k
for l
for i
T C(j,l) B(k,l)
D(i,j) A(i,k) T
Fused Code
Parse Tree
Memory minimal solution
15
Decoupled Approach
Step Two Loop Tiling and Disk I/O Placements
for jT
Tile Sizes Found
for kT
Ti 500 Tj 900 Tk 543 Tl 500
for lT
for iT
for jI
for jI
for kI
for kI
No. of Accesses at lT 47 MB TjTk
TjNl TkNl No. of Accesses at iT 42 MB
NiTj NiTk TjTk Memory Limit 10 MB
for lI
for iI
T(jI,kI) C(jTjI,lTlI) B(kTkI,lTlI)
D(iTiI,jTjI) A(iTiI,kTkI) T(jI,kI)
16
Decoupled Approach
Step Three Loop Tiling and Disk I/O Placements
Disk Access Cost 234 secs Memory Usage 9.23
MB
for jT
Redundancy for I/O statement
for kT
for lT
for iT
for jI
for jI
Read C(jI,lI)
Read D(iI,jI)
Write D(iI,jI)
for kI
for kI
Read B(kI,lI)
Read A(iI,kI)
for lI
for iI
T(jI,kI) C(jI,lI) B(kI,lI)
D(iI,jI) A(iI,kI) T(jI,kI)
17
Integrated Approach
  • Step One Memory Minimization and Disk I/O
    Placement
  • Finds a set of fusion structures with disk I/O
    placements
  • Step Two Loop Tiling
  • Tiles the loops
  • Determines tile sizes to minimize disk access
    cost under memory limit constraints

18
Integrated Approach
Step One Memory Minimization and Disk I/O
Placement
for j
for l
for i
for k
for k
Read C
Write D
D A T(k)
Read A
T(k) C B
Read B
Best Loop Structure with Disk I/O Placements
found by Step One
Redundancy for I/O statement
No Redundancy for I/O statement
19
Integrated Approach
Step Two Loop Tiling
Tile Sizes Found
Ti 300 Tj 254 Tk 308 Tl 292
Disk Access Cost 194 secs Memory Usage 9.99
MB
20
Experimental Results
  • AO-to-MO transform example in quantum chemistry

Ranges for p, q, r, s N O V Ranges for a,
b, c, d V
O No. of occupied orbitals V No. of unoccupied
orbitals
21
Disk Access Costs for Both Approaches
22
CCSD Doubles Equation
  • hbara,b,i,j sumfb,cti,j,a,c,c
    -sumfk,ctk,bti,j,a,c,k,c
    sumfa,cti,j,c,b,c -sumfk,ctk,ati
    ,j,c,b,k,c -sumfk,jti,k,a,b,k
    -sumfk,ctj,cti,k,a,b,k,c
    -sumfk,itj,k,b,a,k -sumfk,cti,ctj
    ,k,b,a,k,c sumti,ctj,dva,b,c,d,c,d
    sumti,j,c,dva,b,c,d,c,d
    sumtj,cva,b,i,c,c -sumtk,bva,k,i,j
    ,k sumti,cvb,a,j,c,c
    -sumtk,avb,k,j,i,k -sumtk,dti,j,c,b
    vk,a,c,d,k,c,d -sumti,ctj,k,b,dvk,a,
    c,d,k,c,d -sumtj,ctk,bvk,a,c,i,k,c
    2sumtj,k,b,cvk,a,c,i,k,c
    -sumtj,k,c,bvk,a,c,i,k,c
    -sumti,ctj,dtk,bvk,a,d,c,k,c,d
    2sumtk,dti,j,c,bvk,a,d,c,k,c,d
    -sumtk,bti,j,c,dvk,a,d,c,k,c,d
    -sumtj,dti,k,c,bvk,a,d,c,k,c,d
    2sumti,ctj,k,b,dvk,a,d,c,k,c,d
    -sumti,ctj,k,d,bvk,a,d,c,k,c,d
    -sumtj,k,b,cvk,a,i,c,k,c
    -sumti,ctk,bvk,a,j,c,k,c
    -sumti,k,c,bvk,a,j,c,k,c
    -sumti,ctj,dtk,avk,b,c,d,k,c,d
    -sumtk,dti,j,a,cvk,b,c,d,k,c,d
    -sumtk,ati,j,c,dvk,b,c,d,k,c,d
    2sumtj,dti,k,a,cvk,b,c,d,k,c,d
    -sumtj,dti,k,c,avk,b,c,d,k,c,d
    -sumti,ctj,k,d,avk,b,c,d,k,c,d
    -sumti,ctk,avk,b,c,j,k,c
    2sumti,k,a,cvk,b,c,j,k,c
    -sumti,k,c,avk,b,c,j,k,c
    2sumtk,dti,j,a,cvk,b,d,c,k,c,d
    -sumtj,dti,k,a,cvk,b,d,c,k,c,d
    -sumtj,ctk,avk,b,i,c,k,c
    -sumtj,k,c,avk,b,i,c,k,c
    -sumti,k,a,cvk,b,j,c,k,c
    sumti,ctj,dtk,atl,bvk,l,c,d,k,l,c
    ,d -2sumtk,btl,dti,j,a,cvk,l,c,d,k
    ,l,c,d -2sumtk,atl,dti,j,c,bvk,l,c,d
    ,k,l,c,d sumtk,atl,bti,j,c,dvk,l,c
    ,d,k,l,c,d -2sumtj,ctl,dti,k,a,bvk
    ,l,c,d,k,l,c,d -2sumtj,dtl,bti,k,a,c
    vk,l,c,d,k,l,c,d sumtj,dtl,bti,k,c,
    avk,l,c,d,k,l,c,d -2sumti,ctl,dtj,
    k,b,avk,l,c,d,k,l,c,d sumti,ctl,at
    j,k,b,dvk,l,c,d,k,l,c,d sumti,ctl,b
    tj,k,d,avk,l,c,d,k,l,c,d
    sumti,k,c,dtj,l,b,avk,l,c,d,k,l,c,d
    4sumti,k,a,ctj,l,b,dvk,l,c,d,k,l,c,d
    -2sumti,k,c,atj,l,b,dvk,l,c,d,k,l,c,d
    -2sumti,k,a,btj,l,c,dvk,l,c,d,k,l,c,d
    -2sumti,k,a,ctj,l,d,bvk,l,c,d,k,l,c,
    d sumti,k,c,atj,l,d,bvk,l,c,d,k,l,c,d
    sumti,ctj,dtk,l,a,bvk,l,c,d,k,l,c
    ,d sumti,j,c,dtk,l,a,bvk,l,c,d,k,l,c,
    d -2sumti,j,c,btk,l,a,dvk,l,c,d,k,l,c
    ,d -2sumti,j,a,ctk,l,b,dvk,l,c,d,k,l,
    c,d sumtj,ctk,btl,avk,l,c,i,k,l,c
    sumtl,ctj,k,b,avk,l,c,i,k,l,c
    -2sumtl,atj,k,b,cvk,l,c,i,k,l,c
    sumtl,atj,k,c,bvk,l,c,i,k,l,c
    -2sumtk,ctj,l,b,avk,l,c,i,k,l,c
    sumtk,atj,l,b,cvk,l,c,i,k,l,c
    sumtk,btj,l,c,avk,l,c,i,k,l,c
    sumtj,ctl,k,a,bvk,l,c,i,k,l,c
    sumti,ctk,atl,bvk,l,c,j,k,l,c
    sumtl,cti,k,a,bvk,l,c,j,k,l,c
    -2sumtl,bti,k,a,cvk,l,c,j,k,l,c
    sumtl,bti,k,c,avk,l,c,j,k,l,c
    sumti,ctk,l,a,bvk,l,c,j,k,l,c
    sumtj,ctl,dti,k,a,bvk,l,d,c,k,l,c,d
    sumtj,dtl,bti,k,a,cvk,l,d,c,k,l,c,
    d sumtj,dtl,ati,k,c,bvk,l,d,c,k,l,
    c,d -2sumti,k,c,dtj,l,b,avk,l,d,c,k,l
    ,c,d -2sumti,k,a,ctj,l,b,dvk,l,d,c,k,
    l,c,d sumti,k,c,atj,l,b,dvk,l,d,c,k,l
    ,c,d sumti,k,a,btj,l,c,dvk,l,d,c,k,l,
    c,d sumti,k,c,btj,l,d,avk,l,d,c,k,l,c
    ,d sumti,k,a,ctj,l,d,bvk,l,d,c,k,l,c,
    d sumtk,atl,bvk,l,i,j,k,l
    sumtk,l,a,bvk,l,i,j,k,l
    sumtk,btl,dti,j,a,cvl,k,c,d,k,l,c,d
    sumtk,atl,dti,j,c,bvl,k,c,d,k,l,c,
    d sumti,ctl,dtj,k,b,avl,k,c,d,k,l,
    c,d -2sumti,ctl,atj,k,b,dvl,k,c,d,
    k,l,c,d sumti,ctl,atj,k,d,bvl,k,c,d
    ,k,l,c,d sumti,j,c,btk,l,a,dvl,k,c,d,
    k,l,c,d sumti,j,a,ctk,l,b,dvl,k,c,d,
    k,l,c,d -2sumtl,cti,k,a,bvl,k,c,j,k,l
    ,c sumtl,bti,k,a,cvl,k,c,j,k,l,c
    sumtl,ati,k,c,bvl,k,c,j,k,l,c
    va,b,i,j
Write a Comment
User Comments (0)
About PowerShow.com