Applying Data Copy To Improve Memory Performance of General Array Computations - PowerPoint PPT Presentation

About This Presentation

Title:

Applying Data Copy To Improve Memory Performance of General Array Computations

Description:

University of Texas at San Antonio. Improving Memory Performance. processor. Registers ... A single layout for entire application ... – PowerPoint PPT presentation

Number of Views:72

Avg rating:3.0/5.0

Slides: 16

Provided by: qin51

Learn more at: http://www.csc.lsu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Applying Data Copy To Improve Memory Performance of General Array Computations

1
Applying Data Copy To Improve Memory Performance
of General Array Computations

Qing Yi
University of Texas at San Antonio

2
Improving Memory Performance
processor
processor
Registers
Registers
Cache
Locality optimizations
Cache
Memory
Memory
Shared memory
Parallelization
Network connection
Communication optimizations
3
Compiler Optimizations For Locality

Computation optimizations
Loop blocking, fusion, unroll-and-jam,
interchange, unrolling
Rearrange computations for better spatial and
temporal locality
Data-layout optimizations
Static layout transformation
A single layout for entire application
No additional overhead, tradeoff between
different layout choices
Dynamic layout transformation
A different layout for each computation phase
Flexible but could be expensive
Combining computation and data transformations
Static layout transformation
Transform layout first, then computation
Dynamic layout transformation
Transform computation first, then dynamic
re-arrange layout

4
Array Copying --- Related Work

Dynamic layout transformation for arrays
Copy arrays into local buffers before computation
Copy modified local buffers back to array
Previous work
Lam, Rothberg and Wolf, Temam, Granston and Jalby
Copy arrays after loop blocking
Assume arrays are accessed through affine
expressions
Array copy always safe
Static data layout transformations
A single layout throughout the application ---
always safe
Optimizing irregular applications
Data access patterns not known until runtime
Dynamic layout transformation --- through
libraries
Scalar Replacement
Equivalent to copying single array element into
scalars
Carr and Kennedy applied to inner loops
Question how expensive is array copy? How
difficult is it?

5
What is new

Array copy for arbitrary loop computations
Stand-alone optimization independent of blocing
Works for computations with non-affine array
access
General array copy algorithm
Work on arbitrarily shaped loops
Automatic insert copy operations to ensure safety
Heuristics to reduce buffer size and copy cost
Performs scalar replacement when specialized
Applications
Improve cache and register locality
When combined with blocking and when without
blocking
Future work
Communication and parallelization
Empirical tuning of optimizations
Interface that allows different application
heuristics

6
Apply Array Copy Matrix Multiplication
for (j0 jltn j) for (k0 kltl k)
for (i0 iltm i) Cijm Cijm
alpha AikmBkjl
Cijm
Cijm
Aikm
Bkjl

Step1 build dependence graph
True, output, anti and input deps between array
accesses
Is each dependence precise?

7
Apply Array Copy Matrix Multiplication
for (j0 jltn j) for (k0 kltl k)
for (i0 iltm i) Cijm Cijm
alpha AikmBkjl
Cijm
Cijm
Aikm
Bkjl

Step2-3 Separate array references connected by
imprecise deps
Impose an order on all array references
Cijm (R) -gt Aikm -gt Bkjl -gt Cijm
(W)
Remove all back edges
Apply typed fusion

8
Array Copy with Imprecise Dependences
for (j0 jltn j) for (k0 kltl k)
for (i0 iltm i) Cf(i,j,m) Cijm
alpha AikmBkjl
Cf(i,j,m)
Cijm
Aikm
Bkjl
Imprecise

Array references connected by imprecise deps
Cannot precisely determine a mapping between
subscripts
Sometimes may refer to the same location,
sometimes not
Not safe to copy into a single buffer
Apply typed-fusion algorithm to separate them

9
Profitability of Applying Array Copy
location to copy A
for (j0 jltn j) for (k0 kltl k)
for (i0 iltm i) Cijm Cijm
alpha AikmBkjl
location to copy C
location to copy B
i
k
j
k
Cijm
Cijm
Aikm
Bkjl
k
k

Each buffer should be copied at most twice
(splitting groups)
Determine outermost location to perform copy
No interference with other groups
Enforce size limit on the buffer
constant size gt scalar replacement
Ensure reuse of copied elements
Lowering position to copy if no extra reuse is
gained

10
Array Copy Result Matrix Multiplication
Copy A0m,0l to A_buf for (j0 jltn j)
Copy C0m, jm to C_buf for (k0 kltl
k) Copy Bkjl to B_buf for
(i0 iltm i) C_bufi C_bufi
alpha A_bufikmB_buf Copy
C_buf0m back to C0m,jm