Title: Applying Data Copy To Improve Memory Performance of General Array Computations
1Applying Data Copy To Improve Memory Performance
of General Array Computations
- Qing Yi
- University of Texas at San Antonio
2Improving Memory Performance
processor
processor
Registers
Registers
Cache
Locality optimizations
Cache
Memory
Memory
Shared memory
Parallelization
Network connection
Communication optimizations
3Compiler Optimizations For Locality
- Computation optimizations
- Loop blocking, fusion, unroll-and-jam,
interchange, unrolling - Rearrange computations for better spatial and
temporal locality - Data-layout optimizations
- Static layout transformation
- A single layout for entire application
- No additional overhead, tradeoff between
different layout choices - Dynamic layout transformation
- A different layout for each computation phase
- Flexible but could be expensive
- Combining computation and data transformations
- Static layout transformation
- Transform layout first, then computation
- Dynamic layout transformation
- Transform computation first, then dynamic
re-arrange layout
4Array Copying --- Related Work
- Dynamic layout transformation for arrays
- Copy arrays into local buffers before computation
- Copy modified local buffers back to array
- Previous work
- Lam, Rothberg and Wolf, Temam, Granston and Jalby
- Copy arrays after loop blocking
- Assume arrays are accessed through affine
expressions - Array copy always safe
- Static data layout transformations
- A single layout throughout the application ---
always safe - Optimizing irregular applications
- Data access patterns not known until runtime
- Dynamic layout transformation --- through
libraries - Scalar Replacement
- Equivalent to copying single array element into
scalars - Carr and Kennedy applied to inner loops
- Question how expensive is array copy? How
difficult is it?
5What is new
- Array copy for arbitrary loop computations
- Stand-alone optimization independent of blocing
- Works for computations with non-affine array
access - General array copy algorithm
- Work on arbitrarily shaped loops
- Automatic insert copy operations to ensure safety
- Heuristics to reduce buffer size and copy cost
- Performs scalar replacement when specialized
- Applications
- Improve cache and register locality
- When combined with blocking and when without
blocking - Future work
- Communication and parallelization
- Empirical tuning of optimizations
- Interface that allows different application
heuristics
6Apply Array Copy Matrix Multiplication
for (j0 jltn j) for (k0 kltl k)
for (i0 iltm i) Cijm Cijm
alpha AikmBkjl
Cijm
Cijm
Aikm
Bkjl
- Step1 build dependence graph
- True, output, anti and input deps between array
accesses - Is each dependence precise?
7Apply Array Copy Matrix Multiplication
for (j0 jltn j) for (k0 kltl k)
for (i0 iltm i) Cijm Cijm
alpha AikmBkjl
Cijm
Cijm
Aikm
Bkjl
- Step2-3 Separate array references connected by
imprecise deps - Impose an order on all array references
- Cijm (R) -gt Aikm -gt Bkjl -gt Cijm
(W) - Remove all back edges
- Apply typed fusion
8Array Copy with Imprecise Dependences
for (j0 jltn j) for (k0 kltl k)
for (i0 iltm i) Cf(i,j,m) Cijm
alpha AikmBkjl
Cf(i,j,m)
Cijm
Aikm
Bkjl
Imprecise
- Array references connected by imprecise deps
- Cannot precisely determine a mapping between
subscripts - Sometimes may refer to the same location,
sometimes not - Not safe to copy into a single buffer
- Apply typed-fusion algorithm to separate them
9Profitability of Applying Array Copy
location to copy A
for (j0 jltn j) for (k0 kltl k)
for (i0 iltm i) Cijm Cijm
alpha AikmBkjl
location to copy C
location to copy B
i
k
j
k
Cijm
Cijm
Aikm
Bkjl
k
k
- Each buffer should be copied at most twice
(splitting groups) - Determine outermost location to perform copy
- No interference with other groups
- Enforce size limit on the buffer
- constant size gt scalar replacement
- Ensure reuse of copied elements
- Lowering position to copy if no extra reuse is
gained
10Array Copy Result Matrix Multiplication
Copy A0m,0l to A_buf for (j0 jltn j)
Copy C0m, jm to C_buf for (k0 kltl
k) Copy Bkjl to B_buf for
(i0 iltm i) C_bufi C_bufi
alpha A_bufikmB_buf Copy
C_buf0m back to C0m,jm
- Dimensionality of buffer enforced by command-line
options - Can be applied to arbitrarily shaped loop
structures - Can be applied independently or after blocking
11Experiments
- Implementation
- Loop transformation framework by Yi, Kennedy and
Adve - ROSE compiler infrastructure at LLNL (Quinlan et
al) - Benchmarks
- dgemm (matrix multiplication, LAPACK)
- dgetrf (LU factorization with partial pivoting,
LAPACK) - tomcatv (mesh generation with Thompson solver,
SPEC95) - Machines
- A Dell PC with two 2.2GHz Intel XEON processors
- 512KB cache on each processor and 2GB memory
- A SGI workstation with a 195 MHz R10000 processor
- 32KB 2-way L1, 1MB 4-way L2, and 256MB memory
- A single 8-way P655 node on a IBM terascale
machine - 32KB 2-way L1 (32 KB) cache, 0.75MB 4-way L2,
16MB 8-way L3 - 16GB memory
12DGEMM on Dell PC
13DGEMM on SGI Workstation
14DGEMM on IBM
15Summary
- When should we apply scalar replacement?
- Profitable unless register pressure too high
- 3-12 improvement observed
- No overhead
- When should we apply array copy?
- When regular conflict misses occur
- When prefetching of array elements is need
- 10-40 improvement observed
- Overhead is 0.5-8 when not beneficial
- Optimizations not beneficial on the IBM node
- Both blocking and 2-dim copying not profitable
- Integer operation overhead too high
- Will be further investigated