Applying Data Copy To Improve Memory Performance of General Array Computations - PowerPoint PPT Presentation

About This Presentation
Title:

Applying Data Copy To Improve Memory Performance of General Array Computations

Description:

University of Texas at San Antonio. Improving Memory Performance. processor. Registers ... A single layout for entire application ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 16
Provided by: qin51
Learn more at: http://www.csc.lsu.edu
Category:

less

Transcript and Presenter's Notes

Title: Applying Data Copy To Improve Memory Performance of General Array Computations


1
Applying Data Copy To Improve Memory Performance
of General Array Computations
  • Qing Yi
  • University of Texas at San Antonio

2
Improving Memory Performance
processor
processor
Registers
Registers
Cache
Locality optimizations
Cache
Memory
Memory
Shared memory
Parallelization
Network connection
Communication optimizations
3
Compiler Optimizations For Locality
  • Computation optimizations
  • Loop blocking, fusion, unroll-and-jam,
    interchange, unrolling
  • Rearrange computations for better spatial and
    temporal locality
  • Data-layout optimizations
  • Static layout transformation
  • A single layout for entire application
  • No additional overhead, tradeoff between
    different layout choices
  • Dynamic layout transformation
  • A different layout for each computation phase
  • Flexible but could be expensive
  • Combining computation and data transformations
  • Static layout transformation
  • Transform layout first, then computation
  • Dynamic layout transformation
  • Transform computation first, then dynamic
    re-arrange layout

4
Array Copying --- Related Work
  • Dynamic layout transformation for arrays
  • Copy arrays into local buffers before computation
  • Copy modified local buffers back to array
  • Previous work
  • Lam, Rothberg and Wolf, Temam, Granston and Jalby
  • Copy arrays after loop blocking
  • Assume arrays are accessed through affine
    expressions
  • Array copy always safe
  • Static data layout transformations
  • A single layout throughout the application ---
    always safe
  • Optimizing irregular applications
  • Data access patterns not known until runtime
  • Dynamic layout transformation --- through
    libraries
  • Scalar Replacement
  • Equivalent to copying single array element into
    scalars
  • Carr and Kennedy applied to inner loops
  • Question how expensive is array copy? How
    difficult is it?

5
What is new
  • Array copy for arbitrary loop computations
  • Stand-alone optimization independent of blocing
  • Works for computations with non-affine array
    access
  • General array copy algorithm
  • Work on arbitrarily shaped loops
  • Automatic insert copy operations to ensure safety
  • Heuristics to reduce buffer size and copy cost
  • Performs scalar replacement when specialized
  • Applications
  • Improve cache and register locality
  • When combined with blocking and when without
    blocking
  • Future work
  • Communication and parallelization
  • Empirical tuning of optimizations
  • Interface that allows different application
    heuristics

6
Apply Array Copy Matrix Multiplication
for (j0 jltn j) for (k0 kltl k)
for (i0 iltm i) Cijm Cijm
alpha AikmBkjl
Cijm
Cijm
Aikm
Bkjl
  • Step1 build dependence graph
  • True, output, anti and input deps between array
    accesses
  • Is each dependence precise?

7
Apply Array Copy Matrix Multiplication
for (j0 jltn j) for (k0 kltl k)
for (i0 iltm i) Cijm Cijm
alpha AikmBkjl
Cijm
Cijm
Aikm
Bkjl
  • Step2-3 Separate array references connected by
    imprecise deps
  • Impose an order on all array references
  • Cijm (R) -gt Aikm -gt Bkjl -gt Cijm
    (W)
  • Remove all back edges
  • Apply typed fusion

8
Array Copy with Imprecise Dependences
for (j0 jltn j) for (k0 kltl k)
for (i0 iltm i) Cf(i,j,m) Cijm
alpha AikmBkjl
Cf(i,j,m)
Cijm
Aikm
Bkjl
Imprecise
  • Array references connected by imprecise deps
  • Cannot precisely determine a mapping between
    subscripts
  • Sometimes may refer to the same location,
    sometimes not
  • Not safe to copy into a single buffer
  • Apply typed-fusion algorithm to separate them

9
Profitability of Applying Array Copy
location to copy A
for (j0 jltn j) for (k0 kltl k)
for (i0 iltm i) Cijm Cijm
alpha AikmBkjl
location to copy C
location to copy B
i
k
j
k
Cijm
Cijm
Aikm
Bkjl
k
k
  • Each buffer should be copied at most twice
    (splitting groups)
  • Determine outermost location to perform copy
  • No interference with other groups
  • Enforce size limit on the buffer
  • constant size gt scalar replacement
  • Ensure reuse of copied elements
  • Lowering position to copy if no extra reuse is
    gained

10
Array Copy Result Matrix Multiplication
Copy A0m,0l to A_buf for (j0 jltn j)
Copy C0m, jm to C_buf for (k0 kltl
k) Copy Bkjl to B_buf for
(i0 iltm i) C_bufi C_bufi
alpha A_bufikmB_buf Copy
C_buf0m back to C0m,jm
  • Dimensionality of buffer enforced by command-line
    options
  • Can be applied to arbitrarily shaped loop
    structures
  • Can be applied independently or after blocking

11
Experiments
  • Implementation
  • Loop transformation framework by Yi, Kennedy and
    Adve
  • ROSE compiler infrastructure at LLNL (Quinlan et
    al)
  • Benchmarks
  • dgemm (matrix multiplication, LAPACK)
  • dgetrf (LU factorization with partial pivoting,
    LAPACK)
  • tomcatv (mesh generation with Thompson solver,
    SPEC95)
  • Machines
  • A Dell PC with two 2.2GHz Intel XEON processors
  • 512KB cache on each processor and 2GB memory
  • A SGI workstation with a 195 MHz R10000 processor
  • 32KB 2-way L1, 1MB 4-way L2, and 256MB memory
  • A single 8-way P655 node on a IBM terascale
    machine
  • 32KB 2-way L1 (32 KB) cache, 0.75MB 4-way L2,
    16MB 8-way L3
  • 16GB memory

12
DGEMM on Dell PC
13
DGEMM on SGI Workstation
14
DGEMM on IBM
15
Summary
  • When should we apply scalar replacement?
  • Profitable unless register pressure too high
  • 3-12 improvement observed
  • No overhead
  • When should we apply array copy?
  • When regular conflict misses occur
  • When prefetching of array elements is need
  • 10-40 improvement observed
  • Overhead is 0.5-8 when not beneficial
  • Optimizations not beneficial on the IBM node
  • Both blocking and 2-dim copying not profitable
  • Integer operation overhead too high
  • Will be further investigated
Write a Comment
User Comments (0)
About PowerShow.com