Register Pressure Guided UnrollandJam - PowerPoint PPT Presentation

About This Presentation
Title:

Register Pressure Guided UnrollandJam

Description:

In a processor, register sits at the fastest position in the memory ... [Prelude] D1. B1 D2 [Loop Body] Do N-2 times (with index i)? Ai Ci Bi 1 Di 2 [Postlude] ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 19
Provided by: yin95
Category:

less

Transcript and Presenter's Notes

Title: Register Pressure Guided UnrollandJam


1
Register Pressure Guided Unroll-and-Jam
  • Author Yin Ma
  • Steven Carr

2
Motivation
  • In a processor, register sits at the fastest
    position in the memory hierarchy, but the number
    of physical registers is very limited.
  • Unroll-and-jam in the loop model of Open64 not
    only increases register pressure by itself but
    also creates opportunities to make other loop
    optimizations increase register pressure
    indirectly.
  • If a transformed loop demands too many registers,
    the overall performance may degrade
  • Given a loop nest, with a better register
    pressure prediction and an unroll factor, the
    degradation can be eliminated and a better
    overall performance can be achieved

3
Research Topic
  • A register pressure prediction algorithm for
    unroll-and-jam
  • A register pressure guided loop model for
    unroll-and-jam

4
BackgroundData Dependence Analysis
True Dependence S1 L1. S2 .L2 Anti-Dependenc
e S1 .L1 S2 L2. Output Dependence S1 L1.
S2 L2. Input Dependence S1 .L1 S2 .L2
  • The data dependence graph (DDG) is a directed
    graph that represents the data dependence
    relationship among instructions.
  • A true dependence exists when L1 stores into a
    memory location that is read by L2 later.
  • An anti-dependence exists if L1 is a read from a
    memory location that is written by L2 later.
  • An output dependence exists when L1 and L2 store
    into the same memory location.
  • An input dependence exists if a memory location
    is read by L1 and L2.

5
BackgroundScalar Replacement
  • Uses scalars, later allocated to registers to
    replace array references in order to decrease the
    number of memory references in loops
  • This directly increases register pressure

for ( i 2 i lt n i ) ai ai-1 bi
Scalar Replaced T a1 for ( i 2 i lt n
i) T T bi ai T
6
BackgroundUnroll-and-Jam
  • Create larger loop bodies by flattening multiple
    iterations
  • Larger loop bodies makes other optimizations
    create more register pressure

Unroll-and-jammed and later scalar replaced for (
I 1 I lt 10 I I2 ) for ( J 1 J lt
5 J ) b BJ c CJ
AIJ b c DIJ EIJ
FIJ AI1J b c
DI1J EI1J FI1J /
register pressure increased because b,
c hold two registers that originally
can be reused for E and F /
for ( I 1 I lt 10 I ) for ( J 1 J
lt 5 J ) AIJ BJ CJ
DIJ EIJ FIJ
?
7
BackgroundSoftware Pipelining
  • Software pipelining is an advanced scheduling
    techniques. Usually, more-overlapped instructions
    demand additional registers
  • The Initiation interval (II) of a loop is the
    number of cycles used to finish one iteration.
  • The resource II (ResII) gives the minimum number
    of cycles needed to execute the loop based upon
    machine resources such as the number of
    functional units.
  • The recurrence II (RecII) gives the minimum
    number of cycles needed for a single iteration
    based upon the length of the cycles in the data
    dependence graph.

Prelude D1 B1 D2 Loop Body Do N-2 times
(with index i)? Ai Ci Bi1 Di2 Postlude AN-1 C
N-1 BN AN CN
Do N times
Software pipelined due to dependences among the
operations
8
Typical approaches of preventing degradation from
register pressure
  • Predictive approach lt- Our approaches
  • Predict effects before applying optimizations and
    decide the best set of parameters to do
    optimizations
  • Fastest speed and fit for all situations
  • Iterative approach (like feedback based)?
  • Apply optimizations with one set of parameters
    then redo for the better performance with
    adjusted parameters
  • Genetic approach
  • Prepare many sets of parameters and apply
    optimizations with each set. Use genetic
    programming to pick the best

9
Problem in Previous Work
  • All predictive register prediction methods are
    designed for software pipelining.
  • Do not support source-code-level loop
    optimizations at all
  • No systemic research on how to predict register
    pressure for loop optimizations
  • No register pressure guided loop model

10
Key Design Detail
  • Prediction algorithms works on source-code level
  • Prediction algorithms handle the effects on
    register pressure from
  • unroll-and-jam
  • scalar replacement
  • software pipelining
  • general scalar optimizations
  • Register pressure guided loop model uses the
    predicted register information to pick an unroll
    vector for the best performance

11
Register Prediction for unroll-and-jam
(Overview)?
  • Compute RecII with our heuristic method
  • Create the list of arrays that will be replaced
    by scalars by checking the original DDG
  • Constructing the new DDG D1 with the list above
    only for the original loop
  • All copies will reuse the DDG D1 as the base DDGs
  • Adjust each copy of DDGs to reflect the future
    changes.
  • Re-compute the ResII to get MinII
  • Do pseudo schedule to get the register pressure

12
Construct the base DDG
  • Travel through the innermost loop and construct
    the base DDG

DO J 1, N DO I 1, N U(I,J) V(I)
P(J,I) ENDDOENDDO
13
Prepare the DDG after unroll-and-jam
  • Duplicate the base DDG with the inputted unroll
    factors

DO J 1, N DO I 1, N U(I,J) V(I)
P(J,I) U(I,J1) V(I) P(J1,I)
ENDDOENDDO
Unroll vector is 2
14
Finalize the DDG
  • Remove unnecessary nodes/edges and add new edges
  • Based on the updated dependence
  • Reflect the effect of further optimizations
  • Consider array indexing reuse by analyzing array
    subscripts

15
Register Prediction
  • Schedule the final DDG with a depth-first scan
    starting from the first node of the first
    iteration copy
  • The RecII is the RecII of the original innermost
    loop
  • The ResII is computed on the final DDG with the
    targeted architecture information
  • MinII MAX( RecII, ResII)?

16
Register Pressure Guided Unroll-and-Jam
  • Use unitII as the performance indicator of an
    unroll-and-jammed loop
  • R is the number of registers predicted
  • P is the number of registers available
  • D is the total outgoing degree in the final DDG
  • E is the total number of cross iteration edges
  • A is the average memory access penalty
  • N is the number of nodes in the final DDG

17
Open64 Implementation Experiment Results
  • For register prediction, a retargetable compiler
    with infinite number of available physical
    registers is used
  • Loop nests are extracted from SPEC2000
  • For register pressure guided unroll-and-jam, our
    model directly replaces the unroll-and-jam
    analysis used by Open64 backend
  • An minor value computed with the information from
    Open64's cache model is added to UnitII
  • For register prediction for unroll-and-jam, it
    predicts the floating-point register pressure of
    a loop within 3 registers and integer register
    pressure within 4 registers
  • Also our register pressure guided unroll-and-jam
    improves the overall performance about 2 over
    the model in the Open64 backend on both x86 and
    x86-64 architectures on Polyhedron benchmark

18
The End
Any Question?
Write a Comment
User Comments (0)
About PowerShow.com