P3DE: ProfileDirected Predicated Partial Dead Code Elimination - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

P3DE: ProfileDirected Predicated Partial Dead Code Elimination

Description:

Profile-Directed Predicated Partial Dead Code Elimination ... Requires some method to control code growth. Cannot handle embedded control flow in a loop ... – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 23
Provided by: shane67
Category:

less

Transcript and Presenter's Notes

Title: P3DE: ProfileDirected Predicated Partial Dead Code Elimination


1
P3DEProfile-Directed Predicated Partial Dead
Code Elimination
  • Shane Ryoo, Sain-Zee Ueng, and Wen-mei W. Hwu
  • Coordinated Science Laboratory
  • University of Illinois at Urbana-Champaign
  • EPIC-5 Workshop, March 26th, 2006

2
Motivation
  • Even with classical optimizations, there still is
    a significant amount of executed code that is
    dead Butts ASPLOS 02
  • Our experience large amount of dead stores
  • Contemporary architectures generally can issue
    only one or two stores per cycle, increasing the
    length of a schedule in store-laden regions
  • Can we remove more of this dead code?
  • Push assignments off hot paths, towards uses

3
Partial Dead Code Elimination
  • Partial Dead Code Elimination (PDE) reduces
    execution of assignments whose results sometimes
    have no effect
  • Basic algorithm by Knoop et al. PLDI 94 sinks
    assignments to remove partially-dead code

assignment moves downward until it is blocked
(a) Before PDE
(b) After PDE
4
Previous Aggressive Dead Code Removal Techniques
  • Primary weakness of previous methods is their
    limitations with cyclic code regions
  • Gupta et al. PACT 97 present the only previous
    work on predicated PDE
  • Path profile-based method for cost-benefit
    analysis
  • Cannot sink out of loops unless the assignment is
    dead along the backedge
  • Bodik et al. PLDI 97 uses code restructuring
    to expose opportunities without introducing extra
    dynamic ops
  • Requires some method to control code growth
  • Cannot handle embedded control flow in a loop

5
Profile-Directed Predicated Partial Dead Code
Elimination
  • Use edge profile information to specialize
    program paths beyond basic PDE (essentially
    speculate)
  • Reduce the number of executions, based on profile
  • Use predication support to enable aggressive
    sinking motion on assignments
  • Uniform cost-benefit model for cyclic and acyclic
    code regions that accounts for predication
    overhead
  • Other optimizations to reduce/eliminate predicate
    usage and increase the applicability of the
    optimization Ryoo M.S. thesis 04

6
P3DE Example
computation of interest
side entry
new location
(a) Before P3DE
(b) After P3DE
7
P3DE Algorithm
  • Perform dataflow analyses to determine the
    possible range of motion for all assignments
  • Dead partially-dead
  • Partially delayable
  • For each assignment
  • Construct a motion graph representing this range
  • Compute a minimum cut of the motion graph, based
    on profile weights, to find the smallest number
    of executions
  • Insert new computations, delete old computations,
    and use predication as necessary to maintain
    correctness
  • Iterate until no profitable motions remain

8
Dead Assignment Dataflow
  • Is the assignment dead along ALL/ANY future
    paths?
  • If completely live (or blocked by aliasing
    operations), sinking the assignment cannot result
    in fewer executions
  • If completely dead, the assignment does not need
    to be executed (inserted)

dataflow direction
9
Partial Delay Dataflow
  • Can the assignment be sunk down any path with
    potential profit?
  • Filter out assignments which are live (not
    profitable to sink) when passing through blocks

dataflow direction
10
Motion Graph Construction
  • One graph per assignment
  • Every CFG edge is included in the motion graph
    which is
  • partially delayable at the origin
  • not dead at the destination

11
Motion Graph Completion
  • Create a single-source, single-sink graph
  • Create cost edges to account for predication
    overhead (side entries)

12
Code Motion
Remove original computation, set predicate
Clear predicate at side entries
Insert new computation on the cut edges
control-equivalent block, execute only when
predicate is set
13
Cyclic P3DE Example
14
Cyclic P3DE Motion Graph
15
Comparison Loop Variable Migration
  • The primary benefit of P3DE comes from register
    promotion within a loop with aliasing function
    calls, when performed with a similar speculative
    PRE operation
  • Bodik Ph.D. thesis 99
  • Within IMPACT, this optimization is already
    performed to some degree by loop variable
    migration, which guards individual aliasing
    function calls with loads and stores of the
    variable
  • However, P3DE speculative PRE is a more
    systematic method of performing this
    optimization, as it guards aliasing regions

16
Loop Variable Migration Example
17
Performance Evaluation
  • Test machine HP zx6000 workstation
  • dual Itanium 2, 1GHz processors
  • 8GB RAM
  • IMPACT compiler configuration
  • Andersens-style, context-sensitive,
    field-sensitive pointer analysis for memory
    disambiguation (incorporated after Ryoo M.S.
    thesis 04)
  • Traditional optimizations, loop optimizations
    (including loop variable migration), speculative
    PRE, hyperblock, and superblock formation
  • Baseline runs a sink-only-if-partially-delayable
    PDE prior to hyperblock and superblock formation
  • P3DE version replaces PDE in the optimization
    chain and does not run loop variable migration
  • SPEC scores taken as the median of 5 runs
  • Itanium performance counters used to measure
    stores and predicate write operations

18
Dynamic Store Operations Removed
stores normalized to baseline input
19
Predicate Writes Inserted Per Store Removed
Performed on a HP zx6000 with 2 Itanium 2 1GHz
processors and 8GB RAM
predicate writes per store removed
  • Some benchmarks omitted due to relatively small
    number of stores removed
  • Many predicate write operations are created by
    hyperblock formation, which can affect the total
    number of predicate write operations

20
SPEC Performance Increase
Performed on a HP zx6000 with 2 Itanium 2 1GHz
processors and 8GB RAM
percentage increase over baseline performance
21
Performance Analysis
  • Many profitable cases already subsumed by loop
    variable migration, so little effect seen in many
    benchmarks
  • 255.vortex achieves performance benefit
  • Specific losses
  • 186.crafty
  • micropipeline stalls loads and stores clustered
    together may appear to have conflicts
  • kernel cycles increase
  • 253.perlbmk explicit spill stores due to
    register pressure
  • 254.gap largest portion is branch misprediction

22
Conclusions
  • P3DE, in combination with speculative PRE, is a
    more systematic version of loop variable
    migration
  • P3DE improves performance when stores are moved
    out of the critical path of statically-scheduled
    regions
  • In a wide-issue architecture, instructions
    generally have significant scheduling freedom, so
    P3DE has little additional benefit on performance
    over loop variable migration, and can result in
  • Increased register pressure
  • Micropipeline stalls when moving stores towards
    loads
  • Perturbation of hyperblock formation
  • An undue amount of predication is not introduced
    (due to cost edges)
Write a Comment
User Comments (0)
About PowerShow.com