P3DE: ProfileDirected Predicated Partial Dead Code Elimination presentation

About This Presentation

Transcript and Presenter's Notes

Title: P3DE: ProfileDirected Predicated Partial Dead Code Elimination

1
P3DEProfile-Directed Predicated Partial Dead
Code Elimination

Shane Ryoo, Sain-Zee Ueng, and Wen-mei W. Hwu
Coordinated Science Laboratory
University of Illinois at Urbana-Champaign
EPIC-5 Workshop, March 26th, 2006

2
Motivation

Even with classical optimizations, there still is
a significant amount of executed code that is
dead Butts ASPLOS 02
Our experience large amount of dead stores
Contemporary architectures generally can issue
only one or two stores per cycle, increasing the
length of a schedule in store-laden regions
Can we remove more of this dead code?
Push assignments off hot paths, towards uses

3
Partial Dead Code Elimination

Partial Dead Code Elimination (PDE) reduces
execution of assignments whose results sometimes
have no effect
Basic algorithm by Knoop et al. PLDI 94 sinks
assignments to remove partially-dead code

assignment moves downward until it is blocked
(a) Before PDE
(b) After PDE
4
Previous Aggressive Dead Code Removal Techniques

Primary weakness of previous methods is their
limitations with cyclic code regions
Gupta et al. PACT 97 present the only previous
work on predicated PDE
Path profile-based method for cost-benefit
analysis
Cannot sink out of loops unless the assignment is
dead along the backedge
Bodik et al. PLDI 97 uses code restructuring
to expose opportunities without introducing extra
dynamic ops
Requires some method to control code growth
Cannot handle embedded control flow in a loop

5
Profile-Directed Predicated Partial Dead Code
Elimination

Use edge profile information to specialize
program paths beyond basic PDE (essentially
speculate)
Reduce the number of executions, based on profile
Use predication support to enable aggressive
sinking motion on assignments
Uniform cost-benefit model for cyclic and acyclic
code regions that accounts for predication
overhead
Other optimizations to reduce/eliminate predicate
usage and increase the applicability of the
optimization Ryoo M.S. thesis 04

6
P3DE Example
computation of interest
side entry
new location
(a) Before P3DE
(b) After P3DE
7
P3DE Algorithm

Perform dataflow analyses to determine the
possible range of motion for all assignments
Dead partially-dead
Partially delayable
For each assignment
Construct a motion graph representing this range
Compute a minimum cut of the motion graph, based
on profile weights, to find the smallest number
of executions
Insert new computations, delete old computations,
and use predication as necessary to maintain
correctness
Iterate until no profitable motions remain

8
Dead Assignment Dataflow

Is the assignment dead along ALL/ANY future
paths?
If completely live (or blocked by aliasing
operations), sinking the assignment cannot result
in fewer executions
If completely dead, the assignment does not need
to be executed (inserted)

dataflow direction
9
Partial Delay Dataflow

Can the assignment be sunk down any path with
potential profit?
Filter out assignments which are live (not
profitable to sink) when passing through blocks

dataflow direction
10
Motion Graph Construction

One graph per assignment
Every CFG edge is included in the motion graph
which is
partially delayable at the origin
not dead at the destination

11
Motion Graph Completion

Create a single-source, single-sink graph
Create cost edges to account for predication
overhead (side entries)

12
Code Motion
Remove original computation, set predicate
Clear predicate at side entries
Insert new computation on the cut edges
control-equivalent block, execute only when
predicate is set
13
Cyclic P3DE Example
14
Cyclic P3DE Motion Graph
15
Comparison Loop Variable Migration

The primary benefit of P3DE comes from register
promotion within a loop with aliasing function
calls, when performed with a similar speculative
PRE operation
Bodik Ph.D. thesis 99
Within IMPACT, this optimization is already
performed to some degree by loop variable
migration, which guards individual aliasing
function calls with loads and stores of the
variable
However, P3DE speculative PRE is a more
systematic method of performing this
optimization, as it guards aliasing regions

16
Loop Variable Migration Example
17
Performance Evaluation

Test machine HP zx6000 workstation
dual Itanium 2, 1GHz processors
8GB RAM
IMPACT compiler configuration
Andersens-style, context-sensitive,
field-sensitive pointer analysis for memory
disambiguation (incorporated after Ryoo M.S.
thesis 04)
Traditional optimizations, loop optimizations
(including loop variable migration), speculative
PRE, hyperblock, and superblock formation
Baseline runs a sink-only-if-partially-delayable
PDE prior to hyperblock and superblock formation
P3DE version replaces PDE in the optimization
chain and does not run loop variable migration
SPEC scores taken as the median of 5 runs
Itanium performance counters used to measure
stores and predicate write operations

18
Dynamic Store Operations Removed
stores normalized to baseline input
19
Predicate Writes Inserted Per Store Removed
Performed on a HP zx6000 with 2 Itanium 2 1GHz
processors and 8GB RAM
predicate writes per store removed

Some benchmarks omitted due to relatively small
number of stores removed
Many predicate write operations are created by
hyperblock formation, which can affect the total
number of predicate write operations

20
SPEC Performance Increase
Performed on a HP zx6000 with 2 Itanium 2 1GHz
processors and 8GB RAM
percentage increase over baseline performance
21
Performance Analysis

Many profitable cases already subsumed by loop
variable migration, so little effect seen in many
benchmarks
255.vortex achieves performance benefit
Specific losses
186.crafty
micropipeline stalls loads and stores clustered
together may appear to have conflicts
kernel cycles increase
253.perlbmk explicit spill stores due to
register pressure
254.gap largest portion is branch misprediction

22
Conclusions

P3DE, in combination with speculative PRE, is a
more systematic version of loop variable
migration
P3DE improves performance when stores are moved
out of the critical path of statically-scheduled
regions
In a wide-issue architecture, instructions
generally have significant scheduling freedom, so
P3DE has little additional benefit on performance
over loop variable migration, and can result in
Increased register pressure
Micropipeline stalls when moving stores towards
loads
Perturbation of hyperblock formation
An undue amount of predication is not introduced
(due to cost edges)

Write a Comment

User Comments (0)

About PowerShow.com

P3DE: ProfileDirected Predicated Partial Dead Code Elimination PowerPoint PPT Presentation