EECS 583 Lecture 6 Hyperblocks, Control CPR - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

EECS 583 Lecture 6 Hyperblocks, Control CPR

Description:

HW 1 due today _at_11:59pm. No 4th testcase I didn't get around to it ... Create tar file, uniquename.tgz. put in /y/eecs583/hw1 ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 39
Provided by: scottm3
Category:

less

Transcript and Presenter's Notes

Title: EECS 583 Lecture 6 Hyperblocks, Control CPR


1
EECS 583 Lecture 6Hyperblocks, Control CPR
  • University of Michigan
  • January 27, 2003

2
Homeworks
  • HW 1 due today _at_1159pm
  • No 4th testcase I didnt get around to it
  • scp tar file to lloth.eecs.umich.edu
  • Please dont email it to me
  • user eecs583
  • password is same
  • Create tar file, uniquename.tgz
  • put in /y/eecs583/hw1
  • scp mahlke.tgz eecs583_at_lloth.eecs.umich.edu/y/eec
    s583/hw1/.
  • HW 2 is available Due in 2 wks

3
Class Problem from Last Time
if (a gt 0) r t s if (b gt 0 c gt
0) u v 1 else if (d gt 0)
x y 1 else z z 1
  • Draw the CFG
  • Compute CD
  • If-convert the code

4
Region Formation If-conversion
10
  • Control flow representation
  • branches
  • predicated operations
  • If-conversion not all all or nothing deal
  • Often bad to apply in blanket mode
  • Selectively apply
  • Regions
  • Extend a superblock to contain if-converted code
  • Convert off-trace transitions to on-trace
  • A hyperblock is born
  • Superblock is a special case HB where all
    guarding predicates are True

BB1
20
80
BB2
BB3
80
20
BB4
BB4
8
20
72
BB5
28
BB6
BB6
7.2
25.2
64.8
2.8
5
When to Apply If-conversion
  • Positives
  • Remove branch
  • No disruption to sequential fetch
  • No prediction or mispredict
  • No use of branch resource
  • Increase potential for operation overlap
  • Enable more aggressive compiler xforms
  • Software pipelining
  • Height reduction
  • Negatives
  • Max or Sum function applied when overlap
  • Resource usage
  • Dependence height
  • Hazard presence
  • Executing useless operations

10
BB1
80
90
20
BB2
BB3
80
20
BB4
10
BB5
90
10
BB6
10
6
Negative 1 Resource Usage
Case 1 Each BB requires 3 resources Assume
processor has 2 resources No IC 13 .63
.43 13 9 9 / 2 4.5 5 cycles IC 1(3
3 3 3) 12 12 / 2 6 cycles
Resource usage is additive for all BBs that are
if-converted
100
BB1
BB1
60
40
BB2 if p1
BB2
BB3
Case 2 Each BB requires 3 resources Assume
processor has 6 resources No IC 13 .63
.43 13 9 9 / 6 1.5 2 cycles IC
1(3333) 12 12 / 6 2 cycles
BB3 if p2
60
40
BB4
BB4
100
7
Negative 2 Dependence Height
Case 1 height(bb1) 1, height(bb2)
3 Height(bb3) 9, height(bb4) 2 No IC 11
.63 .49 12 8.4 IC 11 1MAX(3,9)
13 13
Dependence height is max of for all BBs that are
if-converted (dep height schedule length with
infinite resources)
100
BB1
BB1
Case 2 height(bb1) 1, height(bb2)
3 Height(bb3) 3, height(bb4) 2 No IC 11
.63 .43 12 6 IC 11 1MAX(3,3)
12 6
60
40
BB2 if p1
BB2
BB3
BB3 if p2
60
40
BB4
BB4
100
8
Negative 3 Hazard Presence
Case 1 Hazard in BB3 No IC SB out of BB1, 2,
4, operations In BB4 free to overlap with those
in BB1 and BB2 IC operations in BB4 cannot
overlap With those in BB1 (BB2 ok)
Hazard operation that forces the compiler to be
conservative, so limited reordering or
optimization, e.g., subroutine call, pointer
store,
100
BB1
BB1
60
40
BB2 if p1
BB2
BB3
BB3 if p2
60
40
BB4
BB4
100
9
When To If-convert
  • Resources
  • Small resource usage ideal for less important
    paths
  • Dependence height
  • Matched heights are ideal
  • Close to same heights is ok
  • Remember everything is relative for resources
    and dependence height !
  • Hazards
  • Avoid hazards unless on most important path
  • Estimate of benefit
  • Branches/Mispredicts removed
  • Fudge factor

100
BB1
BB1
60
40
BB2 if p1
BB2
BB3
BB3 if p2
60
40
BB4
BB4
100
10
The Hyperblock
  • Hyperblock - Collection of basic blocks in which
    control flow may only enter at the first BB. All
    internal control flow is eliminated via
    if-conversion
  • Likely control flow paths
  • Acyclic (outer backedge ok)
  • Multiple intersecting traces with no side
    entrances
  • Side exits still exist
  • Hyperblock formation
  • 1. Block selection
  • 2. Tail duplication
  • 3. If-conversion

10
BB1
80
90
20
BB2
BB3
80
20
BB4
10
BB5
90
10
BB6
10
11
Block Selection
  • Block selection
  • Select subset of BBs for inclusion in HB
  • Difficult problem
  • Weighted cost/benefit function
  • Height overhead
  • Resource overhead
  • Hazard overhead
  • Branch elimination benefit
  • Weighted by frequency

10
BB1
80
90
20
BB2
BB3
80
20
BB4
10
BB5
90
10
BB6
10
12
Block Selection
  • Create a trace ?main path
  • Use a heuristic function to select other blocks
    that are compatible with the main path
  • Consider each BB by itself for simplicity
  • Compute priority for other BBs
  • Normalize against main path.
  • BSVi (K x (weight_bbi / size_bbi) x
    (size_main_path / weight_main_path) x bb_chari)
  • weight execution frequency
  • size number of operations
  • bb_char characteristic value of each BB
  • Max value 1, Hazardous instructions reduce
    this to 0.5, 0.25, ...
  • K constant to represent processor issue rate
  • Include BB when BSVi gt Threshold

13
Example - Step 1 - Block Selection
main path 1,2,4,6 num_ops 5 8 3 2
18 weight 80 Calculate the BSVs for BB3,
BB5 assuming no hazards, K 4 BSV3 4 x (20 /
2) x (18 / 80) 9 BSV5 4 x (10 / 5) x (18 /
80) 1.8 If Threshold 2.0, select BB3 along
with main path
10
BB1 - 5
80
90
20
BB2 - 8
BB3 2
80
20
BB4 - 3
10
BB5 - 5
90
10
BB6 - 2
10
14
Example - Step 2 - Tail Duplication
Tail duplication same as with Superblock formation
10
10
BB1
BB1
80
20
80
20
BB2
BB3
BB2
BB3
80
20
80
20
BB4
BB4
10
10
BB5
90
BB5
90
10
10
BB6
BB6
BB6
90
81
9
10
9
1
15
Example - Step 3 If-conversion
If-convert intra-HB branches only!!
10
10
BB1
80
20
BB1 p1,p2 CMPP
BB2
BB3
80
20
BB2 if p1
BB4
BB3 if p2
10
BB4
BB5
90
BB6
BB5
10
10
BB6
81
BB6
9
81
BB6
9
9
1
1
9
16
Hyperblock Performance Evaluation (1)
  • O BB code
  • IP Structural if-conversion
  • All innermost loops, acyclic SEME regions
  • PP Selective if-conversion

17
Class Problem
Form the HB for this subgraph Assume K 4, BSV
Threshold 2
100
BB1- 3
20
80
BB2 - 8
BB3 - 2
80
20
BB4 - 2
45
55
BB5 - 3
BB6 - 2
10
35
55
BB7 -1
BB8 -2
35
10
BB9 -1
18
Block Selection Try 2
  • Problems with BSV formula
  • Ignore dependence height
  • Blocks considered independently (control flow
    ignored)
  • Enumerate all paths of execution through region
    of interest
  • Consider a path execution from entry to some
    exit
  • Give priority to path as a whole
  • Path priority
  • dep_ratioi 1.0 (dep_heighti / max dep_height)
  • op_ratioi 1.0 (num_opsi / max num_ops)
  • priorityi (probabilityi x hazardi) x
    (dep_ratioi op_ratioi K)
  • Hazard multiplier was 0.25 for paths containing
    subroutine call or unresolvable memory store
  • K base contribution for a path (0.1 used)

19
Block Selection Try 2 (continued)
  • Path selection
  • Rank paths from highest to lowest priority
  • Include paths until either
  • Estimated available resources full
  • Priority drops too low
  • Exclude any paths with excessive resource util or
    dep height
  • Use union of selected paths to form Hyperblock
  • Causes some lower priority paths to be included

20
Block Selection - Try 2 - Example
Enumerate all paths, rank by priority
1. A-B-D-E-F-H-N 2. A-B-D-E-F-H-K-N 3.
A-B-D-E-G-J-M-N 4. A-B-D-E-G-J-L-M-N 5.
A-B-D-E-G-I-M-N 6. A-B-D-E-G-J-L-N 7. A-B-D 8.
A-C-D-E-F-H-N 9. A-C-D-E-F-H-K-N 10.
A-C-D-E-G-J-M-N 11. A-C-D-E-G-J-L-M-N 12.
A-C-D-E-G-I-M-N 13. A-C-D-E-G-J-L-N 14. A-C-D
15. A-B-D-E-F-G-I-M-N 16. A-B-D-E-F-G-J-M-N 17.
A-B-D-E-F-G-J-L-M-N 18. A-B-D-E-F-G-J-L-N 19.
A-B-C-E-F-G-I-M-N 20. A-B-C-E-F-G-J-M-N 21.
A-B-C-E-F-G-J-L-M-N 22. A-B-C-E-F-G-J-L-N
21
Block Selection Try 2 Example continued
22
Hyperblock Performance Using Paths
4 - issue
8 - issue
23
Control CPR A Branch Height Reduction
Optimization for EPIC ArchitecturesPLDI - 1999
  • Mike Schlansker
  • Scott Mahlke
  • Hewlett-Packard Laboratories
  • Richard Johnson
  • Transmeta Corporation

24
Introduction and Problem Statement
  • Dependences limit performance
  • Data
  • Control
  • Long dependence chains
  • Sequential code
  • Problem worse for next generation processors
  • High degree hardware parallelism
  • Low degree of program parallelism
  • Resources idle most of the time
  • Height reduction optimizations
  • Traditional compilers focus on reducing operation
    count
  • Future compilers need on increasing program
    parallelism

25
Height Reduction Optimization
  • Goals
  • Break dependences
  • Reduce latency of edges
  • Reorganize computation
  • Common approach
  • Tradeoff redundant work for reduced height
  • Inverse of CSE
  • Data height reduction
  • Use of the associative property
  • Induction variable back substitution
  • Control height reduction
  • Control dependences
  • Reduce height through branch network
  • Focus of our work

26
Our Approach to Control Height Reduction
  • Goals
  • Reduce dependence height through a network of
    branches
  • Reduce number of executed branches
  • Applicable to a large fraction of the program
  • Fit into our existing compiler infrastructure
  • Difficulty
  • Reducing height while
  • Not increasing operation count
  • Irredundant Consecutive Branch Method (ICBM)
  • Use branch profile information
  • Optimize likely the important control flow paths
  • Possibly penalize less important paths

27
Definitions
  • Superblock
  • single-entry linear sequence of operations
    containing 1 or more branches
  • Our basic compilation unit
  • Non-speculative operations
  • Exit branch
  • branch to allow early transfer out of the
    superblock
  • compare condition (ai lt bi)
  • On-trace
  • preferred execution path (E4)
  • identified by profiling
  • Off-trace
  • non-preferred paths (E1, E2, E3)
  • taking an exit branch

28
ICBM for a Simple RISC Processor - Step 1
Input superblock
Insert bypass branch
29
ICBM for a Simple RISC Processor - Step 2
Superblock with bypass branch
Move code down through bypass branch
30
ICBM for a Simple RISC Processor - Step 3
Code after downward motion
Simplify resultant code
31
ICBM for a Simple RISC Processor - Step 4
Sequential boolean
Height reduced
Code after simplification
expression
expression
32
Is the ICBM Transformation Always Correct?
  • Answer is no
  • Problem with downward motion
  • S1 ops to compute c0, c1, c2
  • S2 ops dependent on branches
  • S1 ops must remain on-trace
  • S2 ops must move downward
  • No dependences permitted between S1 and S2
  • Separability violation
  • Experiments - 6 branches failed
  • Memory dependences

33
Blocking
  • Transforming an entire superblock
  • May not be possible
  • May not be profitable
  • Solution - CPR blocks
  • Block into smaller subregions
  • Linear sequences of basic blocks
  • Apply CPR to each subregion
  • Grow CPR block incrementally
  • Terminate CPR block when
  • Correctness violation
  • Performance heuristic

34
ICBM for an EPIC Processor (HPL-PlayDoh)
  • Predicated execution
  • Boolean guard for all operations
  • a b c if p
  • Increases complexity of ICBM
  • Generalize the schema
  • Analyze and transform complex predicated code
  • Suitability pattern match
  • Proof of correct code generation
  • Increases efficiency of ICBM
  • Wired-AND/wired-OR compares
  • Accumulate disjunction of conditions into a
    predicate
  • See PlayDoh technical report
  • Compare network reduced to 1 level

35
Experiment Evaluation
  • ICBM implemented in Elcor research compiler
  • More information available at www.trimaran.org
  • Comparison
  • Baseline - optimized superblock code produced by
    Impact
  • Height-reduced - baseline code with ICBM
    transformation
  • Benchmarks - SPECINT95, SPECINT92, Unix utilities
    (24 total)
  • Processor models - PlayDoh instruction set
  • sequential - single issue RISC
  • narrow - (2,1,1,1) (I,F,M,B)
  • medium - (4,2,2,1)
  • wide - (8,4,4,2)
  • infinite - (75,25,25,25)
  • Cache stalls and branch mispredictions not
    measured

36
Taste of the Results
37
Performance Insights
  • When is ICBM most effective?
  • Sequences of biased branches (includes unrolled
    loops)
  • long sequences are good
  • but do not have to be overly long because its
    better to block anyways
  • Control dependence limited
  • Branch resource limited
  • Best benchmark - cmp
  • long trace (162 ops), 25 branches, all branches
    heavily biased
  • When is ICBM less effective?
  • Few branches
  • Dominated by unbiased branches
  • Control dependences are not limiting -
    data/memory are limiting factors
  • Important branches that we cannot treat (table
    jumps)
  • Worst benchmark - 099.go
  • unbiased branches, data dependences

38
Summary and Final Thoughts
  • ICBM is an effective strategy for control height
    reduction
  • Relatively simple
  • No height versus redundancy tradeoff
  • Use profile information
  • Reduce dependence height and operation count on
    important paths
  • Penalize less important paths
  • Strong performance gains across range of
    processors
  • 13 for a sequential processor
  • 18 for a medium VLIW (4,2,2,1)
  • 33 for a wide VLIW (8,4,4,2)
  • Importance of height reduction optimizations in
    future compilers
  • Parallelism limit studies are only valid on a
    fixed code base
  • Compiler can manufacture ILP
  • Current research only scratches the surface of
    height reduction
Write a Comment
User Comments (0)
About PowerShow.com