EECS 583 Lecture 20 Multicluster Compilation - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

EECS 583 Lecture 20 Multicluster Compilation

Description:

Homogeneous means each cluster is identical. Heterogeneous means FU number ... Go from exit ops to entry ops and pass along good cluster candidates for each op ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 28
Provided by: scottm3
Category:

less

Transcript and Presenter's Notes

Title: EECS 583 Lecture 20 Multicluster Compilation


1
EECS 583 Lecture 20Multicluster Compilation
  • University of Michigan
  • March 24, 2004
  • Guest speakers today Michael Chu and Kevin Fan

2
Recap Traditional VLIW Architectures
  • Conventional VLIW
  • Target architecture seen so far in class
  • Large, centralized register file
  • Many functional units connected
  • Problems with conventional design
  • Longer wires require longer latencies on RF
    accesses
  • Large number of connected FUs to the register
    file require more ports.
  • Register file access time increases quadratically
    with number of ports

Conventional Architecture
RF
Register File
FU
FU
FU
FU
FU
3
Multicluster VLIW Architectures
  • Multicluster VLIW
  • Solution to problems with conventional VLIW
    architecture design.
  • Decentralized architecture by splitting RF and
    connecting subsets of the FUs
  • Require communication between clusters through
    intercluster communication path
  • Problem with Multicluster VLIW
  • Compilation must now deal with disjoint FU/RFs,
    and schedule operations accordingly
  • Used in commercial proceesors
  • Alpha 21264, TI C6x, etc.

Clustered Architecture
Register File
Register File
FU
FU
FU
FU
Cluster 1
Cluster 2
4
Other Multicluster Architectures Designs
  • Clusters can be homogeneous or heterogeneous
  • Homogeneous means each cluster is identical
  • Heterogeneous means FU number/types differ per
    cluster
  • Communication paths can be intercluster buses or
    cross cluster FU inputs

Cross-cluster FU inputs
Intercluster Bus
Register File
Register File
Register File
Register File
FU
FU
FU
FU
FU
FU
FU
FU
Cluster 1
Cluster 2
Cluster 1
Cluster 2
5
Multicluster Compilation Basics
  • Goal distribute operations evenly to balance
    workload while minimizing communication
  • When two operations on separate clusters require
    communication, interconnection network must be
    used

Interconnection Network

Register File
Register File
gtgt


LW
I
MEM
MEM
I

Intercluster move
Cluster 1
Cluster 2
6
Cluster Assignment
  • When do we want to do operation cluster
    assignment?
  • Highly intertwined with Scheduling and Register
    Allocation
  • Assignment to clusters can change how well the
    code can be scheduled, which changes how well
    registers can be allocated.
  • Elcors model
  • Other possible models
  • Combine cluster assignment with scheduling
  • Combine all three
  • Unifying any or all of these three steps can
    greatly increase complexity

Cluster Assignment
Scheduling
Register Allocation
7
Bottom-Up Greedy (BUG) Algorithm
  • First clustering algorithm introduced for the
    Multiflow Trace architecture
  • Used by Elcor
  • Basic idea
  • Recursive algorithm
  • Go from exit ops to entry ops and pass along good
    cluster candidates for each op
  • Go from entry ops back to exit ops and make final
    decisions
  • Consider ops on critical path first

8
BUG Algorithm (cont.)
  • Given an op and its immediate predecessors and
    successors, how to choose a good cluster?
  • Op must get its input operands from its
    predecessors
  • Perform some computation
  • Send its output to its successors
  • Want to pick cluster such that this process
    completes soonest (greedy)
  • A good choice depends on what clusters the ops
    predecessors and successors are assigned to

9
Definitions
  • Available time
  • When a source operand is computed
  • Arrival time
  • When source operand is moved to current cluster
  • Start time
  • When all source operands are ready (max of
    arrival times) and resources are available
  • Completion time
  • Result has been computed and moved to consumers

10
Definitions Illustrated
Relative to Op 3
2
Time
AvailableTime (Op2)
move
1
ArrivalTime (Op2, C1)
AvailableTime (Op1), ArrivalTime (Op1, C1)
StartTime (Op3, C1)
3
CompletionTime (Op3, C1, C1, C2)
4
  • Choose a cluster for Op 3 to minimize Completion
    Time

11
The Main Function Assign
  • Assign (Op, Dests)
  • for each Predecessor of Op
  • Est-clusters Estimate (Op, Dests)
  • Assign (Pred, Est-clusters)
  • Est-clusters Estimate (Op, Dests)
  • Cluster first cluster in Est-clusters
  • Assign Op to Cluster
  • Mark Clusters resources busy at
    StartTime(Op, Cluster)

Upward pass
recursive call
Downward pass
actual assignment
  • Estimate function returns a list of Clusters for
    which CompletionTime(Op, Cluster, Dests) is
    minimum

12
BUG
  • Traverses DFG in a reverse depth-first-search
    fashion
  • Upward pass
  • Predecessors have not been assigned yet
  • Use depth (estart) plus latency to approximate
    predecessors AvailableTime
  • Estimate a set of good clusters for current op
  • Recursively assign predecessors with current set
    aspredecessors Dests
  • Downward pass
  • Make final cluster decisions for ops

13
Example
  • Assume all ops are 1-cycle
  • Each cluster can execute one op per cycle
  • Cluster 1 can execute any op, cluster 2 can only
    execute

C1
C2
M

14
Example left path upward pass
AvailTime(Op1)1
3
5
C1
5
CompTime(Op3,C1,C1) 2 CompTime(Op3,C2,C1)
3
CompTime(Op5,C1,-) 3
1
3
C1
5
C1
CompTime(Op1,C1,C1) 1 CompTime(Op1,C2,C1)
2
15
Example left path downward pass
1
ArrivTime(Op1,C1)1 ArrivTime(Op1,C2)2
3
5
C1
StartTime(Op3,C1)1 CompTime(Op3,C1,C1)
2 StartTime(Op3,C2)2 CompTime(Op3,C2,C1) 4
1
3
5
C1
16
Example right path upward pass
1
2
AvailTime(Op2)1
3
4
5
C1
StartTime(Op4,C1)2 CompTime(Op4,C1,C1)
3 StartTime(Op4,C2)1 CompTime(Op4,C2,C1) 3
1
2
3
4
C1,C2
5
C1
CompTime(Op2,C1,C1,C2) 3 CompTime(Op2,C2,C1,C
2) 1
17
Example right path downward pass
1
2
ArrivTime(Op2,C1)2 ArrivTime(Op2,C2)1
3
4
5
C1
StartTime(Op4,C1)2 CompTime(Op4,C1,C1)
3 StartTime(Op4,C2)1 CompTime(Op4,C2,C1) 3
1
2
3
4
5
CompTime(Op5,C1,-) 4
18
Class problem
C1
C2
M3
4
M

5
Schedule
19
Problems with BUG
  • BUG does a fairly good job of partitioning the
    DFG, but it can be improved
  • Problem 1 Local scope of the DFG
  • Has a very narrow view of the DFG
  • Doesnt consider the best global clustering
  • Problem 2 Scheduler-centric
  • Using the scheduler to determine the clustering
    is slow!
  • BUG is not the only solution to cluster
    assignment
  • Many different algorithms exist all using
    different techniques, different scopes, and occur
    at different phases in the compilation process
  • No clear cut winner on the best algorithm for all
    situations.

20
Local Scope
Local scope clustering
Global scope clustering
1
3
1
4
1
1
2
7
2
8
6
4
move
6
5
2
3
4
5
2
3
4
5
10
8
3
9
6
7
8
9
cycle
cycle
6
7
8
9
5
7
11
11
10
move
11
10
move
9
10
12
12
11
12
12
21
Scheduler-centric Nature
  • Cluster Assignment during scheduling adds
    complexity
  • Detailed resource model/reservation table is
    slow!
  • Forces local decisions to be made

Cluster 2
cycle
Cluster 1
cycle
X
X
X
X
1
1
1
X
X
X
X
2
2
2
3
4
5
X
X
X
X
1
1
6
7
8
9
X
X
X
X
2
2
11
10
X
X
X
X
1
1
12
X
X
X
X
2
2
22
Region-based Hierarchical Operation Partitioning
  • RHOP is one of many advanced clustering
    techniques
  • Code is considered region at a time
  • Weight calculation determines hints for how
    operations affect scheduler
  • Partitioning uses multilevel graph partitioner to
    cluster operations

Program
Region
int main int x printf() . . .
Weight Calculation
Graph Partitioning
23
Weight Calculation
  • Node weights are used to determine approximate
    resource usage
  • Differs depending on how many FUs of each type
    per cluster
  • Edge weights are used to determine where to best
    break the graph
  • Where is intercluster communication free or
    preferred?

1
2
Register File
I
F
M
B
3
(0,0)
(0,0)
1
2
(0,1)
(0,1)
(0,1)
(0,1)
3
5
6
7
4
(1,1)
(1,2)
10
11
8
9
(1,2)
(0,2)
(2,2)
13
12
(2,3)
(3,3)
14
(estart, lstart)
(4,4)
24
RHOP - Coarsening
  • Coarsening takes highly-related operations and
    groups them together to later partition
  • Groups based on edge weights
  • Takes snapshots of how things are coarsened,
    later will consider them together

25
RHOP Scheduling estimate
0
1
2
Cluster 1
1
4
6
5
2
2.5
2.0
9
3
cycle
8
0.5
12
0.0
14
0.0
Cluster_wgt1 5.0
0
1
Cluster 2
2
7
0.0
10
11
0.33
13
0.33
cycle
0.0
0.0
Cluster_wgt2 0.67
26
RHOP Checking proposed moves
  • Move groups of operations over, see how it
    changes the load on the schedule estimate

Cluster 1
1
2
1.0
SL(before) 5.0
0.0
3
cycle
SL(after) 4.5
8
0.0
12
0.0
14
0.0
Cluster 2
Lgain 0.5
1.33
4
6
5
7
10
2.33
9
11
Egain -1.0
13
0.83
cycle
0.0
Mgain 4.0
0.0
27
RHOP - Refinement
Write a Comment
User Comments (0)
About PowerShow.com