EECS 583 Lecture 20 Multicluster Compilation - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

EECS 583 Lecture 20 Multicluster Compilation

Description:

Homogeneous means each cluster is identical. Heterogeneous means FU number ... Go from exit ops to entry ops and pass along good cluster candidates for each op ... – PowerPoint PPT presentation

Number of Views:24

Avg rating:3.0/5.0

Slides: 28

Provided by: scottm3

Category:

more less

Transcript and Presenter's Notes

Title: EECS 583 Lecture 20 Multicluster Compilation

1
EECS 583 Lecture 20Multicluster Compilation

University of Michigan
March 24, 2004
Guest speakers today Michael Chu and Kevin Fan

2
Recap Traditional VLIW Architectures

Conventional VLIW
Target architecture seen so far in class
Large, centralized register file
Many functional units connected
Problems with conventional design
Longer wires require longer latencies on RF
accesses
Large number of connected FUs to the register
file require more ports.
Register file access time increases quadratically
with number of ports

Conventional Architecture
RF
Register File
FU
FU
FU
FU
FU
3
Multicluster VLIW Architectures

Multicluster VLIW
Solution to problems with conventional VLIW
architecture design.
Decentralized architecture by splitting RF and
connecting subsets of the FUs
Require communication between clusters through
intercluster communication path
Problem with Multicluster VLIW
Compilation must now deal with disjoint FU/RFs,
and schedule operations accordingly
Used in commercial proceesors
Alpha 21264, TI C6x, etc.

Clustered Architecture
Register File
Register File
FU
FU
FU
FU
Cluster 1
Cluster 2
4
Other Multicluster Architectures Designs

Clusters can be homogeneous or heterogeneous
Homogeneous means each cluster is identical
Heterogeneous means FU number/types differ per
cluster
Communication paths can be intercluster buses or
cross cluster FU inputs

Cross-cluster FU inputs
Intercluster Bus
Register File
Register File
Register File
Register File
FU
FU
FU
FU
FU
FU
FU
FU
Cluster 1
Cluster 2
Cluster 1
Cluster 2
5
Multicluster Compilation Basics

Goal distribute operations evenly to balance
workload while minimizing communication
When two operations on separate clusters require
communication, interconnection network must be
used

Interconnection Network

Register File
Register File
gtgt

LW
I
MEM
MEM
I

Intercluster move
Cluster 1
Cluster 2
6
Cluster Assignment

When do we want to do operation cluster
assignment?
Highly intertwined with Scheduling and Register
Allocation
Assignment to clusters can change how well the
code can be scheduled, which changes how well
registers can be allocated.
Elcors model
Other possible models
Combine cluster assignment with scheduling
Combine all three
Unifying any or all of these three steps can
greatly increase complexity

Cluster Assignment
Scheduling
Register Allocation
7
Bottom-Up Greedy (BUG) Algorithm

First clustering algorithm introduced for the
Multiflow Trace architecture
Used by Elcor
Basic idea
Recursive algorithm
Go from exit ops to entry ops and pass along good
cluster candidates for each op
Go from entry ops back to exit ops and make final
decisions
Consider ops on critical path first

8
BUG Algorithm (cont.)

Given an op and its immediate predecessors and
successors, how to choose a good cluster?
Op must get its input operands from its
predecessors
Perform some computation
Send its output to its successors
Want to pick cluster such that this process
completes soonest (greedy)
A good choice depends on what clusters the ops
predecessors and successors are assigned to

9
Definitions

Available time
When a source operand is computed
Arrival time
When source operand is moved to current cluster
Start time
When all source operands are ready (max of
arrival times) and resources are available
Completion time
Result has been computed and moved to consumers

10
Definitions Illustrated
Relative to Op 3
2
Time
AvailableTime (Op2)
move
1
ArrivalTime (Op2, C1)
AvailableTime (Op1), ArrivalTime (Op1, C1)
StartTime (Op3, C1)
3
CompletionTime (Op3, C1, C1, C2)
4

Choose a cluster for Op 3 to minimize Completion
Time

11
The Main Function Assign

Assign (Op, Dests)
for each Predecessor of Op
Est-clusters Estimate (Op, Dests)
Assign (Pred, Est-clusters)
Est-clusters Estimate (Op, Dests)
Cluster first cluster in Est-clusters
Assign Op to Cluster
Mark Clusters resources busy at
StartTime(Op, Cluster)

Upward pass
recursive call
Downward pass
actual assignment

Estimate function returns a list of Clusters for
which CompletionTime(Op, Cluster, Dests) is
minimum

12
BUG

Traverses DFG in a reverse depth-first-search
fashion
Upward pass
Predecessors have not been assigned yet
Use depth (estart) plus latency to approximate
predecessors AvailableTime
Estimate a set of good clusters for current op
Recursively assign predecessors with current set
aspredecessors Dests
Downward pass
Make final cluster decisions for ops

13
Example

Assume all ops are 1-cycle
Each cluster can execute one op per cycle
Cluster 1 can execute any op, cluster 2 can only
execute

C1
C2
M

14
Example left path upward pass
AvailTime(Op1)1
3
5
C1
5
CompTime(Op3,C1,C1) 2 CompTime(Op3,C2,C1)
3
CompTime(Op5,C1,-) 3
1
3
C1
5
C1
CompTime(Op1,C1,C1) 1 CompTime(Op1,C2,C1)
2
15
Example left path downward pass
1
ArrivTime(Op1,C1)1 ArrivTime(Op1,C2)2
3
5
C1
StartTime(Op3,C1)1 CompTime(Op3,C1,C1)
2 StartTime(Op3,C2)2 CompTime(Op3,C2,C1) 4
1
3
5
C1
16
Example right path upward pass
1
2
AvailTime(Op2)1
3
4
5
C1
StartTime(Op4,C1)2 CompTime(Op4,C1,C1)
3 StartTime(Op4,C2)1 CompTime(Op4,C2,C1) 3
1
2
3
4
C1,C2
5
C1
CompTime(Op2,C1,C1,C2) 3 CompTime(Op2,C2,C1,C
2) 1
17
Example right path downward pass
1
2
ArrivTime(Op2,C1)2 ArrivTime(Op2,C2)1
3
4
5
C1
StartTime(Op4,C1)2 CompTime(Op4,C1,C1)
3 StartTime(Op4,C2)1 CompTime(Op4,C2,C1) 3
1
2
3
4
5
CompTime(Op5,C1,-) 4
18
Class problem
C1
C2
M3
4
M

5
Schedule
19
Problems with BUG

BUG does a fairly good job of partitioning the
DFG, but it can be improved
Problem 1 Local scope of the DFG
Has a very narrow view of the DFG
Doesnt consider the best global clustering
Problem 2 Scheduler-centric
Using the scheduler to determine the clustering
is slow!
BUG is not the only solution to cluster
assignment
Many different algorithms exist all using
different techniques, different scopes, and occur
at different phases in the compilation process
No clear cut winner on the best algorithm for all
situations.

20
Local Scope
Local scope clustering
Global scope clustering
1
3
1
4
1
1
2
7
2
8
6
4
move
6
5
2
3
4
5
2
3
4
5
10
8
3
9
6
7
8
9
cycle
cycle
6
7
8
9
5
7
11
11
10
move
11
10
move
9
10
12
12
11
12
12
21
Scheduler-centric Nature

Cluster Assignment during scheduling adds
complexity
Detailed resource model/reservation table is
slow!
Forces local decisions to be made

Cluster 2
cycle
Cluster 1
cycle
X
X
X
X
1
1
1
X
X
X
X
2
2
2
3
4
5
X
X
X
X
1
1
6
7
8
9
X
X
X
X
2
2
11
10
X
X
X
X
1
1
12
X
X
X
X
2
2
22
Region-based Hierarchical Operation Partitioning

RHOP is one of many advanced clustering
techniques
Code is considered region at a time
Weight calculation determines hints for how
operations affect scheduler
Partitioning uses multilevel graph partitioner to
cluster operations

Program
Region
int main int x printf() . . .
Weight Calculation
Graph Partitioning
23
Weight Calculation

Node weights are used to determine approximate
resource usage
Differs depending on how many FUs of each type
per cluster
Edge weights are used to determine where to best
break the graph
Where is intercluster communication free or
preferred?

1
2
Register File
I
F
M
B
3
(0,0)
(0,0)
1
2
(0,1)
(0,1)
(0,1)
(0,1)
3
5
6
7
4
(1,1)
(1,2)
10
11
8
9
(1,2)
(0,2)
(2,2)
13
12
(2,3)
(3,3)
14
(estart, lstart)
(4,4)
24
RHOP - Coarsening

Coarsening takes highly-related operations and
groups them together to later partition
Groups based on edge weights
Takes snapshots of how things are coarsened,
later will consider them together

25
RHOP Scheduling estimate
0
1
2
Cluster 1
1
4
6
5
2
2.5
2.0
9
3
cycle
8
0.5
12
0.0
14
0.0
Cluster_wgt1 5.0
0
1
Cluster 2
2
7
0.0
10
11
0.33
13
0.33
cycle
0.0
0.0
Cluster_wgt2 0.67
26
RHOP Checking proposed moves

Move groups of operations over, see how it
changes the load on the schedule estimate

Cluster 1
1
2
1.0
SL(before) 5.0
0.0
3
cycle
SL(after) 4.5
8
0.0
12
0.0
14
0.0
Cluster 2
Lgain 0.5
1.33
4
6
5
7
10
2.33
9
11
Egain -1.0
13
0.83
cycle
0.0
Mgain 4.0
0.0
27
RHOP - Refinement

Write a Comment

User Comments (0)