Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning - PowerPoint PPT Presentation

About This Presentation

Title:

Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning

Description:

Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning. Alex Alet ... Antonio Gonz lez. David Kaeli {aaleta, jmcodina, fran, antonio}_at_ac.upc. ... – PowerPoint PPT presentation

Number of Views:23

Avg rating:3.0/5.0

Slides: 36

Provided by: jesss9

Learn more at: https://arcb.csc.ncsu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning

1
Exploiting Pseudo-schedules to Guide Data
Dependence Graph Partitioning
PACT 2002, Charlottesville, Virginia September
2002

Alex Aletà
Josep M. Codina
Jesús Sánchez
Antonio González
David Kaeli
aaleta, jmcodina, fran, antonio_at_ac.upc.es
kaeli_at_ece.neu.edu

2
Clustered Architectures

Current/future challenges in processor design
Delay in the transmission of signals
Power consumption
Architecture complexity
Clustering divide the system in semi-independent
units
Each unit ? Cluster
Fast interconnects intra-cluster
Slow interconnects inter-clusters
Common trend in commercial VLIW processors
TIs C6x
Analogs TigerSHARC
HPs LX
Equators MAP1000

3
Architecture Overview
4
Instruction Scheduling

For non-clustered architectures
Resources
Dependences
For clustered architectures
Cluster assignment
Minimize inter-cluster communication delays
Exploit communication locality
This work focuses on modulo scheduling for
clustered VLIW architectures
Technique to schedule loops

5
Talk Outline

Previous work
Proposed algorithm
Overview
Graph partitioning
Pseudo-scheduling
Performance evaluation
Conclusions

6
MS for Clustered Architectures

In previous work, two different approaches were
proposed

Two steps
Data Dependence Graph partitioning each
instruction is assigned to a cluster
Scheduling instructions are scheduled in a
suitable slot but only in the preassigned cluster

7
Goal of the Work

Both approaches have benefits
Two steps
Global vision of the Data Dependence Graph
Workload is better split among different clusters
Number of communications is reduced
One step
Local vision of partial scheduling
Cluster assignment is performed with information
of the partial scheduling
Goal obtain an algorithm taking advantage of the
benefits of both approaches

8
Baseline

Baseline scheme GP Aletà et al., Micro34
Cluster assignment performed with a graph
partitioning algorithm
Feed-back between the partitioning and the
scheduler
Results outperformed previous approaches
Still little information available for cluster
assignment
New algorithm better partition
Pseudo-schedules are used to guide the partition
Global vision of the Data Dependence Graph
More information to perform cluster assignment

9
Algorithm Overview
Compute initial partition
II MII
Start scheduling
Schedule Opj based on the current partition
Refine Partition
Able to schedule?
Select next operation (j)
YES
II
NO
Move Opj to another cluster
NO
Able to schedule?
YES
10
Algorithm Overview
Compute initial partition
II MII
Start scheduling
Schedule Opj based on the current partition
Refine Partition
Able to schedule?
Select next operation (j)
YES
II
NO
Move Opj to another cluster
NO
Able to schedule?
YES
11
Graph Partitioning Background

Problem statement
Split the nodes into a pre-determined number of
sets and optimizing some functions
Multilevel strategy
Coarsen the graph
Iteratively, fuse pairs of nodes into new
macro-nodes
Enhancing heuristics
Avoid excess load in any one set
Reduce execution time of the loops

12
Graph Coarsening

Previous definitions
Matching
Slack
Iterate until same number of nodes than clusters
The edges are weighted according to
Impact on execution time of adding a bus delay to
the edge
Slack of the edge
Then, select the maximum weight matching
Nodes linked by edges in the matching are fused
in a single macro-node

13
Coarsening Example
14
Example (II)
1st STEP Partition induced in the original graph
Initial graph
Induced Partition
Final graph
15
Reducing Execution Time

Estimation of execution time needed
Pseudo-schedules
Information obtained
II
SC
Lifetimes
Spills

16
Building pseudo-schedules

Dependences
Respected if possible
Else a penalty on register pressure and/or in
execution time is assessed
Cluster assignment
Partition strictly followed

17
Pseudo-schedule example

2 clusters, 1 FU/cluster, 1 bus of latency 1, II
2

Cluster 1 Cluster 2
A D
B
Cluster 1 Cluster 2
0 A
1
2
3 B
4 D
5
6 C?NO
7 C?NO
Instruction latency 3
18
Pseudo-schedule example
Cluster 1 Cluster 2
0 A
1
2
3 B
4 D
5
6
7
8 C
Cluster 1 Cluster 2
A,C D
B
Induced partition
A
D
B
C
19
Heuristic description

While improvement, iterate
Different partitions are obtained by moving nodes
among clusters
Partitions that produce overload resources in any
of the clusters are discarded
The partition minimizing execution time is chosen
In case of tie, the one that minimizes register
pressure is selected

20
Algorithm Overview
Compute initial partition
II MII
Start scheduling
Schedule Opj based on the current partition
Refine Partition
Able to schedule?
Select next operation (j)
YES
II
NO
Move Opj to another cluster
NO
Able to schedule?
YES
21
The Scheduling Step

To schedule the partition we use URACAM Codina
et al., PACT01
Figure of merit
Uses dynamic transformations to improve the
partial schedule
Register communications
Bus ? memory
Spill code on-the-fly
Register pressure ? memory
If an instruction can not be scheduled in the
cluster assigned by the partition
Try all other clusters
Select the best one according to a figure of merit

22
Algorithm Overview
Compute initial partition
II MII
Start scheduling
Schedule Opj based on the current partition
Refine Partition
Able to schedule?
Select next operation (j)
YES
II
NO
Move Opj to another cluster
NO
Able to schedule?
YES
23
Partition Refinement

II has increased
A better partition can be found for the new II
New slots have been generated in each cluster
More lifetimes are available
A larger number of bus communications allowed
Coarsening process is repeated
Only edges between nodes in the same set can
appear in the matching
After coarsening, the induced partition will be
the last partition that could not be scheduled
The reducing execution time heuristic is reapplied

24
Benchmarks and Configurations