Edge-centric Modulo Scheduling for Coarse-Grained Reconfigurable Architectures - PowerPoint PPT Presentation

About This Presentation
Title:

Edge-centric Modulo Scheduling for Coarse-Grained Reconfigurable Architectures

Description:

FU. RF. FU. FU. FU. FU. Conventional VLIW. CGRA. FU. RF. FU. RF. FU. RF. FU. RF. FU. RF. FU. RF. FU. RF. FU. RF. FU. RF. FU. RF. FU. RF. FU. RF. University of Michigan ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 26
Provided by: fank
Category:

less

Transcript and Presenter's Notes

Title: Edge-centric Modulo Scheduling for Coarse-Grained Reconfigurable Architectures


1
Edge-centric Modulo Scheduling for
Coarse-Grained Reconfigurable Architectures
  • Hyunchul Park, Kevin Fan, Scott Mahlke,
  • Taewook Oh, Heeseok Kim, Hong-seok Kim
  • University of Michigan
  • Samsung Advanced Institute of Technology

October 28, 2008
2
Coarse-Grained Reconfigurable Architecture (CGRA)
  • Array of PEs connected in a mesh-like
    interconnect
  • High throughput with a large number of resources
  • Distributed hardware offers low cost/power
    consumption
  • High flexibility with dynamic reconfiguration

3
CGRA Attractive Alternative to ASICs
  • Suitable for running multimedia applications for
    future embedded systems
  • High throughput, low power consumption, high
    flexibility
  • Morphosys 8x8 array with RISC processor
  • SiliconHive hierarchical systolic array
  • ADRES 4x4 array with tightly coupled VLIW

Morphosys SiliconHive ADRES
viterbi at 80Mbps
h.264 at 30fps
50-60 MOps /mW
3
4
Scheduling in CGRA
  • Sparse interconnect and distributed register
    files
  • No dedicated routing resources FUs are used for
    routing
  • Need explicit routing of operands by compiler

FU
RF
FU
RF
FU
RF
FU
RF
Central RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
FU
FU
FU
FU
RF
FU
RF
FU
RF
FU
RF
Conventional VLIW
FU
RF
FU
RF
FU
RF
FU
RF
CGRA
5
Scheduling Difficulties
  • VLIW routing is guaranteed by central RF
  • CGRA Multiple possible routes
  • Compiler is responsible for finding routes
  • Routing can easily fail by other operations

VLIW
CGRA
5
6
Objective of This Work
  • Modulo scheduling technique for CGRAs
  • Exploit loop-level parallelism by overlapping
    execution of iterations
  • Customized approach based on characteristics of
    CGRAs
  • Achieve fast compile time and good performance
  • Huge scheduling space, distributed resources
  • Naïve approach can result in either poor solution
    or long compile time

6
7
Traditional Approach Node-centric
time FU 0 FU 1 FU 2 FU 3 FU 4
0
1
2
3
4
P2
P1
C
C
0
1
C
C
C
3
2
4
C
C
C
5
6
7
C
C
C
8
9
10
Operations are placed first, then routing is
performed Visit all candidate slots to find the
solution
8
Node-centric Inefficiency 1
time FU 0 FU 1 FU 2 FU 3 FU 4
0
1
2
3
4
P2
P1
C
0
1
C
C
3
2
4
C
C
5
6
7
C
8
9
10
Attempt routing to non-reachable slots by edge P1
to C
9
Node-centric Inefficiency 2
time FU 0 FU 1 FU 2 FU 3 FU 4
0
1
2
3
4
P1
P2
P2
P1
C
C
0
1
C
3
2
4
C
5
6
7
C
8
9
10
Repeat the same routing already performed
10
Our Approach Edge-centric
time FU 0 FU 1 FU 2 FU 3 FU 4
0
1
2
3
4
time FU 0 FU 1 FU 2 FU 3 FU 4
0
1
2
3
4
P2
P1
P2
P1
0
0
1
C
1
3
2
4
C
2
5
6
7
C
C
C
3
4
8
9
10
Node-centric
Edge-centric
Start routing without placing the
operation Placement occurs during routing
11
Benefit 1 Less Routing Calls
time FU 0 FU 1 FU 2 FU 3 FU 4
0
1
2
3
4
time FU 0 FU 1 FU 2 FU 3 FU 4
0
1
2
3
4
P2
P1
P2
P1
0
1
0
3
2
4
1
5
6
7
2
C
C
8
9
10
3
4
Node-centric
Edge-centric
11 routing calls for P1 ? C
1 routing call for P1 ? C
Reduce compile time with less number of routing
calls
12
Benefit 2 Global View
node-centric
edge-centric
time FU 0 FU 1 FU 2 FU 3 FU 4
0
1
2
3
4
time FU 0 FU 1 FU 2 FU 3 FU 4
0
1
2
3
4
P
P
0
0
C
1
1
C
2
2
  • Assume slot 0 is a precious resource (better to
    save it for later use)
  • Node-centric greedily picks slot 1
  • Edge-centric can avoid slot 0 by simply assigning
    a high cost

12
13
Edge-centric Modulo Scheduling
  • Its all about edges
  • Scheduling is constructed by routing edges
  • Placement is integrated into routing process
  • Global perspective for EMS
  • Scheduling order of edges
  • Prioritize edges to determine scheduling order
  • Routing optimization
  • Develop contention model for routing resources

13
14
1 Edge Prioritization
  • Focus on consumers
  • Simple edges / High fanout edges
  • Height-based priority
  • Give high priority to high fanout edges
  • Edges scheduled later will likely use extra
    resources
  • Extra resources in simple edges are just being
    wasted
  • Extra resources in high-fanout edges can be
    helpful
  • Other consumers can make use of those

14
15
Fanout Clustering
  • Our approach the opposite
  • Give priority to simple edges
  • Operations connected in simple edges form a
    cluster
  • Schedule simple edges within a cluster
  • Schedule high-fanout edges when consumers are
    visited
  • 17 of 81 loops in H.264 show better throughput
  • Only 1 shows worse throughput

15
16
2 Routing Optimization
  • Routing is guided by cost associated with each
    routing slot
  • Intelligent routing cost metrics are important
  • Minimize routing resources for current edge
  • Static cost fixed positive cost for each
    resource
  • Minimize routing resources for other edges to
    prod/cons
  • Affinity cost use common consumer information
  • Avoid routing failures for other edges
  • Probabilistic cost predict future resource usage

routing cost F(static cost, affinity cost,
probabilistic cost)
17
Affinity Cost Heuristic
time FU 0 FU 1 FU 2 FU 3
0
1
2
3
A
B
C
FU 0
FU 1
FU 2
FU 3
Routing Cost 2
Routing Cost 0
  • Affinity cost utilize common consumer
    information
  • Affinity value how close common consumer is in
    DFG
  • Place operations with high affinity close to each
    other

17
18
Probabilistic Cost Heuristic
time FU 0 FU 1 FU 2 FU 3 FU 4
0
1
2
3
4
5
6
7
P1
P2
P1
C1
C2
P2
. . . .
C1
C2
ST
Three possible routes, all using same routing
slots
18
19
Probabilistic Cost Heuritsic
time FU 0 FU 1 FU 2 FU 3 FU 4
0
1
2
3
4
5
6
7
P1
P2
P1
C1
C2
P2
. . . .
C1
C2
ST
Need to consider other unplaced edges/operations
Slots that might be used for routing P2 ? C2
Slots that might be used for placing ST
19
20
Probabilistic Cost Heuritsic
time FU 0 FU 1 FU 2 FU 3 FU 4
0 0.33
1 0.33
2 1.0
3 1.0 0.33
4 1.0
5 0.5 0.5
6 0.5 0.5
7
P1
P2
P1
C1
C2
P2
. . . .
X
X
C1
C2
ST
Probabilities on future usage of slots are
calculated and guide routing of P1 ? C1
Route in the middle is selected
20
21
EMS System Flow
Schedule
Select target edge
Preprocessing
Cost calculation
Fanout clustering
Final schedule
DFG
Perform routing
Prioritize edges
Place operations
Route to others
CGRA
21
22
Experimental Setup
  • 214 loops from highly optimized media
    applications
  • H.264, 3D graphics, AAC, MP3
  • Target architecture
  • 4x4 heterogeneous CGRA (6 memory, 4 multiply)
  • Local RF for each PE
  • Mesh-plus interconnect mesh 2 hop connections
  • Compared to 3 other solutions
  • IMS iterative modulo scheduling, no routing
    optimization
  • NMS same heuristics as EMS, but in a
    node-centric way
  • DRESC IMECs simulated annealing

23
Results
  • Performance normalized throughput of loops
  • Max throughput is determined by ops in a loop
    and resources
  • Compile time for all 214 loops

23
24
Conclusion
  • EMS is a good match for scheduling in CGRA
  • Routing is more important than placement
  • Edge-centric approach allows fast compile time
  • 18x speed up over simulated annealing
  • Intelligent routing cost metrics allows good
    performance
  • 24 improvement over IMS, 98 performance of
    existing solution

25
Questions ?
25
Write a Comment
User Comments (0)
About PowerShow.com