Title: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip
1Recursive Partitioning Multicast A
Bandwidth-Efficient Routing for Networks-On-Chip
- Lei Wang, Yuho Jin, Hyungjun Kim and
- Eun Jung Kim
- Department of Computer Science and Engineering
- Texas AM University
2Multi-Core Wave Networks-On-Chip
- Uniprocessors hit the power wall.
- Multi-processors provide high performance at
lower power budget. - Shared-bus architecture has scalability
limitation. - Networks-On-Chip (NOCs) orchestrate chip-wide
communications towards future many-core
processors.
3Challenges in On-Chip Communication
- High performance
- Low communication latency is critical for high
system performance. - Bandwidth-efficient
- Well-designed routing algorithms provide high
network throughput. - Power and Area Constraints
- Simple topologies and slim routers reduce
communication power consumption and save chip
area. - Efficient Multicast supporting
- Cache coherence protocols heavily rely on
multicast or broadcast communication
characteristics.
We propose a bandwidth-efficient routing for
multicast communication in NOCs with low latency
and power consumption.
4Prior Work in Multicast Communication
- Routing Evaluation Criteria for Multicast
Communication Ni93 - Multicast in multicomputer system
- Tree-based Multicast Routing for DSM
Multiprocessor Torrellas96 - Short message multicast in DSM system
- Virtual Circuit Tree Multicasting for
NOCsLipasti08 - Demonstrate necessity of multicasting on-chip
- Propose table-based multicast routing
- Region-based Multicast for CMPs Duato08
- Multicast routing for irregular topology in CMPs
5Outline
- Motivation
- Multicast Router Design
- State-of-art Unicast Router Architecture
- Replication Schemes
- Destination List Management
- Recursive Partitioning Multicast (RPM)
- Network Partitioning
- Routing Rules
- Example
- Deadlock Avoidance
- Evaluation
- Conclusion
6Different Bandwidth Usage Example
Source
Destination
0
1
2
3
0
1
2
3
4
5
6
7
4
5
6
7
8
9
10
11
8
9
10
11
12
13
14
15
12
13
14
15
- Left Path requires 11 link traversals, 12 buffer
writes, 15 buffer reads, and 15 crossbar
traversals - Right Path requires 5 link traversals, 6 buffer
writes, 10 buffer reads, and 10 cross-bar
traversals
7State-of-Art Wormhole Unicast Router
RC
VA
SA
ST
LT
Router
Link
RC VA SA
ST
LT
Link
Router
RC Route Computation VA VC Allocation
SA Switch Allocation ST Switch Traversal
LT Link Traversal
8What we need in a Multicast Router?
- Packet Replication
- Synchronous Replication
- Asynchronous Replication
- Destination List Management
- All-destination Encoding
- Bit String Encoding
- Multiple-region Broadcast Encoding
9Synchronous Replication
H
Head flit
Time (Cycle)
M
Middle flit
3
2
1
0
Tail flit
T
Output 0
Input 0
T
M
M
H
H
M
Input 1
Output 1
Input 2
Output 2
Output 3
Input 3
- Packet replication happens at Switch Traversal
Stage.
10Asynchronous Replication
H
Head flit
Time (Cycle)
M
Middle flit
3
2
1
0
Tail flit
T
Output 0
Input 0
T
M
M
H
H
M
M
Input 1
Output 1
Input 2
Output 2
Output 3
Input 3
11Network Partitioning
1
0
Source node
2
N
3
7
W
E
4
8
5
Three Parts (5, 6, 7)
Eight Parts
S
Three Parts (0, 1, 7)
Three Parts (3, 4, 5)
Three Parts (1, 2, 3)
12Basic Routing Rules
- North top right corner.
- West top left corner.
- South bottom left corner.
- East bottom right corner.
N
W
E
S
Source
N
N
E
E
W
W
S
S
Destination
13Optimized Routing Rules
Source
Destination
Deadlock!!!
14RPM Example-step 1
Multicast Packet
Source
Destination
Partitioning
M
M
M
15RPM Example-step 2
Multicast Packet
Source
Destination
Partitioning
M
M
M
M
Ejection
16RPM Example-step 3
Multicast Packet
Source
Destination
Partitioning
M
M
M
M
17RPM Example-step 4
Multicast Packet
Source
Destination
Partitioning
M
Ejection
Ejection
M
M
M
M
Ejection
18RPM Example-step 5
Multicast Packet
Source
Destination
Partitioning
M
Ejection
M
M
19Deadlock Avoidance
- RPM has no turn restrictions, potentially
introducing deadlock. - We use Virtual Network (VN) to avoid deadlock.
- Two VNs lie in the same physical network.
- Virtual Channels of each port are equally divided
into each virtual network. - Virtual network Id (0 or 1) for each packet is
decided at the source.
20Evaluation Methodology
- Performance Model Cycle-accurate Network
Simulator - Models all router pipeline stages in detail
- Highly parameterized
- Power Model Orion with both dynamic and leakage
power models
Network configuration
Topology 88 Mesh (66 Mesh, 1010 Mesh, 1616 Mesh)
Routing RPM
VC/Port 4
VC Depth 4
Packet Length (flits) 4
Unicast Traffic Pattern Uniform Random (Bit Complement, Transpose)
Multicast Packet Portion 10 (5, 20, 40, 80)
Multicast Destination Number 0 -16 (uniformly distributed)
21Uniform Random Traffic
50
40
40
- Latency is improved around 50 before network
saturation. - Network throughput is extended 40.
22Link Utilization
33
45
- In low workload, RPM saves 33 link utilization.
- In high workload, RPM saves 45 link utlization.
23Dynamic Power Consumption
50
40
24Scalability Study-Network Size
Over 50
25Scalability Study-Multicast Traffic Portion
26Scalability Study-Destination Number
27Conclusion
- Propose a new multicast routing algorithm,
Recursive Partitioning Multicast (RPM) - Bandwidth-efficient and Scalable
- Performance Improvement
- Up to 50 latency reduction
- 33 link utilization reduction
- Power Savings
- Up to 40 total dynamic power savings
- 25 crossbar and link power savings
28Thank you!
29Backup
30Hardware Implementation of Routing logic
31Bit Complement Traffic
32Transpose Traffic