Title: Offchip Communication Architectures for High Throughput Network Processors
1Off-chip Communication Architectures for High
Throughput Network Processors
College of Engineering Computer
science Department of Electrical Computer
Engineering
- Short version of
- Final Defense
- Jacob Engel
2Motivation
- Line card design challenges
- Rapidly growing line rates
- Additional deep packet processing operations
- Increase in memory capacity requirements
- The memory wall
- Nature of packet transmission
- Variable packet size
- Out of order transmission
- Off-chip vs. on-chip
- Off-chip interconnects lack innovative methods
- to improve their integration into large
- scalable network components
- PCB physical limitations
- Area
- I/O pins
- Switching elements
- Scalability
-
3Interconnect Topologies
- Bus
- k-ary n-cubes
- Mesh
- N nodes arranged in a rectangle or square
- Each node connected to 4 neighbors
- Cost N switches (one per node)
- Torus
- Mesh with wrap-around
- Longer cables than simple mesh
- Harder to partition
- Hypercubes
- Multi-dimensional cubes
- Nodes identified by n-bit numbers
- Each node connected to other nodes that differ in
a single bit - Cost N switches (one per node)
Torus network (k-ary n-cube with wraparound)
Mesh (4-ary 2-cube network)
3D Hypercube (2-ary n-cube)
4Switching Mechanism in Interconnects
- Bus-based network
- Drawback
- Does not scale up with number of processors
- Crossbar switching network
- Allows connection of any of p processors to any
of b memory banks - Number of switches n²
- Wiring complexity O(n²w)
- Disadvantage
- Its complexity grows as a function of p²
- Expensive for large number of processors
- Physical dimensions (switches wiring)
Bus-based network
Crossbar network
5Switching Mechanism in Interconnects
- Omega switching network
- More scalable in terms of cost than crossbar,
more scalable in terms of performance than bus - Number of switches (n/2)log_k(n)
- n channels k switches/box
- Wiring complexity O(n w log_k(n))
- Drawbacks
- limited switching flexibility
- Blocking
Omega switch
6 K-ary n-cube based architectures
- Packet based multiple path
- Packets are shared among PE M
- PEsTM, QoS, Classification
- Oversubscribed or faulty link does not
- avoid connectivity to its nodes
- Uses wormhole routing
- Packets are routed adaptively based on
- traffic load and connectivity
-
8-ary 2-cube interconnect
4-ary 3-cube interconnect
Shared-bus
3D-mesh interconnect
7Wormhole routing
- Wormhole routing operates on flits
- Typically what can be transmitted in a single
cycle - Flit channel width
- Packet header is typically one or two flits
- The rest of the flits do not contain routing
information - Known for its improved latency and throughput
packets
Message header
4
3
2
1
Node 1
Node 2
Node 3
Node 4
4 3 2 1
Message
8K-ary n-cube based architectures
- Performance measures
- Latency T_wT_sT_r
- Latency does not consider queuing delays at this
point - Latency is measured per flit per flow
- Throughputch_width/L
- Bi-directional links will increase the throughput
- Aggregate throughput is a function of PE as well
9Flow control mechanisms
Sub-Channel Variation Effects
Virtual Channel Effects
Channel/Packet Size 32 Bits, 1 channel
Virtual Channel Inactive
Channel/Packet Size 16 Bits, 2 sub-channels
Virtual Channel Active
Channel/Packet Size 8 Bits, 4 sub-channels
10The traffic controller (TC)
TC architecture
- Switch module
- Receives status from all TC modules
- It switches I\O ports
- Channel Sampler
- Ports status
- busy / not-busy
- Routing algorithm
- Shortest path
- Ports status
- Virtual channels
- VC on/off
- of available VC
- VC occupancy status
- Channels partitioning
- Sub-channeling on/off
- of available SC
- SC occupancy status
M
PE
TC
TC
Port A
M
PE
TC
TC
TC
Routing Algorithm
Channel Sampler
SW module
Port B
Port D
PE/Memory connectivity
Channel Partition
Virtual Channels
Port C
11The Network simulator
3D-mesh interconnect
4-ary 3-cube interconnect
8-ary 2-cube interconnect
12Interconnect simulation
The interconnect
Runtime performance data
Simulation speed
13The Network simulator architecture
- The interconnect type configuration to simulate
The Interconnect
- A worm container
- Contains worms to
- model and worms
- done modeling
Worms Jar
Worms Jar
PE
PE
M
M
M
M
M
M
M
M
M
M
PE
PE
M
M
- Interconnect type,
- configuration properties
- Data is updated from the
- user interface
Traffic Sampler
Interconnect Configuration Manager
- Timing for
- entering worms
- Traffic load feedback
Scheduler
- Output files which
- contain all simulated
- data (worms properties
- and performance)
- Calculates the
- shortest path route for
- each worm
Interconnect Properties
Worm Manager
Routing Algorithm
Worms Data
Performance Results
- Orchestrates the simulation process
- Provides control signals to all other
- modules participating in simulation
User Interface
- Command line or GUI
- User system parameters
14Performance results Latency
- Latency denotes the time it takes
- a message to reach destination
- Latency includes wire propagation,
- switching and routing delays
- Latency of 3D-mesh for both short
- and long messages was the smallest
- of all three interconnects
- The results shown represent the
- average latency measured for both
- short and long messages
K-ary n-cube interconnects latency comparison
15Performance results Latency
- Offered load determines the probability
- that each node comprising the interconnect
- will generate a message within each cycle
- If offered load0.1 there is a chance that
- 10 of the total nodes in the interconnect
- will generate a message at each cycle
- As the offered load increases the latency
- increases exponentially for all the
- interconnects
- 3D-mesh has the lowest latency and is able
- to sustain higher traffic load
K-ary n-cube interconnects latency vs. offered
load
16Performance results Throughput
- 3D-mesh reached the highest peak
- throughput for both short and
- long messages
- 4-ary 3-cube outperforms 8-ary 2-cube
- in all measurements
- Higher throughput when long worms
- are modeled as a result of the routing
- algorithm (wormhole routing)
K-ary n-cube interconnects throughput comparison
17 Routing accuracy comparison
8-ary 2-cube routing accuracy with VC SC enabled
3D-mesh routing accuracy with VC SC enabled
- VCs and SC significantly increased routing
accuracy - 3D-mesh RA96 vs. 85 (8-ary 2-cube) and
- 84 (4-ary 3-cube)
4-ary 3-cube routing accuracy with VC SC enabled
18Bandwidth utilization rate
- Bandwidth utilization ( of occupied
- channels) / (total of channels)
- VCs as well as SC increase bandwidth
- utilization rate
- Larger gap from no-VC/SC to VC2SC
- then VC2SC to VC4SC
19Failure rate vs. VC size
- As VCs size increases failure rate
- decreases
- Tradeoff failure rate vs. VCs
- size (area and cost)
-
-
3D-mesh failure rate as a function of VC size
20Throughput comparison
- Throughput of common
- interconnects is based on results
- provided by their vendors
- All k-ary n-cube interconnects
- utilize both VCs and SC
- 3D-mesh has the highest
- throughput among all other
- interconnects
Average throughput comparison among
high-performance interconnects
21Conclusion
- We presented k-ary n-cube based interconnects as
off-chip - communications architectures for line cards to
increase the - throughput of the currently used memory system
- We designed a new mixed-radix, non-symmetrical
k-ary n-cube based - interconnect called the 3D-mesh (a variation
of 2-ary 3-cube) - We include multiple, highly efficient techniques
to route, switch and - control packet flows in order to increase
throughput and interconnect - utilization, while minimize traffic congestion
and packet loss - We reveal the best processor-memory
configuration, out of multiple - configurations, that achieves optimal
performance - We developed a custom-designed, event-driven,
simulator to evaluate - the performance of packet-based, off-chip,
k-ary n-cube interconnect - architectures
-
22Conclusion
- Our results show that k-ary n-cube interconnect
architectures - provide higher throughput and can sustain
higher traffic loads - 3D-mesh reached the highest performance results
of all other - interconnects tested
- 3D-mesh is a scalable, cost effective solution
which complies with - all the functional as well as the physical
constraints on the line card
23Future work
- The results of this work can also be used for
- PCs
- On-chip communication architectures
- Future directions for this work include
- Board implementation of the interconnect
- Testing the interconnect with off-the-shelf
components - Expand our simulation framework to include
- Higher dimensions k-ary n-cube networks
- Internet traffic workloads