Offchip Communication Architectures for High Throughput Network Processors

About This Presentation

Title:

Offchip Communication Architectures for High Throughput Network Processors

Description:

avoid connectivity to its nodes. Uses wormhole routing. Packets are routed adaptively based on ... traffic load and connectivity. 8-ary 2-cube interconnect. 4 ... – PowerPoint PPT presentation

Number of Views:117

Avg rating:3.0/5.0

Slides: 24

Provided by: Jac7180

Category:

more less

Transcript and Presenter's Notes

Title: Offchip Communication Architectures for High Throughput Network Processors

1
Off-chip Communication Architectures for High
Throughput Network Processors
College of Engineering Computer
science Department of Electrical Computer
Engineering

Short version of
Final Defense
Jacob Engel

2
Motivation

Line card design challenges
Rapidly growing line rates
Additional deep packet processing operations
Increase in memory capacity requirements
The memory wall
Nature of packet transmission
Variable packet size
Out of order transmission
Off-chip vs. on-chip
Off-chip interconnects lack innovative methods
to improve their integration into large
scalable network components
PCB physical limitations
Area
I/O pins
Switching elements
Scalability

3
Interconnect Topologies

Bus
k-ary n-cubes
Mesh
N nodes arranged in a rectangle or square
Each node connected to 4 neighbors
Cost N switches (one per node)
Torus
Mesh with wrap-around
Longer cables than simple mesh
Harder to partition
Hypercubes
Multi-dimensional cubes
Nodes identified by n-bit numbers
Each node connected to other nodes that differ in
a single bit
Cost N switches (one per node)

Torus network (k-ary n-cube with wraparound)
Mesh (4-ary 2-cube network)
3D Hypercube (2-ary n-cube)
4
Switching Mechanism in Interconnects

Bus-based network
Drawback
Does not scale up with number of processors
Crossbar switching network
Allows connection of any of p processors to any
of b memory banks
Number of switches n²
Wiring complexity O(n²w)
Disadvantage
Its complexity grows as a function of p²
Expensive for large number of processors
Physical dimensions (switches wiring)

Bus-based network
Crossbar network
5
Switching Mechanism in Interconnects

Omega switching network
More scalable in terms of cost than crossbar,
more scalable in terms of performance than bus
Number of switches (n/2)log_k(n)
n channels k switches/box
Wiring complexity O(n w log_k(n))
Drawbacks
limited switching flexibility
Blocking

Omega switch
6
K-ary n-cube based architectures

Packet based multiple path
Packets are shared among PE M
PEsTM, QoS, Classification
Oversubscribed or faulty link does not
avoid connectivity to its nodes
Uses wormhole routing
Packets are routed adaptively based on
traffic load and connectivity

8-ary 2-cube interconnect
4-ary 3-cube interconnect
Shared-bus
3D-mesh interconnect
7
Wormhole routing

Wormhole routing operates on flits
Typically what can be transmitted in a single
cycle
Flit channel width
Packet header is typically one or two flits
The rest of the flits do not contain routing
information
Known for its improved latency and throughput

packets
Message header
4
3
2
1
Node 1
Node 2
Node 3
Node 4
4 3 2 1
Message
8
K-ary n-cube based architectures

Performance measures
Latency T_wT_sT_r
Latency does not consider queuing delays at this
point
Latency is measured per flit per flow
Throughputch_width/L
Bi-directional links will increase the throughput
Aggregate throughput is a function of PE as well

9
Flow control mechanisms
Sub-Channel Variation Effects
Virtual Channel Effects
Channel/Packet Size 32 Bits, 1 channel
Virtual Channel Inactive
Channel/Packet Size 16 Bits, 2 sub-channels
Virtual Channel Active
Channel/Packet Size 8 Bits, 4 sub-channels
10
The traffic controller (TC)
TC architecture

Switch module
Receives status from all TC modules
It switches I\O ports
Channel Sampler
Ports status
busy / not-busy
Routing algorithm
Shortest path
Ports status
Virtual channels
VC on/off
of available VC
VC occupancy status
Channels partitioning
Sub-channeling on/off
of available SC
SC occupancy status

M
PE
TC
TC
Port A
M
PE
TC
TC
TC
Routing Algorithm
Channel Sampler
SW module
Port B
Port D
PE/Memory connectivity
Channel Partition
Virtual Channels
Port C
11
The Network simulator
3D-mesh interconnect
4-ary 3-cube interconnect
8-ary 2-cube interconnect
12
Interconnect simulation
The interconnect
Runtime performance data
Simulation speed
13
The Network simulator architecture

The interconnect type configuration to simulate

The Interconnect

A worm container
Contains worms to
model and worms
done modeling

Worms Jar
Worms Jar
PE
PE
M
M
M
M
M
M

Records
performance
data

M
M
M
M
PE
PE
M
M

Interconnect type,
configuration properties
Data is updated from the
user interface

Traffic Sampler
Interconnect Configuration Manager

Timing for
entering worms
Traffic load feedback

Scheduler

Output files which
contain all simulated
data (worms properties
and performance)

Calculates the
shortest path route for
each worm

Interconnect Properties
Worm Manager
Routing Algorithm
Worms Data
Performance Results

Orchestrates the simulation process
Provides control signals to all other
modules participating in simulation

User Interface

Command line or GUI
User system parameters

14
Performance results Latency

Latency denotes the time it takes
a message to reach destination
Latency includes wire propagation,
switching and routing delays
Latency of 3D-mesh for both short
and long messages was the smallest
of all three interconnects
The results shown represent the
average latency measured for both
short and long messages

K-ary n-cube interconnects latency comparison
15
Performance results Latency

Offered load determines the probability
that each node comprising the interconnect
will generate a message within each cycle
If offered load0.1 there is a chance that
10 of the total nodes in the interconnect
will generate a message at each cycle
As the offered load increases the latency
increases exponentially for all the
interconnects
3D-mesh has the lowest latency and is able
to sustain higher traffic load

K-ary n-cube interconnects latency vs. offered
load
16
Performance results Throughput

3D-mesh reached the highest peak
throughput for both short and
long messages
4-ary 3-cube outperforms 8-ary 2-cube
in all measurements
Higher throughput when long worms
are modeled as a result of the routing
algorithm (wormhole routing)

K-ary n-cube interconnects throughput comparison
17
Routing accuracy comparison
8-ary 2-cube routing accuracy with VC SC enabled
3D-mesh routing accuracy with VC SC enabled

VCs and SC significantly increased routing
accuracy
3D-mesh RA96 vs. 85 (8-ary 2-cube) and
84 (4-ary 3-cube)

4-ary 3-cube routing accuracy with VC SC enabled
18
Bandwidth utilization rate

Bandwidth utilization ( of occupied
channels) / (total of channels)
VCs as well as SC increase bandwidth
utilization rate
Larger gap from no-VC/SC to VC2SC
then VC2SC to VC4SC

19
Failure rate vs. VC size

As VCs size increases failure rate
decreases
Tradeoff failure rate vs. VCs
size (area and cost)

3D-mesh failure rate as a function of VC size
20
Throughput comparison

Throughput of common
interconnects is based on results
provided by their vendors
All k-ary n-cube interconnects
utilize both VCs and SC
3D-mesh has the highest
throughput among all other
interconnects

Average throughput comparison among
high-performance interconnects
21
Conclusion

We presented k-ary n-cube based interconnects as
off-chip
communications architectures for line cards to
increase the
throughput of the currently used memory system
We designed a new mixed-radix, non-symmetrical
k-ary n-cube based
interconnect called the 3D-mesh (a variation
of 2-ary 3-cube)
We include multiple, highly efficient techniques
to route, switch and
control packet flows in order to increase
throughput and interconnect
utilization, while minimize traffic congestion
and packet loss
We reveal the best processor-memory
configuration, out of multiple
configurations, that achieves optimal
performance
We developed a custom-designed, event-driven,
simulator to evaluate
the performance of packet-based, off-chip,
k-ary n-cube interconnect
architectures

22
Conclusion

Our results show that k-ary n-cube interconnect
architectures
provide higher throughput and can sustain
higher traffic loads
3D-mesh reached the highest performance results
of all other
interconnects tested
3D-mesh is a scalable, cost effective solution
which complies with
all the functional as well as the physical
constraints on the line card

23
Future work

The results of this work can also be used for
PCs
On-chip communication architectures
Future directions for this work include
Board implementation of the interconnect
Testing the interconnect with off-the-shelf
components
Expand our simulation framework to include
Higher dimensions k-ary n-cube networks
Internet traffic workloads

Write a Comment

User Comments (0)

About PowerShow.com

Offchip Communication Architectures for High Throughput Network Processors - PowerPoint PPT Presentation

Offchip Communication Architectures for High Throughput Network Processors

avoid connectivity to its nodes. Uses wormhole routing. Packets are routed adaptively based on ... traffic load and connectivity. 8-ary 2-cube interconnect. 4 ... – PowerPoint PPT presentation