Offchip Communication Architectures for High Throughput Network Processors - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Offchip Communication Architectures for High Throughput Network Processors

Description:

avoid connectivity to its nodes. Uses wormhole routing. Packets are routed adaptively based on ... traffic load and connectivity. 8-ary 2-cube interconnect. 4 ... – PowerPoint PPT presentation

Number of Views:117
Avg rating:3.0/5.0
Slides: 24
Provided by: Jac7180
Category:

less

Transcript and Presenter's Notes

Title: Offchip Communication Architectures for High Throughput Network Processors


1
Off-chip Communication Architectures for High
Throughput Network Processors
College of Engineering Computer
science Department of Electrical Computer
Engineering
  • Short version of
  • Final Defense
  • Jacob Engel

2
Motivation
  • Line card design challenges
  • Rapidly growing line rates
  • Additional deep packet processing operations
  • Increase in memory capacity requirements
  • The memory wall
  • Nature of packet transmission
  • Variable packet size
  • Out of order transmission
  • Off-chip vs. on-chip
  • Off-chip interconnects lack innovative methods
  • to improve their integration into large
  • scalable network components
  • PCB physical limitations
  • Area
  • I/O pins
  • Switching elements
  • Scalability

3
Interconnect Topologies
  • Bus
  • k-ary n-cubes
  • Mesh
  • N nodes arranged in a rectangle or square
  • Each node connected to 4 neighbors
  • Cost N switches (one per node)
  • Torus
  • Mesh with wrap-around
  • Longer cables than simple mesh
  • Harder to partition
  • Hypercubes
  • Multi-dimensional cubes
  • Nodes identified by n-bit numbers
  • Each node connected to other nodes that differ in
    a single bit
  • Cost N switches (one per node)

Torus network (k-ary n-cube with wraparound)
Mesh (4-ary 2-cube network)
3D Hypercube (2-ary n-cube)
4
Switching Mechanism in Interconnects
  • Bus-based network
  • Drawback
  • Does not scale up with number of processors
  • Crossbar switching network
  • Allows connection of any of p processors to any
    of b memory banks
  • Number of switches n²
  • Wiring complexity O(n²w)
  • Disadvantage
  • Its complexity grows as a function of p²
  • Expensive for large number of processors
  • Physical dimensions (switches wiring)

Bus-based network
Crossbar network
5
Switching Mechanism in Interconnects
  • Omega switching network
  • More scalable in terms of cost than crossbar,
    more scalable in terms of performance than bus
  • Number of switches (n/2)log_k(n)
  • n channels k switches/box
  • Wiring complexity O(n w log_k(n))
  • Drawbacks
  • limited switching flexibility
  • Blocking

Omega switch
6
K-ary n-cube based architectures
  • Packet based multiple path
  • Packets are shared among PE M
  • PEsTM, QoS, Classification
  • Oversubscribed or faulty link does not
  • avoid connectivity to its nodes
  • Uses wormhole routing
  • Packets are routed adaptively based on
  • traffic load and connectivity

8-ary 2-cube interconnect
4-ary 3-cube interconnect
Shared-bus
3D-mesh interconnect
7
Wormhole routing
  • Wormhole routing operates on flits
  • Typically what can be transmitted in a single
    cycle
  • Flit channel width
  • Packet header is typically one or two flits
  • The rest of the flits do not contain routing
    information
  • Known for its improved latency and throughput

packets
Message header
4
3
2
1
Node 1
Node 2
Node 3
Node 4
4 3 2 1
Message
8
K-ary n-cube based architectures
  • Performance measures
  • Latency T_wT_sT_r
  • Latency does not consider queuing delays at this
    point
  • Latency is measured per flit per flow
  • Throughputch_width/L
  • Bi-directional links will increase the throughput
  • Aggregate throughput is a function of PE as well

9
Flow control mechanisms
Sub-Channel Variation Effects
Virtual Channel Effects
Channel/Packet Size 32 Bits, 1 channel
Virtual Channel Inactive
Channel/Packet Size 16 Bits, 2 sub-channels
Virtual Channel Active
Channel/Packet Size 8 Bits, 4 sub-channels
10
The traffic controller (TC)
TC architecture
  • Switch module
  • Receives status from all TC modules
  • It switches I\O ports
  • Channel Sampler
  • Ports status
  • busy / not-busy
  • Routing algorithm
  • Shortest path
  • Ports status
  • Virtual channels
  • VC on/off
  • of available VC
  • VC occupancy status
  • Channels partitioning
  • Sub-channeling on/off
  • of available SC
  • SC occupancy status

M
PE
TC
TC
Port A
M
PE
TC
TC
TC
Routing Algorithm
Channel Sampler
SW module
Port B
Port D
PE/Memory connectivity
Channel Partition
Virtual Channels
Port C
11
The Network simulator
3D-mesh interconnect
4-ary 3-cube interconnect
8-ary 2-cube interconnect
12
Interconnect simulation
The interconnect
Runtime performance data
Simulation speed
13
The Network simulator architecture
  • The interconnect type configuration to simulate

The Interconnect
  • A worm container
  • Contains worms to
  • model and worms
  • done modeling

Worms Jar
Worms Jar
PE
PE
M
M
M
M
M
M
  • Records
  • performance
  • data

M
M
M
M
PE
PE
M
M
  • Interconnect type,
  • configuration properties
  • Data is updated from the
  • user interface

Traffic Sampler
Interconnect Configuration Manager
  • Timing for
  • entering worms
  • Traffic load feedback

Scheduler
  • Output files which
  • contain all simulated
  • data (worms properties
  • and performance)
  • Calculates the
  • shortest path route for
  • each worm

Interconnect Properties
Worm Manager
Routing Algorithm
Worms Data
Performance Results
  • Orchestrates the simulation process
  • Provides control signals to all other
  • modules participating in simulation

User Interface
  • Command line or GUI
  • User system parameters

14
Performance results Latency
  • Latency denotes the time it takes
  • a message to reach destination
  • Latency includes wire propagation,
  • switching and routing delays
  • Latency of 3D-mesh for both short
  • and long messages was the smallest
  • of all three interconnects
  • The results shown represent the
  • average latency measured for both
  • short and long messages

K-ary n-cube interconnects latency comparison
15
Performance results Latency
  • Offered load determines the probability
  • that each node comprising the interconnect
  • will generate a message within each cycle
  • If offered load0.1 there is a chance that
  • 10 of the total nodes in the interconnect
  • will generate a message at each cycle
  • As the offered load increases the latency
  • increases exponentially for all the
  • interconnects
  • 3D-mesh has the lowest latency and is able
  • to sustain higher traffic load

K-ary n-cube interconnects latency vs. offered
load
16
Performance results Throughput
  • 3D-mesh reached the highest peak
  • throughput for both short and
  • long messages
  • 4-ary 3-cube outperforms 8-ary 2-cube
  • in all measurements
  • Higher throughput when long worms
  • are modeled as a result of the routing
  • algorithm (wormhole routing)

K-ary n-cube interconnects throughput comparison
17
Routing accuracy comparison
8-ary 2-cube routing accuracy with VC SC enabled
3D-mesh routing accuracy with VC SC enabled
  • VCs and SC significantly increased routing
    accuracy
  • 3D-mesh RA96 vs. 85 (8-ary 2-cube) and
  • 84 (4-ary 3-cube)

4-ary 3-cube routing accuracy with VC SC enabled
18
Bandwidth utilization rate
  • Bandwidth utilization ( of occupied
  • channels) / (total of channels)
  • VCs as well as SC increase bandwidth
  • utilization rate
  • Larger gap from no-VC/SC to VC2SC
  • then VC2SC to VC4SC

19
Failure rate vs. VC size
  • As VCs size increases failure rate
  • decreases
  • Tradeoff failure rate vs. VCs
  • size (area and cost)

3D-mesh failure rate as a function of VC size
20
Throughput comparison
  • Throughput of common
  • interconnects is based on results
  • provided by their vendors
  • All k-ary n-cube interconnects
  • utilize both VCs and SC
  • 3D-mesh has the highest
  • throughput among all other
  • interconnects

Average throughput comparison among
high-performance interconnects
21
Conclusion
  • We presented k-ary n-cube based interconnects as
    off-chip
  • communications architectures for line cards to
    increase the
  • throughput of the currently used memory system
  • We designed a new mixed-radix, non-symmetrical
    k-ary n-cube based
  • interconnect called the 3D-mesh (a variation
    of 2-ary 3-cube)
  • We include multiple, highly efficient techniques
    to route, switch and
  • control packet flows in order to increase
    throughput and interconnect
  • utilization, while minimize traffic congestion
    and packet loss
  • We reveal the best processor-memory
    configuration, out of multiple
  • configurations, that achieves optimal
    performance
  • We developed a custom-designed, event-driven,
    simulator to evaluate
  • the performance of packet-based, off-chip,
    k-ary n-cube interconnect
  • architectures

22
Conclusion
  • Our results show that k-ary n-cube interconnect
    architectures
  • provide higher throughput and can sustain
    higher traffic loads
  • 3D-mesh reached the highest performance results
    of all other
  • interconnects tested
  • 3D-mesh is a scalable, cost effective solution
    which complies with
  • all the functional as well as the physical
    constraints on the line card

23
Future work
  • The results of this work can also be used for
  • PCs
  • On-chip communication architectures
  • Future directions for this work include
  • Board implementation of the interconnect
  • Testing the interconnect with off-the-shelf
    components
  • Expand our simulation framework to include
  • Higher dimensions k-ary n-cube networks
  • Internet traffic workloads
Write a Comment
User Comments (0)
About PowerShow.com