Title: High Performance Switches and Internet Routers: Architecture and Scheduling
1High Performance Switches and Internet Routers
Architecture and Scheduling
TU Delft June 18, 2004
Lotfi Mhamdi Computer Engineering Lab. HKUST,
HONG KONG http//www.cs.ust.hk/lotfi
2Outline
- The Need for fast Routers
- Routers Architecture
- Scheduling in Input Queued (IQ) Switches
- The Buffered Crossbar Switching Architecture
(BCS) - Scheduling in BCS
-
- Output Queueing (OQ) switch Emulation
- Concluding Remarks
3 Where high performance packet switches are used
- Core Router - ATM Switch - Frame Relay Switch
The Internet Core
4Recent trends
5Outline
- The Need for fast Routers
- Routers Architecture
- Scheduling in Input Queued (IQ) Switches
- The Buffered Crossbar Switching Architecture
(BCS) - Scheduling in BCS
-
- Output Queueing (OQ) switch Emulation
- Concluding Remarks
6 Basic Architectural Components Data Path Per
packet processing
Ingress
Ingress
Egress
2.
Interconnect
7 InterconnectsTwo basic techniques
Input Queueing
Output Queueing
Usually a non-blocking switch fabric (e.g.
crossbar)
Usually a fast bus/Shared Memory
8Output QueueingThe ideal
9Input QueueingThe Head of Line Blocking
10Head of Line Blocking
11(No Transcript)
12(No Transcript)
13Input QueueingVirtual Output Queues
14IQ Switch with VOQs
It can be quite complex!
15Outline
- The Need for fast Routers
- Routers Architecture
- Scheduling in Input Queued (IQ) Switches
- The Buffered Crossbar Switching Architecture
(BCS) - Scheduling in BCS
-
- Output Queueing (OQ) switch Emulation
- Concluding Remarks
16Matching (Scheduling)
service matrix S(n) Si,j(n), where 1 if
input i sends to output j Si,j(n)
0 otherwise Our objective is Max S(n) s.t
?i Si,j 1 ?j Si,j 1
Matching
Maximum Weight or Maximum Size?
Request Graph
Bipartite Match
17Maximum Size/Weight Matching
Matching
- Maximizes instantaneous throughput
- Complexity O(N2.5)
- Hard to implement in hardware,
- Slow
- Weight (Queue Length, Waiting time) ? 100
throughput - Stable under any admissible input traffic (LQF,
OCF) - Complexity O(N3LogN)
- Hard to implement in hardware and very slow
18Parallel Iterative Matching
Random Selection
Random Selection
- 100 throughput under uniform traffic (converges
in LogN iterations) - 63 throughput with 1 iteration (pointers
synchronization) - Quite complex in Hardware (random encoders),
19iSLIP
Round-Robin Selection
- Easy to implement in hardware,
- Converges in LogN iterations,
- 100 throughput under Uniform Traffic.
20Performance 16X16 Switch, Uniform Traffic
FIFO
PIM
RRM
iSLIP
21Pointer Synchronization
22Performance 3X3 Switch, Non-uniform Traffic
23So Far
- Maximum size/weight matching algorithms are
impractical (hardware implementation) - Iterative matching algorithms (PIM, iSLIP,) are
practical but unstable under non-uniform traffic. - Their centralized design is a bottleneck
- Is it possible to design Distributed Scheduling
Algorithms? - If yes, how?
24Outline
- The Need for fast Routers
- Routers Architecture
- Scheduling in Input Queued (IQ) Switches
- The Buffered Crossbar Switching Architecture
(BCS) - Scheduling in BCS
-
- Output Queueing (OQ) switch Emulation
- Concluding Remarks
25Buffered Crossbar Switch Architecture
26I/O Contention Resolution
1
2
3
4
3
4
1
2
27Outline
- The Need for fast Routers
- Routers Architecture
- Scheduling in Input Queued (IQ) Switches
- The Buffered Crossbar Switching Architecture
(BCS) - Scheduling in BCS
-
- Output Queueing (OQ) switch Emulation
- Concluding Remarks
28Scheduling Process
- Scheduling is divided into three steps
- Input scheduling
- Every input i independently selects (in parallel)
a HoL cell of an eligible VOQ and sends it to the
corresponding internal buffer. - Output scheduling
- Every output j independently selects (in
parallel) a cell amongst all non-empty XPij to be
delivered to the output port. - Delivery notifying (Flow Control)
- For each delivered cell, inform the corresponding
input of the internal buffer status.
Eligible VOQi,j non-empty VOQi,j and empty XPij
29Existing Algorithms
- Round Robin (RR-RR)
- Round robin scheduling at the inputs, and the
outputs.
- Oldest Cell First (OCF-OCF)
- Select the oldest HoL cell in each input, and
the oldest at the outputs.
- Longest Queue First - Round Robin (LQF-RR)
- Select the HoL cell of the longest VOQ at each
input, and round robin at the outputs.
30Internal Buffers based Scheduling
XPi,j internal buffer
31SBF-LBF Performance
32Performance (VOQs Occupancies)
33Outline
- The Need for fast Routers
- Routers Architecture
- Scheduling in Input Queued (IQ) Switches
- The Buffered Crossbar Switching Architecture
(BCS) - Scheduling in BCS
-
- Output Queueing (OQ) switch Emulation
- Concluding Remarks
34OQ EmulationThe Speed up problem
Input Queueing
Output Queueing
Best delay and throughput performance - High
fabric speedup (S N)
Speedup of one is sufficient - Unpredictable
delay due to input contention
Memory speeds for 32x32 ATM switch
35The Ideal Solution
Find a compromise 1 lt Speedup ltlt N
- to get the performance of an OQ switch
- close to the cost of an IQ switch
Question Can we find
- a simple and good algorithm
- that exactly mimics output-queueing
- regardless of switch sizes and traffic patterns?
36Proposed Algorithms
- IQ Buffer less crossbar switch
- A speed up of 4 was shown to be sufficient (MUCF
algorithm) - A speed up of just two was also provided (GBVOQ
algorithm). - The bad news is Both schemes are impractical
(high complexity due to the centralized
scheduling)
- A speed up of just two was proven to be enough
for the exact emulation of an OQ switch (MCAF_LTF
algorithm). - MCAF_LTF is practical and simple to implement in
hardware
?
37Outline
- The Need for fast Routers
- Routers Architecture
- Scheduling in Input Queued (IQ) Switches
- The Buffered Crossbar Switching Architecture
(BCS) - Scheduling in BCS
-
- Output Queueing (OQ) switch Emulation
- Concluding Remarks
38Concluding Remarks
- The IQ crossbar switching architecture is
becoming less attractive due to the scalability
and scheduling complexity challenges. - The BCS switching architecture presents a good
potential in overcoming the IQ switching problems.
Open Questions
- Is 100 throughput achievable with a speedup lt 2
for buffered crossbars using parallel scheduling? - Will storing multiple cells per crosspoint
further simplify scheduling?