Title: Scheduling algorithms for inputqueued IP routers
1Scheduling algorithms for inputqueued IP routers

Emilio Leonardi  in collaboration with P. Giaccone, M. Ajmone
Marsan, A Bianco, M.Mellia, F.Neri  Dipartimento di Elettronica
 Telecommunication Network Group
 http//www.tlcnetworks.polito.it
 Politecnico di Torino (Italy)
Budapest, March 2006
2Outline
 IP routers
 OQ routers
 IQ routers
 Scheduling
 Optimal algorithms
 Heuristic algorithms
 Packetmode algorithms
 Networks of routers
 CIOQ routers
 Multicast traffic
 Conclusions
3Note
 The slides marked RWP are reproduced with
permission of Prof.Nick McKeown from the
Electrical Engineering and Computer Science Dept.
of Stanford University (CA,USA)
4Outline
 IP routers
 OQ routers
 IQ routers
 Scheduling
 Optimal algorithms
 Heuristic algorithms
 Packetmode algorithms
 Networks of routers
 CIOQ routers
 Multicast traffic
 Conclusions
5The Internet is a mesh of routers
core router
access router
enterprise router
6The Internet is a mesh of routers
 Access router
 high number of ports at low speed (kbps/Mbps)
 several access protocols (modem, ADSL, cable)
 Enterprise router
 medium number of ports at high speed (Mbps)
 several services (IP classification, filtering)
 Core router
 moderate number of ports at very high speed
(Mbps/Gbps)  very high throughput
7Basic functions
 Routing
 computation of the output port of an incoming
packet  uses the routing tables computed by the routing
protocols  can be a complex procedure
 very large routing tables
 dynamic variation of routes in the Internet
8Basic functions
 Switching
 transfer of packets from input ports to output
ports  solution of the contentions for output ports
 queueing
 where to store
 scheduling
 what to transfer
9Faster and faster
 Need for high performance routers
 to accommodate the bandwidth demands for new
users and new services  to support QoS
 to reduce costs
10Packet processing and link speed
 Increase of electronic packet processing power
cannot accommodate the increase in link speed
Packet processing Power
Link Speed
10000
1000
2x / 7 months
Moores law 2x / 18 months
100
Fiber Capacity (Gbit/s)
10
1
1985
1990
1995
2000
0,1
TDM
DWDM
Source SPEC95Int David Miller, Stanford.
RWP
11Memory access time
RWP
12Moores law
 Its hard to keep up with Moores law
 the bottleneck is memory speed
 Moores law is too slow
 routers need to improve faster than Moores law
RWP
13Router capacity exceeds Moores law
 Growth in capacity of commercial routers
 1992 2 Gb/s
 1995 10 Gb/s
 1998 40 Gb/s
 2001 160 Gb/s
 2003 640 Gb/s
 Average growth rate 2.2x / 18 months
RWP
14Single packet processing
 The time to process one packet is becoming
shorter and shorter  worst case 40Byte packets (ACKs) travelling
over the Internet  3.2 ?s at 100 Mbps
 320 ns at 1 Gps
 32 ns at 10 Gps
 3.2 ns at 100 Gbps
 320 ps at 1Tbps
15Hardware architecture
physical structure
logical structure
16Hardware architecture
 Main elements
 line cards
 support input/output transmissions
 store packets
 adapt packets to the internal format of the
switching fabric  support data link protocols
 classify packets
 schedule packets
 support security
 switching fabric
 transfers packets from input ports to output ports
17Hardware architecture
 Main elements
 control processor/network processor
 runs routing protocols
 computes routing tables
 manages the overall system
 forwarding engines
 compute the packet destination (lookup)
 inspect packet headers
 rewrite packet headers
18Interconnections among main elements  I
19Interconnections among main elements  II
20Cellbased routers
Cell switch (fabric)
cells
packets
packets
cells
1
N
 ISM InputSegmentation Module
 ORM OutputReassembly Module
 packet variablesize data unit
 cell fixedsize data unit
21Switching fabric
 Our assumptions
 bufferless
 to reduce internal hardware complexity
 nonblocking
 it is always possible to transfer in parallel
from input to output ports any nonconflicting
set of cells
22Switching fabric
 Examples
 crossbar
 rearrangeable Clos network
 Benes network
 BatcherBanyan network (selfrouting)
 Switching constraints
 at most one cell for each input and for each
output can be transferred
23Switching fabric
 We do not discuss switching fabrics with internal
buffers  e.g. crossbars with buffer at each crosspoint
24Generic switching architecture
output queues
input queues
25Speedup
 The speedup determinates the switch performance
 Sin reading speed from input queues
 Sout writing speed to output queues
 maximum speedup factor
 S max(Sin,Sout)
26Performance comparison
 The performance of different switching systems
can be studied  with analytical models
 introducing simplifying assumptions, but
obtaining general results  with simulation models
 obtaining more detailed results
27Traffic description
 Aij(n) 1 if a packet arrives at time n at input
i, with destination reachable through output j  ?ij EAij(n)
 An arrival process is admissible if
 ?i ?ij ? 1
 ?j ?ij ? 1
 that is, no input and no output are overloaded
on average  note that OQ switches exhibit finite delays only
for admissible traffic  traffic matrix ? ?ij
28Traffic scenarios
 Uniform traffic
 Bernoulli i.i.d. arrivals
 usual testbed in the literature
 easy to schedule
 Diagonal traffic
 Bernoulli i.i.d arrivals
 critical to schedule, since
 only two matchings are good
29Traffic scenarios
 LogDiagonal traffic
 Bernoulli i.i.d. arrivals
 more critical than uniform, less than diagonal
traffic
30Outline
 IP routers
 OQ routers
 IQ routers
 Scheduling
 Optimal algorithms
 Heuristic algorithms
 Packetmode algorithms
 Networks of routers
 CIOQ routers
 Multicast traffic
 Conclusions
31Output Queued (OQ) switches
 Sin 1 Sout N
 used for low bandwidth routers
 no coordination among ports
 workconserving
 best average delays
 complete control of delays
 support of QoS scheduling
32Output Queued (OQ) switch
33OQ performance
Uniform traffic
Note OQ is optimal from the point of view of
average delay and throughput
OQ
34Outline
 IP routers
 OQ routers
 IQ routers
 Scheduling
 Optimal algorithms
 Heuristic algorithms
 Packetmode algorithms
 Networks of routers
 CIOQ routers
 Multicast traffic
 Conclusions
35Simple Input Queued (IQ) switches
 Sin 1 Sout 1
 1 FIFO queue for each input port
 throughput limitations
 due to head of the line (HOL) blocking
 scheduling
 to solve contentions
 for the same output
36Head of the Line (HOL) Blocking
RWP
37Simple IQ switch performance
Uniform traffic
Simple IQ
OQ
38Improving simple IQ switches
 Window/bypass schedulers
 the first w cells of each queue contend for
outputs  HOL blocking is reduced, not eliminated
 w 1 means FIFO at each input
 higher complexity
 the scheduler deals with wN cells
 nonFIFO queues
39Improving IQ switches
 Virtual output queueing (VOQ)
 one queue for each input/output pair
 N queues at each input
 N2 queues in the whole switch
 eliminates HOL blocking
 used in highbandwidth routers
 scheduling implemented in hardware at very high
speed
40IQ switches with VOQ
Note from now on, we always assume VOQ at the
switch inputs
41Outline
 IP routers
 OQ routers
 IQ routers
 Scheduling
 Optimal algorithms
 Heuristic algorithms
 Packetmode algorithms
 Networks of routers
 CIOQ routers
 Multicast traffic
 Conclusions
42Scheduling in IQ switches
 Scheduling can be modeled as a matching problem
in a bipartite graph  the edge from node i to node j refers to packets
at input i and directed to output j  the weight of the edge can be
 binary (not empty/empty queue)
 queue length
 HOL cell waiting time, or cell age
 some other metric indicating the priority of the
HOL cell to be served
43Scheduling in IQ switches
Request Graph
Matching (or Permutation)
inputs
outputs
scheduler
44Scheduling in IQ switches
 Request Matrix
 3 5 0 0
 2 0 0 4
 4 5 0 0
 0 0 8 2
Permutation
0 1 0 0 0 0 0
1 1 0 0 0 0 0
1 0
45Implementing schedulers
 Scheduling is a complex task
 a scheduling algorithm can be implemented in
hardware if  it shows good performance for a wide range of
traffic patterns  it can be efficiently parallelized
 it can be efficiently pipelined
 it requires few iterations (or clock cycles)
 it requires limited control information
46Scheduling uniform traffic
 A number of algorithms give 100 throughput when
traffic is uniform  For example
 TDM and a few variants
 iSLIP (see later)
Example of TDM for a 4x4 switch
RWP
47Birkhoff  von Neumann theorem
 Any doubly stochastic matrix L can be
 expressed as convex combination of permutation
matrices pn  L ?n an pn
 with
 an0
 ?n an 1
48Scheduling nonuniform traffic
 thanks to the Birkhoff  von Neumann theorem
 If the traffic is known and admissible, 100
throughput can be achieved by a TDM using  for a fraction of time a1 matching M1 (p1)
 for a fraction of time a2 matching M2 (p2)
 for a fraction of time ak matching Mk (p3)
49Outline
 IP routers
 OQ routers
 IQ routers
 Scheduling
 Optimal algorithms
 Heuristic algorithms
 Packetmode algorithms
 Networks of routers
 CIOQ routers
 Multicast traffic
 Conclusions
50Maximum Size Matching
 Maximum Size Matching (MSM)
 among all the possible matchings, selects the one
with the highest number of edges  MSM is generally not unique
 the best MSM algorithm requires O(N2.5)
iterations, and cannot be implemented
efficiently, since it is based on a flow
augmentation path algorithm
51Instability of MSM
 Assume
 P(arrival at Q12) ?
 P(arrival at Q11) P(arrival at Q22) 1??
 Q12 B 0 Q11 Q22 0
 in case of parity serve Q11 and/or Q22 instead of
Q12  Observe
 Q12 is served only when A11 0 and A22 0, i.e.
with probability  P(serve Q12) P(no arrivals at both Q11 and Q22
) 1(1??)2 (??)2  P(serve Q12) lt P(arrival at Q12) if ? is small
enough  Example ? 0.5 ? 0.1 P(serve Q12) 0.36
Note this proof is due to I.Keslassy, Stanford
Univ.
52Maximum Size Matching
 MSM maximizes the instantaneous throughput
 MSM may not yield 100 throughput
 short term decisions can be inefficient in the
long term  nonbinary edge weights allow MWM to maximize
the longterm throughput
53Maximum Weight Matching
 Maximum Weight Matching (MWM)
 among all the possible N! matchings, selects the
one with the highest weight (sum of the edge
metrics)  MWM is generally not unique
 MWM is too complex to be implemented in hardware
at high speed  the best MWM algorithm requires O(N3) iterations,
and cannot be implemented efficiently, since it
is based on a flow augmentation path algorithm  cannot be parallelized and pipelined efficiently
 MWM has never been implemented in a commercial
chipset
54Maximum Weight Matching
 In case of unknown traffic, MWM is the optimal
solution of the scheduling problem when the
weight is either the queue length or the cell age  achieves 100 throughput under any traffic
 also under nonBernoulli arrival processes,
satisfying the law of large numbers  achieves low average delays, very close to those
of OQ switches  possible starvation for lightly loaded packet
flows
55Maximum Weight Matching
 MWM is the optimal solution of the scheduling
problem when the traffic is unknown, when the
weight is either the queue length or the cell age  achieves 100 throughput under any traffic
 also under nonBernoulli arrival processes,
satisfying the law of large numbers  achieves low average delays, very close to those
of OQ switches  possible starvation for lightly loaded packet
flows
56MWM with pipeline and latency
 Let T and P be fixed
 Dt denotes the matching used at time t
 The following variations of MWM also achieve
100 throughput  Dt MWM(tP) MWM with pipeline
degree P  Dt MWM(ceil(t/T)T) MWM with latency T
 combinations of both
 thus, it seems easy to achieve 100 throughput!
57MWM with pipeline and latency
 Bit
 What about throughput?
 100 throughput
 but needs the computation of a MWM
 What about delays?
 delays can be really bad!
?
?
?
58General consideration
 When scheduling in IQ switches, it is very
difficult to achieve simultaneously  high throughput
 low delay
 limited implementation complexity
59Uniform traffic
 MWM and MSM behave almost identically
Uniform Traffic
100
MWM
MSM
Mean delay
10
1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Normalized Load
60LogDiagonal traffic
 MSM is somewhat inferior to MWM
LogDiagonal Traffic
1000
MWM
MSM
100
Mean delay
10
1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Normalized Load
61Diagonal traffic
 MSM yields much longer delays than MWM at
medium/high loads
Diagonal Traffic
1000
MWM
MSM
100
Mean delay
10
1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Normalized Load
62Outline
 IP routers
 OQ routers
 IQ routers
 Scheduling
 Optimal algorithms
 Heuristic algorithms
 Packetmode algorithms
 Networks of routers
 CIOQ routers
 Multicast traffic
 Conclusions
63Approximations of MSM and MWM
 Motivation
 strong interest in scheduling algorithms with
 very low complexity
 high performance
 Usually
 implementable schedulers (low complexity)
 ? low throughput, long delays
 theoretical schedulers (high complexity)
 ? high throughput, short delays
64Some implementable algorithms
 Approximate MSM
 WFA, iSLIP, 2DRR, RC, FIRM and many others
 Approximate MWM with wij Xij (queue length)
 iLQF, RPA, learning algorithms
 Approximate MWM with wij cell age
 iOCF
 Approximate MWM with wij ?i Xij ?j Xij
 iLPF, MUCS
65APPROXIMATIONS OF MAXIMUM SIZE MATCHING
66Wave Front Arbiter
Requests
Match
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
RWP
67Wave Front Arbiter
2N1 steps
Requests
Match
RWP
68Wrapped Wave Front Arbiter
N steps instead of 2N1
Requests
Match
RWP
69iSLIP
 iSLIP means iterative SLIP
 iterates among the following 3 phases
 Request
 Grant
 Accept
70iSLIP
 3 phases
 Request (from inputs to outputs)
 each unmatched input sends a request to every
output for which it has a cell  Grant (from outputs to inputs)
 if an unmatched output receives requests, it
sends a grant to one of the inputs  contentions solved by a roundrobin mechanism
 Accept (from inputs to outputs)
 if an unmatched input receives grants, it selects
a single output and it becomes matched to it  contentions solved by a roundrobin mechanism
71iSLIP
 The round robin mechanism in iSLIP is designed so
that, under uniform traffic, iSLIP emulates a
dynamic TDM scheduler synchronized on the arrival
pattern
72iSLIP
 iSLIP is maximal
 often, with log N iterations
 always, with N iterations
 iSLIP was implemented on one chip in the Cisco
12000 router  http//www.cisco.com/warp/public/cc/pd/rt/12000/te
ch/fasts_wp.pdf
73iSLIP
iSLIP demo
from http//tinytera.stanford.edu/tinytera/demo
s/index.html
74APPROXIMATIONS OF MAXIMUM WEIGHT MATCHING
75iLQF
 iLQF means iterative Longest Queue First
 iterates among the following 3 phases
 Request
 Grant
 Accept
76iLQF
 3 phases
 Request (from inputs to outputs)
 each unmatched input sends all its queue lengths
as requests to corresponding outputs  Grant (from outputs to inputs)
 if an unmatched output receives requests, it
sends a grant to the input corresponding to the
longest queue  contentions solved by random choice
 Accept (from inputs to outputs)
 if an unmatched input receives grants, it selects
the output with the longest queue  contentions solved by random choice
77iLQF
 iLQF is maximal
 often, with log N iterations
 always, with N iterations
 iLQF is robust to nonuniform traffic
78iLQF
iLQF demo
from http//tinytera.stanford.edu/tinytera/demo
s/index.html
79RPA
 RPA means Reservation with Preemption and
Acknowledgment  Two phases
 Reservation (possibly preemptive)
 Acknowledgement
 Sequential accesses to a reservation vector
 Urgj (if set) is the urgency of the transfer
from input Inj to output j
Vector Res
80RPA
Input 1
Input 2
 Vector Res is sequentially accessed by all inputs
Res
Input 4
Input 3
81RPA
 Initially, at each round Urgj 0 for all j
 Reservation phase
 when input i accesses Res
 it computes Wj Xij Urgj for all j
 finds j such that Wj max Wj
 if Wj gt 0,
 ? reserve output j and set UrgjXij, possibly
overwriting the previous reservation  otherwise,
 ? leave the current reservation
82RPA
 Acknowledgement phase
 if input i still finds its reservation at output
j,  ? books output j
 otherwise,
 ? chooses an unreserved output j and books output
j
83Uniform traffic
 comparison between MWM, iSLIP, iLQF, and RPA
Uniform Traffic
1000
MWM
iSLIP
iLQF
RPA
100
Mean delay
10
1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Normalized Load
84LogDiagonal traffic
 iSLIP saturates close to 84 throughput
LogDiagonal Traffic
100000
MWM
iSLIP
iLQF
RPA
10000
1000
Mean delay
100
10
1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Normalized Load
85Diagonal traffic
 RPA achieves 98 throughput, iLQF 87, iSLIP 83
Diagonal Traffic
100000
MWM
iSLIP
iLQF
RPA
10000
1000
Mean delay
100
10
1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Normalized Load
86LEARNING ALGORITMS
87Learning algorithms
 Goal
 find a good compromise among
 throughput, delay and complexity
88Learning algorithms
 Key observation
 the matchings generated by MWM show limited
changes from one time to another  remembering the matching from the past simplifies
the computation of the new matching  the search implemented by MWM can be enhanced
 with a randomized approach
 by observing arrivals
 by searching in parallel
 based on an extension of randomized scheduling
algorithms
89Simple Randomized Schemes
 Choose a matching at random and use it as the
schedule  doesnt yield 100 throughput
 Choose 2 matchings at random and use the heavier
one as the schedule 
 Choose N matchings at random and use the
heaviest one as the schedule  ?None of these can give 100 throughput !
90Simple randomized algorithms
32x32
91Bounds on Maximum Throughput
92Tassiulas scheme
 Consider the following policy
 Rt matching picked at random (uniformly) among
all the possible N! matchings  Dt arg max W(Dt1), W(Rt)
 Complexity is very low
 O(1) iterations
 easy to pipeline
 Yields 100 throughput !
 note the boost in throughput is due to memory of
the past matching Dt1  However, delays are very large
93Tassiulas' scheme
32x32
94Learning approach
 Properties of COMP1
 W(Dt) ? W(Dt1)
 W(Dt) ? W(Mt)
 Examples
 COMP1 is the MAX among Dt1 and Mt
 COMP1 is the MERGE among Dt1 and Mt
95MERGE procedure
Merging
31222
Emulating MWM is O(N)
21241
M
W(M)13
96The learning approach
 Properties of Mt
 informally, Mt should be a good sample in the
space of all possible matchings  Examples
 Mt is a matching picked uniformly at random
 Mt is a matching picked nonuniformly at random,
with a high probability of being heavy  Mt is derived from the arrival vector At
 Mt is a good neighbor of Dt1
97Theoretical properties
 Stability
 100 throughput under any admissible Bernoulli
traffic pattern  Delay
 the better is the weight of Mt , the smaller are
the queue lengths, and hence the smaller are the
delays
98Example of practical implementation
 Exploiting parallel search
Kth neighbor of Dt1
Dt1
MAX
Mt
At
MAX
 This scheme is called APSARA
Dt
99What is a neighbor of a matching?
Dt1
3 neighbors
N1
N2
N3
 Each neighbor
 differs from Dt1 in ONLY TWO edges
 can be generated very easily in hardware
100MaxAPSARA
 APSARA, as described before, is not maximal
 MaxAPSARA is a modified version of APSARA where
a maximal size matching algorithm runs on the
remaining unmatched inputs/outputs  e.g., if k inputs/outputs are unmatched,
 run iSLIP with k iterations
 select k random edges among the unmatched
inputs/outputs
101APSARA performance
102Outline
 IP routers
 OQ routers
 IQ routers
 Scheduling
 Optimal algorithms
 Heuristic algorithms
 Packetmode algorithms
 Networks of routers
 CIOQ routers
 Multicast traffic
 Conclusions
103Routers and switches
 IP routers deal with variablesize packets
 Hardware switching fabrics often deal with
fixedsize cells  Question
 how to integrate an hardware switching fabric
within an IP router? 
104Router based on an IQ cell switch cellmode
105Cellmode scheduling
 Scheduling algorithms work at cell level
 pros
 100 throughput achievable
 cons
 interleaving of packets at the outputs of the
switching fabric
106Router based on an IQ cell switch packetmode
NO packet interleaving if packetmode
IQ cell switch
ORM
1
1
ORM
N
N
switching fabric
107Router based on an IQ cell switch packetmode
NO packet interleaving if packetmode
IQ cell switch
ORM
1
1
ORMs can be removed
ORM
N
N
switching fabric
108Packetmode scheduling
 Rule packets transferred as trains of cells
 when an input starts transferring the first cell
of a packet comprising k cells, it continues to
transfer in the following k1 time slots  Pros
 no interleaving of packets at the outputs
 easy extension of traditional schedulers
 Cons
 starvation due to long packets
 inherent in packet systems without preemption
 negligible for high speed rates
109Packetmode scheduling
 Questions
 can packet mode provide high throughput?
 what about delays?
YES! ?
It depends?
110Packetmode properties
 Main theoretical results
 MWM in packetmode yields 100 throughput
 Packet mode can provide shorter delays than cell
mode, depending on the packet length distribution
111Simulation scenario
 Router with ISMs and ORMs
 Uniform packet traffic
 uniform packet load
 uniform (1,192) packet size distribution
 Spotted packet traffic
 non uniform packet load
 bimodal (3,100) packet size distribution
112Uniform packet traffic
 Packet mode and cell mode reach the same
throughput
Uniform packet traffic for cell mode
Uniform packet traffic for packet mode
100000
100000
MWM
MSM
iSLIP
iLQF
10000
10000
Mean packet delay
Mean packet delay
1000
1000
100
100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Normalized Load
Normalized Load
Cellmode
Packetmode
113Spotted packet traffic
 Packet mode reaches higher throughput than cell
mode
Spotted packet traffic for packet mode
Spotted packet traffic for cell mode
100000
100000
MWM
MSM
iSLIP
iLQF
10000
10000
Mean packet delay
Mean packet delay
1000
1000
100
100
0.5
0.6
0.6
0.7
0.7
0.8
0.8
0.9
0.9
1.0
1.0
0.5
0.6
0.6
0.7
0.7
0.8
0.8
0.9
0.9
1.0
1.0
Normalized Load
Normalized Load
Cellmode
Packetmode
114Effect of packet size distribution
 iSLIP delayCM/delayPM for different packet size
distributions
2
Uniform
Exponential
better PM
Trimodal
Bimodal
1.5
Packet mode gain for iSLIP
1
better CM
0.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Normalized load
115Packet mode features
 Packet mode scheduling
 is a feasible modification of schedulers
 improves throughput
 but it can generate some unfairness between long
and short packets  inherent to all variablepacket networks without
preemption  may give better packet delays than cell mode
 depends on the packet size distribution
116Outline
 IP routers
 OQ routers
 IQ routers
 Scheduling
 Optimal algorithms
 Heuristic algorithms
 Packetmode algorithms
 Networks of routers
 CIOQ routers
 Multicast traffic
 Conclusions
117Network of IQ routers
 Question
 given a network of IQ switches and an admissible
input traffic, is the network always stable?
118Networks of IQ routers
 Consider the acyclic network of IQ routers in the
following slide  derived from well established results from
adversarial queueing theory  a very specific scenario, but comprises only few
switches  this situation may not be common, but cannot be
excluded in real networks
119Pathological network of IQ switches
Network with 8 switches and 4 flows
120Instability of MWM
 If MWM is adopted at each IQ router, and the
traffic is admissible, the system can be unstable
under Bernoulli i.i.d. arrivals
121Instability of MWM
 MWM is too greedy, in the sense that it can
create traffic bursts that are amplified by each
scheduler  A server can be idling when large bursts
(directed to it) are blocked because of the
contentions upstream  the problem arises when a packet flow is subject
to priority changes along its path through the
network  it is dangerous to increase priority along the
path
122Stability in networks of routers
 Global policies
 Oldest in the network and many others
 problem requires global information about the
network, and perfectly synchronized clocks at the
ingress of the network  Local policies
 until now, nothing really satisfying known
(work in progress)
123Stability in networks of routers
 Semilocal policies
 MWM with local information about the router
neighbors can achieves 100 throughput under
i.i.d. Bernoulli arrivals  Virtual network queue
 the weights used by MWM are
 wij max0,XijH(Xij)
 where H(Xij) is the size of the queue upstream
which is sending packets to Xij
124Outline
 IP routers
 OQ routers
 IQ routers
 Scheduling
 Optimal algorithms
 Heuristic algorithms
 Packetmode algorithms
 Networks of routers
 CIOQ routers
 Multicast traffic
 Conclusions
125CIOQ routers
VOQ
126CIOQ routers
 Question
 if a low speedup S is allowed (and queues are
available at both inputs and outputs), is it
possible to design simple scheduling algorithms,
capable of achieving high throughput and low
delay? 
YES! ?
127CIOQ routers with S2
 If S 2
 it is easy to obtain 100 throughput
 all maximal matchings work
 based on stable marriage algorithms
 it is less easy to obtain work conservation
 output never idling whenever a packet is present
destined to it  same average delays as OQ
 very good delay performance
 e.g. LOOFA
 it is difficult to perfectly emulate OQ
128LOOFA
 The occupancy Cj
 is the number of cells currently residing at the
jth output queue  at each time slot, it is decremented by one
because of departures  Basic idea of LOOFA
 give priority to output channels with low
occupancy, thereby attempting to maintain
workconservation for all outputs
129LOOFA
 If S 2, during each of the two phases
 each unmatched input selects a nonempty VOQ
directed to the unmatched output with the lowest
occupancy, and sends a request to that output  each unmatched output selects one of the
requests, and sends a request to that input  repeat until the matching is maximal
 the selection at the outputs can be round robin,
random, ...
130CIOQ routers with S2
 If S 2
 it is difficult (but possible) to perfectly
emulate an OQ router in terms of packet
departures  it is impossible to distinguish, by observing
arrivals and departures, if the switching
architecture is CIOQ or OQ  delays are perfectly controlled
 easy to implement scheduling algorithms born for
OQ (eg WFQ)
131CIOQ routers
 CIOQ are very promising architectures
 many degrees of freedom in design
 how to balance input/output buffers
 how the buffers interact
 e.g., by backpressure mechanisms
 Several currently designed architectures are
supposed to be CIOQ  The speedup S is becoming closer and closer to 1
in practical implementations of new switching
architectures (CIOQ ?IQ)
132Outline
 IP routers
 OQ routers
 IQ routers
 Scheduling
 Optimal algorithms
 Heuristic algorithms
 Packetmode algorithms
 Networks of routers
 CIOQ routers
 Multicast traffic
 Conclusions
133Multicast traffic
 Misleading (but common) idea
 observe
 OQ can achieve 100 throughput under any
admissible unicast and multicast traffic  OQ can be perfectly emulated by CIOQ with S 2
 then, with S 2 it is possible to achieve 100
throughput for multicast traffic
134Multicast traffic
 Question
 what is the minimum speedup required to achieve
100 throughput?
unknown! ?
135Multicast traffic
 Possible implementations
 copy network before the switching fabric
 a multicast cell with f destinations is treated
as f cells  possible bandwidth inefficiency
 dedicated queue
 multicast packets are treated in some specific way
136Multicast traffic optimal queueing
 MCVOQ queueing
 best throughput performance
 avoids HOL blocking
 2N1 queues for each input, one for each fanout
set  reenqueuing process ? outofsequence problem
 no reenqueuing ? some throughput degradation
137Multicast traffic optimal scheduling
 The optimal scheduling for multicast traffic can
be defined similarly to unicast traffic  it is a sort of max flow algorithm on all N(2N1)
queues  Many heuristics can be envisaged to approximate
it
138Summary
 3 main ingredients for IQ scheduling algorithms
 Weight computation
 Matching computation
 Contention resolution
139Summary
 Weight computation
 obtains the priority of each input queue
 the metric can be related to queue length,
waiting time of the cell at the HOL,  Contention resolution
 whenever the selection is among situations with
equal weights  can be round robin, or random
140Summary
 Matching computation
 computes the matching, trying to maximize its
total weight  can be based on
 an iterative search, like in iSLIP, iOCF, iLQF
 a matrix greedy approach, like in MUCS, WFA
 a reservation vector, like in RPA
 a learning approach, like in APSARA
141Summary
 Good IQ scheduling algorithms exist
 100 throughput
 short delay
 limited complexity
 Performance differences are significant only
close to saturation
142Summary
 Open questions concerning IQ schedulers
 QoS guarantees
 stability of networks of switches
 multicast traffic
143References
 Router functions and architectures
 Keshav S., Sharma R., Issues and trends in
router design'', IEEE Communications Magazine,
vol.36, n.5, May 1998, p.144151  Bux W., Denzel W.E., Engbersen T., Herkersdorf
A., Luijten R.P.,Technologies and building
blocks for fast packet forwarding'', IEEE
Communications Magazine, Jan.2001, pp.7077  Newman P., Minshall G., Lyon T., Huston L.,IP
switching and gigabit routers'', IEEE
Communications Magazine, Jan.1997, pp.6469  Wolf T., Turner J.S., Design issues for
highperformance active routers'', IEEE Journal
on Selected Areas in Communications, vol.19, n.3,
Mar.2001, pp.404409  Scheduling in IQ switches
 Karol M., Hluchyj M., Morgan S., Input versus
output queueing on a space division switch'',
IEEE Transactions on Communications, vol.35,
n.12, Dec.1987  McKeown N., Anantharam V., Walrand J.,Achieving
100\ throughput in an inputqueued switch'',IEEE
INFOCOM'96, vol.1, San Francisco, CA, Mar.1996,
pp.296302  McKeown N.,iSLIP a scheduling algorithm for
inputqueued switches'', IEEE Transactions on
Networking, vol.7, n.2, Apr.1999, pp.188201  McKeown N., Mekkittikul A.,A practical
scheduling algorithm to achieve 100\ throughput
in inputqueued switches'', IEEE INFOCOM'98,
vol.2, 1998, pp.7929, New York, NY  Tamir Y., Chi H.C., Symmetric crossbar
arbiters for VLSI communication switches'', IEEE
Transaction on Parallel and Distributed Systems,
vol.4, no.1, Jan.1993, pp.13 27  Chen H., Lambert J., Pitsilledes A.,RCBB
switch. A high performance switching network for
BISDN'', GLOBECOM 95
144References
 Scheduling in IQ switches
 Anderson T., Owicki S., Saxe J., Thacker
C.,High speed switch scheduling for local area
networks'', ACM Transactions on Computer Systems,
vol.11, n.4, Nov.1993  LaMaire R.O., Serpanos D.N., Two dimensional
roundrobin schedulers for packet switches with
multiple input queues'', IEEE/ACM Transaction on
Networking, vol.2, n.5, Oct.1994, p.471482  Chen H., Lambert J., Pitsilledes A., RCBB
switch. A high performance switching network for
BISDN'', IEEE GLOBECOM 95, 1995  Duan H., Lockwood J.W., Kang S.M., Will J.D., A
high performance OC12/OC48 queue design prototype
for input buffered ATM switches'', IEEE
INFOCOM'97, vol.1, 1997, pp.208, Los Alamitos,
CA  Partridge C., et al., A 50Gb/s IP router'',
IEEE Transactions on Networking, vol.6, n.3, June
1998, pp.237248  Ajmone Marsan M., Bianco A., Leonardi E., Milia
L., RPA a flexible scheduling algorithm for
input buffered switches'', IEEE Transactions on
Communications, vol.47, n.12, Dec.1999,
pp.19211933  Ajmone Marsan M., Bianco A., Filippi E., Giaccone
P.,Leonardi E., Neri F.,On the behavior of
input queueing switch architectures'', European
Transactions on Telecommunications, vol.10, n.2,
Mar.1999, pp.111124  Christensen K.J.,Design and evaluation of a
parallelpolled virtual output queued switch'',
IEEE ICC 2001, vol.1, pp.112116, 2001  Serpanos D.N., Antoniadis P.I., FIRM a class
of distributed scheduling algorithms for
highspeed ATM switches with multiple input
queues'', IEEE INFOCOM 2000, vol.2, pp.548555,
2000  Ying Jiang, Hamdi, M., A 2stage matching
scheduler for a VOQ packet switch architecture,
IEEE ICC 2002, vol.4, pp.21052110, 2002  Tassiulas L., Linear complexity algorithms for
maximum throughput in radio networks and input
queued switches'', IEEE INFOCOM'98, vol.2, New
York, NY, 1998, pp.533539  Giaccone P., Prabhakar B., Shah D., Towards
simple, highperformance schedulers for
highaggregate bandwidth switches '', IEEE
INFOCOM'02, New York, Jun.2002
145References
 Packet scheduling in IQ switches
 Ajmone Marsan M., Bianco A., Giaccone P.,
Leonardi E., Neri F., Packet scheduling in
inputqueued cellbased switches'', IEEE
INFOCOM'01, Anchorage, Alaska, Apr.2001(extended
version to appear in IEEE Trans. on Networking,
about Oct.2002)  Moon S.H., Sung D.K., Highperformance
variablelength packet scheduling algorithm for
IP traffic'', IEEE GLOBECOM'01, Dec.2001  Scheduling multicast traffic in IQ switches
 Hayes J.F., Breault R., MehmetAli M.K.,
Performance analysis of a multicast switch'',
IEEE Transactions on Communications, vol.39, n.4,
Apr.1991, pp.581587  Kim C.K., Lee T.T., Call scheduling algorithm
in multicast switching systems'', IEEE
Transactions on Communications, vol.40, n.3,
Mar.1992, pp.625635  McKeown N., Prabhakar B., Scheduling multicast
cells in an inputqueued switch'', INFOCOM'96,
vol.1, San Francisco, CA, Mar.1996, pp.261278  Prabhakar B., McKeown N., Ahuja R., Multicast
scheduling for inputqueued switches'', IEEE
Journal on Selected Areas in Communications,
vol.15, n.5, Jun.1997, pp.855866  Chen W., Chang Y., Hwang W., A high performance
cell scheduling algorithm in broadband multicast
switching systems'', IEEE GLOBECOM'97, vol.1, New
York, NY, 1997, pp.170174  Guo M., Chang R., Multicast ATM switches
survey and performance evaluation'', Computer
Communication Review, vol.28, n.2, Apr.1998,
pp.98131  Andrews M., Khanna S., Kumaran K., Integrated
scheduling of unicast and multicast traffic in an
inputqueued switch'', IEEE INFOCOM'99, vol.3,
New York, NY, 1999, pp.11441151  Liu Z., Righter R., Scheduling multicast
inputqueued switches'', Journal of Scheduling,
John Wiley Sons, May 1999
146References
 Scheduling multicast traffic in IQ switches
 Nong G., Hamdi M., On the provision of
integrated QoS guarantees of unicast and
multicast traffic in inputqueued switches'',
IEEE GLOBECOM'99, vol.3, 1999  Ajmone Marsan M., Bianco A., Giaccone P.,
Leonardi E., Neri F., On the throughput of
inputqueued cellbased switches with multicast
traffic'', IEEE INFOCOM'01, Anchorage Alaska,
Apr.2001  Ge Nong, Hamdi M., Providing QoS guarantees for
unicast/multicast traffic with fixed/variableleng
th packets in multiple inputqueued switches,
IEEE Symposium on Computers and Communications,
pp.166 171, 2001  Smiljanic A., Flexible bandwidth allocation in
highcapacity packet switches, IEEE/ACM
Transactions on Networking, vol.10, n.2,
pp.287293, Apr.2002  QoS support in IQ switches
 Tabatabaee V., Georgiadis L., Tassiulas L.,
QoS provisioning and tracking fluid policies in
input queueing switches'', IEEE INFOCOM'00, New
York, Mar.2000  Chang C.S., Lee D.S., Jou Y.S., Load balanced
Birkhoffvon Neumann switches'', 2001 IEEE
Workshop on High Performance Switching and
Routing, 2001, pp.276280.  Hung A., Kesidis G., McKeown N.,ATM
inputbuffered switches with guaranteedrate
property'', IEEE ISCC'98, July 1998, pp.331335,
Athens, Greece  Advanced architectures derived from pure IQ
 Iyer S., McKeown N., Making parallel packet
switches practical'', IEEE INFOCOM'01, Alaska,
Mar.2001  Chang C.S., Lee D.S., Jou Y.S., Load balanced
Birkhoffvon Neumann switches'', 2001 IEEE
Workshop on High Performance Switching and
Routing, 2001, pp.276280  Sivaram R., Stunkel C.B., Panda D.K., HIPIQS a
highperformance switch architecture using input
queuing, IEEE Transactions on Parallel and
Distributed Systems, vol.13, n.3, pp.275289,
Mar.2002