Routing algorithms presentation

About This Presentation

Transcript and Presenter's Notes

Title: Routing algorithms

1
Routingalgorithms

Yaakov (J) Stein Sept 2009
Chief Scientist
RAD Data Communications

2
Outline

Control Plane
principles of IP routing
longest prefix match
RIBs and FIBs
Forwarding plane
classification and lookup mechanisms
switching fabrics

Introduction

4
Routers

A router is a combination of hardware, software,
and memory
that is responsible for forwarding packets
towards their destinations
Routers generally work at ISO layer 3 (network
layer)
but can also function at layer 2.5 (for MPLS)
and may inspect higher layers, but only for
optimization
(QoS management, load balancing, etc.)
Note that Ethernet switches technically filter
rather than forward
In order to correctly fulfill their function
(i.e. to know where to forward)
routers usually run routing protocols
to exchange information between themselves
Ethernet switches do not need such protocols as
they learn how to filter
So the router performs 2 distinct algorithms
forwarding algorithm (forwarding component)
routing algorithm (control component)

5
What does a router do ?

Control plane (routing algorithm)
run routing protocols
identify interface and next hop L2 addresses
populate RIBs (if Link State, perform SPFs)
scan all RIBs, and produce FIB (entries map FEC
to NH)
Data plane (forwarding algorithm)
deframing (CRC/checksum/defragmentation/reassembly
/demapping)
parsing (pulling values from appropriate fields
simple IPv4 DA,
complex finding URL or MIB variable)
FEC classification (add metadata, based on DA,
DAToS, MPLS, )
lookup / search
packet modification and replication
framing
traffic management and queuing
compression, encryption, etc.

6
IP networks

IP networks are made up of
hosts
middleboxes (e.g., firewalls, NATs, NAPTs,
Application Layer GWs)
routers (obsolete terminology gateways)
It will be useful to differentiate between
core routers (connect to other routers)
edge routers (connect to hosts)
We will see shortly that it is more complex than
that
To understand how a router is different from
other network elements
we need to know the basic principles the IP
protocol architecture
We will mainly deal with IPv4 unicast forwarding

Routing Slide 6
7
The basics (1)

The first principle of IP is the end-to-end (E2E)
principle
All functionality should be implemented only with
the knowledge
and help of the application at the end points
The second principle is the hourglass model
IP (l3) is the common layer
below IP (L3) is not part of IP suite, above is
Thus
most functionality and state is in the hosts
middlebox functionality is severely limited
routers are limited to forwarding packets
without extensive packet manipulation (exception
- TTL)
The third principle is that forwarding is
connectionless
on a hop-by-hop basis

Routing Slide 7
8
The basics (2)

The fourth principle is that unicast IP
forwarding is performed
based on a Destination Address (DA)
Addresses must usually be unique (end-to-end
principle)
Hosts usually have a single IP address, routers
have many addresses
It is the responsibility of a service provider
(SP) to allocate addresses
IP addresses are not arbitrary, like Ethernet MAC
addresses
The fifth principle is that IP addresses are
aggregated into subnetworks
All addresses in a subnetwork share a common
prefix
Subnetworks may be further aggregated
The sixth principle is that it is the
responsibility of the router
to forward towards the hosts subnet
but it is not its responsibility to deliver the
packet on the subnet
the IP suite starts above L2
subnets L2 (e.g., Ethernet, PPP) delivers the
packet to the host

9
IP Routing types

Distance Vector (Bellman-Ford), e.g. RIP, RIPv2,
IGRP, EIGRP
send ltaddr,costgt to neighbors
routers maintain cost to all destinations
need to solve count to ? problem
Path Vector, e.g. BGP
send ltaddr,cost,pathgt to neighbors
similar to distance vector, but w/o count to ?
problem
like distance vector has slow convergence
doesnt require consistent topology
can support hierarchical topology gt exterior
protocol (EGP)
Link State, e.g. OSPF, IS-IS
send ltneighbor-addr,costgt to all routers
determine entire flat network topology (SPF -
Dijkstras algorithm)
fast convergence, guaranteed loopless gt
interior routing protocol (IGP)
convergence time is the time taken until all
routers work consistently
before convergence is complete packets may be
misforwarded, and there may be loops

1.1
10

Control Plane
(Routing)

11
FIBs

Based on the 6 principles we can understand what
a router does
The router looks at the packets DA
It deduces to which subnet the packet belongs
If the router can directly interface that subnet
it must use the appropriate L2 to send the
packet to the host
Otherwise it must retrieve the next hop (router)
that sends the packet towards the subnet
The next router does the same
If routing has converged there will be no loops
or black holes
but there may be during transients
The information needed by the router to properly
forward packets
is stored in the Forwarding Information Base
(FIB)
The FIB associates address prefixes with Next
Hops (NHs)
(and, to save an additional lookup, usually with
L2 addresses as well)
Do not confuse the FIB with a Routing Information
Base (RIB)

12
More on FIBs

Simple (and primitive) routers have a routing
table
modern large routers have several different
databases
The FIB is designed to be fast to search
we will talk about its data structure later
RIBs are designed to be fast to update
which is quite a different structure
There may be many RIBs, one for each routing
protocol running
and static routes may be entered into any of
them
There are sometime other databases as well
for example link state routing protocols
require a LSDB
from which the RIB is built
The basic idea is that we build the FIB from the
RIBs

13
Router interfaces

Routers connect to hosts and to other routers via
interfaces
(from 1 to many thousands of interfaces per
router)
Routers are responsible for forwarding packets
arriving at an ingress interface
to an egress interface
Interfaces have layer 3 and above properties
and also contain layer a and 2 properties (ports)
Each interface is assigned a unique IP address
Interfaces are grouped into subnetworks
All interfaces on a subnetwork share the same
prefix

14
Prefixes and masks

Since 1993 (RFC 1519 - CIDR) subnets can have any
length prefix
There are two ways of specifying the prefix
length
slash notation, e.g., 192.168.16.0/20
note unspecified bits are set to zero
mask notation, e.g., 192.168.16.0 with mask
255.255.240.0
Note that 192.168.16.0/20 means all addresses
from 192.168.16.0 through 192.168.31.255
Note that it contains 192.168.16.0/21,
192.168.24.0/22, etc.
since they are in the range and have longer
prefixes (larger masks)
/32 are fully qualified IP addresses
0.0.0.0/0 matches every IP address
it is the default route
route taken when there is no matching entry in
FIB
the gateway of last resort

15
Prefix tables
slash mask
A.B.0.0/16 255.255.0.0
A.B. 0.0/15 255.254.0.0
A.B. 0.0/14 255.252.0.0
A.B. 0.0/13 255.248.0.0
A.B. 0.0/12 255.240.0.0
A.B. 0.0/11 255.224.0.0
A.B. 0.0/10 255.192.0.0
A.B.0.0/9 255.128.0.0
A.0.0.0/8 255.0.0.0
A.0.0.0/7 254. 0.0.0
A.0.0.0/6 252. 0.0.0
A.0.0.0/5 248. 0.0.0
A.0.0.0/4 240. 0.0.0
A.0.0.0/3 224. 0.0.0
A.0.0.0/2 192. 0.0.0
A.0.0.0/1 128. 0.0.0
0.0.0.0/0 0.0.0.0
slash Mask
A.B.C.D/32 255.255.255.255
A.B.C.D/31 255.255.255.254
A.B.C.D/30 255.255.255.252
A.B.C.D/29 255.255.255.248
A.B.C.D/28 255.255.255.240
A.B.C.D/27 255.255.255.224
A.B.C.D/26 255.255.255.192
A.B.C.D/25 255.255.255.128
A.B.C.0/24 255.255.255.0
A.B.C.0/23 255.255.254.0
A.B.C.0/22 255.255.252.0
A.B.C.0/21 255.255.248.0
A.B.C.0/20 255.255.240.0
A.B.C.0/19 255.255.224.0
A.B.C.0/18 255.255.192.0
A.B.0.0/17 255.255.128.0
Note for /25 D0 or 128 for /26 D 0, 64,
128, or 192 etc.
16
Special IP addresses

Some IP addresses are reserved for special
purposes
they are not assigned by IANA
and may require special treatment by router

prefix range purpose
0.0.0.0/8 0.0.0.0 0.255.255.255 defaults
10.0.0.0/8 10.0.0.0 10.255.255.255 private addresses
127.0.0.0/8 127.0.0.0 127.255.255.255 loopback addresses
169.254.0.0/16 169.254.0.0 - 169.254.255.255 zeroconf
172.16.0.0/12 172.16.0.0 - 172.31.255.255 private addresses
192.0.2.0/24 192.0.2.0 - 192.0.2.255 Documentation
192.88.99.0/24 192.88.99.0 - 192.88.99.255 IPv6-IPv4 relay
192.168.0.0/16 192.168.0.0 - 192.168.255.255 private addresses
198.18.0.0/15 198.18.0.0 - 198.19.255.255 device benchmark
224.0.0.0/4 224.0.0.0 239.255.255.255 multicast
240.0.0.0/4 240.0.0.0 255.255.255.255 reserved
17
Longest prefix match and FECs

To find the subnet, we need to look at the
packets DA
and to find the best match for the DA that is
known
find known the FIB entry that matches the longest
prefix of the DA (LPM)
All packets that are forwarded in the same way
are grouped
into a Forwarding Equivalence Class (FEC)
In the simplest case, a FEC is simply a known IP
address prefix
i.e., the packets subnet
In more complex cases it might get more complex
for example, ToS field, source address, etc.
Every packet has to be looked up in the FIB and
classified to a FEC
and this forwarding has to be done fast

18
Example

Assume the following FIB
and lets look up 192.0.2.131 (last byte
10000011)
It matches the first entry with prefix length 0
(everything matches)
It matches the second with length 24 (first three
bytes)
It matches the third entry with prefix length 25
(last byte 1xxxxxxx)
It does not match the fourth entry (11xxxxxx )
So the packet is forwarded to next hop C

prefix next hop IP address
0.0.0.0/0 (gateway of last resort) A
192.0.2.0/24 B
192.0.2.128/25 C
192.0.2.192/26 D
19
Packet processing time
How much time do we have to process a packet
? disregarding L2 overhead, IPGs, etc. So
the FIB data structure has to be optimized for
fast look-up
100 Mbps 1 Gbps 10 Gbps
64 B 5 msec 500 nanosec 50 nanosec
256 B 20 msec 2 msec 200 nanosec
1500 B 120msec 12 msec 1.2 msec
20
Policy and autonomy

The FIB information is based on
static routes
dynamic routes determined by routing protocols
gateway of last resort (default route) 0.0.0.0/0
In general we need to apply policy, since
there are conflicting sources of information
may not want to use, or even believe, information
received from peers
it is all a matter of autonomy
an Autonomous System can request service from
another
but can not force it to provide service

21
Autonomous systems

Routers are grouped into Autonomous Systems (ASs)
ASs may be grouped into domains
AS look to the outside world as single entity
(they usually have an AS ID)
Routers in the same AS obey a common policy, and
trust each other
AS are truly autonomous
one AS can request another to forward a packet,
but can not force it to
Inside ASs we run Internal Gateway Protocols
(IGPs)
e.g. OSPF, IS-IS
Between ASs we run External Gateway Protocols
(EGPs)
e.g., BGP
A router that runs both is called an AS Border
Router (ASBR)

22
More of the story

Actually, it can get a lot more complicated
In general, a router will be running multiple
routing protocols
For example
one or more IGPs (RIP,OSPF, IS-IS) between
routers in the same AS
internal BGP (iBGP) between routers in the same
AS
(usually a full mesh, but when too complex we
can use route reflectors)
external BGP (eBGP) between ASBRs in different
ASs
How does a router know if a BGP session is iBGP
or eBGP ?
by the AS number !
IGP is used to find a path to another router
(including ASBR) in the same AS
eBGP is used by ASBRs to learn / distribute
routes to other ASs
iBGP is used for ASBR to inform core routers of
external routes

23
Simplest example

Stub ASs (my home router)
single homed to outside world
single internal subnet, so dont need IGP
single homed, so dont need to run BGP to ISP
dont need to have an AS number

0.0.0.0/0
.1
192.168.0.0/24
.101
.102
.104
.105
.103
24
More complex example

Connecting to a server connected to another ISP
with dual homing
Routers 3 and 6 learn from eBGP how to reach
A.B.C.D
Policy determines that 3 will be used (see later)
Router 1 learns from iBGP session that A.B.C.D is
reachable via router 3
Router 1 learns from IGP that router 3 is
reachable via router 2
Router 2 knows how to directly reach router 3
because of IGP adjacency
Packet from a.b.c.d is forwarded via 1-2-3 to AS
2 and to A.B.C.D

1
a.b.c.d
2
3
7
4
5
6
AS 2
AS 1
eBGP sessions
Full mesh of iBGP sessions
A.B.C.D
25
Even more complex example

Three ASs, with one possibly acting as a transit
domain

AS 1 would like to hand off the traffic to AS
2 AS 2 has no economic incentive to carry this
traffic But AS 2 gets the route from AS3 What can
AS 2 do to stop this ? (remember autonomy!) What
can AS 1 then do ?
26
Rules for customer ASs

Stub AS
Single-homed AS does not need to learn routes
from provider
It only has to send all traffic via its unique
exit point (0.0.0.0/0)
Provider gets routes from static or IGP or
private-AS eBGP
Multihomed Nontransit AS
AS advertises only its own routes to both SPs
AS filters out traffic for foreign routes that
reach it via static/default routing
eBGP is not needed, but recommended for route
propagation and filtering
Multihomed Transit AS
Uses eBGP to SPs and iBGP for transit traffic

27
BGP rules

There are
internal routes
external routes
customer routes
When eBGP learns a route it is repeated via iBGP
to all others in AS
thus all routers in AS learn it
When iBGP learns a route it is repeated only to
externals via eBGP
since internals also get it directly
When there is another ASBR that can reach the
same other AS
a second route is repeated by iBGP to the ASBR
The ASBR will then make the decision as to which
to use !

28
IGP rules

IGPs are used between routers in the same AS
so IGPs do not have sophisticated policy control
routers usually blindly accept all information
received
For proper operation (no routing loops)
all routers in AS must have the same IGP RIB
for link state protocols (OSPF, IS-IS)
there is a Link State Data Base (LSDB)
from which IGP RIBs can be constructed (will be
explained shortly)
Because all routers have the same LSDB
Although the forwarding is hop-by-hop
the result is the same as if there were
coordination
IGPs
do not scale to inifinity
require complete knowledge
are not suitable for interaction with
non-trusted routers
since a single misconfiguration can be fatal

29
LSDBs

We said before that LS routing protocols have
another database
LSDB contains representation of every router and
link in the AS
implicitly holding the complete topology of the
network
In addition, the LSDB associate costs (metric)
with every link
RIP - the metric is always hop count, no
non-trivial metric
OSPF - the metric is more general, for example
link length
These costs form a matrix
The topology is symmetric, but the costs need not
be

From \ to Router A Router B Router C
Router A M(A,B) M(A,C)
Router B M(B,A) M(B,C)
Router C M(C,A) M(C,B)
30
LSDBs and IGP RIBs

Each router can independently calculate the
least-cost path
to every other router in the AS
A Shortest Path First (SPF) algorithm (e.g.,
Dijkstras algorithm)
is used to compute a tree of the shortest paths
to all destinations
Each route in the SPF tree is an entire path
but for each router we can extract the next hop
and build the RIB for that router (each router
has its own RIB)
From the RIBs we build the FIB needed for
efficient forwarding of packets

31
Graph search algorithms

There are many algorithms for search on graphs
Breadth first
Bellman-Ford
Iterative deepening
Depth first (backtracking)
Depth limited
Best first
Greedy algorithms
Dijkstras algorithm
Beam search
A
B
etc. etc.
We will discuss graphs, trees, etc. later on

32
Dijkstras algorithm

Graph search algorithm first described by Edsger
Dijkstra in 1959
It assumes additive, non-negative, costs for each
link in graph
It is a best-first greedy algorithm
Think of a city street map
We want to from initial intersection to
destination one with the least walking
Start at the initial intersection its distance
is zero
Measure and label the distances to all adjacent
intersections (breadth first)
Choose the closest one (this is the greedy step)
Consider all the neighbors of the chosen
intersection
If the distance (sum of the distance to the
chosen intersection and the distance from chosen
intersection to neighbor) is the shortest known
way to get to that neighbor
then remember that distance (not a tree!)
Once you have considered all neighbors of the
intersection
mark the chosen intersection as visited (its
distance is now known)
Choose the unvisited intersection with shortest
distance
Continue until all intersections have been visited

33
Dijkstras algorithm - formal

Let's call the node we are starting with an
initial node.
The COST of node X will be the distance from the
initial node,
i.e., the sum of distances of all links along the
path from the initial node to X
Initialization
Set initial nodes COST to zero, all others to
infinity
Set initial node as current, all other as
unvisited
Main step
For all unvisited neighbours of current node
Calculate their distances from the initial node
as
COST(neighbor) COST(current) DIST(current
to neighbor)
If this is less than what is presently marked,
overwrite the marking
When all neighbors have been considered, mark
current node as visited
(once visited, this nodes COST is final)
Recurse
Select the unvisited node with the smallest COST
as current node
Go to Main step

34
Implementation issues

If our graph has N nodes and L links
In a straightforward implementation of Dijkstras
algorithm
finding the unvisited node with lowest cost takes
O(N)
this is done N times
so the total computation for this is O(N2)
the computation of the distances to every node
takes O(L)
since each link is followed once (to the node it
lands on)
So the total complexity is O(N2 L) O(N2)
By using more sophisticated data structures
(Fibonacci heap)
this can be reduced to O(N log N L) O(N log
N)

35
RIBs to FIB

So we are finally ready to see how the FIB is
populated
First rejection rules are applied, for example
do not accept routes from ASs without agreements
do not accept routes that loop
(e.g., BGP advertisements with AS number in the
AS-PATH)
Then install FIB entries according to policy, for
example
first install Static routes
then routes from IGP RIB
choose eBGP before iBGP
(hot potato
rule- get it out of my network - let someone else
handle)
if there are different routes from BGP
choose
the route with highest local preference
if routes have equal local preference
choose the
route with the shortest AS-PATH
if routes have equal AS-PATHs
choose the route
with the lowest origin number
if still equal choose highest BGP peer address

36
Advertisement

Not everything received is accepted for inclusion
in the FIB
Not everything accepted for inclusion in the FIB
is further advertised
Never advertise information not accepted to FIB !

Data Plane (Forwarding)

38
Forwarding

Now that we have a FIB installed, lets forward
some packets !
There are main steps to forwarding
classification
switching
misc. (scheduling, queuing, QoS, compression,
encryption, )
The operations
may be stateless vs. statefull
may change over time
may involve peeking into the packet (Deep Packet
Inspection)
may be recursive
Packet forwarding can be done by SW or HW or some
combination
We normally differentiate between
the fast path (simple forwarding)
the slow path (control protocol packets)

39
Lookup and data structures

Lookup comes in several varieties, such as
Exact match (e.g., MAC addresses, VLANs, IP
multicast)
Longest Prefix Matching (LPM) (needed for
searching FIBs)
Range matching (e.g., ports, firewalls)
In order to optimize lookup, we use appropriate
data structures
Wirths law Programs algorithms
data structures
We need to perform the following on our data
structures
Insert
Delete
Modify
Search
and to check the following metrics
Time complexity for each of the above
Size (spatial) efficiency
Scalability

40
LookUp Table

The simplest (and fastest) data structure is the
LookUp Table (LUT)
AKA indexed array, Location Addressable Memory
(LAM)
The incoming address is used as an index to
access the (NH) information
We can put in more info, e.g., L2 type and
address, to save further lookups
Example
Limitations
only for exact match, not LPM
limitation only for small number of possible
addresses, e.g. VLANs
We can use LUTs after other data structures that
return a key as an index

address interface NH L2 type L2 NH address
0 1 192.0.2.0 Ethernet 00-17-42-F7-14-14
1 2 192.0.2.16 PPP -
2 3 192.0.2.128 SDH VC ID
41
Hash tables

Hash tables enable handling a large number of
potential addresses
A hashing function is a function
from a large variable (large number of bits or
large number of bytes)
to a small variable (small number of bits or
bytes)
which is white (small changes in the input create
widely different outputs)
Hashing long addresses returns a short index
The problem is that the hashing function is not
11
so there will always be the probability of hash
clashes (collisions)
Solutions
perfect hashing only when addresses are known
ahead of time
index in table returns list of all addresses
stored
multiple hashing linear probing, quadratic
probing
Hashing is very good for exact match (e.g. MAC
addresses)
but is not suitable for LPM

42
Hash implementation

From an efficiency point of view
hash tables are between LUTs and search tables
(to be discussed next)
To control collisions, we need a relatively large
table (birthday paradox !)
Example
99 probability of collision
when 3000 entries are put into a hash table of
size 1 million
Using multiple hashing
average computational load is O(1
keys/table-size)
A primitive hash function is the modulo hash
H(key) addr mod table-size
table-size should be prime must not be a power
of 2 or close
A better hash function is (for appropriate
integer m and fraction f)
H(key) Trunc m Frac( f addr)
The quality of the hashing function depends on f
For Fibonacci hashing f 1 / ?
? is the golden ratio ½ (v5 2) 1.618

23 people ½ 57 people 99
43
Search table

The most spatially efficient data structure is
the search table
It is similar to a LUT
but the addresses being looked up are not
indexes
rather, we need to sequentially search the table
for the address
We can reduce the search time by ordering the
table
Search tables are good for LPM !
For example
Order FIB from most specific (longest prefix) to
least (0.0.0.0/0)
Loop through FIB list until find the first match

row prefix interface NH
0 192.168.16.0/24 1 10.10.1.1
1 192.168.196.0/20 2 10.16.54.2
2 192.168.0.0/17 2 10.16.1.16
3 0.0.0.0/0 3 10.1.1.0
44
Limitations of search tables

Search tables are not limited in size like LUTs
but this comes at a price
it is expensive to search for an address
it is expensive to modify the information for an
address
it is very expensive to insert or delete
addresses (copy!)
Search table FIBs can be rebuilt from RIBs each
time
For exact match it is possible to speed up by
binary search
order by address
guess position in middle of table range where key
should be
choose new range by comparison

45
Linked lists

Search tables are hard to maintain
if we need to insert or delete an element we
need extensive moving
Linked lists are designed to simplify mechanics
of such updates
still need exhaustive search to find where to
insert
Linked lists can be singly or doubly linked
Skip lists increase efficiency by enabling
skipping over ranges

46
Linked list implementation

If we already know where to insert or what to
delete or whom to modify
then linked lists are very efficient
However, search is O(N) where N is the number of
entries
worst case N
average case N / 2
Properly constructed multi-level skip lists take
O(log N) on average
but are still O(N) in the worst case
But if we already need to use double pointers
then trees are better

47
Tree structures

A graph is a collection of nodes and edges
connecting them
In a directed graph the edges have direction
and so or every edge there is a father node and
a child node
Nodes without children are called leaves
A forest is a directed graph without loops (only
one path between 2 nodes)
A tree is a forest with a single root -- it thus
defines a partial order
(it is conventional to draw the tree upsidedown)
A binary tree has no more than 2 output edges per
node
Trees can be implemented using arrays, pointers
(like linked lists), heaps, etc.

48
Search trees

Search trees may store data
in their nodes
in leaves only
on edges
combinations of the above
For general search trees searching can be
breadth-first or depth-first
Breadth-first
start at the root
find all children of root and check for desired
data
if not found, find all childrens children
recurse
Depth first
start at the root
find first child and check for desired data
if not found, find first child of first child
recurse until data found or leaf
if leaf backtrack to father and try the next
child

49
Binary trees

Binary Search Trees simplify storage and
manipulation
We call the children of a node L and R
Perfect binary trees have exactly 2 children for
each internal node
Thus there are exactly 2H leaves, where H is the
height of the tree
Altogether there are (2H1 1) nodes
To store keys in a binary search tree
place keys in all nodes
for every intermediate node N
key(L) lt key(N)
key(R) gt key(N)
To search for a key in a BST
start at root (set current node to root)
check if key is stored in current node
if not if key lt key(N) then set current to L
else set current to R
In the worst case this takes only H comparisons
(for perfect trees O(log N))

50
Balanced binary trees

Since the search complexity is proportional to
the BSTs height H
we would like to use BSTs with minimal height
Balanced BSTs are not perfect BSTs, but as close
as possible to perfect
It is more complex to build a balanced BST, but
faster to search

51
Tries

From retrieve, but usually pronounced try
A trie is an ordered tree with
subvalues on edges
values at leaves
Tries are related to prefix search
representations
Tries were originally developed for exact match
Tries can be binary or not
Tries are good for all lexicographical searches
O(log n)
but particularly efficient for LPM
Trie variants are the fastest known lookups -
O(log log n)
LC-trie used in modern Linux router
implementations

52
Example trie

Assume the following FIB
10.0.128.0/16
10.0.0.0/8
192.168.0.0/16
192.168.128.0/24
0.0.0.0/0
Note
in the plain trie,
all IP prefixes have the same length

53
Example compact trie

Same FIB
10.0.128.0/16
10.0.0.0/8
192.168.0.0/16
192.168.128.0/24
0.0.0.0/0
Now maximum length is 2
We could also save memory by exploiting common
suffixes

54
More trie variants

Patricia trie
Practical Algorithm to Retrieve Information Coded
in Alphanumeric
Binary trie which store in nodes the number of
bits
to skip before next decision point
For LPM store prefixes in internal nodes
Bucket tries
Store multiple keys in leaves
Multibit tries (M-tries)
Fewer branching decisions than binary tries
Multibit strides (usually variable strides)
Level-compressed tries (LC-tries)
Replace perfect subtrees in binary trie with
single degree 2k nodes
LPC trie level and path compressed
And there are many more (Lulea, full-tree, )

55
Example LC-trie

binary trie
LC-trie

56
Content Addressable Memory

CAM (AKA associative memory)
Addressable by content, rather than by location
(LAM)
Special purpose hardware
Fastest possible lookup (essentially searches
entire table in one clock)
but limited in size
usually drives regular memory for additional
storage
Binary CAM (BCAM) stores 0 or 1 in each bit
Ternary CAM (TCAM) allows wildcards
Can be used for LPM
Can prioritize solution by
number of bits matched
order in table
CAM technology today
32 to 144 bit keys
128K 512 K memories
hundreds of millions searches per second

57
Example 3 bit BCAM
encode output to access LAM
58
Other uses of lookup

Deep Packet Inspection
URL lookup (often partial or with wildcards)
can use Trees and tries
XML information, patterns, etc.
multiple encapsulations
e.g. Ethernet in IP, MPLS over IP, etc.
there are special-purpose languages to describe
such cases
Firewalls, Access Control
source/destination IP addresses
source/destination ports 4-tuples
TCP/UDP ports in ranges

59
Switching

Once the packet has been classified we need to
properly forward it
The classifier result (the FEC) becomes a
metadata field
that controls the switch
A switch fabric is combination of HW and SW
components
that enable moving the packet from input
interface to output interface
The metaphor is borrowed from woven material
At low packets per second (pps) switching can be
all software
At high speeds highly parallel hardware is needed

60
Blocking

The major design constraint in switches is
blocking
Blocking occurs when some resource is being used
and the present packet can not be processed
We differentiate three types of blocking
Output blocking (the egress port is in use)
Internal blocking (some internal resource of the
switch is in use)
Input (head of line) blocking (packet can not
enter the switch)
A switch which is designed such that there is
never internal blocking
is called a nonblocking switch
To alleviate blocking we can add buffers to store
the waiting packets
According to the type of blocking we wish to
avoid we have
output queues
internal buffers
input queues

61
Switch types

In the TDM days there were two switching
mechanism types
time domain switches (time slot interchange)
spatial domain switches
For packet networks there are no time slots
but time domain switching can still be used
shared memory switches (packets from all inputs
placed in single memory)
shared bus switches (all inputs and outputs share
a ring/hypercube, etc. )
In spatial switching
the packets destined for different output
interfaces
travel different internal paths
The simplest spatial switch is the crossbar
invented for analog telephone circuits
does not scale well O(N2) possible connections
Multilayer crossbars can scale better

62
Shared memory/bus

Shared memory switches are simple and cheap to
implement
The memory is the heart of the fabric
it determines the throughput and delay
The memory must run N times faster than the
ingress rates
Packets may be divided into frame buffers
Scalability can be achieved by parallelism (bit
or byte slicing)
The architecture breaks down at high speeds
Shared bus switches are similar in principle to
shared memory ones
The shared medium is the heart of the switch
and determines its characteristics
Scalability can be achieved by parallelism
(parallel buses)
The architecture breaks down at high speeds

63
Multistage crossbars

Crossbars are fast, but
are not nonblocking
do not scale
These problems are solved by multistage crossbars
The only blocking of a single stage crossbar is
output blocking
more than one packet needs to leave the egress
interface
Time-space switches solve this by sorting the
frames before switching
More complex architectures are time-space-time
switches
There are many multistage solutions to the
scaling problem

64
Clos network

The Clos network is more efficient than a single
NN crossbar
We divide N into r groups of n N n r
It has 3 stages of crossbars
the first stage has r groups of nn crossbars
the second stage has n groups of rr crossbars
the third stage has r groups of nn crossbars
The shuffle rule Xth output of Yth crossbar
connects to Yth input of Xth crossbar
Instead of N2 n2r2 connections
just r2 n 2 r n2
For example
if r n vN
then instead of n4 just 3n 3
a reduction by 3/n 3 / vN

65
Benes network

The Benes network contains only 22 crossbars
and is built recursively
For N 2n, the network has 2n-1 stages and 4 N
log (N-1/2) connections
The shuffle rule is that 1st outputs go to the
1st block, 2nd to the 2nd

Benes(n-1)
Benes(n-1)
66
Banyan network

The Banyan network is a binary tree of 22
crossbars
with exactly one path from each ingress
interface to each egress interface
At each stage the next bit of the identifier
determines the switching
Banyan switches are time and space efficient
O(log N) delay
½ N log N 22 crossbars
The main problem is internal blocking at any of
the crossbars
One way to relieve this is to speed up the
crossbars and to add buffers

shuffle
shuffle
67
Batchers sort

To prevent internal blocking in a Banyan switch
we can first sort the frames according to egress
interface
When this is done with Batchers parallel sorter
½ log N ( log N 1) stages of simple sorters
we get the Batcher-Banyan switch

Write a Comment

User Comments (0)

About PowerShow.com

Routing algorithms PowerPoint PPT Presentation