Routing Convergence Delay - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Routing Convergence Delay

Description:

Key to support multi service traffic, e.g., VoIP. Convergence Delay ... Sprint end-to-end study: around 1 second. AT&T router-level study: 100 msec 300 msec ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 46
Provided by: vmw
Category:

less

Transcript and Presenter's Notes

Title: Routing Convergence Delay


1
Routing Convergence Delay
  • Prof. Gao
  • ECE697J Spring 2005
  • Advanced Computer Networks

2
Project Proposal
  • Content
  • Motivation of project
  • Proposed approach analytical, experimental,
    measurement, simulation, survey
  • Team of at most 2 students
  • Please feel free to talk with me
  • Due Date Friday March 11 11pm
  • Send by email to lgao_at_ecs.umass.edu

3
Routing Convergence
  • Many network changes
  • Equipment failures, or new deployment
  • Router configuration changes
  • Planned maintenance on the network
  • Control plane adapts to changes
  • Detect the change
  • Propagate routing messages
  • Compute new routes
  • Update the forwarding tables

4
What could happen during routing convergence?
  • Inconsistent routing state
  • Asynchronous propagation of route changes
  • Distributed route computation and FIB update
  • Effect on data forwarding plane
  • Drop packets
  • Long packet delay
  • Forwarding loops
  • Out of order packets
  • Reduce routing convergence delay!
  • Worst case convergence delay infinite
  • Impact on packet forwarding
  • Key to support multi service traffic, e.g., VoIP

5
Convergence Delay
  • Control plane convergence delay
  • Period of time to converged routing state
  • Forwarding plane convergence delay
  • Period of time to converged forwarding state

6
Intradomain routing protocol convergence delay
7
Reduce Intradomain Rerouting Delay
  • Replace intradomain routing with something else
  • e.g., MPLS Fast Failure Recovery
  • Figure out whats wrong with intradomain routing
    and fix it

8
Link-State Routing Protocol Convergence Delay
  • Detect local topology changes (e.g. link up/down)
  • Flood Link State Advertisements (LSAs)
  • SPF Calculation
  • each router calculates a single source shortest
    path tree
  • Update Forwarding Information Base (FIB)
  • each router uses the tree to build its FIB, which
    governs packet forwarding

9
Router Model
Route Processor (CPU)
OSPF Process
Topology View
LSA Processing
LSA Flooding
SPF Calculation
FIB Update
FIB
Forwarding
Forwarding
Switching Fabric
Interface card
Interface card
10
Detecting a Link is Dead
  • Periodic hello packets (hello_interval,default
    is 10sec, greater than 1sec)
  • Timeout if not received (dead_interval, 40 sec)
  • Declare failure and flood the info to others
  • Small values lead to faster detection, but also
  • Higher bandwidth consumption for hellos
  • False detection during congestion interval
  • False detection if router CPU falls a little
    behind

hello
hello
11
Knowing the Link is Dead Interface Support
  • Smart interface hardware
  • Detects loss of connectivity at lower layer
  • Interrupts the router CPU about the failure
  • Common in Packet Over SONET technology
  • Detect in less than 100 msec
  • But
  • Some media dont support it (e.g., Ethernet, ATM)
  • so, you often need hello messages anyway

12
LSA Propagation
  • A link state packet is generated at the point of
    detection then flooded, unmodified, through the
    network.
  • It should propagate at near the speed of light
    plus one store-and-forward delay per hop.
  • So in theory LSP propagation should make a
    negligible contribution to the re-route time.
  • Theory doesnt often resemble reality...

13
LSA Propagation Explanation
  • Pacing LSA propagation to combine LSAs
  • In some implementation, SPF calculation is done
    before LSAs are flooded
  • To prevent this,
  • Spec might be amended to explicitly state that
    LSP flooding is higer priority than SPF
    calculation
  • SPF computation time becomes more important

14
Reducing the SPF Computational Overhead
  • Important if SPF and LSA flooding in series
  • Good system
  • Fast processor
  • High-speed memory
  • Good algorithms
  • Traditional approach computes from scratch
  • Improved from O(n2) to O(nlogn)
  • Incremental algorithms compute only the changes
  • O(logn) instead of O(nlogn)
  • Pre-computation
  • Pre-compute effects of certain failure scenarios
  • E.g., all single-link or single-router failures

15
Updating the Forwarding Table
  • Forwarding table
  • Map destination prefix to outgoing link(s)
  • Copy of table on each interface card
  • Highly optimized for fast lookups
  • Updating the forwarding table
  • Computing the new forwarding table
  • Making updates to the copy of the line card
  • Important source of delay
  • Sprint end-to-end study around 1 second
  • ATT router-level study 100 msec 300 msec

16
Significance of Protocol Timers
  • Hello and dead intervals
  • Failure-detection delay vs. false diagnosis
  • Pacing the link-state flooding
  • Combining LSAs vs. longer convergence delay
  • Some routers wait till after re-running Dijkstra!
  • Delaying start of shortest-path computation
  • Reducing computations vs. convergence delay
  • Especially useful if failure affects multiple
    links

17
OSPF Task Delays (Cisco)
  • LSA Processing
  • 100-800 microseconds
  • LSA flooding
  • 30-40 milliseconds
  • pacing timer is the determining factor
  • SPF calculation
  • 1-40 milliseconds
  • O(n2) behavior for full n x n mesh
  • FIB update time
  • 100-300 milliseconds
  • no dependence on the size of the topology

18
Reduce the Effects of Convergence
  • Long convergence delay is bad
  • Transient problems with loss and delay
  • Disruptive for VoIP and online gaming
  • Solution 1 better implementation
  • Interfaces that detect failures automatically
  • Cranking down the values of the timers
  • Faster CPUs and path-computation algorithms
  • Avoid forwarding loops during convergence
  • Solution 2 network design and operation
  • Improve forwarding-plane convergence
  • Improve convergence during maintenance
  • .

19
Summary
  • Reduce intradomain routing convergence delay
  • Faster SPF computation
  • Make Hello timer milliseconds rather than seconds
  • Low-level support for detecting link up and down
  • Reduce impact of convergence delay
  • Protocol improvement to avoid forwarding loops
  • Improve network design

20
Project Ideas
  • Simulation study of data forwarding during IGP
    route convergence
  • Effective mechanisms for bad forwarding during
    convergence
  • Avoid forwarding loops
  • Stress test of OSPF or IS-IS on timer setting

21
Interdomain Routing Convergence Delay
22
Control Plane Convergence Time
  • BGP router detects the change
  • Propagate the update messages to neighbors
  • Announcement
  • Withdrawal
  • Until all routers reach a stable state and no
    route update is propagated

23
Link Failure Detection
  • Detect at lower layer
  • For some networks, can not rely on lower layer
  • Keep alive message timer 1/3 of hold time
    (default hold timer 90sec)
  • Hold timer expires before keep alive message is
    received, BGP session is down

24
Route Propagation
  • How often a router can send route update?
  • Minimum Route Advertisement timer (MRAI)
  • Batch processing updates
  • How many hops route updates have to propagate?
  • Depends on topology
  • Depends on routing policy

25
MRAI Timers
  • Minimum Advertisement Interval Timer
  • Minimum amount of time that must elapse between
    route updates
  • Applied to BGP announcement or withdrawal
  • Avoid router CPU overload wait till route is
    stable
  • Default MRAI value
  • eBGP session 30 seconds
  • iBGP session 5 seconds

26
Different Implementation of Timers
  • Prefix based rate limiting timer
  • Peer based rate limiting timer
  • Current implementation of CISCO

27
Impact of Topology
  • Route convergence time
  • Assume BGP selects the shortest path as the best
    path
  • w Minimum Route Advertisement Interval
  • From down to up, Tup O(dw)
  • where d is the length of the shortest path in the
    network
  • From up to down, Tdown O(Dw)
  • where D is the length of the longest loop-free
    path in the network
  • Failover convergence?
  • Implications
  • Good news spread fast
  • Bad news spread very slow

28
Impact of Routing Policy
  • Route convergence time
  • w Minimum Route Advertisement Interval
  • From down to up, Tup O(dw)
  • where d is the length of the shortest policy
    conforming paths
  • From up to down, Tdown O(Dw)
  • where D is the length of the longest policy
    conforming path
  • Failover convergence?

29
Failover Convergence
  • Link failure leads to route to change to another
    path
  • Can be longer than Tdown!

30
An Example of Failover
AS1
AS2
W20
W20
W20
120 10
10
20
210
A10
A10
A10
d
AS0
packet
BGP update
BGP Routing table
31
Worst Case Analysis of of Messages
  • Assumptions
  • Fully meshed topology of n nodes
  • Export to all
  • Withdraw d from node 0
  • O((n-1)!) messages
  • O((n-1)!) distinct paths to reach d
  • (n-1) paths of length 1
  • (n-1)(n-2) paths of length 2
  • (n-1)! paths of length n

32
Data Collection
  • Data Collected BGP routing messages
  • Time Period Over the course of 9 months starting
    Jan 96
  • Where Five of the major U.S. network exchange
    points
  • Tool Unix based route servers, Multithreaded
    routing Toolkit(MRTd)

33
Routing Updates Observed
  • For 45,000 prefixes and 1500 paths
  • 3 to 6 million updates per day

34
Transient Failures During Convergence
AS1
AS2
W20
W20
W20
120 10
10
20
210
A10
A10
A10
Transient failure
d
AS0
packet
BGP update
BGP Routing table
35
Another Example of Transient Failure
310
320
3
w
A
A
w
210 20
2 0
A
A
10
120
2
w
w
1
0
d
Peer-to-peer Provider-to-customer
36
Failure Duration for Different Implementation of
Timers
  • MRAI
  • Reset by announcements
  • MRWI
  • Reset by withdrawals

37
Prefix Based Rate Limiting Timers
  • MRAI enabled
  • Failure duration ? ?
  • ? link propagation delay router
    processing delay
  • MRAI and MRWI enabled
  • Failure duration ? ? MRWI ? MRWI

viewpoint
01x 0y
0y
1x
10y
w
w
w
w
w
21x
210y
Alternate path y
0
1
2
A
A
A
A
A
Start the timer
Waiting ..
Reset timer
x
38
Peer Based Rate Limiting Timers
  • MRAI and MRWI
  • MRAI enabled

Failure duration ? (D??Nshort) MRAI
Router ? provides alternate path router v is
the first router detecting failure.
Nshort shortest path between ? and u
Waiting ..
Reset timer
A
A
viewpoint
A
?vx ?y
? y
A
vx
v?y
w
w
w
w
w
uvx
u?y
Alternate path y
?
?
u
A
A
D??
x
39
What About Convergence Delay?
40
Prefix Based Rate Limiting Timers
  • MRAI enabled
  • Convergence delay ? Dvx MRAI
  • MRAI and MRWI enabled
  • Convergence delay ? ? MRWI Dvx MRAI

viewpoint
01x 0y
0y
1x
10y
w
w
w
w
w
21x
210y
Alternate path y
0
v
2
A
A
A
A
A
Dvx length of longest path between v and x That
goes through the failed link
Start the timer
Waiting ..
Reset timer
x
41
Peer Based Rate Limiting Timers
  • MRAI and MRWI
  • MRAI enabled

Convergence delay ? (D??DbestDvx) MRAI
Router ? provides alternate path router v is
the first router detecting failure.
Dshort shortest path length between ? and u
Waiting ..
Reset timer
A
A
viewpoint
A
?vx ?y
? y
A
vx
v?y
w
w
w
w
w
uvx
u?y
Alternate path y
?
?
u
A
A
D??
Dvx length of longest path between v and x That
goes through the failed link
x
42
Minimize Convergence Delay
  • Make MRAI as small as possible
  • of messages processed
  • CPU load correlates with of updates
  • http//www.pam2004.org/papers/155.pdf
  • BGP message pass through time is small
  • http//www.pam2004.org/papers/170.pdf
  • Improving BGP
  • Find invalid paths
  • Remove invalid paths from best path selection
    process
  • Topology and routing policy design

43
Solve Problem at Higher Layer
  • Overlay routing
  • RON project at MIT
  • Skype for VoIP
  • Multipath routing for multihomed nodes
  • Smart routing service
  • Internap
  • Route science

44
Summary
  • Convergence delay can be as long as 30 minutes
    (15 minutes is a long time for VoIP)
  • BGP convergence delay depends MRAI
  • Routing policy and topology impact
  • Transient failure during convergence
  • Implementation of MRAI

45
Project Ideas
  • Forwarding plane performance during BGP
    convergence
  • Sufficient condition for forwarding loops
  • Fast failover BGP
  • Timer implementation
  • Reduce transient failure duration
Write a Comment
User Comments (0)
About PowerShow.com