Part III: BGP Measurement - PowerPoint PPT Presentation

1 / 64
About This Presentation
Title:

Part III: BGP Measurement

Description:

Year 2000. BGP updates more than one order of magnitude larger than expected ... 2000. 36. Strong evidence for withdrawal- and announcement-triggered suppression. ... – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 65
Provided by: zmorl
Category:
Tags: bgp | iii | measurement | part

less

Transcript and Presenter's Notes

Title: Part III: BGP Measurement


1
Part III BGP Measurement
2
BGP routing updates
  • Route updates at prefix level
  • No activity in steady state
  • Routing messages indicate changes, no refreshes

3
Internet routing instability
  • Large of BGP updates
  • Failures
  • Policy changes
  • Redundant messages
  • Routing instability
  • Route keeps changing, e.g., routes keep going up
    and down

4
Implications
  • Router overhead
  • Transient delay and loss
  • Unreachable hosts
  • High loss rate
  • High jitter
  • Long delays
  • Significant packet reordering
  • Poor predictability of traffic flow

How do we know if the instability is due to
routing or network congestion?
5
Measure BGP stability
  • First work by Labovitz et al.
  • Methodology
  • Collect routing messages from five public
    exchange points
  • BGP information considered
  • AS path
  • Next hop next hop to reach a network
  • Two routes are the same if they have the same AS
    path and next hop
  • Other attributes (e.g., MED, communities) ignored
  • Focus on forwarding path stability

6
Measurement methodology
7
BGP information exchange
  • Announcements a router has either
  • Learned of a new route, or
  • Made a policy decision that it prefers a new
    route
  • Withdrawals a router concludes that a network is
    no longer reachable
  • Explicit associated to the withdrawal message
  • Implicit (in effect an announcement) when a
    route is replaced as a result of an announcement
    message
  • In steady state BGP updates should be only the
    result of infrequent policy changes
  • BGP is stateful, requires no refreshes
  • Update rate indication of network stability

8
Example of delayed convergence
stage
0 2 1 3 1 4 1
1 41 41 31
4 431 241 --
9 -- -- --
node
Assuming node 1 has a route to a destination, and
it withdraws the route
Stage (msg processed) Msg queued
0 1-gt2,3,4W
1 1-gt2,3,4W 2-gt3,4A241, 3-gt2,4A341,
4-gt2,3A431
2 2-gt3,4A241 3-gt2,4A341,
4-gt2,3A431 3 3-gt2,4A341 4-gt2,3A431,
4-gt2,3W 4 4-gt2,3A431
MinRouteAdver timer expires 4-gt2,3W,
3-gt2,4A3241, 2-gt3,4A2431 (omitted)
9 3-gt2,4W
Note In response to a withdrawal from 1, node 3
sends out 3 messages 3-gt2,4A341,
3-gt2,4A3241, 3-gt2,4W
9
Types of inter-domain routing updates
  • Forwarding instability
  • may reflect topology changes
  • Policy fluctuations (routing instability)
  • may reflect changes in routing policy information
  • Pathological updates
  • redundant updates that are neither routing nor
    forwarding instability
  • Instability
  • forwarding instability and policy fluctuation ?
    change forwarding path

10
Routing successive events (instability)
  • WADiff
  • W a route is explicitly withdrawn as it becomes
    unreachable
  • A is later replaced with an alternative route
  • Forwarding instability
  • AADiff
  • A a route is implicitly withdrawn
  • A then replaced by an alternative route as the
    original route becomes unavailable or a new
    preferred route becomes available
  • Forwarding instability

11
Routing successive events (pathological
instability)
  • WADup
  • W a route is explicitly withdrawn
  • A then reannounced later
  • forwarding instability or pathological behavior
  • AADup
  • A a route is implicitly withdrawn
  • A then replaced with a duplicate of the original
    route
  • pathological behavior or policy fluctuation
  • WWDup
  • The repeated transmission of BGP withdrawals for
    a prefix that is currently unreachable
    (pathological behavior)

12
Measurement findingsoverview
  • Year 2000
  • BGP updates more than one order of magnitude
    larger than expected
  • Routing information dominated by pathological
    updates
  • Implementation problems
  • BGP self-synchronization
  • Unconstrained routing policies

13
Routing problem findings
  • Implementation problems
  • Redundant updates
  • Routers do not maintain the history of the
    announcements sent to neighbors
  • Self-synchronization
  • BGP routers exchange information simultaneously
  • may lead to periodic link/router failures
  • Unconstrained routing policies may lead to
    persistent route oscillations

14
Instability measurement
  • Instability and redundant updates exhibits strong
    correlation with load
  • (30 seconds, 24 hours and seven days periods)
  • Instability usually exhibits high frequency
  • Pathological updates exhibits both high and low
    frequencies

15
Non-localized instability
  • No single AS dominates instability statistics
  • No correlation between size of AS and its impact
    on instability statistics
  • There is no small set of paths that dominate
    instability statistics

16
Measurement conclusions
  • Routing in the Internet exhibits many undesirable
    behaviors
  • Instability over a wide range of time scales
  • Asymmetric routes
  • Network outages
  • Problem seems to worsen
  • Many problems are due to software bugs or
    inefficient router architectures

17
Lessons
  • Even after decades of experience routing in the
    Internet is not a solved problem
  • This attests the difficulty and complexity of
    building distributed algorithm in the Internet,
    i.e., in a heterogeneous environment with
    products from various vendors
  • Simple protocols may increase the chance to be
  • Understood
  • Implemented right

18
Better understanding of BGP dynamics
  • Difficulties
  • Multiple administrative domains
  • Unknown information (policies, topologies)
  • Unknown operational practices
  • Ambiguous protocol specs

Proposal a controlled active measurement
infrastructure for continuous BGP monitoring
BGP Beacons.
19
What is a BGP Beacon?
  • An unused, globally visible prefix with known
    Announce/Withdrawal schedule
  • For long-term, public use

20
Who will benefit from BGP Beacon?
  • Researchers study BGP dynamics
  • To calibrate and interpret BGP updates
  • To study convergence behavior
  • To analyze routing and data plane interaction
  • Network operators
  • Serve to debug reachability problems
  • Test effects of configuration changes
  • E.g., flap damping setting

21
Related work
  • Differences from Labovitzs BGP fault-injector
  • Long-term, publicly documented
  • Varying advertisement schedule
  • Beacon sequence number (AGG field)
  • Enabler for many research in routing dynamics
  • RIPE Ris Beacons
  • Set up at 9 exchange points

22
Active measurement infrastructure
Many Observation points
1Oregon RouteViews
Internet
2. RIPE
3.ATT
Send route update
4. Verio
5. MIT
6.Berkeley
BGP Beacon 1 198.133.206.0/24
23
Deployed PSG Beacons
24
Deployed PSG Beacons
  • B1, 2, 3, 5
  • Announced and withdrawn with a fixed period
  • (2 hours) between updates
  • 1st daily ANN 300AM GMT
  • 1st daily WD 100AM GMT
  • B4 varying period
  • B5 fail-over experiments
  • Software available at http//www.psg.com/zmao

25
Beacon 5 schedule
Live host behind the beacon for data analysis
Study fail-over Behavior for multi-homed
customers
26
Beacon terminology
  • Output signal
  • 50010 A1
  • 50040 W
  • 50110 A2

Internet
RouteView ATT
Beacon prefix 198.133.206.0/24
Signal length number of updates in output signal
(3 updates) Signal duration time between first
and last update in the signal (50010 --
50110, 60 seconds) Inter-arrival time time
between consecutive updates
  • Input signal
  • Beacon-injected change
  • 30000 GMT Announce (A0)
  • 50000 GMT Withdrawal (W)

27
Process Beacon data
  • Identify output signals, ignore external events
  • Data cleaning
  • Anchor prefix as reference
  • Same origin AS as beacon prefix
  • Statically nailed down
  • Minimize interference between consecutive input
    signals
  • Beacon period is set to 2 hours
  • Time stamp and sequence number
  • Attach additional information in the BGP updates
  • Make use of a transitive attribute Aggregator
    fields

28
Beacon data cleaning process
  • Goal
  • Clearly identify updates associated with injected
    routing change
  • Discard beacon events influenced by external
    routing changes

29
Cumulative Beacon statisticssignificant noise
  • Current observation points
  • 111 peers RIPE, Route-View, Berkeley, MIT,
    MIT-RON nodes, ATT-Research, ATT, AMS-IXP, Verio

Avg expansion 20.210.81.2
30
Cumulative Beacon statisticssignificant noise
  • Example response to ANN-beacon at peer p
  • R1 ASpath 286 209 1 3130 3927
  • R2 ASpath 286 209 2914 3130 3927
  • 100 events 20 R1 R2, 80 R2

31
Cisco vs. Juniperupdate rate-limiting
Known last-hop Cisco and Juniper routers from
the same AS and location
Average signal length average number of
updates observed for a single beacon-injected
change
32
Cisco-like last-hop routers
Linear increase in signal duration wrt signal
length
Slope30 second
Due to Ciscos default rate-limiting setting
33
Juniper-like last-hop routers
Signal duration relatively stable wrt increase
in signal length
Shorter signal duration compared to
Cisco-like last-hops
34
Route flap damping
  • A mechanism to punish unstable routes by
    suppressing them
  • Reduce router processing load due to instability
  • Prevent sustained routing oscillations
  • Do not sacrifice convergence times for
    well-behaved routes

There is conjecture a single announcement can
cause route suppression.
35
RFC2439 Route flap damping
  • Scope
  • Inbound external routes
  • Per neighbor, per destination
  • Penalty
  • Flap route change
  • Increases for each flap
  • Decays exponentially

36
Route flap damping analysis
Strong evidence for withdrawal- and
announcement-triggered suppression.
37
Distinguish between announcement and withdrawal
  • Summary
  • WD-triggered sup more likely
  • than ANN-triggered sup
  • Cisco overall more likely trigger sup than
    Juniper
  • (AAAW-pattern)
  • Juniper more
  • aggressive for AWAW pattern

38
Convergence analysis
  • Summary
  • Withdrawals converge
  • slower than announcements
  • Most beacon events converge within 3 minutes

39
Output signal duration
40
Beacon 1s upstream change
Single-homed (AS2914)
Multi-homed (AS1239, 2914)
Multi-homed (AS1,2914)
41
Beacon for identifying router behavior
Beacon 2 seen from RouteView data
42
Inter-arrival time analysis
43
Inter-arrival time modeling
  • Geometric distribution (body)
  • Update rate-limiting behavior every 30 sec
  • Prob(missing update train) independent of how
    many already missed
  • Mass at 1
  • Discretization of timestamps for timeslt1
  • Shifted exponential distribution (tail)
  • Most likely due to route flap damping

44
Motivation
destination
AS4
AS2
AS3
B
A
C
AS1
A backbone network is vulnerable to routing
changes that occur in other domains.
D
source
45
Goal
  • Identify important routing anomalies
  • Lost reachability
  • Persistent flapping
  • Large traffic shifts
  • Contributions
  • Build a tool to identify a small number of
    important routing disruptions from a large volume
    of raw BGP updates in real time.
  • Use the tool to characterize routing disruptions
    in an operational network

46
Capturing Routing Changes
A large operational network (8/16/2004
10/10-2004)
eBGP
Updates
eBGP
Updates
iBGP
Best routes
iBGP
Best routes
BGP Monitor
CPE
iBGP
eBGP
47
Challenges
  • Large volume of BGP updates
  • Millions daily, very bursty
  • Too much for an operator to manage
  • Different from root-cause analysis
  • Identify changes and their effects
  • Focus on actionable events rather than diagnosis
  • Diagnose causes in/near the AS

48
System Architecture
From millions of updates to a few dozen reports
49
Grouping BGP Update into Events
  • Challenge A single routing change
  • leads to multiple update messages
  • affects routing decisions at multiple routers
  • Approach
  • Group together all updates for a prefix with
  • inter-arrival lt 70 seconds
  • Flag prefixes with changes lasting gt 10 minutes.

BGP Update Grouping
BGP Updates
Events
Persistent Flapping Prefixes
50
Grouping Thresholds
  • Based on our understanding of BGP and data
    analysis
  • Event timeout 70 seconds
  • 2 MRAI timer 10 seconds
  • 98 inter-arrival time lt 70 seconds
  • Convergence timeout 10 minutes
  • BGP usually converges within a few minutes
  • 99.9 events lt 10 minutes

51
Persistent Flapping Prefixes
A surprising finding 15.2 of updates were
caused by persistent-flapping prefixes even
though flap damping is enabled.
  • Types of persistent flapping
  • Conservative damping parameters (78.6)
  • Protocol oscillations due to MED (18.3)
  • Unstable interfaces or BGP sessions (3.0)

52
Example Unstable eBGP Session
Peer
ISP
p
Customer
  • Flap damping parameters is session-based
  • Damping not implemented for iBGP sessions

53
Event Classification
  • Challenge Major concerns in network management
  • Changes in reachability
  • Heavy load of routing messages on the routers
  • Change of flow of the traffic through the network

Event Classification
Typed Events, e.g., Loss/Gain of Reachability
Events
Solution classify events by severity of their
impact
54
Event Category No Disruption
p
AS2
AS1
No Traffic Shift
No Disruption no border routers have any
traffic shift. (50.3)
ISP
55
Event Category Internal Disruption
p
AS2
AS1
Internal Disruption all traffic shifts are
internal. (15.6)
ISP
Internal Traffic Shift
56
Event Category Single External Disruption
p
AS2
AS1
external Traffic Shift
Single External Disruption only one of the
traffic shifts is external (20.7)
ISP
57
Statistics on Event Classification
  • First 3 categories have significant day-to-day
    variations
  • Updates per event depends on the type of events
    and the number of affected routers

58
Event Correlation
  • Challenge A single routing change
  • affects multiple destination prefixes

Event Correlation
Typed Events
Clusters
Solution group the same-type, close-occurring
events
59
EBGP Session Reset
  • Caused most of single external disruption
    events
  • Check if the number of prefixes using that
    session as the best route changes dramatically
  • Validation with Syslog router report (95)

Number of prefixes
session recovery
session failure
time
60
Hot-Potato Changes
  • Hot-Potato Changes
  • Caused internal disruption events
  • Validation with OSPF measurement (95) Teixeira
    et al SIGMETRICS 04

P
Hot-potato routing route to closest egress
point
10
11
9
ISP
61
Traffic Impact Prediction
  • Challenge Routing changes have different impacts
    on the network which depends on the popularity of
    the destinations

Traffic Impact Prediction
Large Disruptions
Clusters
Netflow Data
Solution weigh each cluster by traffic volume
62
Traffic Impact Prediction
  • Traffic weight
  • Per-prefix measurement from netflow
  • 10 prefixes accounts for 90 of traffic
  • Traffic weight of a cluster
  • the sum of traffic weight of the prefixes
  • A small number of large clusters have large
    traffic weight
  • Mostly session resets and hot-potato changes

63
Performance Evaluation
  • Memory
  • Static memory current routes, 600 MB
  • Dynamic memory clusters, 300 MB
  • Speed
  • 99 of intervals of 1 second of updates can be
    process within 1 second
  • Occasional execution lag
  • Every interval of 70 seconds of updates can be
    processed within 70 seconds

Measurements were based on 900MHz CPU
64
Conclusion of BGP Troubleshooting Tool
  • BGP troubleshooting system
  • Fast, online fashion
  • Operators concerns (reachability, flapping,
    traffic)
  • Significant information reduction
  • millions of update ? a few dozens of large
    disruptions
  • Uncovered important network behavior
  • Hot-Potato changes
  • Session resets
  • Persistent-flapping prefixes
Write a Comment
User Comments (0)
About PowerShow.com