BGP Anomaly Detection in an ISP - PowerPoint PPT Presentation

About This Presentation
Title:

BGP Anomaly Detection in an ISP

Description:

Build a tool to identify a small number of important routing disruptions from a ... Fast, online fashion. Operator concerns (reachability, flapping, traffic) ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 23
Provided by: jia3
Category:
Tags: bgp | isp | anomaly | detection

less

Transcript and Presenter's Notes

Title: BGP Anomaly Detection in an ISP


1
BGP Anomaly Detection in an ISP
  • Jian Wu (U. Michigan)
  • Z. Morley Mao (U. Michigan)
  • Jennifer Rexford (Princeton)
  • Jia Wang (ATT Labs)

http//www.cs.princeton.edu/jrex/papers/nsdi05-ji
an.pdf
2
Goal
  • Identify important anomalies
  • Lost reachability
  • Persistent flapping
  • Large traffic shifts
  • Contributions
  • Build a tool to identify a small number of
    important routing disruptions from a large volume
    of raw BGP updates in real time.
  • Use the tool to characterize routing disruptions
    in an operational network

3
Capturing Routing Changes
Large operational network (8/16/2004 10/10-2004)
eBGP
eBGP
Updates
eBGP
Updates
iBGP
iBGP
Best routes
iBGP
Best routes
BGP Monitor
CPE
iBGP
iBGP
iBGP
eBGP
eBGP
eBGP
4
Challenges
  • Large volume of BGP updates
  • Millions daily, very bursty
  • Too much for an operator to manage
  • Different than root-cause analysis
  • Identify changes and their effects
  • Focus on actionable events
  • Diagnose causes only in/near the AS

5
System Architecture
6
Grouping BGP Update into Events
  • Challenge A single routing change
  • leads to multiple update messages
  • affects routing decisions at multiple routers
  • Solution
  • Group all updates for a prefix with inter-arrival
    lt 70 seconds
  • Flag prefixes with changes lasting gt 10 minutes.

Persistent Flapping Prefixes
7
Grouping Thresholds
  • Based on data analysis and our understanding of
    BGP
  • Event timeout 70 seconds
  • 2 MRAI timer 10 seconds
  • 98 inter-arrival time lt 70 seconds
  • Convergence timeout 10 minutes
  • BGP usually converges within minutes
  • 99.9 events lt 10 minutes

8
Persistent Flapping Prefixes
  • Causes of persistent flapping
  • Conservative damping parameters (78.6)
  • Protocol oscillations due to MED (18.3)
  • Unstable interface or BGP session (3.0)
  • Surprising finding 15.2 of updates were caused
    by persistent flapping prefixes, even though flap
    damping was enabled!

9
Example Unstable eBGP Session
Peer
ISP
p
Customer
  • Flap damping parameters are session-based
  • Damping not implemented for iBGP sessions

10
Event Classification
  • Challenge Major concerns in network management
  • Changes in reachability
  • Heavy load of routing messages on the routers
  • Change of flow of traffic through the network

Event Classification
Typed Events
Events
Solution classify events by severity of their
impacts
11
Event Category No Disruption
p
AS2
AS1
No Traffic Shift
No Disruption each of the border routers has
no traffic shift. (50.3)
ISP
12
Event Category Internal Disruption
p
AS2
AS1
Internal Disruption all of the traffic shifts
are internal traffic shift. (15.6)
ISP
Internal Traffic Shift
13
Event Category Single External Disruption
p
AS2
AS1
external Traffic Shift
Single External Disruption only one of the
traffic shifts is external traffic shift. (20.7)
ISP
14
Statistics on Event Classification
  • First 3 categories have significant variations
    from day to day
  • Updates per event depends on the type of events
    and the number of affected routers

15
Event Correlation
  • Challenge A single routing change
  • affects multiple destination prefixes

Event Correlation
Typed Events
Clusters
Solution group events of same type that occur
close in time
16
EBGP Session Reset
  • Caused most single external disruption events
  • Check if the number of prefixes using that
    session as the best route changes dramatically
  • Validation with Syslog router report (95)

Number of prefixes
session recovery
session failure
time
17
Hot-Potato Changes
  • Hot-Potato Changes
  • Caused internal disruption events
  • Validation with OSPF measurement (95) Teixeira
    et al SIGMETRICS 04

P
Hot-potato routing route to closest egress
point
10
11
9
ISP
18
Traffic Impact Prediction
  • Challenge Routing changes have different impacts
    on the network which depends on the popularity of
    the destinations

Traffic Impact Prediction
Large Disruptions
Clusters
Netflow Data
Solution weigh each cluster by traffic volume
19
Traffic Impact Prediction
  • Traffic weight
  • Per-prefix measurement from Netflow
  • 10 prefixes accounts for 90 of traffic
  • Traffic weight of a cluster
  • Sum of traffic weight of the prefixes
  • A few clusters have large traffic weight
  • Mostly session resets hot-potato changes

20
Performance Evaluation
  • Memory
  • Static memory current routes, 600 MB
  • Dynamic memory clusters, 300 MB
  • Speed
  • 99 of intervals of 1 second of updates can be
    process within 1 second
  • Occasional execution lag
  • Every interval of 70 seconds of updates can be
    processed within 70 seconds

Measurements were based on 900MHz CPU
21
Conclusion
  • BGP anomaly detection
  • Fast, online fashion
  • Operator concerns (reachability, flapping,
    traffic)
  • Significant information reduction
  • Uncovered important network behaviors
  • Persistent flapping prefixes
  • Hot-potato changes
  • Session resets and interface failures

22
Detecting Peering Violations
  • Consistent export requirement
  • Peer should advertise prefixes at all peering
    points, with the same AS path length
  • Allows the AS to do hot-potato routing
  • Detecting violations
  • Using iBGP feeds from the border routers
  • Some inference tricks to identify inconsistencies
  • Results of the study
  • http//www.nanog.org/mtg-0410/feamster.html
  • http//www.cs.princeton.edu/jrex/papers/imc04.pdf
Write a Comment
User Comments (0)
About PowerShow.com