Title: Internet Routing COS 598A Today: Detecting Anomalies Inside an AS
1Internet Routing (COS 598A)Today Detecting
Anomalies Inside an AS
- Jennifer Rexford
- http//www.cs.princeton.edu/jrex/teaching/spring2
005 - Tuesdays/Thursdays 1100am-1220pm
2Outline
- Traffic
- SNMP link statistics
- Packet and flow monitoring
- Network topology
- IP routers and links
- Fault data, layer-2 topology, and configuration
- Intradomain route monitoring
- Interdomain routes
- BGP route monitoring
- Analysis of BGP update data
- Conclusions
3Why is Traffic Measurement Important?
- Billing the customer
- Measure usage on links to/from customers
- Applying billing model to generate a bill
- Traffic engineering and capacity planning
- Measure the traffic matrix (i.e., offered load)
- Tune routing protocol or add new capacity
- Denial-of-service attack detection
- Identify anomalies in the traffic
- Configure routers to block the offending traffic
- Analyze application-level issues
- Evaluate benefits of deploying a Web caching
proxy - Quantify fraction of traffic that is P2P file
sharing
4Collecting Traffic Data SNMP
- Simple Network Management Protocol
- Standard Management Information Base (MIB)
- Protocol for querying the MIBs
- Advantage ubiquitous
- Supported on all networking equipment
- Multiple products for polling and analyzing data
- Disadvantages dumb
- Coarse granularity of the measurement data
- E.g., number of byte/packet per interface per 5
minutes - Cannot express complex queries on the data
- Unreliable delivery of the data using UDP
5Collecting Traffic Data Packet Monitoring
- Packet monitoring
- Passively collecting IP packets on a link
- Recording IP, TCP/UDP, or application-layer
traces - Advantages details
- Fine-grain timing information
- E.g., can analyze the burstiness of the traffic
- Fine-grain packet contents
- Addresses, port numbers, TCP flags, URLs, etc.
- Disadvantages overhead
- Hard to keep up with high-speed links
- Often requires a separate monitoring device
6Collecting Traffic Data Flow Statistics
- Flow monitoring (e.g., Cisco Netflow)
- Statistics about groups of related packets (e.g.,
same IP/TCP headers and close in time) - Recording header information, counts, and time
- Advantages detail with less overhead
- Almost as good as packet monitoring, except no
fine-grain timing information or packet contents - Often implemented directly on the interface card
- Disadvantages trade-off detail and overhead
- Less detail than packet monitoring
- Less ubiquitous than SNMP statistics
7Using the Traffic Data in Network Operations
- SNMP byte/packet counts everywhere
- Tracking link utilizations and detecting
anomalies - Generating bills for traffic on customer links
- Inference of the offered load (i.e., traffic
matrix) - Packet monitoring selected locations
- Analyzing the small time-scale behavior of
traffic - Troubleshooting specific problems on demand
- Flow monitoring selective, e.g,. network edge
- Tracking the application mix
- Direct computation of the traffic matrix
- Input to denial-of-service attack detection
8Network Topology
9IP Topology
- Topology information
- Routers
- Links, and their capacities
- Internal links inside the AS
- Edge links connecting to neighboring domains
- Ways to learn the topology
- Inventory database
- SNMP polling/traps
- Traceroute
- Route monitoring
- Router configuration data
10Below IP
- Layer-2 paths
- ATM virtual circuits
- Frame Relay virtual circuits
- Mapping to lower layers
- Specific fibers
- Shared optical amplifiers
- Shared conduits
- Physical length (propagation delay)
- Information not visible to IP
- Stored in an inventory database
- Not necessarily generated/updated automatically
11Intradomain Monitoring OSPF Protocol
- Link-state protocol
- Routers flood Link State Advertisements (LSAs)
- Routers compute shortest paths based on weights
- Routers identify next-hop to reach other routers
2
1
3
1
3
2
1
5
4
3
12Intradomain Route Monitoring
- Construct continuous view of topology
- Detect when equipment goes up or down
- Input to traffic-engineering and planning tools
- Detect routing anomalies
- Identify failures, LSA storms, and route flaps
- Verify that LSA load matches expectations
- Flag strange weight settings as misconfigurations
- Analyze convergence delay
- Monitor LSAs in multiple locations with go
- Compare the times when LSAs arrive
- Detect router implementation mistakes
13Passive Collection of LSAs
- OSPF is a flooding protocol
- Every LSA sent on every participating link
- Very helpful for simplifying the monitor
- Can participate in the protocol
- Shared media (e.g., Ethernet)
- Join multicast group and listen to LSAs
- Point-to-point links
- Establish an adjacency with a router
- or passively monitor packets on a link
- Tap a link and capture the OSPF packets
14Reducing the Volume of Information
- Prioritizing the messages
- Router failure over router recovery
- Link failure or weight change over a refresh
- Informational messages about weight settings
- Grouping related messages
- Link failure group messages for the two ends
- Router failure group the affected links
- Common failure group links failing close in time
15Anomalies Found in the Shaikh04 paper
- Intermittent hardware problem
- Router periodically losing OSPF adjacencies
- Risk of network partition if 2nd failure occurred
- External link flaps
- Congestion on edge link causing lost messages
- Lost adjacency leading to flapping routes
- Configuration errors
- Two routers assigned the same IP address
- Inefficient config leading to duplicate LSAs
- Vendor implementation bug
- More frequent refreshing of LSAs than specified
16Interdomain Route Monitoring
17Motivation for BGP Monitoring
- Visibility into external destinations
- What neighboring ASes are telling you
- How you are reaching external destinations
- Detecting anomalies
- Increases in number of destination prefixes
- Lost reachability to some destinations
- Route hijacking
- Instability of the routes
- Input to traffic-engineering tools
- Knowing the current routes in the network
- Workload for testing routers
- Realistic message traces to play back to routers
18BGP Monitoring A Wish List
- Ideally knowing what the router knows
- All externally-learned routes
- Before policy has modified the attributes
- Before a single best route is picked
- How to achieve this
- Special monitoring session on routers that tells
everything they have learned - Packet monitoring on all links with BGP sessions
- If you cant do that, you could always do
- Periodic dumps of routing tables
- BGP session to learn best route from router
19Using Routers to Monitor BGP
Establish a passive BGP session from a
workstation running BGP software
Talk to operational routers using SNMP or telnet
at command line
eBGP or iBGP
() BGP table dumps do not burden
operational routers (-) Receives only best
routes from BGP neighbor () Update
dynamics captured () not restricted to
interfaces provided by vendors
(-) BGP table dumps are expensive () Table
dumps show all alternate routes (-) Update
dynamics lost (-) restricted to interfaces
provided by vendors
20Collect BGP Data From Many Routers
Seattle
Cambridge
Chicago
Detroit
New York
Kansas City
Philadelphia
Denver
San Francisco
St. Louis
Washington, D.C.
2
Los Angeles
Dallas
Atlanta
San Diego
Phoenix
Austin
Orlando
Houston
Route Monitor
BGP is not a flooding protocol
21Detecting Important Routing Changes
- Large volume of BGP updates messages
- Around 2 million/day, and very bursty
- Too much for an operator to manage
- Identify important anomalies
- Lost reachability
- Persistent flapping
- Large traffic shifts
- Not the same as root-cause analysis
- Identify changes and their effects
- Focus on mitigation, rather than diagnosis
- Diagnose causes if they occur in/near the AS
22Challenge 1 Excess Update Messages
- A single routing change
- Leads to multiple update messages
- Affects routing decision at multiple routers
Persistent Flapping Prefixes
Group updates for a prefix with inter-arrival lt
70 seconds, and flag prefixes with changes
lasting gt 10 minutes.
23Determine Event Timeout
Cumulative distribution of BGP update
inter-arrival time
BGP beacon
(70, 98)
24Event Duration Persistent Flapping
Complementary cumulative distribution of event
duration
(600, 0.1)
25Detecting Persistent Flapping
- Significant persistent flapping
- 15.2 of all BGP update messages
- though a small number of destination prefixes
- Surprising, especially since flap dampening is
used - Types of persistent flapping
- Conservative flap-damping parameters (78.6)
- Protocol oscillations, e.g., MED oscillation
(18.3) - Unstable interface or BGP session (3.0)
26Example Unstable eBGP Session
Peer
ATT
p
Customer
- Flap damping parameters is session-based
- Damping not implemented for iBGP sessions
27Challenge 2 Identify Important Events
- Major concerns of network operators
- Changes in reachability
- Heavy load of routing messages on the routers
- Flow of the traffic through the network
Classify events by type of impact it has on the
network
28Event Category No Disruption
p
AS2
AS1
No Traffic Shift
ATT
No Disruption each of the border routers has
no traffic shift
29Event Category Internal Disruption
p
AS2
AS1
Internal Disruption all of the traffic shifts
are internal traffic shift
ATT
Internal Traffic Shift
30Event Type Single External Disruption
p
AS2
AS1
external Traffic Shift
ATT
Single External Disruption traffic at one exit
point shifts to other exit points
31Statistics on Event Classification
32Challenge 3 Multiple Destinations
- A single routing change
- Affects multiple destination prefixes
Group events of same type that occur close in time
33Main Causes of Large Clusters
- External BGP session resets
- Failure/recovery of external BGP session
- E.g., session to another large tier-1 ISP
- Caused single external disruption events
- Validated by looking at syslog reports on routers
- Hot-potato routing changes
- Failure/recovery of an intradomain link
- E.g., leads to changes in IGP path costs
- Caused internal disruption events
- Validated by looking at OSPF measurements
34Challenge 4 Popularity of Destinations
- Impact of event on traffic
- Depends on the popularity of the destinations
Netflow Data
Weight the group of destinations by the traffic
volume
35Traffic Impact Prediction
- Traffic weight
- Per-prefix measurements from Netflow
- 10 prefixes accounts for 90 of traffic
- Traffic weight of a cluster
- The sum of traffic weight of the prefixes
- Flag clusters with heavy traffic
- A few large clusters have large traffic weight
- Mostly session resets and hot-potato changes
36Conclusions
- Network troubleshooting from the inside
- Traffic, topology, and routing data
- Easier to understand whats going on
- though still challenging to collect/analyze
data - Traffic measurement
- SNMP, packet monitoring, and flow monitoring
- Routing monitors
- Track network state and identify anomalies
- Intradomain monitor capturing LSAs
- BGP monitor capturing BGP updates
37Next Time BGP Routing Table Size
- Three papers
- On characterizing BGP routing table growth
- An empirical study of router response to large
BGP routing table load - A framework for interdomain route aggregation
- Review only of the first paper
- Summary
- Why accept
- Why reject
- Avenues for future work
- Optional
- Vanevar Bush on As We May Think (1945)