PlanetSeer: Internet Path Failure Monitoring and Characterization in WideArea Services - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

PlanetSeer: Internet Path Failure Monitoring and Characterization in WideArea Services

Description:

PlanetSeer: Internet Path Failure Monitoring and Characterization in WideArea Services – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 36
Provided by: mingz7
Category:

less

Transcript and Presenter's Notes

Title: PlanetSeer: Internet Path Failure Monitoring and Characterization in WideArea Services


1
PlanetSeer Internet Path Failure Monitoring and
Characterization in Wide-Area Services
  • Ming Zhang, Chi Zhang
  • Vivek Pai, Larry Peterson, Randy Wang
  • Princeton University

2
Motivation
  • Routing anomalies are common on Internet
  • Maintenance
  • Power outage
  • Fiber cut
  • Misconfiguration
  • Anomalies can affect end-to-end performance
  • Packet losses
  • Packet delays
  • Disconnectivities

3
Background
  • Anomaly detection and diagnosis are nontrivial
  • Asymmetric paths
  • Failure information propagation
  • Highly varied durations
  • Limited coverage

4
Contributions
  • New techniques for
  • Anomaly detection
  • Anomaly isolation
  • Anomaly classification
  • Large-scale study of anomalies
  • Broad coverage
  • High detection rate, low overhead
  • Characterization of anomalies
  • End-to-end effects
  • Benefits to host service

5
Outline
  • State of the Art
  • PlanetSeer Components
  • MonD passive monitoring
  • ProbeD active probing
  • Anomaly Analysis
  • Loop-based anomaly
  • Non-loop anomaly
  • Bypassing Anomalies
  • Summary

6
State of the Art
  • Routing messages
  • BGP AS-level diagnosis
  • IS-IS, OSPF Within single ISP
  • Router/link traffic statistics
  • SNMP, NetFlow proprietary
  • End-to-end measurement
  • Ping, traceroute

7
End-to-End Probing
  • All-pairs probes among n nodes
  • O(n2) measurement cost
  • Not scalable as n grows

8
Key Observation
  • Combine passive monitoring with active probing
  • Peer-to-Peer (P2P), Content Distribution Network
    (CDN)
  • Large client population
  • Geographically distributed nodes
  • Large traffic volume
  • Highly diverse paths
  • The traffic generated by the services reveals
    information about the network.

9
Our Approach
  • Host service
  • CDN
  • Components
  • Passive monitoring
  • Active probing
  • Advantages
  • Low overhead
  • Wide coverage

Client
C
R1
R2
B
A
10
MonD Anomaly Detection
  • Anomaly indicators
  • Time-to-live (TTL) change
  • Routing change
  • n consecutive timeouts (n 4 in current system)
  • Idling period of 3 to 16 seconds
  • most congestion periods lt 220ms

11
ProbeD Operation
  • Baseline probes
  • When a new IP appears
  • From local node
  • Forward probes
  • When a possible anomaly detected
  • From multiple nodes (including local node)
  • Reprobes
  • At 0.5, 1.5, 3.5 and 7.5 hours later
  • From local node

12
ProbeD Groups
  • 353 nodes, 145 sites, 30 groups
  • According to geographic location
  • One traceroute per group

13
Estimating Scope
  • Which routers might be affected?
  • Routers which possibly change their next hops
  • Traceroutes from multiple locations can narrow
    the scope

14
Path Diversity
  • Monitoring Period 02/2004 05/2004
  • Unique IPs 887,521
  • Traversed ASes 10,090

15
Confirming Anomalies
  • Reported anomalies
  • 2,259,588
  • Conditions
  • Loops
  • Route change
  • Partial unreachability
  • ICMP unreachable
  • Very conservative confirmation

Undecided 22
Non-anomaly 66
Anomaly 12
16
Confirmed Anomaly Breakdown
Temp Anomalies 16
  • Confirmed anomalies
  • 271,898
  • 2 per minute
  • 100x more
  • Temp anomalies
  • Inconsistent probes

Persist Loop 7
Temp loop 1
Path Change 44
Other Outage 23
Fwd Outage 9
17
Scope of Loops
  • How many routers or ASes are involved?
  • Temp loops involve more routers than persistent
    loops
  • 97 persistent loops and 51 temp loops contain 2
    hops

18
Distribution of Loops
  • Many persistent loops in tier-3, few in tier-1
  • Worst 10 of tier-1 ASes implications for
    largest ISPs
  • 20 traffic
  • 35 persistent loops

19
Duration of Persistent Loops
  • How long do persistent loops last?
  • Either resolve quickly or last for an extended
    period

20
Scope of Forward Anomalies
  • How many routers or ASes are affected?
  • 60 outages within 1 hops
  • 75 outages and 68 changes within 4 hops

21
Location of Forward Anomalies
  • How close are the anomalies to the edges of the
    network?
  • 44 outages at the last hop
  • 72 outages and 40 changes within 4 hops

22
Distribution of Forward Anomalies
  • Which ASes are affected?
  • Tier-1 ASes most stable
  • Tier-3 ASes most likely to be affected

23
Overlay Routing
  • Use alternate path when default path fails

24
Bypassing Anomalies
  • How useful is overlay routing for bypassing
    failures?
  • Effective in 43 of 62,815 failures, lower than
    previous studies
  • 32 bypass paths inflate RTTs by more than a
    factor of two

25
Summary
  • Confirm 272,000 anomalies in 3 months
  • Persistent and temporary loops
  • Persistent loops narrower scope, either resolve
    quickly or last for a long time
  • Path outages and changes
  • Outages closer to edge, narrower scope
  • Anomaly distribution
  • Skewed. Tier-1 most stable. Tier-3 most
    problematic.
  • Overlay routing
  • Bypasses 43 failures, latency inflation

26
More Information
  • In the paper
  • More details about anomaly characteristics
  • End-to-end impacts
  • Classification methodology
  • Optimizations to reduce overheads improve
    confirmation rate
  • mzhang_at_cs.princeton.edu
  • http//www.cs.princeton.edu/nsg/infoplane

27
Classifying Anomalies
  • Temporary vs. persistent loops
  • Whether exit loops at maximum hop
  • Path changes vs. outages
  • Changes follow different paths to clients
  • Outages stop at intermediate hops

ProbeD
Client
28
Non-anomalies
  • Non-anomalies
  • Ultrashort anomalies
  • Path-based TTL
  • Aggressive timeout

29
Identifying Forward Outages
  • Forward outages
  • Route change
  • ICMP dest unreachable
  • Forward timeout

30
Loop Effect on RTT
  • How do loops affect RTTs?
  • Loops can incur high latency inflation

31
Loop Effect on Loss Rate
  • How do loops affect loss rates?
  • 65 temporary and 55 persistent loops preceded
    by loss rates exceeding 30

32
Forward Anomaly Effect on RTT
  • How do forward anomalies affect RTTs?
  • Outages and changes can incur latency inflation
  • Outages have more negative effect on RTTs

33
Forward Anomaly Effect on Loss Rate
  • How do forward anomalies affect loss rates?
  • 45 outages and 40 changes preceded by loss
    rates exceeding 30

34
Reducing Measurement Overhead
  • Can we reduce the number of probes?
  • 15 probes can achieve the same accuracy in 80
    cases
  • Flow-based TTL

35
Traffic Breakdown By Tiers
Write a Comment
User Comments (0)
About PowerShow.com