Surviving Large Scale Internet Outages - PowerPoint PPT Presentation

1 / 93
About This Presentation
Title:

Surviving Large Scale Internet Outages

Description:

Can lead to large scale 'failures' Inability of access or diversion to malicious sites. ... Aggregation of large un-owned IP blocks. Incompatible policies among AS'es ... – PowerPoint PPT presentation

Number of Views:214
Avg rating:3.0/5.0
Slides: 94
Provided by: csG7
Category:

less

Transcript and Presenter's Notes

Title: Surviving Large Scale Internet Outages


1
Surviving Large Scale Internet Outages
  • Dr. Krishna Kant
  • Intel Research

Acknowledgements Work supported by
National Science Foundation Collaborative
work with A. Sahoo P. Mohapatra
2
Outline
  • Overview
  • Routing and Name resolution infrastructures
  • Some large scale failures
  • Routing Vulnerabilities
  • Routing algorithms their properties
  • Improving inter-domain routing
  • Dealing with Name Resolution Failures
  • Name resolution preliminaries
  • DNS vulnerabilities Solutions

3
The Problem
  • Internet has two critical elements
  • Routing (Inter intra domain)
  • Name resolution
  • How robust are they against large scale
    failures/attacks?
  • How do we improve them?

4
Internet Routing
  • Not a homogeneous network
  • A network of autonomous systems (AS)
  • Large variation in AS sizes typical heavy tail.
  • Inter-AS routing
  • Border Gateway Protocol (BGP)
  • Complex configuration parameters
  • Flexible but serious stability, recoverability
    and configurability issues
  • Intra-AS routing
  • Usually easier to manage
  • Central control, smaller network,
  • But, can suffer from similar problems

5
Internet Name Resolution
  • Domain Name Server (DNS)
  • Translates names to IP addresses.
  • Critical for all networking services
  • Hierarchical structure
  • Caching of data in proxy servers resolvers
  • DNS Vulnerabilities
  • Complex dependencies easy to poison
  • Can lead to large scale failures
  • Inability of access or diversion to malicious
    sites.

ftp acme.com
application
Resolver
10.7.196.31
acme.com
DNS proxy server
Auth. DNS server
6
Large Scale Failures
  • Characteristics
  • Large service impact.
  • Usually non-uniformly distributed, e.g., an
    affected geographical area, hijacked .com domain,
    etc.
  • Why study large scale failures?
  • Several moderate sized incidents already.
  • Larger failures will happen
  • Can cause other undesirable impacts
  • Secondary failures due to large recovery traffic,
  • Substantial imbalance in load,

7
Routing Failures
  • Physical Damage
  • Earthquake, hurricane, high BW cable cuts,
  • SW bugs configuration errors
  • Incorrect input or output filtering rules
  • Aggregation of large un-owned IP blocks
  • Incompatible policies among ASes
  • Network wide congestion (DoS attack)
  • Malicious route advertisements via worms

8
Name Resolution Failures
  • Compromising name resolution
  • Poisoning (altering/insertion) of address records
  • Doesnt even require compromising the server
  • Extensive caching ? More points of entry
  • Substitution of rogue DNS server
  • Security holes due to configuration errors
  • Potential large scale effects
  • Poisoning at higher levels ? Large scale
    disruption
  • Example March 2005 .com attack
  • Redirection to malicious sites to collect
    sensitive info

9
Some Significant Failure Events
10
Taiwan Earthquake Dec 2006
  • Major outage in SE Asia, 60 drop in traffic
  • Issues
  • Global traffic passes through a small number of
    seismically active choke points.
  • Luzon strait, Malacca strait, South coast of
    Japan
  • Satellite overland cables ? Inadequate backup
    capacity
  • Several countries depend on 1-2 landing pts
  • Outlook Potential repeat performance
  • Economics makes change unlikely.
  • May be exploited by pirates terrorists
  • Reference http//master.apan.net/meetings/xian200
    7/publication/051_Kitamura.pdf

11
Hurricane Katrina (Aug 2005)
  • Major local outages. No major regional cable
    routes through the worst affected areas.
  • Outages persisted for weeks months. Notable
    after-effects in FL (significant outages 4 days
    later!)
  • Reference
  • http//www.renesys.com/tech/presentations/pdf/Ren
    esys-Katrina-Report-9sep2005.pdf

12
NY Power Outage (Aug 2003)
  • No of concurrent network outages vs. time
  • Large ASes suffered less than smaller ones.
  • Many ASes all routers down for 4 hours.
  • Very similar power outage in Italy, sept 2003.

13
Slammer Worm (Jan 2003)
  • Worm started w/ buffer overflow of MS SQL.
  • Very rapid replication, huge congestion buildup
    in 10 mins
  • Korea falls out, 5/13 DNS root servers fail,
    failed ATMs,
  • High BGP activity to find working routes.
  • Reference http//www.cs.ucsd.edu/
    savage/papers/IEEESP03.pdf

14
DNS Attack (Jan 2006)
  • Attack Type
  • Authoritative TLD DNS servers attacked using 100
    zombie clients 51K recursive servers.
  • 55 Byte zombie query ? 4.2KB response.
  • Responses directed to target name server (w/
    spoofed IP address).
  • Impact
  • Failures in networks in the path including
    transit providers to authoritative TLD DNS
    servers
  • Graph
  • Unanswered queries (Y-axis) vs. Time (X-axis)
  • Red failure, yellow slow

Reference http//www.oecd.org/dataoecd/34/40/3865
3402.pdf
15
Infrastructure Induced Failures
  • En-masse use of backup routes by 4000 Cisco
    routers in May 2007 (Japan)
  • Routing table rewrites ? 7 hr downtime in NE
    Japan
  • Ref http//www.networkworld.com/news/2007/051607-
    cisco-routers-major-outage-japan.html
  • Akamai CDN failure June 2004
  • Probably widespread failures in Akamais DNS.
  • Ref http//www.landfield.com/isn/mail-archive/200
    4/Jun/0064.html
  • Worldcom router mis-configuration Oct 2002
  • Misconfigured eBGP router flooded internal
    routers with routes.
  • Ref http//www.isoc-chicago.org/internetoutage.pd
    f

16
Routing Infrastructure
17
Routing Basics
  • Distance vector based (DV)
  • RIP (Routing Information Protocol).
  • IGRP (Interior gateway routing Protocol).
  • Link State Based (LS)
  • OSPF (Open shortest path first)
  • IS-IS (Intermediate system to IS)
  • Path Vector Based (PV)
  • BGP (Border Gateway Protocol)
  • Intra-domain (iBGP) inter-domain (eBGP)
    versions.

18
Distance Vector (DS) Protocols
  • Build RT using successive path advertisements.
  • May use stale info used to handle failures
  • count to infinity problem Several versions to
    fix this.
  • Difficult to use policies

Routing Table for A
B
D
E
A
F
C
19
Link State (LS) Protocols
  • Each node keeps complete adjacency/cost matrix
    computes shortest paths locally
  • Any failure propagated via flooding
  • Expensive in a large network
  • Loop-free can use policies easily.

3
B
D
1
4
6
A
2
5
E
C
20
Path Vector Protocols
  • Each node initialized w/ a set of paths for each
    destination
  • Active paths updated much like in DV
  • Explicitly withdraw failed paths ( advertise
    next best)
  • Filtering on incoming/outgoing paths, path
    selection policies
  • Paths A to D
  • Via B cost 3
  • Via C cost 4
  • Entire path not stored (only cost, next hop)

21
Intra-domain Routing under Failures
  • Routing algorithms
  • Link state (OSPF)
  • Flooding can handle failures quickly.
  • Path vector (iBGP)
  • iBGP routers are fully meshed in small networks
    (routing not much of an issue)
  • In large network, route reflectors may be used
    for scalability
  • Can recover rather quickly
  • Single domain of control
  • High visibility, common management network, etc.
  • Easy to configure consistent values at all
    routers
  • iBGP with route reflection shown to suffer from
    oscillations, but can be remedied.
  • Reference A. Rawat M.A. Shayman, Preventing
    persistent oscillations and loops in IBGP
    configuration with route reflection, Computer
    Networks, Vol 50, No 18, Dec 2006, pp 3642-3665

22
Inter-domain Routing
  • BGP Default inter-AS protocol (RFC 1771)
  • Path vector protocol, runs on TCP
  • Scalable, rich policy settings
  • But prone to long convergence delays
  • High packet loss delay during convergence

23
Inter-domain routingBGP specifics and
vulnerabilities
24
BGP Routing Table
  • Prefix origin address for dest mask
    (eg.,207.8.128.0/17)
  • Next hop Neighbor that announced the route
  • One active route, others kept as backup
  • Only active route can be advertised
  • Route attributes -- may be conveyed outside
  • ASpath Used for loop avoidance.
  • MED (multi-exit discriminator) preferred
    incoming path
  • Local pref Used for local path selection

25
BGP Messages
  • Message Types
  • Open (establish TCP conn), notification, update,
    keepalive
  • Update
  • Withdraw zero or more old routes
  • Optionally advertise exactly one new route.
  • May need to also advertise sub-prefix
  • E.g., 207.8.240.0/24 which is contained in
    207.8.128.0/17

26
Routing Process
  • Input policy engine
  • Filter routes by path attributes, prefix, etc.
  • Output policy engine
  • Manipulate attributes, e.g. Local pref., MED,
    etc.
  • Multiple points for possible configuration errors
    mismatch between ASes

27
BGP Recovery
  • BGP Convergence Delay
  • Time for all routes to stabilize following an
    event
  • Four durations of interest
  • Tup, Tshort, Tlong, Tdown
  • Min. Route Advertisement Interval (MRAI)
  • Applies only to adv., not withdrawals
  • Intended per destination, Implemented per
    peer
  • Damps out oscillations

Convergence Delay
MRAI
28
Impact of BGP Recovery
  • Long Recovery Times
  • 3 min. for 30 of isolated failures
  • 15 min. for 10 of cases
  • Longer for larger failures
  • Consequences
  • Connection attempts over invalid routes fail.
  • Big increase in pkt loss (30X) and delay (4X)
  • Compromised QoS

Graphs taken from ref 2, Labovitz, et.al.
29
BGP Illustration (1)
  • Best path PSD(N, cost) X
  • S,D Source destination nodes
  • N Next hop
  • X Actual path (for illustration only)
  • Sample starting paths to C
  • PBC(D,3) BDAC, PDC(A,2) DAC, etc.
  • Paths shown using arrows (all share seg AC)
  • Failure of A
  • BGP does not attempt to diagnose problem or
    broadcast failure events.

30
BGP Illustration (2)
  • NOTE Affected node names in blue, rest in white
  • As neighbors install paths not using A as next
    hop ?
  • PDC(B,5) DBFEAC, PEC(F,5) EFBDAC, PGC(H,6)
    GHIBDAC
  • Full path unknown ? Passage of these paths thru A
    is not known!
  • D advertises PDC (B,5) to B
  • Current PBC is via D ? B must pick a path not via
    D ?
  • B installs PBC(F,4) BFEAC advertises it to F
    I
  • Note Green indicates first advertisement by B

31
BGP Illustration (3)
  • E advertises PEC EFBDAC to F
  • Current PFC is via E ?
  • F installs PFC(B,4) FBDAC advertises to E
    B
  • G advertises PGC GHIBDAC to H
  • Current PHC is via H ?
  • H installs PHC(I,5) HIBDAC advertises to I

32
BGP Illustration (4)
  • Bs adv BFEAC reaches F I
  • PFC(B,4) FBDAC thru B ? F withdraws PFC has
    no path to C!
  • PIC(H,5) IHGAC is shorter ? I retains it.
  • Fs adv FBDAC reaches B PBC(F,4) BFEAC thru
    F ?
  • B installs PBC(I,6) BIHGAC and advertises to
    D, F I
  • Note Green text Bs first adv Grey text Bs
    subsequent adv. (disallowed by MRAI)

33
BGP Illustration (5a)
  • Hs adv HIBDAC reaches I
  • PIC(H,5) IHGAC thru H ? I installs PIC(B,6)
    IBDAC advertises to B H.
  • Bs adv BIHGAC reaches D, F
  • D updates PDC(B,8) DBIHGAC (Just a local
    update)
  • F updates PFC(B,8) FBIHGAC advertises to E
  • w/ MRAI
  • D F have wrong (lower) cost metric, but will
    still follow the same path thru. B.

34
BGP Illustration (5b)
  • Bs adv BIHGAC reaches I
  • PIC(B,6) IBDAC thru B ? I withdraws PIC has
    no path to C!
  • w/ MRAI
  • I will continue to use the nonworking path IBDAC.
    Same as having no path.
  • Is adv IBDAC reaches B H
  • H changes its path to HIBDAC
  • Bs path thru I, so B installs (C,10)
    advertises to its neighbors D, F I

35
BGP Illustration (5c)
  • Fs update reaches E
  • E updates its path locally.
  • Is withdrawal of IBDAC reaches H ( also B)
  • H withdraws the path IBDAC has no path to C!
  • Hs withdrawal of HIBDAC reaches G ( also I)
  • G withdraws the path GHIBDAC has no path to C!
  • w/ MRAI
  • Nonworking paths stay at E, H G

36
BGP Illustration (6) No MRAI
  • Bs adv C reaches D, F I (in some order)
  • D updates its path cost (B,11)
  • F updates its path cost (B,11) advertises PFC
    to E.
  • I updates its path cost (B,13) advertises PIC
    to H
  • Final updates
  • Fs update FBC reaches E which updates its path
    locally
  • Is adv IBC reaches H
  • H updates its path cost (I,14) HIBC
    advertises PHC to G
  • G does a local update

37
BGP Illustration (5) w/ MRAI
  • Hs adv HIBDAC reaches I
  • PIC(H,5) IHGAC thru H ? I installs PIC(B,6)
    IBDAC advertises to B H.
  • Is adv IBDAC reaches B H
  • H changes its path to HIBDAC
  • Bs path is thru I, so B installs (C,10)
  • When MRAI expires, B advertises to its neighbors
    D, F I
  • Note If MRAI is large, path recovery gets delayed

38
BGP Illustration (6) w/ MRAI
  • Bs adv C reaches D, F I (in some order)
  • D updates its path cost (B,11)
  • F updates its path cost (B,11) advertises PFC
    to E.
  • I installs updated path IBC and advertises it
    to H
  • Final updates Same as for (6)
  • W/ vs. w/o MRAI
  • MRAI avoids some unnecessary path updates (less
    router load)

39
BGP Convergence Delay Analysis
40
Known Analytic Results
  • Lots of work for isolated failures, none on large
    scale failures.
  • Labovitz 1 Convergence delay bound for full
    mesh networks
  • O(n3) for average case, O(n!) for worst case.
  • Labovitz 2, Obradovic 3, Pei8
  • Convergence delay ? Length of longest path
    involved
  • Applies only for unit cost hops
  • Griffin and Premore 4
  • V shaped curve of convergence delay wrt MRAI.
  • Messages wrt MRAI decreases at a decreasing rate.

41
Evaluation of LS Failures
  • Evaluation methods
  • Primarily simulation. Analysis is intractable
  • BGP Simulation Tools
  • Several available, but simulation expense is the
    key!
  • SSFNET scalable, but max 240 nodes on 32-bit
    machine
  • SSFNet default parameter settings
  • MRAI jittered by 25 to avoid synchronization
  • OSPFv2 used as the intra-domain protocol

42
Topology Modeling
  • Topology Generation BRITE
  • Enhanced to generate arbitrary degree
    distributions
  • Heavy tailed based on actual measurements.
  • Approx 70 low 30 high degree nodes.
  • Mostly used 1 router/AS ? Easier to see trends.
  • Failure topology Geographical placement
  • Emulated by placing all AS routers and ASes on a
    1000x1000 grid
  • The area of an AS ? No. of routers in AS

43
Convergence Delay vs. Failure Extent
  • Initial rapid increase then flattens out.
  • Delays increase rate both go up with network
    size
  • ? Large failures can pose a problem!

44
Delay Msg Traffic vs. MRAI
  • Small networks in simulation ?
  • Optimal MRAI for isolated failures small (0.375
    s).
  • Main observations
  • Larger failure ? Larger MRAI more effective

45
Convergence Delay vs. MRAI
  • A V-shaped curve, as expected
  • Curve flattens out as failure extent increases
  • Optimal MRAI shifts to right with failure extent.

46
Impact of AS Distance
  • ASes more likely to be connected to other
    nearby ASes.
  • b indicates the preference for shorter distances
    (smaller b ? higher preference)
  • Lower convergence delay for lower b.

47
Improving BGP Convergence Delay
48
Reducing Convergence Delays
  • Many schemes mostly evaluated for isolated
    failures
  • Some popular schemes
  • Ghost Flushing
  • Consistency Assertions
  • Root Cause Notification
  • Our work (Large scale failure focused)
  • Dynamic MRAI
  • Batching
  • Speculative Invalidation

49
Ghost Flushing
  • Bremler-Barr, Afek, Schwarz Infocom 2003
  • An adv. implicitly replaces old path
  • GF withdraws old path immediately.
  • Pros
  • Withdrawals will cascade thru ntwk
  • More likely to install new working routes
  • Cons
  • Substantial addl load on routers
  • Flushing takes away a working route!
  • Install BC ?
  • Routes at D, F, I via B will start working
  • Flushing will take them away.

50
Consistency Assertion
  • Pei, Zhao, et.al., Infocom 2002
  • If S has two paths SN1xD SN2yN1xD, first
    path is withdrawn, then second path is not used
    (considered infeasible).
  • Pros
  • Avoids trying out paths that are unlikely to be
    working.
  • Cons
  • Consistency Checking can be expensive

S
N2
N1
y
x
D
51
Root Cause Notification
  • Pei, Azuma, Massy, Zhang Computer Networks, 2004
  • Modify BGP messages to carry root cause (e.g.,
    node/link failure).
  • Pros
  • Avoid paths w/ failed nodes/links ? substantial
    reduction in conv. delay.
  • Cons
  • Change to BGP protocol. Unlikely to be adopted.
  • Applicability to large scale failures unclear
    (diagnosis difficult)

H
E
F
G
I
D
2
3
A
B
10
C
  • D, E, G diagnose if A or link to A has failed.
  • Propagate this info to neighbors

52
Large Scale FailuresOur Approach
  • What we cant or wouldnt do?
  • No coordination between ASes
  • Business issues, security issues, very hard to
    do,
  • No change to wire protocol (i.e., no new msg
    type).
  • No substantial router overhead
  • Solution applicable to both isolated LS
    failures.
  • What we can do?
  • Change MRAI based on network and/or load parms
  • e.g., degree dependent, backlog dependent,
  • Process messages ( generate updates) differently

53
Key Idea Dynamic MRAI
  • Increase MRAI when the router is heavily loaded
  • Reduces load of route changes.
  • Relationship to large scale failure
  • Larger failure size ? Greater router loading ?
    Larger MRAI more appropriate.
  • Router load directed MRAI caters to all failure
    sizes!
  • Implementation
  • Queue length threshold based MRAI adjustment.

Decrease th1
Increase th1
Decrease th2
Increase th2
54
Dynamic MRAI Effect on Delay
  • Change wrt fixed MRAI0.375 secs.
  • Improves convergence delay as compared to fixed
    values.

55
Key Idea Message Batching
  • BGP default FIFO message processing ?
  • Unnecessary processing, if
  • A later update (already in queue) changes route
    to dest.
  • Timer expiry before a later msg is processed.
  • Relationship to large scale failure
  • Significant batching (and hence batching
    advantage) likely for large scale failures only.
  • Algorithm
  • A separate logical queue/dest. allows
    processing of all updates to dest as a batch.
  • 1 update from same neighbor ? Delete older ones.

B
C
B
A
A
A
A
B
A
C
B
A
56
Batching Effect on Delay
  • Behavior similar to dynamic MRAI w/o actually
    making it dynamic
  • Combination w/ dynamic MRAI works somewhat
    better.

57
Key Idea Speculative Invalidation
  • Large scale failure
  • A lot of route withdrawals for the failed AS, say
    X
  • withdrawn paths w/ AS X e AS_path thres ?
    Invalidate all paths containing X
  • Implementation Issues
  • Going through the routes for invalidation is
    inefficient
  • Use output route filters at each node
  • Threshold estimation ? Computed (see paper)
  • Reverting routes to valid state ? time-slot based

58
Effect of Invalidation
  • Avoids exploring unnecessary paths
  • Reduces conv. delay significantly, but
  • May affect connectivity adversely.
  • Implement only at nodes with degree 4 or higher

59
Comparison of Various Schemes
  • CA is the best scheme throughout!
  • GF is rather poor
  • Batching dynamic MRAI do pretty well
    considering their simplicity

60
Routing Recovery Metrics
61
Whats the right performance metric?
  • Convergence delay
  • Network centric, not user centric
  • Instability in infrequently used routes is almost
    irrelevant
  • User Centric Metrics
  • Packet loss packet delays
  • Convergence delay does not correlate well with
    user centric metrics

62
User Centric Metrics
  • Frac of pkts lost or frac increase in pkt delay
  • Pkt delay needs E2E monitoring ? Impractical
  • Metric computation
  • Single number Overall avg over routes time
  • Distribution wrt routes, time dependent rate,
    etc.

63
Comparison between Schemes
  • Comparisons
  • Consistency assertion (CA), Ghost Flushing (GF),
    Speculative Invalidation (SI)
  • All 3 schemes reduce conv delay substantially,
    but only CA can really reduce the pkt losses!

64
How Schemes affect routes
  • Cumulative time for which there is no valid path
  • T_noroute Time for which there is no route at
    all
  • T_allinval Time for which all neighbors
    advertise an invalid route
  • T_BGPinval Time for which BGP chooses an invalid
    route (even though some neighbor has a valid
    route).
  • GF increases T_noroute the most, CA reduces
    T_allinval the most

65
Changes to Reduce pkt Losses
  • GF Difficult to reduce T_noroute. Not attempted.
  • CA Use best route even if all of them are
    infeasible, but dont advertise infeasible
    routes.
  • Improves substantially
  • SI Mark the route invalid probabilistically
    depending on fail count (instead of
    deterministically)
  • Improves substantially

66
Routing Misconfiguration
67
BGP Configuration Faults
  • Configuration parameters
  • Which neighboring networks can send traffic?
  • Where traffic enters and leaves the network?
  • How routers within the network learn routes to
    external destinations?
  • Potential Problems
  • Invisible path Valid route exists, but not made
    available
  • Invalid route, e.g., routing loop

68
Configuration Checking
  • Fault checking by a tool called RCC
  • N. Feamster H. Balakrishnan, Detecting BGP
    faults with static analysis, NSDI 05
  • Config faults in every AS, most related to lack
    of coordination
  • Some faults could have global consequences
  • Consistency checking required for each change in
    policies hard!


69
Conclusions Open Issues
  • Inter-domain routing does not perform very well
    for large scale failures.
  • Considered several schemes for improvement. Room
    for further work.
  • Convergence delay is not the right metric
  • Defined pkt loss related metric a simple scheme
    to improve it.
  • Open Issues for large scale failures
  • Analytic Modeling of convergence properties.
  • What aspects affect pkt losses can we model
    them?
  • Account for policies AS relationships.
  • Effective efficient methods for
    misconfiguration detection.

70
Domain Name SystemBasics Vulnerabilities
71
DNS Infrastructure
root
Browser
E-Mail
FTP
nz
au
sg
Application and O/S
Resolver
edu
gov
DNS Proxy Server
ips
sa
gb
Cache
Authoritative DNS Server
  • Three levels of name resolution
  • Client side (OS provided resolver)
  • DNS proxy server (organization level)
  • Authoritative server (serves a zone)

72
DNS Basics
  • DNS manages zones
  • A set of names that are under the same authority
  • E.g., ftp.acme.com and www.acme.com under
    acme.com
  • DNS is a "lookup service"
  • Simple queries ? No search or 'best fit' answers
  • Limited data expansion capability
  • TTL (Time To Live) The time an RRSet can be
    cached/reused by a non- authoritative server
  • Best matching records
  • Iterative vs. recursive resolution

73
TTL Values Used by DNS Proxies
  • 2.7 Million names on dmoz.org
  • Some values, e.g. 1 hr, 1 day, 2 days dominate
  • Some extremely small values
  • Large TTL
  • Low overhead, more likely to be stale/incorrect
  • Small TTL
  • High overhead, but up to date.

CDF of TTLs
74
DNS Resource Record (RR)
  • Domain name
  • (length, name) pairs, eg., cisco.com ?
    05cisco03com00
  • Record Types
  • DNS Internal types
  • Authority NS, SOA DNSSEC DS, DNSKEY, RRSIG,
  • Many others TSIG, TKEY, CNAME, DNAME,
  • Terminal RR
  • Address records A, AAAA
  • Informational TXT, HINFO, KEY, (data carried
    to apps)
  • Non Terminal RR
  • MX, SRV, PTR, w/ domain names resulting in
    further queries.
  • Other fields
  • RL Record length, RDATA IP address,
    referral,
  • TTL Time To Live in a cache

75
DNS Attacks
  • Inject incorrect RR into DNS proxy (poisoning)
  • Capture query send fake response before the
    real response
  • Randomly send fake responses
  • Query interception relatively easy
  • UDP based ? Dont need any context!
  • DNS query uses 16-bit trans-id to connect query
    w/ response.
  • Randomized in newer implementations, but attacker
    can generate a large number of replies.
  • Response can include additional RRs
  • Intercept updates to authoritative server
  • Technically not poisoning, but a problem

76
Poisoning Consequences
  • Can be exploited in many ways
  • Disallow name resolution
  • Direct all traffic to small set of servers
  • DDOS attack!
  • Direct to a malicious server to collect info or
    drop malware
  • Scale of attack depends on the level in the
    hierarchy!
  • Poison propagates downwards
  • Set large TTL to avoid expiry
  • Actual scenario in Mar 05 (.com entry poisoned)

Proxy Cache
77
Kaminsky DNS Attack
  • Attack target www.abc.com
  • Poisoning of auth server response possible on TTL
    expiry in the proxy ? hard
  • Generate queries for fake x.abc.com (x1, 2, 3,
    )
  • Supply fake responses with guessed TXID (ahead of
    auth server response)
  • In fake response, delegate www.abc.com to a
    server of attackers choosing
  • Source port guessing is necessary for this attack
  • Repeat until something works, say 83.abc.com
  • The proxy-server now has a valid DNS record for
    83.abc.com.
  • Queries to www.abc.com are now directed to
    attackers site
  • Reference http//www.doxpara.com/DMK_BO2K8.ppt

78
DNSchanger Attack
  • Attack via DHCP
  • Doesn't exploit any security vulnerability
  • Depends on ndisprot.sys driver installed on
    infected box that generates fake DHCP server
    offers.
  • Attack Scenario
  • Infected client X connects to some network N
  • A benign client Y Requests IP address for N
  • X responds w/ DHCP-offer that supplies rogue DNS
    server address.
  • Ys DNS requests can now be translated to
    arbitrary IP addresses.

79
DNS Robustness
80
Making DNS Robust
  • TSIG (symmetric key crypto)
  • Intended for secure master?slave proxy comm.
  • Issues Not general, Scalability
  • DNSSEC
  • Stops cache poisoning, but issues of overhead,
    infrastructure change, key mgmt, etc.
  • Based on PKI, a symmetric key version also
    exists.
  • Cooperative Lookup
  • Direct requests to responsive clients (CoDNS)
  • Distributed hash table (DHT) structure for DNS
    (CoDoNS)
  • Cooperative checking between clients (DoX)

81
PK-DNSSEC
  • Auth. chain starts from root
  • Parent signs child certificates (avoids lying
    about public key)
  • Encrypted exchange also supplies signed public
    keys
  • F public key, f private key

Root Cert.
root
nz
sg
au
Fgov(query)
gov
edu
DNS proxy
fgov(resp, Fgb)
Fgb(query)
ips
sa
gb
gb
fgb(resp)
82
CoDoNS
  • Organize DNS using DHT (distributed hash table).
  • Enhances availability via distribution and
    replication
  • Explicit version control to keep all copies
    current
  • Issues
  • DHT issues (equal capacity nodes)
  • Explicit version control unscalable.
  • Not directed towards poisoning control (but
    DNSSEC can be used)

83
Domain Name Cross-referencing (DoX)
  • Client peer groups
  • Diversity common interest based
  • Peers agree to cooperate on verification of
    popular records.
  • Mutual verification
  • Assumes that authoritative server is not poisoned.

Peer2
Peer1
Verify
Peer3
Peer4
84
Choosing Peers
  • Give get
  • Give A peer must participate in verification
    even if it is not interested in the address ?
    Overhead
  • Get Immediate poison detection, high data
    currency.
  • Selection of peers
  • Topic channel w/ subscription by peers
  • E.g. names under a Google/Yahoo directory
  • Community channel, e.g., peers within the same
    org
  • Minimizing overhead
  • Verify only popular (perhaps most vulnerable)
    names

85
DoX Verification
  • Uses verification cache for efficiency
  • Verification
  • DNS copy (Rd) verified copy (Rv) ? Stop
  • Else send (Ro,Rn) (Rv,Rd) to all other peers
  • At least m peers agree ? Stop, else obtain
    authoritative copy Ra if Ra ! Rd, poison
    detected.
  • Agreement procedure
  • Involves local copy Rv remotely received
    (Ro,Rn)
  • If RvRn ? agree, else peer gets auth. copy Ra
  • Several cases, e.g., if RvRo, RaRn ? agree
  • Verified copy was obsolete, got correct one now ?
    Forced removal of obsolete copy

86
Handling Multiple IPs per name
  • DNS directed load distribution
  • Easily handled with set comparison
  • Multiple Views
  • Used to differentiate inside/outside clients
  • All peers should belong to same view (statically
    or trial error).
  • Content Distribution Networks (CDNs)
  • Same name translates to different IP addresses in
    different regions
  • Need a flowset based IP address comparison

87
Results Normal DNS
  • Poison spreads in the cache
  • More queries are affected

88
Results DoX
  • Poison removed immediately

89
DoX vs. DNSSEC
90
DNS Delegation
  • Delegation
  • Allows portions of name space to be managed
    separately
  • Often used for geographical diversity
    redundancy
  • How it works
  • sales.dell.com delegated to domain pc1.com
  • Client needs to follow DNS tree to resolve
    pc1.com (unless there is a glue record)

root
.com
dell.com
pc2.com
pc1.com
support.dell.com
sales.dell.com
parts.sales.dell.com
91
Delegation Related Failures
  • Delegation Characteristics
  • Any domain server may delegate its subspace
    further
  • Delegation done manually w/o any global
    visibility.
  • Delegation related problems
  • Potential for long delegation chains long
    resolution delays
  • Lame delegation Child fails to inform parent of
    IP address change
  • Cyclic dependencies
  • Greatly amplified opportunities for compromise,
    poisoning delegation to rogue servers.

92
Exploiting Delegation
  • How do we exploit delegation to increase DNS
    resilience?
  • Diversity physical, geographical and
    organizational
  • Careful selection of nodes to delegate to
  • Active monitoring of delegation related
    anomalies.
  • Delegation Sentry
  • Overlay based monitoring of delegations
  • Each DNS server monitored by a selected set of
    peers
  • Need to establish policies for selecting delegate
  • Based on reputation, availability, capacity,
    location, domain, chain length, etc.
  • Mechanisms to defeat compromised servers.
  • Automated checking of problems such as lame
    delegation, cyclic dependencies, etc.

93
Conclusions
  • DNS has numerous vulnerabilities easy to attack
  • Several proposed solutions, none entirely
    satisfactory
  • Large deployed base resists significant overhaul
  • Securing DNS remains a challenge
  • Combine the best of DNSSEC, CoDoNS DoX.
  • Automated detection and correction of delegation
    problems.

94
Thats all folks!Questions?
95
BGP References
  • A.L. Barabasi and R. Albert, Emergence of
    Scaling in Random Networks, Science, pp.
    509512, Oct. 1999.
  • A. Basu, C.L. Ong, et. al. Route oscillations in
    IBGP with route reflection, In Proc. ACM SIGCOMM
    (Pittsburgh, PA, Aug. 2002).
  • A. Bremler-Barr, Y. Afek, and S. Schwarz,
    Improved BGP convergence via ghost flushing, in
    Proc. IEEE INFOCOM 2003, vol. 2, San Francisco,
    CA, Mar 2003, pp. 927937.
  • S. Deshpande and B. Sikdar,On the Impact of
    Route Processing and MRAI Timers on BGP
    Convergence Times, in Proc. GLOBECOM 2004, Vol.
    2, pp 1147- 1151.
  • L. Gao, T.G. Griffin, J. Rexford, Inherently
    safe backup routing with BGP, In Proc. IEEE
    INFOCOM (Anchorage, AK, Apr. 2001)
  • T.G. Griffin and B.J. Premore, An experimental
    analysis of BGP convergence time, in Proc. ICNP
    2001, Riverside, California, Nov 2001, pp. 5361.
  • T.G. Griffin, F.B. Shepherd, G. Wilfong, The
    stable paths problem and inter-domain routing,
    IEEE/ACM Transactions on Networking 10, 1 (2002),
    232243.
  • F. Hao, S. Kamat, and P. V. Koppol, "On metrics
    for evaluating BGP routing convergence," Bell
    Labora- tories Tech. Rep., 2003.
  • C. Labovitz, G. R. Malan, and F. Jahanian,
    Internet Routing Instability, IEEE/ACM
    Transactions on Networking, vol. 6, no. 5, pp.
    515528, Oct. 1998.
  • C. Labovitz, Ahuja, et al., Delayed internet
    routing convergence, in Proc. ACM SIGCOMM 2000,
    Stockholm, Sweden, Aug 2000, pp. 175187.
  • C. Labovitz, A. Ahuja, et al., The Impact of
    Internet Policy and Topology on Delayed Routing
    Convergence, in Proc. IEEE INFOCOM 2001, vol. 1,
    Anchorage, Alaska, Apr 2001, pp. 537546.

96
BGP References
  • A. Lakhina, J.W. Byers, et al., On the
    Geographic Location of Internet Resources, IEEE
    Journal on Selected Areas in Communications, vol.
    21 , no. 6, pp. 934948, Aug. 2003.
  • R. Mahajan, D. Wetherall, T. Anderson,
    Understanding BGP misconfiguration, In Proc.
    ACM SIGCOMM, (Pittsburgh, PA, Aug. 2002), pp.
    317.
  • A. Medina, A. Lakhina, et al., Brite Universal
    topology generation from a users perspective,
    in Proc. MASCOTS 2001, Cincinnati, Ohio, Aug
    2001, pp. 346-353.
  • N. Feamster H. Balakrishnan, Detecting BGP
    faults with static analysis, Proc. of NSDI 2005
  • D. Obradovic, Real-time Model and Convergence
    Time of BGP, in Proc. IEEE INFOCOM 2002, vol. 2,
    New York, June 2002, pp. 893901.
  • D. Pei, et al., "A study of packet delivery
    perfor- mance during routing convergence," in
    Proc. DSN 2003, San Francisco, CA, June 2003, pp.
    183-192.
  • Dan Pei, B. Zhang, et al., An analysis of
    convergence delay in path vector routing
    protocols, Computer Networks, vol. 30, no. 3,
    Feb. 2006, pp. 398421.
  • D. Pei, X. Zhao, et al., Improving BGP
    convergence through consistency assertions, in
    Proc. IEEE INFOCOM 2002, vol. 2, New York, NY,
    June 2327, 2002, pp. 902911.
  • Y. Rekhter, T. Li, and S. Hares, Border Gateway
    Protocol 4, RFC 4271, Jan. 2006.
  • J. Rexford, J. Wang, et al., BGP routing
    stability of popular destinations, in Proc.
    Internet Measurement Workshop 2002, Marseille,
    France, Nov. 68, 2002, pp. 197202.
  • A. Sahoo, K. Kant, and P. Mohapatra,
    Characterization of BGP recovery under
    Large-scale Failures, in Proc. ICC 2006,
    Istanbul, Turkey, June 1115, 2006.

97
BGP References
  • A. Sahoo, K. Kant, and P. Mohapatra, Improving
    BGP Convergence Delay for Large Scale Failures,
    in Proc. DSN 2006, June 25-28, 2006,
    Philadelphia, Pennsylvania, pp. 323-332.
  • A. Sahoo, K. Kant, and P. Mohapatra, "Speculative
    Route Invalidation to Improve BGP Convergence
    Delay under Large-Scale Failures," in Proc. ICCCN
    2006, Arlington, VA, Oct. 2006.
  • A. Sahoo, K. Kant, and P. Mohapatra, Improving
    Packet Delivery Performance of BGP During
    Large-Scale Failures", submitted to Globecom
    2007.
  • SSFNet Scalable Simulation Framework.
    Online. Available http//www.ssfnet.org/
  • W. Sun, Z. M. Mao, K. G. Shin, Differentiated
    BGP Update Processing for Improved Routing
    Convergence, in Proc. ICNP 2006, Santa Barbara,
    CA, Nov. 1215, 2006 , pp. 280289.
  • H. Tangmunarunkit, J. Doyle, et al, Does Size
    Determine Degree in AS Topology?, ACM SIGCOMM,
    vol. 31, issue 5, pp. 710, Oct. 2001.
  • R. Teixeira, S. Agarwal, and J. Rexford, BGP
    routing changes Merging views from two ISPs,
    ACM SIGCOMM, vol. 35, issue 5, pp. 7982, Oct.
    2005.
  • B. Waxman, Routing of Multipoint Connections,
    IEEE Journal on Selected Areas in Communications,
    vol. 6, no. 9, pp. 16171622, Dec. 1988.
  • B. Zhang, R. Liu, et al., Measuring the
    internets vital statistics Collecting the
    internet AS-level topology , ACM SIGCOMM, vol.
    35, issue 1, pp. 5361, Jan. 2005.
  • B. Zhang, D. Massey, and L. Zhang, "Destination
    Reachability and BGP Convergence Time," in Proc.
    GLOBECOM 2004, vol. 3, Dallas, TX, Nov 3, 2004,
    1383-1389.

98
DNS References
  • R. Arends, R. Austein, et.al, DNS Security
    Introduction Requirements,'' RFC 4033, 2005.
  • G. Ateniese S. Mangard, A new approach to DNS
    security (DNSSEC),'' in Proc. 8th ACM conf on
    comp comm system security, 2001.
  • D. Atkins R. Austein, Threat analysis of the
    domain name system,' http//www.rfc-archive.org/g
    etrfc.php?rfc3833, August 2004.
  • R. Curtmola, A. D. Sorbo, G. Ateniese, On the
    performance and analysis of dns security
    extensions,'' in Proceedings of CANS, 2005.
  • M. Theimer M. B. Jones, Overlook Scalable
    name service on an overlay network,'' in Proc.
    22nd ICDCS, 2002.
  • K. Park, V. Pai, et.al, CoDNS Improving DNS
    performance and reliability via cooperative
    lookups,'' in Proc. 6th Symp on OS design
    impl., 2004.
  • V. Pappas, Z. Xu, S. Lu, D. Massey, A. Terzis,
    and L. Zhang, Impact of configuration errors on
    DNS robustness, SIGCOMM CCR., vol. 34, no. 4,
    pp. 319330, 2004.
  • L. Yuan, K. Kant, et. al, DoX A peer-to-peer
    antidote for DNS cache poisoning attacks,'' in
    Proc. IEEE ICC, 2006.
  • L. Yuan, K. Kant P. Mohapatra, A proxy view
    of quality of domain name service,'' in IEEE
    Infocom 2007.
  • V. Ramasubramanian and E. G. Sirer, Perils of
    transitive trust in the Domain Name System, in
    Proc. International Measurement Conference, 2005.
  • V. Ramasubramanium E.G. Sirer, The design
    implementation of next generation name service
    for internet, Sigcom 2004.

99
Backup
100
Quality of DNS Service(QoDNS)
  • Availability
  • Measures if DNS can answer the query.
  • Prob of correct referral when record is not
    cached.
  • Accuracy
  • Prob of hitting a stale record in proxy cache
  • Poison Propagation
  • Prob(poison at leaf level at tt level k
    poisoned at t0)
  • Latency Additional time per query
  • Overhead Additional msgs/BW per query

101
Computation of Metrics
  • Modification at authoritative server
  • Copy obsolete but proxy not aware until TTL
    expires a new query forces a reload

modification
U
X
hit
miss
hit
hit
hit
miss
  • XR Residual time of query arrival process
  • MR Residual time of modification process
  • Y Inter-miss period TTL XR

102
Dealing with Hierarchy
.gov
Level h-1
.sa
.ips
Level h
.gb
  • A miss at a node ? a query at its parent
  • Superposition of miss processes of children ?
    query arrival process of parent
  • Recursively derive the arrival process bottom-up

103
Deriving QoDNS Metrics
  • Accuracy
  • Prob. leaf record is current
  • (Un)Availability
  • Prob. BMR is Obsolete referral
  • Latency
  • RTT x referrals
  • Overhead
  • Related to referrals for current RRs tries
    for obsolete RRs

BMR is Current

.sg
.nz
.au
.gov
.edu
.gov
BMR is Obsolete Referral
.sa
.ips
.gb
.gb
.abc
.xys
.qwe
.abc
BMR is Obsolete Record
Write a Comment
User Comments (0)
About PowerShow.com