CoDNS: Improving DNS Performance and Reliability via Cooperative Lookups - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

CoDNS: Improving DNS Performance and Reliability via Cooperative Lookups

Description:

Running 24/7 since June 2003. Roughly 3-4 million reqs/day aggregate ... Critical in HTTP proxy, web crawlers and busy mail servers. 9/5/09. 6 ... – PowerPoint PPT presentation

Number of Views:103
Avg rating:3.0/5.0
Slides: 29
Provided by: kyou3
Category:

less

Transcript and Presenter's Notes

Title: CoDNS: Improving DNS Performance and Reliability via Cooperative Lookups


1
CoDNS Improving DNS Performance and Reliability
via Cooperative Lookups
2
Background CoDeeN
  • Academic Content Distribution Network
  • 100 proxy servers on PlanetLab
  • Improve web performance reliability
  • Running 24/7 since June 2003
  • Roughly 3-4 million reqs/day aggregate
  • Highest-traffic project on PlanetLab

3
How CoDeeN works
CoDeeN Proxy
Each CoDeeN proxy is a forward proxy, reverse
proxy, redirector
4
CoDeeN as Project Factory
  • Deployment Issues
  • Reliability/Security
  • CoDNS
  • Reliable DNS service
  • CoDeploy
  • Large file distribution
  • Dynamic URL rewriting
  • CoMon
  • Monitoring infrastructure

5
Local DNS Lookup Problems
  • CoDeeN experiences DNS problems
  • Local DNS lookup failures
  • 5 seconds delay for cached records
  • Frequent widely-distributed
  • Local DNS lookups important
  • LDNS cache hit rate 80-90
  • Unpredictable service
  • Directly affecting user-perceived latency
  • Random delay in web browsing
  • Critical in HTTP proxy, web crawlers and busy
    mail servers

6
Experiment For Local Problems
  • Local name lookup every 6 seconds
  • yyy.domain on xxx.domain at Planetlab
  • planetlab-2.cs.princeton.edu for
    planetlab-1.cs.princeton.edu
  • Lookup should be handled locally
  • Failure criteria
  • 5 seconds of latency
  • zero answer
  • Rolling average of the past 100 queries

7
DNS Failure on Various Nodes
  • planetlab1.cs.cornell.edu
  • planetlab2.tamu.edu
  • planetlab2.cs.uoregon.edu

8
(No Transcript)
9
Possible Causes
  • Dead nodes
  • High application-level packet losses
  • DNS
  • DNS?

10
UDP packet losses
11
Who is the muderer?
  • Yes, DNS!
  • Lookup of origin server on cache misses

12
Three kinds of failures 1
  • Periodic failures
  • The regularity of these failures suggests that
    they are possibly caused by cron jobs running on
    the local nameserver.

13
Three kinds of failures 2
  • Long lasting continuous failures
  • They are possibly caused by local nameserver
    malfunctioning or extended overloading.

14
Three kinds of failures 3
  • Sporadic short failures
  • They are likely caused by temporary overloading
    of the local name server.

15
Understanding failure
  • Time
  • Correllation
  • Little

16
Insight for CoDNS
  • As long as we have a reasonable number of healthy
    nameservers, we can use them to mask
    locally-observed delays.

17
Solution? CoDNS
LDNS
CoDNS
local node
18
CoDNS
  • Cooperative name lookup scheme
  • If local server OK, use local server
  • When failures, ask a peer for the lookup
  • Selecting nearby peers
  • Liveness/Remote nameservers health
  • Send to improve cache locality
  • Remote request timeout
  • Dynamically adjusted to local servers health
  • Exponentially backed off for each remote query

19
Implementation
  • A event-driven master daemon
  • Running on each node, accessible via UDP for
    remote queries, and loopback TCP for
    locally-originated name lookups
  • The slave process
  • Resolves those names by calling gethostbyname()
    and sends the result back to the master.

20
Several details 1
  • Remote query initiation retries
  • when the past 32 name lookups are all resolved
    locally without using any remote queries, the
    initial delay is set to 200 ms.
  • if the remote query wins more than 50 of the
    last 16 requests, then the delay is set to 0 ms.

21
Several details 2
  • Proximity, Locality and Availability
  • When CoDNS starts, it sends a heartbeat to each
    node in the node list every second. Select the
    ones with rtt less than 90ms as neighbors
  • After having found enough neighbors, it monitors
    the liveness of each node by sending the
    heartbeat every 30 seconds.
  • If the heartbeat does not arrive within a short
    time period, then the node is excluded in the
    next remote query selection.

22
Related approaches
  • Recursive DNS query ability into every local
    node.
  • In the case of local nameserver failure, the
    local node will contact the root name servers
    directly and try to resolve the hostname lookups
    by itself.
  • Disadvantages
  • reduces the caching effectiveness
  • increases the configuration efforts and also
    causes extra management problems.
  • use more resources on each node.

23
Related approaches
  • Second local name server immediately
  • aggravate the overload problem
  • Many failures observed are caused by overload
    rather than network packet loss

24
Evaluation
  • Live traffic for one week for CoDeeN (20k - 30k)

25
Finer-grained View
  • Live traffic for one day
  • Effectively flattens the spikes

26
(No Transcript)
27
Availability
  • Add one 9, from 99 to 99.9

28
More info
  • Contact us!
  • princeton_codeen_at_slices.planet-lab.org
  • CoDeeN USENIX04
  • http//codeen.cs.princeton.edu/
  • CoDNS OSDI04
  • http//codeen.cs.princeton.edu/codns/
  • CoDeploy
  • http//codeen.cs.princeton.edu/codeploy/
  • CoMon
  • http//codeen.cs.princeton.edu/comon/
Write a Comment
User Comments (0)
About PowerShow.com