CoDNS: Improving DNS Performance and Reliability via Cooperative Lookups - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

CoDNS: Improving DNS Performance and Reliability via Cooperative Lookups

Description:

Highest Random Weight (HRW) Remote request timeout ... CDN uses DNS to pick 'best' replica. CoDNS used only when LDNS failing. Pro: faster lookup time ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 36
Provided by: kyoungs
Category:

less

Transcript and Presenter's Notes

Title: CoDNS: Improving DNS Performance and Reliability via Cooperative Lookups


1
CoDNS Improving DNS Performance and Reliability
via Cooperative Lookups
  • KyoungSoo Park, Vivek Pai, Larry Peterson,
  • Zhe Wang
  • Princeton University

2
Domain Name System(DNS)
  • Human-friendly names ? IP addresses
  • Operational for over 20 years
  • Essential part of the Web
  • Two components
  • Server-side name owners
  • Client-side contacting name owners

3
Two Kinds of DNS Problems
  • Server-side problems Danzig92, Jung01
  • Nameserver bugs
  • Misconfigurations
  • Hardening/replacing server infrastructure
  • Client-side problems
  • Between local nameservers (LDNS) and clients
  • Larger memories higher LDNS hit rate
  • LDNS cache hit rate 80 90
  • Result LDNS problems magnified

4
Contributions
  • Measure LDNS problems, causes
  • Client-side DNS helper, CoDNS
  • Communicates with other CoDNS peers
  • Incrementally deployable
  • Works with all DNS lookups (CDN, etc)
  • Benefits
  • Latency reduction 27-82
  • Availability generally adds extra 9

5
Local DNS Lookup Problems
  • Local DNS lookup failures
  • 5 seconds delay for cached records
  • Frequent widely-distributed
  • Unpredictable service
  • Directly affects user-perceived latency
  • Random delays in web access
  • Kills HTTP proxies, web services, and busy mail
    servers

6
Demonstrating Local Problems
  • Local name lookup every 6 seconds
  • yyy.domain on xxx.domain at 200 sites
  • planetlab-2.cs.princeton.edu for
    planetlab-1.cs.princeton.edu
  • Lookup should be handled locally
  • LDNS is site-shared, NOT PlanetLabs
  • Failure criteria
  • 5 seconds of latency
  • zero answer
  • Rolling average of the past 100 queries

7
Expected DNS Behavior
  • University of Utah
  • Rice University

8
DNS Failure on Various Nodes
  • Cornell
  • Texas AM
  • University of Oregon

9
Possible Causes
  • Packet loss
  • LDNS overloading
  • Cron jobs
  • Maintenance problems

10
Packet Loss
  • UDP inherently unreliable
  • Single loss triggers query retransmission
  • Less than 0.1 in LAN environment
  • Heavily dependent on local traffic
  • Losses last for 1 min
  • Cable modem/DSL may be worse
  • Our sites have 4 LAN hops, Cable 8

11
Nameserver Overloading
  • University of Michigan
  • University of Torino, Italy
  • Technical University Berlin, Germany

12
Nameserver Overloading
  • Many responses for 1 sec 5 sec
  • No timeout but simply late
  • Pr (Overloading DNS Failure) 90 for some
    nodes
  • Bursts cause socket buffer overflow
  • Experiment in the paper

13
Cron jobs/heavy processes
  • University of Tennessee 1
  • University of Tennessee 2
  • Moscow State University

14
Why Do We See This?
  • Large memory ? large cache
  • Large cache ? high hit rate
  • High hit rate ? CPU load drops
  • Low CPU load ? add more services
  • More services ? memory pressure
  • Memory pressure ? failures, delays

15
Maintenance Problems
  • /etc/resolv.conf
  • Configured to dead nameservers
  • Blocking services
  • Outside the firewall
  • Complete outage
  • Berkeley Millennium nodes, 3/17/2004
  • Blackout / natural disaster
  • Duke hit by hurricane Isabel, Fall/2003

16
SolutionCoDNS
My LAN
Client Programs
CoDNS
My Machine
17
CoDNS Cooperative DNS
  • Cooperative name lookup scheme
  • If local server OK, use local server
  • When failing, ask peers to do lookup
  • Insurance model
  • Share risk, share benefits
  • Aggregate name lookup service
  • Aggregate cache effect
  • Incrementally deployable, no server change

18
Design Issues
  • Proximity / liveness
  • Select nearby peers
  • Monitors nameservers health as well
  • Request locality
  • Pick same peer for same names
  • Highest Random Weight (HRW)
  • Remote request timeout
  • Dynamically adjusted to local servers health
  • Exponentially backed off for each remote query

19
How many peers needed?
One extra peer halves avg response time!
Average Response Time
20
Effect of Timeout
200ms - slope changes 500ms - virtually flat
Average Number of Lookups
21
Deployment Status
  • CoDNS deployed on all PlanetLab nodes
  • Running 24/7 since August 2003
  • CoDeeN uses CoDNS as primary DNS
  • After CoDeeNs own DNS cache
  • Remote query configuration
  • One extra peer, 200ms starting timeout
  • On total LDNS failure, send immediately
  • Monitor 10 nodes as neighbors

22
Evaluation
  • Live traffic for one week for CoDeeN (20k - 30k)

23
Finer-grained View
  • Live traffic for one day
  • Effectively flattens the spikes

Cache miss WAN problem
LDNS
CoDNS
24
Availability
  • Adds one 9, from 99 to 99.9

CoDNS
LDNS
25
What About CDNs?
  • CDN uses DNS to pick best replica
  • CoDNS used only when LDNS failing
  • Pro faster lookup time
  • Con maybe worse/farther replica
  • In reality, peers answer is better 30 of the
    time

26
CDN Pro/Con Measurements
27
Overhead
  • Heartbeat packet 1/sec, Memory 600KB
  • Remote queries median 25 more lookups

28
CoDNS Alternatives
  • In the paper
  • Private Nameservers
  • Secondary Nameservers
  • TCP Queries

29
Conclusion
  • Local failures relatively frequent
  • Failure time dominates latency
  • CoDNS provides low-cost insurance service
  • Masks local failures
  • Reduces avg response time 27-82
  • Improves availability by additional 9
  • Incrementally deployable, no server change

30
More Information
  • CoDNS homepage
  • http//codeen.cs.princeton.edu/codns/
  • Email
  • princeton_codeen_at_slices.planet-lab.org

31
TCP Queries
  • DNS support TCP
  • Failure rate is better
  • Not used exept for AFXR or when answer is big
  • Simple TCP
  • 2 packets vs. 9 packets (324 9)
  • Persistent TCP
  • ACK overhead
  • Resource waste for Idle connections
  • Vulnerable to overloading/server down

32
S-TCP,P-TCP,UDP, CoDNS
  • Replay test(10792 names) on 107 nodes
  • CoDNS First

33
CoDNS vs. Persistent TCP
Average Response Time (ms)
34
Lookup Distribution
  • Live traffic on a node for one week (20333
    queries)
  • 2043135 ms / 5809265 ms 35.1
  • 100 ms vs. 286 ms per query
  • Great improvement on W-CDF

35
Analysis on Wins
80 at first query, 95 at second query
Percentage
Write a Comment
User Comments (0)
About PowerShow.com