Diagnostic Steps - PowerPoint PPT Presentation

About This Presentation
Title:

Diagnostic Steps

Description:

Diagnostic Steps Les Cottrell SLAC Presented at the Networks for Non Networkers 2nd International Workshop, 21-22 June 2005, Edinburgh, Scotland – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 39
Provided by: slacStanf1
Category:
Tags: diagnostic | steps

less

Transcript and Presenter's Notes

Title: Diagnostic Steps


1
Diagnostic Steps
  • Les Cottrell SLAC
  • Presented at the Networks for Non Networkers 2nd
    International Workshop, 21-22 June 2005,
    Edinburgh, Scotland
  • http//www.slac.stanford.edu/grp/scs/net/talk05/nf
    nn2-jun05.ppt

Partially funded by DOE/MICS Field Work Proposal
on Internet End-to-end Performance Monitoring
(IEPM), also supported by IUPAP
2
Overview
  • Goal provide a practical guide to debugging
    common problems (Brian covered high performance
    problems)
  • Why is diagnosis difficult yet important?
  • Local host
  • Ping, Traceroute, PingRoute
  • Looking at time series
  • Locating bottlenecks
  • Correlation of problems with routes
  • More tools and problems
  • Where is a node
  • Who do you tell, what do you say?
  • Case studies and More Information

3
Why is diagnosis difficult?
  • Internet's evolution as a composition of
    independently developed and deployed protocols,
    technologies, and core applications
  • Diversity, highly unpredictable, hard to find
    invariants
  • Rapid evolution change, no equilibrium so far
  • Findings may be out of date
  • Measurement/diagnosis not high on vendors list of
    priorities
  • Resources/skill focus on more interesting an
    profitable issues
  • Tools lacking or inadequate
  • Implementations are flaky not fully tested with
    new releases

4
Add to that
  • Distributed systems are very hard
  • A distributed system is one in which I can't get
    my work done because a computer I've never heard
    of has failed. Butler Lampson
  • Network is deliberately transparent
  • The bottlenecks can be in any of the following
    components
  • the applications
  • the OS
  • the disks, NICs, bus, memory, etc. on sender or
    receiver
  • the network switches and routers, and so on
  • Problems may not be logical
  • Most problems are operator errors,
    configurations, bugs
  • When building distributed systems, we often
    observe unexpectedly low performance
  • the reasons for which are usually not obvious
  • Just when you think youve cracked it, in steps
    security
  • Firewall, NAT boxes etc.
  • Block pings, traceroute looks like port scan,
    diagnostic tool ports are blocked
  • ISPs worried about providing access to core,
    making results public, privacy issues

5
Sources of problems
  • Host errors
  • TCP buffers, heavy utilization
  • Duplex mismatch (Ethernet)
  • Misconfigured router/switches
  • Including routing errors, especially for backup
    paths
  • Bad equipment, wiring/fiber problem
  • Congestion

6
Local Host (also see NDT later)
  • Usual Unix tools (uname-a, top, vmstat, iostat
    ..)
  • Is the host overloaded, do you have a gateway
    (route), name server (nslookup), which interface
    are you using (mii-tool (needs root), gives
    duplex speed common error source)
  • Net ifconfig a (look at errors), netstat a
  • Is server running (if you know port)?
  • gttelnet localhost 2811 Trying 127.0.0.1
  • 220 aftpexp04.bnl.gov GridFTP Server 1.12 GSSAPI
    type Globus/GSI wu-2.6.2 (gcc32dbg,
    1069715860-42) ready.
  • telnetgt quit

7
Local Host - LISA
  • Localhost Information Service Agent  LISA is a
    Java Web Start application which provides
  • Integration with MonALISA
  • Complete Monitoring of the System (Load, CPU,
    Memory, Disk, Disk IO, Paging, Processes, Network
    Traffic and Connectivity...).
  • History and instantaneous
  • Filters to trigger actions when predefined
    conditions are detected.
  • A user Friendly GUI to present the monitoring
    information.
  • Optimization modules for distributed
    applications.
  • It is a lightweight application that can be
    easily deployed on any system.
  • Modules for End to End network measurements (
    e.g. IPERF).
  • See monalisa.caltech.edu/dev_lisa.html

8
Ping
  • Ping to localhost, ping to gateway, ping to well
    known host to relevant remote host
  • Use IP address to avoid nameserver problems
  • Look for connectivity, loss, RTT, jitter, dups
  • May need to run for a long time to see some
    pathologies (e.g. bursty loss due to DSL loss of
    sync)
  • Try flood pings if suspect rate limited
  • Use synack or sting if ICMP blocked
  • www-iepm.slac.stanford.edu/tools/synack/

9
Ping example
Packet size
Remote host
Repeat count
RTT
  • syrup/home ping -c 6 -s 64 thumper.bellcore.com
  • PING thumper.bellcore.com (128.96.41.1) 64 data
    bytes
  • 72 bytes from 128.96.41.1 icmp_seq0 ttl240
    time641.8 ms
  • 72 bytes from 128.96.41.1 icmp_seq2 ttl240
    time1072.7 ms
  • 72 bytes from 128.96.41.1 icmp_seq3 ttl240
    time1447.4 ms
  • 72 bytes from 128.96.41.1 icmp_seq4 ttl240
    time758.5 ms
  • 72 bytes from 128.96.41.1 icmp_seq5 ttl240
    time482.1 ms
  • --- thumper.bellcore.com ping statistics --- 6
    packets transmitted, 5 packets received, 16
    packet loss round-trip min/avg/max
    482.1/880.5/1447.4 ms

Missing seq
Summary
10
3rd party ping (via Looking Glass)
  • Find servers
  • www.caida.org/analysis/routing/reversetrace/
  • Example http//stats.geant.net/cgi-bin/lg/lg.cgi
  • Ok for checking connectivity and RTT but not for
    losses (unless huge)

Looking Glass Results - ch1.ch.geant.net Date
Mon May 30 212839 2005 GMT Query Ping
ltIP_Addr FQDNgtReal Query ping rapid count
5Argument(s) www.slac.stanford.edu PING
www8.slac.stanford.edu (134.79.18.163) 56 data
bytes !!!!! --- www8.slac.stanford.edu ping
statistics --- 5 packets transmitted, 5 packets
received, 0 packet loss round-trip
min/avg/max/stddev167.316/172.212/191.222/9.506
ms
11
Traceroute
  • Traceroute to remote host
  • Is the route direct, over commercial congested
    nets
  • Reverse traceroute from remote host to you or 3rd
    party
  • www.slac.stanford.edu/comp/net/wan-mon/traceroute-
    srv.html
  • www.tracert.com/
  • www.caida.org/analysis/routing/reversetrace/

CAIDA Mouse sensitive map
12
Traceroute
Remote host
Max hops
Probes/hop
  • UDP/ICMP tool to show route packets take from
    local to remote host
  • 17cottrell_at_flora06gttraceroute -q 1 -m 20
    lhr.comsats.net.pk
  • traceroute to lhr.comsats.net.pk (210.56.16.10),
    20 hops max, 40 byte packets
  • 1 RTR-CORE1.SLAC.Stanford.EDU (134.79.19.2)
    0.642 ms
  • 2 RTR-MSFC-DMZ.SLAC.Stanford.EDU
    (134.79.135.21) 0.616 ms
  • 3 ESNET-A-GATEWAY.SLAC.Stanford.EDU
    (192.68.191.66) 0.716 ms
  • 4 snv-slac.es.net (134.55.208.30) 1.377 ms
  • 5 nyc-snv.es.net (134.55.205.22) 75.536 ms
  • 6 nynap-nyc.es.net (134.55.208.146) 80.629 ms
  • 7 gin-nyy-bbl.teleglobe.net (192.157.69.33)
    154.742 ms
  • 8 if-1-0-1.bb5.NewYork.Teleglobe.net
    (207.45.223.5) 137.403 ms
  • 9 if-12-0-0.bb6.NewYork.Teleglobe.net
    (207.45.221.72) 135.850 ms
  • 10 207.45.205.18 (207.45.205.18) 128.648 ms
  • 11 210.56.31.94 (210.56.31.94) 762.150 ms
  • 12 islamabad-gw2.comsats.net.pk (210.56.8.4)
    751.851 ms
  • 13
  • 14 lhr.comsats.net.pk (210.56.16.10) 827.301 ms

location
Long delay satellite
No response Lost packet or router ignores
13
RTT from California to world
Europe
E. Coast
Brazil
E. Coast US
W. Coast US
300ms
RTT (ms)
Europe S. America
0.30.6c
Longitude (degrees)
300ms
Frequency
Source Palo Alto CA, W. Coast
RTT (ms.)
Data from CAIDA Skitter project
14
Traceroute server results
  • Example www.slac.stanford.edu/cgi-bin/nph-tracero
    ute.pl

Related info
Security warning
Traceroute
Enter IP address or name
15
Pingroute
  • Ping routers along route, e.g. a tool to install
    that helps
  • www.slac.stanford.edu/comp/net/fpingroute.pl
  • or www.slac.stanford.edu/comp/net/fpingroute.pl
    if fping N/A

15cottrell_at_noric04gtfpingroute.pl fpingroute.pl
does a traceroute to the selected host. For each
of the hops along the route it then uses fping
to ping each node (in parallel) 'count' times.
Output includes traceroute information, RTTs,
losses for 100 and 'size byte
pings. Version0.21, 8/24/04 Usage
fpingroute.pl Opts host where host is the
remote host's IP address or name e.g.
www.slac.stanford.edu Opts
-c count default10 -s
size default1400 -i
initial default1 Example fpingroute.pl -i 3
-c 10 -s 1400 www.triumf.ca
16
Pingroute example
  • May help tell where losses start
  • Will need many pings if losses small

Start of losses?
But?
Start of sustained losses
Routers may not respond
17
Look at time series
  • Look at history plots (PingER, AMP, IEPM-BW,
    ISPs, own border router etc.), when did problem
    start, how big an effect is it?
  • Assumes you know proximity of paths for which
    there are archived active measurements to the
    path that you are interested in
  • Also that relevant measurements exist
  • www-iepm.slac.stanford.edu/pinger/
  • amp.nlanr.net/
  • ISPs plots
  • Abilene http//stryper.uits.iu.edu/abilene/
  • GEANT http//stats.geant.net/usagemap/usagemap
  • RIPE http//www.ripe.net/projects/ttm/Plots/
  • ESnet http//measurement.es.net/ (OWAMP)
  • Collaboration between Internet2/ESnet/Geant to
    provide access to router measurements holds
    promise
  • Look at traceroute histories (see later)

18
Example time series
  • Look for change in measured value
  • Note time
  • Correlate

Italy disconnected
19
Find location of a bottleneck
  • Look at hops along the path
  • Pingroute (see earlier)
  • If possible look at utilizations or active probes
    launched from there
  • Pipechar (son of pathchar, pchar)
  • Send packets of varying sizes to each router
    along path
  • Look at RTT as a function of packet size
  • From slope deduce bandwidth
  • Diferentiate to find capacity at each hop
  • However pchar is no longer supported, pathchar is
    very slow, pipechar has uncertain support (ask
    Brian)
  • Packet size variation limited to 1-MTU (1500)
    Bytes, so on fast links timing is difficult, with
    the result that estimates may not be reliable
  • Find pipechar at http//www.dsd.lbl.gov/OldProjec
    ts/NCS/

20
Divide Conquer
  • Abilene has hosts at major PoPs running bwctl
  • So make measurements from end to middle to ID
    loss of performance
  • http//e2epi.internet2.edu/pipes/ami/bwctl/

21
Correlate with routes (traceanal)
22
Visualizing traceroutes
  • One compact page per day
  • One row per host, one column per hour
  • One character per traceroute to indicate
    pathology or change (usually period(.) no
    change)
  • Identify unique routes with a number
  • Be able to inspect the route associated with a
    route number
  • Provide for analysis of long term route
    evolutions

Route at start of day, gives idea of route
stability
Multiple route changes (due to GEANT), later
restored to original route
Period (.) means no change
23
Changes in network topology (BGP) can result in
dramatic changes in performance
Hour
Samples of traceroute trees generated from the
table
Los-Nettos (100Mbps)
Remote host
Snapshot of traceroute summary table
Notes 1. Caltech misrouted via Los-Nettos
100Mbps commercial net 1400-1700 2. ESnet/GEANT
working on routes from 200 to 1400 3. A
previous occurrence went un-noticed for 2
months 4. Next step is to auto detect and notify
Drop in performance (From original path
SLAC-CENIC-Caltech to SLAC-Esnet-LosNettos
(100Mbps) -Caltech )
Back to original path
Dynamic BW capacity (DBC)
Changes detected by IEPM-Iperf and AbWE
Mbits/s
Available BW (DBC-XT)
Cross-traffic (XT)
Esnet-LosNettos segment in the path (100 Mbits/s)
ABwE measurement one/minute for 24 hours Thurs
Oct 9 900am to Fri Oct 10 901am
24
Moving towards application
  • See Brians talk
  • Try user application (mem to mem disk to disk)
  • GridFTP, bbcp, bbftp
  • Iperf or thrulay (also provides RTT) to test TCP
    or UDP throughput
  • dast.nlanr.net/Projects/Iperf/
  • www.internet2.edu/shalunov/thrulay/
  • NDT
  • What are the interface speeds?
  • What is the bottleneck?
  • Is there a duplex mismatch?
  • Are buffers set right (both ends)?

25
NDT example (Rich Carlson)
26
Other tools
  • Ntop
  • Summarizes libpcap (sniffer) infor
  • Internet2 Detective
  • Tests connectivity to I2, bandwidth, multicast,
    IPv6
  • Can run as Java applet
  • http//detective.internet2.edu/
  • NLANR Internet Advisor
  • Ethereal, tcpdump, snoop for masochists
  • Passive tools
  • Netflow for characterizing network, spotting
    abnormalities, e.g.
  • www.itec.oar.net/abilene-netflow
  • www.slac.stanford.edu/comp/net/slac-netflow/html/
    SLAC-netflow.html
  • SNMP based tools

27
And then
  • Wireless
  • Avoid peer-to-peer/ad-hoc connections
  • Disable connecting to ad-hoc (set infrastructure
    only)
  • Disable bridging
  • How to do it varies by OS (XP, OSX, Linux)
  • Ad hoc can still interfere if on same channel
  • Tools to locate an access point (e.g.
    Yellow-Jacket)
  • See
  • www2.slac.stanford.edu/comp/net/wireless/Wireless-
    Meeting-Handout.mht
  • NAT boxes may block or not support application
  • Private addresses
  • 10.0.0.0 - 10.255.255.255 a single class A net
  • 172.16.0.0 - 172.31.255.255 16 contiguous class
    Bs
  • 192.168.0.0 192.168.255.255 256 contiguous
    class Cs

28
Where is a host?
  • Beware some of information following is
    ephemeral, in general use heuristics with Google
  • Google Internet country codes for TLDs
  • Host may not be in TLD country, especially
    developing regions often use proxies elsewhere
  • Location may be encoded in router name
  • iplsIndianapolis, snvSunnyvale
  • Name server lookup to find hostname given IP
    address
  • 47cottrell_at_netflowgtnslookup 210.56.16.10
  • Server localhost
  • Address 127.0.0.1
  • Name lhr.comsats.net.pk
  • Address 210.56.16.10
  • Use a whois server, e.g.
  • www.networksolutions.com/cgi-bin/whois/whois
    (Americas Africa)
  • www.ripe.net/cgi-bin/whois (Europe)
  • www.apnic.net/ (Asia)
  • May identify site name, address, contact, etc,
    not all domains are in databases (e.g. will not
    find comsats.net.pk)

29
Where is a host cont.
  • Find the Autonomous System (AS) administering
  • Form giving AS for domain name
  • http//www.fixedorbit.com/search.htm
  • Gives AS number, name adjacent ASs web page for
    AS
  • Given an AS find out more about it
  • Use http//bgp.potaroo.net/cidr/ go to bottom and
    enter AS into form
  • Gives ISP name, web page, phone number, email,
    hours etc.
  • Review list of AS's ordered by Upstream AS
    Adjacency
  • www.telstra.net/ops/bgp/bgp-as-upsstm.txt
  • Tells what AS is upstream of an ISP

30
Where is a host - cont.
  • May be able to get latitude longitude
  • http//www.hostip.info/index.html
  • http//www.ip2location.com/ 
  • But it is a subscriber service (, but ),
    however it is probably best for developing
    regions
  • Triangulate pings from landmarks (in development)
  • planetlab-01.ipv6.lip6.fr10000/cbg.php

31
Who you gonna tell?
  • Local network support people
  • Internet Service Provider (ISP) usually done by
    local networker
  • Usually will know immediate one, e.g.
    trouble_at_es.net
  • Use puck.nether.net/netops/nocs.cgi to find ISP
  • Use www.telstra.net/ops/bgp/bgp-as-upsstm.txt to
    find upstream ISPs
  • Well managed sites and ISPs maintain a list of
    email addresses such as abuse_at_ or postmaster_at_,
    that one can send email to, for example to
    complain about spam etc.
  • This follows an Internet recommendation (RFC
    2142).
  • Some less helpful sites do not provide such
    services, for more on these, see RFC-ignorant.org

32
What ya gonna tell em?
  • Describe problem with details
  • What is affected?
  • Application, host OS (uname a), NIC (ifconfig,
    route)
  • How is it affected?
  • Non responsiveness, unable to contact remote host
  • Slow performance (see Brians talk), packet loss
  • When did it start?
  • Send ping output between hosts
  • Send traceroute forward reverse if possible
  • Maybe use I (ICMP option)
  • NDT
  • Identify when it started
  • If complex think about creating web page with
    details
  • Top, vmstat, pingroute, pipechar, application
    output (GridFTP, iperf)

33
Web page examples Case studies
  • http//www.slac.stanford.edu/grp/scs/net/case/html
    /
  • http//e2epi.internet2.edu/case-studies/

34
More Information
  • Tutorial on monitoring
  • www.slac.stanford.edu/comp/net/wan-mon/tutorial.ht
    ml
  • RFC 2151 on Internet tools
  • www.freesoft.org/CIE/RFC/Orig/rfc2151.txt
  • Network monitoring tools
  • www.slac.stanford.edu/xorg/nmtf/nmtf-tools.html
  • www.caida.org/tools/taxonomy/
  • Network Performance Tools an I2 Cookbook
  • e2epi.internet2.edu/network-perf-wk/tools-cookbook
    .pdf
  • Network Monitoring sites
  • www.slac.stanford.edu/comp/net/wan-mon/netmon.html

35
Pathology Encodings
Change but same AS
No change
Probe type
End host not pingable
Change in only 4th octet
Hop does not respond
Stutter
ICMP checksum
Multihomed
! Annotation (!X)
36
Navigation
traceroute to CCSVSN04.IN2P3.FR
(134.158.104.199), 30 hops max, 38 byte packets
1 rtr-gsr-test (134.79.243.1) 0.102 ms 13
in2p3-lyon.cssi.renater.fr (193.51.181.6) 154.063
ms !X
  • rt firstseen lastseen
    route
  • 0 1086844945 1089705757
    ...,192.68.191.83,137.164.23.41,137.164.22.37,...,
    131.215.xxx.xxx
  • 1 1087467754 1089702792
    ...,192.68.191.83,171.64.1.132,137,...,131.215.xxx
    .xxx
  • 2 1087472550 1087473162
    ...,192.68.191.83,137.164.23.41,137.164.22.37,...,
    131.215.xxx.xxx
  • 3 1087529551 1087954977
    ...,192.68.191.83,137.164.23.41,137.164.22.37,...,
    131.215.xxx.xxx
  • 4 1087875771 1087955566
    ...,192.68.191.83,137.164.23.41,137.164.22.37,...,
    (n/a),131.215.xxx.xxx
  • 5 1087957378 1087957378
    ...,192.68.191.83,137.164.23.41,137.164.22.37,...,
    131.215.xxx.xxx
  • 6 1088221368 1088221368
    ...,192.68.191.146,134.55.209.1,134.55.209.6,...,1
    31.215.xxx.xxx
  • 7 1089217384 1089615761
    ...,192.68.191.83,137.164.23.41,(n/a),...,131.215.
    xxx.xxx
  • 8 1089294790 1089432163
    ...,192.68.191.83,137.164.23.41,137.164.22.37,(n/a
    ),...,131.215.xxx.xxx

37
History Channel
38
AS information
Write a Comment
User Comments (0)
About PowerShow.com