Internet Monitoring - PowerPoint PPT Presentation

1 / 60
About This Presentation
Title:

Internet Monitoring

Description:

Averaging/Sampling intervals ... Utilization with different averaging times. Same data, measured Mbits/s every 5 secs ... Averages vs maxima ... – PowerPoint PPT presentation

Number of Views:125
Avg rating:3.0/5.0
Slides: 61
Provided by: cot86
Category:

less

Transcript and Presenter's Notes

Title: Internet Monitoring


1
Internet Monitoring
  • Les Cottrell SLAC
  • Presented at NUST Institute of Information
    Technology (NIIT) Rawalpindi, Pakistan, March 15,
    2005

Partially funded by DOE/MICS Field Work Proposal
on Internet End-to-end Performance Monitoring
(IEPM), also supported by IUPAP
2
Overview
  • Why is measurement difficult yet important?
  • LAN vs WAN
  • SNMP
  • Effects of measurement interval
  • Passive
  • Active
  • Tools including some results on Digital Divide
  • Trouble shooting
  • Tools, how to find things who to tell
  • New challenges

3
Why is measurement difficult?
  • Internet's evolution as a composition of
    independently developed and deployed protocols,
    technologies, and core applications
  • Diversity, highly unpredictable, hard to find
    invariants
  • Rapid evolution change, no equilibrium so far
  • Findings may be out of date
  • Measurement not high on vendors list of
    priorities
  • Resources/skill focus on more interesting an
    profitable issues
  • Tools lacking or inadequate
  • Implementations poor not fully tested with new
    releases
  • ISPs worried about providing access to core,
    making results public, privacy issues
  • The phone connection oriented model (Poisson
    distributions of session length etc.) does not
    work for Internet traffic (heavy tails, self
    similar behavior, multi-fractals etc.)

4
Add to that
  • Distributed systems are very hard
  • A distributed system is one in which I can't get
    my work done because a computer I've never heard
    of has failed. Butler Lampson
  • Network is deliberately transparent
  • The bottlenecks can be in any of the following
    components
  • the applications
  • the OS
  • the disks, NICs, bus, memory, etc. on sender or
    receiver
  • the network switches and routers, and so on
  • Problems may not be logical
  • Most problems are operator errors,
    configurations, bugs
  • When building distributed systems, we often
    observe unexpectedly low performance
  • the reasons for which are usually not obvious
  • Just when you think youve cracked it, in steps
    security

5
Why is measurement important?
  • End users network managers need to be able to
    identify track problems
  • Choosing an ISP, setting a realistic service
    level agreement, and verifying it is being met
  • Choosing routes when more than one is available
  • Setting expectations
  • Deciding which links need upgrading
  • Deciding where to place collaboration components
    such as a regional computing center, software
    development
  • How well will an application work (e.g. VoIP)
  • Application steering (e.g. forecasting)
  • Grid middleware, e.g. replication manager

6
LAN vs WAN
  • Measuring the LAN
  • Network admin has control so
  • Can read MIBs from devices
  • Can within limits passively sniff traffic
  • Know the routes between devices
  • Manually for small networks
  • Automated for large networks
  • Measuring the WAN
  • No admin control, unless you are an ISP
  • Cant read information out of routers
  • May not be able to sniff/trace traffic due to
    privacy/security concerns
  • Dont know route details between points, may
    change, not under your control, may be able to
    deduce some of it
  • So typically have to make do with what can be
    measured from end to end with very limited
    information from intermediate equipment hops.

7
SNMP (Simple Network Management Protocol)
  • Example of an Application, usually built on UDP
  • Defacto standard for network management
  • Created by IETF to address short term needs of
    TCP/IP
  • Consists of
  • Management Information Bases (MIBs)
  • Store information about managed object (host,
    router, switch etc.) system status info,
    performance configuration data
  • Remote Network Monitoring (RMON) is a management
    tool for passively watching line traffic
  • SNMP communication protocol to read out data and
    set parameters
  • Polling protocol, manager asks questions agent
    responds

8
SNMP Model
Agent MIB
  • NMS contains manager software to send receive
    SNMP messages to Agents
  • Agent is a software component residing on a
    managed node, responds to SNMP queries, performs
    updates reports problems
  • MIBs resides on nodes and at NMS and is a logical
    description of all network management data.

Agent MIB
Agent MIB
TCP/IP net
Agent MIB
Agent MIB
Agent MIB
Network Management Station(NMS)
9
SNMP version 1 limitations
  • Authentication is inadequate
  • Password (community string) placed in clear in
    SNMP messages
  • MIB variables must be polled separately, i.e.
    entire MIB cannot be fetched with single command
  • SNMPv2 and v3 attempt to address these and other
    limitations
  • Despite limitations, SNMP has been a huge success
  • Provides device and link utilization (byte,
    packets) and errors
  • Lot of facilities/tools built around SNMP to
    provide reports for sites
  • Security concerns limit access typically to very
    limited set of owner/admins
  • E.g. ISPs wont let you poll their devices

10
SNMP Examples
  • Using MRTG to display Router bits/s MIB variable

CERN trans- Atlantic traffic
11
Averaging/Sampling intervals
  • Typical measurements of utilization are made for
    5 minute intervals or longer in order not to
    create much impact.
  • Interactive human interactions require second or
    sub-second response
  • So it is interesting to see the difference
    between measurement made with different time
    frames.

12
Utilization with different averaging times
5 secs
  • Same data, measured Mbits/s every 5 secs
  • Average over different time intervals
  • Does not get a lot smoother
  • May indicate multi-fractal behavior

5 mins
1 hour
13
Averages vs maxima
  • Maximum of all 5 sec samples can be factor of 2
    or more greater than the average over 5 minute
    intervals

14
Lot of heavy FTP activity
  • The difference depends on traffic type
  • Only 20 difference in max average

15
Passive vs. Active Monitoring
  • Active injects traffic on demand
  • Passive watches things as they happen
  • Network device records information
  • Packets, bytes, errors kept in MIBs retrieved
    by SNMP
  • Devices (e.g. probe) capture/watch packets as
    they pass
  • Router, switch, sniffer, host in promiscuous
    (tcpdump)
  • Complementary to one another
  • Passive
  • does not inject extra traffic, measures real
    traffic
  • Polling to gather data generates traffic, also
    gathers large amounts of data
  • Active
  • provides explicit control on the generation of
    packets for measurement scenarios
  • testing what you want, when you need it.
  • Injects extra artificial traffic
  • Can do both, e.g. start active measurement and
    look at passively

16
Passive tools
  • SNMP
  • Hardware probes e.g. Sniffer, NetScout, can be
    stand-alone or remotely access from a central
    management station
  • Software probes snoop, tcpdump, require
    promiscous access to NIC card, i.e. root/sudo
    access
  • Flow measurement netramet, OCxMon/CoralReef,
    Netflow

17
Example Passive site border monitoring
  • Use Cisco Netflow in Catalyst 6509 with MSFC, on
    SLAC border
  • Gather about 200MBytes/day of flow data
  • The raw data records include source and
    destination addresses and ports, the protocol,
    packet, octet and flow counts, and start and end
    times of the flows
  • Much less detailed than saving headers of all
    packets, but good compromise
  • Top talkers history and daily (from to), tlds,
    vlans, protocol and application utilization
  • Use for network security

18
SLAC Traffic profile
SLAC offsite links OC3 to ESnet, 1Gbps to
Stanford U thence OC12 to I2 OC48 to
NTON Profile bulk-data xfer dominates
HTTP
Mbps in
iperf
2 Days
Last 6 months
Mbps out
SSH
FTP
bbftp
19
Top talkers by protocol
Hostname
100
1
10000
Volume dominated by single Application - bbcp
MBytes/day (log scale)
20
Flow sizes
SNMP
Real A/V
AFS file server
Heavy tailed, in out, UDP flows shorter than
TCP, packetbytes 75 TCP-in lt 5kBytes, 75
TCP-out lt 1.5kBytes (lt10pkts) UDP 80 lt 600Bytes
(75 lt 3 pkts), 10 more TCP than UDP Top UDP
AFS (gt55), Real(25), SNMP(1.4)
21
Flow lengths
  • 60 of TCP flows less than 1 second
  • Would expect TCP streams longer lived
  • But 60 of UDP flows over 10 seconds, maybe due
    to heavy use of AFS

22
Some Active Measurement Tools
  • Ping connectivity, RTT loss
  • flavors of ping, fping, Linux vs Solaris ping
  • but blocking rate limiting
  • Alternative synack, but can look like DoS attack
  • Sting measures one way loss
  • Traceroute
  • How it works, what it provides
  • Reverse traceroute servers
  • Traceroute archives
  • Combining ping traceroute,
  • traceping, pingroute
  • Pathchar, pchar, pipechar, bprobe, abing etc.
  • Iperf, netperf, ttcp, FTP

23
Ping
  • ICMP client/server application built on IP
  • Client send ICMP echo request, server sends reply
  • Server usually in kernel, so reliable fast
  • User can specify number of data bytes. Client
    puts timestamp in data bytes. Compares timestamp
    with time when echo comes back to get RTT
  • Many flavors (e.g. fping) and options
  • packet length, number of tries, timeout,
    separation
  • Ping localhost (127.0.0.1) first, then gateway IP
    address etc.

24
Ping example
  • syrup/home ping -c 6 -s 64 thumper.bellcore.com
  • PING thumper.bellcore.com (128.96.41.1) 64 data
    bytes
  • 72 bytes from 128.96.41.1 icmp_seq0 ttl240
    time641.8 ms
  • 72 bytes from 128.96.41.1 icmp_seq2 ttl240
    time1072.7 ms
  • 72 bytes from 128.96.41.1 icmp_seq3 ttl240
    time1447.4 ms
  • 72 bytes from 128.96.41.1 icmp_seq4 ttl240
    time758.5 ms
  • 72 bytes from 128.96.41.1 icmp_seq5 ttl240
    time482.1 ms
  • --- thumper.bellcore.com ping statistics --- 6
    packets transmitted, 5 packets received, 16
    packet loss round-trip min/avg/max
    482.1/880.5/1447.4 ms

Packet size
Remote host
Repeat count
RTT
Missing seq
Summary
25
Traceroute
  • UDP/ICMP tool to show route packets take from
    local to remote host
  • 17cottrell_at_flora06gttraceroute -q 1 -m 20
    lhr.comsats.net.pk
  • traceroute to lhr.comsats.net.pk (210.56.16.10),
    20 hops max, 40 byte packets
  • 1 RTR-CORE1.SLAC.Stanford.EDU (134.79.19.2)
    0.642 ms
  • 2 RTR-MSFC-DMZ.SLAC.Stanford.EDU
    (134.79.135.21) 0.616 ms
  • 3 ESNET-A-GATEWAY.SLAC.Stanford.EDU
    (192.68.191.66) 0.716 ms
  • 4 snv-slac.es.net (134.55.208.30) 1.377 ms
  • 5 nyc-snv.es.net (134.55.205.22) 75.536 ms
  • 6 nynap-nyc.es.net (134.55.208.146) 80.629 ms
  • 7 gin-nyy-bbl.teleglobe.net (192.157.69.33)
    154.742 ms
  • 8 if-1-0-1.bb5.NewYork.Teleglobe.net
    (207.45.223.5) 137.403 ms
  • 9 if-12-0-0.bb6.NewYork.Teleglobe.net
    (207.45.221.72) 135.850 ms
  • 10 207.45.205.18 (207.45.205.18) 128.648 ms
  • 11 210.56.31.94 (210.56.31.94) 762.150 ms
  • 12 islamabad-gw2.comsats.net.pk (210.56.8.4)
    751.851 ms
  • 13
  • 14 lhr.comsats.net.pk (210.56.16.10) 827.301 ms

Max hops
Remote host
Probes/hop
No response Lost packet or router ignores
26
Reverse traceroute servers
  • Reverse traceroute server runs as CGI script in
    web server
  • Allow measurement of route from other end.
    Important for asymmetric routes. See e.g.
  • www.slac.stanford.edu/comp/net/wan-mon/traceroute-
    srv.html
  • CAIDA map of reverse traceroute servers
  • www.caida.org/analysis/routing/reversetrace/

27
Pingroute
  • Run traceroute, then ping each router n times
  • helps identify where in route the problems start
    to occur
  • Routers may not respond to pings, or may treat
    pings directed at them, differently to other
    packets

28
Path characterization
  • Pathchar
  • sends multiple packets of varying sizes to each
    router along route
  • measures minimum response time
  • plot min RTT vs packet size to get bandwidth
  • calculate differences to get individual hop
    characteristics
  • measures for each hop BW, queuing, delay/hop
  • can take a long time
  • Pipechar/abing
  • Also sends back-to-back packets and measures
    separation on return
  • Much faster
  • Finds bottleneck

Bottleneck
Min spacing At bottleneck
Spacing preserved On higher speed links
29
Network throughput
  • Iperf
  • Client generates sends UDP or TCP packets
  • Server receives receives packets
  • Can select port, maximum window size, port ,
    duration, Mbytes to send etc.
  • Client/server communicate packets seen etc.
  • Reports on throughput
  • Requires sever to be installed at remote site,
    i.e. friendly administrators or logon account and
    password

30
Iperf example
  • 25cottrell_at_flora06gtiperf -p 5008 -w 512K -P 3
    -c sunstats.cern.ch
  • --------------------------------------------------
    ----------
  • Client connecting to sunstats.cern.ch, TCP port
    5008
  • TCP window size 512 KByte
  • --------------------------------------------------
    ----------
  • 6 local 134.79.16.101 port 57582 connected
    with 192.65.185.20 port 5008
  • 5 local 134.79.16.101 port 57581 connected
    with 192.65.185.20 port 5008
  • 4 local 134.79.16.101 port 57580 connected
    with 192.65.185.20 port 5008
  • ID Interval Transfer Bandwidth
  • 4 0.0-10.3 sec 19.6 MBytes 15.3 Mbits/sec
  • 5 0.0-10.3 sec 19.6 MBytes 15.3 Mbits/sec
  • 6 0.0-10.3 sec 19.7 MBytes 15.3 Mbits/sec
  • Total throughput 315.3Mbits/s 45.9Mbits/s

3 parallel streams
TCP port 5006
Max window size
Remote host
31
Active Measurement Projects
  • PingER running at NIIT
  • AMP coming soon to NIIT
  • One way delay
  • Surveyor (now defunct), RIPE (mainly Europe),
    owamp
  • IEPM-BW running at NIIT
  • NIMI (mainly a design infrastructure)
  • NWS (mainly for forecasting)
  • Skitter
  • All projects measure routes
  • For a detailed comparison see
  • www.slac.stanford.edu/comp/net/wan-mon/iepm-cf.htm
    l
  • www.slac.stanford.edu/grp/scs/net/proposals/infra-
    mon.html

32
AMP
  • http//amp.nlanr.net/AMP/
  • AMP uses dedicated PCs as monitors, 150 (June,
    2005)
  • Today mainly does pings
  • Oriented to Internet 2, 10 countries
  • Does mainly full mesh pinging
  • Being re-written to provide support for more
    probes

33
PingER
  • Measure the network performance for developing
    regions
  • From developed to developing vice versa
  • Between developing regions within developing
    regions
  • Use simple tool (PingER/ping)
  • Ping installed on all modern hosts, low traffic
    interference,
  • 21 pings each 30 mins to remote hosts (lt
    100bits/s average)
  • Provides very useful measures
  • Originated in High Energy Physics, now focused on
    DD
  • Persistent (data goes back to 1995), interesting
    history

PingER coverage Feb 2005
Monitoring site Remote site
34
ExamplesWorld View
C. Asia, Russia, S.E. Europe, L. America, M.
East, China 4-5 yrs behind India, Africa 7 yrs
behind
S.E. Europe, Russia catching up Latin Am., Mid
East, China keeping up India, Africa falling
behind
Important for policy makers
Many institutes in developing world have less
performance than a household in N. America or
Europe
35
Losses
  • US residential Broadband users have better access
    than sites in many regions

36
Loss to Africa (example of variability)
From PingER project
37
Compare with TAI
  • UN Technology Achievement Index (TAI)
  • Measures creation diffusion of technology and
    building human skills

Note how bad Africa is
38
E2E Troubleshooting
  • Solving the E2E performance problem is the
    critical problem for the user
  • Improve e2e throughput for data intensive apps in
    high-speed WANs
  • Provide ability to do performance analysis
    fault detection ins Grid computing environment
  • Provide accurate, detailed, adaptive monitoring
    of all distributed components including the
    network

39
Anatomy of a Problem
Hey, this is not working right!
Others are getting in ok
Not our problem
Applications Developer
Applications Developer
The computer Is working OK
Looks fine
All the lights are green
How do you solve a problem along a path?
We dont see anything wrong
The network is lightly loaded
From an Internet2 E2E presentation by Russ Hobby
40
Needs
  • Measurement tools to quickly, accurately and
    automatically identify problems
  • Automatically take action to investigate and
    gather information, on-demand measurements
  • Standard ways to discover request and report
    results of measurements, for applications
  • GGF/NMWG schemas
  • Share information with people and apps across a
    federation of measurement infrastructures

41
Trouble shooting
  • Ping to localhost, ping to gateway to remote
    host
  • Use IP address to avoid nameserver problems
  • Look for connectivity, loss RTT
  • May need to run for a long time to see some
    pathologies (e.g. bursty loss dues to DSL loss of
    sync)
  • Use synack or sting if ICMP blocked
  • Traceroute to remote host
  • Reverse traceroute from remote host to you
  • Ping routers along route
  • Look at history plots (PingER, AMP), when did
    problem start, how big an effect is it?

42
Trouble shooting
  • Try user application
  • Iperf to test throughput

43
Where is a host?
  • Name server lookup to find hostname given IP
    address
  • 47cottrell_at_netflowgtnslookup 210.56.16.10
  • Server localhost
  • Address 127.0.0.1
  • Name lhr.comsats.net.pk
  • Address 210.56.16.10
  • Triangulate position based on RTT measurements
    made to unknown host from several hosts at known
    locations.

44
Whereis a host
  • Do a Google search on IP address to location,
    e.g.
  • http//www.geobytes.com/IpLocator.htm

45
Hi-perf Challenges
  • Packet loss hard to measure by ping
  • For 10 accuracy on BER 1/108 1 day at 1/sec
  • Ping loss ? TCP loss
  • Iperf/GridFTP throughput at 10Gbits/s
  • To measure stable (congestion avoidance) state
    for 90 of test takes 60 secs 75GBytes
  • Requires scheduling implies authentication etc.
  • Using packet pair dispersion can use only few
    tens or hundreds of packets, however
  • Timing granularity in host is hard (sub µsec)
  • NICs may buffer (e.g. coalesce interrupts. or TCP
    offload) so need info from NIC or before
  • Security blocked ports, firewalls, keys vs. one
    time passwords, varying policies etc.

46
Dedicated Optical Circuits
  • Could be whole new playing field, todays tools
    no longer applicable
  • No jitter (so packet pair dispersion no use)
  • Instrumented TCP stacks a la Web100 may not be
    relevant
  • Layer 1 2 switches make traceroute less useful
  • Losses so low, ping not viable to measure
  • High speeds make some current techniques fail or
    more difficult (timing, amounts of data etc.)

47
More Information
  • Tutorial on monitoring
  • www.slac.stanford.edu/comp/net/wan-mon/tutorial.ht
    ml
  • RFC 2151 on Internet tools
  • www.freesoft.org/CIE/RFC/Orig/rfc2151.txt
  • Network monitoring tools
  • www.slac.stanford.edu/xorg/nmtf/nmtf-tools.html
  • Ping
  • http//www.ping127001.com/pingpage.htm
  • IEPM/PingER home site
  • www-iepm.slac.stanford.edu/
  • IEEE Communications, May 2000, Vol 38, No 5, pp
    130-136

48
Simplified SLAC DMZ Network, 2001

Dial up ISDN
2.4Gbps OC48 link
NTON
()
rtr-msfc-dmz
155Mbps OC3 link()
Stanford
Swh-dmz
ESnet
Internet2
slac-rt1.es.net
OC12 link 622Mbps
swh-root
Etherchannel 4 gbps
SLAC Internal Network
1Gbps Ethernet
() Upgrade to OC12 has been requested () This
link will be replaced with a OC48 POS card for
the 6500 when available
100Mbps Ethernet
10Mbps Ethernet
49
Flow lengths
  • Distribution of netflow lengths for SLAC border
  • Log-log plots, linear trendline power law
  • Netflow ties off flows after 30 minutes
  • TCP, UDP ICMP flows are log-log linear for
    longer (hundreds to 1500 seconds) flows
    (heavy-tails)
  • There are some peaks in TCP distributions,
    timeouts?
  • Web server CGI script timeouts (300s), TCP
    connection establishment (default 75s), TIME_WAIT
    (default 240s), tcp_fin_wait (default 675s)

ICMP
TCP
UDP
50
Traceroute technical details
  • Rough traceroute algorithm
  • ttl1 To 1st router
  • port33434 Starting UDP port
  • while we havent got UDP port unreachable
  • send UDP packet to hostport with ttl
  • get response
  • if time exceeded note roundtrip time
  • else if UDP port unreachable
  • quit
  • print output
  • ttl port
  • Can appear as a port scan
  • SLAC gets about one complaint every 2 weeks.

51
Time series
UDP
TCP
Cat 4000 802.1q vs. ISL
Incoming
Outgoing
52
Power law fit parameters by time
Just 2 parameters provide a reasonable
description of the flow size distributions
53
Not your normal Internet site
Ames IXP approximately 60-65 was HTTP, about
13 was NNTP Uwisc 34 HTTP, 24 FTP, 13 Napster
54
PingER cont.
  • Monitor timestamps and sends ping to remote site
    at regular intervals (typically about every 30
    minutes)
  • Remote site echoes the ping back
  • Monitor notes current and send time and gets RTT
  • Discussing installing monitor site in Pakistan
  • provide real experience of using techniques
  • get real measurements to set expectations,
    identify problem areas, make recommendations
  • provide access to data for developing new
    analysis techniques, for statisticians etc.

55
PingER
  • Measurements from
  • 38 monitors in 14 countries
  • Over 600 remote hosts
  • Over 120 countries
  • Over 3300 monitor-remote site pairs
  • Measurements go back to Jan-95
  • Reports on RTT, loss, reachability, jitter,
    reorders, duplicates
  • Uses ubiquitous ping facility of TCP/IP
  • Countries monitored
  • Contain over 80 of world population
  • 99 of online users of Internet

56
Surveyor RIPE, NIMI
  • Surveyor RIPE use dedicated PCs with GPS clocks
    for synchronization
  • Measure 1 way delays and losses
  • Surveyor mainly for Internet 2
  • RIPE mainly for European ISPs
  • NIMI (National Internet Measurement
    Infrastructure) more of an infrastructure for
    measurements and some tools (I.e. currently does
    not have public available data,regularly updated)
  • Mainly full mesh measurements on demand

57
Skitter
  • Makes ping route measurements to tens of
    thousands of sites around the world. Site
    selection varies based on web site hits.
  • Provide loss RTTs
  • Skitter PingER are main 2 sites to monitor
    developing world.

58
Where is a host cont.
  • Find the Autonomous System (AS) administering
  • Use reverse traceroute server with AS
    identification, e.g.
  • www.slac.stanford.edu/cgi-bin/nph-traceroute.pl
  • 14 lhr.comsats.net.pk (210.56.16.10) AS7590 -
    COMSATS 711 ms (ttl242)
  • Get contacts for ISPs (if know ISP or AS)
  • http//puck.nether.net/netops/nocs.cgi
  • Gives ISP name, web page, phone number, email,
    hours etc.
  • Review list of AS's ordered by Upstream AS
    Adjacency
  • www.telstra.net/ops/bgp/bgp-as-upsstm.txt
  • Tells what AS is upstream of an ISP
  • Look at real-time information about the global
    routing system from the perspectives of several
    different locations around the Internet
  • Use route views at www.antc.uoregon.edu/route-view
    s/
  • Triangulate RTT measurements to unknown host from
    multiple places

59
Who do you tell
  • Local network support people
  • Internet Service Provider (ISP) usually done by
    local networker
  • Use puck.nether.net/netops/nocs.cgi to find ISP
  • Use www.telstra.net/ops/bgp/bgp-as-upsstm.txt to
    find upstream ISPs
  • Give them the ping and traceroute results

60
Achieving throughput
  • User cant achieve throughput available (Wizard
    gap)
  • Big step just to know what is achievable
Write a Comment
User Comments (0)
About PowerShow.com