CampusWide Network Performance Monitoring and Recovery CPR - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

CampusWide Network Performance Monitoring and Recovery CPR

Description:

Really strange problems. The network isn't down! Office of Information Technology ... Catastrophic failure is easy to detect, small problems aren't ... – PowerPoint PPT presentation

Number of Views:275
Avg rating:3.0/5.0
Slides: 34
Provided by: rnocG
Category:

less

Transcript and Presenter's Notes

Title: CampusWide Network Performance Monitoring and Recovery CPR


1
Campus-Wide Network Performance Monitoring and
Recovery (CPR)
  • Warren Matthews
  • Chris Kelly
  • Russ Clark
  • Terry Turner

2
Outline
  • Problems facing network operators
  • Our Solution CPR
  • Hardware and Software
  • Deployments
  • Measurements
  • Analysis
  • Visualization

3
The Problem
  • The network is down!
  • Finger pointing
  • No single point of contact
  • No idea who to contact
  • Lack of factual information
  • Multifaceted problems
  • Really strange problems
  • The network isnt down!

4
More Problems
  • Measurement Infrastructure typically means WAN
    monitoring
  • But problems are LAN and host based
  • Network Operations
  • Single point of view
  • Catastrophic failure is easy to detect, small
    problems arent
  • Little quantitative data to troubleshoot
    performance problems

5
Network Support
  • Campus
  • Backbone group maintains 196 buildings, 2011
    switches, 61980 ports.
  • Southern Crossroads gigapop (SOX)
  • Provides connectivity for many Universities
    throughout the South East (including Peachnet)
  • 10Gbps link to Abilene backbone (Internet2)

6
The Solution CPR
  • Campus-wide Network Performance Monitoring and
    Recovery
  • Emphasis on Recovery
  • Regular tests across campus network
  • Active and passive monitoring
  • Shared access to test results
  • Comprehensive analysis and visualization

7
Key Enablers
  • Control of Network
  • Firewalls
  • Department and host-based
  • Switches/Routers
  • DNS/Reverse DNS
  • Physical access
  • Lots of available hardware
  • Lots of time

8
Hardware
  • CPR machines
  • Originally donated hardware
  • Underpowered (P2-3/128MB/10GB) and unreliable
  • New donation of much better machines
  • Dell Optiplex GX260s (P4/512MB/30GB)
  • A few 1U Dell servers
  • Analysis servers
  • Sun Fire X2100s and Sun Fire V210s

9
Software
  • Red Hat Enterprise Linux 4
  • State-wide license
  • Georgia Tech Satellite Server for updates
  • Kickstart system for easy installation
  • Custom management tools
  • Setup script for new machines
  • Centralized configuration management and software
    installation
  • Centralized Nagios monitoring

10
Centralized Nagios
11
Deployments
  • Campus - entire campus network
  • Cluster - PACE cluster
  • infiniband latency/throughput
  • GAMMON
  • ISP
  • Major Providers (Level3, Qwest, Cogent)
  • Residential (SpeedFactory, BellSouth, Charter,
    Cox, Earthlink)
  • International
  • Georgia Tech has adopted an international focus

12
GAMMON Deployment
  • Georgia Measurement and Monitoring
  • Valdosta State University
  • Armstrong Atlantic State University
  • Barrow County School System
  • Distance Learning and Professional Education
    (DLPE)

13
GAMMON Deployment
Barrow
Bellsouth
Level3
UUNET
Qwest
SOX
GLC
GT
Peachnet
Savannah
Armstrong
Valdosta
14
Global Deployment
  • International focus in strategic plan
  • CPR in Metz France, Shanghai China
  • Leverage global monitoring infrastructure and
    communicate using emerging standards (GGF-NMWG,
    perfSONAR)
  • Routing challenges

15
Campus Deployment
  • 87 Hosts on Campus
  • Several are virtual machines in VMware
  • Allows multiple subnets in the same building to
    each be monitored by a CPR machine, all one one
    piece of hardware
  • Additional benchmarking of performance needed
  • Host machines have 2GB RAM and full hard drives
  • Collocated with switches in data closets
  • Multiple views of the network (especially that of
    the end user)

16
Campus Deployment
ET
EDI
GLC
SOX
Savannah
GTL
Yamacraw
Class
505
TechSq Classroom
Servernet
Gateway Routers
Savant
Rich133
Savant44
Rich
Rich2
Cherry-Emerson
Daniel
GTRI
MARC
Arch
Admin
EST
Core Routers
811
Couch
845
MiRC
Skiles
SEB
SSC
Neely
NI
SI
MRDC
DMSmith
IBB
Habersham
Ajax
OHR
Sc-class
Lyman
Mason
French-class
FAB
OKeefe
King
GCATT
LAWN
Howey
French
OHR
Lib-class
Lyman
17
Measurements
  • No in-house development of measurement tools
  • Original plan didnt include visualization
  • Inconvenient to click through numerous graphs
  • CPR as a testbed for tools

18
Measurements
  • Smokeping - roundtrip time and graphs
  • Nagios - services
  • Nessus and Nmap - security scanning
  • Arpwatch - layer 2 monitoring
  • Iptables logs (distributed darknet)
  • Iperf (bwctl) - TCP throughput only.
  • Pathrate, Pathload, Traceroute

19
More Measurements
  • Wish list
  • OWAMP - NTP
  • GOAT/Netflow
  • Syslog
  • SPAM/SWARM integration
  • Coming soon
  • NDT/NPAD (central, distributed)
  • Test bed for tools under development

20
Analysis Goals
  • Create base-lines for historical comparison
  • Use multiple views to detect location
  • Middleware
  • Plateau detector (AMP), RIPE-TT
  • How should we react to alarms?
  • Troubleshooting guide

21
Analysis Areas
  • Statistics
  • Binary Tomography
  • Reducing data require to detect problems
  • Spatial and Temporal correlations

22
Visualization Smokeping
23
Visualization Nagios
24
Visualization MyCPR
25
Results
  • CPR has helped solve numerous issues
  • Firewall
  • Network slowness from file sharing
  • Dropped sessions
  • Not everything is a network issue

26
International Latency
  • Data since Jan 21 2006
  • N25,750
  • Mode291ms
  • Median297ms
  • Mean318.4ms
  • Max1127ms
  • IQR27ms

27
Global Routing
Transpac
SJTU
JP
CERNet
PacificWave
Abilene
GLORIAD
KRnet
SOX
GT
28
Campus Throughput
Campus Network throughput Grouped by parent
router -red lines that machine is behind a
10baseT switch or hub somewhere -messy
convergence all of those machines share a
router, something is limiting throughput (mostly
incoming)
29
GAMMON Throughput
30
Cluster Performance
31
Data Access
  • Archival Files
  • MySQL database
  • Looking to the future
  • PerfSONAR - international measurement
    infrastructure
  • Datapository - standard for data sharing
  • NetRadar

32
Lots of Data
33
The End
  • CPR is a testbed for your tools!
  • Were always looking for people to work with
    (especially in analysis).
  • Questions?
  • Contact
  • Warren.Matthews_at_oit.gatech.edu
  • Chris.Kelly_at_oit.gatech.edu
  • Project website http//www.rnoc.gatech.edu/cpr/
Write a Comment
User Comments (0)
About PowerShow.com