Uncovering Functional Networks in Internet Traffic - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

Uncovering Functional Networks in Internet Traffic

Description:

Committee: Filippo Menczer, Alessandro Vespignani, Katy B rner, Minaxi ... surfing. sending email. playing games. 6. What ... Buddy's Web surfing as two ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 57
Provided by: edward118
Learn more at: http://vw.indiana.edu
Category:

less

Transcript and Presenter's Notes

Title: Uncovering Functional Networks in Internet Traffic


1
Uncovering Functional Networks in Internet Traffic
  • Mark Meiss
  • September 25, 2006

2
Who am I?
  • Mark Meiss
  • Ph.D. candidate in Computer Science
  • Committee Filippo Menczer, Alessandro
    Vespignani, Katy Börner, Minaxi Gupta, Kay
    Connelly
  • Researcher at the Advanced Network Management
    Laboratory (ANML)
  • http//anml.iu.edu/

3
(No Transcript)
4
Whats the agenda?
  • The subject of todays story
  • Finding a way to improve security without
    compromising user privacy
  • A case study in applied network science
  • This work is done with Filippo Menczer and
    Alessandro Vespignani.

5
What do people do online?
Theres what we imagine
6
What do people do online?
And theres what is actually happening
7
Not just a value judgment
  • These applications all affect the health of a
    data network.
  • There are legal problems, yes but also
  • Crowding out other applications.
  • (Napster was once over 70 of all IUB traffic)
  • Compromised computers are used to launch further
    attacks.
  • Common nuisances are on the Net as well.

8
The bottom line
  • Network administrators
  • need to be able to identify
  • what applications
  • are being used on the network.

but this can be very difficult.
9
A crash coursein data networks
  • Well use a running example
  • Buddy Bradley wants to read a web page about his
    favorite band at Vulgar Entertainment, Inc.

10
(No Transcript)
11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
Quick summary
  • Each network conversation is identified by four
    pieces of information
  • Client address and port number
  • Server address and port number
  • The server uses a well-known port number
  • The client uses an ephemeral port number

21
So why is it hard to identify applications?
  • Well-known ports are a convention, not a rule
  • Web, e-mail, etc. do have ports assigned by the
    IANA
  • BitTorrent, Gnutella, Napster, etc. do not
  • Client and server ports share the same namespace
  • In practice
  • Any application can use any pair of port numbers
  • Our focus discovering what application is
    running on a port with no assigned use.

22
The conventional solution
  • Lets look inside
  • all of those packets!

23
(No Transcript)
24
(No Transcript)
25
Another problem
  • Packet inspection doesnt scale
  • Modern high-speed networks run at 10 gigabits per
    second or faster
  • (thats one full DVD every few seconds)
  • General-purpose computers cant even copy that
    data in real time

26
(No Transcript)
27
(No Transcript)
28
Introducing the flow
  • We can summarize Buddys Web surfing as two
    flows
  • 192.168.65.3313029 to 10.99.205.12280 (456
    bytes)
  • 10.99.205.12280 to 192.168.65.3313029 (63,211
    bytes)

29
Where do flows come from?
  • Architectural features of Internet routers allow
    them to export flow data
  • Routers cant summarize all the data
  • Packets are sampled to construct the flows
  • Typical sampling rate is around 1100

30
What can you dowith a flow?
  • Usual answer
  • Treat a flow as a record in a relational database
  • Who talked to port 1337?
  • What proportion of our traffic is on port 80?
  • Who is scanning for vulnerable systems?
  • Which hosts are infected with this worm?
  • These are useful and valid questions.

31
What can you dowith a flow?
  • Our approach
  • Treat a flow as a directed, weighted edge
  • The resulting network describes user behavior
  • Hold that thought for now

32
The Internet2/Abilene network
  • TCP/IP network connecting research and
    educational institutions in the U.S.
  • Over 200 universities and corporate research labs
  • Also provides transit service between Pacific Rim
    and European networks

33
Why study Abilene?
  • Wide-area network that includes both domestic and
    international traffic
  • Heterogeneous user base including hundreds of
    thousands of undergraduates
  • High capacity network (10-Gbps fiber-optic links)
    that has never been congested
  • Research partnership gives access to (anonymized)
    traffic data unavailable from commercial networks

34
Flow collection
Flows are exported in Ciscos netflow-v5
format and anonymized before being written to
disk.
35
Data dimensions
  • Observed Abilene on April 14, 2005
  • About 200 terabytes of data exchanged
  • This is roughly 25,000 DVDs of information
  • 600 million flow records
  • Almost 28 gigabytes on disk
  • 15 million unique hosts involved

36
Forming a bipartite network
  • Motivation
  • Clients and servers perform different functions
  • A web browser is very different from a web server
  • Most hosts are one or the other
  • Identifying clients and servers
  • Recall that there is a single namespace for ports
  • Heuristic the more common port is the server

37
Weighted bipartite digraph
38
(No Transcript)
39
Multiple digraphs
Port 80 (Web)
Port 6346 (Gnutella)
Port 19101 (???)
Port 25 (Mail)
40
Application correlation
  • Consider the out-strength of a client in the
    networks for ports p and q

41
Application correlation
  • Build a pair of vectors from the distribution of
    strength values

42
Application correlation
  • Examine the cosine similarity of the vectors
  • When s 0, applications p and q are never used
    together.
  • When s 1, applications p and q are always used
    together, and to the same extent.

43
Clustering applications
  • We now have s(p, q) for every pair of ports
  • Convert these similarities into distances
  • If s 0, then d is large if s 1, then d 0
  • Now apply Wards hierarchical clustering algorithm

44
(No Transcript)
45
Natural clusters
  • The Web
  • Correlated with almost every application
  • Use is nearly universal
  • Traditional applications
  • Includes mail, FTP, news, remote access, etc.
  • Characterized by dedicated servers
  • Peer-to-peer applications
  • Includes file sharing Gnutella, BitTorrent, etc.
  • Users often use several of these

46
Classifying unknownapplications
  • To classify an unknown application, see what
    known applications it clusters with
  • Our classification experiment
  • Take 16 unknown ports
  • Guess function based on similarity data
  • Validate or invalidate guesses based on external
    evidence

47
Example 1
  • Port 388 is coupled with FTP and Hotline
  • FTP is a file transfer application
  • Hotline is an early file-sharing application
  • Our guess traditional file transfer application
  • Actual identity Unidata/LDM
  • Used for moving large meteorological data sets

48
Example 2
  • Port 19101 is coupled with instant messaging and
    P2P applications
  • Our guess a P2P application that relies on
    individual contact for file transfers
  • Actual identity Clubbox
  • Korean file-sharing program
  • Users trade large files on virtual hard drives

49
(No Transcript)
50
Overall results
  • For our 16 guesses
  • 8 were unambiguously correct
  • 6 were partially correct
  • These turned out to be trojans and malware
  • We learned that IRC P2P evil afoot
  • 2 could not be confirmed or disproven
  • Ports were in transient use during data collection

51
Implications
  • We can identify the type of an application
    without examining a single packet!
  • Scalable
  • Preserves user privacy
  • Difficult to do with relational view of flow data

52
(No Transcript)
53
(No Transcript)
54
(No Transcript)
55
(No Transcript)
56
(No Transcript)
57
Broader application
  • Generic view of the situation
  • Weighted network of entities derived from
    activity with labeled classes of interaction
  • Find the sub-network for each labeled class
  • Use the network distributions to calculate
    similarity scores for the classes
  • Use the similarity scores to cluster the classes
  • Classify unknown classes using these clusters

58
Thank you!
  • Questions and comments
Write a Comment
User Comments (0)
About PowerShow.com