Contentbased Detection System using Clustering - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Contentbased Detection System using Clustering

Description:

OHT- Online-HTML Tracer (cont.) Sniffer. IP Packets. HTML Page. Recording stream of packets ... A small network of 38 computers having constant IP addresses ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 31
Provided by: Mark2236
Category:

less

Transcript and Presenter's Notes

Title: Contentbased Detection System using Clustering


1
Content-based Detection Systemusing Clustering
  • Bracha Shapira
  • Elovici Y. Last M. Kandell A. Zaafrany O. Fridman
    M.

E-mail bshapira_at_bgu.ac.il
NATO ARW Ben-Gurion University, June 2007
2
Agenda
  • Overview
  • Performance Measures
  • Research questions
  • Experimental Environment
  • Initial Results
  • Challenges

3
Goals
  • Develop a model for detecting terrorist
    activities in non-terror environments based on
    the network traffic content
  • On-line detection
  • Detection should based on passive eavesdropping
    on the network
  • Detection true-positive and false-positive
    similar to IDS that are based on anomaly detection

4
Assumption
  • A group of users in some defined environment
    would have typical interests
  • Terror groups that share ideology would have
    typical interests

5
Basic Idea
  • A terrorist (or a new supporter) would be abnoral
    in the typical environment
  • His/her interest would resemble terror typical
    behavior

6
Detection environment (example)
Terrorist
Related

Site

ATDS
WEB



University Campus

7
Content-Based Methodology Learning Phase
Network
Sniffer
OHT- Online-HTML Tracer
Configuration Data
Filter
Vector Generator
Representation and storage
Vectors of usersTransactions
8
Sniffer collect group related sites
9
Filter
  • Take out pages containing non-textual information
    and pages in unsupported languages

10
Convert each page to a representing vector of
term weights
Count all occurrences of meaningful terms on the
page using search engines techniques Normalize
weight of each term vector space model Vector
should represent vectors as accurately as
possible.
Bomb 8 Suicide 4 War - 2 Food - 5 .
(0.2,0.1, 0.05,0.25, 0.12)
11
Content-Based Methodology Learning Phase
(Continued)
Vectors of users Transactions
Clustering
Cluster 1 (Vectors)
Cluster n (Vectors)
Normal User Behavior Computation
Group-Representor
Normal User Behavior
12
Apply clustering on vectors to find common
interests of group
  • Clustering is a statistical method to find group
    of similar objects according to their properties

13
Cluster Centroid Computation
  • - Vector representation of cluster j
  • Dj - number of vectors in cluster j
  • - one vector representation

14
Content-Based Methodology Detection Phase
Network
Sniffer
OHT- Online-HTML Tracer
Configuration data
Filtering
Vector Generator
Representation
Normal User Behavior
detector
Threshold (tr)
Detection
Alarm
15
OHT- Online-HTML Tracer
16
OHT- Online-HTML Tracer (cont.)
IP Packets
Recording stream of packets
Sniffer
Disregard non textual sequences
HTTP Filter
HTML Reconstruction
Reconstruct HTML Pages
HTML Page
17
Detection
Representation
Group vectors by IP
Normal User Profile
Each queue holds one normal user acceses
18
Detection Algorithm Parameters
  • The size of the sub queue for each IP.
  • Alarm thresholds values
  • The ratio between the number of accesses detected
    as abnormal and the size of the sub queue
  • Alarm threshold (similarity)
  • Number of clusters representing the typical
    profile of the monitored users.

19
Performance Measures
  • Message loss rate ( pages correctly captured)
  • Detection (True Positive) Rate
  • Positive-corrected classified/total_number_of
    _positive
  • False Alarm (False Positive) Rate
  • Negatively-incorrected-classified
    /total_number_of_negative

20
OHT Performance Evaluation
  • Access to a given list of 100 URLs that include
    textual HTML.
  • Simulated page requests sent from 38 PCs
  • Each run if ideally performed would result in
    3800 reconstructed HTML pages.

21
Experimental Environment
  • A small network of 38 computers having constant
    IP addresses
  • All computers access the web through the same
    switch
  • The switch was programmed to send all the packets
    to one port
  • About 170,000 web transactions (page views) have
    been recorded during 24 days of the normal
    group
  • We simulated suspicious behavior by accessing
    terror related web sites (582).

22
TP and FP as a function of the number of clusters
23
Conclusions
  • No. of clusters affects results
  • Can not be generalized
  • Must be calibrated

24
TP and FP for 100 alarm threshold
25
TP and FP for 50 alarm threshold
26
TP and FP positive for queue size 2
27
TP and FP as for queue size 32
28
Conclusions
  • Queue size and abnormal ratio affect results
  • Can not be generalized
  • Need to be claibrated

29
Improvements and challenges
  • Large scale evaluations
  • Multilingual (Arabic..)
  • More effective representation of pages
  • Graphs
  • Context aware representations
  • Optimal number of clusters
  • Analyzing views of non-textual information
  • Example

30
Problems
  • False positive
  • Hiding behind a NAT
  • Deployment or De-NAT
  • Privacy..
Write a Comment
User Comments (0)
About PowerShow.com