Data Mining presentation

About This Presentation

Transcript and Presenter's Notes

Title: Data Mining

1
Data Mining Intrusion Detection

Shan Bai
Instructor Dr. Yingshu Li
CSC 8712 ,Spring 08

2
Outline

Intrusion Detection
Data Mining
Data Mining in Intrusion Detection
Reference

3
What is an intrusion?

An intrusion can be defined as any set of
actions that attempt to compromise the
Integrity
confidentiality, or
availability
of a resource.

Incidents Reported to Computer Emergency Response
Team/Coordination Center
Spread of SQL Slammer worm 10 minutes after its
deployment
4
Intrusion Examples

Trojan horse /worm
Address spoofing
a malicious user uses a fake IP address to send
malicious packets to a target.
Many others

DOS
denial-of-service
R2L
unauthorized access from a remote machine, e.g.
guessing password
U2R
unauthorized access to local super user (root)
privileges, e.g., various buffer overflow''
attacks
Probing
surveillance and other probing, e.g., port
scanning.

5
Intrusion Detection System (IDS)

Intrusion Detection System
combination of software and hardware that
attempts to perform intrusion detection raises
the alarm when possible intrusion happens.

6
IDS Categories

Intrusion detection systems are split into two
groups
Anomaly detection systems
Identify malicious traffic based on deviations
from established normal network.
Misuse detection systems
Identify intrusions based on a known pattern
(signatures) for the malicious activity.

7
Anomaly Detection
probable intrusion
activity measures
baseline the normal traffic and then look for
things that are out of the norm
Relatively high false positive rate -
anomalies can just be new normal activities.
8
Misuse Detection
Example if (src_ip dst_ip) then land attack
look for known indicators ICMP Scans, port scans,
connection attempts CPU, RAM I/O Utilization,
File system activity, modification of system
files, permission modifications
Cant detect new attacks
9

Goal of Intrusion Detection Systems (IDS)
To detect an intrusion as it happens and be able
to respond to it.
False positives
A false positive is a situation where something
abnormal (as defined by the IDS) happens, but it
is not an intrusion.
Too many false positives
User will quit monitoring IDS because of noise.
False negatives
A false negative is a situation where an
intrusion is really happening, but IDS doesn't
catch it.

10
Outline

Intrusion Detection
Data Mining
Data Mining in Intrusion Detection
Reference

11
Why do we need Data Mining?

Despite the enormous amount of data, particular
events of interest are still quite rare,
frequency ranges from 0.1 to less than 10
We are drowning in data, but starving for
knowledge!??

12
Data Mining vs. KDD

Knowledge Discovery in Databases (KDD) The whole
process of finding useful information and
patterns in data
Data Mining Use of algorithms to extract the
information and patterns derived by the KDD
process
Data mining is the core of the knowledge
discovery process

13
KDD Process

Selection Obtain data from various sources.
Preprocessing Cleanse data.
Transformation Convert to common format.
Transform to new format.
Data Mining Obtain desired results.
Interpretation/Evaluation Present results to
user in meaningful manner

14
Data Mining A KDD Process
Knowledge

Data mining core of knowledge discovery process

Pattern Evaluation
Data Mining
Task-relevant Data
Selection
Data Warehouse
Data Cleaning
Data Integration
Databases
15
Typical Data Mining Architecture
Graphical user interface
Pattern evaluation
Data mining engine
Knowledge-base
Database or data warehouse server
Filtering
Data cleaning data integration
Data Warehouse
Databases
16
Outline

Intrusion Detection
Data Mining
Data Mining in Intrusion Detection
Reference

Network intrusion detection
Number of intrusions on the network is
typically a very small fraction of the total
network traffic

18
Why Can Data Mining Help?

Learn from traffic data
Supervised learning learn precise models from
past intrusions
Unsupervised learning identify suspicious
activities
Maintain models on dynamic data
Correlation of suspicious events across network
sites
Helps detect sophisticated attacks not
identifiable by single site analyses
Analysis of long term data (months/years)
Uncover suspicious stealth activities (e.g.
insiders leaking/modifying information)

19
Intrusion Detection

Traditional intrusion detection system IDS tools
(e.g. SNORT) are based on signatures of known
attacks
Limitations
Signature database has to be manually revised
for each new type of discovered intrusion
They cannot detect emerging cyber threats
Substantial latency in deployment of newly
created signatures across the computer system

20
Data Mining for Intrusion Detection Techniques
and Applications

Frequent pattern mining
Classification
Clustering
Mining data streams

21
Frequent pattern mining

Patterns that occur frequently in a database
Mining Frequent patterns finding regularities
Process of Mining Frequent patterns for intrusion
detection
Phase I mine a repository of normal frequent
itemsets for attack-free data
Phase II find frequent itemsets in the last n
connections and compare the patterns to the
normal profile

22
Frequent pattern mining

Apriori
Any subset of a frequent itemset must be also
frequent an anti-monotone property
A transaction containing beer, diaper, nuts
also
contains beer, diaper
beer, diaper, nuts is frequent beer,
diaper must
also be frequent
No superset of any infrequent itemset should be
generated or tested
Many item combinations can be pruned

23
Sequential Pattern Analysis

Models sequence patterns
(Temporal) order is important in many situations
Time-series databases and sequence databases
Frequent patterns ? (frequent) sequential
patterns
Sequential patterns for intrusion detection
Capture the signatures for attacks in a series of
packets

24
Sequential Pattern Mining

Given a set of sequences, find the complete set
of frequent subsequences

25
Apriori Property in Sequences
26
Classification A Two-Step Process

Model construction describe a set of
predetermined classes
Training dataset tuples for model construction
Each tuple/sample belongs to a predefined class
Classification rules, decision trees, or math
formulae
Model application classify unseen objects
Estimate accuracy of the model using an
independent test set
Acceptable accuracy ? apply the model to classify
data tuples with unknown class labels

27
Classification
28
Classification Decision Tree

A node in the tree a test of some attribute
A branch a possible value of the attribute
Classification
Start at the root
Test the attribute
Move down the tree branch

29
Neural classification HIDE

A hierarchical network intrusion detection
system using statistical processing and neural
network classification by Zheng et al.
Five major components
Probes collect traffic data
Event preprocessor preprocesses traffic data and
feeds the statistical model
Statistical processor maintains a model for
normal activities and generates vectors for new
events
Neural network classifies the vectors of new
events
Post processor generates reports

30
Clustering

What Is Clustering?
Group data into clusters
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Unsupervised learning no predefined classes

31
Clustering

What Is A Good Clustering?
High intra-class similarity and low
interclasssimilarity
Depending on the similarity measure
The ability to discover some or all of the hidden
patterns

32
Clustering

Clustering Approaches
Partitioning algorithms
Partition the objects into k clusters
Iteratively reallocate objects to improve the
clustering
Hierarchy algorithms
Agglomerative each object is a cluster, merge
clusters to form larger ones
Divisive all objects are in a cluster, split
it up into smaller clusters

33
Clustering

K-Means Example

34
Mining Data Streams for Intrusion Detection

Maintaining profiles of normal activities
The profiles of normal activities may drift
Identifying novel attacks
Identifying clusters and outliers in traffic data
streams
Reduce the future alarm load by writing filtering
rules that automatically discard well-understood
false positives

35
Data Mining for Intrusion Detection

Misuse detection
Predictive models are built from labeled data
sets (instances are labeled as normal or
intrusive)
These models can be more sophisticated and
precise than manually created signatures
Recent research e.g. JAM (Java Agents for
Metalearning)

36
Misuse Detection
Example if (src_ip dst_ip) then land attack
look for known indicators ICMP Scans, port scans,
connection attempts CPU, RAM I/O Utilization,
File system activity, modification of system
files, permission modifications
Cant detect new attacks
37
JAM (Java Agents for Metalearning)

JAM (developed at Columbia University) uses data
mining techniques to discover patterns of
intrusions. It then applies a meta-learning
classifier to learn the signature of attacks.
The association rules algorithm determines
relationships between fields in the audit trail
records, and the frequent episodes algorithm
models sequential patterns of audit events.
Features are then extracted from both algorithms
and used to compute models of intrusion behavior.
The classifiers build the signature of attacks.
So thus, data mining in JAM builds misuse
detection model.
Classifiers in the JAM are generated by using
rule learning program on training data of system
usage. After training, resulting classification
rules is used to recognize anomalies and detect
known intrusions.
The system has been tested with data from
Sendmail-based attacks, and with network attacks
using TCP dump data.

38
Data Mining for Intrusion Detection

Anomaly detection
Identifies anomalies as deviations from normal
behavior
E.g. ADAM Audit Data Analysis and Mining MINDS
MINnesota INtrusion Detection System

39
Anomaly Detection
probable intrusion
activity measures
baseline the normal traffic and then look for
things that are out of the norm
Relatively high false positive rate -
anomalies can just be new normal activities.
40
ADAM Audit Data Analysis and Mining

Detecting Intrusion by Data Mining
Combination of Association Rule and
Classification Rule
Firstly, ADAM collects known frequent datasetsan
off-line algorithm
Secondly, ADAM runs an online algorithm
Finds last frequent connection records
Compare them with known mined data
Discards those, which seems to be normal
Suspicious ones are forwarded to the classifier
Trained classifier then classify the suspicious
data as one of the following
Known type of attack
Unknown type of attack
False alarm

41
ADAM Detecting Intrusion by Data Mining
42
ADAM Audit Data Analysis and Mining

ADAM has two phases in their model
1st Phase Train the classifier
Offline process
Takes place only once
Before the main experiment
2nd Phase Using the trained classifier
Trained classifier is then used to detect
anomalies
Online process

43
The MINDS Project

MINDS MINnesota INtrusion Detection System
Learning from Rare Class Building rare class
prediction models
Anomaly/outlier detection
Summarization of attacks using association
pattern analysis

Rules Discovered Milk --gt Coke
Diaper, Milk --gt Beer
44
MINDS - Learning from Rare Class

Problem Building models for rare network attacks
(Mining needle in a haystack)
Standard data mining models are not suitable for
rare classes
Models must be able to handle skewed class
distributions
Learning from data streams - intrusions are
sequences of events

45
MINDS - Anomaly Detection

Detect novel attacks/intrusions by identifying
them as deviations from normal, i.e. anomalous
behavior
Identify normal behavior
Construct useful set of features
Define similarity function
Use outlier detection algorithm
Nearest neighbor approach
Density based schemes
Unsupervised Support Vector Machines (SVM)

46
Experimental Evaluation

Publicly available data set
DARPA 1998 Intrusion Detection Evaluation Data
Set prepared and managed by MIT Lincoln Lab
includes a wide variety of intrusions simulated
in a military network environment
Real network data from
University of Minnesota

Anomaly detection is applied
4 times a day
10 minutes time window

Open source signature-based network IDS
network
www.snort.org
10 minutes cycle 2 millions connections
net-flow data using CISCO routers
Anomaly scores
Association pattern analysis

MINDSanomaly detection
Data preprocessing
47
MINDS - Framework for Mining Associations
Ranked connections
attack
Discriminating Association Pattern Generator
Anomaly Detection System
normal
update

Build normal profile
Study changes in normal behavior
Create attack summary
Detect misuse behavior
Understand nature of the attack

R1 TCP, DstPort1863 ? Attack R100 TCP,
DstPort80 ? Normal
Knowledge Base
MINDS association analysis module
48
Discovered Real-life Association Patterns

Rule 1 SrcIPXXXX, DstPort80, ProtocolTCP,
FlagSYN, NoPackets 3, NoBytes120180
(c1256, c2 1)
Rule 2 SrcIPXXXX, DstIPYYYY, DstPort80,
ProtocolTCP, FlagSYN, NoPackets 3, NoBytes
120180 (c1177, c2 0)

At first glance, Rule 1 appears to describe a Web
scan
Rule 2 indicates an attack on a specific machine
Both rules together indicate that a scan is
performed first, followed by an attack on a
specific machine identified as vulnerable by the
attacker

49
Discovered Real-life Association Patterns
DstIPZZZZ, DstPort8888, ProtocolTCP (c1369,
c20)DstIPZZZZ, DstPort8888, ProtocolTCP,
FlagSYN (c1291, c20)

This pattern indicates an anomalously high number
of TCP connections on port 8888 involving machine
ZZZZ
Follow-up analysis of connections covered by the
pattern indicates that this could be a machine
running a variation of the Kazaa file-sharing
protocol
Having an unauthorized application increases the
vulnerability of the system

50
Discovered Real-life Association Patterns(ctd)
SrcIPXXXX, DstPort27374, ProtocolTCP,
FlagSYN, NoPackets4, NoBytes189200 (c1582,
c22) SrcIPXXXX, DstPort12345, NoPackets4,
NoBytes189200 (c1580, c23) SrcIPYYYY,
DstPort27374, ProtocolTCP, FlagSYN,
NoPackets3, NoBytes144 (c1694, c23)

This pattern indicates a large number of scans on
ports 27374 (which is a signature for the
SubSeven worm) and 12345 (which is a signature
for NetBus worm)
Further analysis showed that no fewer than five
machines scanning for one or both of these ports
in any time window

51
Discovered Real-life Association Patterns(ctd)
DstPort6667, ProtocolTCP (c1254, c21)

This pattern indicates an unusually large number
of connections on port 6667 detected by the
anomaly detector
Port 6667 is where IRC (Internet Relay Chat) is
typically run
Further analysis reveals that there are many
small packets from/to various IRC servers around
the world
Although IRC traffic is not unusual, the fact
that it is flagged as anomalous is interesting
This might indicate that the IRC server has been
taken down (by a DOS attack for example) or it is
a rogue IRC server (it could be involved in some
hacking activity)

52
Discovered Real-life Association Patterns(ctd)
DstPort1863, ProtocolTCP, Flag0, NoPackets1,
NoByteslt139 (c1498, c26)DstPort1863,
ProtocolTCP, Flag0 (c1587, c26)DstPort1863,
ProtocolTCP (c1606, c28)

This pattern indicates a large number of
anomalous TCP connections on port 1863
Further analysis reveals that the remote IP block
is owned by Hotmail
Flag0 is unusual for TCP traffic

53
MINDS Conclusion

Data mining based algorithms are capable of
detecting intrusions that cannot be detected by
state-of-the-art signature based methods
SNORT has static knowledge manually updated by
human analysts
MINDS anomaly detection algorithms are adaptive
in nature
MINDS anomaly detection algorithms can also be
effective in detecting anomalous behavior
originating from a compromised or infected machine

54
IDS Using both Misuse and Anomaly
DetectionRIDS-100

RIDS( Rising Intrusion Detection System) is
provided by Rising Tech. It is a leader in
antivirus and content security software and
services in China.
The company is a leading provider of client,
gateway and server security solutions for virus
protection, firewall and intrusion detection
technologies and security services to enterprises
and service providers around China.
RIDS make the use of both intrusion detection
technique, misuse and anomaly detection.
Distance based outlier detection algorithm is
used for detection deviational behavior among
collected network data.
For misuse detection, it has very vast set of
collected data pattern which can be matched with
scanned network data for misuse detection.
This large amount of data pattern is scanned
using data mining classification Decision Tree
algorithm.
http//www.rising-global.com/

55
A cooperative anomaly and intrusiondetection
system (CAIDS),

built with a network-based intrusion detection
system (NIDS) and an anomaly detection system
(ADS) operating interactively through a signature
generator.

56
A cooperative anomaly and intrusiondetection
system (CAIDS),

A frequent episode rule (FER) is generated out of
a collection of frequent episodes. The FER is
defined over episode sequences with multiple
connection events.
For an example, we envision a window where we
observe a 3-event sequence
E, D, and F. An FER is generated as E ? D, F
confidence level freq (a U b)/freq (b)0.8,
where a represents the event E on the LHS and b
corresponds to the two events D and F on the RHS
of the rule.
If the b occurs with 5 and the joint event a and
b has 4 to occur, there is a (0.04/0.05) 80
chance that D and F will follow in the same
window.

57
A cooperative anomaly and intrusiondetection
system (CAIDS),

In practice, the event E could be an
authentication service characterized by two
attributes
(service authentication, flagSF).
The events D, F may be two sequential smtp
requests denoted by (service smtp).
Thus we can derive an FER with a confidence level
of c 80, that two smtp services will follow
the authentication service within a window w 2
sec. The three joint traffic events accounts with
a support level s 10 out of all the network
connections being evaluated. This FER is formally
stated as follows
(service authentication) ? (service smtp)
(service smtp) (0.8, 0.1, 2 sec)
(1)

58
A cooperative anomaly and intrusiondetection
system (CAIDS),

An association rule is aimed at finding
interesting intra-relationship inside a single
connection record
In general, an FER is specified by the following
expression
L1, L2,, Ln ? R1,, Rm (c, s, window)
(2)
Li (1 i n) and Rj (1 j m) are ordered
traffic connection events.
We call L1, L2,, Ln the LHS episode and R1,,
Rm the RHS of the episode rule.

59
A cooperative anomaly and intrusiondetection
system (CAIDS),
Architecture of the CAIDS simulator built with a
2,000-signature Snort and an anomaly detection
subsystem (ADS) with 60 FERs after 2 weeks of
rule training over the Lincoln Lab IDS evaluation
dataset
60
Conclusion

In this report we have studied basic concept and
some classic system models, like ADAM ,MINDSin
this area.
To make summary of those system models, their
technologies and their validation methods.
Hope to a overview on currently development in
this area and how data mining is evolving into
the field of network intrusion detection.

61
Reference

DARPA 1998 data set
A cleansed set in KDDCup99
DARPA 1991 data set is also available
http//www.ll.mit.edu/IST/ideval/data/data_index.h
tml
Daniel Barbara, Julia Couto, Sushil Jajodia,
Leonard Popyack, Ningning Wu, ADAM Detecting
Intrusions by Data Mining, Proceedings of the
2001 IEEE Workshop on Information Assurance and
Security, United States Military Academy, West
Point, NY, 5-6 June 2001
Zhang, J. and Zulkernine, M. 2006. A Hybrid
Network Intrusion Detection Technique Using
Random Forests. In Proceedings of the First
international Conference on Availability,
Reliability and Security (April 20 - 22, 2006).
W. Lee et al. A data mining framework for
building intrusion detection models. In
Information and System Security, Vol. 3, No. 4,
2000.
Ertoz L. et Al, "MINDS - Minnesota Intrusion
Detection System", Next Generation Data Mining
Chapter 3, 2004
Exploiting efficient data mining techniques to
enhance intrusion detection systems Lu, C.-T.
Boedihardjo, A.P. Manalwar, P. Information Reuse
and Integration, Conf, 2005. IRI -2005 IEEE
International Conference on. Volume , Issue ,
15-17 Aug. 2005 Page(s) 512 - 517
Sal Stolfo, Andreas Prodromidis, Shelley
Tselepis, Wenke Lee, Dave Fan, and Phil Chan
(Honorable mention (runner-up) for Best Paper
Award in Applied Research Category) In
Proceedings of the Third International Conference
on Knowledge Discovery and Data Mining (KDD '97),
Newport Beach, CA, August 1997

62
Questions Comments

Write a Comment

User Comments (0)

About PowerShow.com

Data Mining PowerPoint PPT Presentation