A%20Data%20Mining%20Approach%20for%20Building%20Cost-Sensitive%20and%20Light%20Intrusion%20Detection%20Models%20%20Quarterly%20Review%20

About This Presentation

Title:

A%20Data%20Mining%20Approach%20for%20Building%20Cost-Sensitive%20and%20Light%20Intrusion%20Detection%20Models%20%20Quarterly%20Review%20

Description:

A Data Mining Approach for Building Cost-Sensitive and Light ... Patten discovery ( domain knowledge/expert system, data mining ...) A Case Study: DDoS ... – PowerPoint PPT presentation

Number of Views:172

Avg rating:3.0/5.0

Slides: 91

Provided by: ccGa

Learn more at: http://wenke.gtisc.gatech.edu

Category:

more less

Transcript and Presenter's Notes

Title: A%20Data%20Mining%20Approach%20for%20Building%20Cost-Sensitive%20and%20Light%20Intrusion%20Detection%20Models%20%20Quarterly%20Review%20

1
A Data Mining Approach for Building
Cost-Sensitive and Light Intrusion Detection
Models Quarterly Review November 2000
North Carolina State University Columbia
University Florida Institute of Technology
2
Outline

Project description
Progress report
Cost-sensitive modeling (NCSU/Columbia/FIT).
Automated feature and model construction (NCSU).
Anomaly detection (NCSU/Columbia/FIT).
Attack clustering and light modeling (FIT).
Real-time architecture and systems
(NCSU/Columbia).
Correlation (NCSU).
Collaboration with industry (NCSU/Columbia).
Publications and software distribution.
Effort and budget.
Plan of work for next quarter

3
New Ideas and Hypotheses (1/2)

High-volume automated attacks can overwhelm a
real-time IDS and its staff
IDS needs to consider cost factors
Damage cost, response cost, operational cost,
etc.
Pure statistical accuracy not ideal
Base-rate fallacy of anomaly detection.
Alternative the cost (saving) of an IDS.

4
New Ideas and Hypotheses (2/2)

Thorough analysis cannot always be done in
real-time by one sensor
Correlation of multiple sensor outputs.
Trend or scenario analysis.
Need better theories and tools for building
misuse and anomaly detection models
Characteristics of normal data and attack
signatures can be measured and utilized.

5
Main Approaches (1/2)

Cost-sensitive models and architecture
Optimized for the cost metrics defined by users.
Cost-sensitive machine learning algorithms.
Multiple specialized and light sensors
dynamically activated/configured in run-time.
Load balancing of models and data
Aggregation and correlation.
Cost-effectiveness as the guiding principle and
multi-model correlation as the architectural
approach.

6
Main Approaches (2/2)

Theories and tools for more effective anomaly and
misuse detection
Information-theoretic measures for anomaly
detection
Regularity of normal data is used to build
model.
New algorithms, e.g.
Unsupervised learning using noisy data.
Using artificial anomalies
An automated system that integrate all these
algorithms/tools.

7
Project Impacts (1/2)

A better understanding of the cost factors, cost
models, and cost metrics related to intrusion
detection.
Modeling techniques and deployment strategies for
cost-effective IDSs
Provide the best-valued protection.
Clustering techniques for grouping intrusions
and building specialized and light sensors.
An architecture for dynamically activating,
configuring, and correlating sensors.

8
Project Impacts (2/2)

More effective misuse and anomaly detection
models
With sound theoretical foundations and automation
tools.
Analysis/correlation techniques for
understanding/recognizing and predicting complex
attack scenarios.

9
Cost-Sensitive Modeling

In previous quarters
Cost factors and metrics definition and analysis.
Cost model definition.
Cost-sensitive modeling with machine learning.
Evaluation using DARPA off-line data.
Current quarter
Real-time architecture.
Dynamic cost-sensitive deployment and correlation
of sensors.

10
A Multi Layer/Component Architecture
models
Remote IDS/Sensor
Dynamic Cost-sensitive Decision Making
FW
Real-time IDS
Backend IDS
ID Model Builder
11
Next Steps

Study realistic cost-metrics in the real-world.
Implement a prototype system
Demonstrate the advantage of cost-sensitive
modeling and dynamic cost-effective deployment
Use representative scenarios for evaluation.

12
An Automated System for Feature and Model
Construction
13
The Data Mining Process of Building ID Models
models
features
patterns
connection/ session records
packets/ events (ASCII)
raw audit data
14
Feature Construction From Patterns
patterns
new intrusion records
mining
mining
normal and historical intrusion records
compare
intrusion patterns
detection models
features
learning
training data
15
Status and Next Steps

The effectiveness of the algorithms/tools
(process steps) have been validated
1998 DARPA Evaluation.
Automating the process
Process steps chained together.
Process iteration under development.
Field test
Advanced Technology Systems, General Dynamics.
Planned public release 2Q-2001.
Dealing with unlabeled data
Integrate anomaly detection over noisy data
(Columbia) algorithms.

16
Information-Theoretic Measures for Anomaly
Detection

Motivations
Need formal understandings.
Hypothesis
Anomaly detection is based on regularity of
normal data.
Approach
Entropy and conditional entropy regularity
Determine how to build a model.
Relative (conditional) entropy how the
regularities between training and test datasets
relate
Determine the performance of a model on test data.

17
Case Studies

Anomaly detection for Unix processes
Short sequences as normal profile.
A classification approach
Given the first k system calls, predict the k1st
system call
How to determine the sequence length, k? Will
including other information help?
UNM sendmail system call traces.
MIT Lincoln Lab BSM data.
Anomaly detection for network
How to partition the data refine the complex
subject.
MIT Lincoln Lab tcpdump data.

18
Entropy and Conditional Entropy

Impurity of the dataset
the smaller (the more regular) the better.

Irregularity of sequential dependencies
uncertainty of a sequence after seeing its
prefix (subsequences)
the smaller (the more regular) the better.

19
Relative (Conditional) Entropy

How different is p from q
how different is the regularity of test data
from that of training data
the smaller the better.

20
Information Gain and Classification

How much can attribute/feature A contribute to
the classification process
the reduction of entropy when the dataset is
partitioned according to values of A.
the larger the better.
if A the first k events in a sequence (i.e.,
Y) and the class label is the k1st event
conditional entropy H(XY) is just the second
term of the Gain(X, A)
the smaller the conditional entropy, the better
performance the classifier.

21
Conditional Entropy of Training Data (UNM)
22
Misclassification Rate Training Data
23
Conditional Entropy vs. Misclassification Rate
24
Misclassification Rate of Testing Data and
Intrusion Data
25
Relative Conditional Entropy btw. Training and
Testing Normal Data
26
(Real and Estimated) Accuracy/Cost (Time)
Trade-off
27
Conditional Entropy of In- and Out- bound Email
(MIT/LL BSM)
28
Relative Conditional Entropy
29
Misclassification Rate of in-bound Email
30
Misclassification Rate of out-bound Email
31
Accuracy/cost Trade-off
32
Estimated Accuracy/cost Trade-off
33
Key Findings

Regularity of data can guide how to build a
model
For sequential data, conditional entropy directly
influences the detection performance
Determines the (best) sequence length and whether
to include more information, before building a
model.
With cost is also considered, the optimal
model.
Detection performance on test data can be
attained only if regularity is similar to
training data.

34
Next Steps

Study how to measure more complex environments
Network topology/configuration/traffic, etc.
Extend the principle/approach for misuse
detection
Measure normal, attack, and their relationship
Parameter adjustment, performance prediction.

35
New Anomaly Detection Approaches

Unsupervised training methods
Build models over noisy (not clean) data
Artificial anomalies
Improves performance of misuse and anomaly
detection methods.
Network traffic anomaly detection

36
AD over Noisy Data

Builds normal models over data containing some
anomalies.
Motivating assumptions
Intrusions are extremely rare compared to to
normal.
Intrusions are quantitatively different.

37
Approach Overview

Mixture model
Normal component
Anomalous component
Build probabilistic model of data
Max likelihood test for detection.

38
Mixture Model of Anomalies

Assume a generative model
The data is generated with a probability
distribution D.
Each element originates from one of two
components
M, the Majority Distribution (x ? M).
A, the Anomalous Distribution (x ? A).
Thus D (1-?)M ?A.

39
Modeling Probability Distributions

Train Probability Distributions over current sets
of M and A.
PM(X) probability distribution for Majority.
PA(X) probability distribution for Anomaly.
Any probability modeling method can be used
Naïve Bayes, Max Entropy, etc.

40
Experiments

Two Sets of experiments
Measured Performance against comparison methods
over noisy data.
Measured Performance trained over noisy data
against comparison methods trained over clean
data.
Method Robust in both comparisons.

41
AD Using Artificial Anomalies

Generate abnormal behavior artificially
Assume the given normal data are representative.
Near misses" of normal behavior is considered
abnormal.
Change the value of only one feature in an
instance of normal behavior.
Sparsely represented values are sampled more
frequently.
Near misses" help define a tight boundary
enclosing the normal behavior.

42
Experimental Results

Learning algorithm RIPPER
Data 1998 DARPA evaluation
U2R, R2L, DOS, PRB 22 clusters
Training data normal and artificial anomalies
Results
Overall detection rate 94.26
Overall false alarm rate 2.02
100 dectection buffer_overflow, guess_passwd,
phf, back
0 detection perl, spy, teardrop, ipsweep, nmap
50 detection 13 out of 22 intrusion subclasses

43
Combining Anomaly and Misuse Detection

Training data normal data, artificially
generated anomalies, known intrusion data
The learned model can predict normal, anomaly, or
known intrusion subclass
Experiments were performed on increasing subsets
of known intrusion subclasses in the training
data (simulates identified intrusions over time).

44
Combining Anomaly and Misuse Detection (continued)

Consider phf, pod, teardrop, spy, and smurf are
unknown (absent from the training data)
Anomaly detection rate phf25, pod100,
teardrop93.91, spy50, smurf100
Overall false alarm rate .20
The false alarm rate has dropped from 2.02 to
.20 when some known attacks are included for
training

45
Adaptive Combined Anomaly and Misuse Detection

Completely re-train model whenever new intrusion
is found is very expensive and slow process.
Effective and fast remedy is very important to
thwart these attacks.
Re-training is still necessary when time and
resource are enough.

46
Multiple Model Adaptive Approach

Generate an additional detection module only good
at detecting the newly discovered intrusion.
Method 1 trained from normal and new intrusion
data
Method 2 new intrusion and artificial anomaly
When old classifier predicts anomaly, it will
be further predicted by the new classifier to
examine if it is the new intrusion.

47
Multiple Model Adaptive Experiment

The old model is trained from n intrusions.
A light weight model is trained from one new
intrusion type.
They are combined as an ensemble.
The accuracy and training time is compared with
one model trained from n 1 intrusions.

48
Multiple Model Adaptive Experiment Result

The accuracy difference is very small
recall 3.4
precision -16
In other words, ensemble approach detects more
new intrusion, but also misidentifies more
anomaly as new intrusion.
Training time difference 150 time difference! or
a cup of coffee versus one or two days.

49
Detecting Anomalies in Network Traffic (1/2)

Can we detect intrusions by identifying novel
values in network packets?
Anomaly detection is potentially useful in
detecting novel attacks.
Our model is trained on attack-free tcpdump data.
Fields in the Transport layer or below are
considered.

50
Detecting Anomalies in Network Traffic (2/2)

Normal field values are learned.
During evaluation, a function scores a packet
based on the likelihood of encountering novel
field values.
Initial results indicate our learned model
compares favorably with other systems on the 1999
DARPA evaluation data.

51
Packet Fields

Fields in Data-link, Network, and Transport
layers.
(Application layer will be considered later)
Ethernet source, destination, protocol.
IP header length, TOS, fragment ID, TTL,
transport protocol
TCP header length, UAPRSF flags, URG pointer
UDP length
ICMP type, code

52
Anomaly Scoring Function (1/2)

N1 Number of unique values in a field in the
training data
N Number of packets in the training data
Likelihood of observing a novel value in a field
is
N1 / N
(escape probability, Witten and Bell, 1991)

53
Anomaly Scoring Function (2/2)

Non-stationary model consider the last
occurrence of novel values
t Number of seconds since the last novel value
in the same field
Likelihood of observing an anomaly
P (N1 / N) (1 / t)
Field anomaly score Sf 1 / P
Packet anomaly score Sf Sf

54
Experiments

1999 DARPA evaluation data (from Lincoln Lab).
Same mechanism as DARPA in determining detection
(correct IP address of the victim, 60 seconds
before and after an attack).
Score thresholds of our system and others are
lowered to produce no more than 100 false alarms.
Some of the other systems use binary scoring.

55
Initial Results
IDS TP/FP All TP/FP Network IDS Type
Oracle 200/0 72/0 ideal
FIT 64/100 51/100 anomaly
GMU 51/22 27/22 anomalysignature
NYU 20/80 14/80 signature
SUNY 24/9 19/9 signature
NetSTAT 70/995 35/995 signature
Emerald-TCP 83/23 35/23 signature
56
Discussion

All attacks more detections with 100 or fewer
false alarms than most systems except Emerald and
NetSTAT.
Our initial experiments did not look at fields in
the Application protocol layer.
Network attacks more detections with 100 or
fewer false alarms than the other systems.
57 out of 72 attacks were detected with 100 false
alarms.

57
Summary of Progress

Florida Techs official start date August 30,
2000.
Near-term objective using learning techniques to
build anomaly detection models that can identify
intrusions.
Progress initial experimental results on the
1999 DARPA evaluation data indicate that our
techniques compare favorably with the other
systems in detecting network attacks.

58
Plans for the Next Quarter

Investigate an entropy approach to detecting
anomalies.
Study methods that incorporate more information
from packets prior to the current packet.
Examine how effective our techniques are with
respect to individual attack types.
Devise techniques to catch attack types that are
undetected.
Incorporate fields in the Application protocol
layer into our model.

59
Anomaly Detection Summary and Plans

Anomaly detection is a main focus.
Both theories and new approaches.
Will integrate
Theories applied to develop new AD sensors.
Incorporate cost-sensitive measures.
Study real-time architecture/performance.
Automated feature and model construction system.

60
Correlation Analysis of Attack Scenario

Motivations
Detecting individual attack actions not adequate
Damage assessment, trend prediction, etc.
Hypothesis
Attacks are related and such correlation can be
learned.
Approach
Start with crude knowledge models.
Use data mining to validate/refine the models.
An IETF/IDWG architecture/system.

61
Objectives (1/2)

Local/low layer correlations in an IDS
Multiple sources of raw (audit) data
Raw information tcpdump data, BSM records
Based on specific attack signatures, system/user
normal profiles
Benefits
Better accuracy higher TP, lower FP
More alarm information for higher level and
global analysis

62
Objectives (2/2)

Global / High Layer Correlations
Multiple sources of alarms by IDSs
The bigger picture
What really happened in our networks?
What can we learn from these cases?
Benefits
What is the intention of the attacks?
What will happen next? When? Where?
What can we do to prevent it from happening?

63
Architecture of Global Correlation System
Alarm Collection Center
Report Center
IDSs
Knowledge Controller
64
Correlation Techniques from Network Management
System (1/2)

Rule-Based Reasoning (RBR)
If then rules based on the domain knowledge and
expertise.
Sufficient for small, non-changing, and well
understood system.
Model-Based Reasoning (MBR)
Model both physical and logical entity, such as
hub, router
Correlation is a result of the collaboration
among models.

65
Correlation Techniques from Network Management
Systems (2/2)

State-Transition Graph (STG)
Logical connections via state-transition.
May lead to unexpected behavior if the
collaborating STGs are not carefully defined.
Case-Based Reasoning (CBS)
Learn from the experience and offer solutions to
novel problems based on experience.
Need to develop a similarity metric to retrieve
useful cases from the library.

66
Correlation Techniques for IDS

Combination of different correlation techniques
Network complexity.
Wide varieties attacking motives and tools.
Adaptation of different correlation techniques
Different perspectives between NMS and IDS.

67
Challenges of Correlation (1/2)

Knowledge representation
How to represent the objects such as alarms, log
files, network entities?
How to model the knowledge such as network
topology, network history, intrusion library,
previous cases?

68
Challenges of Correlation (2/2)

Knowledge base construction
What kind of knowledge base do we need?
How to construct the knowledge base?
Case library
Network Knowledge
Intrusion Knowledge
Patten discovery ( domain knowledge/expert
system, data mining )

69
A Case Study DDoS

An attack scenario from MIT/LL
Phase 1 IPSweep of the AFB from a remote site.
Phase 2 Probe of live IPs to look for the
sadmind daemon running on Solaris hosts.
Phase 3 Break-ins via the sadmind
vulnerability.
Phase 4 Installation of the trojan
programmstream DDoS software on three hosts at
the AFB.
Phase 5 Launching the DDoS.

70
Alarm Model

Object-Oriented
Alarm A feature1, feature2,
Features of Alarm
Attack type
Time stamp
Service
Source IP / domain
Target IP/ domain
Target number
Source type (router , host , server)
Target type (router, host, server )
Duration
Frequency within time window

71
Alarm Model

Example
IP sweep 095151 ICMP ppp5-23.iawhk.com
172.16.115.x 20 hosts servers 9 1
Attack type IP sweep
Time stamp 095151
Service ICMP
Source IP ppp5-23.iawhk.com
Target IP 172.16.115.x
Target number 20
Source type n/a
Target type hosts and servers
Duration 9 seconds
Frequency 1

72
Scenario Representation (1/2)

Attack scenario graph
Constructed by domain knowledge
Can be validated/augmented via data mining.
Describing attack scenarios via state transition.
Each transition with probability P.
Modifiable by experts.
Adaptive to new cases.

73
Scenario Representation (2/2)

Example of attack scenario graph

TFN2K DDoS
IP Sweep
Trojan Installation
Trinoo DDoS
Mstream DDoS
74
Correlation Rule Sets

Based on
Attack scenario graph.
Domain knowledge and expertise.
Case library.
Two Layers of Rule Sets
Lower layer for matching/correlating specific
alarms.
Higher layer for trend prediction.
Probability assigned.

75
Correlation Rule Sets

Example of low layer rule sets
If (A1.type IP Sweep A2.type Port Scan
) (A1.time lt A2.time) (A1.domain A2.domain)
( A2.target gt 10 ), then A1A2
.
If (A2.type Port Scan A3.type Buffer
Overflow) (A2.time lt A3.time) (A3.DestIP
belongs to A2.domain) (A3.target gt2), then A2
A3

76
Correlation Rule Set

Example of high layer rule sets
If (A1 A2, A2 A3), then A1A2A3
If (A1 A2 A3), then the attack scenario
is A1 -gt A2 -gtA3 -gt A4 w/ probability P1
A1-gt A2 -gt A3 -gt A4 -gt A5 w/ probability P2
E.g.,
If (IP Sweep Port Scan Buffer Overflow
)
Then next1 Trojan Installation with P1
next2 DDoS with P2

77
Status and Next Steps

At the very beginning of this research.
Attack Scenario Graph
How to construct it automatically?
How to model the statistical properties of attack
scenario state transition?
How to automatically generate the correlation
rule sets?
Collaboration with other groups
Alarm formats, architecture, IETF/IDWG.

78
Real-time System Implementation

Motivations
Validate our algorithms and models in the
real-world.
Faster technology transfer and greater impact.
Approach
Collaboration with industries
Reuse available building blocks as much as
possible.

79
Conceptual Architecture
Adaptive Model Generator
models
Data Warehouse
data
models
data
Sensor
Detector
data
80
System Architecture
Unsupervised Machine Learning
Supervised Machine Learning
Model Generation
Data Warehouse
Meta IDS
Adaptive Model Generation
Real Time Data Mining
NT Host Based IDS
Linux Host Based IDS
Solaris Host Based IDS
NFR Network Based IDS
Sensors
Malicious Email Filter
File System Wrappers
Software Wrappers
81
Sensor Host Based IDS System