Title: A Data Mining Approach for Building Cost-Sensitive and Light Intrusion Detection Models Quarterly Review
1A Data Mining Approach for Building
Cost-Sensitive and Light Intrusion Detection
Models Quarterly Review November 2000
North Carolina State University Columbia
University Florida Institute of Technology
2Outline
- Project description
- Progress report
- Cost-sensitive modeling (NCSU/Columbia/FIT).
- Automated feature and model construction (NCSU).
- Anomaly detection (NCSU/Columbia/FIT).
- Attack clustering and light modeling (FIT).
- Real-time architecture and systems
(NCSU/Columbia). - Correlation (NCSU).
- Collaboration with industry (NCSU/Columbia).
- Publications and software distribution.
- Effort and budget.
- Plan of work for next quarter
3New Ideas and Hypotheses (1/2)
- High-volume automated attacks can overwhelm a
real-time IDS and its staff - IDS needs to consider cost factors
- Damage cost, response cost, operational cost,
etc. - Pure statistical accuracy not ideal
- Base-rate fallacy of anomaly detection.
- Alternative the cost (saving) of an IDS.
4New Ideas and Hypotheses (2/2)
- Thorough analysis cannot always be done in
real-time by one sensor - Correlation of multiple sensor outputs.
- Trend or scenario analysis.
- Need better theories and tools for building
misuse and anomaly detection models - Characteristics of normal data and attack
signatures can be measured and utilized.
5Main Approaches (1/2)
- Cost-sensitive models and architecture
- Optimized for the cost metrics defined by users.
- Cost-sensitive machine learning algorithms.
- Multiple specialized and light sensors
dynamically activated/configured in run-time. - Load balancing of models and data
- Aggregation and correlation.
- Cost-effectiveness as the guiding principle and
multi-model correlation as the architectural
approach.
6Main Approaches (2/2)
- Theories and tools for more effective anomaly and
misuse detection - Information-theoretic measures for anomaly
detection - Regularity of normal data is used to build
model. - New algorithms, e.g.
- Unsupervised learning using noisy data.
- Using artificial anomalies
- An automated system that integrate all these
algorithms/tools.
7Project Impacts (1/2)
- A better understanding of the cost factors, cost
models, and cost metrics related to intrusion
detection. - Modeling techniques and deployment strategies for
cost-effective IDSs - Provide the best-valued protection.
- Clustering techniques for grouping intrusions
and building specialized and light sensors. - An architecture for dynamically activating,
configuring, and correlating sensors.
8Project Impacts (2/2)
- More effective misuse and anomaly detection
models - With sound theoretical foundations and automation
tools. - Analysis/correlation techniques for
understanding/recognizing and predicting complex
attack scenarios.
9Cost-Sensitive Modeling
- In previous quarters
- Cost factors and metrics definition and analysis.
- Cost model definition.
- Cost-sensitive modeling with machine learning.
- Evaluation using DARPA off-line data.
- Current quarter
- Real-time architecture.
- Dynamic cost-sensitive deployment and correlation
of sensors.
10A Multi Layer/Component Architecture
models
Remote IDS/Sensor
Dynamic Cost-sensitive Decision Making
FW
Real-time IDS
Backend IDS
ID Model Builder
11Next Steps
- Study realistic cost-metrics in the real-world.
- Implement a prototype system
- Demonstrate the advantage of cost-sensitive
modeling and dynamic cost-effective deployment - Use representative scenarios for evaluation.
12An Automated System for Feature and Model
Construction
13The Data Mining Process of Building ID Models
models
features
patterns
connection/ session records
packets/ events (ASCII)
raw audit data
14Feature Construction From Patterns
patterns
new intrusion records
mining
mining
normal and historical intrusion records
compare
intrusion patterns
detection models
features
learning
training data
15Status and Next Steps
- The effectiveness of the algorithms/tools
(process steps) have been validated - 1998 DARPA Evaluation.
- Automating the process
- Process steps chained together.
- Process iteration under development.
- Field test
- Advanced Technology Systems, General Dynamics.
- Planned public release 2Q-2001.
- Dealing with unlabeled data
- Integrate anomaly detection over noisy data
(Columbia) algorithms.
16Information-Theoretic Measures for Anomaly
Detection
- Motivations
- Need formal understandings.
- Hypothesis
- Anomaly detection is based on regularity of
normal data. - Approach
- Entropy and conditional entropy regularity
- Determine how to build a model.
- Relative (conditional) entropy how the
regularities between training and test datasets
relate - Determine the performance of a model on test data.
17Case Studies
- Anomaly detection for Unix processes
- Short sequences as normal profile.
- A classification approach
- Given the first k system calls, predict the k1st
system call - How to determine the sequence length, k? Will
including other information help? - UNM sendmail system call traces.
- MIT Lincoln Lab BSM data.
- Anomaly detection for network
- How to partition the data refine the complex
subject. - MIT Lincoln Lab tcpdump data.
18Entropy and Conditional Entropy
- Impurity of the dataset
- the smaller (the more regular) the better.
- Irregularity of sequential dependencies
- uncertainty of a sequence after seeing its
prefix (subsequences) - the smaller (the more regular) the better.
19Relative (Conditional) Entropy
- How different is p from q
- how different is the regularity of test data
from that of training data - the smaller the better.
20Information Gain and Classification
- How much can attribute/feature A contribute to
the classification process - the reduction of entropy when the dataset is
partitioned according to values of A. - the larger the better.
- if A the first k events in a sequence (i.e.,
Y) and the class label is the k1st event - conditional entropy H(XY) is just the second
term of the Gain(X, A) - the smaller the conditional entropy, the better
performance the classifier.
21 Conditional Entropy of Training Data (UNM)
22 Misclassification Rate Training Data
23Conditional Entropy vs. Misclassification Rate
24Misclassification Rate of Testing Data and
Intrusion Data
25Relative Conditional Entropy btw. Training and
Testing Normal Data
26(Real and Estimated) Accuracy/Cost (Time)
Trade-off
27Conditional Entropy of In- and Out- bound Email
(MIT/LL BSM)
28Relative Conditional Entropy
29Misclassification Rate of in-bound Email
30Misclassification Rate of out-bound Email
31Accuracy/cost Trade-off
32Estimated Accuracy/cost Trade-off
33Key Findings
- Regularity of data can guide how to build a
model - For sequential data, conditional entropy directly
influences the detection performance - Determines the (best) sequence length and whether
to include more information, before building a
model. - With cost is also considered, the optimal
model. - Detection performance on test data can be
attained only if regularity is similar to
training data.
34Next Steps
- Study how to measure more complex environments
- Network topology/configuration/traffic, etc.
- Extend the principle/approach for misuse
detection - Measure normal, attack, and their relationship
- Parameter adjustment, performance prediction.
35New Anomaly Detection Approaches
- Unsupervised training methods
- Build models over noisy (not clean) data
- Artificial anomalies
- Improves performance of misuse and anomaly
detection methods. - Network traffic anomaly detection
36AD over Noisy Data
- Builds normal models over data containing some
anomalies. - Motivating assumptions
- Intrusions are extremely rare compared to to
normal. - Intrusions are quantitatively different.
37Approach Overview
- Mixture model
- Normal component
- Anomalous component
- Build probabilistic model of data
- Max likelihood test for detection.
38Mixture Model of Anomalies
- Assume a generative model
- The data is generated with a probability
distribution D. - Each element originates from one of two
components - M, the Majority Distribution (x ? M).
- A, the Anomalous Distribution (x ? A).
- Thus D (1-?)M ?A.
39Modeling Probability Distributions
- Train Probability Distributions over current sets
of M and A. - PM(X) probability distribution for Majority.
- PA(X) probability distribution for Anomaly.
- Any probability modeling method can be used
- Naïve Bayes, Max Entropy, etc.
40Experiments
- Two Sets of experiments
- Measured Performance against comparison methods
over noisy data. - Measured Performance trained over noisy data
against comparison methods trained over clean
data. - Method Robust in both comparisons.
41AD Using Artificial Anomalies
- Generate abnormal behavior artificially
- Assume the given normal data are representative.
- Near misses" of normal behavior is considered
abnormal. - Change the value of only one feature in an
instance of normal behavior. - Sparsely represented values are sampled more
frequently. - Near misses" help define a tight boundary
enclosing the normal behavior.
42Experimental Results
- Learning algorithm RIPPER
- Data 1998 DARPA evaluation
- U2R, R2L, DOS, PRB 22 clusters
- Training data normal and artificial anomalies
- Results
- Overall detection rate 94.26
- Overall false alarm rate 2.02
- 100 dectection buffer_overflow, guess_passwd,
phf, back - 0 detection perl, spy, teardrop, ipsweep, nmap
- 50 detection 13 out of 22 intrusion subclasses
43Combining Anomaly and Misuse Detection
- Training data normal data, artificially
generated anomalies, known intrusion data - The learned model can predict normal, anomaly, or
known intrusion subclass - Experiments were performed on increasing subsets
of known intrusion subclasses in the training
data (simulates identified intrusions over time).
44Combining Anomaly and Misuse Detection (continued)
- Consider phf, pod, teardrop, spy, and smurf are
unknown (absent from the training data) - Anomaly detection rate phf25, pod100,
teardrop93.91, spy50, smurf100 - Overall false alarm rate .20
- The false alarm rate has dropped from 2.02 to
.20 when some known attacks are included for
training
45Adaptive Combined Anomaly and Misuse Detection
- Completely re-train model whenever new intrusion
is found is very expensive and slow process. - Effective and fast remedy is very important to
thwart these attacks. - Re-training is still necessary when time and
resource are enough.
46Multiple Model Adaptive Approach
- Generate an additional detection module only good
at detecting the newly discovered intrusion. - Method 1 trained from normal and new intrusion
data - Method 2 new intrusion and artificial anomaly
- When old classifier predicts anomaly, it will
be further predicted by the new classifier to
examine if it is the new intrusion.
47Multiple Model Adaptive Experiment
- The old model is trained from n intrusions.
- A light weight model is trained from one new
intrusion type. - They are combined as an ensemble.
- The accuracy and training time is compared with
one model trained from n 1 intrusions.
48Multiple Model Adaptive Experiment Result
- The accuracy difference is very small
- recall 3.4
- precision -16
- In other words, ensemble approach detects more
new intrusion, but also misidentifies more
anomaly as new intrusion. - Training time difference 150 time difference! or
a cup of coffee versus one or two days.
49Detecting Anomalies in Network Traffic (1/2)
- Can we detect intrusions by identifying novel
values in network packets? - Anomaly detection is potentially useful in
detecting novel attacks. - Our model is trained on attack-free tcpdump data.
- Fields in the Transport layer or below are
considered.
50Detecting Anomalies in Network Traffic (2/2)
- Normal field values are learned.
- During evaluation, a function scores a packet
based on the likelihood of encountering novel
field values. - Initial results indicate our learned model
compares favorably with other systems on the 1999
DARPA evaluation data.
51Packet Fields
- Fields in Data-link, Network, and Transport
layers. - (Application layer will be considered later)
- Ethernet source, destination, protocol.
- IP header length, TOS, fragment ID, TTL,
transport protocol - TCP header length, UAPRSF flags, URG pointer
- UDP length
- ICMP type, code
52Anomaly Scoring Function (1/2)
- N1 Number of unique values in a field in the
training data - N Number of packets in the training data
- Likelihood of observing a novel value in a field
is - N1 / N
- (escape probability, Witten and Bell, 1991)
53Anomaly Scoring Function (2/2)
- Non-stationary model consider the last
occurrence of novel values - t Number of seconds since the last novel value
in the same field - Likelihood of observing an anomaly
- P (N1 / N) (1 / t)
- Field anomaly score Sf 1 / P
- Packet anomaly score Sf Sf
54Experiments
- 1999 DARPA evaluation data (from Lincoln Lab).
- Same mechanism as DARPA in determining detection
(correct IP address of the victim, 60 seconds
before and after an attack). - Score thresholds of our system and others are
lowered to produce no more than 100 false alarms. - Some of the other systems use binary scoring.
55Initial Results
IDS TP/FP All TP/FP Network IDS Type
Oracle 200/0 72/0 ideal
FIT 64/100 51/100 anomaly
GMU 51/22 27/22 anomalysignature
NYU 20/80 14/80 signature
SUNY 24/9 19/9 signature
NetSTAT 70/995 35/995 signature
Emerald-TCP 83/23 35/23 signature
56Discussion
- All attacks more detections with 100 or fewer
false alarms than most systems except Emerald and
NetSTAT. - Our initial experiments did not look at fields in
the Application protocol layer. - Network attacks more detections with 100 or
fewer false alarms than the other systems. - 57 out of 72 attacks were detected with 100 false
alarms.
57Summary of Progress
- Florida Techs official start date August 30,
2000. - Near-term objective using learning techniques to
build anomaly detection models that can identify
intrusions. - Progress initial experimental results on the
1999 DARPA evaluation data indicate that our
techniques compare favorably with the other
systems in detecting network attacks.
58Plans for the Next Quarter
- Investigate an entropy approach to detecting
anomalies. - Study methods that incorporate more information
from packets prior to the current packet. - Examine how effective our techniques are with
respect to individual attack types. - Devise techniques to catch attack types that are
undetected. - Incorporate fields in the Application protocol
layer into our model.
59Anomaly Detection Summary and Plans
- Anomaly detection is a main focus.
- Both theories and new approaches.
- Will integrate
- Theories applied to develop new AD sensors.
- Incorporate cost-sensitive measures.
- Study real-time architecture/performance.
- Automated feature and model construction system.
60Correlation Analysis of Attack Scenario
- Motivations
- Detecting individual attack actions not adequate
- Damage assessment, trend prediction, etc.
- Hypothesis
- Attacks are related and such correlation can be
learned. - Approach
- Start with crude knowledge models.
- Use data mining to validate/refine the models.
- An IETF/IDWG architecture/system.
61Objectives (1/2)
- Local/low layer correlations in an IDS
- Multiple sources of raw (audit) data
- Raw information tcpdump data, BSM records
- Based on specific attack signatures, system/user
normal profiles - Benefits
- Better accuracy higher TP, lower FP
- More alarm information for higher level and
global analysis
62Objectives (2/2)
- Global / High Layer Correlations
- Multiple sources of alarms by IDSs
- The bigger picture
- What really happened in our networks?
- What can we learn from these cases?
- Benefits
- What is the intention of the attacks?
- What will happen next? When? Where?
- What can we do to prevent it from happening?
63Architecture of Global Correlation System
Alarm Collection Center
Report Center
IDSs
Knowledge Controller
64Correlation Techniques from Network Management
System (1/2)
- Rule-Based Reasoning (RBR)
- If then rules based on the domain knowledge and
expertise. - Sufficient for small, non-changing, and well
understood system. - Model-Based Reasoning (MBR)
- Model both physical and logical entity, such as
hub, router - Correlation is a result of the collaboration
among models.
65Correlation Techniques from Network Management
Systems (2/2)
- State-Transition Graph (STG)
- Logical connections via state-transition.
- May lead to unexpected behavior if the
collaborating STGs are not carefully defined. - Case-Based Reasoning (CBS)
- Learn from the experience and offer solutions to
novel problems based on experience. - Need to develop a similarity metric to retrieve
useful cases from the library.
66Correlation Techniques for IDS
- Combination of different correlation techniques
- Network complexity.
- Wide varieties attacking motives and tools.
- Adaptation of different correlation techniques
- Different perspectives between NMS and IDS.
67Challenges of Correlation (1/2)
- Knowledge representation
- How to represent the objects such as alarms, log
files, network entities? - How to model the knowledge such as network
topology, network history, intrusion library,
previous cases?
68Challenges of Correlation (2/2)
- Knowledge base construction
- What kind of knowledge base do we need?
- How to construct the knowledge base?
- Case library
- Network Knowledge
- Intrusion Knowledge
- Patten discovery ( domain knowledge/expert
system, data mining )
69A Case Study DDoS
- An attack scenario from MIT/LL
- Phase 1 IPSweep of the AFB from a remote site.
- Phase 2 Probe of live IPs to look for the
sadmind daemon running on Solaris hosts. - Phase 3 Break-ins via the sadmind
vulnerability. - Phase 4 Installation of the trojan
programmstream DDoS software on three hosts at
the AFB. - Phase 5 Launching the DDoS.
70Alarm Model
- Object-Oriented
- Alarm A feature1, feature2,
- Features of Alarm
- Attack type
- Time stamp
- Service
- Source IP / domain
- Target IP/ domain
- Target number
- Source type (router , host , server)
- Target type (router, host, server )
- Duration
- Frequency within time window
71Alarm Model
- Example
- IP sweep 095151 ICMP ppp5-23.iawhk.com
172.16.115.x 20 hosts servers 9 1 - Attack type IP sweep
- Time stamp 095151
- Service ICMP
- Source IP ppp5-23.iawhk.com
- Target IP 172.16.115.x
- Target number 20
- Source type n/a
- Target type hosts and servers
- Duration 9 seconds
- Frequency 1
72Scenario Representation (1/2)
- Attack scenario graph
- Constructed by domain knowledge
- Can be validated/augmented via data mining.
- Describing attack scenarios via state transition.
- Each transition with probability P.
- Modifiable by experts.
- Adaptive to new cases.
73Scenario Representation (2/2)
- Example of attack scenario graph
TFN2K DDoS
IP Sweep
Trojan Installation
Trinoo DDoS
Mstream DDoS
74Correlation Rule Sets
- Based on
- Attack scenario graph.
- Domain knowledge and expertise.
- Case library.
- Two Layers of Rule Sets
- Lower layer for matching/correlating specific
alarms. - Higher layer for trend prediction.
- Probability assigned.
75Correlation Rule Sets
- Example of low layer rule sets
- If (A1.type IP Sweep A2.type Port Scan
) (A1.time lt A2.time) (A1.domain A2.domain)
( A2.target gt 10 ), then A1A2 - .
- If (A2.type Port Scan A3.type Buffer
Overflow) (A2.time lt A3.time) (A3.DestIP
belongs to A2.domain) (A3.target gt2), then A2
A3
76Correlation Rule Set
- Example of high layer rule sets
- If (A1 A2, A2 A3), then A1A2A3
- If (A1 A2 A3), then the attack scenario
- is A1 -gt A2 -gtA3 -gt A4 w/ probability P1
- A1-gt A2 -gt A3 -gt A4 -gt A5 w/ probability P2
- E.g.,
- If (IP Sweep Port Scan Buffer Overflow
) - Then next1 Trojan Installation with P1
- next2 DDoS with P2
77Status and Next Steps
- At the very beginning of this research.
- Attack Scenario Graph
- How to construct it automatically?
- How to model the statistical properties of attack
scenario state transition? - How to automatically generate the correlation
rule sets? - Collaboration with other groups
- Alarm formats, architecture, IETF/IDWG.
78Real-time System Implementation
- Motivations
- Validate our algorithms and models in the
real-world. - Faster technology transfer and greater impact.
- Approach
- Collaboration with industries
- Reuse available building blocks as much as
possible.
79Conceptual Architecture
Adaptive Model Generator
models
Data Warehouse
data
models
data
Sensor
Detector
data
80System Architecture
Unsupervised Machine Learning
Supervised Machine Learning
Model Generation
Data Warehouse
Meta IDS
Adaptive Model Generation
Real Time Data Mining
NT Host Based IDS
Linux Host Based IDS
Solaris Host Based IDS
NFR Network Based IDS
Sensors
Malicious Email Filter
File System Wrappers
Software Wrappers
81Sensor Host Based IDS System
- Generic Interface to Sensors
- BAM (Basic Auditing Module)
- Sends data to data warehouse
- Receives models from data warehouse
- NT System
- Fully Operational
- Linux System BSM (Solaris) System
- Sensor Operational
- Under Construction
- Plan to finish construction by end of semester
82(No Transcript)
83Sensor Network IDS System
- NFR Based Sensor
- Data Mining based
- Efficient Evaluation Architecture
- Multiple Models
- System operational and integrated with larger
system
84Sensor Malicious Email Filter
- Monitors Email (sendmail)
- Detects malicious emails entering domain
- Key Features
- Model Based
- Generalizes to unknown malicious attachments
- Models distributed automatically to filters
- Status
- Prototype operational
- Open source release by end of semester
85Sensor Advanced IDS Sensors
- File Wrappers
- Software Wrappers
- Monitor other aspects of system
- Status
- File Wrappers almost finished
- Software Wrappers under development
86Data Warehouse
- Stores data collected from sensors
- Generic IDS data format
- Data can be manipulated in database
- Cross reference data from attacks
- Stores generated models
- Status
- Currently Operational
- Refining Interface and Data Transfer Protocol
- Completed by end of Semester
87Adaptive Model Generator
- Builds models from data in data warehouse
- Uses both supervised and unsupervised data
- Can build models based on data collected
- XML Based Data Exchange Format
- Status
- Exchange Formats defined
- Prototype developed
- Completion by end of semester
88Collaboration with Industries
- NFR.
- Cigital (RST).
- SAS.
- General Dynamics.
- Aprisma/Cabletron.
- HRL.
89Publications and Software, etc.
- 4 Journal and 10 Conference papers
- One best paper and two runner-ups.
- JAM.
- MADAMID.
- PhDs two graduated, one graduating, five in the
pipeline - More to come
90Efforts Current Tasks
- Cost-sensitive modeling (NCSU/Columbia/FIT).
- Automated feature and model construction
(NCSU/Columbia/FIT) - Integration of all algorithms and tools.
- Anomaly detection (NCSU/Columbia/FIT).
- Attack clustering and light modeling (FIT).
- Real-time architecture and systems
(NCSU/Columbia). - Correlation (NCSU).
- Collaboration with industry (NCSU/Columbia/FIT).