Learning Rules and Clusters for Network Anomaly Detection - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

Learning Rules and Clusters for Network Anomaly Detection

Description:

... the probability of seeing event x. Easier if event x was observed during ... If the antecedent is satisfied but none of the values in the Z attribute is ... – PowerPoint PPT presentation

Number of Views:48

Avg rating:3.0/5.0

Slides: 41

Provided by: galax

Category:

more less

Transcript and Presenter's Notes

Title: Learning Rules and Clusters for Network Anomaly Detection

1
Learning Rules and Clusters for Network Anomaly
Detection

Philip Chan, Matt Mahoney, Muhammad Arshad
Florida Institute of Technology

2
Outline

Related work in anomaly detection
Rule Learning algorithm LERAD
Cluster learning algorithm CLAD
Summary and ongoing work

3
Related Work in Anomaly Detection

Host-based
STIDE (Forrest et al., 96) system calls,
instance-based
(Lane Brodley, 99) user commands,
instance-based
ADMIT (Sequeira Zaki, 02) user commands,
clustering
Network-based
NIDES (SRI, 95) addresses and ports,
probabilistic
SPADE (Silicon Defense, 01) addresses and ports,
probabilistic
ADAM (Barbara et al., 01) hybrid anomaly-misuse
detection

4
LERAD Learning Rules for Anomaly Detection

(ICDM 03)

5
Probabilistic Models

Anomaly detection
P(x D NoAttacks)
Given training data with no attacks, estimate the
probability of seeing event x
Easier if event x was observed during training
actually, since x is normal, we arent interested
in its likelihood
Harder if event x was not observed (zero
frequency problem)
we are interested in the likelihood of anomalies

6
Estimating Probability with Zero Frequency

r number of unique values in an attribute in
the training data
n number of instances with the attribute in the
training data
Likelihood of observing a novel value in an
attribute is estimated by
p r / n
(Witten and Bell, 1991)

7
Anomaly Score

Likelihood of novel event p
During detection, if a novel event (unobserved
during training) actually occurs
anomaly score 1/p surprise factor

8
Example

Training Sequence1 a, b, c, d, e, b, f, g, c, h
P(NovelLetter) 8/10
Z is observed during detection, anomaly score
10/8
Training Sequence2 a, a, b, b, b, a, b, b, a, a
P(NovelLetter) 2/10
Z is observed during detection, anomaly score
10/2

9
Nonstationary Model

More likely to see a novel value if novel values
were seen recently (e.g., during an attack)
During detection, record when the last novel
value was observed
ti number of seconds since the last novel value
in attribute Ai
Anomaly score for Ai Scorei ti / pi
Anomaly score for an instance Si Scorei

10
LEarning Rules for Anomaly Detection (LERAD)

PHAD uses prior probabilities P(Z)
ALAD uses conditional probabilities P(ZA)
More accurate to learn probabilities that are
conditioned on multiple attributes P(ZA,B,C)
Combinatorial explosion
Fast algorithm based on sampling

11
Rules in LERAD

If the antecedent is satisfied, the Z attribute
has one of the values z1, z2, z3
Unlike association rules, our rules allow a set
of values in the consequent
Unlike classification rules, our rules dont
require a fixed attribute as the consequent

12
Semantics of a Rule

If the antecedent is satisfied but none of the
values in the Z attribute is matched, the anomaly
score is n/r (similar to PHAD/ALAD)
r size of Z ( of unique values in Z)
n of tuples that satisfy the antecedent and
have the Z attribute (support)

13
Overview of the Algorithm

Randomly select pairs of tuples (packets,
connections, ) from a sample of the training
data
Create candidate rules based on each pair
Estimate the score of each candidate rule based
on a sample of the training data
Prune the candidate rules
Update the consequent and calculate the score for
each rule using the entire training set

14
Creating Candidate Rules

Find the matching attributes for example, given
this randomly selected pair of tuples
ltA1,B2,C3,D4gt and ltA1,B2,C3,D6gt
Attributes A, B, and, C match
Create these rules
A1, B2 gt C?
B2, C3 gt A?
A1, C3 gt B?

15
Estimating Rule Scores

Randomly pick a sample from the training set to
estimate the score (n/r) for each rule
The consequent of each rule is now estimated
n/r100/3
n/r10/2
n/r200/100
The larger the score (n/r), the higher the
confidence that the rule captures normal behavior

16
Pruning Candidate Rules

To reduce the amount of time for learning from
the entire training set and during detection
High scoring rules more confidence for top rules
Redundancy check some rules are not necessary
Coverage check minimum set of rules that
describe the data

17
Redundancy Check

Rule 1
Rule 2
Rule 2 is more general than Rule 1, which is
redundant and can be removed
Rule 3
Rule 2 and Rule 3 dont overlap
Rule 4
Rule 4 is more general than Rule 3, remove Rule 3

18
Coverage Check

A rule can cover multiple tuples, but a tuple can
only be covered by one rule (highest-scoring
rule).
Rules are checked in descending order of scores
For each rule in the candidate rule set
mark tuples that are covered by the rule
Rules that dont cover any tuples are removed
Our coverage check includes the redundancy check

19
Final Training

The selected rules are trained on the entire
training set consequent and score are updated
n/r100000/5
n/r4000/2
90 for training the rules
10 for validating the rules rules that cause
false alarms are removed (being conservative--the
remaining rules are highly predictive)

20
Scoring during Detection

Score for a matched rule that is violated
S t n/r
where t is the duration since the last time the
rule was violated (anomaly occurred wrt the rule)
Anomaly score for the tuple Si Si

21
Attributes Used in LERAD-tcp

TCP connections are reassembled (similar to ALAD)
Last 2 bytes of the destination IP address
4 bytes of the source IP address
Source and destination ports
Duration (from the first packet to the last)
Length, TCP flags
First 8 strings in the payload (delimited by
space/new line)

22
Attributes used in LERAD-all

Attributes used in LERAD-tcp
UDP and ICMP header fields

23
Experimental Data and Parameters

DARPA 99 data set
Training Week 3 Testing Weeks 4 5
Training 35K tuples (LERAD-tcp) 69K tuples
(LERAD-all)
Testing 178K tuples (LERAD-tcp) 941K tuples
(LERAD-all)
1,000 pairs of tuples were sampled to form
candidate rules (more didnt help much)
100 tuples were sampled to estimate scores for
candidate rules (more didnt help much)

24
Experimental Results

Average of 5 runs
10 false alarms per day
201 attacks 74 hard-to-detect attacks
(Lippmann, 2000)
LERAD-tcp 117 detections (58) 45
hard-to-detect (60)
LERAD-all 112 detections (56) 41
hard-to-detect (55)

25
LERAD-all vs. LERAD-tcp
26
Experimental Time Statistics

Preprocessing 7.5 minutes (2.9GB, training
set), 20 minutes (4GB, test set)
LERAD-tcp 6 seconds (4MB, training), 17
seconds (17MB, testing)
LERAD-all 12 seconds (8MB, training), 95
seconds (91MB, testing)
50-75 learned final rules

27
Results from Mixed Data (RAID 03)

DARPA 99 data set attacks are real, background
is simulated
Compared with collected real data
Artifacts smaller range of values, little
crud, values stop growing
Modified LERAD 87 detections, 49 (56)
legitimate
Mixed data 30 detections, 25 (83) legitimate

28
CLAD Clustering for Anomaly Detection

(In Data Mining against Cyber Threats,
Kumar et al., 03)

29
Finding Outliers

Cluster the data points
Outliers points in far away and sparse clusters
Inter-cluster distance average distance from the
rest of the clusters
Density number of data points in a fixed-volume
cluster

30
CLAD

Simple efficient clustering algorithm (large
amount of data)
Clusters with fixed radius
If a point is within the radius of an existing
cluster
Add the point to the cluster
Else
The point becomes the centriod of a new cluster

31
CLAD Issues

Distance for discrete attributes
Values that are more frequent are likely to be
more normal and are consider closer
Difference in frequency of discrete values
Power-law distributions logarithm
Radius of clusters
Select a small random sample
Calculate the distance of all pairs
Average of the smallest 1

32
Sparse and Dense Regions

Outliers are in distant and sparse regions
However, an attack might generate many
connections and can make its neighborhood not
sparse.
(distant and sparse) or (distant and dense)
Distant distance gt avg(distance) sd(distance)
Sparse density lt avg(density) sd(density)
Dense density gt avg(density) sd(density)

33
Experiments

Weeks 1, 2, 4, and 5
No explicit training-testing, looking for
outliers
A model for each port
Ports with less than 1 traffic are lumped into
the Others model
Anomaly scores are normalized in SDs, the
Combined model simply merges the scores from
different models