Title: Learning Rules and Clusters for Network Anomaly Detection
1Learning Rules and Clusters for Network Anomaly
Detection
- Philip Chan, Matt Mahoney, Muhammad Arshad
- Florida Institute of Technology
2Outline
- Related work in anomaly detection
- Rule Learning algorithm LERAD
- Cluster learning algorithm CLAD
- Summary and ongoing work
3Related Work in Anomaly Detection
- Host-based
- STIDE (Forrest et al., 96) system calls,
instance-based - (Lane Brodley, 99) user commands,
instance-based - ADMIT (Sequeira Zaki, 02) user commands,
clustering - Network-based
- NIDES (SRI, 95) addresses and ports,
probabilistic - SPADE (Silicon Defense, 01) addresses and ports,
probabilistic - ADAM (Barbara et al., 01) hybrid anomaly-misuse
detection
4LERAD Learning Rules for Anomaly Detection
5Probabilistic Models
- Anomaly detection
- P(x D NoAttacks)
- Given training data with no attacks, estimate the
probability of seeing event x - Easier if event x was observed during training
- actually, since x is normal, we arent interested
in its likelihood - Harder if event x was not observed (zero
frequency problem) - we are interested in the likelihood of anomalies
6Estimating Probability with Zero Frequency
- r number of unique values in an attribute in
the training data - n number of instances with the attribute in the
training data - Likelihood of observing a novel value in an
attribute is estimated by - p r / n
- (Witten and Bell, 1991)
7Anomaly Score
- Likelihood of novel event p
- During detection, if a novel event (unobserved
during training) actually occurs - anomaly score 1/p surprise factor
8Example
- Training Sequence1 a, b, c, d, e, b, f, g, c, h
- P(NovelLetter) 8/10
- Z is observed during detection, anomaly score
10/8 - Training Sequence2 a, a, b, b, b, a, b, b, a, a
- P(NovelLetter) 2/10
- Z is observed during detection, anomaly score
10/2
9Nonstationary Model
- More likely to see a novel value if novel values
were seen recently (e.g., during an attack) - During detection, record when the last novel
value was observed - ti number of seconds since the last novel value
in attribute Ai - Anomaly score for Ai Scorei ti / pi
- Anomaly score for an instance Si Scorei
10LEarning Rules for Anomaly Detection (LERAD)
- PHAD uses prior probabilities P(Z)
- ALAD uses conditional probabilities P(ZA)
- More accurate to learn probabilities that are
conditioned on multiple attributes P(ZA,B,C) - Combinatorial explosion
- Fast algorithm based on sampling
11Rules in LERAD
-
- If the antecedent is satisfied, the Z attribute
has one of the values z1, z2, z3 - Unlike association rules, our rules allow a set
of values in the consequent - Unlike classification rules, our rules dont
require a fixed attribute as the consequent
12Semantics of a Rule
-
- If the antecedent is satisfied but none of the
values in the Z attribute is matched, the anomaly
score is n/r (similar to PHAD/ALAD) - r size of Z ( of unique values in Z)
- n of tuples that satisfy the antecedent and
have the Z attribute (support) -
13Overview of the Algorithm
- Randomly select pairs of tuples (packets,
connections, ) from a sample of the training
data - Create candidate rules based on each pair
- Estimate the score of each candidate rule based
on a sample of the training data - Prune the candidate rules
- Update the consequent and calculate the score for
each rule using the entire training set
14Creating Candidate Rules
- Find the matching attributes for example, given
this randomly selected pair of tuples - ltA1,B2,C3,D4gt and ltA1,B2,C3,D6gt
- Attributes A, B, and, C match
- Create these rules
- A1, B2 gt C?
- B2, C3 gt A?
- A1, C3 gt B?
15Estimating Rule Scores
- Randomly pick a sample from the training set to
estimate the score (n/r) for each rule - The consequent of each rule is now estimated
-
n/r100/3 - n/r10/2
-
n/r200/100 - The larger the score (n/r), the higher the
confidence that the rule captures normal behavior
16Pruning Candidate Rules
- To reduce the amount of time for learning from
the entire training set and during detection - High scoring rules more confidence for top rules
- Redundancy check some rules are not necessary
- Coverage check minimum set of rules that
describe the data
17Redundancy Check
- Rule 1
- Rule 2
- Rule 2 is more general than Rule 1, which is
redundant and can be removed - Rule 3
- Rule 2 and Rule 3 dont overlap
- Rule 4
- Rule 4 is more general than Rule 3, remove Rule 3
18Coverage Check
- A rule can cover multiple tuples, but a tuple can
only be covered by one rule (highest-scoring
rule). - Rules are checked in descending order of scores
- For each rule in the candidate rule set
- mark tuples that are covered by the rule
- Rules that dont cover any tuples are removed
- Our coverage check includes the redundancy check
19Final Training
- The selected rules are trained on the entire
training set consequent and score are updated -
n/r100000/5 -
n/r4000/2 - 90 for training the rules
- 10 for validating the rules rules that cause
false alarms are removed (being conservative--the
remaining rules are highly predictive)
20Scoring during Detection
- Score for a matched rule that is violated
- S t n/r
- where t is the duration since the last time the
rule was violated (anomaly occurred wrt the rule) - Anomaly score for the tuple Si Si
21Attributes Used in LERAD-tcp
- TCP connections are reassembled (similar to ALAD)
- Last 2 bytes of the destination IP address
- 4 bytes of the source IP address
- Source and destination ports
- Duration (from the first packet to the last)
- Length, TCP flags
- First 8 strings in the payload (delimited by
space/new line)
22Attributes used in LERAD-all
- Attributes used in LERAD-tcp
- UDP and ICMP header fields
23Experimental Data and Parameters
- DARPA 99 data set
- Training Week 3 Testing Weeks 4 5
- Training 35K tuples (LERAD-tcp) 69K tuples
(LERAD-all) - Testing 178K tuples (LERAD-tcp) 941K tuples
(LERAD-all) - 1,000 pairs of tuples were sampled to form
candidate rules (more didnt help much) - 100 tuples were sampled to estimate scores for
candidate rules (more didnt help much)
24Experimental Results
- Average of 5 runs
- 10 false alarms per day
- 201 attacks 74 hard-to-detect attacks
(Lippmann, 2000) - LERAD-tcp 117 detections (58) 45
hard-to-detect (60) - LERAD-all 112 detections (56) 41
hard-to-detect (55)
25LERAD-all vs. LERAD-tcp
26Experimental Time Statistics
- Preprocessing 7.5 minutes (2.9GB, training
set), 20 minutes (4GB, test set) - LERAD-tcp 6 seconds (4MB, training), 17
seconds (17MB, testing) - LERAD-all 12 seconds (8MB, training), 95
seconds (91MB, testing) - 50-75 learned final rules
27Results from Mixed Data (RAID 03)
- DARPA 99 data set attacks are real, background
is simulated - Compared with collected real data
- Artifacts smaller range of values, little
crud, values stop growing - Modified LERAD 87 detections, 49 (56)
legitimate - Mixed data 30 detections, 25 (83) legitimate
28CLAD Clustering for Anomaly Detection
- (In Data Mining against Cyber Threats,
- Kumar et al., 03)
29Finding Outliers
- Cluster the data points
- Outliers points in far away and sparse clusters
- Inter-cluster distance average distance from the
rest of the clusters - Density number of data points in a fixed-volume
cluster
30CLAD
- Simple efficient clustering algorithm (large
amount of data) - Clusters with fixed radius
- If a point is within the radius of an existing
cluster - Add the point to the cluster
- Else
- The point becomes the centriod of a new cluster
31CLAD Issues
- Distance for discrete attributes
- Values that are more frequent are likely to be
more normal and are consider closer - Difference in frequency of discrete values
- Power-law distributions logarithm
- Radius of clusters
- Select a small random sample
- Calculate the distance of all pairs
- Average of the smallest 1
32Sparse and Dense Regions
- Outliers are in distant and sparse regions
- However, an attack might generate many
connections and can make its neighborhood not
sparse. - (distant and sparse) or (distant and dense)
- Distant distance gt avg(distance) sd(distance)
- Sparse density lt avg(density) sd(density)
- Dense density gt avg(density) sd(density)
33Experiments
- Weeks 1, 2, 4, and 5
- No explicit training-testing, looking for
outliers - A model for each port
- Ports with less than 1 traffic are lumped into
the Others model - Anomaly scores are normalized in SDs, the
Combined model simply merges the scores from
different models
34(No Transcript)
35(No Transcript)
36Results
37LERAD vs. CLAD
38Ongoing Work
- On-line, noise-tolerant LERAD
- Applying LERAD to system calls, including
arguments - Tokenizing payload to create features
39Data Mining for Computer Security Workshop at
ICDM03Melbourne, FLNov 19, 2003
- www.cs.fit.edu/pkc/dmsec03/
40http//www.cs.fit.edu/pkc/id/