Learning Rules and Clusters for Network Anomaly Detection - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Learning Rules and Clusters for Network Anomaly Detection

Description:

... the probability of seeing event x. Easier if event x was observed during ... If the antecedent is satisfied but none of the values in the Z attribute is ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 41
Provided by: galax
Category:

less

Transcript and Presenter's Notes

Title: Learning Rules and Clusters for Network Anomaly Detection


1
Learning Rules and Clusters for Network Anomaly
Detection
  • Philip Chan, Matt Mahoney, Muhammad Arshad
  • Florida Institute of Technology

2
Outline
  • Related work in anomaly detection
  • Rule Learning algorithm LERAD
  • Cluster learning algorithm CLAD
  • Summary and ongoing work

3
Related Work in Anomaly Detection
  • Host-based
  • STIDE (Forrest et al., 96) system calls,
    instance-based
  • (Lane Brodley, 99) user commands,
    instance-based
  • ADMIT (Sequeira Zaki, 02) user commands,
    clustering
  • Network-based
  • NIDES (SRI, 95) addresses and ports,
    probabilistic
  • SPADE (Silicon Defense, 01) addresses and ports,
    probabilistic
  • ADAM (Barbara et al., 01) hybrid anomaly-misuse
    detection

4
LERAD Learning Rules for Anomaly Detection
  • (ICDM 03)

5
Probabilistic Models
  • Anomaly detection
  • P(x D NoAttacks)
  • Given training data with no attacks, estimate the
    probability of seeing event x
  • Easier if event x was observed during training
  • actually, since x is normal, we arent interested
    in its likelihood
  • Harder if event x was not observed (zero
    frequency problem)
  • we are interested in the likelihood of anomalies

6
Estimating Probability with Zero Frequency
  • r number of unique values in an attribute in
    the training data
  • n number of instances with the attribute in the
    training data
  • Likelihood of observing a novel value in an
    attribute is estimated by
  • p r / n
  • (Witten and Bell, 1991)

7
Anomaly Score
  • Likelihood of novel event p
  • During detection, if a novel event (unobserved
    during training) actually occurs
  • anomaly score 1/p surprise factor

8
Example
  • Training Sequence1 a, b, c, d, e, b, f, g, c, h
  • P(NovelLetter) 8/10
  • Z is observed during detection, anomaly score
    10/8
  • Training Sequence2 a, a, b, b, b, a, b, b, a, a
  • P(NovelLetter) 2/10
  • Z is observed during detection, anomaly score
    10/2

9
Nonstationary Model
  • More likely to see a novel value if novel values
    were seen recently (e.g., during an attack)
  • During detection, record when the last novel
    value was observed
  • ti number of seconds since the last novel value
    in attribute Ai
  • Anomaly score for Ai Scorei ti / pi
  • Anomaly score for an instance Si Scorei

10
LEarning Rules for Anomaly Detection (LERAD)
  • PHAD uses prior probabilities P(Z)
  • ALAD uses conditional probabilities P(ZA)
  • More accurate to learn probabilities that are
    conditioned on multiple attributes P(ZA,B,C)
  • Combinatorial explosion
  • Fast algorithm based on sampling

11
Rules in LERAD
  • If the antecedent is satisfied, the Z attribute
    has one of the values z1, z2, z3
  • Unlike association rules, our rules allow a set
    of values in the consequent
  • Unlike classification rules, our rules dont
    require a fixed attribute as the consequent

12
Semantics of a Rule
  • If the antecedent is satisfied but none of the
    values in the Z attribute is matched, the anomaly
    score is n/r (similar to PHAD/ALAD)
  • r size of Z ( of unique values in Z)
  • n of tuples that satisfy the antecedent and
    have the Z attribute (support)

13
Overview of the Algorithm
  • Randomly select pairs of tuples (packets,
    connections, ) from a sample of the training
    data
  • Create candidate rules based on each pair
  • Estimate the score of each candidate rule based
    on a sample of the training data
  • Prune the candidate rules
  • Update the consequent and calculate the score for
    each rule using the entire training set

14
Creating Candidate Rules
  • Find the matching attributes for example, given
    this randomly selected pair of tuples
  • ltA1,B2,C3,D4gt and ltA1,B2,C3,D6gt
  • Attributes A, B, and, C match
  • Create these rules
  • A1, B2 gt C?
  • B2, C3 gt A?
  • A1, C3 gt B?

15
Estimating Rule Scores
  • Randomly pick a sample from the training set to
    estimate the score (n/r) for each rule
  • The consequent of each rule is now estimated

  • n/r100/3
  • n/r10/2

  • n/r200/100
  • The larger the score (n/r), the higher the
    confidence that the rule captures normal behavior

16
Pruning Candidate Rules
  • To reduce the amount of time for learning from
    the entire training set and during detection
  • High scoring rules more confidence for top rules
  • Redundancy check some rules are not necessary
  • Coverage check minimum set of rules that
    describe the data

17
Redundancy Check
  • Rule 1
  • Rule 2
  • Rule 2 is more general than Rule 1, which is
    redundant and can be removed
  • Rule 3
  • Rule 2 and Rule 3 dont overlap
  • Rule 4
  • Rule 4 is more general than Rule 3, remove Rule 3

18
Coverage Check
  • A rule can cover multiple tuples, but a tuple can
    only be covered by one rule (highest-scoring
    rule).
  • Rules are checked in descending order of scores
  • For each rule in the candidate rule set
  • mark tuples that are covered by the rule
  • Rules that dont cover any tuples are removed
  • Our coverage check includes the redundancy check

19
Final Training
  • The selected rules are trained on the entire
    training set consequent and score are updated

  • n/r100000/5

  • n/r4000/2
  • 90 for training the rules
  • 10 for validating the rules rules that cause
    false alarms are removed (being conservative--the
    remaining rules are highly predictive)

20
Scoring during Detection
  • Score for a matched rule that is violated
  • S t n/r
  • where t is the duration since the last time the
    rule was violated (anomaly occurred wrt the rule)
  • Anomaly score for the tuple Si Si

21
Attributes Used in LERAD-tcp
  • TCP connections are reassembled (similar to ALAD)
  • Last 2 bytes of the destination IP address
  • 4 bytes of the source IP address
  • Source and destination ports
  • Duration (from the first packet to the last)
  • Length, TCP flags
  • First 8 strings in the payload (delimited by
    space/new line)

22
Attributes used in LERAD-all
  • Attributes used in LERAD-tcp
  • UDP and ICMP header fields

23
Experimental Data and Parameters
  • DARPA 99 data set
  • Training Week 3 Testing Weeks 4 5
  • Training 35K tuples (LERAD-tcp) 69K tuples
    (LERAD-all)
  • Testing 178K tuples (LERAD-tcp) 941K tuples
    (LERAD-all)
  • 1,000 pairs of tuples were sampled to form
    candidate rules (more didnt help much)
  • 100 tuples were sampled to estimate scores for
    candidate rules (more didnt help much)

24
Experimental Results
  • Average of 5 runs
  • 10 false alarms per day
  • 201 attacks 74 hard-to-detect attacks
    (Lippmann, 2000)
  • LERAD-tcp 117 detections (58) 45
    hard-to-detect (60)
  • LERAD-all 112 detections (56) 41
    hard-to-detect (55)

25
LERAD-all vs. LERAD-tcp
26
Experimental Time Statistics
  • Preprocessing 7.5 minutes (2.9GB, training
    set), 20 minutes (4GB, test set)
  • LERAD-tcp 6 seconds (4MB, training), 17
    seconds (17MB, testing)
  • LERAD-all 12 seconds (8MB, training), 95
    seconds (91MB, testing)
  • 50-75 learned final rules

27
Results from Mixed Data (RAID 03)
  • DARPA 99 data set attacks are real, background
    is simulated
  • Compared with collected real data
  • Artifacts smaller range of values, little
    crud, values stop growing
  • Modified LERAD 87 detections, 49 (56)
    legitimate
  • Mixed data 30 detections, 25 (83) legitimate

28
CLAD Clustering for Anomaly Detection
  • (In Data Mining against Cyber Threats,
  • Kumar et al., 03)

29
Finding Outliers
  • Cluster the data points
  • Outliers points in far away and sparse clusters
  • Inter-cluster distance average distance from the
    rest of the clusters
  • Density number of data points in a fixed-volume
    cluster

30
CLAD
  • Simple efficient clustering algorithm (large
    amount of data)
  • Clusters with fixed radius
  • If a point is within the radius of an existing
    cluster
  • Add the point to the cluster
  • Else
  • The point becomes the centriod of a new cluster

31
CLAD Issues
  • Distance for discrete attributes
  • Values that are more frequent are likely to be
    more normal and are consider closer
  • Difference in frequency of discrete values
  • Power-law distributions logarithm
  • Radius of clusters
  • Select a small random sample
  • Calculate the distance of all pairs
  • Average of the smallest 1

32
Sparse and Dense Regions
  • Outliers are in distant and sparse regions
  • However, an attack might generate many
    connections and can make its neighborhood not
    sparse.
  • (distant and sparse) or (distant and dense)
  • Distant distance gt avg(distance) sd(distance)
  • Sparse density lt avg(density) sd(density)
  • Dense density gt avg(density) sd(density)

33
Experiments
  • Weeks 1, 2, 4, and 5
  • No explicit training-testing, looking for
    outliers
  • A model for each port
  • Ports with less than 1 traffic are lumped into
    the Others model
  • Anomaly scores are normalized in SDs, the
    Combined model simply merges the scores from
    different models

34
(No Transcript)
35
(No Transcript)
36
Results

37
LERAD vs. CLAD
38
Ongoing Work
  • On-line, noise-tolerant LERAD
  • Applying LERAD to system calls, including
    arguments
  • Tokenizing payload to create features

39
Data Mining for Computer Security Workshop at
ICDM03Melbourne, FLNov 19, 2003
  • www.cs.fit.edu/pkc/dmsec03/

40
http//www.cs.fit.edu/pkc/id/
  • Thank you
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com