Data Mining - PowerPoint PPT Presentation

1 / 62
About This Presentation

Data Mining


Data Mining &Intrusion Detection Shan Bai Instructor: Dr. Yingshu Li CSC 8712 ,Spring 08 Outline Intrusion Detection Data Mining Data Mining in Intrusion Detection ... – PowerPoint PPT presentation

Number of Views:615
Avg rating:3.0/5.0
Slides: 63
Provided by: Sam3128


Transcript and Presenter's Notes

Title: Data Mining

Data Mining Intrusion Detection
  • Shan Bai
  • Instructor Dr. Yingshu Li
  • CSC 8712 ,Spring 08

  • Intrusion Detection
  • Data Mining
  • Data Mining in Intrusion Detection
  • Reference

What is an intrusion?
  • An intrusion can be defined as any set of
    actions that attempt to compromise the
  • Integrity
  • confidentiality, or
  • availability
  • of a resource.

Incidents Reported to Computer Emergency Response
Team/Coordination Center
Spread of SQL Slammer worm 10 minutes after its
Intrusion Examples
  • Trojan horse /worm
  • Address spoofing
  • a malicious user uses a fake IP address to send
    malicious packets to a target.
  • Many others
  • DOS
  • denial-of-service
  • R2L
  • unauthorized access from a remote machine, e.g.
    guessing password
  • U2R
  • unauthorized access to local super user (root)
    privileges, e.g., various buffer overflow''
  • Probing
  • surveillance and other probing, e.g., port

Intrusion Detection System (IDS)
  • Intrusion Detection System
  • combination of software and hardware that
    attempts to perform intrusion detection raises
    the alarm when possible intrusion happens.

IDS Categories
  • Intrusion detection systems are split into two
  • Anomaly detection systems
  • Identify malicious traffic based on deviations
    from established normal network.
  • Misuse detection systems
  • Identify intrusions based on a known pattern
    (signatures) for the malicious activity.

Anomaly Detection
probable intrusion
activity measures
baseline the normal traffic and then look for
things that are out of the norm
Relatively high false positive rate -
anomalies can just be new normal activities.
Misuse Detection
Example if (src_ip dst_ip) then land attack
look for known indicators ICMP Scans, port scans,
connection attempts CPU, RAM I/O Utilization,
File system activity, modification of system
files, permission modifications
Cant detect new attacks
  • Goal of Intrusion Detection Systems (IDS)
  • To detect an intrusion as it happens and be able
    to respond to it.
  • False positives
  • A false positive is a situation where something
    abnormal (as defined by the IDS) happens, but it
    is not an intrusion.
  • Too many false positives
  • User will quit monitoring IDS because of noise.
  • False negatives
  • A false negative is a situation where an
    intrusion is really happening, but IDS doesn't
    catch it.

  • Intrusion Detection
  • Data Mining
  • Data Mining in Intrusion Detection
  • Reference

Why do we need Data Mining?
  • Despite the enormous amount of data, particular
    events of interest are still quite rare,
    frequency ranges from 0.1 to less than 10
  • We are drowning in data, but starving for

Data Mining vs. KDD
  • Knowledge Discovery in Databases (KDD) The whole
    process of finding useful information and
    patterns in data
  • Data Mining Use of algorithms to extract the
    information and patterns derived by the KDD
  • Data mining is the core of the knowledge
    discovery process

KDD Process
  • Selection Obtain data from various sources.
  • Preprocessing Cleanse data.
  • Transformation Convert to common format.
    Transform to new format.
  • Data Mining Obtain desired results.
  • Interpretation/Evaluation Present results to
    user in meaningful manner

Data Mining A KDD Process
  • Data mining core of knowledge discovery process

Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Data Cleaning
Data Integration
Typical Data Mining Architecture
Graphical user interface
Pattern evaluation
Data mining engine
Database or data warehouse server
Data cleaning data integration
Data Warehouse
  • Intrusion Detection
  • Data Mining
  • Data Mining in Intrusion Detection
  • Reference

  • Network intrusion detection
  • Number of intrusions on the network is
    typically a very small fraction of the total
    network traffic

Why Can Data Mining Help?
  • Learn from traffic data
  • Supervised learning learn precise models from
    past intrusions
  • Unsupervised learning identify suspicious
  • Maintain models on dynamic data
  • Correlation of suspicious events across network
  • Helps detect sophisticated attacks not
    identifiable by single site analyses
  • Analysis of long term data (months/years)
  • Uncover suspicious stealth activities (e.g.
    insiders leaking/modifying information)

Intrusion Detection
  • Traditional intrusion detection system IDS tools
    (e.g. SNORT) are based on signatures of known
  • Limitations
  • Signature database has to be manually revised
    for each new type of discovered intrusion
  • They cannot detect emerging cyber threats
  • Substantial latency in deployment of newly
    created signatures across the computer system

Data Mining for Intrusion Detection Techniques
and Applications
  • Frequent pattern mining
  • Classification
  • Clustering
  • Mining data streams

Frequent pattern mining
  • Patterns that occur frequently in a database
  • Mining Frequent patterns finding regularities
  • Process of Mining Frequent patterns for intrusion
  • Phase I mine a repository of normal frequent
    itemsets for attack-free data
  • Phase II find frequent itemsets in the last n
    connections and compare the patterns to the
    normal profile

Frequent pattern mining
  • Apriori
  • Any subset of a frequent itemset must be also
    frequent an anti-monotone property
  • A transaction containing beer, diaper, nuts
  • contains beer, diaper
  • beer, diaper, nuts is frequent beer,
    diaper must
  • also be frequent
  • No superset of any infrequent itemset should be
    generated or tested
  • Many item combinations can be pruned

Sequential Pattern Analysis
  • Models sequence patterns
  • (Temporal) order is important in many situations
  • Time-series databases and sequence databases
  • Frequent patterns ? (frequent) sequential
  • Sequential patterns for intrusion detection
  • Capture the signatures for attacks in a series of

Sequential Pattern Mining
  • Given a set of sequences, find the complete set
    of frequent subsequences

Apriori Property in Sequences
Classification A Two-Step Process
  • Model construction describe a set of
    predetermined classes
  • Training dataset tuples for model construction
  • Each tuple/sample belongs to a predefined class
  • Classification rules, decision trees, or math
  • Model application classify unseen objects
  • Estimate accuracy of the model using an
    independent test set
  • Acceptable accuracy ? apply the model to classify
    data tuples with unknown class labels

Classification Decision Tree
  • A node in the tree a test of some attribute
  • A branch a possible value of the attribute
  • Classification
  • Start at the root
  • Test the attribute
  • Move down the tree branch

Neural classification HIDE
  • A hierarchical network intrusion detection
    system using statistical processing and neural
    network classification by Zheng et al.
  • Five major components
  • Probes collect traffic data
  • Event preprocessor preprocesses traffic data and
    feeds the statistical model
  • Statistical processor maintains a model for
    normal activities and generates vectors for new
  • Neural network classifies the vectors of new
  • Post processor generates reports

  • What Is Clustering?
  • Group data into clusters
  • Similar to one another within the same cluster
  • Dissimilar to the objects in other clusters
  • Unsupervised learning no predefined classes

  • What Is A Good Clustering?
  • High intra-class similarity and low
  • Depending on the similarity measure
  • The ability to discover some or all of the hidden

  • Clustering Approaches
  • Partitioning algorithms
  • Partition the objects into k clusters
  • Iteratively reallocate objects to improve the
  • Hierarchy algorithms
  • Agglomerative each object is a cluster, merge
    clusters to form larger ones
  • Divisive all objects are in a cluster, split
    it up into smaller clusters

  • K-Means Example

Mining Data Streams for Intrusion Detection
  • Maintaining profiles of normal activities
  • The profiles of normal activities may drift
  • Identifying novel attacks
  • Identifying clusters and outliers in traffic data
  • Reduce the future alarm load by writing filtering
    rules that automatically discard well-understood
    false positives

Data Mining for Intrusion Detection
  • Misuse detection
  • Predictive models are built from labeled data
    sets (instances are labeled as normal or
  • These models can be more sophisticated and
    precise than manually created signatures
  • Recent research e.g. JAM (Java Agents for

Misuse Detection
Example if (src_ip dst_ip) then land attack
look for known indicators ICMP Scans, port scans,
connection attempts CPU, RAM I/O Utilization,
File system activity, modification of system
files, permission modifications
Cant detect new attacks
JAM (Java Agents for Metalearning)
  • JAM (developed at Columbia University) uses data
    mining techniques to discover patterns of
    intrusions. It then applies a meta-learning
    classifier to learn the signature of attacks.
  • The association rules algorithm determines
    relationships between fields in the audit trail
    records, and the frequent episodes algorithm
    models sequential patterns of audit events.
    Features are then extracted from both algorithms
    and used to compute models of intrusion behavior.
  • The classifiers build the signature of attacks.
    So thus, data mining in JAM builds misuse
    detection model.
  • Classifiers in the JAM are generated by using
    rule learning program on training data of system
    usage. After training, resulting classification
    rules is used to recognize anomalies and detect
    known intrusions.
  • The system has been tested with data from
    Sendmail-based attacks, and with network attacks
    using TCP dump data.

Data Mining for Intrusion Detection
  • Anomaly detection
  • Identifies anomalies as deviations from normal
  • E.g. ADAM Audit Data Analysis and Mining MINDS
    MINnesota INtrusion Detection System

Anomaly Detection
probable intrusion
activity measures
baseline the normal traffic and then look for
things that are out of the norm
Relatively high false positive rate -
anomalies can just be new normal activities.
ADAM Audit Data Analysis and Mining
  • Detecting Intrusion by Data Mining
  • Combination of Association Rule and
    Classification Rule
  • Firstly, ADAM collects known frequent datasetsan
    off-line algorithm
  • Secondly, ADAM runs an online algorithm
  • Finds last frequent connection records
  • Compare them with known mined data
  • Discards those, which seems to be normal
  • Suspicious ones are forwarded to the classifier
  • Trained classifier then classify the suspicious
    data as one of the following
  • Known type of attack
  • Unknown type of attack
  • False alarm

ADAM Detecting Intrusion by Data Mining
ADAM Audit Data Analysis and Mining
  • ADAM has two phases in their model
  • 1st Phase Train the classifier
  • Offline process
  • Takes place only once
  • Before the main experiment
  • 2nd Phase Using the trained classifier
  • Trained classifier is then used to detect
  • Online process

The MINDS Project
  • MINDS MINnesota INtrusion Detection System
  • Learning from Rare Class Building rare class
    prediction models
  • Anomaly/outlier detection
  • Summarization of attacks using association
    pattern analysis

Rules Discovered Milk --gt Coke
Diaper, Milk --gt Beer
MINDS - Learning from Rare Class
  • Problem Building models for rare network attacks
    (Mining needle in a haystack)
  • Standard data mining models are not suitable for
    rare classes
  • Models must be able to handle skewed class
  • Learning from data streams - intrusions are
    sequences of events

MINDS - Anomaly Detection
  • Detect novel attacks/intrusions by identifying
    them as deviations from normal, i.e. anomalous
  • Identify normal behavior
  • Construct useful set of features
  • Define similarity function
  • Use outlier detection algorithm
  • Nearest neighbor approach
  • Density based schemes
  • Unsupervised Support Vector Machines (SVM)

Experimental Evaluation
  • Publicly available data set
  • DARPA 1998 Intrusion Detection Evaluation Data
    Set prepared and managed by MIT Lincoln Lab
    includes a wide variety of intrusions simulated
    in a military network environment
  • Real network data from
  • University of Minnesota
  • Anomaly detection is applied
  • 4 times a day
  • 10 minutes time window

Open source signature-based network IDS
10 minutes cycle 2 millions connections
net-flow data using CISCO routers
Anomaly scores
Association pattern analysis

MINDSanomaly detection
Data preprocessing
MINDS - Framework for Mining Associations
Ranked connections
Discriminating Association Pattern Generator
Anomaly Detection System
  1. Build normal profile
  2. Study changes in normal behavior
  3. Create attack summary
  4. Detect misuse behavior
  5. Understand nature of the attack

R1 TCP, DstPort1863 ? Attack R100 TCP,
DstPort80 ? Normal
Knowledge Base
MINDS association analysis module
Discovered Real-life Association Patterns
  • Rule 1 SrcIPXXXX, DstPort80, ProtocolTCP,
    FlagSYN, NoPackets 3, NoBytes120180
    (c1256, c2 1)
  • Rule 2 SrcIPXXXX, DstIPYYYY, DstPort80,
    ProtocolTCP, FlagSYN, NoPackets 3, NoBytes
    120180 (c1177, c2 0)
  • At first glance, Rule 1 appears to describe a Web
  • Rule 2 indicates an attack on a specific machine
  • Both rules together indicate that a scan is
    performed first, followed by an attack on a
    specific machine identified as vulnerable by the

Discovered Real-life Association Patterns
DstIPZZZZ, DstPort8888, ProtocolTCP (c1369,
c20)DstIPZZZZ, DstPort8888, ProtocolTCP,
FlagSYN (c1291, c20)
  • This pattern indicates an anomalously high number
    of TCP connections on port 8888 involving machine
  • Follow-up analysis of connections covered by the
    pattern indicates that this could be a machine
    running a variation of the Kazaa file-sharing
  • Having an unauthorized application increases the
    vulnerability of the system

Discovered Real-life Association Patterns(ctd)
SrcIPXXXX, DstPort27374, ProtocolTCP,
FlagSYN, NoPackets4, NoBytes189200 (c1582,
c22) SrcIPXXXX, DstPort12345, NoPackets4,
NoBytes189200 (c1580, c23) SrcIPYYYY,
DstPort27374, ProtocolTCP, FlagSYN,
NoPackets3, NoBytes144 (c1694, c23)
  • This pattern indicates a large number of scans on
    ports 27374 (which is a signature for the
    SubSeven worm) and 12345 (which is a signature
    for NetBus worm)
  • Further analysis showed that no fewer than five
    machines scanning for one or both of these ports
    in any time window

Discovered Real-life Association Patterns(ctd)
DstPort6667, ProtocolTCP (c1254, c21)
  • This pattern indicates an unusually large number
    of connections on port 6667 detected by the
    anomaly detector
  • Port 6667 is where IRC (Internet Relay Chat) is
    typically run
  • Further analysis reveals that there are many
    small packets from/to various IRC servers around
    the world
  • Although IRC traffic is not unusual, the fact
    that it is flagged as anomalous is interesting
  • This might indicate that the IRC server has been
    taken down (by a DOS attack for example) or it is
    a rogue IRC server (it could be involved in some
    hacking activity)

Discovered Real-life Association Patterns(ctd)
DstPort1863, ProtocolTCP, Flag0, NoPackets1,
NoByteslt139 (c1498, c26)DstPort1863,
ProtocolTCP, Flag0 (c1587, c26)DstPort1863,
ProtocolTCP (c1606, c28)
  • This pattern indicates a large number of
    anomalous TCP connections on port 1863
  • Further analysis reveals that the remote IP block
    is owned by Hotmail
  • Flag0 is unusual for TCP traffic

MINDS Conclusion
  • Data mining based algorithms are capable of
    detecting intrusions that cannot be detected by
    state-of-the-art signature based methods
  • SNORT has static knowledge manually updated by
    human analysts
  • MINDS anomaly detection algorithms are adaptive
    in nature
  • MINDS anomaly detection algorithms can also be
    effective in detecting anomalous behavior
    originating from a compromised or infected machine

IDS Using both Misuse and Anomaly
  • RIDS( Rising Intrusion Detection System) is
    provided by Rising Tech. It is a leader in
    antivirus and content security software and
    services in China.
  • The company is a leading provider of client,
    gateway and server security solutions for virus
    protection, firewall and intrusion detection
    technologies and security services to enterprises
    and service providers around China.
  • RIDS make the use of both intrusion detection
    technique, misuse and anomaly detection.
  • Distance based outlier detection algorithm is
    used for detection deviational behavior among
    collected network data.
  • For misuse detection, it has very vast set of
    collected data pattern which can be matched with
    scanned network data for misuse detection.
  • This large amount of data pattern is scanned
    using data mining classification Decision Tree
  • http//

A cooperative anomaly and intrusiondetection
system (CAIDS),
  • built with a network-based intrusion detection
    system (NIDS) and an anomaly detection system
    (ADS) operating interactively through a signature

A cooperative anomaly and intrusiondetection
system (CAIDS),
  • A frequent episode rule (FER) is generated out of
    a collection of frequent episodes. The FER is
    defined over episode sequences with multiple
    connection events.
  • For an example, we envision a window where we
    observe a 3-event sequence
  • E, D, and F. An FER is generated as E ? D, F
  • confidence level freq (a U b)/freq (b)0.8,
  • where a represents the event E on the LHS and b
    corresponds to the two events D and F on the RHS
    of the rule.
  • If the b occurs with 5 and the joint event a and
    b has 4 to occur, there is a (0.04/0.05) 80
    chance that D and F will follow in the same

A cooperative anomaly and intrusiondetection
system (CAIDS),
  • In practice, the event E could be an
    authentication service characterized by two
  • (service authentication, flagSF).
  • The events D, F may be two sequential smtp
    requests denoted by (service smtp).
  • Thus we can derive an FER with a confidence level
    of c 80, that two smtp services will follow
    the authentication service within a window w 2
    sec. The three joint traffic events accounts with
    a support level s 10 out of all the network
    connections being evaluated. This FER is formally
    stated as follows
  • (service authentication) ? (service smtp)
  • (service smtp) (0.8, 0.1, 2 sec)

A cooperative anomaly and intrusiondetection
system (CAIDS),
  • An association rule is aimed at finding
    interesting intra-relationship inside a single
    connection record
  • In general, an FER is specified by the following
  • L1, L2,, Ln ? R1,, Rm (c, s, window)
  • Li (1 i n) and Rj (1 j m) are ordered
    traffic connection events.
  • We call L1, L2,, Ln the LHS episode and R1,,
    Rm the RHS of the episode rule.

A cooperative anomaly and intrusiondetection
system (CAIDS),
Architecture of the CAIDS simulator built with a
2,000-signature Snort and an anomaly detection
subsystem (ADS) with 60 FERs after 2 weeks of
rule training over the Lincoln Lab IDS evaluation
  • In this report we have studied basic concept and
    some classic system models, like ADAM ,MINDSin
    this area.
  • To make summary of those system models, their
    technologies and their validation methods.
  • Hope to a overview on currently development in
    this area and how data mining is evolving into
    the field of network intrusion detection.

  • DARPA 1998 data set
  • A cleansed set in KDDCup99
  • DARPA 1991 data set is also available
  • http//
  • Daniel Barbara, Julia Couto, Sushil Jajodia,
    Leonard Popyack, Ningning Wu, ADAM Detecting
    Intrusions by Data Mining, Proceedings of the
    2001 IEEE Workshop on Information Assurance and
    Security, United States Military Academy, West
    Point, NY, 5-6 June 2001
  • Zhang, J. and Zulkernine, M. 2006. A Hybrid
    Network Intrusion Detection Technique Using
    Random Forests. In Proceedings of the First
    international Conference on Availability,
    Reliability and Security (April 20 - 22, 2006).
  • W. Lee et al. A data mining framework for
    building intrusion detection models. In
    Information and System Security, Vol. 3, No. 4,
  • Ertoz L. et Al, "MINDS - Minnesota Intrusion
    Detection System", Next Generation Data Mining
    Chapter 3, 2004
  • Exploiting efficient data mining techniques to
    enhance intrusion detection systems Lu, C.-T.
    Boedihardjo, A.P. Manalwar, P. Information Reuse
    and Integration, Conf, 2005. IRI -2005 IEEE
    International Conference on. Volume , Issue ,
    15-17 Aug. 2005 Page(s) 512 - 517
  • Sal Stolfo, Andreas Prodromidis, Shelley
    Tselepis, Wenke Lee, Dave Fan, and Phil Chan
    (Honorable mention (runner-up) for Best Paper
    Award in Applied Research Category) In
    Proceedings of the Third International Conference
    on Knowledge Discovery and Data Mining (KDD '97),
    Newport Beach, CA, August 1997

Questions Comments
Write a Comment
User Comments (0)