Title: Testing Intrusion Detection Systems: A Critic for the 1998 and 1999 DARPA Intrusion Detection System Evaluations as Performed by Lincoln Laboratory
1Testing Intrusion Detection Systems A Critic for
the 1998 and 1999 DARPA Intrusion Detection
System Evaluations as Performed by Lincoln
Laboratory
- By John Mchugh
- Presented by Hongyu Gao
- Feb. 5, 2009
2Outline
- Lincoln Labs evaluation in 1998
- Critic on data generation
- Critic on taxonomy
- Critic on evaluation process
- Brief discussion on 1999 evaluation
- Conclusion
3The 1998 evaluation
- The most comprehensive evaluation of research on
intrusion detection systems that has been
performed to date -
4The 1998 evaluation contd
- Objective
- To provide unbiased measurement of current
performance levels. - To provide a common shared corpus of
experimental data that is available to a wide
range of researchers
5The 1998 evaluation, contd
- Simulated a typical air force base network
6The 1998 evaluation, contd
- Collected synthetic traffic data
7The 1998 evaluation contd
- Researchers tested their system using the traffic
- Receiver Operating Curve (ROC) was used to
present the result
81. Critic on data generation
- Both background (normal) and attack data are
synthesized. - Said to represent traffic to and from a typical
air force base. - It is required that such synthesized data should
reflect system performance in realistic
scenarios.
9Critic on background data
- Counter point 1
- Real traffic is not well-behaved.
- E.g. spontaneous packet storms that are
indistinguishable from malicious attempts at
flooding. - Not considered in background traffic
10Critic on background data, contd
- Counter point 2
- Low average data rate
11Critic on background data, contd
- Possible negative consequences
- System may produce larger amount of FP in
realistic scenario. - System may drop packets in realistic scenario
12Critic on attack data
- The distribution of attack is not realisitic
- The number of attacks, which are U2R, R2L, DoS,
Probing, is of the same order
U2R R2L DoS Probing
114 34 99 64
13Critic on attack data, contd
- Possible negative consequences
- The aggregate detection rate does not reflect the
detection rate in real traffic
14Critic on simulated AFB network
- Not likely to be realistic
- 4 real machines
- 3 fixed attack target
- Flat architecture
- Possible negative consequence
- IDS can be tuned to only look at traffic
targeting to certain hosts - Preclude the execution of smurf or ICMP echo
attack
152. Critic on taxonomy
- Based on the attackers point of view
- Denial of service
- Remote to user
- User to root
- probing
- Not useful describing what an IDS might see
16Critic on taxonomy, contd
- Alternative taxonomy
- Classify by protocol layer
- Classify by whether a completed protocol
handshake is necessary - Classify by severity of attack
- Many others
173. Critic on evaluation
- The unit of evaluation
- Session is used
- Some traffic (e.g. message originating with
Ethernet hubs) are not in any session - Is session an appropriate unit?
183. Critic on evaluation
- Scoring and ROC
- Denominator?
19Critic on evaluation, contd
- An non-standard variation of ROC
- --Substitue x-axis with false alarms per day
- Possible problem
- The number of false alarms per unit time may
increase significantly with data rate increasing - Suggested alternative
- The total number of alert (both TP and FP)
- Use the standard ROC
20Evaluation on Snort
21Evaluation on Snort, contd
- Poor performance on Dos and Probe
- Good performance on R2L and U2R
- Conclusion on Snort
- Not sufficient to get any conclusion
22Critic on evaluation, contd
- False alarm rate
- A crucial concern
- The designated maximum value (0.1) is
inconsistent with the maximum operator load set
by Lincoln lab (100/day)
23Critic on evaluation, contd
- Does the evaluation result really mean something?
- ROC curve reflects the ability to detect attack
against normal traffic - What does a good IDS consist of?
- Algorithm
- Reliability
- Good signatures
-
24Brief discussion on 1999 evaluation
- Have some superficial improvements
- Additional hosts and host types are added
- New attacks are added
- None of these addresses the flaws listed above
25Brief discussion on 1999 evaluation, contd
- Security policy is not clear
- What is an attack, what is not?
- Scan, probe
26Conclusion
- The Lincoln lab evaluation is a major and
impressive effort. - This paper criticizes the evaluation from
different aspects.
27Follow-up Work
- DETER - Testbed for network security technology.
- Public facility for medium-scale repeatable
experiments in computer security - Located at USC ISI and UC Berkeley.
- 300 PC systems running Utah's Emulab software.
- Experimenter can access DETER remotely to
develop, configure, and manipulate collections of
nodes and links with arbitrary network
topologies. - Problem with this is currently that there isn't
realistic attack module or background noise
generator plugin for the framework. Attack
distribution is a problem. - PREDICT - Its a huge trace repository. It is not
public and there are several legal issues in
working with it.
28Follow-up Work
- KDD Cup - Its goal is to provide data-sets from
real world problems to demonstrate the
applicability of dierent knowledge discovery and
machine learning techniques. - The 1999 KDD intrusion detection contest uses a
labelled version of this 1998 DARPA dataset, - Annotated with connection features.
- There are several problems with KDD Cup.
Recently, people have found average TCP packet
sizes as best correlation metrics for attacks,
which is clearly points out the inefficacy.
29Discussion
- Can the aforementioned problems be addressed?
- Dataset
- Taxonomy
- Unit for analysis
- Approach to compare between IDSes
30The End