Classification%20and%20Novel%20Class%20Detection%20in%20Data%20Streams - PowerPoint PPT Presentation

About This Presentation

Title:

Classification%20and%20Novel%20Class%20Detection%20in%20Data%20Streams

Description:

Data Streams. Data streams are: Continuous flows of data. Network traffic. Sensor data. Call center records – PowerPoint PPT presentation

Number of Views:159

Avg rating:3.0/5.0

Slides: 31

Provided by: utda58

Learn more at: https://personal.utdallas.edu

Category:

more less

Transcript and Presenter's Notes

Title: Classification%20and%20Novel%20Class%20Detection%20in%20Data%20Streams

1
Classification and Novel Class Detection in Data
Streams

Mehedy Masud1, Latifur Khan1, Jing Gao2,
Jiawei Han2, and Bhavani Thuraisingham1
1Department of Computer Science, University of
Texas at Dallas
2Department of Computer Science, University of
Illinois at Urbana Champaign

This work was funded in part by
2
Presentation Overview

Stream Mining Background

Novel Class Detection Concept Evolution

3
Data Streams

Data streams are

Examples

Network traffic
Sensor data
Call center records
4
Data Stream Classification

Uses past labeled data to build classification
model
Predicts the labels of future instances using the
model
Helps decision making

5
Challenges
Introduction

Infinite length
Concept-drift
Concept-evolution (emergence of novel class)
Recurrence (seasonal) class

6
Infinite Length

Impractical to store and use all historical data
Requires infinite storage
And running time

7
Concept-Drift
A data chunk
Negative instance
Instances victim of concept-drift
Positive instance
8
Concept-Evolution
y

- - - - -
- - - - - - - - - -

D
y1
C
A

- - - - - - -
- - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - -
- - - - - - - - - - - - - - -
- - - - - - -- - - - -

y2
B

x1
x
Classification rules R1. if (x gt x1 and y lt y2)
or (x lt x1 and y lt y1) then class R2. if (x gt
x1 and y gt y2) or (x lt x1 and y gt y1) then class
-
Existing classification models misclassify novel
class instances
9
Background Ensemble of Classifiers
C1

C2

x,?
C3
-
input
Individual outputs
voting
Ensemble output
Classifier
10
Background Ensemble Classification of Data
Streams

Divide the data stream into equal sized chunks
Train a classifier from each data chunk
Keep the best L such classifier-ensemble
Example for L 3

Note Di may contain data points from different
classes
D4
D5
D6
Labeled chunk
Data chunks
Unlabeled chunk
Addresses infinite length and concept-drift
C4
C5
Classifiers
C1
C2
C3
C4
C5
Ensemble
11
Examples of Recurrence and Novel Classes
Introduction

Twitter Stream a stream of messages
Each message may be given a category or class
based on the topic
Examples
Election 2012, London Olympic, Halloween,
Christmas, Hurricane Sandy, etc.
Among these
Election 2012 or Hurricane Sandy are novel
classes because they are new events.
Also
Halloween is recurrence class because it
recurs every year.

12
Concept-Evolution and Feature Space
Introduction
y

- - - - -
- - - - - - - - - -

D
y1
C
A

- - - - - - -
- - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - -
- - - - - - - - - - - - - - -
- - - - - - -- - - - -

Three steps
Training and building decision boundary
Outlier detection and filtering
Computing cohesion and separation

14
Training Creating Decision Boundary
Prior work

Training is done chunk-by-chunk (One classifier
per chunk)
An ensemble of classifiers are used for
classification

Raw training data
Clusters are created
y

- - - -
- -
- - - - - - -

D
y1
C
A

- - - - - - - - - -
- - - - - - - - - - -
- - - - - - - - - - -
- - - - - - - - - - -

y2
B

x1
x
Addresses Infinite length problem
15
Outlier Detection and Filtering
Prior work
Test instance inside decision boundary (not
outlier)
Test instance outside decision boundary Raw
outlier or Routlier
y
x
D
y1
C
A
Routlier
Routlier
Routlier
x
X is an existing class instance
AND
False
y2
True
B
X is a filtered outlier (Foutlier) (potential
novel class instance)
x1
x
Routliers may appear as a result of novel class,
concept-drift, or noise. Therefore, they are
filtered to reduce noise as much as possible.
16
Computing Cohesion Separation
Prior work
? o,5(x)
a(x)
x
?-,5(x)
?,5(x)
b(x)
b-(x)

a(x) mean distance from an Foutlier x to the
instances in ?o,q(x)
bmin(x) minimum among all bc(x) (e.g. b(x) in
figure)
q-Neighborhood Silhouette Coefficient (q-NSC)

If q-NSC(x) is positive, it means x is closer to
Foutliers than any other class.

17
Limitation Recurrence Class
Prior work
18
Why Recurrence Classes are Forgotten?
Prior work

Divide the data stream into equal sized chunks
Train a classifier from whole data chunk
Keep the best L such classifier-ensemble
Example for L 3
Therefore, old models are discarded
Old classes are forgotten after a while

D4
D5
D6
Labeled chunk
Data chunks
Unlabeled chunk
C4
C5
Classifiers
Ensemble
C1
C2
C3
C4
C5
Addresses infinite length and concept-drift
19
CLAM The Proposed Approach
Proposed method
CLAss Based Micro-Classifier Ensemble
Stream
Latest Labeled chunk
Training
New model
Update
Latest unlabeled instance
Outlier detection
Ensemble (M) (keeps all classes)
Classify using M
Not outlier
Outlier
(Existing class)
Buffering and novel class detection
20
Training and Updating
Proposed method

Each chunk is first separated into different
classes
A micro-classifier is trained from each classs
data
Each micro-classifier replaces one existing
micro-classifier
A total of L micro-classifiers make a
Micro-Classifier Ensemble (MCE)
C such MCEs constitute the whole ensemble, E

21
CLAM The Proposed Approach
Proposed method
CLAss Based Micro-Classifier Ensemble
Stream
Latest Labeled chunk
Training
New model
Update
Latest unlabeled instance
Outlier detection
Ensemble (M) (keeps all classes)
Classify using M
Not outlier
Outlier
(Existing class)
Buffering and novel class detection
22
Outlier Detection and Classification
Proposed method

A test instance x is first classified with each
micro-classifier ensemble
Each micro-classifier ensemble gives a partial
output (Yr) and a outlier flag (boolean)
If all ensembles flags x as outlier, then it is
buffered and sent to novel class detector
Otherwise, the partial outputs are combined and a
class label is predicted

23
Evaluation
Evaluation

Competitors
CLAM (CL) proposed work
SCANR (SC) 1 prior work
ECSMiner (EM) 2 prior work
Olindda 3-WCE 4 (OW) another baseline
Datasets Synthetic, KDD Cup 1999 Forest
covertype

1. M. M. Masud, T. M. Al-Khateeb, L. Khan, C. C.
Aggarwal, J. Gao, J. Han, and B. M.
Thuraisingham, Detecting recurring and novel
classes in concept-drifting data streams, in
Proc. ICDM 11, Dec. 2011, pp. 1176181. 2.
Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei
Han, and Bhavani M. Thuraisingham.Classification
and novel class detection in concept-drifting
data streams under time constraints. In
Preprints, IEEE Transactions on Knowledge and
Data Engineering (TKDE), 23(6) 859-874
(2011). 3. E. J. Spinosa, A. P. de Leon F. de
Carvalho, and J. Gama. Cluster-based novel
concept detection in data streams applied to
intrusion detection in computer networks. In
Proc. 2008 ACM symposium on Applied computing,
pages 976980, 2008. 4. H. Wang, W. Fan, P. S.
Yu, and J. Han. Mining concept-drifting data
streams using ensemble classifiers. In Proc.
ninth ACM SIGKDD international conference on
Knowledge discovery and data mining, pages
226235, Washington, DC, USA, Aug, 2003. ACM.
24
Overall Error
Evaluation
Error rates on (a) SynC20, (b)SynC40, (c)Forest
and (d) KDD
25
Number of Recurring Classes vs Error
Evaluation
26
Error vs Drift and Chunk Size
Evaluation
27
Summary Table
Evaluation
28
Conclusion