On Appropriate Assumptions to Mine Data Streams: Analyses and Solutions

About This Presentation

Title:

On Appropriate Assumptions to Mine Data Streams: Analyses and Solutions

Description:

Appropriateness of Shared Distribution. An example of stream data ... Appropriateness of Shared Distribution. Changes in P(y) P(y) P(x,y)=P(y|x)P(x) ... – PowerPoint PPT presentation

Number of Views:69

Avg rating:3.0/5.0

Slides: 23

Provided by: jing3

Category:

more less

Transcript and Presenter's Notes

Title: On Appropriate Assumptions to Mine Data Streams: Analyses and Solutions

1
On Appropriate Assumptions to Mine Data Streams
Analyses and Solutions

Jing Gao Wei Fan Jiawei Han
University of Illinois at Urbana-Champaign
IBM T. J. Watson Research Center

2
Introduction (1)

Data Stream
Continuously arriving data flow
Applications network traffic, credit card
transaction flow, phone calling records, etc.

3
Introduction (2)

Stream Classification
Construct a classification model based on past
records
Use the model to predict labels for new data
Help decision making

Classification model
Fraud?
Labeling
Fraud
4
Framework
?

Classification Model
Predict
5
Existing Stream Mining Methods

How to use old examples?
Throw away or fade out old examples
Select old examples or models which match the
current concepts
How to update the model?
Real Time Update
Batch Update

Match the training distribution!
6
Existing Stream Mining Methods

Shared distribution assumption
Training and test data are from the same
distribution P(x,y) x-feature vector, y-class
label
Validity of existing work relies on the shared
distribution assumption
Difference from traditional learning
Both distributions evolve

training

test
7
Appropriateness of Shared Distribution

An example of stream data
KDDCUP99 Intrusion Detection Data
P(y) evolves

Shift or delay inevitable
The future data could be different from current
data
Matching the current distribution to fit the
future one is a wrong way
The shared distribution assumption is
inappropriate

8
Appropriateness of Shared Distribution

Changes in P(y)
P(y) P(x,y)P(yx)P(x)
The change in P(y) is attributed to changes in
P(yx) and P(x)

Time Stamp 1
Time Stamp 11
Time Stamp 21
9
Realistic and relaxed assumption
The training and test distributions are similar
to the degree that the model trained from the
training set D has higher accuracy on the test
set T than both random guessing and predicting
the same class label.
Model
Training set
Test set
Random Guessing
Fixed Guessing
10
Realistic and relaxed assumption

Strengths of this assumption
Does not assume any exact relationship between
training and test distribution
Simply assume that learning is useful
Develop algorithms based on this assumption
Maximize the chance for models to succeed on
future data instead of match current data

11
A Robust and Extensible Stream Mining Framework
C1
Training set
Test set
C2

Ck
Simple Voting(SV)
Averaging Probability(AP)
12
Why ensemble?

Ensemble
Reduce variance caused by single models
Is more robust than single models when the
distribution is evolving
Expected error analysis
Single model
Ensemble

13
Why simple averaging?

Combining outputs
Simple averaging uniform weights wi1/k
Weighted ensemble non-uniform weights
wi is inversely proportional to the training
errors
wi should reflect P(M), the probability of model
M after observing the data
Uniform weights are the best
P(M) is changing and we could never estimate the
true P(M) and when and how it changes
Uniform weights could minimize the expected
distance between P(M) and weight vector

14
An illustration

Single models (M1, M2, M3) have huge variance.
Simple averaging ensemble (AP) is more stable and
accurate.
Weighted ensemble (WE) is not as good as AP since
training errors and test errors may have
different distributions.

Single Models
Weighted Ensemble
Average Probability
15
Experiments

Set up
Data streams with chunks T1, T2, , TN
Use Ti as the training set to classify Ti1
Measures
Mean Squared Error, Accuracy
Number of Wins, Number of Loses
Normalized Accuracy, MSE

16
Experiments

Methods
Single models Decision tree (DT), SVM, Logistic
Regression (LR)
Weighted ensemble weights reflect the accuracy
on training set (WE)
Simple ensemble voting (SV) or probability
averaging (AP)

17
Experimental Results (1)
Time 40
Time 100
Comparison on Synthetic Data
18
Experimental Results (2)
Comparison on Intrusion Data Set
19
Experimental Results (3)
Classification Accuracy Comparison
20
Experimental Results (4)
Mean Squared Error Comparison
21
Conclusions

Realistic assumption
Take into account the difference between training
and test distributions
Overly matching the training distribution is thus
unsatisfactory
Model averaging
Robust and accurate
Theoretically proved the effectiveness
Could give the best predictions on average

On Appropriate Assumptions to Mine Data Streams: Analyses and Solutions - PowerPoint PPT Presentation

On Appropriate Assumptions to Mine Data Streams: Analyses and Solutions

Appropriateness of Shared Distribution. An example of stream data ... Appropriateness of Shared Distribution. Changes in P(y) P(y) P(x,y)=P(y|x)P(x) ... – PowerPoint PPT presentation