Mining Unusual Patterns in Data Streams: Methodologies and Research Problems

About This Presentation

Title:

Mining Unusual Patterns in Data Streams: Methodologies and Research Problems

Description:

Fast changing and requires fast, real-time response ... Tradebot (www.tradebot.com): stock tickers & streams. Tribeca (Bellcore): network monitoring ... – PowerPoint PPT presentation

Number of Views:482

Avg rating:3.0/5.0

Slides: 50

Provided by: jiaw186

Category:

more less

Transcript and Presenter's Notes

Title: Mining Unusual Patterns in Data Streams: Methodologies and Research Problems

1
Mining Unusual Patterns in Data Streams
Methodologies and Research Problems

Jiawei Han
Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/hanj

2
Outline

Characteristics of stream data
Architecture and models for SDMS and stream query
processing
Why mining unusual patterns in stream data?
Essentials for mining unusual patterns in stream
data
Stream cubing and stream OLAP methods
Stream mining methods
Research problems
Conclusions

3
Characteristics of Data Streams

Data Streams
Data streamscontinuous, ordered, changing, fast,
huge amount
Traditional DBMSdata stored in finite,
persistent data sets
Characteristics
Huge volumes of continuous data, possibly
infinite
Fast changing and requires fast, real-time
response
Data stream captures nicely our data processing
needs of today
Random access is expensivesingle linear scan
algorithm (can only have one look)
Store only the summary of the data seen thus far
Most stream data are at pretty low-level or
multi-dimensional in nature, needs multi-level
and multi-dimensional processing

4
Stream Data Applications

Telecommunication calling records
Business credit card transaction flows
Network monitoring and traffic engineering
Financial market stock exchange
Engineering industrial processes power supply
manufacturing
Sensor, monitoring surveillance video streams
Security monitoring
Web logs and Web page click streams
Massive data sets (even saved but random access
is too expensive)

5
DBMS versus DSMS

Persistent relations
One-time queries
Random access
Unbounded disk store
Only current state matters
No real-time services
Relatively low update rate
Data at any granularity
Assume precise data
Access plan determined by query processor,
physical DB design

Transient streams
Continuous queries
Sequential access
Bounded main memory
Historical data is important
Real-time requirements
Possibly multi-GB arrival rate
Data at fine granularity
Data stale/imprecise
Unpredictable/variable data arrival and
characteristics

Ack. From Motwanis PODS tutorial slides
6
Architecture Stream Query Processing
User/Application
SDMS (Stream Data Management System)
Results
Multiple streams
Stream Query Processor
Scratch Space (Main memory and/or Disk)
7
Challenges of Stream Data Processing

Multiple, continuous, rapid, time-varying,
ordered streams
Main memory computations
Queries are often continuous
Evaluated continuously as stream data arrives
Answer updated over time
Queries are often complex
Beyond element-at-a-time processing
Beyond stream-at-a-time processing
Beyond relational queries (scientific, data
mining, OLAP)
Multi-level/multi-dimensional processing and data
mining
Most stream data are at pretty low-level or
multi-dimensional in nature

8
Processing Stream Queries

Query types
One-time query vs. continuous query (being
evaluated continuously as stream continues to
arrive)
Predefined query vs. ad-hoc query (issued
on-line)
Unbounded memory requirements
For real-time response, main memory algorithm
should be used
Memory requirement is unbounded if one will join
future tuples
Approximate query answering
With bounded memory, it is not always possible to
produce exact answers
High-quality approximate answers are desired
Data reduction and synopsis construction methods
Sketches, random sampling, histograms, wavelets,
etc.

9
Methods for Approximate Query Answering

Sliding windows
Only over sliding windows of recent stream data
Approximation but often more desirable in
applications
Batched processing, sampling and synopses
Batched if update is fast but computing is slow
Compute periodically, not very timely
Sampling if update is slow but computing is fast
Compute using sample data, but not good for
joins, etc.
Synopsis data structures
Maintain a small synopsis or sketch of data
Good for querying historical data
Blocking operators, e.g., sorting, avg, min, etc.
Blocking if unable to produce the first output
until seeing the entire input

10
Projects on DSMS (Data Stream Management System)

Research projects and system prototypes
STREAM (Stanford) A general-purpose DSMS
Cougar (Cornell) sensors
Aurora (Brown/MIT) sensor monitoring, dataflow
Hancock (ATT) telecom streams
Niagara (OGI/Wisconsin) Internet XML databases
OpenCQ (Georgia Tech) triggers, incr. view
maintenance
Tapestry (Xerox) pub/sub content-based filtering
Telegraph (Berkeley) adaptive engine for sensors
Tradebot (www.tradebot.com) stock tickers
streams
Tribeca (Bellcore) network monitoring
Streaminer (UIUC) new project for stream data
mining

11
Stream Data Mining vs. Stream Querying

Stream miningA more challenging task
It shares most of the difficulties with stream
querying
Patterns are hidden and more general than
querying
It may require exploratory analysis
Not necessarily continuous queries
Stream data mining tasks
Multi-dimensional on-line analysis of streams
Mining outliers and unusual patterns in stream
data
Clustering data streams
Classification of stream data

12
Challenges for Mining Unusual Patterns in Data
Streams

Most stream data are at pretty low-level or
multi-dimensional in nature needs ML/MD
processing
Analysis requirements
Multi-dimensional trends and unusual patterns
Capturing important changes at multi-dimensions/le
vels
Fast, real-time detection and response
Comparing with data cube Similarity and
differences
Stream (data) cube or stream OLAP Is this
feasible?
Can we implement it efficiently?

13
Multi-Dimensional Stream Analysis Examples

Analysis of Web click streams
Raw data at low levels seconds, web page
addresses, user IP addresses,
Analysts want changes, trends, unusual patterns,
at reasonable levels of details
E.g., Average clicking traffic in North America
on sports in the last 15 minutes is 40 higher
than that in the last 24 hours.
Analysis of power consumption streams
Raw data power consumption flow for every
household, every minute
Patterns one may find average hourly power
consumption surges up 30 for manufacturing
companies in Chicago in the last 2 hours today
than that of the same day a week ago

14
A Key StepStream Data Reduction

Challenges of OLAPing stream data
Raw data cannot be stored
Simple aggregates are not powerful enough
History shape and patterns at different levels
are desirable multi-dimensional regression
analysis
Proposal
A scalable multi-dimensional stream data cube
that can aggregate regression model of stream
data efficiently without accessing the raw data
Stream data compression
Compress the stream data to support memory- and
time-efficient multi-dimensional regression
analysis

15
Basics of General Linear Regression

n tuples in one cell (xi , yi), i 1..n, where
yi is the measure attribute to be analyzed
For sample i , a vector of k user-defined
predictors ui
The linear regression model
where ? is a k 1 vector of regression
parameters

16
Theory of General Linear Regression

Collect into the model matrix U
The ordinary least square (OLS) estimate of
is the argument that minimizes the residue sum of
squares function
Main theorem to determine the OLS regression
parameters

17
Linearly Compressed Representation (LCR)

Stream data compression for multi-dimensional
regression analysis
Define, for i, j 0,,k-1
The linearly compressed representation (LCR) of
one cell
Size of LCR of one cell
quadratic in k, independent of the number of
tuples n in one cell

18
Matrix Form of LCR

LCR consists of and , where
and
where
provides OLS regression parameters essential for
regression analysis
is an auxiliary matrix that facilitates
aggregations of LCR in standard and regression
dimensions in a data cube environment
LCR only stores
the upper triangle of

19
Aggregation in Standard Dimensions

Given LCR of m cells that differ in one standard
dimension, what is the LCR of the cell aggregated
in that dimension?
for m base cells
for an aggregated cell
The lossless aggregation formula

20
Stock Price ExampleAggregation in Standard
Dimensions

Simple linear regression on time series data
Cells of two companies
After aggregation

21
Aggregation in Regression Dimensions

Given LCR of m cells that differ in one
regression dimension, what is the LCR of the cell
aggregated in that dimension?
for m base cells
for the aggregated
cell
The lossless aggregation formula

22
Stock Price ExampleAggregation in Time Dimension

Cells of two adjacent
time intervals
After aggregation

23
Feasibility of Stream Regression Analysis

Efficient storage and scalable (independent of
the number of tuples in data cells)
Lossless aggregation without accessing the raw
data
Fast aggregation computationally efficient
Regression models of data cells at all levels
General results covered a large and the most
popular class of regression
Including quadratic, polynomial, and nonlinear
models

24
A Stream Cube Architecture

A tilt time frame
Different time granularities
second, minute, quarter, hour, day, week,
Critical layers
Minimum interest layer (m-layer)
Observation layer (o-layer)
User watches at o-layer and occasionally needs
to drill-down down to m-layer
Partial materialization of stream cubes
Full materialization too space and time
consuming
No materialization slow response at query time
Partial materialization what do we mean
partial?

25
A Tilt Time-Frame Model
Up to 7 days
Up to a year
26
Benefits of Tilt Time-Frame Model

Each cell stores the measures according to
tilt-time-frame
Limited memory space Impossible to store the
history in full scale
Emphasis more on recent data
Most applications emphasize on recent data (slide
window)
Natural partition on different time granularities
Putting different weights on remote data
Useful even for uniform weight
Tilt time-frame forms a new time dimension
for mining changes and evolutions
Essential for mining unusual patterns or outliers
Finding those with dramatic changes
E.g., exceptional stocksnot following the trends

27
Two Critical Layers in the Stream Cube
(, theme, quarter)
o-layer (observation)
(user-group, URL-group, minute)
m-layer (minimal interest)
(individual-user, URL, second)
(primitive) stream data layer
28
On-Line Materialization vs. On-Line Computation

On-line materialization
Materialization takes precious resources and time
Only incremental materialization (with slide
window)
Only materialize cuboids of the critical
layers?
Some intermediate cells that should be
materialized
Popular path approach vs. exception cell approach
Materialize intermediate cells along the popular
paths
Exception cells how to set up exception
thresholds?
Notice exceptions do not have monotonic behaviour
Computation problem
How to compute and store stream cubes
efficiently?
How to discover unusual cells between the
critical layer?

29
Stream Cube Structure from m-layer to o-layer
(A1, , C1)
(A1, , C2)
(A1, , C2)
(A1, , C2)
(A2, B1, C1)
(A1, B1, C2)
(A1, B2, C1)
(A2, , C2)
(A2, B1, C2)
A2, B2, C1)
(A1, B2, C2)
(A2, B2, C2)
30
Stream Cube Computation

Cube structure from m-layer to o-layer
Three approaches
All cuboids approach
Materializing all cells (too much in both space
and time)
Exceptional cells approach
Materializing only exceptional cells (saves space
but not time to compute and definition of
exception is not flexible)
Popular path approach
Computing and materializing cells only along a
popular path
Using H-tree structure to store computed cells
(which form the stream cubea selectively
materialized cube)

31
An H-Tree Cubing Structure
root
Observation layer
sports
politics
entertainment
uiuc
uic
uic
uiuc
Minimal int. layer
jeff
Jim
jeff
mary
Q.I.
Q.I.
Q.I.
32
Benefits of H-Tree and H-Cubing

H-tree and H-cubing
Developed for computing data cubes and ice-berg
cubes
J. Han, J. Pei, G. Dong, and K. Wang, Efficient
Computation of Iceberg Cubes with Complex
Measures, SIGMOD'01
Compressed database
Fast cubing
Space preserving in cube computation
Using H-tree for stream cubing
Space preserving
Intermediate aggregates can be computed
incrementally and saved in tree nodes
Facilitate computing other cells and
multi-dimensional analysis
H-tree with computed cells can be viewed as
stream cube

33
Time and Space vs. Number of Tuples at the
m-Layer (Dataset D3L3C10T400K)
a) Time vs. m-layer size
b) Space vs. m-layer size
34
Time and Space vs. the Number of Levels
a) Time vs. levels
b) Space vs. levels
35
Other Approaches for Mining Unusual Patterns in
Stream Data

Beyond multi-dimensional regression analysis
Other approaches can be effective for mining
unusual patterns
Multi-dimensional gradient analysis of multiple
streams
Gradient analysis finding substantial changes
(notable gradients) in relevance to other
dimensions
E.g., those stocks that increase over 10 in the
last hour
Clustering and outlier analysis for stream mining
Clustering data streams (Guha, Motwani et al.
2000-2002)
History-sensitive, high-quality incremental
clustering
Decision tree analysis of stream data
Evolution of decision trees Domingos et al.
(2000, 2001)
Incremental integration of new streams in
decision-tree induction

36
What Is Gradient Analysis?

Gradient analysis Analysis of notable changes
(gradients) of sophisticated measures in
multi-dimensional space
Changes in dimensions ? changes in measures
Drill-down (descendants), roll-up (ancestors),
and mutation (siblings)
Query Notable changes of average house price in
Champaign in 02 comparing against 01
Answer Townhouse in Southwest Champaign West
went down 5, houses in Urbana went up 10
Originated from CubeGrade problem
First proposed by Imielinski et al. (DAMI 2002)
as Cubegrade
Efficient pushing of constraints for complex
measures (such as avg) in constrained gradient
analysis by Dong et al. (VLDB 2001)

37
Multi-Dimensional Gradient Analysis of Multiple
Streams

Stream gradient analysis
Analysis of notable changes of sophisticated
measures in multi-dimensional space in relevance
to time for stream data
Changes in time ? changes in measures (possibly
comparing with sibling streams)
Drill-down (descendants), roll-up (ancestors),
and mutation (siblings)
Query Find exceptionally promising stocks in the
last hour
E.g., Tech sector goes down sharply but IBM goes
down only slightly
How to solve it in a stream environment?
Find surrounding average gradients, and then find
stocks whose gradients are substantially
different from average
Analysis should be performed in multi-dimensional
space

38
Clustering for Stream Data Mining

What is cluster analysis?
Grouping a set of data objects into a set of
classes (clusters)
The intra-class similarity is high and the
inter-class similarity is low
Applications Pattern recognition, spatial data
analysis, image processing, market research, Web
document and click stream analysis
Clustering Another data reduction technique in
stream analysis
New requirements in stream data clustering
Generate overall high-quality clusters without
seeing the old data
High quality, efficient incremental clustering
algorithms
Analysis should take care of multi-dimensional
space

39
Major Clustering Approaches in Traditional
Cluster Analysis

Partitioning algorithms Construct various
partitions and then evaluate them by some
criterion
E.g., k-means, k-medoids, etc.
Hierarchy algorithms Create a hierarchical
decomposition of the set of data (or objects)
using some criterion
Often needs to integrate with other clustering
methods, e.g., BIRCH
Density-based based on connectivity and density
functions
Finding clusters of arbitrary shapes, e.g.,
DBSCAN, OPTICS, etc.
Grid-based based on a multiple-level granularity
structure
View space as grid structures, e.g., STING,
CLIQUE
Model-based find the best fit of the model to
all the clusters
Good for conceptual clustering, e.g., COBWEB, SOM

40
The K-Means Clustering Process

Example

10
9
8
7
6
5
Update the cluster means
Assign each objects to most similar center
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
reassign
reassign
K2 Arbitrarily choose K object as initial
cluster center
Update the cluster means
41
Clustering Data Streams

Only the most popular clustering algorithm,
k-means, is examined in stream clustering (Guha,
Motwani, et al. 2000-2002)
The K-Means Clustering Method (MacQueen67) Each
cluster is represented by the center of the
cluster
Data stream with points from metric space
Find k centers in the stream such that the sum of
distances from data points to their closest
center is minimized.
Clustering data streams
Only the k centroids (representing the clustering
results) retain when new data comes
Only use the new data set to perform incremental
clustering
The previous data carries weights of the previous
many points
The error is bounded by continuous incremental
updates
The simple algorithm yields constant factor
approximation

42
An Incremental Clustering Method

Only the k centroids (representing the clustering
results) retain when new data comes
Only use the new data set to perform incremental
clustering
The previous data carries weights of the previous
many points
The Incremental algorithm (GMM01)
Assign each object to the cluster with the
nearest seed point
Compute new seed points as the centroids of the
clusters of the current partition
Repeat steps 2-3 until no change, the cluster is
formed by a set of k new centroids
The error is bounded by continuous incremental
updates
The simple algorithm yields constant factor
approximation

43
Research Problems in Stream Clustering

Better quality but still efficient clustering
algorithms?
Simple k-means clustering by preserving only k
centroids may loose too much information
Keeping additional information may lead to better
clustering quality
Multi-dimensional clustering analysis?
Cluster not confined to 2-D metric space, how to
incorporate other features, especially
non-numerical properties
Finding outliers as a by-product of cluster
analysis?
Efficient detection of outliers (far away from
majority) in data streams
Weighted by history of the data?
Mining evolutions and changes of clusters?
Stream clustering with other clustering
approaches?
Constraint-based cluster analysis with data
streams?

44
Major Classification Methods

Popular classification methods
Decision tree induction
ID3, C4.5, Regression trees, decision lists, etc.
Bayesian classification
Neural networks
Support Vector Machines (SVM)
Associative classification
k-nearest neighbor classifier and case-based
reasoning
Genetic algorithms
Rough set and fuzzy set approaches
Most of theses methods are not re-examined in the
context of stream data

45
Decision Tree Analysis in Stream Data

What is decision-tree analysis?
Building a compact tree from data to guide
decision making
One of the most popular classification method in
data mining
Applications market analysis, Web document
classification, etc.
Decision-tree Another data reduction technique
in stream analysis
New requirements in stream data decision-tree
analysis
Generate high-quality up-to-date decision-trees
without seeing the old data
High quality, efficient incremental decision-tree
induction
Analysis should take care of multi-dimensional
space

46
Classical Example Play Tennis?

Training data set from Quinlans

47
Decision Tree Obtained with ID3 (Quinlan 86)
48
Algorithm for Decision Tree Induction

Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive
divide-and-conquer manner
At start, all the training examples are at the
root
Attributes are categorical (if continuous-valued,
they are discretized in advanceC4.5 handles
continuous value splitting)
Examples are partitioned recursively based on
selected attributes
Test attributes are selected on the basis of a
heuristic or statistical measure (e.g.,
information gain, Gini index)
Conditions for stopping partitioning
All samples for a given node belong to the same
class
There are no remaining attributes for further
partitioning majority voting is employed for
classifying the leaf
There are no samples left

49
Decision Tree Induction with Stream Data

VFDT/CVFDT
P. Domingos and G. Hulten, Mining high-speed
data streams, KDD'00
G. Hulten, L. Spencer, and P. Domingos, Mining
time-changing data streams,KDD'01
VFDT (Very Fast Decision Tree) (Domingos and
Hulten00)
With high probability, constructs an identical
model that a traditional (greedy) method would
learn
If it cannot be inserted into the same branch,
construct shadow branches as preparation for
changes
If the shadow becomes dominant, switch of tree
branches occur
CVFDT Extension to time changing data

50
Decision Tree Induction with Stream Data

For each record in stream
Traverse T to determine appropriate leaf L for
record
Update (attribute, class) counts in L and compute
best split function ?phi(s,X,L) for each
attribute Xi
If there exists i ?phi(s,X,L) - ?phi(si,Xi,L)
e for all Xi neq X --- (1)
split L using attribute X
Compute value for e using Hoeffding Bound
Hoeffding Bound If ?phi(s,X,L) takes values in
range R, and L contains m records, then with
probability 1-d, the computed value of
?phi(s,X,L) (using m records in L) differs from
the true value by at most e
Hoeffding Bound guarantees that if (1) holds,
then Xi is correct choice for split with
probability 1-d

51
Single-Pass Algorithm (An Example)
Packets 10
Data Stream
yes
no
Protocol http
SP(Bytes) - SP(Packets)
Packets 10
Data Stream
yes
no
Bytes 60K
Protocol http
yes
Protocol ftp
Ack. From Gehrkes SIGMOD tutorial slides
52
Research Problems in Stream Classification

What about decision tree may need dramatic
restructuring?
Especially when new data is rather different from
the existing model
Efficient detection of outliers (far away from
majority) using constructed models
Weighted by history of the data pay more
attention to new data?
Mining evolutions and changes of models?
Multi-dimensional decision tree analysis?
Stream classification with other classification
approaches?
Constraint-based classification with data streams?

53
Other Research Problems in Stream Data Mining

Stream data mining should it be a general
approach or application-specific ones?
Do stream mining applications share common core
requirements and features?
Killer applications in stream data mining
General architectures and mining language
Multi-dimensional, multi-level stream data mining
Algorithms and applications
How will stream mining make good use of
user-specified constraints?
Stream association and correlation analysis
Measures approximation? Without seeing the
global picture?
How to mine changes of associations?

54
Conclusions

Stream data analysis A rich and largely
unexplored field
Current research focus in database community
DSMS system architecture, continuous query
processing, supporting mechanisms
Stream data mining and stream OLAP analysis
Powerful tools for finding general and unusual
patterns
Largely unexplored current studies only touched
the surface
Our recent study A multi-dimensional stream
analysis framework
Tilt time frame
Critical layers
Popular path approach (how to do quick but high
quality partial materialization and computation)
Lots of exciting issues in further study
A promising one Multi-level, multi-dimensional
analysis and mining of stream data

55
References

B. Babcock, S. Babu, M. Datar, R. Motawani, and
J. Widom, Models and issues in data stream
systems, PODS'02 (tutorial).
S. Babu and J. Widom, Continuous queries over
data streams, SIGMOD Record, 30109--120, 2001.
Y. Chen, G. Dong, J. Han, J. Pei, B. W. Wah, and
J. Wang. Online analytical processing stream
data Is it feasible?, DMKD'02.
Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang,
Multi-dimensional regression analysis of
time-series data streams, VLDB'02.
P. Domingos and G. Hulten, Mining high-speed
data streams, KDD'00.
M. Garofalakis, J. Gehrke, and R. Rastogi,
Querying and mining data streams You only get
one look, SIGMOD'02 (tutorial).
J. Gehrke, F. Korn, and D. Srivastava, On
computing correlated aggregates over continuous
data streams, SIGMOD'01.
S. Guha, N. Mishra, R. Motwani, and L.
O'Callaghan, Clustering data streams, FOCS'00.
G. Hulten, L. Spencer, and P. Domingos, Mining
time-changing data streams, KDD'01.