Title: Mining Unusual Patterns in Data Streams: Methodologies and Research Problems
1Mining Unusual Patterns in Data Streams
Methodologies and Research Problems
- Jiawei Han
- Department of Computer Science
- University of Illinois at Urbana-Champaign
- www.cs.uiuc.edu/hanj
2Outline
- Characteristics of stream data
- Architecture and models for SDMS and stream query
processing - Why mining unusual patterns in stream data?
- Essentials for mining unusual patterns in stream
data - Stream cubing and stream OLAP methods
- Stream mining methods
- Research problems
- Conclusions
3Characteristics of Data Streams
- Data Streams
- Data streamscontinuous, ordered, changing, fast,
huge amount - Traditional DBMSdata stored in finite,
persistent data sets - Characteristics
- Huge volumes of continuous data, possibly
infinite - Fast changing and requires fast, real-time
response - Data stream captures nicely our data processing
needs of today - Random access is expensivesingle linear scan
algorithm (can only have one look) - Store only the summary of the data seen thus far
- Most stream data are at pretty low-level or
multi-dimensional in nature, needs multi-level
and multi-dimensional processing
4Stream Data Applications
- Telecommunication calling records
- Business credit card transaction flows
- Network monitoring and traffic engineering
- Financial market stock exchange
- Engineering industrial processes power supply
manufacturing - Sensor, monitoring surveillance video streams
- Security monitoring
- Web logs and Web page click streams
- Massive data sets (even saved but random access
is too expensive)
5DBMS versus DSMS
- Persistent relations
- One-time queries
- Random access
- Unbounded disk store
- Only current state matters
- No real-time services
- Relatively low update rate
- Data at any granularity
- Assume precise data
- Access plan determined by query processor,
physical DB design
- Transient streams
- Continuous queries
- Sequential access
- Bounded main memory
- Historical data is important
- Real-time requirements
- Possibly multi-GB arrival rate
- Data at fine granularity
- Data stale/imprecise
- Unpredictable/variable data arrival and
characteristics
Ack. From Motwanis PODS tutorial slides
6Architecture Stream Query Processing
User/Application
SDMS (Stream Data Management System)
Results
Multiple streams
Stream Query Processor
Scratch Space (Main memory and/or Disk)
7Challenges of Stream Data Processing
- Multiple, continuous, rapid, time-varying,
ordered streams - Main memory computations
- Queries are often continuous
- Evaluated continuously as stream data arrives
- Answer updated over time
- Queries are often complex
- Beyond element-at-a-time processing
- Beyond stream-at-a-time processing
- Beyond relational queries (scientific, data
mining, OLAP) - Multi-level/multi-dimensional processing and data
mining - Most stream data are at pretty low-level or
multi-dimensional in nature
8Processing Stream Queries
- Query types
- One-time query vs. continuous query (being
evaluated continuously as stream continues to
arrive) - Predefined query vs. ad-hoc query (issued
on-line) - Unbounded memory requirements
- For real-time response, main memory algorithm
should be used - Memory requirement is unbounded if one will join
future tuples - Approximate query answering
- With bounded memory, it is not always possible to
produce exact answers - High-quality approximate answers are desired
- Data reduction and synopsis construction methods
- Sketches, random sampling, histograms, wavelets,
etc.
9Methods for Approximate Query Answering
- Sliding windows
- Only over sliding windows of recent stream data
- Approximation but often more desirable in
applications - Batched processing, sampling and synopses
- Batched if update is fast but computing is slow
- Compute periodically, not very timely
- Sampling if update is slow but computing is fast
- Compute using sample data, but not good for
joins, etc. - Synopsis data structures
- Maintain a small synopsis or sketch of data
- Good for querying historical data
- Blocking operators, e.g., sorting, avg, min, etc.
- Blocking if unable to produce the first output
until seeing the entire input
10Projects on DSMS (Data Stream Management System)
- Research projects and system prototypes
- STREAM (Stanford) A general-purpose DSMS
- Cougar (Cornell) sensors
- Aurora (Brown/MIT) sensor monitoring, dataflow
- Hancock (ATT) telecom streams
- Niagara (OGI/Wisconsin) Internet XML databases
- OpenCQ (Georgia Tech) triggers, incr. view
maintenance - Tapestry (Xerox) pub/sub content-based filtering
- Telegraph (Berkeley) adaptive engine for sensors
- Tradebot (www.tradebot.com) stock tickers
streams - Tribeca (Bellcore) network monitoring
- Streaminer (UIUC) new project for stream data
mining
11Stream Data Mining vs. Stream Querying
- Stream miningA more challenging task
- It shares most of the difficulties with stream
querying - Patterns are hidden and more general than
querying - It may require exploratory analysis
- Not necessarily continuous queries
- Stream data mining tasks
- Multi-dimensional on-line analysis of streams
- Mining outliers and unusual patterns in stream
data - Clustering data streams
- Classification of stream data
12Challenges for Mining Unusual Patterns in Data
Streams
- Most stream data are at pretty low-level or
multi-dimensional in nature needs ML/MD
processing - Analysis requirements
- Multi-dimensional trends and unusual patterns
- Capturing important changes at multi-dimensions/le
vels - Fast, real-time detection and response
- Comparing with data cube Similarity and
differences - Stream (data) cube or stream OLAP Is this
feasible? - Can we implement it efficiently?
13Multi-Dimensional Stream Analysis Examples
- Analysis of Web click streams
- Raw data at low levels seconds, web page
addresses, user IP addresses, - Analysts want changes, trends, unusual patterns,
at reasonable levels of details - E.g., Average clicking traffic in North America
on sports in the last 15 minutes is 40 higher
than that in the last 24 hours. - Analysis of power consumption streams
- Raw data power consumption flow for every
household, every minute - Patterns one may find average hourly power
consumption surges up 30 for manufacturing
companies in Chicago in the last 2 hours today
than that of the same day a week ago
14A Key StepStream Data Reduction
- Challenges of OLAPing stream data
- Raw data cannot be stored
- Simple aggregates are not powerful enough
- History shape and patterns at different levels
are desirable multi-dimensional regression
analysis - Proposal
- A scalable multi-dimensional stream data cube
that can aggregate regression model of stream
data efficiently without accessing the raw data - Stream data compression
- Compress the stream data to support memory- and
time-efficient multi-dimensional regression
analysis
15Basics of General Linear Regression
- n tuples in one cell (xi , yi), i 1..n, where
yi is the measure attribute to be analyzed - For sample i , a vector of k user-defined
predictors ui - The linear regression model
-
- where ? is a k 1 vector of regression
parameters
16Theory of General Linear Regression
- Collect into the model matrix U
- The ordinary least square (OLS) estimate of
is the argument that minimizes the residue sum of
squares function - Main theorem to determine the OLS regression
parameters
17Linearly Compressed Representation (LCR)
- Stream data compression for multi-dimensional
regression analysis - Define, for i, j 0,,k-1
- The linearly compressed representation (LCR) of
one cell - Size of LCR of one cell
- quadratic in k, independent of the number of
tuples n in one cell
18Matrix Form of LCR
- LCR consists of and , where
-
- and
-
- where
-
- provides OLS regression parameters essential for
regression analysis - is an auxiliary matrix that facilitates
aggregations of LCR in standard and regression
dimensions in a data cube environment -
- LCR only stores
the upper triangle of
19Aggregation in Standard Dimensions
- Given LCR of m cells that differ in one standard
dimension, what is the LCR of the cell aggregated
in that dimension? - for m base cells
- for an aggregated cell
- The lossless aggregation formula
20Stock Price ExampleAggregation in Standard
Dimensions
- Simple linear regression on time series data
- Cells of two companies
- After aggregation
21Aggregation in Regression Dimensions
- Given LCR of m cells that differ in one
regression dimension, what is the LCR of the cell
aggregated in that dimension? -
for m base cells - for the aggregated
cell - The lossless aggregation formula
22Stock Price ExampleAggregation in Time Dimension
- Cells of two adjacent
- time intervals
- After aggregation
23Feasibility of Stream Regression Analysis
- Efficient storage and scalable (independent of
the number of tuples in data cells) - Lossless aggregation without accessing the raw
data - Fast aggregation computationally efficient
- Regression models of data cells at all levels
- General results covered a large and the most
popular class of regression - Including quadratic, polynomial, and nonlinear
models
24A Stream Cube Architecture
- A tilt time frame
- Different time granularities
- second, minute, quarter, hour, day, week,
- Critical layers
- Minimum interest layer (m-layer)
- Observation layer (o-layer)
- User watches at o-layer and occasionally needs
to drill-down down to m-layer - Partial materialization of stream cubes
- Full materialization too space and time
consuming - No materialization slow response at query time
- Partial materialization what do we mean
partial?
25A Tilt Time-Frame Model
Up to 7 days
Up to a year
26Benefits of Tilt Time-Frame Model
- Each cell stores the measures according to
tilt-time-frame - Limited memory space Impossible to store the
history in full scale - Emphasis more on recent data
- Most applications emphasize on recent data (slide
window) - Natural partition on different time granularities
- Putting different weights on remote data
- Useful even for uniform weight
- Tilt time-frame forms a new time dimension
- for mining changes and evolutions
- Essential for mining unusual patterns or outliers
- Finding those with dramatic changes
- E.g., exceptional stocksnot following the trends
27Two Critical Layers in the Stream Cube
(, theme, quarter)
o-layer (observation)
(user-group, URL-group, minute)
m-layer (minimal interest)
(individual-user, URL, second)
(primitive) stream data layer
28On-Line Materialization vs. On-Line Computation
- On-line materialization
- Materialization takes precious resources and time
- Only incremental materialization (with slide
window) - Only materialize cuboids of the critical
layers? - Some intermediate cells that should be
materialized - Popular path approach vs. exception cell approach
- Materialize intermediate cells along the popular
paths - Exception cells how to set up exception
thresholds? - Notice exceptions do not have monotonic behaviour
- Computation problem
- How to compute and store stream cubes
efficiently? - How to discover unusual cells between the
critical layer?
29Stream Cube Structure from m-layer to o-layer
(A1, , C1)
(A1, , C2)
(A1, , C2)
(A1, , C2)
(A2, B1, C1)
(A1, B1, C2)
(A1, B2, C1)
(A2, , C2)
(A2, B1, C2)
A2, B2, C1)
(A1, B2, C2)
(A2, B2, C2)
30Stream Cube Computation
- Cube structure from m-layer to o-layer
- Three approaches
- All cuboids approach
- Materializing all cells (too much in both space
and time) - Exceptional cells approach
- Materializing only exceptional cells (saves space
but not time to compute and definition of
exception is not flexible) - Popular path approach
- Computing and materializing cells only along a
popular path - Using H-tree structure to store computed cells
(which form the stream cubea selectively
materialized cube)
31An H-Tree Cubing Structure
root
Observation layer
sports
politics
entertainment
uiuc
uic
uic
uiuc
Minimal int. layer
jeff
Jim
jeff
mary
Q.I.
Q.I.
Q.I.
32Benefits of H-Tree and H-Cubing
- H-tree and H-cubing
- Developed for computing data cubes and ice-berg
cubes - J. Han, J. Pei, G. Dong, and K. Wang, Efficient
Computation of Iceberg Cubes with Complex
Measures, SIGMOD'01 - Compressed database
- Fast cubing
- Space preserving in cube computation
- Using H-tree for stream cubing
- Space preserving
- Intermediate aggregates can be computed
incrementally and saved in tree nodes - Facilitate computing other cells and
multi-dimensional analysis - H-tree with computed cells can be viewed as
stream cube
33Time and Space vs. Number of Tuples at the
m-Layer (Dataset D3L3C10T400K)
a) Time vs. m-layer size
b) Space vs. m-layer size
34Time and Space vs. the Number of Levels
a) Time vs. levels
b) Space vs. levels
35Other Approaches for Mining Unusual Patterns in
Stream Data
- Beyond multi-dimensional regression analysis
- Other approaches can be effective for mining
unusual patterns - Multi-dimensional gradient analysis of multiple
streams - Gradient analysis finding substantial changes
(notable gradients) in relevance to other
dimensions - E.g., those stocks that increase over 10 in the
last hour - Clustering and outlier analysis for stream mining
- Clustering data streams (Guha, Motwani et al.
2000-2002) - History-sensitive, high-quality incremental
clustering - Decision tree analysis of stream data
- Evolution of decision trees Domingos et al.
(2000, 2001) - Incremental integration of new streams in
decision-tree induction
36What Is Gradient Analysis?
- Gradient analysis Analysis of notable changes
(gradients) of sophisticated measures in
multi-dimensional space - Changes in dimensions ? changes in measures
- Drill-down (descendants), roll-up (ancestors),
and mutation (siblings) - Query Notable changes of average house price in
Champaign in 02 comparing against 01 - Answer Townhouse in Southwest Champaign West
went down 5, houses in Urbana went up 10 - Originated from CubeGrade problem
- First proposed by Imielinski et al. (DAMI 2002)
as Cubegrade - Efficient pushing of constraints for complex
measures (such as avg) in constrained gradient
analysis by Dong et al. (VLDB 2001)
37Multi-Dimensional Gradient Analysis of Multiple
Streams
- Stream gradient analysis
- Analysis of notable changes of sophisticated
measures in multi-dimensional space in relevance
to time for stream data - Changes in time ? changes in measures (possibly
comparing with sibling streams) - Drill-down (descendants), roll-up (ancestors),
and mutation (siblings) - Query Find exceptionally promising stocks in the
last hour - E.g., Tech sector goes down sharply but IBM goes
down only slightly - How to solve it in a stream environment?
- Find surrounding average gradients, and then find
stocks whose gradients are substantially
different from average - Analysis should be performed in multi-dimensional
space
38Clustering for Stream Data Mining
- What is cluster analysis?
- Grouping a set of data objects into a set of
classes (clusters) - The intra-class similarity is high and the
inter-class similarity is low - Applications Pattern recognition, spatial data
analysis, image processing, market research, Web
document and click stream analysis - Clustering Another data reduction technique in
stream analysis - New requirements in stream data clustering
- Generate overall high-quality clusters without
seeing the old data - High quality, efficient incremental clustering
algorithms - Analysis should take care of multi-dimensional
space
39Major Clustering Approaches in Traditional
Cluster Analysis
- Partitioning algorithms Construct various
partitions and then evaluate them by some
criterion - E.g., k-means, k-medoids, etc.
- Hierarchy algorithms Create a hierarchical
decomposition of the set of data (or objects)
using some criterion - Often needs to integrate with other clustering
methods, e.g., BIRCH - Density-based based on connectivity and density
functions - Finding clusters of arbitrary shapes, e.g.,
DBSCAN, OPTICS, etc. - Grid-based based on a multiple-level granularity
structure - View space as grid structures, e.g., STING,
CLIQUE - Model-based find the best fit of the model to
all the clusters - Good for conceptual clustering, e.g., COBWEB, SOM
40The K-Means Clustering Process
10
9
8
7
6
5
Update the cluster means
Assign each objects to most similar center
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
reassign
reassign
K2 Arbitrarily choose K object as initial
cluster center
Update the cluster means
41Clustering Data Streams
- Only the most popular clustering algorithm,
k-means, is examined in stream clustering (Guha,
Motwani, et al. 2000-2002) - The K-Means Clustering Method (MacQueen67) Each
cluster is represented by the center of the
cluster - Data stream with points from metric space
- Find k centers in the stream such that the sum of
distances from data points to their closest
center is minimized. - Clustering data streams
- Only the k centroids (representing the clustering
results) retain when new data comes - Only use the new data set to perform incremental
clustering - The previous data carries weights of the previous
many points - The error is bounded by continuous incremental
updates - The simple algorithm yields constant factor
approximation
42An Incremental Clustering Method
- Only the k centroids (representing the clustering
results) retain when new data comes - Only use the new data set to perform incremental
clustering - The previous data carries weights of the previous
many points - The Incremental algorithm (GMM01)
- Assign each object to the cluster with the
nearest seed point - Compute new seed points as the centroids of the
clusters of the current partition - Repeat steps 2-3 until no change, the cluster is
formed by a set of k new centroids - The error is bounded by continuous incremental
updates - The simple algorithm yields constant factor
approximation
43Research Problems in Stream Clustering
- Better quality but still efficient clustering
algorithms? - Simple k-means clustering by preserving only k
centroids may loose too much information - Keeping additional information may lead to better
clustering quality - Multi-dimensional clustering analysis?
- Cluster not confined to 2-D metric space, how to
incorporate other features, especially
non-numerical properties - Finding outliers as a by-product of cluster
analysis? - Efficient detection of outliers (far away from
majority) in data streams - Weighted by history of the data?
- Mining evolutions and changes of clusters?
- Stream clustering with other clustering
approaches? - Constraint-based cluster analysis with data
streams?
44Major Classification Methods
- Popular classification methods
- Decision tree induction
- ID3, C4.5, Regression trees, decision lists, etc.
- Bayesian classification
- Neural networks
- Support Vector Machines (SVM)
- Associative classification
- k-nearest neighbor classifier and case-based
reasoning - Genetic algorithms
- Rough set and fuzzy set approaches
- Most of theses methods are not re-examined in the
context of stream data
45Decision Tree Analysis in Stream Data
- What is decision-tree analysis?
- Building a compact tree from data to guide
decision making - One of the most popular classification method in
data mining - Applications market analysis, Web document
classification, etc. - Decision-tree Another data reduction technique
in stream analysis - New requirements in stream data decision-tree
analysis - Generate high-quality up-to-date decision-trees
without seeing the old data - High quality, efficient incremental decision-tree
induction - Analysis should take care of multi-dimensional
space
46Classical Example Play Tennis?
- Training data set from Quinlans
47Decision Tree Obtained with ID3 (Quinlan 86)
48Algorithm for Decision Tree Induction
- Basic algorithm (a greedy algorithm)
- Tree is constructed in a top-down recursive
divide-and-conquer manner - At start, all the training examples are at the
root - Attributes are categorical (if continuous-valued,
they are discretized in advanceC4.5 handles
continuous value splitting) - Examples are partitioned recursively based on
selected attributes - Test attributes are selected on the basis of a
heuristic or statistical measure (e.g.,
information gain, Gini index) - Conditions for stopping partitioning
- All samples for a given node belong to the same
class - There are no remaining attributes for further
partitioning majority voting is employed for
classifying the leaf - There are no samples left
49Decision Tree Induction with Stream Data
- VFDT/CVFDT
- P. Domingos and G. Hulten, Mining high-speed
data streams, KDD'00 - G. Hulten, L. Spencer, and P. Domingos, Mining
time-changing data streams,KDD'01 - VFDT (Very Fast Decision Tree) (Domingos and
Hulten00) - With high probability, constructs an identical
model that a traditional (greedy) method would
learn - If it cannot be inserted into the same branch,
construct shadow branches as preparation for
changes - If the shadow becomes dominant, switch of tree
branches occur - CVFDT Extension to time changing data
50Decision Tree Induction with Stream Data
- For each record in stream
- Traverse T to determine appropriate leaf L for
record - Update (attribute, class) counts in L and compute
best split function ?phi(s,X,L) for each
attribute Xi - If there exists i ?phi(s,X,L) - ?phi(si,Xi,L)
e for all Xi neq X --- (1) - split L using attribute X
- Compute value for e using Hoeffding Bound
- Hoeffding Bound If ?phi(s,X,L) takes values in
range R, and L contains m records, then with
probability 1-d, the computed value of
?phi(s,X,L) (using m records in L) differs from
the true value by at most e - Hoeffding Bound guarantees that if (1) holds,
then Xi is correct choice for split with
probability 1-d
51Single-Pass Algorithm (An Example)
Packets 10
Data Stream
yes
no
Protocol http
SP(Bytes) - SP(Packets)
Packets 10
Data Stream
yes
no
Bytes 60K
Protocol http
yes
Protocol ftp
Ack. From Gehrkes SIGMOD tutorial slides
52Research Problems in Stream Classification
- What about decision tree may need dramatic
restructuring? - Especially when new data is rather different from
the existing model - Efficient detection of outliers (far away from
majority) using constructed models - Weighted by history of the data pay more
attention to new data? - Mining evolutions and changes of models?
- Multi-dimensional decision tree analysis?
- Stream classification with other classification
approaches? - Constraint-based classification with data streams?
53Other Research Problems in Stream Data Mining
- Stream data mining should it be a general
approach or application-specific ones? - Do stream mining applications share common core
requirements and features? - Killer applications in stream data mining
- General architectures and mining language
- Multi-dimensional, multi-level stream data mining
- Algorithms and applications
- How will stream mining make good use of
user-specified constraints? - Stream association and correlation analysis
- Measures approximation? Without seeing the
global picture? - How to mine changes of associations?
54Conclusions
- Stream data analysis A rich and largely
unexplored field - Current research focus in database community
DSMS system architecture, continuous query
processing, supporting mechanisms - Stream data mining and stream OLAP analysis
- Powerful tools for finding general and unusual
patterns - Largely unexplored current studies only touched
the surface - Our recent study A multi-dimensional stream
analysis framework - Tilt time frame
- Critical layers
- Popular path approach (how to do quick but high
quality partial materialization and computation) - Lots of exciting issues in further study
- A promising one Multi-level, multi-dimensional
analysis and mining of stream data
55References
- B. Babcock, S. Babu, M. Datar, R. Motawani, and
J. Widom, Models and issues in data stream
systems, PODS'02 (tutorial). - S. Babu and J. Widom, Continuous queries over
data streams, SIGMOD Record, 30109--120, 2001. - Y. Chen, G. Dong, J. Han, J. Pei, B. W. Wah, and
J. Wang. Online analytical processing stream
data Is it feasible?, DMKD'02. - Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang,
Multi-dimensional regression analysis of
time-series data streams, VLDB'02. - P. Domingos and G. Hulten, Mining high-speed
data streams, KDD'00. - M. Garofalakis, J. Gehrke, and R. Rastogi,
Querying and mining data streams You only get
one look, SIGMOD'02 (tutorial). - J. Gehrke, F. Korn, and D. Srivastava, On
computing correlated aggregates over continuous
data streams, SIGMOD'01. - S. Guha, N. Mishra, R. Motwani, and L.
O'Callaghan, Clustering data streams, FOCS'00. - G. Hulten, L. Spencer, and P. Domingos, Mining
time-changing data streams, KDD'01.
56www.cs.uiuc.edu/hanj