Mining Unusual Patterns in Data Streams in Multi-Dimensional Space - PowerPoint PPT Presentation

About This Presentation
Title:

Mining Unusual Patterns in Data Streams in Multi-Dimensional Space

Description:

Data stream captures nicely our data processing needs of today ... Most stream data are at pretty low-level or multi-dimensional in nature, needs ... – PowerPoint PPT presentation

Number of Views:162
Avg rating:3.0/5.0
Slides: 28
Provided by: jiaw208
Category:

less

Transcript and Presenter's Notes

Title: Mining Unusual Patterns in Data Streams in Multi-Dimensional Space


1
Mining Unusual Patterns in Data Streams in
Multi-Dimensional Space
  • Jiawei Han
  • Department of Computer Science
  • University of Illinois at Urbana-Champaign
  • www.cs.uiuc.edu/hanj

2
Outline
  • Characteristics of data streams
  • Mining unusual patterns in data streams
  • Multi-dimensional regression analysis of data
    streams
  • Stream cubing and stream OLAP methods
  • Mining other kinds of patterns in data streams
  • Research problems

3
Data Streams
  • Data Streams
  • Data streamscontinuous, ordered, changing, fast,
    huge amount
  • Traditional DBMSdata stored in finite,
    persistent data sets
  • Characteristics
  • Huge volumes of continuous data, possibly
    infinite
  • Fast changing and requires fast, real-time
    response
  • Data stream captures nicely our data processing
    needs of today
  • Random access is expensivesingle linear scan
    algorithm (can only have one look)
  • Store only the summary of the data seen thus far
  • Most stream data are at pretty low-level or
    multi-dimensional in nature, needs multi-level
    and multi-dimensional processing

4
Stream Data Applications
  • Telecommunication calling records
  • Business credit card transaction flows
  • Network monitoring and traffic engineering
  • Financial market stock exchange
  • Engineering industrial processes power supply
    manufacturing
  • Sensor, monitoring surveillance video streams
  • Security monitoring
  • Web logs and Web page click streams
  • Massive data sets (even saved but random access
    is too expensive)

5
Challenges of Stream Data Mining
  • Multiple, continuous, rapid, time-varying,
    ordered streams
  • Main memory computation
  • Mining queries are either continuous or ad-hoc
  • Mining queries are often complex
  • Involving multiple streams, large amount of data,
    and history
  • Finding patterns, models, anomaly, differences,
  • Mining dynamics (changes, trends and evolutions)
    of data streams
  • Multi-level/multi-dimensional processing and data
    mining
  • Most stream data are at pretty low-level or
    multi-dimensional in nature

6
Stream Data Mining Tasks
  • Multi-dimensional (on-line) analysis of streams
  • Clustering data streams
  • Classification of data streams
  • Mining frequent patterns in data streams
  • Mining sequential patterns in data streams
  • Mining partial periodicity in data streams
  • Mining notable gradients in data streams
  • Mining outliers and unusual patterns in data
    streams

7
Multi-Dimensional Stream Analysis Examples
  • Analysis of Web click streams
  • Raw data at low levels seconds, web page
    addresses, user IP addresses,
  • Analysts want changes, trends, unusual patterns,
    at reasonable levels of details
  • E.g., Average clicking traffic in North America
    on sports in the last 15 minutes is 40 higher
    than that in the last 24 hours.
  • Analysis of power consumption streams
  • Raw data power consumption flow for every
    household, every minute
  • Patterns one may find average hourly power
    consumption surges up 30 for manufacturing
    companies in Chicago in the last 2 hours today
    than that of the same day a week ago

8
A Key StepStream Data Reduction
  • Challenges of OLAPing stream data
  • Raw data cannot be stored
  • Simple aggregates are not powerful enough
  • History shape and patterns at different levels
    are desirable multi-dimensional regression
    analysis
  • Proposal
  • A scalable multi-dimensional stream data cube
    that can aggregate regression model of stream
    data efficiently without accessing the raw data
  • Stream data compression
  • Compress the stream data to support memory- and
    time-efficient multi-dimensional regression
    analysis

9
Regression Cube for Time-Series
  • Initially, one time-series per base cell
  • Too costly to store all these time-series
  • Too costly to compute regression at
    multi-dimensional space
  • Regression cube
  • Base cube only store regression parameters of
    base cells (e.g., 2 points vs. 1000 points)
  • All the upper level cuboids can be computed
    precisely for linear regression on both standard
    dimensions and time dimensions
  • For quadratic regression, we need 5 points
  • In general, we need
  • where k 2 for quadratic.

10
Basics of General Linear Regression
  • n tuples in one cell (xi , yi), i 1..n, where
    yi is the measure attribute to be analyzed
  • For sample i , a vector of k user-defined
    predictors ui
  • The linear regression model
  • where ? is a k 1 vector of regression
    parameters

11
Theory of General Linear Regression
  • Collect into the model matrix U
  • The ordinary least square (OLS) estimate of
    is the argument that minimizes the residue sum of
    squares function
  • Main theorem to determine the OLS regression
    parameters

12
Linearly Compressed Representation (LCR)
  • Stream data compression for multi-dimensional
    regression analysis
  • Define, for i, j 0,,k-1
  • The linearly compressed representation (LCR) of
    one cell
  • Size of LCR of one cell
  • quadratic in k, independent of the number of
    tuples n in one cell

13
Matrix Form of LCR
  • LCR consists of and , where
  • and
  • where
  • provides OLS regression parameters essential for
    regression analysis
  • is an auxiliary matrix that facilitates
    aggregations of LCR in standard and regression
    dimensions in a data cube environment
  • LCR only stores
    the upper triangle of

14
Aggregation in Standard Dimensions
  • Given LCR of m cells that differ in one standard
    dimension, what is the LCR of the cell aggregated
    in that dimension?
  • for m base cells
  • for an aggregated cell
  • The lossless aggregation formula

15
Stock Price ExampleAggregation in Standard
Dimensions
  • Simple linear regression on time series data
  • Cells of two companies
  • After aggregation

16
Aggregation in Regression Dimensions
  • Given LCR of m cells that differ in one
    regression dimension, what is the LCR of the cell
    aggregated in that dimension

  • for m base cells
  • for the
    aggregated cell
  • The lossless aggregation formula

17
Stock Price ExampleAggregation in Time Dimension
  • Cells of two adjacent
  • time intervals
  • After aggregation

18
Feasibility of Stream Regression Analysis
  • Efficient storage and scalable (independent of
    the number of tuples in data cells)
  • Lossless aggregation without accessing the raw
    data
  • Fast aggregation computationally efficient
  • Regression models of data cells at all levels
  • General results covered a large and the most
    popular class of regression
  • Including quadratic, polynomial, and nonlinear
    models

19
A Stream Cube Architecture
  • A tilted time frame
  • Different time granularities
  • second, minute, quarter, hour, day, week,
  • Critical layers
  • Minimum interest layer (m-layer)
  • Observation layer (o-layer)
  • User watches at o-layer and occasionally needs
    to drill-down down to m-layer
  • Partial materialization of stream cubes
  • Full materialization too space and time
    consuming
  • No materialization slow response at query time
  • Partial materialization what do we mean
    partial?

20
A Tilted Time-Frame Model
Up to 7 days
Up to a year
Logarithmic (exponential) scale
4t
2t
1t
4t
8t
16t
Time
Now
21
Benefits of Tilted Time-Frame Model
  • Each cell stores the measures according to
    tilt-time-frame
  • Limited memory space Impossible to store the
    history in full scale
  • Emphasis more on recent data
  • Most applications emphasize on recent data (slide
    window)
  • Natural partition on different time granularities
  • Putting different weights on remote data
  • Useful even for uniform weight
  • Tilted time-frame forms a new time dimension
  • for mining changes and evolutions
  • Essential for mining unusual patterns or outliers
  • Finding those with dramatic changes
  • E.g., exceptional stocksnot following the trends

22
Two Critical Layers in the Stream Cube
(, theme, quarter)
o-layer (observation)
(user-group, URL-group, minute)
m-layer (minimal interest)
(individual-user, URL, second)
(primitive) stream data layer
23
On-Line Materialization vs. On-Line Computation
  • On-line materialization
  • Materialization takes precious resources and time
  • Only incremental materialization (with slide
    window)
  • Only materialize cuboids of the critical
    layers?
  • Some intermediate cells that should be
    materialized
  • Popular path approach vs. exception cell approach
  • Materialize intermediate cells along the popular
    paths
  • Exception cells how to set up exception
    thresholds?
  • Notice exceptions do not have monotonic behaviour
  • Computation problem
  • How to compute and store stream cubes
    efficiently?
  • How to discover unusual cells between the
    critical layer?

24
Stream Cube Structure from m-layer to o-layer
(A1, , C1)
(A1, , C2)
(A1, , C2)
(A1, , C2)
(A2, B1, C1)
(A1, B1, C2)
(A1, B2, C1)
(A2, , C2)
(A2, B1, C2)
A2, B2, C1)
(A1, B2, C2)
(A2, B2, C2)
25
Stream Cube Computation
  • Cube structure from m-layer to o-layer
  • Three approaches
  • All cuboids approach
  • Materializing all cells (too much in both space
    and time)
  • Exceptional cells approach
  • Materializing only exceptional cells (saves space
    but not time to compute and definition of
    exception is not flexible)
  • Popular path approach
  • Computing and materializing cells only along a
    popular path
  • Using H-tree structure to store computed cells
    (which form the stream cubea selectively
    materialized cube)

26
An H-Tree Cubing Structure
root
Observation layer
sports
politics
entertainment
uiuc
uic
uic
uiuc
Minimal int. layer
jeff
Jim
jeff
mary
Q.I.
Q.I.
Q.I.
27
Partial Materialization Using H-Tree
  • H-tree
  • Introduced for computing data cubes and iceberg
    cubes
  • J. Han, J. Pei, G. Dong, and K. Wang, Efficient
    Computation of Iceberg Cubes with Complex
    Measures, SIGMOD'01
  • Compressed database, fast cubing, and space
    preserving in cube computation
  • Using H-tree for partial stream cubing
  • Space preserving
  • Intermediate aggregates can be computed
    incrementally and saved in tree nodes
  • Facilitate computing other cells and
    multi-dimensional analysis
  • H-tree with computed cells can be viewed as
    stream cube

28
Time and Space vs. Number of Tuples at the
m-Layer (Dataset D3L3C10T400K)
a) Time vs. m-layer size
b) Space vs. m-layer size
29
Time and Space vs. the Number of Levels
a) Time vs. levels
b) Space vs. levels
30
Mining Other Unusual Patterns in Stream Data
  • Clustering and outlier analysis for stream mining
  • Clustering data streams (Guha, Motwani et al.
    2000-2002)
  • History-sensitive, high-quality incremental
    clustering
  • Classification of stream data
  • Evolution of decision trees Domingos et al.
    (2000, 2001)
  • Incremental integration of new streams in
    decision-tree induction
  • Frequent pattern analysis
  • Approximate frequent patterns (Manku Motwani
    VLDB02)
  • Evolution and dramatic changes of frequent
    patterns

31
Conclusions
  • Stream data mining A rich and largely unexplored
    field
  • Current research focus in database community
  • DSMS system architecture, continuous query
    processing, supporting mechanisms
  • Stream data mining and stream OLAP analysis
  • Powerful tools for finding general and unusual
    patterns
  • Effectiveness, efficiency and scalability lots
    of open problems
  • Our philosophy
  • A multi-dimensional stream analysis framework
  • Time is a special dimension tilted time frame
  • What to compute and what to save?Critical layers
  • Very partial materialization/precomputation
    popular path approach
  • Mining dynamics of stream data

32
References
  • B. Babcock, S. Babu, M. Datar, R. Motawani, and
    J. Widom, Models and issues in data stream
    systems, PODS'02 (tutorial).
  • S. Babu and J. Widom, Continuous queries over
    data streams, SIGMOD Record, 30109--120, 2001.
  • Y. Chen, G. Dong, J. Han, J. Pei, B. W. Wah, and
    J. Wang. Online analytical processing stream
    data Is it feasible?, DMKD'02.
  • Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang,
    Multi-dimensional regression analysis of
    time-series data streams, VLDB'02.
  • P. Domingos and G. Hulten, Mining high-speed
    data streams, KDD'00.
  • M. Garofalakis, J. Gehrke, and R. Rastogi,
    Querying and mining data streams You only get
    one look, SIGMOD'02 (tutorial).
  • J. Gehrke, F. Korn, and D. Srivastava, On
    computing correlated aggregates over continuous
    data streams, SIGMOD'01.
  • S. Guha, N. Mishra, R. Motwani, and L.
    O'Callaghan, Clustering data streams, FOCS'00.
  • G. Hulten, L. Spencer, and P. Domingos, Mining
    time-changing data streams, KDD'01.

33
www.cs.uiuc.edu/hanj
  • Thank you !!!
Write a Comment
User Comments (0)
About PowerShow.com