Extending DSMS for Data Stream Mining presentation

About This Presentation

Transcript and Presenter's Notes

Title: Extending DSMS for Data Stream Mining

1
Extending DSMS for Data Stream Mining

CS240B Notes
by
Carlo Zaniolo
UCLA CSD

2
Data Streams

Continuous, unbounded, rapid, time-varying
streams of data elements
Occur in a variety of modern applications
Network monitoring and traffic engineering
Sensor networks, RFID tags
Telecom call records
Financial applications
Web logs and click-streams
Manufacturing processes
DSMS Data Stream Management System

3
Many Research Projects

Amazon/Cougar (Cornell) sensors
Aurora (Brown/MIT) sensor monitoring, dataflow
Hancock (ATT) Telecom streams
Niagara (OGI/Wisconsin) Internet DBs XML
OpenCQ (Georgia) triggers, view maintenance
Stream (Stanford) general-purpose DSMS
Tapestry (Xerox) pubish/subscribe filtering
Telegraph (Berkeley) adaptive engine for
sensors
Gigascope ATT Labs Network Monitoring
Stream Mill (UCLA) - power extensibility

4
Technology Challenges

Data Models
Relational Streams--but XML streams important
too
Tuple Time-Stamping
Order is important
Windows
Query Languages Extensions of SQL or XQUERY
To support continuous (i.e., persistent) queries
on transient datareversal of roles.
Blocking operators excluded
Query Plans
New execution models (main memory oriented)
Optimized scheduling for response time or memory
Quality of Services (QoS) Approximation
Synopses
Sampling
Load shedding.

5
Commercial Developments

Several Startups
Streambase,
Coral8,
Apama, and
Truviso.
Oracle and DBMS companies
Publish/subscribe
Complex Event Processing (CEP)
Limitations only simple applicationse.g.
continuous queries expressed in SQL
No Support for Data Stream Mining queries.

6
Data Stream Mining

Many applications click stream analysis,
intrusion detection,...
Many fast light algorithms developed for stream
mining.
Ensembles, Moment, SWIM, etc.
Analyst should be able to focus on high-level
mining tasks.
Leaving QoS and lower-level issues to the
system.
Integration of mining methods into Data Stream
Management Systems (DSMS) is required
Many research challenges.
Stream Mill Miner (SMM) is the first DSMS
designed for that.

7
Data Stream Management Systems (DSMS)

Data stream mining applications so far ignored by
DSMS although
A. DSMS technology is required for data stream
mining
QoS, query scheduling, synopses, sampling,
windows, ...
B. But supporting DM applications is difficult
since current DSMS only support simple query
languages based on SQL.
Conclusion either a shotgun wedding ... or a
research breakthrough is needed here!

8
A Difficult Problem the Inductive DBMS Experience

Initial attempts to support mining queries in
relational DBMS Unsuccessful
OR-DBMS do not fare much better Sarawagi 98.
In 1996 the high-road approach by Imielinski
Mannila who called for a quantum leap in
functionality
High-level declarative languages for DM .
Extensions for query processing and
optimization.
The research area of Inductive DBMS was thus born
Inspired DMQL, Mine Rule, MSQL, etc.
Suffer from limited generality and performance
issues.

9
DBMS Vendors

Vendors have taken a low-road approach.
A library of mining functions using a
cache-mining approach
IBM DB2 Intelligent Miner
Oracle Data Miner
MS OLE DB for DM mining models
Closed systems,
Lacking in coverage and user-extensibility.
Not as popular as dedicated, stand-alone mining
systems, such as Weka

10
Weka

A comprehensive set of mining algorithms, and
tools.
Generic algorithms over arbitrary data sets.
Independent on the number of columns in tables.
Open and extensible system based on Java.
These are the features that we want in our
SMMstarting from SQL rather than Java!
Not an easy task ...why?

11
SMM Contributions

Build on Stream Mill DSMS and its SQL-based
continuous query language and enabling
technology.
Language and System Extensions
Genericity,
Extensibility, and
Performance
A suite of stream mining algorithms.
Existing ones and
Newly developed in this projecte.g., SWIM.
High level mining model for better
Usability
Control of mining process.

12
From SQL to Online Mining in SMMstep by step

Naïve Bayesian Classifier (NBC).
Important and frequently used.
Schema-specific NBC. Simple to express in SQL by
count, sum aggregates. But a generci NBC is still
preferable.
Genericity one function independent of number
columns involved.
Schema independence in SQL?

13
Genericity

Weka
Arrays of type real.
SMM
Verticalization.
Similar arrays, but in tables.
Built-in table function to reduce any table to
this form.
Thus, generic UDAs work with this schema.
And further improvements are also supported in
SMM

14
Extensibility?

Most mining tasks cannot be implemented in SQL.
Solution Define complex functions by User
Defined Aggregates (UDAs)
Complex mining tasks can be viewed as aggregates
UDAs Natively defined in SQL make the language
computationally complete Wang 04
Turing-complete over static data
Non-blocking complete over data streams
Natural extensions to support windows and delta
computations for data streams Bai 06
UDAs can be defined in a PL, for better
performance

15
Windowed UDA Example Continuous Count

WINDOW AGGREGATE sum(val REAL)REAL
TABLE state (tot real)
INITIALIZE
INSERT INTO state VALUES(val)
ITERATE
UPDATE state SET tot tot val
EXPIRE
UPDATE state SET tot tot oldest().val
/ No TERMINATE state /

16
Online Mining in SMM

UDAs Invoked with standard SQL2003 syntax of
OLAP functions.
SELECT learn(ts.Column, ts.Value, t.dec)
OVER (ROWS 1000 PRECEDING)
FROM trainingstream AS t,
TABLE (verticalize(Outlook, Temp, Humidity,
Wind)) AS ts
Powerful framework
Concept drifts-shifts
Association rule mining

17
The Slide Construct

A window can be divided into panes (called a
slide)
Tumbling windows when the size of the slide is
equal or larger than that of the window
The slide/window combination is great for data
stream mining.
Simple construct added to support slides in UDAs
Allowed us to build a flexible and efficient
library of data stream mining UDAs

18
SMM Contributions

Build on Stream Mill DSMS and its SQL-based
continuous query language and enabling
technology.
Language and System Extensions
Genericity,
Extensibility, and
Performance
A suite of stream mining algorithms.
Existing ones and
Newly developed in this projecte.g., SWIM.
High level mining model for better
Usability
Control of mining process.

19
Association Rule Mining

SWIM Mozafari 08 Maintaining frequent
patterns over large windows with slides.
Differentially computes frequent patterns as
slides enter (expire out of) the window.
Uses efficient Verifiers based on conditional
counting.
Trade-off between Delay and Performance
Performance gain over existing algorithms.

20
SWIM (Sliding Window Incremental Miner)

If pattern p is freq in a window, it must be freq
in at least one of its slides -- keep a union of
freq patterns of all slides (PT)

Expired
New

.
S4
S5
S6
S7
W4
W5
Mine
Mining Alg.
PT
Prune PT
PT F5 U F6 U F7
PT F4 U F5 U F6
21
Concept Drifts/ShiftsComplex Processes

Ensemble based methods.
Weighted bagging Wang 03, adaptive boosting
Chu 04, inductive transfer Forman 06.
Generic support, e.g. adaptive boosting (below).

22
Built-in Online Mining Algorithms In SMM

Online classifiers
Naïve Bayesian
Decision Tree
K-nearest Neighbor
Online clustering
DBScan Ester 96
IncDBScan
Windowed K-means
DenStream Cao 06
CluStream

Association rule mining
Approximate frequent items
SWIM Mozafari 08
Moment Chi 04
AFPIM
Time series/sequence queries
SQL-TS Sadri 01
Many more

Already supported

To be supported

23
SMM Contributions

Build on Stream Mill DSMS and its SQL-based
continuous query language and enabling
technology.
Language and System Extensions
Genericity,
Extensibility, and
Performance
A suite of stream mining algorithms.
Existing ones and
Newly developed in this projecte.g., SWIM.
High level mining model for better
Usability
Control of mining process.

24
Usability?

Complex SQL queries to invoke built-in and
user-defined mining algorithms.
An open and extensible system
Most analysts would prefer using high-level
mining language that
supports uniform invocation of built-in and
user-defined mining algorithms (no SQL required)
describes the workflow of the mining process
Is also open and extensible to incorporate newly
defined mining algorithms.

25
Example Defining a Mining Model

CREATE MODEL TYPE NaiveBayesianClassifier
SHAREDTABLES (DescriptorTbl),
Learn (UDA LearnNaiveBayesian,
WINDOW TRUE,
PARTABLES(), names of param tables required
by the method
PARAMETERS() additional parameters to be
specified for input
),
Classify (UDA ClassifyNaiveBayesian,
WINDOW TRUE,
PARTABLES(),
PARAMETERS()
)

26
Example Using a Mining Model

Creating an instance
CREATE MODEL INSTANCE NaiveBayesianInstance
AS NaiveBayesianClassifier
Uniform invocation of mining tasks
RUN NaiveBayesianInstance.Learn WITH TrainingSet

27
Performance

SMM Vs. Weka
NBC and decision tree classifier
Datasets UCI
Iris 5 attributes
Heart disease 13 attributes
Overhead of integrating algorithms into SMM
The SWIM algorithm standalone vs. integrated
Dataset IBM Quest
Trans len 20, Pattern len 5, Tuples 50K

28
Comparison with Weka NBC-Iris
29
Comparison with Weka NBC-HD
30
Comparison with Weka Decision Tree - Iris
31
Integration Overhead Integrated SWIM vs.
Standalone SWIM
32
The Stream Mill System

One server, multiple clients
Server (on Linux) hosts the ESL language and
manages storage and continuous queries
Client (Java based GUI) allows the user to
specify streams, queries, etc.

33
Conclusion

SMM integrates new solutions for several
difficult problems
Usability by high-level mining models
Extensibility by user-defined mining models that
call on UDAs with windows
Suite of built-in data stream mining UDAs
Generic mining UDAs by Verticalization other
techniques
Performance
SMM is the first of its kind more and better
systems will follow in its footsteps.

34
Future Work

Faster lighter mining algorithms
E.g. online algorithms for clustering
Integration of other mining algorithms
Data flow in mining models
Similar solution for databases

Thank you!

36
References

Arasu 04 Arvind Arasu and Jennifer Widom.
Resource sharing in continuous sliding-window
aggregates. In VLDB, pages 336347, 2004.
Babcock 02 B. Babcock, S. Babu, M. Datar, R.
Motawani, and J. Widom. Models and issues in data
stream systems. In PODS, 2002.
Bai 06 Yijian Bai, Hetal Thakkar, Chang Luo,
Haixun Wang, and Carlo Zaniolo. A data stream
language and system designed for power and
extensibility. In CIKM, pages 337346, 2006.
Cao 06 F Cao, M Ester, W Qian, and A Zhou,
Density-based Clustering over an Evolving Data
Stream with Noise, To appear in Proceedings of
SIAM 2006.
Chi 04 Y. Chi, H. Wang, P. S. Yu, and R. R.
Muntz. Moment Maintaining closed frequent
itemsets over a stream sliding window. In
Proceedings of the 2004 IEEE International
Conference on Data Mining (ICDM04), November
2004.
Chu 04 F. Chu and C. Zaniolo. Fast and light
boosting for adaptive mining of data streams. In
PAKDD, volume 3056, 2004.
Ester 96 Martin Ester, Hans-Peter Kriegel,
Jorg Sander, and Xiaowei Xu. A density-based
algorithm for discovering clusters in large
spatial databases with noise. In Second
International Conference on Knowledge Discovery
and Data Mining, pages 226231, 1996.
Forman 06 George Forman. Tackling concept
drift by temporal inductive transfer. In SIGIR,
pages 252259, 2006.

37
References

Imielinski 96 Tomasz Imielinski and Heikki
Mannila. A database perspective on knowledge
discovery. Commun. ACM, 39(11)5864, 1996.
Law 04 Yan-Nei Law, Haixun Wang, and Carlo
Zaniolo. Data models and query language for data
streams. In VLDB, pages 492503, 2004.
Mozafari 08 Barzan Mozafari, Hetal Thakkar,
and Carlo Zaniolo. Verifying and mining frequent
patterns from large windows over data streams. In
International Conference on Data Engineering
(ICDE), 2008.
Sadri 01 Reza Sadri, Carlo Zaniolo, Amir
Zarkesh, and Jafar Adibi. Optimization of
sequence queries in database systems. In PODS,
Santa Barbara, CA, May 2001.
Sarawagi 98 S. Sarawagi, S. Thomas, and R.
Agrawal. Integrating association rule mining with
relational database systems Alternatives and
implications. In SIGMOD, 1998.
UCI-MLR http//archive.ics.uci.edu/ml/datasets.h
tml
Wang 03 H. Wang, W. Fan, P. S. Yu, and J. Han.
Mining concept-drifting data streams using
ensemble classifiers. In SIGKDD, 2003.

Write a Comment

User Comments (0)

About PowerShow.com

Extending DSMS for Data Stream Mining PowerPoint PPT Presentation