Extending DSMS for Data Stream Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Extending DSMS for Data Stream Mining

Description:

Continuous, unbounded, rapid, time-varying streams of data elements ... Weighted bagging [Wang' 03], adaptive boosting [Chu' 04], inductive transfer [Forman' 06] ... – PowerPoint PPT presentation

Number of Views:231
Avg rating:3.0/5.0
Slides: 38
Provided by: mir135
Learn more at: http://web.cs.ucla.edu
Category:

less

Transcript and Presenter's Notes

Title: Extending DSMS for Data Stream Mining


1
Extending DSMS for Data Stream Mining
  • CS240B Notes
  • by
  • Carlo Zaniolo
  • UCLA CSD

2
Data Streams
  • Continuous, unbounded, rapid, time-varying
    streams of data elements
  • Occur in a variety of modern applications
  • Network monitoring and traffic engineering
  • Sensor networks, RFID tags
  • Telecom call records
  • Financial applications
  • Web logs and click-streams
  • Manufacturing processes
  • DSMS Data Stream Management System

3
Many Research Projects
  • Amazon/Cougar (Cornell) sensors
  • Aurora (Brown/MIT) sensor monitoring, dataflow
  • Hancock (ATT) Telecom streams
  • Niagara (OGI/Wisconsin) Internet DBs XML
  • OpenCQ (Georgia) triggers, view maintenance
  • Stream (Stanford) general-purpose DSMS
  • Tapestry (Xerox) pubish/subscribe filtering
  • Telegraph (Berkeley) adaptive engine for
    sensors
  • Gigascope ATT Labs Network Monitoring
  • Stream Mill (UCLA) - power extensibility

4
Technology Challenges
  • Data Models
  • Relational Streams--but XML streams important
    too
  • Tuple Time-Stamping
  • Order is important
  • Windows
  • Query Languages Extensions of SQL or XQUERY
  • To support continuous (i.e., persistent) queries
    on transient datareversal of roles.
  • Blocking operators excluded
  • Query Plans
  • New execution models (main memory oriented)
  • Optimized scheduling for response time or memory
  • Quality of Services (QoS) Approximation
  • Synopses
  • Sampling
  • Load shedding.

5
Commercial Developments
  • Several Startups
  • Streambase,
  • Coral8,
  • Apama, and
  • Truviso.
  • Oracle and DBMS companies
  • Publish/subscribe
  • Complex Event Processing (CEP)
  • Limitations only simple applicationse.g.
    continuous queries expressed in SQL
  • No Support for Data Stream Mining queries.

6
Data Stream Mining
  • Many applications click stream analysis,
    intrusion detection,...
  • Many fast light algorithms developed for stream
    mining.
  • Ensembles, Moment, SWIM, etc.
  • Analyst should be able to focus on high-level
    mining tasks.
  • Leaving QoS and lower-level issues to the
    system.
  • Integration of mining methods into Data Stream
    Management Systems (DSMS) is required
  • Many research challenges.
  • Stream Mill Miner (SMM) is the first DSMS
    designed for that.

7
Data Stream Management Systems (DSMS)
  • Data stream mining applications so far ignored by
    DSMS although
  • A. DSMS technology is required for data stream
    mining
  • QoS, query scheduling, synopses, sampling,
    windows, ...
  • B. But supporting DM applications is difficult
    since current DSMS only support simple query
    languages based on SQL.
  • Conclusion either a shotgun wedding ... or a
    research breakthrough is needed here!

8
A Difficult Problem the Inductive DBMS Experience
  • Initial attempts to support mining queries in
    relational DBMS Unsuccessful
  • OR-DBMS do not fare much better Sarawagi 98.
  • In 1996 the high-road approach by Imielinski
    Mannila who called for a quantum leap in
    functionality
  • High-level declarative languages for DM .
  • Extensions for query processing and
    optimization.
  • The research area of Inductive DBMS was thus born
  • Inspired DMQL, Mine Rule, MSQL, etc.
  • Suffer from limited generality and performance
    issues.

9
DBMS Vendors
  • Vendors have taken a low-road approach.
  • A library of mining functions using a
    cache-mining approach
  • IBM DB2 Intelligent Miner
  • Oracle Data Miner
  • MS OLE DB for DM mining models
  • Closed systems,
  • Lacking in coverage and user-extensibility.
  • Not as popular as dedicated, stand-alone mining
    systems, such as Weka

10
Weka
  • A comprehensive set of mining algorithms, and
    tools.
  • Generic algorithms over arbitrary data sets.
  • Independent on the number of columns in tables.
  • Open and extensible system based on Java.
  • These are the features that we want in our
    SMMstarting from SQL rather than Java!
  • Not an easy task ...why?

11
SMM Contributions
  • Build on Stream Mill DSMS and its SQL-based
    continuous query language and enabling
    technology.
  • Language and System Extensions
  • Genericity,
  • Extensibility, and
  • Performance
  • A suite of stream mining algorithms.
  • Existing ones and
  • Newly developed in this projecte.g., SWIM.
  • High level mining model for better
  • Usability
  • Control of mining process.

12
From SQL to Online Mining in SMMstep by step
  • Naïve Bayesian Classifier (NBC).
  • Important and frequently used.
  • Schema-specific NBC. Simple to express in SQL by
    count, sum aggregates. But a generci NBC is still
    preferable.
  • Genericity one function independent of number
    columns involved.
  • Schema independence in SQL?

13
Genericity
  • Weka
  • Arrays of type real.
  • SMM
  • Verticalization.
  • Similar arrays, but in tables.
  • Built-in table function to reduce any table to
    this form.
  • Thus, generic UDAs work with this schema.
  • And further improvements are also supported in
    SMM

14
Extensibility?
  • Most mining tasks cannot be implemented in SQL.
  • Solution Define complex functions by User
    Defined Aggregates (UDAs)
  • Complex mining tasks can be viewed as aggregates
  • UDAs Natively defined in SQL make the language
    computationally complete Wang 04
  • Turing-complete over static data
  • Non-blocking complete over data streams
  • Natural extensions to support windows and delta
    computations for data streams Bai 06
  • UDAs can be defined in a PL, for better
    performance

15
Windowed UDA Example Continuous Count
  • WINDOW AGGREGATE sum(val REAL)REAL
  • TABLE state (tot real)
  • INITIALIZE
  • INSERT INTO state VALUES(val)
  • ITERATE
  • UPDATE state SET tot tot val
  • EXPIRE
  • UPDATE state SET tot tot oldest().val
  • / No TERMINATE state /

16
Online Mining in SMM
  • UDAs Invoked with standard SQL2003 syntax of
    OLAP functions.
  • SELECT learn(ts.Column, ts.Value, t.dec)
  • OVER (ROWS 1000 PRECEDING)
  • FROM trainingstream AS t,
  • TABLE (verticalize(Outlook, Temp, Humidity,
    Wind)) AS ts
  • Powerful framework
  • Concept drifts-shifts
  • Association rule mining

17
The Slide Construct
  • A window can be divided into panes (called a
    slide)
  • Tumbling windows when the size of the slide is
    equal or larger than that of the window
  • The slide/window combination is great for data
    stream mining.
  • Simple construct added to support slides in UDAs
  • Allowed us to build a flexible and efficient
    library of data stream mining UDAs

18
SMM Contributions
  • Build on Stream Mill DSMS and its SQL-based
    continuous query language and enabling
    technology.
  • Language and System Extensions
  • Genericity,
  • Extensibility, and
  • Performance
  • A suite of stream mining algorithms.
  • Existing ones and
  • Newly developed in this projecte.g., SWIM.
  • High level mining model for better
  • Usability
  • Control of mining process.

19
Association Rule Mining
  • SWIM Mozafari 08 Maintaining frequent
    patterns over large windows with slides.
  • Differentially computes frequent patterns as
    slides enter (expire out of) the window.
  • Uses efficient Verifiers based on conditional
    counting.
  • Trade-off between Delay and Performance
  • Performance gain over existing algorithms.

20
SWIM (Sliding Window Incremental Miner)
  • If pattern p is freq in a window, it must be freq
    in at least one of its slides -- keep a union of
    freq patterns of all slides (PT)

Expired
New

.
S4
S5
S6
S7
W4
W5
Mine
Mining Alg.
PT
Prune PT
PT F5 U F6 U F7
PT F4 U F5 U F6
21
Concept Drifts/ShiftsComplex Processes
  • Ensemble based methods.
  • Weighted bagging Wang 03, adaptive boosting
    Chu 04, inductive transfer Forman 06.
  • Generic support, e.g. adaptive boosting (below).

22
Built-in Online Mining Algorithms In SMM
  • Online classifiers
  • Naïve Bayesian
  • Decision Tree
  • K-nearest Neighbor
  • Online clustering
  • DBScan Ester 96
  • IncDBScan
  • Windowed K-means
  • DenStream Cao 06
  • CluStream
  • Association rule mining
  • Approximate frequent items
  • SWIM Mozafari 08
  • Moment Chi 04
  • AFPIM
  • Time series/sequence queries
  • SQL-TS Sadri 01
  • Many more
  • Already supported
  • To be supported

23
SMM Contributions
  • Build on Stream Mill DSMS and its SQL-based
    continuous query language and enabling
    technology.
  • Language and System Extensions
  • Genericity,
  • Extensibility, and
  • Performance
  • A suite of stream mining algorithms.
  • Existing ones and
  • Newly developed in this projecte.g., SWIM.
  • High level mining model for better
  • Usability
  • Control of mining process.

24
Usability?
  • Complex SQL queries to invoke built-in and
    user-defined mining algorithms.
  • An open and extensible system
  • Most analysts would prefer using high-level
    mining language that
  • supports uniform invocation of built-in and
    user-defined mining algorithms (no SQL required)
  • describes the workflow of the mining process
  • Is also open and extensible to incorporate newly
    defined mining algorithms.

25
Example Defining a Mining Model
  • CREATE MODEL TYPE NaiveBayesianClassifier
  • SHAREDTABLES (DescriptorTbl),
  • Learn (UDA LearnNaiveBayesian,
  • WINDOW TRUE,
  • PARTABLES(), names of param tables required
    by the method
  • PARAMETERS() additional parameters to be
    specified for input
  • ),
  • Classify (UDA ClassifyNaiveBayesian,
  • WINDOW TRUE,
  • PARTABLES(),
  • PARAMETERS()
  • )

26
Example Using a Mining Model
  • Creating an instance
  • CREATE MODEL INSTANCE NaiveBayesianInstance
  • AS NaiveBayesianClassifier
  • Uniform invocation of mining tasks
  • RUN NaiveBayesianInstance.Learn WITH TrainingSet

27
Performance
  • SMM Vs. Weka
  • NBC and decision tree classifier
  • Datasets UCI
  • Iris 5 attributes
  • Heart disease 13 attributes
  • Overhead of integrating algorithms into SMM
  • The SWIM algorithm standalone vs. integrated
  • Dataset IBM Quest
  • Trans len 20, Pattern len 5, Tuples 50K

28
Comparison with Weka NBC-Iris
29
Comparison with Weka NBC-HD
30
Comparison with Weka Decision Tree - Iris
31
Integration Overhead Integrated SWIM vs.
Standalone SWIM
32
The Stream Mill System
  • One server, multiple clients
  • Server (on Linux) hosts the ESL language and
    manages storage and continuous queries
  • Client (Java based GUI) allows the user to
    specify streams, queries, etc.

33
Conclusion
  • SMM integrates new solutions for several
    difficult problems
  • Usability by high-level mining models
  • Extensibility by user-defined mining models that
    call on UDAs with windows
  • Suite of built-in data stream mining UDAs
  • Generic mining UDAs by Verticalization other
    techniques
  • Performance
  • SMM is the first of its kind more and better
    systems will follow in its footsteps.

34
Future Work
  • Faster lighter mining algorithms
  • E.g. online algorithms for clustering
  • Integration of other mining algorithms
  • Data flow in mining models
  • Similar solution for databases

35
  • Thank you!

36
References
  • Arasu 04 Arvind Arasu and Jennifer Widom.
    Resource sharing in continuous sliding-window
    aggregates. In VLDB, pages 336347, 2004.
  • Babcock 02 B. Babcock, S. Babu, M. Datar, R.
    Motawani, and J. Widom. Models and issues in data
    stream systems. In PODS, 2002.
  • Bai 06 Yijian Bai, Hetal Thakkar, Chang Luo,
    Haixun Wang, and Carlo Zaniolo. A data stream
    language and system designed for power and
    extensibility. In CIKM, pages 337346, 2006.
  • Cao 06 F Cao, M Ester, W Qian, and A Zhou,
    Density-based Clustering over an Evolving Data
    Stream with Noise, To appear in Proceedings of
    SIAM 2006.
  • Chi 04 Y. Chi, H. Wang, P. S. Yu, and R. R.
    Muntz. Moment Maintaining closed frequent
    itemsets over a stream sliding window. In
    Proceedings of the 2004 IEEE International
    Conference on Data Mining (ICDM04), November
    2004.
  • Chu 04 F. Chu and C. Zaniolo. Fast and light
    boosting for adaptive mining of data streams. In
    PAKDD, volume 3056, 2004.
  • Ester 96 Martin Ester, Hans-Peter Kriegel,
    Jorg Sander, and Xiaowei Xu. A density-based
    algorithm for discovering clusters in large
    spatial databases with noise. In Second
    International Conference on Knowledge Discovery
    and Data Mining, pages 226231, 1996.
  • Forman 06 George Forman. Tackling concept
    drift by temporal inductive transfer. In SIGIR,
    pages 252259, 2006.

37
References
  • Imielinski 96 Tomasz Imielinski and Heikki
    Mannila. A database perspective on knowledge
    discovery. Commun. ACM, 39(11)5864, 1996.
  • Law 04 Yan-Nei Law, Haixun Wang, and Carlo
    Zaniolo. Data models and query language for data
    streams. In VLDB, pages 492503, 2004.
  • Mozafari 08 Barzan Mozafari, Hetal Thakkar,
    and Carlo Zaniolo. Verifying and mining frequent
    patterns from large windows over data streams. In
    International Conference on Data Engineering
    (ICDE), 2008.
  • Sadri 01 Reza Sadri, Carlo Zaniolo, Amir
    Zarkesh, and Jafar Adibi. Optimization of
    sequence queries in database systems. In PODS,
    Santa Barbara, CA, May 2001.
  • Sarawagi 98 S. Sarawagi, S. Thomas, and R.
    Agrawal. Integrating association rule mining with
    relational database systems Alternatives and
    implications. In SIGMOD, 1998.
  • UCI-MLR http//archive.ics.uci.edu/ml/datasets.h
    tml
  • Wang 03 H. Wang, W. Fan, P. S. Yu, and J. Han.
    Mining concept-drifting data streams using
    ensemble classifiers. In SIGKDD, 2003.
Write a Comment
User Comments (0)
About PowerShow.com