DEMON: Mining and Monitoring Evolving Data presentation

About This Presentation

Transcript and Presenter's Notes

Title: DEMON: Mining and Monitoring Evolving Data

1
DEMON Mining and Monitoring Evolving Data

Venkatesh Ganti
UW-Madison
Johannes Gehrke
Cornell University
Raghu Ramakrishnan
UW-Madison
Presented by
Navneet Panda

2
Overview

Extension of Data Mining to a dynamic environment
evolving through systematic addition or deletion
of blocks of data.
Introduction of a new dimension called the data
span dimension, allowing the selection of
temporal subset of the database.
Efficient model maintenance algorithms for
frequent item sets and clusters.
Generic algorithm modifying traditional model
maintenance algorithm to an algorithm allowing
restrictions on the data span dimension.
Examination of validity and performance of ideas.

3
Notation

Tuple basic unit of information in the data.
e.g. Customer transaction, database record or an
n-dimensional point.
Block set of tuples.
Database Sequence of blocks D1, D2, ..........,
Dk, ....... where each block Dk is associated
with an identifier k.
Dt Latest block.
D1, .......,Dt Current database snapshot.

4
Objectives ( 1 )

Mining systematically evolving data
Systematic Sets of records added together as
opposed to
Arbitrary Individual record can be updated
at any time.
Reasoning Data warehouses with a large
collection of data from multiple sources donot
update records arbitrarily.Rather the approach
is to update batches of records at regular time
intervals. Block evolution does not necessarily
follow a regular time period.

5
Objectives ( 2 )

Introduction of a new dimension called the Data
Span Dimension.
Takes the temporal aspect of the data evolution
into account.
Options
Unrestricted Window All the data collected so
far. Notation D1,t
Most Recent Window Specified number w of the
most recently collected blocks of data.If ( t gt
w-1 ) D t- w 1, t Consists of blocks D
t w 1,...., Dtelse Consists of blocks D1
......., Dt

6
Additional Constraints

Block Selection Predicate Bit sequence of 0's
and 1's with a 1 in a particular position
indicates that the particular block i selected
for mining and vice versa.
Motive To enable the analyst to perform the
following kind of actions.
Model the collected on all Mondays to analyse the
sales immediately after the weekend. Required
blocks need to be selected from the unrestricted
window by a predicate that marks all the blocks
added to the database on Mondays.
Model the data collected on all Mondays in the
last 28 days
Model all data collected on same day as today in
last month

7
Formal definitions 1

D1, t D1,.....,Dt DataBase Snapshot
D t w 1, t Most Recent Window
of size w.
A window-independent block-sequence is a
sequencelt b1, ......, bt,.......gt of 0/1 bits
A window-relative block sequence is a sequencelt
b1, ..... , bwgt of bits one per block in the most
recent window.

8
Formal definitions 2

I i1,......, in Set of items
Transaction and itemset are subsets of I. Each
transaction is associated with a unique
identifier transaction identifier.
A transaction T contains itemset X if X is a
subset of T.
Support ?D(X) Fraction of the total number of
transactions in D that contain X where D is a set
of transactions.
Minimum Support ( 0 lt k lt 1 ). Itemset X is
frequent on D if ?D(X) k.
Frequent itemsets L( D, k ) All itemsets
frequent on D.
Negative border NB(D, k ) Set of all infrequent
itemsets whose proper subsets are all frequent.

9
Unrestricted Window Algorithm

New algorithms ECUT ECUT
Previous best algorithm Borders
Detection Phase Recognize the change in
frequent itemsets
Update Phase Count the set of new itemsets
required for dynamic maintenance. Relies on the
maintenance of the negative border along with the
set of frequent itemsets.
Addition of D t1 to D1, t causes an update of
frequent itemsets L( D1, t ), k ) and NB( D1,
t , k).
If X ? NB( D1, t , k) and the support of X
?(X ) is greater than k then X becomes frequent
in D1, t 1.
New candidate itemsets are generated recursively
after adding X to L(D 1, t , k ). The counting
of support of new itemsets is achieved by
organizing them as a prefix tree. ( PT - Scan )

10
ECUT

Exploits systematic data evolution and the fact
that very few new candidate itemsets need to be
counted.
Retrieve only the relevant portion of dataset to
count the support of an itemset X.
Relevant information stored as TID ( Transaction
identifier ) lists of items in X. TID lists are
sorted.
X i1, ......., ik
TID Lists ?(i1), ......, ?(ik ).
Intersection of these TID Lists gives ?( X ).
Intersection similar to merge phase of a
sort-merge join.
Size of the TID List 1 or 2 orders of magnitude
smaller than D.

11
ECUT ( 2 )

Uses (1) Additivity property Support of an
itemset X on D1, t is the sum of it's supports
on D1,......, Dt.(2) 0/1 property Block
either completely selected in BSS or not.
Implication TID Lists of block Di constructed
and added to database without the possibility of
modification when Di is added to the database.
Block Di is scanned and the identifier of each
transaction T ? Di is appended if T contains X.
Any information that can be obtained from the
transactional format can be obtained from the set
of TID Lists hence obviating the need for storage
of the database in transactional format.

12
ECUT

Improvement upon ECUT when additional space
available.
Intuition Support of X counted by joining the
TID lists of itemsets X1, ....., Xk where
X1 U ....... U Xk X
Greater the sizes of Xi's faster the calculation
of support of X.
The choice of Xi's is NP-hard therefore heuristic
applied Significant reduction in time to count
the support of an itemset result from the use of
2-itemsets instead of 1-itemsets.If memory
limited then as many chosen as possible. An
itemset with higher overall support chosen before
one with lower support.
Advantage Speed, Updates which change the
threshold k to k' possible by augmenting Borders
with ECUT.

13
Clustering

Existing algorithm BIRCH.
Preclustering Phase dataset scanned to identify
a small set of sub clusters C. C fits easily into
memory. This phase dominates the overall resource
requirements.
Analysis Phase Merge some sub clusters of C to
form user defined number of clusters Second phase
works on in memory data hence very fast.The
improved algorithm presented is BIRCH.

14
BIRCH

Incrementally cluster D1, t 1 in two steps.
Inductive description
Base case t 1 run BIRCH on D 1 , 1.
Time t 1 output of first phase of BIRCH in
memory as set of sub clusters Ct.
When Dt1 is added update Ct by scanning Dt1 as
if the first phase of BIRCH was suspended.
After obtaining Ct1 run second phase of BIRCH.
Observation Input order of data does not have
perceptible impact on quality of clusters
produced by BIRCH.

15
GEMM

Generic Model Maintenance for the most recent
window option.
Basic Idea Starting with block Dt-w1 window D
t -w 1, t evolves in w steps.
Build the required model incrementally using
algorithm Am in w steps.
Suppose current window is D t -w 1, t. There
are w -1 future windows which overlap with it.
Incrementally evolve models for all such future
windows.
Implication necessary to maintain models for
all future windows.

16
Example
17
Example
18
GEMM

Generic Model Maintenance for the most recent
window option.
Basic Idea Starting with block Dt-w1 window D
t -w 1, t evolves in w steps.
Build the required model incrementally using
algorithm Am in w steps.
Suppose current window is D t -w 1, t. There
are w -1 future windows which overlap with it.
Incrementally evolve models for all such future
windows.
Implication necessary to maintain models for
all future windows.
Choice of Block Selection Sequence depends upon
the type of BSS window independent or
window-relative.

19
Window independent BSS
20
Window Relative BSS
21
Analysis

Time between addition Time taken by
algorithm Am of block and availability
to update the model with of updated model
a single new block
Except the model for the new window rest of the
models donot need to be constructed immediately
and can be done offline
These models can be swapped out ot disk and
retrieved when required implying that main memory
is not a limitation as long as one model fits
into memory.
Since space occupied by model when compared to
data stored on disk is negligible therefore the
additional disk space required is negligible.

22
Optimizations

Maintenance under deletion of transactions
possible for certain classes of models for
example set of frequent itemsets.
Algorithm proceeds exactly as for addition of
transactions except that support of all itemsets
contained in a deleted transaction are
decremented.
Options a) GEMM with instantiated model
maintenance algorithm Am for addition of
blocksb) Alternative algorithm that directly
updates the model to reflect the addition of new
block and deletion of oldest block.(b)
maintains a single model whereas GEMM maintains
w-1 models. Response time of GEMM is same as time
required to add a new block whereas (b) has to
reflect the addition and deletion. GEMM takes
half the time.
Set of subclusters cannot be maintained under
deletion in BIRCH.
Inefficient to use (b) with window relative
BSS. For eample BSS lt10101010gt would cause whole
model to be reconstructed in (b).

23
Performance

Measurements on 200 Mhz Pentium Pro PC with 128
MB Ram and running Solaris 2.6
Data generator used by Agrawal et al. Format NM
. T1L . I I . Nppats . Pplen N Million
transactionsT1 Average transaction lengthI
items ( in multiples of 1000's )Np patterns ( in
multiples of 1000's )P average pattern length
Observation Additional amount of space required
materialization for ECUT with frequent itemsets
of size 2 lt 25 of overall datasize.

Comparison of update Phase of ECUT ECUT with
that of BORDERS. Set of frequent itemsets
computed at 1 minimum support. Random selection
of a set S from negative border and counting of
support of all X ? S. Size of S varied from 5 to
180.

Comparison of total time taken by the algorithms
broken down into detection and update phase.
First set of frequent itemsets computed at
certain k. Then overall maintenance time required
to update the frequent itemsets when a second
block is added is measured

26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29

Clusters distributed over all dimensions
generated.
Two blocks of data considered. Number of tuples
varied in second block between 100K and 800K and
2 uniformly distributed noise points added to
perturb the clusters.

30
Results

ECUT and ECUT scale linearly with the number of
itemsets in S.
ECUT outperforms PT-Scan when S lt 75.
ECUT outperforms over entire range.
S lt 40ECUT twice as fast as PT-Scan. ECUT
around 8 times as fast as PT-Scan.Typical values
of S lt 30
Update phase of BORDERS dominates the overall
maintenance time.
When new block size lt 5 of original dataset size
algorithms are between 2 to 10 times faster than
PT-Scan.
When ECUT used in update phase detection phase
dominates total maintenance time.
BIRCH significantly outperforms BIRCH.

31
Conclusion

Problem space of systematic data evolution
explored and efficient model-maintenance
algorithms presented.
All the algorithms presented are actually very
simple modifications of existing algorithms and
seem to be quite effective.

Write a Comment

User Comments (0)

About PowerShow.com

DEMON: Mining and Monitoring Evolving Data PowerPoint PPT Presentation