Efficient Closed Pattern Mining in the Presence of Tough Block Constraints - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Efficient Closed Pattern Mining in the Presence of Tough Block Constraints

Description:

A frequent pattern can be a set of items, sequence, graph that occurs frequently ... spatial, geometric or topological pattern depending on the database chosen ... – PowerPoint PPT presentation

Number of Views:141
Avg rating:3.0/5.0
Slides: 39
Provided by: gade
Category:

less

Transcript and Presenter's Notes

Title: Efficient Closed Pattern Mining in the Presence of Tough Block Constraints


1
Efficient Closed Pattern Mining in the Presence
of Tough Block Constraints
  • Krishna Gade
  • Computer Science Engineering
  • gade_at_cs.umn.edu

2
Outline
  • Introduction
  • Problem Definition and Motivation
  • Contributions
  • Block Constraints
  • Matrix Projection based approach
  • Search Space Pruning Techniques
  • Experimental Evaluation
  • Conclusions

3
Introduction to Pattern Mining
  • What is a frequent pattern ?
  • Why is frequent pattern mining a fundamental task
    in data mining ?
  • Closed, Maximal and Constrained Extensions
  • State-of-the-art algorithms
  • Limitations with the current solutions

4
What is a frequent pattern ?
  • A frequent pattern can be a set of items,
    sequence, graph that occurs frequently in a
    database
  • It can also be a spatial, geometric or
    topological pattern depending on the database
    chosen
  • For a given a transaction database and a support
    threshold min_sup, an itemset X is frequent if
    Sup(X) gt min_sup
  • Sup(X) is the support of X, defined as the
    fraction of the transactions in the database
    which contain X.

5
Why is frequent pattern mining so fundamental to
data mining ?
  • Foundation for several essential data mining
    tasks
  • Association, Correlation and Causality Analysis
  • Classification based on Association Rules
  • Pattern-based and Pattern-preserving Clustering.
  • Support is a simple, yet effective (in many
    cases) measure to determine the significance of a
    pattern, which also correlates with most of the
    other statistical measures.

6
Closed, Maximal Constrained Extensions to
Frequent Patterns
  • A frequent pattern X is said to be closed, if
  • no superset of X has the same supporting set.
  • It is said to be maximal, if
  • no superset of X is frequent.
  • It is said to be constrained, if
  • it satisfies some constraint defined on the items
    present or on the transactions that support it.
  • E.g., length(X) gt min_l

7
State-of-the-art algorithms for Pattern Discovery
  • Itemsets
  • Frequent FP-Growth, Apriori, OP, Inverted
    Matrix
  • Closed Closet, Charm, FPClose, LCM
  • Maximal Mafia, FPMax
  • Constrained LPMiner, Bamboo
  • Sequences
  • SPAM, SLPMiner, BIDE etc.,
  • Graphs
  • FSG, gFSG, gSpan etc.,

8
Limitations
  • Limitations of Support
  • May not capture the semantics of user interest.
  • Too many frequent patterns if the support
    threshold is too low.
  • Closed and Maximal frequent patterns fix this but
    there may be loss of information (in case of
    maximal).
  • Support is a measure of interestingness of a
    pattern, there can be others such as length
    etc.
  • E.g., One may be interested in finding patterns
    whose length decreases with the support.

9
Definitions
  • Block is a 2-tuple B (I,T), consisting of
    itemset I and its supporting set T.
  • Weighted block is a block with a weight function
    w, where w I x T -gtR
  • B is a Closed block iff there exists no block B
    (I,T) where I is the superset of I. (If
    such B exists then it is the super-block of B
    and B its sub-block.)
  • The size of the block B is defined as
  • The sum of the corresponding weighted block, is
    defined as

10
Example of a Block
Example Database
Matrix Representation of the Database
B1 (a,b,T1), B2 (c,d,T2,T4) are
examples of a block. Red sub-matrix is not a
block.
11
Block Constraints
  • Let t be the set of all transactions in the
    database and m be the set of all items.
  • A block constraint C is a predicate, C 2t x 2m
    -gt true, false
  • A block B is a valid block for C if B satisfies C
    or C(B) is true.
  • C is a tough block constraint if there is no
    dependency between the satisfaction (violation)
    of by a block B and the satisfaction (violation)
    of its super or sub-blocks.
  • In this thesis explore 3 different tough block
    constraints.
  • Block-size
  • Block-sum
  • Block-similarity

12
Monotonicity and Anti-montononicty of Constraints
  • Monotone Constraint
  • C is monotone iff C(X) true, then for every Y
    such that Y is a superset of X, C(Y) true.
  • E.g. Sup(X) lt v is monotone.
  • Benefit Prune all Y if Sup(X) gt v.
  • Anti-monotone Constraint
  • C is anti-monotone iff C(X) true, then for
    every Y such that Y is a subset of X, C(Y)
    true.
  • E.g. Sup(X) gt v is anti-monotone.
  • Benefit Prune all Y if Sup(X) lt v.

13
Why Block-size is a tough constraint An
Illustration
  • For the constraint BSize gt 4,
  • (b,c,T1,T2 is a valid block, but
    (b,c,d,T2) is invalid.
  • Block-size constraint is not monotone.
  • neither of (b,T1,T2), (c,T1,T2,T4) is
    valid.
  • Block-size constraint is not anti-monotone.

14
Block-size, Block-sum Constraints
  • Block-size Constraint
  • Motivation Find set of itemsets each of which
    accounts for a certain fraction of overall number
    of transactions performed in a period of time.
  • Block-sum Constraint
  • Motivation Identify product groups that account
    for a certain fraction of the overall sales,
    profits etc.

15
Block-similarity Definition
  • Motivation Finding groups of thematically
    related words in large document datasets.
  • Importance of a group of words can be measured by
    their contribution to the overall similarity
    between the documents in the collection.
  • Here t is the set of tf-idf scaled and normalized
    unit-length document vectors and m is the set of
    distinct terms in the collection.
  • Block-similarity of a weighted block B is defined
    as
  • Loss in the aggregate pairwise similarity of the
    documents in t, resulting from zeroing-out
    entries corresponding to B.
  • BSim(B) S S, where S, S are the aggregate
    pairwise similarities before and after removing
    B.

16
Block Similarity - Illustration
(b,c,D1,D2) is removed here to calculate its
block-similarity, by measuring the loss in the
aggregate similarity.
17
Block-similarity contd.,
  • Similarity of any two documents is measured as
    the dot-product of their unit-length vectors.
    (cosine)
  • For the given collection t, we define a composite
    vector to be the sum of all document vectors in
    t.
  • We define the composite vector BI for a weighted
    block B (I,T) to be the vector formed by adding
    all the vectors in T only along the dimensions in
    I. Then,
  • Block-similarity constraint is now defined as

18
Key Features of the Algorithm
  • Follows widely used projection-based pattern
    mining paradigm.
  • Adopts a depth-first search traversal on the
    lattice of complete set of itemsets, with the
    items ordered non-decreasingly on their
    frequency.
  • Represents the transaction/document database as a
    matrix, transactions (documents) as rows and
    items (terms) as columns.
  • Employs efficient compressed sparse matrix
    storage and access schemes like to achieve high
    computational efficiency.
  • Matrix-projection based pattern enumeration
    shares ideas from the recently developed
    array-projection based method H-Mine.
  • Prunes potentially invalid rows and columns at
    each node during the traversal of the lattice
    (shown in the next page) as determined by our
    row-pruning and column-pruning and matrix-pruning
    tests.
  • Adopts various closed itemset mining optimization
    techniques, like column fusing, redundant pattern
    pruning from CHARM and Closet to the block
    constraints.
  • The hash-table consists of only closed patterns
    hashed by the sum of the transaction-ids of the
    transactions in their supporting sets.

19
Pattern Enumeration
  • Visits each node in the lattice in a dept-first
    order. Each node represents a distinct pattern p.
  • At a certain node labeled p in the lattice, we
    report and store p in the hash table, as a closed
    pattern if p is closed and valid under the given
    block constraint.
  • We build a p-projected matrix by pruning any
    potentially invalid columns and rows determined
    by our pruning tests.

Ø
Level 1
a
b
c
d
Level 2
ab
ac
ad
bc
bd
cd
abc
abd
acd
bcd
Level 3
Level 4
abcd
20
Matrix-Projection
given matrix
  • A p-projected matrix is the matrix containing
    only the rows that contain p, and the columns
    that appear after p, in the predefined order.
  • Projecting the matrix is linear on the number of
    non-zeroes in the projected matrix.

b-projected matrix
21
Compressed Sparse Representation
  • CSR format utilizes two one-dimensional arrays
  • First stores the actual non-zero elements of the
    matrix in a row (or column).
  • Second stores the indices corresponding to the
    beginning of each row (or column).
  • We maintain both row- and column-based
    representation for efficient projection and
    frequency counting.

22
CSR format for the example matrix
Row-based CSR
Pointer Array
Index Array
23
Search Space Pruning
b-projected matrix
  • Column Pruning
  • Given a pattern p and its p-projected matrix.
  • Necessary condition for the columns which can
    form a valid block with p.
  • Eliminate all columns in the p-projected matrix
    that do not satisfy it.
  • Block-Size
  • A local supporting set of x.
  • rlen(t) local rowlength of t.

Let BSize gt 5 be the constraint. d will get
pruned as it can never form a block of size gt 5
with its prefix b, since the maximum block-size
possible with d is 4.
24
Search Space Pruning contd.,
  • Column Pruning
  • Block-sum
  • rsum(t) local rowsum of t.
  • Block-similarity
  • e maximum value of vector D.
  • g local maximum rowsum.
  • freq frequency.
  • a 2D.BP

25
Search Space Pruning contd.,
  • Row Pruning
  • Smallest Valid Extension, SVE
  • SVE(p) is the length of the smallest possible
    extension q to p, such that resulting block
    formed by p and q, is valid.
  • Prune rows whose length is smaller than SVE in
    the p-projected matrix.
  • SVE for generic block constraint BSxxx is given
    below.
  • Block-size
  • z size of the supporting set of p.
  • Block-sum
  • z maximum column sum in the p-projected matrix.
  • Block-similarity
  • z maximum column similarity in the p-projected
    matrix.

26
Search Space Pruning contd.,
b-projected matrix
  • Row Pruning Example
  • Matrix Pruning
  • Prune p-projected matrix if
  • Block-size
  • Sum of the row-lengths in the projected matrix is
    insufficient.
  • Block-sum
  • Sum of the row-sums and is insufficient.
  • Block-similarity
  • Sum of the column- similarities is insufficient.
  • to form a valid block with p.

Let BSum gt 7 be the constraint. Since SVE gt 3,
T1 gets pruned.
27
Pattern Closure Check and Optimizations
  • Closure Check
  • Hash-table consists of closed patterns
  • Hash-keys are sum of transaction-ids
  • At a certain node p in the lattice (shown
    before),
  • Column Fusing
  • Fuse the fully dense columns of the p-projected
    matrix to p.
  • Also fuse columns to one another that have
    identical supporting sets.
  • Redundant Pattern Pruning
  • If p is a proper subset of an already mined
    closed pattern with the same support, it can be
    safely pruned. Also any pattern extending it need
    not be explored as it has already been done.
    Hence p is a redundant pattern.

28
Experimental Setup
Notation CBMiner Closed Block Miner
Algorithm CLOSET - State-of-the-art closed
frequent itemset mining algorithm CP Column
Pruning, RP Row Pruning, MP Matrix Pruning.
29
Experimental Results
Comparisons with Closet on Gazelle
30
Experimental Results contd.,
Comparisons with Closet on Sports
31
Experimental Results contd.,
Comparisons of Pruning Techniques on Gazelle
(left) and Pumsb(right)
No Pruning Gazelle 1578.48 , BSize gt 0.1
Pumsb 1330.03 , BSum gt 6.0
32
Experimental Results contd.,
Comparisons Closed All Valid Block Mining
Big-Market
33
Experimental Results contd.,
Comparison of Pruning Techniques on Big-Market
Scalability Test on T10I4Dx
Time for No Pruning 3560 seconds
34
Micro Concept Discovery
  • Scaled the document vectors using tf-idf.
  • Normalized using L2-norm.
  • Applied the CBMiner algorithm for each of the
    three constraints.
  • Chose the top-1000 patterns ranked on the
    constraint function value.
  • Compute the entropies of the documents that form
    the supporting set of the block.
  • Also ran CLOSET to get the top-1000 patterns
    ranked on frequency.

35
Micro Concept Discovery contd.,
  • Average entropies of the four schemes are pretty
    low.
  • Block Similarity outperforms the rest as it leads
    to lowest entropies or purest clusters.
  • Block-size and itemset frequency constraints do
    not account for the weights associated with the
    terms and hence are inconsistent.
  • But, Block-sum performs reasonably well as it
    accounts for the term weights provided by tf-idf
    and L2-norm.

36
Micro Concept Discovery contd.,
37
Conclusions
  • Proposed a new class of constraints called
    tough block constraints.
  • And a matrix-projection based framework CBMiner
    for mining closed block patterns.
  • Block Constraints discussed Block-size,
    Block-sum, Block-similarity
  • 3 novel pruning techniques column pruning, row
    pruning and matrix pruning.
  • Order(s) magnitude faster than traditional closed
    frequent itemset mining algorithms
  • Finds much fewer patterns.

38
  • Thank You !!
Write a Comment
User Comments (0)
About PowerShow.com