Cartesian Contour: A Concise Representation for a Collection of Frequent Sets - PowerPoint PPT Presentation

About This Presentation
Title:

Cartesian Contour: A Concise Representation for a Collection of Frequent Sets

Description:

Approximate Representation. Cost: 1 biclique, 3 itemsets, 5 items ... Approximation Bound of the Greedy Algorithm ... Approximation Bound ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 31
Provided by: Lin172
Learn more at: https://www.cs.kent.edu
Category:

less

Transcript and Presenter's Notes

Title: Cartesian Contour: A Concise Representation for a Collection of Frequent Sets


1
Cartesian Contour A Concise Representation for a
Collection of Frequent Sets
  • Ruoming Jin
  • Kent State University

Joint work with Yang Xiang and Lin Liu (KSU)
2
Frequent Pattern Mining
  • Summarizing the underlying datasets, providing
    key insights
  • Key building block for data mining toolbox
  • Association rule mining
  • Classification
  • Clustering
  • Change Detection
  • etc
  • Application Domains
  • Business, biology, chemistry, WWW,
    computer/networing security, software
    engineering,

3
The Problem
  • The number of patterns is too large
  • Attempt
  • Maximal Frequent Itemsets
  • Closed Frequent Itemsets
  • Non-Derivable Itemsets
  • Compressed or Top-k Patterns
  • Tradeoff
  • Significant Information Loss
  • Large Size

4
Pattern Summarization
  • Using a small number of itemsets to best
    represent the entire collection of frequent
    itemsets
  • The Spanning Set Approach Afrati-Gionis-Mannila,
    KDD04
  • Exact Description Maximal Frequent Itemsets
  • Our problem
  • Can we find a concise representation which can
    allow both exact and approximate summarization of
    a collection of frequent itemsets?

5
Basic Idea
A,B,G,H, A,B,I,J, A,B,K,L C,D,G,H,
C,D,I,J, C,D,K,L E,F,G,H, E,F,I,J,
E,F,K,L
9 itemsets, 36 items.
Covering
Picturing
A,B,C,D,E,F
Cartesian Product
G,H,I,J,K,L
1 biclique, 6 itemsets, 12 items
6
Cartesian Covering
Non-frequent itemsets
7
Problem Formulation
  • Cartesian product
  • e.g.
  • Cost of a Cartesian product
  • e.g. 1 biclique, 3 itemsets, and 5 items
  • Covering
  • e.g.

How can we use Cartesian products to concisely
represent a collection of frequent itemsets?
8
Exact and Approximate Covering
Exact Representation
Cost 2 biclique, 4 itemsets, 6 items False
positive none
Approximate Representation
Cost 1 biclique, 3 itemsets, 5 items False
positive G,C,G,D,G,C,D
9
Covering Maximal Frequent Itemsets
MNOVWX
CDEJKL
CDEVWX
MNOGHI
CDEGHI
PQRJKL
CDESTU
GHI, JKL
ABCGHI
ABCSTU
STU, VWX
ABC, CDE
MNO, PQR
10
Problem Reformulation
  • Given Maximal Frequent Itemsets

Exact representation
Approximate representation
Frequent Itemsets
C1 C2
C1 C2
11
Minimal Biclique Set Cover Problem
Ground Set 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
1, 2,3,4,6,7,8,9
5,10,11
12
NP-hardness
  • By reducing the Minimal Biclique Set Cover into
    our problem, we can easily prove our problem1
    (exact) and problem2 (approximate) are NP-hard.
  • Minimal Biclique Set Cover is a Variant of the
    Classical Set Cover Problem

Can we use the standard set-cover greedy
algorithm?
13
Naïve greedy algorithm
  • Greedy algorithm
  • Each time choose a biclique with the lowest price
    .
  • is the cost.
  • This method has a logarithmic approximation
    bound.
  • The problem?
  • The number of candidate bicliques are 2XY !!

14
Candidate Reduction
  • Assume one side of the biclique candidate is
    known, how to choose the other side?

15
Greedy Algorithm
Biclique Candidate
Split and sort
Covering 4
Covering 3
Covering 3
Add 1st single Y-vertex Biclique
Add 2nd single Y-vertex Biclique
Add 3th single Y-vertex Biclique
Fixed!
Cheapest sub-biclique!
Cost 1
Cost 5/7
Cost 6/8
gt 5/7
16
Approximation Bound of the Greedy Algorithm
  • The greedy SubBiclique procedure can find a
    sub-biclique whose price is less than or equal to
    e/(1-e) of the price of the optimal sub-biclique
    (cheapest price)!

17
Further Reduction
  • Only using the IDEA1, the time complexity is
    still exponential .
  • How to reduce this further??
  • Are all the combinations equally
    important?
  • No, because some are more likely to connect to
    the Y side.
  • Our solution Frequent itemset mining!

18
Using Frequent Itemset Mining
19
Overall Algorithm
  • Step 1 Use the Frequent Itemset Mining tool to
    find all the (one-side maximal) biclique
    candidates
  • Step 2 Calculate the cheapest sub-biclique for
    each candidate using the greedy procedure
  • Step 3 Compare all the sub-bicliques, choose the
    cheapest one
  • Step 4 if MFI totally covered, done else go to
    Step 2.

20
Approximation Bound
  • Our algorithm has e/(1-e) (ln (n)1)
    approximation ratio with respect to the candidate
    set (all the sub-bicliques with one sides coming
    from the frequent itemset mining).

21
Speed-up techniques (1)
  • Using Closed itemsets for X and Y
  • Initially X and Y contain all the FI,
    respectively.
  • Using to cover MFI is similar to factorizing
    MFI
  • MFIs maximal factor itemsets are closed
    itemsets, whose number is much smaller!

22
Speed-up techniques (2)
Dense Graph
Sparse Graph
TRADEOFF
Frequent Itemset
Supporting Transaction
Frequent itemsets is small Valuable biclique
candidates are not be fully used!
Frequent itemsets is big Handling those
candidates are too slow!
23
Speed-up techniques (3)
  • Iterative procedure
  • A large number of closed itemsets
  • To cover MFI in one time can produce a huge
    number of biclique candidates
  • So to cover MFI in several times
  • Support level is reduced gradually!

24
Experiments
  • Data sets

25
(No Transcript)
26
(No Transcript)
27
Conclusion
  • We propose an interesting summarization problem
    which consider the interaction between frequent
    patterns
  • We transform this problem into a generalized
    minimal biclique covering problem and design an
    approximate algorithm with bound
  • The experimental results demonstrate the
    effective and efficiency of our approach

28
  • Thank you !!!

29
Reference
  • Bayardo98 Roberto J. Bayardo Jr. Efficiently
    mining long patterns from databases. SIGMOD98.
  • Pasquier99 Nicolas Pasquier, Yves Bastide,
    Rafik Taouil, and Lotfi Lakhal. Descovering
    frequent closed itemsets for association rules.
    ICDT99.
  • Calder07 Toon Calder and Bart Goethals.
    Non-derivable itemset mining. Data Min. Knowl.
    Discover. 07.
  • Han02 Jiawei Han, Jianyong Wang, Ying Lu and
    Petre Tzvetkov. Mining top-k frequent closed
    patterns without minimum support. ICDM02.
  • Xin06 Dong Xin, Hong Cheng, Xifeng Yan, and
    Jiawei Han. Extracting redundancy-aware top-k
    patterns. KDD06.
  • Xin05 Dong Xin, Jiawei Han, Xifeng Yan, and
    Hong Cheng. Mining compressed frequent-pattern
    sets. VLDB05.
  • Afrati04 Foto Afrati, Aristides Gionis, and
    Heikki Mannila. Approximating a collection of
    frequent sets. KDD04.
  • Yan05 Xifeng Yan, Hong Cheng, Jiawei Han, and
    Dong Xin. Summarization itemset patterns a
    profile-based approach. KDD05.
  • Wang06 Chao Wang and Srinivasan Parthasarathy.
    Summarizing itemset patterns using probabilistic
    models. KDD06.
  • Jin08 Ruoming Jin, Muad Abu-Ata, Yang Xiang,
    and Ning Ruan. Effective and efficient itemset
    pattern summarization regression-based
    approaches. KDD08.
  • Xiang08 Yang Xiang, Ruoming Jin, David Fuhy,
    and Feodor F. Dragan. Succinct Summarization of
    transactional databases an overlapped
    hyperrectangle scheme. KDD08.

30
Related Work
  • K-itemset approximation Afrati04.
  • Difference
  • their work is a special case of our work
  • their work is expensive for exact description
  • Our work use set cover and max-k cover methods.
  • Restoring the frequency of frequent itemsets
    Yan05, Wang06, Jin08.
  • Hyperrectangle covering problem Xiang08.
Write a Comment
User Comments (0)
About PowerShow.com