Title: Cartesian Contour: A Concise Representation for a Collection of Frequent Sets
1Cartesian Contour A Concise Representation for a
Collection of Frequent Sets
- Ruoming Jin
- Kent State University
Joint work with Yang Xiang and Lin Liu (KSU)
2Frequent Pattern Mining
- Summarizing the underlying datasets, providing
key insights - Key building block for data mining toolbox
- Association rule mining
- Classification
- Clustering
- Change Detection
- etc
- Application Domains
- Business, biology, chemistry, WWW,
computer/networing security, software
engineering,
3The Problem
- The number of patterns is too large
- Attempt
- Maximal Frequent Itemsets
- Closed Frequent Itemsets
- Non-Derivable Itemsets
- Compressed or Top-k Patterns
-
- Tradeoff
- Significant Information Loss
- Large Size
4Pattern Summarization
- Using a small number of itemsets to best
represent the entire collection of frequent
itemsets - The Spanning Set Approach Afrati-Gionis-Mannila,
KDD04 - Exact Description Maximal Frequent Itemsets
- Our problem
- Can we find a concise representation which can
allow both exact and approximate summarization of
a collection of frequent itemsets?
5Basic Idea
A,B,G,H, A,B,I,J, A,B,K,L C,D,G,H,
C,D,I,J, C,D,K,L E,F,G,H, E,F,I,J,
E,F,K,L
9 itemsets, 36 items.
Covering
Picturing
A,B,C,D,E,F
Cartesian Product
G,H,I,J,K,L
1 biclique, 6 itemsets, 12 items
6Cartesian Covering
Non-frequent itemsets
7Problem Formulation
- Cartesian product
- e.g.
- Cost of a Cartesian product
- e.g. 1 biclique, 3 itemsets, and 5 items
- Covering
- e.g.
How can we use Cartesian products to concisely
represent a collection of frequent itemsets?
8Exact and Approximate Covering
Exact Representation
Cost 2 biclique, 4 itemsets, 6 items False
positive none
Approximate Representation
Cost 1 biclique, 3 itemsets, 5 items False
positive G,C,G,D,G,C,D
9Covering Maximal Frequent Itemsets
MNOVWX
CDEJKL
CDEVWX
MNOGHI
CDEGHI
PQRJKL
CDESTU
GHI, JKL
ABCGHI
ABCSTU
STU, VWX
ABC, CDE
MNO, PQR
10Problem Reformulation
- Given Maximal Frequent Itemsets
Exact representation
Approximate representation
Frequent Itemsets
C1 C2
C1 C2
11Minimal Biclique Set Cover Problem
Ground Set 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
1, 2,3,4,6,7,8,9
5,10,11
12NP-hardness
- By reducing the Minimal Biclique Set Cover into
our problem, we can easily prove our problem1
(exact) and problem2 (approximate) are NP-hard. - Minimal Biclique Set Cover is a Variant of the
Classical Set Cover Problem
Can we use the standard set-cover greedy
algorithm?
13Naïve greedy algorithm
- Greedy algorithm
- Each time choose a biclique with the lowest price
. - is the cost.
- This method has a logarithmic approximation
bound. - The problem?
- The number of candidate bicliques are 2XY !!
14Candidate Reduction
- Assume one side of the biclique candidate is
known, how to choose the other side?
15Greedy Algorithm
Biclique Candidate
Split and sort
Covering 4
Covering 3
Covering 3
Add 1st single Y-vertex Biclique
Add 2nd single Y-vertex Biclique
Add 3th single Y-vertex Biclique
Fixed!
Cheapest sub-biclique!
Cost 1
Cost 5/7
Cost 6/8
gt 5/7
16Approximation Bound of the Greedy Algorithm
- The greedy SubBiclique procedure can find a
sub-biclique whose price is less than or equal to
e/(1-e) of the price of the optimal sub-biclique
(cheapest price)!
17Further Reduction
- Only using the IDEA1, the time complexity is
still exponential . - How to reduce this further??
- Are all the combinations equally
important? - No, because some are more likely to connect to
the Y side. - Our solution Frequent itemset mining!
18Using Frequent Itemset Mining
19Overall Algorithm
- Step 1 Use the Frequent Itemset Mining tool to
find all the (one-side maximal) biclique
candidates - Step 2 Calculate the cheapest sub-biclique for
each candidate using the greedy procedure - Step 3 Compare all the sub-bicliques, choose the
cheapest one - Step 4 if MFI totally covered, done else go to
Step 2.
20Approximation Bound
- Our algorithm has e/(1-e) (ln (n)1)
approximation ratio with respect to the candidate
set (all the sub-bicliques with one sides coming
from the frequent itemset mining).
21Speed-up techniques (1)
- Using Closed itemsets for X and Y
- Initially X and Y contain all the FI,
respectively. - Using to cover MFI is similar to factorizing
MFI - MFIs maximal factor itemsets are closed
itemsets, whose number is much smaller!
22Speed-up techniques (2)
Dense Graph
Sparse Graph
TRADEOFF
Frequent Itemset
Supporting Transaction
Frequent itemsets is small Valuable biclique
candidates are not be fully used!
Frequent itemsets is big Handling those
candidates are too slow!
23Speed-up techniques (3)
- Iterative procedure
- A large number of closed itemsets
- To cover MFI in one time can produce a huge
number of biclique candidates - So to cover MFI in several times
- Support level is reduced gradually!
24Experiments
25(No Transcript)
26(No Transcript)
27Conclusion
- We propose an interesting summarization problem
which consider the interaction between frequent
patterns - We transform this problem into a generalized
minimal biclique covering problem and design an
approximate algorithm with bound - The experimental results demonstrate the
effective and efficiency of our approach
28 29Reference
- Bayardo98 Roberto J. Bayardo Jr. Efficiently
mining long patterns from databases. SIGMOD98. - Pasquier99 Nicolas Pasquier, Yves Bastide,
Rafik Taouil, and Lotfi Lakhal. Descovering
frequent closed itemsets for association rules.
ICDT99. - Calder07 Toon Calder and Bart Goethals.
Non-derivable itemset mining. Data Min. Knowl.
Discover. 07. - Han02 Jiawei Han, Jianyong Wang, Ying Lu and
Petre Tzvetkov. Mining top-k frequent closed
patterns without minimum support. ICDM02. - Xin06 Dong Xin, Hong Cheng, Xifeng Yan, and
Jiawei Han. Extracting redundancy-aware top-k
patterns. KDD06. - Xin05 Dong Xin, Jiawei Han, Xifeng Yan, and
Hong Cheng. Mining compressed frequent-pattern
sets. VLDB05. - Afrati04 Foto Afrati, Aristides Gionis, and
Heikki Mannila. Approximating a collection of
frequent sets. KDD04. - Yan05 Xifeng Yan, Hong Cheng, Jiawei Han, and
Dong Xin. Summarization itemset patterns a
profile-based approach. KDD05. - Wang06 Chao Wang and Srinivasan Parthasarathy.
Summarizing itemset patterns using probabilistic
models. KDD06. - Jin08 Ruoming Jin, Muad Abu-Ata, Yang Xiang,
and Ning Ruan. Effective and efficient itemset
pattern summarization regression-based
approaches. KDD08. - Xiang08 Yang Xiang, Ruoming Jin, David Fuhy,
and Feodor F. Dragan. Succinct Summarization of
transactional databases an overlapped
hyperrectangle scheme. KDD08.
30Related Work
- K-itemset approximation Afrati04.
- Difference
- their work is a special case of our work
- their work is expensive for exact description
- Our work use set cover and max-k cover methods.
- Restoring the frequency of frequent itemsets
Yan05, Wang06, Jin08. - Hyperrectangle covering problem Xiang08.