Cartesian Contour: A Concise Representation for a Collection of Frequent Sets - PowerPoint PPT Presentation

About This Presentation

Title:

Cartesian Contour: A Concise Representation for a Collection of Frequent Sets

Description:

Approximate Representation. Cost: 1 biclique, 3 itemsets, 5 items ... Approximation Bound of the Greedy Algorithm ... Approximation Bound ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 31

Provided by: Lin172

Learn more at: https://www.cs.kent.edu

Category:

more less

Transcript and Presenter's Notes

Title: Cartesian Contour: A Concise Representation for a Collection of Frequent Sets

1
Cartesian Contour A Concise Representation for a
Collection of Frequent Sets

Ruoming Jin
Kent State University

Joint work with Yang Xiang and Lin Liu (KSU)
2
Frequent Pattern Mining

Summarizing the underlying datasets, providing
key insights
Key building block for data mining toolbox
Association rule mining
Classification
Clustering
Change Detection
etc
Application Domains
Business, biology, chemistry, WWW,
computer/networing security, software
engineering,

3
The Problem

The number of patterns is too large
Attempt
Maximal Frequent Itemsets
Closed Frequent Itemsets
Non-Derivable Itemsets
Compressed or Top-k Patterns
Tradeoff
Significant Information Loss
Large Size

4
Pattern Summarization

Using a small number of itemsets to best
represent the entire collection of frequent
itemsets
The Spanning Set Approach Afrati-Gionis-Mannila,
KDD04
Exact Description Maximal Frequent Itemsets
Our problem
Can we find a concise representation which can
allow both exact and approximate summarization of
a collection of frequent itemsets?

5
Basic Idea
A,B,G,H, A,B,I,J, A,B,K,L C,D,G,H,
C,D,I,J, C,D,K,L E,F,G,H, E,F,I,J,
E,F,K,L
9 itemsets, 36 items.
Covering
Picturing
A,B,C,D,E,F
Cartesian Product
G,H,I,J,K,L
1 biclique, 6 itemsets, 12 items
6
Cartesian Covering
Non-frequent itemsets
7
Problem Formulation

Cartesian product
e.g.
Cost of a Cartesian product
e.g. 1 biclique, 3 itemsets, and 5 items
Covering
e.g.

How can we use Cartesian products to concisely
represent a collection of frequent itemsets?
8
Exact and Approximate Covering
Exact Representation
Cost 2 biclique, 4 itemsets, 6 items False
positive none
Approximate Representation
Cost 1 biclique, 3 itemsets, 5 items False
positive G,C,G,D,G,C,D
9
Covering Maximal Frequent Itemsets
MNOVWX
CDEJKL
CDEVWX
MNOGHI
CDEGHI
PQRJKL
CDESTU
GHI, JKL
ABCGHI
ABCSTU
STU, VWX
ABC, CDE
MNO, PQR
10
Problem Reformulation

Given Maximal Frequent Itemsets

Exact representation
Approximate representation
Frequent Itemsets
C1 C2
C1 C2
11
Minimal Biclique Set Cover Problem
Ground Set 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
1, 2,3,4,6,7,8,9
5,10,11
12
NP-hardness

By reducing the Minimal Biclique Set Cover into
our problem, we can easily prove our problem1
(exact) and problem2 (approximate) are NP-hard.
Minimal Biclique Set Cover is a Variant of the
Classical Set Cover Problem

Can we use the standard set-cover greedy
algorithm?
13
Naïve greedy algorithm

Greedy algorithm
Each time choose a biclique with the lowest price
.
is the cost.
This method has a logarithmic approximation
bound.
The problem?
The number of candidate bicliques are 2XY !!

14
Candidate Reduction

Assume one side of the biclique candidate is
known, how to choose the other side?

15
Greedy Algorithm
Biclique Candidate
Split and sort
Covering 4
Covering 3
Covering 3
Add 1st single Y-vertex Biclique
Add 2nd single Y-vertex Biclique
Add 3th single Y-vertex Biclique
Fixed!
Cheapest sub-biclique!
Cost 1
Cost 5/7
Cost 6/8
gt 5/7
16
Approximation Bound of the Greedy Algorithm

The greedy SubBiclique procedure can find a
sub-biclique whose price is less than or equal to
e/(1-e) of the price of the optimal sub-biclique
(cheapest price)!

17
Further Reduction

Only using the IDEA1, the time complexity is
still exponential .
How to reduce this further??
Are all the combinations equally
important?
No, because some are more likely to connect to
the Y side.
Our solution Frequent itemset mining!

18
Using Frequent Itemset Mining
19
Overall Algorithm

Step 1 Use the Frequent Itemset Mining tool to
find all the (one-side maximal) biclique
candidates
Step 2 Calculate the cheapest sub-biclique for
each candidate using the greedy procedure
Step 3 Compare all the sub-bicliques, choose the
cheapest one
Step 4 if MFI totally covered, done else go to
Step 2.

20
Approximation Bound

Our algorithm has e/(1-e) (ln (n)1)
approximation ratio with respect to the candidate
set (all the sub-bicliques with one sides coming
from the frequent itemset mining).

21
Speed-up techniques (1)

Using Closed itemsets for X and Y
Initially X and Y contain all the FI,
respectively.
Using to cover MFI is similar to factorizing
MFI
MFIs maximal factor itemsets are closed
itemsets, whose number is much smaller!

22
Speed-up techniques (2)
Dense Graph
Sparse Graph
TRADEOFF
Frequent Itemset
Supporting Transaction
Frequent itemsets is small Valuable biclique
candidates are not be fully used!
Frequent itemsets is big Handling those
candidates are too slow!
23
Speed-up techniques (3)

Iterative procedure
A large number of closed itemsets
To cover MFI in one time can produce a huge
number of biclique candidates
So to cover MFI in several times
Support level is reduced gradually!

24
Experiments

Data sets

25
(No Transcript)
26
(No Transcript)
27
Conclusion

We propose an interesting summarization problem
which consider the interaction between frequent
patterns
We transform this problem into a generalized
minimal biclique covering problem and design an
approximate algorithm with bound
The experimental results demonstrate the
effective and efficiency of our approach

Thank you !!!

29
Reference

Bayardo98 Roberto J. Bayardo Jr. Efficiently
mining long patterns from databases. SIGMOD98.
Pasquier99 Nicolas Pasquier, Yves Bastide,
Rafik Taouil, and Lotfi Lakhal. Descovering
frequent closed itemsets for association rules.
ICDT99.
Calder07 Toon Calder and Bart Goethals.
Non-derivable itemset mining. Data Min. Knowl.
Discover. 07.
Han02 Jiawei Han, Jianyong Wang, Ying Lu and
Petre Tzvetkov. Mining top-k frequent closed
patterns without minimum support. ICDM02.
Xin06 Dong Xin, Hong Cheng, Xifeng Yan, and
Jiawei Han. Extracting redundancy-aware top-k
patterns. KDD06.
Xin05 Dong Xin, Jiawei Han, Xifeng Yan, and
Hong Cheng. Mining compressed frequent-pattern
sets. VLDB05.
Afrati04 Foto Afrati, Aristides Gionis, and
Heikki Mannila. Approximating a collection of
frequent sets. KDD04.
Yan05 Xifeng Yan, Hong Cheng, Jiawei Han, and
Dong Xin. Summarization itemset patterns a
profile-based approach. KDD05.
Wang06 Chao Wang and Srinivasan Parthasarathy.
Summarizing itemset patterns using probabilistic
models. KDD06.
Jin08 Ruoming Jin, Muad Abu-Ata, Yang Xiang,
and Ning Ruan. Effective and efficient itemset
pattern summarization regression-based
approaches. KDD08.
Xiang08 Yang Xiang, Ruoming Jin, David Fuhy,
and Feodor F. Dragan. Succinct Summarization of
transactional databases an overlapped
hyperrectangle scheme. KDD08.