Department of Computer Science presentation

About This Presentation

Transcript and Presenter's Notes

Title: Department of Computer Science

1
Pattern Decomposition Algorithm for Data Mining
Frequent Patterns Qinghua ZouAdvisor Dr.
Wesley Chu

Department of Computer Science
University of CaliforniaLos Angeles

2
Outline

1. The problem
2. Importance of mining frequent sets
3. Related work
4. PDS, an efficient approach
5. Performance analysis
6. Conclusion

3
1. The Problem
D is a transaction database 5 transactions 9
items a,b,c,,h,k

Frequent Itemsets
a, b, c, d, e
ab, ac, ad, bc, bd, be , cd, ce, de
abc, abd, bcd, bce, bde, cde
bcde

D 1 a b c d e f 2 a b c g 3 a b d h 4 b c d e
k 5 a b c
The problem Given a transaction dataset D and a
minimal support. To find frequent itemsets
Minimal support 2
4
1. 1 More terms for the problem

Basic terms
I0 1,2, , n The set of all items
e.g., items in supermarkets, words in a sentence,
etc
ti, transaction A set of items
e.g., items I bought yesterday in a supermarket,
sentences in a document
D, data set A set of transactions
I, Itemset Any subset of I0
sup(I), support of IThe number of the
transactions containing I
frequent set sup( I ) gt minsup
conf(r), confidence of a rule r1,2 gt 3
conf(r) sup ( 1,2,3 ) / sup( 1,2 )
The problem Given a minsup, how to find all
frequent sets quickly?
E.g. 1-item, 2-item, k-item frequent sets

5
2. Why Mining Frequent Sets ?

Frequent pattern mining Foundation for several
essential data mining tasks
association, correlation, causality
sequential patterns
partial periodicity, cyclic/temporal associations
Applications
basket data analysis, cross-marketing, catalog
design, loss-leader analysis
clustering, classification, Web log sequence, DNA
analysis, etc.
Text Mining, finding multi-words combination

6
3. Related Work

1994, Apriori Rakesh Agrawal, IBM SJ
Bottom up search using L(k) gtC(k1)
1995, DHP Jong et al. IBM TJ
Direct Hashing and Pruning
1997, DIC Sergey Brin. Stanford Univ
Dynamic Itemset Counting
1997, MaxClique Mohammed et el. Univ of
Rochester
Using clique L(2) gtC(k), k3,,m
1998, Max-Miner Roberto et al. IBM SJ
Top-down pruning
1998, Pincer-Search Lin et al, New York Univ
Both bottom up and top down search
2000, FP-tree Jiawei Han
Building frequent pattern tree

7
3.1 Apriori Algorithm Example

L1a,b,c,d,e
C2ab,ac,ad,ae,bc,bd,be,cd,ce,de
L2ab,ac,ad,bc,bd,be,cd,ce,de
C3abc,abd,acd,bcd,bce,bde,cde
C3abc,abd,acd,bcd,bce,bde,cde
L3abc, abd, bcd, bce, bde, cde
C4abcd,bcde
C4abcd,bcde
L4bcde
Answer L1, L2, L3, L4

D 1 a b c d e f 2 a b c g 3 a b d h 4 b c d e
k 5 a b c
8
Apriori Algorithm

L(1) large 1-itemsets
for( k2 L(k-1)!null k )
C(k) apriori-gen( L(k-1) ) //new
candidates
forall transactions t in D
Ct subset( C(k), t ) //candidates
contained in t
forall candiates c in Ct
c.count
Lk c in Ck c.countgtminsup
Answer U L(k)

9
3.2 Pincer-Search Algorithm

01. L0 null k 1 C1 i i
belong to ? 0
02. MFCS I0 MFS null
03. while Ck ! null
04. read database and count supports for Ck
and MFCS
05. remove frequent itemsets from MFCS and add
them to MFS
06. determine frequent set Lk and infrequent
set Sk
07. use Sk to update MFCS
08. generate new candidate set Ck1 (join,
recover, and prune)
09. k k 1
10. return MFS

10
Pincer Search Example

L0, MFCSabcdefghk, MFS
C1a,b,c,d,e,f,g,h,k,
L1a,b,c,d,e, MFCSabcde, MFS
C2ab,ac,ad,ae,bc,bd,be,cd,ce,de
L2ab,ac,ad,bc,bd,be,cd,ce,de,
MFCSabcd,bcde, MFS
C3abc,abd,acd,bcd,bce,bde,cde
L3abc, abd, bcd, bce, bde, cde, MFCS,
MFSbcde
C4abcd,bcde
C4abcd
L4
Answer L1, L2, L3, L4, MFS

D 1 a b c d e f 2 a b c g 3 a b d h 4 b c d e
k 5 a b c
11
3.3 FP-Tree
TID Items bought (ordered) frequent
items 100 f, a, c, d, g, i, m, p f, c, a, m,
p 200 a, b, c, f, l, m, o f, c, a, b,
m 300 b, f, h, j, o f, b 400 b, c, k,
s, p c, b, p 500 a, f, c, e, l, p, m,
n f, c, a, m, p
min_support 3

Steps
Scan DB once, find frequent 1-itemset (single
item pattern)
Order frequent items in frequency descending
order
Scan DB again, construct FP-tree

12
FP-tree Example
Ordered frequent items a b c d e a b c a b d b c
d e a b c
D 1 a b c d e f 2 a b c g 3 a b d h 4 b c d e
k 5 a b c

Header Table Item frequency head
a 4 b 4 c 4 d 3 e 2
a4
b1
b4
c1
c3
d1
d1
d1
Recursively searching the tree to find Frequent
itemsets. It is not easy.
e1
e1
13
4. PDA Basic Idea
L1, L1
dicomposing
calculating
D1 1 2 3 4 5
D2 1 2 3 4
2
2
3
1
14
4.1 PDA terms

Definitions
Ii , Itemset, is a set of item, e.g. 1,2,3
Pi, pattern, is a set of itemsets, e.g. 1,2,3,
2,3,4
occ(Pi), occur times of pattern Pi
t, transaction, a pair of (Pi, occ(Pi)), e.g.(
1,2,3,2,3,4 , 2 )
D, data set, a set of transactions
D(k), the data set for generating k-item frequent
sets
k-item independent, itemset I1 is k-item
independent with I2 , i.e. the number of their
common items is less than k. e.g.
1,2,3 and 2,3,4 , common set 2,3, so
they are 3-item independent, but not 2-item
independent

15
4.2 Decomposing Example

). Suppose we are given a pattern p abcdef1 in
D1 where L1 a,b,c,d,e and f in L1. To
decompose p with L1, we simply delete f from p,
leaving us with a new pattern abcde1 in D2.
). Suppose a pattern p abcde1 in D2 and ae in
L2. Since ae is infrequent and cannot occur in
a future frequent set, we decompose p abcde1 to
a composite pattern q abcd,bcde1 by removing a
and e respectively from p.
). Suppose a pattern p abcd,bcde1 in D3 and acd
in L3. Since acd is a subset of abcd, abcd is
decomposed into abc, abd, bcd. Their sizes are
less than 4, so they are not qualified for D4.
Itemset bcde does not contain acd, so it remains
the same and is included in D4. Results will be
bcde1.

16
4.2 continue

Split Example t.P1,2,3,4,5,6,7,8, we found
156 to be an infrequent 3-item set. We split 156
into 15, 16, 56. Result
1,2,3,4,5,7,8, 1,2,3,4,6,7,8,
2,3,4,5,6,7,8
Quick-split Example t.P1,2,3,4,5,6,7,8, we
found infreq 3-item set 156,157,158, 167, 168,
178, 125, 126, 127,128, 135, 136, 137,138,145,
146, 147, 148 . Build max-common tree

5 234678
1
6 23478
1 5678
7 2348
8 234
2,3,4,5,6,7,8, 1,2,3,4
3
2,3,4,5,6,7,8 1,2,3,4
4-item independent
17
4.3 PDA Algorithm
PD ( transaction-set T ) 1 D1 ltt, 1gt t ? T
k1 2 while (Dk? F) do begin 3 forall
p inDk do // counting 4 forall
k-itemset s of p.IS do 5
Sup(sDk) p.Occ 6 decide Lk and Lk
//build Dk1 7 Dk1 PD-rebuild(Dk, Lk,
Lk) 8 k 9 end 10Answer ? Lk
18
4.4 PDA rebuilding
PD-rebuild (Dk, Lk, Lk) 1 Dk1 F ht an
empty hash table 2 forall p in Dk do begin
3 // qk, qk can be taken from previous
counting qkss in p.IS n Lk qktt
in p.IS n Lk 4 u PD-decompose(p.IS,
qk) 5 v s in u s is k-item independent
in u 6 add ltu-v, p.Occgt to Dk1 7
forall s in v do 8 if s in ht then
ht.s.Occ p.Occ 9 else put lts,p.Occgt
to ht 10 end 11 Dk1 Dk1 ? p in ht
19
4.5 PDA Example
D3 1 abcd, bcde 1 2 a b c 2 3 a b d 1 4 b
c d e 1
D4 1 b c d e 2
D1 1 a b c d e f 1 2 a b c g 1 3 a b d h
1 4 b c d e k 1 5 a b c 1
D2 1 a b c d e 1 2 a b c 2 3 a b d 1 4 b c
d e 1
D5 F
?
?
?
20
5. Experiments onSynthetic Databases

The benchmark databases are generated by a
popular synthetic data generation program from
IBM Quest project
Parameters
n is the number of different items (set to 1000)
T is the average transaction size
I is the average size of the maximal frequent
itemsets,
D is the number of transactions
L is the number of the maximal frequent
itemsets
T20-I6-1K T 20, I 6, D 1k
T20-I6-10K T 20, I 6, D 10k
T20-I6-100K T 20, I 6, D 100k

21
Comparison With Apriori
22
Time Distribution
23
Scale Up Experiment
24
Comparison with FP-tree
25
6. Conclusion

In PDA, transaction number shrinks quickly to 0
Shrinks both the transaction number and itemset
length
In transaction summing
In item set decomposing
Only one scan of database
No candidate set generation
Long patterns can be found at any iteration

26
Reference
1 R. Agrawal and R. Srikant. Fast algorithms
for mining association rules. In VLDB'94, pp.
487-499. 2 R. J. Bayardo. Efficiently mining
long patterns from databases. In SIGMOD'98, pp.
85-93. 3 Zaki, M. J. Parthasarathy, S.
Ogihara, M. and Li, W. 1997. New Algorithms for
Fast Discovery of Association Rules. In Proc. of
the Third Intl Conf. on Knowledge Discovery in
Databases and Data Mining, pp. 283-286. 4 Lin,
D.-I and Kedem, Z. M. 1998. Pincer-Search A New
Algorithm for Discovering the Maximum Frequent
Set. In Proc. of the Sixth European Conf. on
Extending DatabaseTechnology. 5 Park, J. S.
Chen, M.-S. and Yu, P. S. 1996. An Effective
Hash Based Algorithm for Mining Association
Rules. In Proc. of the 1995 ACM-SIGMOD Conf. on
Management of Data, pp. 175-186. 6 Brin, S.
Motwani, R. Ullman, J. and Tsur, S. 1997.
Dynamic Itemset Counting and Implication Rules
for Market Basket Data. In Proc. of the 1997
ACM-SIGMOD Conf. On Management of Data,
255-264. 7 J. Pei, J. Han, and R. Mao. CLOSET
An Efficient Algorithm for Mining Frequent Closed
Itemsets, Proc. 2000 ACM-SIGMOD Int. Workshop on
Data Mining and Knowledge Discovery (DMKD'00),
Dallas, TX, May 2000. 8 J. Han, J. Pei, and Y.
Yin. Mining Frequent Patterns without Candidate
Generation, Proc. 2000 ACM-SIGMOD Int. Conf. on
Management of Data (SIGMOD'00), Dallas, TX, May
2000. 9 Bomze, I. M., Budinich, M., Pardalos,
P. M., and Pelillo, M. The maximum clique
problem, Handbook of Combinatorial Optimization
(Supplement Volume A), in D.-Z. Du and P. M.
Pardalos (eds.). Kluwer Academic Publishers,
Boston, MA, 1999. 10 C. Bron and J. Kerbosch.
Finding all cliques of an undirected graph. In
Communications of the ACM, 16(9)575-577, Sept.
1973. 11 Johnson D.B., Chu W.W., Dionisio
J.D.N., Taira R.K., Kangarloo H., Creating and
Indexing Teaching Files from Free-text Patient
Reports. Proc. AMIA Symp 1999 pp. 814-818. 12
Johnson D.B., Chu W.W., Using n-word combinations
for domain specific information retrieval,
Proceedings of the Second International
Conference on Information Fusion FUSION99, San
Jose, CA, July 6-9,1999. 13 A. Savasere, E.
Omiecinski, and S. Navathe. An Efficient
Algorithm for Mining Association Rules in Large
Databases. In Proceedings of the 21st VLDB
Conference, 1995. 14 Heikki Mannila, Hannu
Toivonen, and A. Inkeri Verkamo. Efficient
algorithms for discovering association rules. In
Usama M. Fayyad and Ramasamy Uthurusamy, editors,
Proc. of the AAAI Workshop on Knowledge Discovery
in Databases, pp. 181-192, Seattle, Washington,
July 1994. 15 H. Toivonen. Sampling Large
Databases for Association Rules. In Proceedings
of the 22nd International Conference on Very
Large Data Bases, Bombay, India, September 1996.

Write a Comment

User Comments (0)

About PowerShow.com

Department of Computer Science PowerPoint PPT Presentation