Department of Computer Science - PowerPoint PPT Presentation

About This Presentation
Title:

Department of Computer Science

Description:

basket data analysis, cross-marketing, catalog design, loss-leader analysis ... Since acd is a subset of abcd, abcd is decomposed into abc, abd, bcd. ... – PowerPoint PPT presentation

Number of Views:193
Avg rating:3.0/5.0
Slides: 27
Provided by: lind48
Learn more at: http://web.cs.ucla.edu
Category:

less

Transcript and Presenter's Notes

Title: Department of Computer Science


1
Pattern Decomposition Algorithm for Data Mining
Frequent Patterns Qinghua ZouAdvisor Dr.
Wesley Chu
  • Department of Computer Science
  • University of CaliforniaLos Angeles

2
Outline
  • 1. The problem
  • 2. Importance of mining frequent sets
  • 3. Related work
  • 4. PDS, an efficient approach
  • 5. Performance analysis
  • 6. Conclusion

3
1. The Problem
D is a transaction database 5 transactions 9
items a,b,c,,h,k
  • Frequent Itemsets
  • a, b, c, d, e
  • ab, ac, ad, bc, bd, be , cd, ce, de
  • abc, abd, bcd, bce, bde, cde
  • bcde

D 1 a b c d e f 2 a b c g 3 a b d h 4 b c d e
k 5 a b c
The problem Given a transaction dataset D and a
minimal support. To find frequent itemsets
Minimal support 2
4
1. 1 More terms for the problem
  • Basic terms
  • I0 1,2, , n The set of all items
  • e.g., items in supermarkets, words in a sentence,
    etc
  • ti, transaction A set of items
  • e.g., items I bought yesterday in a supermarket,
    sentences in a document
  • D, data set A set of transactions
  • I, Itemset Any subset of I0
  • sup(I), support of IThe number of the
    transactions containing I
  • frequent set sup( I ) gt minsup
  • conf(r), confidence of a rule r1,2 gt 3
  • conf(r) sup ( 1,2,3 ) / sup( 1,2 )
  • The problem Given a minsup, how to find all
    frequent sets quickly?
  • E.g. 1-item, 2-item, k-item frequent sets

5
2. Why Mining Frequent Sets ?
  • Frequent pattern mining Foundation for several
    essential data mining tasks
  • association, correlation, causality
  • sequential patterns
  • partial periodicity, cyclic/temporal associations
  • Applications
  • basket data analysis, cross-marketing, catalog
    design, loss-leader analysis
  • clustering, classification, Web log sequence, DNA
    analysis, etc.
  • Text Mining, finding multi-words combination

6
3. Related Work
  • 1994, Apriori Rakesh Agrawal, IBM SJ
  • Bottom up search using L(k) gtC(k1)
  • 1995, DHP Jong et al. IBM TJ
  • Direct Hashing and Pruning
  • 1997, DIC Sergey Brin. Stanford Univ
  • Dynamic Itemset Counting
  • 1997, MaxClique Mohammed et el. Univ of
    Rochester
  • Using clique L(2) gtC(k), k3,,m
  • 1998, Max-Miner Roberto et al. IBM SJ
  • Top-down pruning
  • 1998, Pincer-Search Lin et al, New York Univ
  • Both bottom up and top down search
  • 2000, FP-tree Jiawei Han
  • Building frequent pattern tree

7
3.1 Apriori Algorithm Example
  • L1a,b,c,d,e
  • C2ab,ac,ad,ae,bc,bd,be,cd,ce,de
  • L2ab,ac,ad,bc,bd,be,cd,ce,de
  • C3abc,abd,acd,bcd,bce,bde,cde
  • C3abc,abd,acd,bcd,bce,bde,cde
  • L3abc, abd, bcd, bce, bde, cde
  • C4abcd,bcde
  • C4abcd,bcde
  • L4bcde
  • Answer L1, L2, L3, L4

D 1 a b c d e f 2 a b c g 3 a b d h 4 b c d e
k 5 a b c
8
Apriori Algorithm
  1. L(1) large 1-itemsets
  2. for( k2 L(k-1)!null k )
  3. C(k) apriori-gen( L(k-1) ) //new
    candidates
  4. forall transactions t in D
  5. Ct subset( C(k), t ) //candidates
    contained in t
  6. forall candiates c in Ct
  7. c.count
  8. Lk c in Ck c.countgtminsup
  9. Answer U L(k)

9
3.2 Pincer-Search Algorithm
  • 01. L0 null k 1 C1 i i
    belong to ? 0
  • 02. MFCS I0 MFS null
  • 03. while Ck ! null
  • 04. read database and count supports for Ck
    and MFCS
  • 05. remove frequent itemsets from MFCS and add
    them to MFS
  • 06. determine frequent set Lk and infrequent
    set Sk
  • 07. use Sk to update MFCS
  • 08. generate new candidate set Ck1 (join,
    recover, and prune)
  • 09. k k 1
  • 10. return MFS

10
Pincer Search Example
  • L0, MFCSabcdefghk, MFS
  • C1a,b,c,d,e,f,g,h,k,
  • L1a,b,c,d,e, MFCSabcde, MFS
  • C2ab,ac,ad,ae,bc,bd,be,cd,ce,de
  • L2ab,ac,ad,bc,bd,be,cd,ce,de,
    MFCSabcd,bcde, MFS
  • C3abc,abd,acd,bcd,bce,bde,cde
  • L3abc, abd, bcd, bce, bde, cde, MFCS,
    MFSbcde
  • C4abcd,bcde
  • C4abcd
  • L4
  • Answer L1, L2, L3, L4, MFS

D 1 a b c d e f 2 a b c g 3 a b d h 4 b c d e
k 5 a b c
11
3.3 FP-Tree
TID Items bought (ordered) frequent
items 100 f, a, c, d, g, i, m, p f, c, a, m,
p 200 a, b, c, f, l, m, o f, c, a, b,
m 300 b, f, h, j, o f, b 400 b, c, k,
s, p c, b, p 500 a, f, c, e, l, p, m,
n f, c, a, m, p
min_support 3
  • Steps
  • Scan DB once, find frequent 1-itemset (single
    item pattern)
  • Order frequent items in frequency descending
    order
  • Scan DB again, construct FP-tree

12
FP-tree Example
Ordered frequent items a b c d e a b c a b d b c
d e a b c
D 1 a b c d e f 2 a b c g 3 a b d h 4 b c d e
k 5 a b c

Header Table Item frequency head
a 4 b 4 c 4 d 3 e 2
a4
b1
b4
c1
c3
d1
d1
d1
Recursively searching the tree to find Frequent
itemsets. It is not easy.
e1
e1
13
4. PDA Basic Idea
L1, L1
dicomposing
calculating
D1 1 2 3 4 5
D2 1 2 3 4
2
2
3
1
14
4.1 PDA terms
  • Definitions
  • Ii , Itemset, is a set of item, e.g. 1,2,3
  • Pi, pattern, is a set of itemsets, e.g. 1,2,3,
    2,3,4
  • occ(Pi), occur times of pattern Pi
  • t, transaction, a pair of (Pi, occ(Pi)), e.g.(
    1,2,3,2,3,4 , 2 )
  • D, data set, a set of transactions
  • D(k), the data set for generating k-item frequent
    sets
  • k-item independent, itemset I1 is k-item
    independent with I2 , i.e. the number of their
    common items is less than k. e.g.
  • 1,2,3 and 2,3,4 , common set 2,3, so
    they are 3-item independent, but not 2-item
    independent

15
4.2 Decomposing Example
  • ). Suppose we are given a pattern p abcdef1 in
    D1 where L1 a,b,c,d,e and f in L1. To
    decompose p with L1, we simply delete f from p,
    leaving us with a new pattern abcde1 in D2.
  • ). Suppose a pattern p abcde1 in D2 and ae in
    L2. Since ae is infrequent and cannot occur in
    a future frequent set, we decompose p abcde1 to
    a composite pattern q abcd,bcde1 by removing a
    and e respectively from p.
  • ). Suppose a pattern p abcd,bcde1 in D3 and acd
    in L3. Since acd is a subset of abcd, abcd is
    decomposed into abc, abd, bcd. Their sizes are
    less than 4, so they are not qualified for D4.
    Itemset bcde does not contain acd, so it remains
    the same and is included in D4. Results will be
    bcde1.

16
4.2 continue
  • Split Example t.P1,2,3,4,5,6,7,8, we found
    156 to be an infrequent 3-item set. We split 156
    into 15, 16, 56. Result
  • 1,2,3,4,5,7,8, 1,2,3,4,6,7,8,
    2,3,4,5,6,7,8
  • Quick-split Example t.P1,2,3,4,5,6,7,8, we
    found infreq 3-item set 156,157,158, 167, 168,
    178, 125, 126, 127,128, 135, 136, 137,138,145,
    146, 147, 148 . Build max-common tree

5 234678
1
6 23478
1 5678
7 2348
8 234
2,3,4,5,6,7,8, 1,2,3,4
3
2,3,4,5,6,7,8 1,2,3,4
4-item independent
17
4.3 PDA Algorithm
PD ( transaction-set T ) 1 D1 ltt, 1gt t ? T
k1 2 while (Dk? F) do begin 3 forall
p inDk do // counting 4 forall
k-itemset s of p.IS do 5
Sup(sDk) p.Occ 6 decide Lk and Lk
//build Dk1 7 Dk1 PD-rebuild(Dk, Lk,
Lk) 8 k 9 end 10Answer ? Lk
18
4.4 PDA rebuilding
PD-rebuild (Dk, Lk, Lk) 1 Dk1 F ht an
empty hash table 2 forall p in Dk do begin
3 // qk, qk can be taken from previous
counting qkss in p.IS n Lk qktt
in p.IS n Lk 4 u PD-decompose(p.IS,
qk) 5 v s in u s is k-item independent
in u 6 add ltu-v, p.Occgt to Dk1 7
forall s in v do 8 if s in ht then
ht.s.Occ p.Occ 9 else put lts,p.Occgt
to ht 10 end 11 Dk1 Dk1 ? p in ht
19
4.5 PDA Example
D3 1 abcd, bcde 1 2 a b c 2 3 a b d 1 4 b
c d e 1
D4 1 b c d e 2
D1 1 a b c d e f 1 2 a b c g 1 3 a b d h
1 4 b c d e k 1 5 a b c 1
D2 1 a b c d e 1 2 a b c 2 3 a b d 1 4 b c
d e 1
D5 F
?
?
?
20
5. Experiments onSynthetic Databases
  • The benchmark databases are generated by a
    popular synthetic data generation program from
    IBM Quest project
  • Parameters
  • n is the number of different items (set to 1000)
  • T is the average transaction size
  • I is the average size of the maximal frequent
    itemsets,
  • D is the number of transactions
  • L is the number of the maximal frequent
    itemsets
  • T20-I6-1K T 20, I 6, D 1k
  • T20-I6-10K T 20, I 6, D 10k
  • T20-I6-100K T 20, I 6, D 100k

21
Comparison With Apriori
22
Time Distribution
23
Scale Up Experiment
24
Comparison with FP-tree
25
6. Conclusion
  • In PDA, transaction number shrinks quickly to 0
  • Shrinks both the transaction number and itemset
    length
  • In transaction summing
  • In item set decomposing
  • Only one scan of database
  • No candidate set generation
  • Long patterns can be found at any iteration

26
Reference
1 R. Agrawal and R. Srikant. Fast algorithms
for mining association rules. In VLDB'94, pp.
487-499. 2 R. J. Bayardo. Efficiently mining
long patterns from databases. In SIGMOD'98, pp.
85-93. 3 Zaki, M. J. Parthasarathy, S.
Ogihara, M. and Li, W. 1997. New Algorithms for
Fast Discovery of Association Rules. In Proc. of
the Third Intl Conf. on Knowledge Discovery in
Databases and Data Mining, pp. 283-286. 4 Lin,
D.-I and Kedem, Z. M. 1998. Pincer-Search A New
Algorithm for Discovering the Maximum Frequent
Set. In Proc. of the Sixth European Conf. on
Extending DatabaseTechnology. 5 Park, J. S.
Chen, M.-S. and Yu, P. S. 1996. An Effective
Hash Based Algorithm for Mining Association
Rules. In Proc. of the 1995 ACM-SIGMOD Conf. on
Management of Data, pp. 175-186. 6 Brin, S.
Motwani, R. Ullman, J. and Tsur, S. 1997.
Dynamic Itemset Counting and Implication Rules
for Market Basket Data. In Proc. of the 1997
ACM-SIGMOD Conf. On Management of Data,
255-264. 7 J. Pei, J. Han, and R. Mao. CLOSET
An Efficient Algorithm for Mining Frequent Closed
Itemsets, Proc. 2000 ACM-SIGMOD Int. Workshop on
Data Mining and Knowledge Discovery (DMKD'00),
Dallas, TX, May 2000. 8 J. Han, J. Pei, and Y.
Yin. Mining Frequent Patterns without Candidate
Generation, Proc. 2000 ACM-SIGMOD Int. Conf. on
Management of Data (SIGMOD'00), Dallas, TX, May
2000. 9 Bomze, I. M., Budinich, M., Pardalos,
P. M., and Pelillo, M. The maximum clique
problem, Handbook of Combinatorial Optimization
(Supplement Volume A), in D.-Z. Du and P. M.
Pardalos (eds.). Kluwer Academic Publishers,
Boston, MA, 1999. 10 C. Bron and J. Kerbosch.
Finding all cliques of an undirected graph. In
Communications of the ACM, 16(9)575-577, Sept.
1973. 11 Johnson D.B., Chu W.W., Dionisio
J.D.N., Taira R.K., Kangarloo H., Creating and
Indexing Teaching Files from Free-text Patient
Reports. Proc. AMIA Symp 1999 pp. 814-818. 12
Johnson D.B., Chu W.W., Using n-word combinations
for domain specific information retrieval,
Proceedings of the Second International
Conference on Information Fusion FUSION99, San
Jose, CA, July 6-9,1999. 13 A. Savasere, E.
Omiecinski, and S. Navathe. An Efficient
Algorithm for Mining Association Rules in Large
Databases. In Proceedings of the 21st VLDB
Conference, 1995. 14 Heikki Mannila, Hannu
Toivonen, and A. Inkeri Verkamo. Efficient
algorithms for discovering association rules. In
Usama M. Fayyad and Ramasamy Uthurusamy, editors,
Proc. of the AAAI Workshop on Knowledge Discovery
in Databases, pp. 181-192, Seattle, Washington,
July 1994. 15 H. Toivonen. Sampling Large
Databases for Association Rules. In Proceedings
of the 22nd International Conference on Very
Large Data Bases, Bombay, India, September 1996.
Write a Comment
User Comments (0)
About PowerShow.com