Frequent Item Mining - PowerPoint PPT Presentation

About This Presentation

Title:

Frequent Item Mining

Description:

Frequent Item Mining Other key optimization Recording the items Why is this relevant? Transaction Tree Organize transaction into trees Count through two trees ... – PowerPoint PPT presentation

Number of Views:73

Avg rating:3.0/5.0

Slides: 75

Provided by: csKentEd8

Learn more at: https://www.cs.kent.edu

Category:

more less

Transcript and Presenter's Notes

Title: Frequent Item Mining

1
Frequent Item Mining
2
What is data mining?

Pattern Mining?
What patterns?
Why are they useful?

3
Definition Frequent Itemset

Itemset
A collection of one or more items
Example Milk, Bread, Diaper
k-itemset
An itemset that contains k items
Support count (?)
Frequency of occurrence of an itemset
E.g. ?(Milk, Bread,Diaper) 2
Support
Fraction of transactions that contain an itemset
E.g. s(Milk, Bread, Diaper) 2/5
Frequent Itemset
An itemset whose support is greater than or equal
to a minsup threshold

4
Frequent Itemsets Mining
TID Transactions
100 A, B, E
200 B, D
300 A, B, E
400 A, C
500 B, C
600 A, C
700 A, B
800 A, B, C, E
900 A, B, C
1000 A, C, E

Minimum support level 50
A,B,C,A,B, A,C

5
Three Different Views of FIM

Transactional Database
How we do store a transactional database?
Horizontal, Vertical, Transaction-Item Pair
Binary Matrix
Bipartite Graph
How does the FIM formulated in these different
settings?

5
6
Frequent Itemset Generation
Given d items, there are 2d possible candidate
itemsets
7
Frequent Itemset Generation

Brute-force approach
Each itemset in the lattice is a candidate
frequent itemset
Count the support of each candidate by scanning
the database
Match each transaction against every candidate
Complexity O(NMw) gt Expensive since M 2d !!!

8
Reducing Number of Candidates

Apriori principle
If an itemset is frequent, then all of its
subsets must also be frequent
Apriori principle holds due to the following
property of the support measure
Support of an itemset never exceeds the support
of its subsets
This is known as the anti-monotone property of
support

9
Illustrating Apriori Principle
10
Illustrating Apriori Principle
Items (1-itemsets)
Pairs (2-itemsets) (No need to
generatecandidates involving Cokeor Eggs)
Minimum Support 3
Triplets (3-itemsets)
If every subset is considered, 6C1 6C2 6C3
41 With support-based pruning, 6 6 1 13
11
Apriori
R. Agrawal and R. Srikant. Fast algorithms for
mining association rules. VLDB, 487-499, 1994
12
(No Transcript)
13
How to Generate Candidates?

Suppose the items in Lk-1 are listed in an order
Step 1 self-joining Lk-1
insert into Ck
select p.item1, p.item2, , p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1q.item1, , p.itemk-2q.itemk-2,
p.itemk-1 lt q.itemk-1
Step 2 pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck

14
Challenges of Frequent Itemset Mining

Challenges
Multiple scans of transaction database
Huge number of candidates
Tedious workload of support counting for
candidates
Improving Apriori general ideas
Reduce passes of transaction database scans
Shrink number of candidates
Facilitate support counting of candidates

15
Alternative Methods for Frequent Itemset
Generation

Representation of Database
horizontal vs vertical data layout

16
ECLAT

For each item, store a list of transaction ids
(tids)

TID-list
17
ECLAT

Determine support of any k-itemset by
intersecting tid-lists of two of its (k-1)
subsets.
3 traversal approaches
top-down, bottom-up and hybrid
Advantage very fast support counting
Disadvantage intermediate tid-lists may become
too large for memory

?
?
18
(No Transcript)
19
(No Transcript)
20
FP-growth Algorithm

Use a compressed representation of the database
using an FP-tree
Once an FP-tree has been constructed, it uses a
recursive divide-and-conquer approach to mine the
frequent itemsets

21
FP-tree construction
null
After reading TID1
A1
B1
After reading TID2
null
B1
A1
B1
C1
D1
22
FP-Tree Construction
Transaction Database
null
B3
A7
B5
C3
C1
D1
D1
Header table
C3
E1
D1
E1
D1
E1
D1
Pointers are used to assist frequent itemset
generation
23
FP-growth
Conditional Pattern base for D P
(A1,B1,C1), (A1,B1),
(A1,C1), (A1),
(B1,C1) Recursively apply FP-growth on
P Frequent Itemsets found (with sup gt 1) AD,
BD, CD, ACD, BCD
null
A7
B1
B5
C1
C1
D1
D1
C3
D1
D1
D1
24
(No Transcript)
25
Compact Representation of Frequent Itemsets

Some itemsets are redundant because they have
identical support as their supersets
Number of frequent itemsets
Need a compact representation

26
Maximal Frequent Itemset
An itemset is maximal frequent if none of its
immediate supersets is frequent
Maximal Itemsets
Border
Infrequent Itemsets
27
Closed Itemset

An itemset is closed if none of its immediate
supersets has the same support as the itemset

28
Maximal vs Closed Itemsets
Transaction Ids
Not supported by any transactions
29
Maximal vs Closed Frequent Itemsets
Closed but not maximal
Minimum support 2
Closed and maximal
Closed 9 Maximal 4
30
Maximal vs Closed Itemsets
31
Association Rule Mining and FIM
32
Research Questions

How to efficiently enumerate Maximal Frequent
Itemsets?
How about Closed Frequent Itemsets?

33
Association Rule Mining

Given a set of transactions, find rules that will
predict the occurrence of an item based on the
occurrences of other items in the transaction

Example of Association Rules
Market-Basket transactions
Diaper ? Beer,Beer, Bread ? Milk,
Implication means co-occurrence, not causality!
34
Definition Association Rule

Association Rule
An implication expression of the form X ? Y,
where X and Y are itemsets
Example Milk, Diaper ? Beer
Rule Evaluation Metrics
Support (s)
Fraction of transactions that contain both X and
Y
Confidence (c)
Measures how often items in Y appear in
transactions thatcontain X

35
Association Rule Mining Task

Given a set of transactions T, the goal of
association rule mining is to find all rules
having
support minsup threshold
confidence minconf threshold
Brute-force approach
List all possible association rules
Compute the support and confidence for each rule
Prune rules that fail the minsup and minconf
thresholds
? Computationally prohibitive!

36
Mining Association Rules
Example of Rules Milk,Diaper ? Beer (s0.4,
c0.67)Milk,Beer ? Diaper (s0.4,
c1.0) Diaper,Beer ? Milk (s0.4,
c0.67) Beer ? Milk,Diaper (s0.4, c0.67)
Diaper ? Milk,Beer (s0.4, c0.5) Milk ?
Diaper,Beer (s0.4, c0.5)

Observations
All the above rules are binary partitions of the
same itemset Milk, Diaper, Beer
Rules originating from the same itemset have
identical support but can have different
confidence
Thus, we may decouple the support and confidence
requirements

37
Mining Association Rules

Two-step approach
Frequent Itemset Generation
Generate all itemsets whose support ? minsup
Rule Generation
Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning
of a frequent itemset
Frequent itemset generation is still
computationally expensive

38
Computational Complexity

Given d unique items
Total number of itemsets 2d
Total number of possible association rules

If d6, R 602 rules
39
Rule Generation

Given a frequent itemset L, find all non-empty
subsets f ? L such that f ? L f satisfies the
minimum confidence requirement
If A,B,C,D is a frequent itemset, candidate
rules
ABC ?D, ABD ?C, ACD ?B, BCD ?A, A ?BCD, B
?ACD, C ?ABD, D ?ABCAB ?CD, AC ? BD, AD ? BC,
BC ?AD, BD ?AC, CD ?AB,
If L k, then there are 2k 2 candidate
association rules (ignoring L ? ? and ? ? L)

40
Rule Generation

How to efficiently generate rules from frequent
itemsets?
In general, confidence does not have an
anti-monotone property
c(ABC ?D) can be larger or smaller than c(AB ?D)
But confidence of rules generated from the same
itemset has an anti-monotone property
e.g., L A,B,C,D c(ABC ? D) ? c(AB ? CD)
? c(A ? BCD)
Confidence is anti-monotone w.r.t. number of
items on the RHS of the rule

41
Rule Generation for Apriori Algorithm
Lattice of rules
Low Confidence Rule
42
Rule Generation for Apriori Algorithm

Candidate rule is generated by merging two rules
that share the same prefixin the rule consequent
join(CDgtAB,BDgtAC)would produce the
candidaterule D gt ABC
Prune rule DgtABC if itssubset ADgtBC does not
havehigh confidence

43
Beyond Itemsets

Sequence Mining
Finding frequent subsequences from a collection
of sequences
Graph Mining
Finding frequent (connected) subgraphs from a
collection of graphs
Tree Mining
Finding frequent (embedded) subtrees from a set
of trees/graphs
Geometric Structure Mining
Finding frequent substructures from 3-D or 2-D
geometric graphs
Among others

44
Frequent Pattern Mining
E
E
A
B
A
B
A
A
B
B
A
A
B
A
B
F
E
A
A
E
C
B
A
B
C
D
F
D
C
C
D
F
D
C
C
C
D
D
A
D
F
C
D
A
B
D
C
45
Why Frequent Pattern Mining is So Important?

Application Domains
Business, biology, chemistry, WWW,
computer/networing security,
Summarizing the underlying datasets, providing
key insights
Basic tools for other data mining tasks
Assocation rule mining
Classification
Clustering
Change Detection
etc

Network motifs recurring patterns that occur
significantly more than in randomized nets
Do motifs have specific roles in the network?
Many possible distinct subgraphs

47
The 13 three-node connected subgraphs
48
199 4-node directed connected subgraphs
And it grows fast for larger subgraphs 9364
5-node subgraphs, 1,530,843 6-node
49
Finding network motifs an overview

Generation of a suitable random ensemble
(reference networks)
Network motifs detection process

Count how many times each subgraph appears
Compute statistical significance for each
subgraph probability of appearing in random as
much as in real network
(P-val or Z-score)

50
Ensemble of networks
Real 5 Rand0.50.6 Zscore
(Standard Deviations)7.5
51
Performance and Scalability Apriori
Implementation
52
Apriori
R. Agrawal and R. Srikant. Fast algorithms for
mining association rules. VLDB, 487-499, 1994
53
Challenges of Frequent Itemset Mining

Challenges
Multiple scans of transaction database
Huge number of candidates
Tedious workload of support counting for
candidates
Improving Apriori general ideas
Reduce passes of transaction database scans
Shrink number of candidates
Facilitate support counting of candidates

53
54
Reducing Number of Comparisons

Candidate counting
Scan the database of transactions to determine
the support of each candidate itemset
To reduce the number of comparisons, store the
candidates in a hash structure
Instead of matching each transaction against
every candidate, match it against candidates
contained in the hashed buckets

55
Generate Hash Tree

Suppose you have 15 candidate itemsets of length
3
1 4 5, 1 2 4, 4 5 7, 1 2 5, 4 5 8, 1 5
9, 1 3 6, 2 3 4, 5 6 7, 3 4 5, 3 5 6,
3 5 7, 6 8 9, 3 6 7, 3 6 8
You need
Hash function
Max leaf size max number of itemsets stored in
a leaf node (if number of candidate itemsets
exceeds max leaf size, split the node)

56
Association Rule Discovery Hash tree
Hash Function
Candidate Hash Tree
1,4,7
3,6,9
2,5,8
Hash on 1, 4 or 7
57
Association Rule Discovery Hash tree
Hash Function
Candidate Hash Tree
1,4,7
3,6,9
2,5,8
Hash on 2, 5 or 8
58
Association Rule Discovery Hash tree
Hash Function
Candidate Hash Tree
1,4,7
3,6,9
2,5,8
Hash on 3, 6 or 9
59
Subset Operation
Given a transaction t, what are the possible
subsets of size 3?
60
Subset Operation Using Hash Tree
transaction
61
Subset Operation Using Hash Tree
transaction
1 3 6
3 4 5
1 5 9
62
Subset Operation Using Hash Tree
transaction
1 3 6
3 4 5
1 5 9
Match transaction against 11 out of 15 candidates
63
Prefix Tree Representation
Efficient Implementations of Apriori and
EclatChristian Borgelt., FIMI03
64
Prefix Tree
65
Prefix Tree Structure for Counting
66
Other key optimization

Recording the items
Why is this relevant?
Transaction Tree
Organize transaction into trees
Count through two trees

67
Scalability

How to handle very large dataset?
The dataset can not be stored in the main memory
Performance of out-of-core datasets/Performance
of in-core datasets

68
Partition Scan Database Only Twice

Any itemset that is potentially frequent in DB
must be frequent in at least one of the
partitions of DB
Scan 1 partition database and find local
frequent patterns
Scan 2 consolidate global frequent patterns
A. Savasere, E. Omiecinski, and S. Navathe. An
efficient algorithm for mining association in
large databases. In VLDB95

69
DHP Reduce the Number of Candidates

A k-itemset whose corresponding hashing bucket
count is below the threshold cannot be frequent
Candidates a, b, c, d, e
Hash entries ab, ad, ae bd, be, de
Frequent 1-itemset a, b, d, e
ab is not a candidate 2-itemset if the sum of
count of ab, ad, ae is below support threshold
J. Park, M. Chen, and P. Yu. An effective
hash-based algorithm for mining association
rules. In SIGMOD95

70
Sampling for Frequent Patterns

Select a sample of original database, mine
frequent patterns within sample using Apriori
Scan database once to verify frequent itemsets
found in sample, only borders of closure of
frequent patterns are checked
Example check abcd instead of ab, ac, , etc.
Scan database again to find missed frequent
patterns
H. Toivonen. Sampling large databases for
association rules. In VLDB96

71
DIC Reduce Number of Scans
ABCD

Once both A and D are determined frequent, the
counting of AD begins
Once all length-2 subsets of BCD are determined
frequent, the counting of BCD begins

ABC
ABD
ACD
BCD
AB
AC
BC
AD
BD
CD
Transactions
1-itemsets
B
C
D
A
2-itemsets
Apriori

Itemset lattice
1-itemsets
2-items
S. Brin R. Motwani, J. Ullman, and S. Tsur.
Dynamic itemset counting and implication rules
for market basket data. In SIGMOD97
3-items
DIC
72
References

R. Agrawal, T. Imielinski, and A. Swami. Mining
association rules between sets of items in large
databases. SIGMOD, 207-216, 1993.
R. Agrawal and R. Srikant. Fast algorithms for
mining association rules. VLDB, 487-499, 1994.
R. J. Bayardo. Efficiently mining long patterns
from databases. SIGMOD, 85-93, 1998.

73
References

Christian Borgelt, Efficient Implementations of
Apriori and Eclat, FIMI03
Ferenc Bodon, A fast APRIORI implementation,
FIMI03
Ferenc Bodon, A Survey on Frequent Itemset
Mining, Technical Report, Budapest University of
Technology and Economic, 2006

74
Important websites