Title: Chap' 5 Mining Frequent Patterns, Association, and Correlations
1Chap. 5 Mining Frequent Patterns, Association,
and Correlations
2What Is Association Mining?
- Association rule mining
- Finding frequent patterns, associations,
correlations, or causal structures among sets of
items or objects - Applications
- Basket data analysis, cross-marketing, catalog
design, clustering, classification, etc. - Examples
- Rule form Body Head support, confidence
- buys(x, diapers) buys(x, beers) 0.5,
60 - major(x, CS) takes(x, DB) grade(x, A)
1, 75
3Basic Concepts
- Given
- Database of transactions
- Transaction a set of items (exgt purchased by a
customer) - Find
- Rules that correlate one set of items with
another set of items - Exgt 98 of people who purchase tires and auto
accessories - also get automotive services done
- Mining step
- Find all frequent itemsets
- Generate strong association rules
- Support confidence
4Support and Confidence
- For the rule X Y ? Z
- Support P(X ? Y ? Z)
- Confidence P(Z X ? Y )
Customer buys both
Customer buys diaper
For B ? D Support 10 Confidence 67
5
10
15
Customer buys beer
70
5Mining Example
Min. support 50 Min. confidence 70
- Rule A ? C
- support support(A, C) 50
- confidence support(A, C)/support(A) 66.7
- Rule C ? A
- support support(C, A) 50
- confidence support(C, A)/support(C) 100
6Kinds of Rules
- buys(x, SQLServer) buys(x, DBMiner)
- age(x, 30..39) income(x, 42..48K)
buys(x, computer) - age(x, 30..39) income(x, 42..48K)
buys(x, notebook) - Boolean vs. quantitative
- Single dimension vs. multiple dimensional
- Single level vs. multiple level
7The Apriori Algorithm
- Mining single-dimensional Boolean association
rules - Find the frequent itemsets
- Iteratively find frequent itemsets with
cardinality from 1 to k (k-itemset) - Join step Ck is generated by joining Lk-1with
itself - Prune step Remove any (k-1)-itemset that is not
frequent - The Apriori principle
- Any subset (prior set) of a frequent itemset must
be frequent - i.e., if A,B,C is a frequent itemset, both
A,B, A,C, and B,C should be a frequent
itemset
C1 ? L1 ? C2 ? L2 ? Ck ? Lk
8The Apriori Algorithm
- Pseudo-code
- Ck Candidate itemset of size k
- Lk frequent itemset of size k
- L1 frequent items
- for (k 1 Lk !? k) do begin
- Ck1 candidates generated from Lk
- for each transaction t in database do
- increment the count of all candidates in
Ck1 - that are contained in t
- Lk1 candidates in Ck1 with min_support
- end
- return ?k Lk
9The Apriori Algorithm Example
Database D
C1
L1
scan D
join
L2
C2
C2
prune
scan D
join
prune
C3
L3
scan D
10Generating Candidates
- L3abc, abd, acd, ace, bcd
- Joining L3L3
- abcd from abc and abd
- acde from acd and ace
- Pruning
- acde is removed because ade is not in L3
- C4abcd
11Generating Rules
- 2 ? 3 ? 5 confidence 2/2 100
- 2 ? 5 ? 3 confidence 2/3 67
- 3 ? 5 ? 2 confidence 2/2 100
- 2 ? 3 ? 5 confidence 2/3 67
- 3 ? 2 ? 5 confidence 2/3 67
- 5 ? 2 ? 3 confidence 2/3 67
12Methods to Improve Aprioris Efficiency
- Hash-based itemset counting
- A k-itemset whose corresponding hashing bucket
count is below the threshold cannot be frequent - Transaction reduction
- A transaction that does not contain any frequent
k-itemset is useless in subsequent scans - Partitioning
- Any itemset that is potentially frequent in DB
must be frequent in at least one of the
partitions of DB - Sampling
- Mining on a subset of given data with lower
support threshold. Less accurate but more
efficient
13Performance Bottleneck
- The core of the Apriori algorithm
- Use frequent (k 1)-itemsets to generate
candidate k-itemsets - Use database scan and pattern matching to collect
counts - The bottleneck of Apriori candidate generation
- Huge candidate sets
- 104 frequent 1-itemset will generate 107
candidate 2-itemsets - To discover a frequent pattern of size 100, e.g.,
a1, a2, , a100, one needs to generate 2100 ?
1030 candidates - Multiple scans of database
- Needs (n 1 ) scans, n is the length of the
longest pattern
14Mining Frequent Patterns Without Candidate
Generation
- FP-tree structure
- Compress a large database into a compact
Frequent-Pattern tree (FP-tree) structure - Avoid costly database scans
- Constructing FP-tree
- Scan DB once, find frequent 1-itemset (single
item pattern) - Order frequent items in frequency descending
order - Scan DB again, construct FP-tree while sharing
prefix
15Construct FP-tree
TID Items bought (ordered) frequent
items 100 f, a, c, d, g, i, m, p f, c, a, m,
p 200 a, b, c, f, l, m, o f, c, a, b,
m 300 b, f, h, j, o f, b 400 b, c, k,
s, p c, b, p 500 a, f, c, e, l, p, m,
n f, c, a, m, p
16Mining Frequent Patterns (1)
- Method
- For each item, construct its conditional
pattern-base - Construct its conditional FP-tree
- Repeat the process until the resulting FP-tree is
empty, or it contains only one path - Single path will generate all the combinations of
its sub-paths, each of which is a frequent pattern
17Mining Frequent Patterns (2)
- 1. Starting at the frequent header table,
traverse the FP-tree by following the link of
each frequent item - 2. Accumulate all of transformed prefix paths of
that item to form a conditional pattern base
Conditional pattern bases item cond. pattern
base c f3 a fc3 b fca1, f1, c1 m fca2,
fcab1 p fcam2, cb1
Header Table Item frequency head
f 4 c 4 a 3 b 3 m 3 p 3
f4
c1
b1
b1
c3
p1
a3
b1
m2
p2
m1
18Mining Frequent Patterns (3)
- 3. Accumulate the count for each item in the base
- 4. Construct the FP-tree for the frequent items
of the pattern base
m-conditional pattern base fca2, fcab1
m-conditional FP-tree
min_support 50 (3)
19Mining Frequent Patterns (4)
20Mining Frequent Patterns (5)
- 5. Repeat the process until the FP-tree contains
single-path P - 6. The complete set of frequent pattern of T can
be generated by enumeration of all the
combinations of the sub-paths of P
m-conditional FP-tree
All frequent patterns concerning m
m, fm, cm, am, fcm, fam, cam, fcam
21Presentation of Association Rules
22Presentation of Association Rules
23Presentation of Association Rules
24Mining Multiple-Level Association Rules
- Items often form hierarchy
- Rules regarding itemsets at appropriate levels
could be quite useful
milk bread 20, 60
2 milk wheat bread 6, 50
2 milk white bread 4, 70
25Mining Multiple-Level Association Rules
- A top_down, progressive deepening approach
- First find high-level strong rules, then find
their lower-level rules - milk bread 20,
60 - 2 milk wheat bread
6, 50 - Items at the lower level are expected to have
lower support - Uniform support vs. reduced support
- Uniform Support the same minimum support for all
levels - Reduced Support reduced minimum support at lower
levels
26Different Strategies
- Uniform support
- Reduced independent support
- Reduced level-cross support
Level 1 min_sup 5
Milk 10
Level 2 min_sup 5
Skim milk 4
2 milk 6
Level 1 min_sup 15
Milk 10
Level 2 min_sup 3
Skim milk 4
2 milk 6
Level 1 min_sup 15
Milk 10
Level 2 min_sup 5
Skim milk
2 milk
27Redundancy Filtering
- Redundant rule
- Its support is close to the expected value,
based on the rules ancestor - Example
- milk ? wheat bread support 8, confidence
70 - 2 milk ? wheat bread support 2, confidence
72 - (First rule is an ancestor of the second rule)
- If 2 milk is ¼ of all milk, the second
rule is redundant
28Mining Multi-Dimensional Association Rules
- Single-dimensional rules
- buys(X, milk) ? buys(X, bread)
- Multi-dimensional rules
- More than 2 dimensions (or predicates)
- age(X,19-25) ? occupation(X,student) ?
buys(X,coke) - age(X,19-25) ? buys(X, popcorn) ? buys(X,
coke) - Categorical Attributes
- Finite number of possible values, no ordering
among values - Quantitative Attributes
- Numeric, implicit ordering among values
29Techniques for Mining MD Associations
- Search for frequent k-predicate set
- Example age, occupation, buys is a 3-predicate
set - How to treat the attribute age ?
- 1. Using static discretization
- Quantitative attributes are statically
discretized by using predefined concept
hierarchies - 2. Quantitative association rules
- Quantitative attributes are dynamically
discretized into bins based on the distribution
of the data - 3. Distance-based association rules
- This is a dynamic discretization process that
considers the distance between data points
30Static Discretization
- Discretized prior to mining using concept
hierarchy - Data cube is well suited for mining
- The cells of an n-dimensional cuboid corresponds
to the n-predicate sets - Mining from data cubes can be much faster
31Quantitative Assoc. Rules
- Quantitative attributes are dynamically
discretized - The confidence or compactness of the rules is
maximized - 2-D quantitative assoc. rules
- Aquan1 ? Aquan2 ? A cat
- Binning Partition the range. Each array cell
holds the corresponding count distribution - Finding frequent itemsets Scan 2-D array to find
predicate sets satisfying min support - Clustering Cluster adjacent association rules to
form more general rules
32Quantitative Assoc. Rules
- Example
- age(X,34) ? income(X,31K-40K) ?
buys(X,HDTV) - age(X,35) ? income(X,31K-40K) ?
buys(X,HDTV) - age(X,34) ? income(X,41K-50K) ?
buys(X,HDTV) - age(X,35) ? income(X,41K-50K) ?
buys(X,HDTV) - age(X,34-35) ? income(X,31K-50K) ?
buys(X,HDTV)
33Distance-based Assoc. Rules
- Binning methods ? do not capture the semantics
- Distance-based partitioning ? more meaningful
- Method
- Find clusters
- Search groups of clusters that occur together
34Problem of Support and Confidence
- Example
- Among 10000 transactions
- 6000 include computer games
- 7500 include videos
- 4000 include both
- Min support 30, Min confidence 60
- The rule
- buys(X, game) ? buys(X, video) 40, 66.7
- misleading because the overall percentage of
- buying video is 75 which is higher than 66.7
!
35Correlation
- Correlation
- Measure the dependency of itemsets
- Corr(A,B) gt 1 ? Positively related
- Also called the lift of rule A ? B
- Example
- Corr(game, video) P(game,video) / P(game)
P(video) - 0.4 / (0.6 x 0.75)
- 0.89 lt 1.0
- ? Negative correlation !
36References
- R. Agrawal and R. Srikant. Fast algorithms for
mining association rules. VLDB'94. - H. Mannila, H. Toivonen, and A. I. Verkamo.
Efficient algorithms for discovering association
rules. KDD'94. - A. Savasere, E. Omiecinski, and S. Navathe. An
efficient algorithm for mining association rules
in large databases. VLDB'95. - J. S. Park, M. S. Chen, and P. S. Yu. An
effective hash-based algorithm for mining
association rules. SIGMOD'95. - H. Toivonen. Sampling large databases for
association rules. VLDB'96. - S. Brin, R. Motwani, J. D. Ullman, and S. Tsur.
Dynamic itemset counting and implication rules
for market basket analysis. SIGMOD'97. - S. Sarawagi, S. Thomas, and R. Agrawal.
Integrating association rule mining with
relational database systems Alternatives and
implications. SIGMOD'98. - R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A
tree projection algorithm for generation of
frequent itemsets. J. Parallel and Distributed
Computing02. - J. Han, J. Pei, and Y. Yin. Mining frequent
patterns without candidate generation. SIGMOD
00. - J. Pei, J. Han, and R. Mao. CLOSET An Efficient
Algorithm for Mining Frequent Closed Itemsets.
DMKD'00. - J. Liu, Y. Pan, K. Wang, and J. Han. Mining
Frequent Item Sets by Opportunistic Projection.
KDD'02. - J. Han, J. Wang, Y. Lu, and P. Tzvetkov. Mining
Top-K Frequent Closed Patterns without Minimum
Support. ICDM'02. - J. Wang, J. Han, and J. Pei. CLOSET Searching
for the Best Strategies for Mining Frequent
Closed Itemsets. KDD'03. - G. Liu, H. Lu, W. Lou, J. X. Yu. On Computing,
Storing and Querying Frequent Patterns. KDD'03.
37References
- R. Srikant and R. Agrawal. Mining generalized
association rules. VLDB'95. - J. Han and Y. Fu. Discovery of multiple-level
association rules from large databases. VLDB'95. - R. Srikant and R. Agrawal. Mining quantitative
association rules in large relational tables.
SIGMOD'96. - T. Fukuda, Y. Morimoto, S. Morishita, and T.
Tokuyama. Data mining using two-dimensional
optimized association rules Scheme, algorithms,
and visualization. SIGMOD'96. - K. Yoda, T. Fukuda, Y. Morimoto, S. Morishita,
and T. Tokuyama. Computing optimized rectilinear
regions for association rules. KDD'97. - R.J. Miller and Y. Yang. Association rules over
interval data. SIGMOD'97. - Y. Aumann and Y. Lindell. A Statistical Theory
for Quantitative Association Rules KDD'99. - M. Klemettinen, H. Mannila, P. Ronkainen, H.
Toivonen, and A. I. Verkamo. Finding
interesting rules from large sets of discovered
association rules. CIKM'94. - S. Brin, R. Motwani, and C. Silverstein. Beyond
market basket Generalizing association rules to
correlations. SIGMOD'97. - C. Silverstein, S. Brin, R. Motwani, and J.
Ullman. Scalable techniques for mining causal
structures. VLDB'98. - P.-N. Tan, V. Kumar, and J. Srivastava.
Selecting the Right Interestingness Measure for
Association Patterns. KDD'02. - E. Omiecinski. Alternative Interest Measures
for Mining Associations. TKDE03. - Y. K. Lee, W.Y. Kim, Y. D. Cai, and J. Han.
CoMine Efficient Mining of Correlated Patterns.
ICDM03.