Chap' 5 Mining Frequent Patterns, Association, and Correlations - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

Chap' 5 Mining Frequent Patterns, Association, and Correlations

Description:

Finding frequent patterns, associations, correlations, or ... Ex 98% of people who purchase tires and auto accessories. also get automotive services done ... – PowerPoint PPT presentation

Number of Views:87

Avg rating:3.0/5.0

Slides: 38

Provided by: jiaw185

Category:

more less

Transcript and Presenter's Notes

Title: Chap' 5 Mining Frequent Patterns, Association, and Correlations

1
Chap. 5 Mining Frequent Patterns, Association,
and Correlations

Data Mining

2
What Is Association Mining?

Association rule mining
Finding frequent patterns, associations,
correlations, or causal structures among sets of
items or objects
Applications
Basket data analysis, cross-marketing, catalog
design, clustering, classification, etc.
Examples
Rule form Body Head support, confidence
buys(x, diapers) buys(x, beers) 0.5,
60
major(x, CS) takes(x, DB) grade(x, A)
1, 75

3
Basic Concepts

Given
Database of transactions
Transaction a set of items (exgt purchased by a
customer)
Find
Rules that correlate one set of items with
another set of items
Exgt 98 of people who purchase tires and auto
accessories
also get automotive services done
Mining step
Find all frequent itemsets
Generate strong association rules
Support confidence

4
Support and Confidence

For the rule X Y ? Z
Support P(X ? Y ? Z)
Confidence P(Z X ? Y )

Customer buys both
Customer buys diaper
For B ? D Support 10 Confidence 67
5
10
15
Customer buys beer
70
5
Mining Example
Min. support 50 Min. confidence 70

Rule A ? C
support support(A, C) 50
confidence support(A, C)/support(A) 66.7
Rule C ? A
support support(C, A) 50
confidence support(C, A)/support(C) 100

6
Kinds of Rules

buys(x, SQLServer) buys(x, DBMiner)
age(x, 30..39) income(x, 42..48K)
buys(x, computer)
age(x, 30..39) income(x, 42..48K)
buys(x, notebook)
Boolean vs. quantitative
Single dimension vs. multiple dimensional
Single level vs. multiple level

7
The Apriori Algorithm

Mining single-dimensional Boolean association
rules
Find the frequent itemsets
Iteratively find frequent itemsets with
cardinality from 1 to k (k-itemset)
Join step Ck is generated by joining Lk-1with
itself
Prune step Remove any (k-1)-itemset that is not
frequent
The Apriori principle
Any subset (prior set) of a frequent itemset must
be frequent
i.e., if A,B,C is a frequent itemset, both
A,B, A,C, and B,C should be a frequent
itemset

C1 ? L1 ? C2 ? L2 ? Ck ? Lk
8
The Apriori Algorithm

Pseudo-code
Ck Candidate itemset of size k
Lk frequent itemset of size k
L1 frequent items
for (k 1 Lk !? k) do begin
Ck1 candidates generated from Lk
for each transaction t in database do
increment the count of all candidates in
Ck1
that are contained in t
Lk1 candidates in Ck1 with min_support
end
return ?k Lk

9
The Apriori Algorithm Example
Database D
C1
L1
scan D
join
L2
C2
C2
prune
scan D
join
prune
C3
L3
scan D
10
Generating Candidates

L3abc, abd, acd, ace, bcd
Joining L3L3
abcd from abc and abd
acde from acd and ace
Pruning
acde is removed because ade is not in L3
C4abcd

11
Generating Rules

2 ? 3 ? 5 confidence 2/2 100
2 ? 5 ? 3 confidence 2/3 67
3 ? 5 ? 2 confidence 2/2 100
2 ? 3 ? 5 confidence 2/3 67
3 ? 2 ? 5 confidence 2/3 67
5 ? 2 ? 3 confidence 2/3 67

12
Methods to Improve Aprioris Efficiency

Hash-based itemset counting
A k-itemset whose corresponding hashing bucket
count is below the threshold cannot be frequent
Transaction reduction
A transaction that does not contain any frequent
k-itemset is useless in subsequent scans
Partitioning
Any itemset that is potentially frequent in DB
must be frequent in at least one of the
partitions of DB
Sampling
Mining on a subset of given data with lower
support threshold. Less accurate but more
efficient

13
Performance Bottleneck

The core of the Apriori algorithm
Use frequent (k 1)-itemsets to generate
candidate k-itemsets
Use database scan and pattern matching to collect
counts
The bottleneck of Apriori candidate generation
Huge candidate sets
104 frequent 1-itemset will generate 107
candidate 2-itemsets
To discover a frequent pattern of size 100, e.g.,
a1, a2, , a100, one needs to generate 2100 ?
1030 candidates
Multiple scans of database
Needs (n 1 ) scans, n is the length of the
longest pattern

14
Mining Frequent Patterns Without Candidate
Generation

FP-tree structure
Compress a large database into a compact
Frequent-Pattern tree (FP-tree) structure
Avoid costly database scans
Constructing FP-tree
Scan DB once, find frequent 1-itemset (single
item pattern)
Order frequent items in frequency descending
order
Scan DB again, construct FP-tree while sharing
prefix

15
Construct FP-tree
TID Items bought (ordered) frequent
items 100 f, a, c, d, g, i, m, p f, c, a, m,
p 200 a, b, c, f, l, m, o f, c, a, b,
m 300 b, f, h, j, o f, b 400 b, c, k,
s, p c, b, p 500 a, f, c, e, l, p, m,
n f, c, a, m, p
16
Mining Frequent Patterns (1)

Method
For each item, construct its conditional
pattern-base
Construct its conditional FP-tree
Repeat the process until the resulting FP-tree is
empty, or it contains only one path
Single path will generate all the combinations of
its sub-paths, each of which is a frequent pattern

17
Mining Frequent Patterns (2)

1. Starting at the frequent header table,
traverse the FP-tree by following the link of
each frequent item
2. Accumulate all of transformed prefix paths of
that item to form a conditional pattern base

Conditional pattern bases item cond. pattern
base c f3 a fc3 b fca1, f1, c1 m fca2,
fcab1 p fcam2, cb1

Header Table Item frequency head
f 4 c 4 a 3 b 3 m 3 p 3
f4
c1
b1
b1
c3
p1
a3
b1
m2
p2
m1
18
Mining Frequent Patterns (3)

3. Accumulate the count for each item in the base
4. Construct the FP-tree for the frequent items
of the pattern base

m-conditional pattern base fca2, fcab1
m-conditional FP-tree
min_support 50 (3)
19
Mining Frequent Patterns (4)
20
Mining Frequent Patterns (5)

5. Repeat the process until the FP-tree contains
single-path P
6. The complete set of frequent pattern of T can
be generated by enumeration of all the
combinations of the sub-paths of P

m-conditional FP-tree
All frequent patterns concerning m
m, fm, cm, am, fcm, fam, cam, fcam
21
Presentation of Association Rules
22
Presentation of Association Rules
23
Presentation of Association Rules
24
Mining Multiple-Level Association Rules

Items often form hierarchy
Rules regarding itemsets at appropriate levels
could be quite useful

milk bread 20, 60
2 milk wheat bread 6, 50
2 milk white bread 4, 70
25
Mining Multiple-Level Association Rules

A top_down, progressive deepening approach
First find high-level strong rules, then find
their lower-level rules
milk bread 20,
60
2 milk wheat bread
6, 50
Items at the lower level are expected to have
lower support
Uniform support vs. reduced support
Uniform Support the same minimum support for all
levels
Reduced Support reduced minimum support at lower
levels

26
Different Strategies

Uniform support
Reduced independent support
Reduced level-cross support

Level 1 min_sup 5
Milk 10
Level 2 min_sup 5
Skim milk 4
2 milk 6
Level 1 min_sup 15
Milk 10
Level 2 min_sup 3
Skim milk 4
2 milk 6
Level 1 min_sup 15
Milk 10
Level 2 min_sup 5
Skim milk
2 milk
27
Redundancy Filtering

Redundant rule
Its support is close to the expected value,
based on the rules ancestor
Example
milk ? wheat bread support 8, confidence
70
2 milk ? wheat bread support 2, confidence
72
(First rule is an ancestor of the second rule)
If 2 milk is ¼ of all milk, the second
rule is redundant

28
Mining Multi-Dimensional Association Rules

Single-dimensional rules
buys(X, milk) ? buys(X, bread)
Multi-dimensional rules
More than 2 dimensions (or predicates)
age(X,19-25) ? occupation(X,student) ?
buys(X,coke)
age(X,19-25) ? buys(X, popcorn) ? buys(X,
coke)
Categorical Attributes
Finite number of possible values, no ordering
among values
Quantitative Attributes
Numeric, implicit ordering among values

29
Techniques for Mining MD Associations

Search for frequent k-predicate set
Example age, occupation, buys is a 3-predicate
set
How to treat the attribute age ?
1. Using static discretization
Quantitative attributes are statically
discretized by using predefined concept
hierarchies
2. Quantitative association rules
Quantitative attributes are dynamically
discretized into bins based on the distribution
of the data
3. Distance-based association rules
This is a dynamic discretization process that
considers the distance between data points

30
Static Discretization

Discretized prior to mining using concept
hierarchy
Data cube is well suited for mining
The cells of an n-dimensional cuboid corresponds
to the n-predicate sets
Mining from data cubes can be much faster

31
Quantitative Assoc. Rules

Quantitative attributes are dynamically
discretized
The confidence or compactness of the rules is
maximized
2-D quantitative assoc. rules
Aquan1 ? Aquan2 ? A cat
Binning Partition the range. Each array cell
holds the corresponding count distribution
Finding frequent itemsets Scan 2-D array to find
predicate sets satisfying min support
Clustering Cluster adjacent association rules to
form more general rules

32
Quantitative Assoc. Rules

Example
age(X,34) ? income(X,31K-40K) ?
buys(X,HDTV)
age(X,35) ? income(X,31K-40K) ?
buys(X,HDTV)
age(X,34) ? income(X,41K-50K) ?
buys(X,HDTV)
age(X,35) ? income(X,41K-50K) ?
buys(X,HDTV)
age(X,34-35) ? income(X,31K-50K) ?
buys(X,HDTV)

33
Distance-based Assoc. Rules

Binning methods ? do not capture the semantics
Distance-based partitioning ? more meaningful
Method
Find clusters
Search groups of clusters that occur together

34
Problem of Support and Confidence

Example
Among 10000 transactions
6000 include computer games
7500 include videos
4000 include both
Min support 30, Min confidence 60
The rule
buys(X, game) ? buys(X, video) 40, 66.7
misleading because the overall percentage of
buying video is 75 which is higher than 66.7
!

35
Correlation

Correlation
Measure the dependency of itemsets
Corr(A,B) gt 1 ? Positively related
Also called the lift of rule A ? B
Example
Corr(game, video) P(game,video) / P(game)
P(video)
0.4 / (0.6 x 0.75)
0.89 lt 1.0
? Negative correlation !

36
References

R. Agrawal and R. Srikant. Fast algorithms for
mining association rules. VLDB'94.
H. Mannila, H. Toivonen, and A. I. Verkamo.
Efficient algorithms for discovering association
rules. KDD'94.
A. Savasere, E. Omiecinski, and S. Navathe. An
efficient algorithm for mining association rules
in large databases. VLDB'95.
J. S. Park, M. S. Chen, and P. S. Yu. An
effective hash-based algorithm for mining
association rules. SIGMOD'95.
H. Toivonen. Sampling large databases for
association rules. VLDB'96.
S. Brin, R. Motwani, J. D. Ullman, and S. Tsur.
Dynamic itemset counting and implication rules
for market basket analysis. SIGMOD'97.
S. Sarawagi, S. Thomas, and R. Agrawal.
Integrating association rule mining with
relational database systems Alternatives and
implications. SIGMOD'98.
R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A
tree projection algorithm for generation of
frequent itemsets. J. Parallel and Distributed
Computing02.
J. Han, J. Pei, and Y. Yin. Mining frequent
patterns without candidate generation. SIGMOD
00.
J. Pei, J. Han, and R. Mao. CLOSET An Efficient
Algorithm for Mining Frequent Closed Itemsets.
DMKD'00.
J. Liu, Y. Pan, K. Wang, and J. Han. Mining
Frequent Item Sets by Opportunistic Projection.
KDD'02.
J. Han, J. Wang, Y. Lu, and P. Tzvetkov. Mining
Top-K Frequent Closed Patterns without Minimum
Support. ICDM'02.
J. Wang, J. Han, and J. Pei. CLOSET Searching
for the Best Strategies for Mining Frequent
Closed Itemsets. KDD'03.
G. Liu, H. Lu, W. Lou, J. X. Yu. On Computing,
Storing and Querying Frequent Patterns. KDD'03.

37
References

R. Srikant and R. Agrawal. Mining generalized
association rules. VLDB'95.
J. Han and Y. Fu. Discovery of multiple-level
association rules from large databases. VLDB'95.
R. Srikant and R. Agrawal. Mining quantitative
association rules in large relational tables.
SIGMOD'96.
T. Fukuda, Y. Morimoto, S. Morishita, and T.
Tokuyama. Data mining using two-dimensional
optimized association rules Scheme, algorithms,
and visualization. SIGMOD'96.
K. Yoda, T. Fukuda, Y. Morimoto, S. Morishita,
and T. Tokuyama. Computing optimized rectilinear
regions for association rules. KDD'97.
R.J. Miller and Y. Yang. Association rules over
interval data. SIGMOD'97.
Y. Aumann and Y. Lindell. A Statistical Theory
for Quantitative Association Rules KDD'99.
M. Klemettinen, H. Mannila, P. Ronkainen, H.
Toivonen, and A. I. Verkamo. Finding
interesting rules from large sets of discovered
association rules. CIKM'94.
S. Brin, R. Motwani, and C. Silverstein. Beyond
market basket Generalizing association rules to
correlations. SIGMOD'97.
C. Silverstein, S. Brin, R. Motwani, and J.
Ullman. Scalable techniques for mining causal
structures. VLDB'98.
P.-N. Tan, V. Kumar, and J. Srivastava.
Selecting the Right Interestingness Measure for
Association Patterns. KDD'02.
E. Omiecinski. Alternative Interest Measures
for Mining Associations. TKDE03.
Y. K. Lee, W.Y. Kim, Y. D. Cai, and J. Han.
CoMine Efficient Mining of Correlated Patterns.
ICDM03.