Chap' 5 Mining Frequent Patterns, Association, and Correlations - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Chap' 5 Mining Frequent Patterns, Association, and Correlations

Description:

Finding frequent patterns, associations, correlations, or ... Ex 98% of people who purchase tires and auto accessories. also get automotive services done ... – PowerPoint PPT presentation

Number of Views:87
Avg rating:3.0/5.0
Slides: 38
Provided by: jiaw185
Category:

less

Transcript and Presenter's Notes

Title: Chap' 5 Mining Frequent Patterns, Association, and Correlations


1
Chap. 5 Mining Frequent Patterns, Association,
and Correlations
  • Data Mining

2
What Is Association Mining?
  • Association rule mining
  • Finding frequent patterns, associations,
    correlations, or causal structures among sets of
    items or objects
  • Applications
  • Basket data analysis, cross-marketing, catalog
    design, clustering, classification, etc.
  • Examples
  • Rule form Body Head support, confidence
  • buys(x, diapers) buys(x, beers) 0.5,
    60
  • major(x, CS) takes(x, DB) grade(x, A)
    1, 75

3
Basic Concepts
  • Given
  • Database of transactions
  • Transaction a set of items (exgt purchased by a
    customer)
  • Find
  • Rules that correlate one set of items with
    another set of items
  • Exgt 98 of people who purchase tires and auto
    accessories
  • also get automotive services done
  • Mining step
  • Find all frequent itemsets
  • Generate strong association rules
  • Support confidence

4
Support and Confidence
  • For the rule X Y ? Z
  • Support P(X ? Y ? Z)
  • Confidence P(Z X ? Y )

Customer buys both
Customer buys diaper
For B ? D Support 10 Confidence 67
5
10
15
Customer buys beer
70
5
Mining Example
Min. support 50 Min. confidence 70
  • Rule A ? C
  • support support(A, C) 50
  • confidence support(A, C)/support(A) 66.7
  • Rule C ? A
  • support support(C, A) 50
  • confidence support(C, A)/support(C) 100

6
Kinds of Rules
  • buys(x, SQLServer) buys(x, DBMiner)
  • age(x, 30..39) income(x, 42..48K)
    buys(x, computer)
  • age(x, 30..39) income(x, 42..48K)
    buys(x, notebook)
  • Boolean vs. quantitative
  • Single dimension vs. multiple dimensional
  • Single level vs. multiple level

7
The Apriori Algorithm
  • Mining single-dimensional Boolean association
    rules
  • Find the frequent itemsets
  • Iteratively find frequent itemsets with
    cardinality from 1 to k (k-itemset)
  • Join step Ck is generated by joining Lk-1with
    itself
  • Prune step Remove any (k-1)-itemset that is not
    frequent
  • The Apriori principle
  • Any subset (prior set) of a frequent itemset must
    be frequent
  • i.e., if A,B,C is a frequent itemset, both
    A,B, A,C, and B,C should be a frequent
    itemset

C1 ? L1 ? C2 ? L2 ? Ck ? Lk
8
The Apriori Algorithm
  • Pseudo-code
  • Ck Candidate itemset of size k
  • Lk frequent itemset of size k
  • L1 frequent items
  • for (k 1 Lk !? k) do begin
  • Ck1 candidates generated from Lk
  • for each transaction t in database do
  • increment the count of all candidates in
    Ck1
  • that are contained in t
  • Lk1 candidates in Ck1 with min_support
  • end
  • return ?k Lk

9
The Apriori Algorithm Example
Database D
C1
L1
scan D
join
L2
C2
C2
prune
scan D
join
prune
C3
L3
scan D
10
Generating Candidates
  • L3abc, abd, acd, ace, bcd
  • Joining L3L3
  • abcd from abc and abd
  • acde from acd and ace
  • Pruning
  • acde is removed because ade is not in L3
  • C4abcd

11
Generating Rules
  • 2 ? 3 ? 5 confidence 2/2 100
  • 2 ? 5 ? 3 confidence 2/3 67
  • 3 ? 5 ? 2 confidence 2/2 100
  • 2 ? 3 ? 5 confidence 2/3 67
  • 3 ? 2 ? 5 confidence 2/3 67
  • 5 ? 2 ? 3 confidence 2/3 67

12
Methods to Improve Aprioris Efficiency
  • Hash-based itemset counting
  • A k-itemset whose corresponding hashing bucket
    count is below the threshold cannot be frequent
  • Transaction reduction
  • A transaction that does not contain any frequent
    k-itemset is useless in subsequent scans
  • Partitioning
  • Any itemset that is potentially frequent in DB
    must be frequent in at least one of the
    partitions of DB
  • Sampling
  • Mining on a subset of given data with lower
    support threshold. Less accurate but more
    efficient

13
Performance Bottleneck
  • The core of the Apriori algorithm
  • Use frequent (k 1)-itemsets to generate
    candidate k-itemsets
  • Use database scan and pattern matching to collect
    counts
  • The bottleneck of Apriori candidate generation
  • Huge candidate sets
  • 104 frequent 1-itemset will generate 107
    candidate 2-itemsets
  • To discover a frequent pattern of size 100, e.g.,
    a1, a2, , a100, one needs to generate 2100 ?
    1030 candidates
  • Multiple scans of database
  • Needs (n 1 ) scans, n is the length of the
    longest pattern

14
Mining Frequent Patterns Without Candidate
Generation
  • FP-tree structure
  • Compress a large database into a compact
    Frequent-Pattern tree (FP-tree) structure
  • Avoid costly database scans
  • Constructing FP-tree
  • Scan DB once, find frequent 1-itemset (single
    item pattern)
  • Order frequent items in frequency descending
    order
  • Scan DB again, construct FP-tree while sharing
    prefix

15
Construct FP-tree
TID Items bought (ordered) frequent
items 100 f, a, c, d, g, i, m, p f, c, a, m,
p 200 a, b, c, f, l, m, o f, c, a, b,
m 300 b, f, h, j, o f, b 400 b, c, k,
s, p c, b, p 500 a, f, c, e, l, p, m,
n f, c, a, m, p
16
Mining Frequent Patterns (1)
  • Method
  • For each item, construct its conditional
    pattern-base
  • Construct its conditional FP-tree
  • Repeat the process until the resulting FP-tree is
    empty, or it contains only one path
  • Single path will generate all the combinations of
    its sub-paths, each of which is a frequent pattern

17
Mining Frequent Patterns (2)
  • 1. Starting at the frequent header table,
    traverse the FP-tree by following the link of
    each frequent item
  • 2. Accumulate all of transformed prefix paths of
    that item to form a conditional pattern base

Conditional pattern bases item cond. pattern
base c f3 a fc3 b fca1, f1, c1 m fca2,
fcab1 p fcam2, cb1

Header Table Item frequency head
f 4 c 4 a 3 b 3 m 3 p 3
f4
c1
b1
b1
c3
p1
a3
b1
m2
p2
m1
18
Mining Frequent Patterns (3)
  • 3. Accumulate the count for each item in the base
  • 4. Construct the FP-tree for the frequent items
    of the pattern base

m-conditional pattern base fca2, fcab1
m-conditional FP-tree
min_support 50 (3)
19
Mining Frequent Patterns (4)
20
Mining Frequent Patterns (5)
  • 5. Repeat the process until the FP-tree contains
    single-path P
  • 6. The complete set of frequent pattern of T can
    be generated by enumeration of all the
    combinations of the sub-paths of P

m-conditional FP-tree
All frequent patterns concerning m
m, fm, cm, am, fcm, fam, cam, fcam
21
Presentation of Association Rules
22
Presentation of Association Rules
23
Presentation of Association Rules
24
Mining Multiple-Level Association Rules
  • Items often form hierarchy
  • Rules regarding itemsets at appropriate levels
    could be quite useful

milk bread 20, 60
2 milk wheat bread 6, 50
2 milk white bread 4, 70
25
Mining Multiple-Level Association Rules
  • A top_down, progressive deepening approach
  • First find high-level strong rules, then find
    their lower-level rules
  • milk bread 20,
    60
  • 2 milk wheat bread
    6, 50
  • Items at the lower level are expected to have
    lower support
  • Uniform support vs. reduced support
  • Uniform Support the same minimum support for all
    levels
  • Reduced Support reduced minimum support at lower
    levels

26
Different Strategies
  • Uniform support
  • Reduced independent support
  • Reduced level-cross support

Level 1 min_sup 5
Milk 10
Level 2 min_sup 5
Skim milk 4
2 milk 6
Level 1 min_sup 15
Milk 10
Level 2 min_sup 3
Skim milk 4
2 milk 6
Level 1 min_sup 15
Milk 10
Level 2 min_sup 5
Skim milk
2 milk
27
Redundancy Filtering
  • Redundant rule
  • Its support is close to the expected value,
    based on the rules ancestor
  • Example
  • milk ? wheat bread support 8, confidence
    70
  • 2 milk ? wheat bread support 2, confidence
    72
  • (First rule is an ancestor of the second rule)
  • If 2 milk is ¼ of all milk, the second
    rule is redundant

28
Mining Multi-Dimensional Association Rules
  • Single-dimensional rules
  • buys(X, milk) ? buys(X, bread)
  • Multi-dimensional rules
  • More than 2 dimensions (or predicates)
  • age(X,19-25) ? occupation(X,student) ?
    buys(X,coke)
  • age(X,19-25) ? buys(X, popcorn) ? buys(X,
    coke)
  • Categorical Attributes
  • Finite number of possible values, no ordering
    among values
  • Quantitative Attributes
  • Numeric, implicit ordering among values

29
Techniques for Mining MD Associations
  • Search for frequent k-predicate set
  • Example age, occupation, buys is a 3-predicate
    set
  • How to treat the attribute age ?
  • 1. Using static discretization
  • Quantitative attributes are statically
    discretized by using predefined concept
    hierarchies
  • 2. Quantitative association rules
  • Quantitative attributes are dynamically
    discretized into bins based on the distribution
    of the data
  • 3. Distance-based association rules
  • This is a dynamic discretization process that
    considers the distance between data points

30
Static Discretization
  • Discretized prior to mining using concept
    hierarchy
  • Data cube is well suited for mining
  • The cells of an n-dimensional cuboid corresponds
    to the n-predicate sets
  • Mining from data cubes can be much faster

31
Quantitative Assoc. Rules
  • Quantitative attributes are dynamically
    discretized
  • The confidence or compactness of the rules is
    maximized
  • 2-D quantitative assoc. rules
  • Aquan1 ? Aquan2 ? A cat
  • Binning Partition the range. Each array cell
    holds the corresponding count distribution
  • Finding frequent itemsets Scan 2-D array to find
    predicate sets satisfying min support
  • Clustering Cluster adjacent association rules to
    form more general rules

32
Quantitative Assoc. Rules
  • Example
  • age(X,34) ? income(X,31K-40K) ?
    buys(X,HDTV)
  • age(X,35) ? income(X,31K-40K) ?
    buys(X,HDTV)
  • age(X,34) ? income(X,41K-50K) ?
    buys(X,HDTV)
  • age(X,35) ? income(X,41K-50K) ?
    buys(X,HDTV)
  • age(X,34-35) ? income(X,31K-50K) ?
    buys(X,HDTV)

33
Distance-based Assoc. Rules
  • Binning methods ? do not capture the semantics
  • Distance-based partitioning ? more meaningful
  • Method
  • Find clusters
  • Search groups of clusters that occur together

34
Problem of Support and Confidence
  • Example
  • Among 10000 transactions
  • 6000 include computer games
  • 7500 include videos
  • 4000 include both
  • Min support 30, Min confidence 60
  • The rule
  • buys(X, game) ? buys(X, video) 40, 66.7
  • misleading because the overall percentage of
  • buying video is 75 which is higher than 66.7
    !

35
Correlation
  • Correlation
  • Measure the dependency of itemsets
  • Corr(A,B) gt 1 ? Positively related
  • Also called the lift of rule A ? B
  • Example
  • Corr(game, video) P(game,video) / P(game)
    P(video)
  • 0.4 / (0.6 x 0.75)
  • 0.89 lt 1.0
  • ? Negative correlation !

36
References
  • R. Agrawal and R. Srikant. Fast algorithms for
    mining association rules. VLDB'94.
  • H. Mannila, H. Toivonen, and A. I. Verkamo.
    Efficient algorithms for discovering association
    rules. KDD'94.
  • A. Savasere, E. Omiecinski, and S. Navathe. An
    efficient algorithm for mining association rules
    in large databases. VLDB'95.
  • J. S. Park, M. S. Chen, and P. S. Yu. An
    effective hash-based algorithm for mining
    association rules. SIGMOD'95.
  • H. Toivonen. Sampling large databases for
    association rules. VLDB'96.
  • S. Brin, R. Motwani, J. D. Ullman, and S. Tsur.
    Dynamic itemset counting and implication rules
    for market basket analysis. SIGMOD'97.
  • S. Sarawagi, S. Thomas, and R. Agrawal.
    Integrating association rule mining with
    relational database systems Alternatives and
    implications. SIGMOD'98.
  • R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A
    tree projection algorithm for generation of
    frequent itemsets. J. Parallel and Distributed
    Computing02.
  • J. Han, J. Pei, and Y. Yin. Mining frequent
    patterns without candidate generation. SIGMOD
    00.
  • J. Pei, J. Han, and R. Mao. CLOSET An Efficient
    Algorithm for Mining Frequent Closed Itemsets.
    DMKD'00.
  • J. Liu, Y. Pan, K. Wang, and J. Han. Mining
    Frequent Item Sets by Opportunistic Projection.
    KDD'02.
  • J. Han, J. Wang, Y. Lu, and P. Tzvetkov. Mining
    Top-K Frequent Closed Patterns without Minimum
    Support. ICDM'02.
  • J. Wang, J. Han, and J. Pei. CLOSET Searching
    for the Best Strategies for Mining Frequent
    Closed Itemsets. KDD'03.
  • G. Liu, H. Lu, W. Lou, J. X. Yu. On Computing,
    Storing and Querying Frequent Patterns. KDD'03.

37
References
  • R. Srikant and R. Agrawal. Mining generalized
    association rules. VLDB'95.
  • J. Han and Y. Fu. Discovery of multiple-level
    association rules from large databases. VLDB'95.
  • R. Srikant and R. Agrawal. Mining quantitative
    association rules in large relational tables.
    SIGMOD'96.
  • T. Fukuda, Y. Morimoto, S. Morishita, and T.
    Tokuyama. Data mining using two-dimensional
    optimized association rules Scheme, algorithms,
    and visualization. SIGMOD'96.
  • K. Yoda, T. Fukuda, Y. Morimoto, S. Morishita,
    and T. Tokuyama. Computing optimized rectilinear
    regions for association rules. KDD'97.
  • R.J. Miller and Y. Yang. Association rules over
    interval data. SIGMOD'97.
  • Y. Aumann and Y. Lindell. A Statistical Theory
    for Quantitative Association Rules KDD'99.
  • M. Klemettinen, H. Mannila, P. Ronkainen, H.
    Toivonen, and A. I. Verkamo. Finding
    interesting rules from large sets of discovered
    association rules. CIKM'94.
  • S. Brin, R. Motwani, and C. Silverstein. Beyond
    market basket Generalizing association rules to
    correlations. SIGMOD'97.
  • C. Silverstein, S. Brin, R. Motwani, and J.
    Ullman. Scalable techniques for mining causal
    structures. VLDB'98.
  • P.-N. Tan, V. Kumar, and J. Srivastava.
    Selecting the Right Interestingness Measure for
    Association Patterns. KDD'02.
  • E. Omiecinski. Alternative Interest Measures
    for Mining Associations. TKDE03.
  • Y. K. Lee, W.Y. Kim, Y. D. Cai, and J. Han.
    CoMine Efficient Mining of Correlated Patterns.
    ICDM03.
Write a Comment
User Comments (0)
About PowerShow.com