Data Mining: Concepts and Techniques (2nd ed.) - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Data Mining: Concepts and Techniques (2nd ed.)

Description:

Data Mining: Concepts and Techniques (2nd ed.) Chapter 5 Frequent Pattern Mining * * – PowerPoint PPT presentation

Number of Views:855
Avg rating:3.0/5.0
Slides: 29
Provided by: Jiaw270
Category:

less

Transcript and Presenter's Notes

Title: Data Mining: Concepts and Techniques (2nd ed.)


1
Data Mining Concepts and Techniques (2nd
ed.) Chapter 5
  • Frequent Pattern Mining

1
2
Mining Frequent Patterns, Association and
Correlations Basic Concepts and Methods
  • Basic Concepts
  • Frequent Itemset Mining Apriori Algorithm
  • Improving the efficiency of Apriori algorithm
  • Summary

3
What Is Frequent Pattern Analysis?
  • Frequent pattern a pattern (a set of items,
    subsequences, substructures, etc.) that occurs
    frequently together (or strongly correlated) in
    a data set
  • First proposed by Agrawal, Imielinski, and Swami
    AIS93 in the context of frequent itemsets and
    association rule mining
  • Motivation Finding inherent regularities in data
  • What products were often purchased together?
    Beer and diapers?!
  • What are the subsequent purchases .after buying
    a PC?
  • What kinds of DNA are sensitive to this new drug?
  • Can we automatically classify web documents?
  • Applications
  • Basket data analysis, cross-marketing, catalog
    design, sale campaign analysis, Web log (click
    stream) analysis, and DNA sequence analysis.

4
Why Is Freq. Pattern Mining Important?
  • Freq. pattern An intrinsic and important
    property of datasets.
  • Foundation for many essential data mining tasks
  • Association, correlation, and causality analysis
  • Mining sequential, structural (e.g., sub-graph)
    patterns
  • Pattern analysis in spatiotemporal, multimedia,
    time-series, and stream data
  • Classification discriminative based frequent
    pattern analysis
  • Cluster analysis frequent pattern-based
    sub-space clustering
  • Data warehousing iceberg cube and cube-gradient
  • Semantic data compression fascicles
  • Broad applications

5
Basic Concepts Frequent Patterns and Association
rules
  • itemset A set of one or more items
  • k-itemset X x1, , xk
  • (absolute) support, or, support count of X
    Frequency or occurrence of an itemset X
  • (relative) support, s, is the fraction of
    transactions that contains X (i.e., the
    probability that a transaction contains X)
  • An itemset X is frequent if Xs support is no
    less than a minsup threshold

Tid Items bought
10 Beer, Nuts, Diaper
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs
40 Nuts, Eggs, Milk
50 Nuts, Coffee, Diaper, Eggs, Milk
  • Let minsup50
  • Freq. 1-itemsets
  • Beer3(60) Nuts3(60) Diaper4(80)
    Eggs3(60)
  • Freq. 2-itemsets
  • Beer, Diaper3(60)

6
Basic Concepts Association Rules
  • Find all the rules X ? Y with minimum support and
    confidence
  • support, s, probability that a transaction
    contains X ? Y
  • confidence, c, conditional probability that a
    transaction having X also contains Y
  • Let minsup 50, minconf 50
  • Freq. Pat. Beer3, Nuts3, Diaper4, Eggs3,
    Beer, Diaper3

Items bought
Tid
Beer, Nuts, Diaper
10
Beer, Coffee, Diaper
20
Beer, Diaper, Eggs
30
Nuts, Eggs, Milk
40
Nuts, Coffee, Diaper, Eggs, Milk
50
Customer buys both
Customer buys diaper
Customer buys beer
  • Association rules (any more!)
  • Beer ? Diaper (60, 100)
  • Diaper ? Beer (60, 75)

Note Itemset a subtle notation!
7
Closed Patterns and Max-Patterns
  • A long pattern contains a combinatorial number of
    sub-patterns, e.g., a1, , a100 contains (1001)
    (1002) (110000) 2100 1 1.271030
    sub-patterns!
  • Solution Mine closed patterns and max-patterns
    instead
  • An itemset X is closed if X is frequent and there
    exists no super-pattern Y ? X, with the same
    support as X (proposed by Pasquier, et al. _at_
    ICDT99)
  • An itemset X is a max-pattern if X is frequent
    and there exists no frequent super-pattern Y ? X
    (proposed by Bayardo _at_ SIGMOD98)
  • Closed pattern is a lossless compression of freq.
    patterns
  • Reducing the of patterns and rules

8
Closed Itemset
  • An itemset is closed if none of its immediate
    supersets has the same support as the itemset
  • Closed pattern is a lossless compression of
    frequent patterns.
  • It reduces the of patterns but does not lose
    the support information.

9
Max-patterns
Min_sup2
  • Difference from close patterns?
  • Do not care for the real support of the
    sub-patterns of a max-pattern
  • Max-pattern frequent patterns without proper
    frequent super pattern
  • BCDE, ACD are max-patterns
  • BCD is not a max-pattern

Tid Items
10 A,B,C,D,E
20 B,C,D,E,
30 A,C,D,F
10
Maximal vs Closed Frequent Itemsets
Transaction Ids
minsup2
Closed 9 Maximal 4
11
Maximal vs Closed Itemsets
Closed Frequent Itemsets are Lossless the
support for any frequent itemset can be deduced
from the closed frequent itemsets
Max-pattern is a lossy compression. We only know
all its subsets are frequent but not the real
support.
Thus in many applications, mining close-patterns
is more desirable than mining max-patterns.
12
Mining Frequent Patterns, Association and
Correlations Basic Concepts and Methods
  • Basic Concepts
  • Frequent Itemset Mining Apriori Algorithm
  • Improving the efficiency of Apriori algorithm
  • Summary

13
Key Observation (monotonicity)
  • Any subset of a frequent itemset must also be
    frequent Downward clouser property (also called
    Apriori propery)
  • If beer, diaper, nuts is frequent, so is beer,
    diaper
  • Efficient mining methodology Apriori pruning
    principle
  • Any superset of an infrequent itemset must also
    be infrequent.
  • If any subset of an itemset S is infrequent,
    then there is no chance for S to be frequent -
    why do we even have to consider S..! Prune.!

14
The Downward Closure Property and Scalable Mining
Methods
  • Scalable mining methods Three major approaches
  • Level-wise, join-based approachApriori (Agrawal
    Srikant_at_VLDB94)
  • Freq. pattern projection and growth
    (FPgrowthHan, Pei Yin _at_SIGMOD00)
  • Vertical data format approach (EclatZaki ,
    Parthasarathy Ogihara, Li _at_KDD97)

15
Apriori A Candidate Generation Test Approach
  • Outline of Apriori (level-wise, candidate
    generation and testing)
  • Method
  • Initially, scan DB once to get frequent 1-itemset
  • Repeat
  • Generate length (k1) candidate itemsets from
    length k frequent itemsets
  • Test the candidates against DB to find frequent
    (k1) itemsets
  • Set kk1
  • Terminate when no frequent or candidate set can
    be generated
  • Return all the frequent itemsets derived.

16
The Apriori Algorithm (Pseudo-Code)
  • Ck Candidate itemset of size k
  • Lk frequent itemset of size k
  • k1
  • L1 frequent items //Frequent 1-itemset
  • While ( Lk !? do //When Lk is not empty
  • Ck1 candidates generated from Lk
  • // candidates generation.
  • Derive Lk1 by counting for all candidates in
    Ck1 wrt TDB and satisfying minsup
  • // Lk1 candidates in Ck1 with minsup.
  • kk1
  • return ?k Lk

17
The Apriori AlgorithmAn Example
Supmin 2
Itemset sup
A 2
B 3
C 3
D 1
E 3
Database TDB
Itemset sup
A 2
B 3
C 3
E 3
L1
C1
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
1st scan
C2
C2
Itemset sup
A, B 1
A, C 2
A, E 1
B, C 2
B, E 3
C, E 2
Itemset
A, B
A, C
A, E
B, C
B, E
C, E
L2
2nd scan
Itemset sup
A, C 2
B, C 2
B, E 3
C, E 2
Itemset sup
B, C, E 2
L3
C3
3rd scan
Itemset
B, C, E
Self-join members of Lk-1 are joinable if their
first (k-2) items are in common
18
Apriori Implementation of Trick
  • How to generate candidates?
  • Step 1 self-joining Lk
  • Step 2 pruning
  • Example of Candidate-generation
  • L3abc, abd, acd, ace, bcd
  • Self-joining L3L3
  • abcd from abc and abd
  • acde from acd and ace
  • Pruning
  • acde is removed because ade is not in L3
  • C4 abcd

Any (k-1)-itemset that is not frequent cannot be
a subset of a frequent k-itemset
19
Challenges of Frequent Pattern Mining
  • Challenges
  • Multiple scans of transaction database
  • Huge number of candidates
  • Tedious workload of support counting for
    candidates
  • Improving Apriori general ideas
  • Reduce passes of transaction database scans
  • Shrink number of candidates
  • Facilitate support counting of candidates

20
Apriori Improvements and Alternatives
  • Reduce passes of transaction database scans
  • Partitioning (e.g. Savasere, et al., 1995)
  • Dynamic itemset counting (Brin, et al.,1997)
  • Shrink the number of candidates
  • Hash-based technique (e.g., DHP Park, et al.,
    1995)
  • Transaction reduction (e.g., Bayardo 1998)
  • Sampling (e.g., Toivonen, 1996)

21
Partitioning Scan Database Only Twice
  • Theorem Any itemset that is potentially frequent
    in TDB must be frequent in at least one of the
    partitions of TDB
  • Method
  • Scan 1 Partition database (how?) and find local
    frequent patterns.
  • Scan 2 Consolidate global frequent patterns (how
    to ?)

22
Direct Hashing Pruning (DHP)
  • When generating L1, the algorithm also generates
    all the 2-itemsets for each transaction, hashes
    them to a hash table and keeps a count.

23
Hash Function Used
  • For each pair, a numeric value is obtained by
    first representing B by 1, C by 2, E 3, J 4, M 5
    and Y 6. Now each pair can be represented by a
    two digit number, for example (B, E) by 13 and
    (C, M) by 26.
  • The two digits are then coded as modulo 8 number
    (dividing by 8 and using the remainder). This is
    the bucket address.
  • A count of the number of pairs hashed is kept.
    Those addresses that have a count above the
    support value have the bit vector set to 1
    otherwise 0.
  • All pairs in rows that have zero bit are removed.

24
Transaction Reduction
As discussed earlier, any transaction that does
not contain any frequent k-itemsets cannot
contain any frequent (k1)-itemsets and such a
transaction may be marked or removed.
TID Items bought
001 B, M, T, Y
002 B, M
003 T, S, P
004 A, B, C, D
005 A, B
006 T, Y, E
007 A, B, M
008 B, C, D, T, P
009 D, T, S
010 A, B, M
Frequent items (L1) are A, B, D, M, T. We are
not able to use these to eliminate any
transactions since all transactions have at least
one of the items in L1. The frequent pairs (C2)
are A,B and B,M. How can we reduce
transactions using these?
25
Sampling Toivonen, 1995
  • A random sample (usually large enough to fit in
    the main memory) may be obtained from the overall
    set of transactions and the sample is searched
    for frequent itemsets. These frequent itemsets
    are called sample frequent itemsets.
  • Not guaranteed to be accurate but we sacrifice
    accuracy for efficiency. A lower support
    threshold may be used for the sample to ensure
    not missing any frequent datasets.
  • Sample size is such that the search for frequent
    itemsets for the sample can be done in main
    memory.

26
Dynamic Itemset Counting
  • Interrupt algorithm after every M transactions
    while scanning.
  • Itemsets which are already frequent are combined
    in pairs to generate higher order itemsets.
  • The technique is dynamic in that, it starts
    estimating support for all the itemsets if all of
    their subsets are already found frequent.
  • The resulting algorithm requires fewer database
    scans than Apriori.

27
DIC Reduce Number of Scans
28
Summary
  • Frequent patterns
  • Closed patterns and Max-patterns
  • Apriori algorithm for mining frequent patterns
  • Improving the efficiency of apriori
    Partitioning, DHP, DIC
Write a Comment
User Comments (0)
About PowerShow.com