Frequent Patterns I - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Frequent Patterns I

Description:

AI & Machine Learning. Conferences: Machine learning (ICML), AAAI, IJCAI, COLT ... Conference proceedings: CHI, ACM-SIGGraph, etc. Journals: IEEE Trans. ... – PowerPoint PPT presentation

Number of Views:14
Avg rating:3.0/5.0
Slides: 25
Provided by: mxh6
Category:

less

Transcript and Presenter's Notes

Title: Frequent Patterns I


1
Frequent Patterns I
  • EECS435
  • Fall 2005

2
Outline
  • Introduction
  • What is frequent pattern mining?
  • What is association rule mining?
  • Methods for association rule mining
  • Extensions of frequent patterns

3
Introduction
  • Topics
  • Association Rule
  • Sequential Patterns
  • Graph Mining
  • Clustering and Outlier Detection
  • Classification and Prediction
  • Regression
  • Pattern Interestingness
  • Dimensionality Reduction

4
Introduction
  • Applications
  • Bioinformatics
  • Web mining
  • Text mining
  • Visualization
  • Financial data analysis
  • Intrusion detection

5
Introduction
  • Data mining and KDD (SIGKDD CDROM)
  • Conferences ACM-SIGKDD, IEEE-ICDM, SIAM-DM,
    PKDD, PAKDD, etc.
  • Journal Data Mining and Knowledge Discovery, KDD
    Explorations
  • Database systems (SIGMOD CD ROM)
  • Conferences ACM-SIGMOD, ACM-PODS, VLDB,
    IEEE-ICDE, EDBT, ICDT, DASFAA
  • Journals ACM-TODS, IEEE-TKDE, JIIS, J. ACM, etc.
  • AI Machine Learning
  • Conferences Machine learning (ICML), AAAI,
    IJCAI, COLT (Learning Theory), etc.
  • Journals Machine Learning, Artificial
    Intelligence, etc.

6
Introduction
  • Statistics
  • Conferences Joint Stat. Meeting, etc.
  • Journals Annals of statistics, etc.
  • Bioinformatics
  • Conferences ISMB, RECOMB, PSB, CSB, BIBE, etc.
  • Journals J. of Computational Biology,
    Bioinformatics, etc.
  • Visualization
  • Conference proceedings CHI, ACM-SIGGraph, etc.
  • Journals IEEE Trans. visualization and computer
    graphics, etc.

7
What Is Frequent Pattern Mining?
  • Frequent patterns patterns (set of items,
    sequence, etc.) that occur frequently in a
    database AIS93
  • Frequent pattern mining finding regularities in
    data
  • What products were often purchased together?
  • Beer and diapers?!
  • What are the subsequent purchases after buying a
    car?
  • Can we automatically profile customers?

8
Basics
  • Itemset a set of items
  • E.g., acma, c, m
  • Support of itemsets
  • Sup(acm)3
  • Given min_sup3, acm is a frequent pattern
  • Frequent pattern mining find all frequent
    patterns in a database

Transaction database TDB
9
Association Rules Mining A Road Map
  • Boolean vs. quantitative associations
  • age(x, 30..39) income(x, 42..48K) ? buys(x,
    car) 1, 75
  • Single dimension vs. multiple dimensional
    associations
  • Single level vs. multiple-level analysis
  • What brands of beers are associated with what
    brands of diapers?

10
Extensions Applications
  • Correlation, causality analysis mining
    interesting rules
  • Maxpatterns and frequent closed itemsets
  • Sequential patterns
  • Periodic patterns
  • Structural Patterns

11
Frequent Pattern Mining Methods
  • Apriori and its variations/improvements
  • Mining frequent-patterns without candidate
    generation
  • Mining max-patterns and closed itemsets
  • Mining multi-dimensional, multi-level frequent
    patterns with flexible support constraints
  • Interestingness correlation and causality

12
Apriori Candidate Generation-and-test
  • Any subset of a frequent itemset must be also
    frequent an anti-monotone property
  • A transaction containing beer, diaper, nuts
    also contains beer, diaper
  • beer, diaper, nuts is frequent ? beer, diaper
    must also be frequent
  • No superset of any infrequent itemset should be
    generated or tested
  • Many item combinations can be pruned

13
Apriori-based Mining
  • Generate length (k1) candidate itemsets from
    length k frequent itemsets, and
  • Test the candidates against DB

14
Apriori Algorithm
  • A level-wise, candidate-generation-and-test
    approach (Agrawal Srikant 1994)

Data base D
1-candidates
Freq 1-itemsets
2-candidates
Scan D
Min_sup2
Counting
Freq 2-itemsets
3-candidates
Scan D
Scan D
Freq 3-itemsets
15
The Apriori Algorithm
  • Ck Candidate itemset of size k
  • Lk frequent itemset of size k
  • L1 frequent 1-itemsets
  • for (k 1 Lk !? k) do
  • Ck1 candidates generated from Lk
  • for each transaction t in database do increment
    the count of all candidates in Ck1 that are
    contained in t
  • Lk1 candidates in Ck1 with min_support
  • return ?k Lk

16
Important Details of Apriori
  • How to generate candidates?
  • Step 1 self-joining Lk
  • Step 2 pruning
  • How to count supports of candidates?

17
How to Generate Candidates?
  • Suppose the items in Lk-1 are listed in an order
  • Step 1 self-join Lk-1
  • INSERT INTO Ck
  • SELECT p.item1, p.item2, , p.itemk-1, q.itemk-1
  • FROM Lk-1 p, Lk-1 q
  • WHERE p.item1q.item1, , p.itemk-2q.itemk-2,
    p.itemk-1 lt q.itemk-1
  • Step 2 pruning
  • For each itemset c in Ck do
  • For each (k-1)-subsets s of c do if (s is not in
    Lk-1) then delete c from Ck

18
Example of Candidate-generation
  • L3abc, abd, acd, ace, bcd
  • Self-joining L3L3
  • abcd from abc and abd
  • acde from acd and ace
  • Pruning
  • acde is removed because ade is not in L3
  • C4abcd

19
How to Count Supports of Candidates?
  • Why counting supports of candidates is a problem?
  • The total number of candidates can be very huge
  • One transaction may contain many candidates
  • Method
  • Candidate itemsets are stored in a hash-tree
  • Leaf node of hash-tree contains a list of
    itemsets and counts
  • Interior node contains a hash table
  • Subset function finds all candidates contained
    in a transaction

20
Challenges of Frequent Pattern Mining
  • Challenges
  • Multiple scans of transaction database
  • Huge number of candidates
  • Tedious work of support counting for candidates
  • Improving Apriori general ideas
  • Reduce number of transaction database scans
  • Shrink number of candidates
  • Facilitate support counting of candidates

21
DIC Reduce Number of Scans
ABCD
  • Once both A and D are determined frequent, the
    counting of AD can begin
  • Once all length-2 subsets of BCD are determined
    frequent, the counting of BCD can begin

ABC
ABD
ACD
BCD
AB
AC
BC
AD
BD
CD
Transactions
B
C
D
A
Apriori

Itemset lattice
2-items
S. Brin R. Motwani, J. Ullman, and S. Tsur, 1997.
3-items
DIC
22
DHP Reduce the Number of Candidates
  • A hashing bucket count ltmin_sup ? every candidate
    in the buck is infrequent
  • Candidates a, b, c, d, e
  • Hash entries ab, ad, ae bd, be, de
  • Large 1-itemset a, b, d, e
  • The sum of counts of ab, ad, ae lt min_sup ? ab
    should not be a candidate 2-itemset
  • J. Park, M. Chen, and P. Yu, 1995

23
Partition Scan Database Only Twice (Distributed
Computing)
  • Partition the database into n partitions
  • Itemset X is frequent ? X frequent in at least
    one partition
  • Scan 1 partition database and find local
    frequent patterns
  • Scan 2 consolidate global frequent patterns
  • A. Savasere, E. Omiecinski, and S. Navathe, 1995

24
Sampling for Frequent Patterns
  • Select a sample of original database, mine
    frequent patterns within sample using Apriori
  • Scan database once to verify frequent itemsets
    found in sample, only borders of closure of
    frequent patterns are checked
  • Example check abcd instead of ab, ac, , etc.
  • Scan database again to find missed frequent
    patterns
  • H. Toivonen, 1996
Write a Comment
User Comments (0)
About PowerShow.com