Chapter 5: Mining Frequent Patterns, Associations, and Correlations - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Chapter 5: Mining Frequent Patterns, Associations, and Correlations

Description:

Mining Frequent Patterns, Associations, and Correlations. Per Kristian Helland ... Market basket analysis is just one form of frequent pattern mining. Completeness ... – PowerPoint PPT presentation

Number of Views:663
Avg rating:3.0/5.0
Slides: 30
Provided by: georgec80
Category:

less

Transcript and Presenter's Notes

Title: Chapter 5: Mining Frequent Patterns, Associations, and Correlations


1
Chapter 5Mining Frequent Patterns,
Associations, and Correlations
  • Per Kristian Helland
  • Joacim Christiansen
  • Øystein Rose

2
Plan
  • Introduction
  • Definitions
  • Road Map
  • Finding Frequent Itemsets (Apriori)
  • Frequent Itemsets ? Association Rules
  • Improving Apriori

3
Introduction
  • Early 1990s Executive at Marks Spencer
    explained his db to Agrawal
  • Agrawal began devising algorithms for asking
    open-ended queries ? publishes Mining Association
    Rules between Sets of Items in Large Databases,
    1993 (991 citations)
  • Represented in the form of association rules
    computer ? antivirus_software support 2,
    confidence 60

4
Introduction (2)
  • Make available former unknown information
  • Wal-Mart Diapers beer at Fridays (Strategic
    marketing)
  • Medical Information How patients react to
    medicine (Social, economic, and domain)

5
Definitions Overview
  • We use three levels of abstraction
  • Patterns
  • Itemsets
  • Association rules

6
Definitions Patterns
  • Patterns can be
  • Itemsets (x and y)
  • Sequential (x before y)
  • Temporal (x 3 hours before y)
  • Structured (x is a sub-tree)
  • Etc
  • Frequent patterns are those that appear in a
    data set frequently (a measure that is given by a
    user or an expert)

Historical remark Argawal (1993) used the term
large itemset to describe itemsets that satisfies
a specified minimum support threshold
7
Itemset
  • Given A set of items, I x1,,xn
  • A set of items, X I, is an itemset
  • X is a k-itemset where kX

8
Itemset properties
  • Proper sub-itemset
  • Every X is contained in Y, but at least one item
    of Y is not in X
  • Proper super-itemset
  • Closed itemset
  • The itemset X is closed in a data set S if there
    exists no proper super-itemset Y such that Y has
    the same support count as X
  • Closed frequent itemset
  • The set C of closed frequent itemsets for a data
    set S contains complete information regarding its
    corresponding frequent itemsets
  • Maximal frequent itemset
  • X is a maximal frequent itemset in S if X is
    frequent and there exists no super-itemset Y such
    that X Y and Y is frequent in S

9
Road map- Market basket analysis is just one
form of frequent pattern mining
Frequent pattern mining techniques can be
classified based on
  • Completeness
  • Levels of abstraction
  • Number of data dimensions
  • Types of values
  • Kinds of rules mined
  • Kinds of patterns mined

10
Road map- Market basket analysis is just one
form of frequent pattern mining
  • Completeness
  • Complete set of frequent itemsets, closed
    frequent itemsets, constrained itemsets, top-k
    itemsets
  • Different applications may have different
    requirements
  • Levels of abstraction
  • Multilevel vs. Single-level
  • buys(X, computer) ? buys(X, HP_printer)
  • buys(X, laptop_computer) ? buys(X,
    HP_printer)
  • Number of data dimensions
  • Single-dimensional vs. Multidimensional
  • buys(X, computer) ? buys(X, antivirus_software
    )
  • age(X, 30...39) income(X, 42K...48K) ?
    buys(X, high resolution TV)

11
Road map- Market basket analysis is just one
form of frequent pattern mining
  • Types of values
  • Boolean vs. quantitative
  • age(X, 30...39) income(X, 42K...48K) ?
    buys(X, high resolution TV)
  • Kinds of rules mined
  • Association rules
  • Further statistical analysed correlation rules
  • Kinds of patterns mined
  • Frequent itemset mining
  • Sequential pattern mining (ordering of events)
  • Structured pattern mining (any structure, more
    general)

12
Finding Frequent Itemsets- Apriori
  • Apriori (1994, Agrawal, Srikant) Uses prior
    knowledge of frequent itemset properties.
  • k-itemsets used to explore (k1)-itemsets, L1 ?
    L2, L2 ? L3... LK-1 ? LK
  • Apriori property All nonempty subsets of a
    frequent itemset must also be frequent.if P(I)
    lt min_sup then P(I?A) lt min_sup

13
(No Transcript)
14
Finding Frequent Itemsets (2)- Apriori
  • Two basic steps
  • Join Finding candidates, CK LK-1 join LK-1
  • li itemset i LK-1
  • lij jth item in li
  • Assumed Items within a transaction or itemset
    are sorted in lexicographical order For
    (k-1)-itemset li1 lt li2 lt ... lt lik-1
  • l1 and l2 are joinable if their first (k-2) items
    are in common(l11 l21) ? (l12 l22)
    ? ... ? (l1k-2 l2k-2) ? (l1k-1 lt
    l2k-1)
  • Resulting itemset li and li1 l11, l12,
    ..., l1k-2, l1k-1, l2k-2

15
Finding Frequent Itemsets (3) - Apriori
  • Prune Removing infrequent itemsets
  • Note Any (k-1)-subset of a candidate k-itemset
    that is not in LK-1 cannot be frequent
  • CK ? LK
  • Db scan Each candidate in CK is counted. if
    support count of li lt minimum support countthen
    li is removed

16
Finding Frequent Itemsets (4)- Apriori
  • ExampleTransactional dataI1, I2, I5I2,
    I4I2, I3I1, I2, I4I1, I3I2, I3I1,
    I3I1, I2, I3, I5I1, I2, I3

17
Finding Frequent Itemsets (5)- Apriori
  • 1. Each item of the itemsets is a member of the
    set of candidate 1-itemsets, C2
  • 2. (absolute) support gt minimum (absolute)
    support

18
Finding Frequent Itemsets (6)- Apriori
  • 3. L1 join L1, no candidates are removed (each
    subset is frequent)
  • 4. Finding support count C2
  • 5. (absolute) support gt minimum (absolute)
    support

19
Finding Frequent Itemsets (7)- Apriori
  • 6. L2 join L2 I1, I2, I3, I1, I2, I5,
    I1, I3, I5, I2, I3, I4, I2, I3, I5, I2,
    I4, I5. All itemsets whose subsets are not
    frequent are removed
  • 7. Finding support count C3
  • (absolute) support gt minimum (absolute) support
  • L3 join L3 I1, I2, I3, I5, but subset I2,
    I3, I5 is not frequent? C4 Ø

20
Association rules
  • Given a database D with multi-set of subsets of
    the sets of items, I, we call each T in D a
    transaction
  • An association rule is on the form X?Y where X
    and Y are itemsets and (X U Y) Ø

21
Association rules properties
  • The rule support is defined as
  • support(X?Y) P(X U Y) the percentage of
    transactions in D that contain (X U Y)
  • The rule confidence is defined as
  • confidence(X?Y) P(YB) the percentage of
    transaction in D containing X that
  • also contain Y
  • Rules that support both a minimum support
    threshold and a minimum confidence threshold are
    called strong

22
Association rule howto?
  • 1. Find all frequent itemsets
  • Apriori
  • 2. Generate strong association rules from the
    frequent itemsets
  • Unsupervised
  • For each frequent itemset l, generate all
    nonempty subsets of l
  • For every subset s of l, output the rule s?(l\s)
    if

23
Association rule howto? (2)
  • Supervised generation of rules
  • What is a truly interesting rule?
  • Chapter 1
  • Easily understood by humans
  • Valid on new or test data with some degree of
    certainty
  • Potentially useful
  • Novel
  • Others Simplicity, generality, actionability,
    unexpectedness (Mannila et.al, 1999)
  • User/expert to guide the discovery process

24
Improving apriori- Reducing number of database
scans
Variations may be summarized as follows
  • Hash-based technique
  • Transaction reduction
  • Partitioning
  • Sampling
  • Dynamic itemset counting

25
Improving apriori- Reducing number of database
scans
Hash-based
  • When scanning each transaction in the DB to
    generate 1-itemsets
  • Generate all of the 2-itemsets for each
    transaction
  • Hash them into different buckets
  • Buckets with count below support_count can be
    removed
  • Reduces the number of candidate k-itemsets
    examined

26
Improving apriori- Reducing number of database
scans
Transaction reduction
  • Reduces number of transactions scanned in future
    iterations
  • A transaction that does not contain any frequent
    k-itemsets cannot contain any frequent
    (k1)-itemsets
  • Such a transaction may be removed from subsequent
    scans

27
Improving apriori- Reducing number of database
scans
Partitioning (2 scans)
  • Divide the transactions into nonoverlapping
    partitions
  • Support for each partition is min_sup x nr. of
    trans. in the partition
  • Scan partition to find all local frequent
    itemsets
  • Any itemset that is potential frequent must occur
    as a frequent itemset in at least one of the
    partitions
  • Second scan to determine global frequent itemsets

Phase I
Phase II
Frequent itemsets in D
Transactions in D
Divide D into n partitions
Find frequent itemsets local to each partition (1
scan)
Combine local frequent itemsets to form candidate
itemsets
Find global frequent itemsets among candidates (1
scan)
28
Improving apriori- Reducing number of database
scans
Sampling (1,5 2,5 scans)
  • Randomly sample a subset of transactions from D
  • Search frequent itemsets in this sample
  • Lower support treshold to lessen possibility of
    loosing global itemsets
  • Whole D is used to compute actual frequencies of
    these candidates
  • Use the concept of negative border to check if
    all global frequent itemsets are found Toi96

29
Improving apriori- Reducing number of database
scans
Dynamic itemset counting
  • Partition into blocks of size M
  • Before scanning a block, update candidate
    itemsets
  • If all subsets of a candidate itemset is
    frequent, start counting
  • Requires fewer scans than Apriori

finished, frequent
finished, non-frequent
counting, frequent
counting, non-frequent
Write a Comment
User Comments (0)
About PowerShow.com