Association Analysis - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Association Analysis

Description:

Given a set of records each of which contain some number of items from a given collection; ... tj is said to contain an itemset X, if X is a subset of tj. ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 49
Provided by: alext151
Category:

less

Transcript and Presenter's Notes

Title: Association Analysis


1
Association Analysis
2
Association Rule Mining Definition
  • Given a set of records each of which contain some
    number of items from a given collection
  • Produce dependency rules which will predict
    occurrence of an item based on occurrences of
    other items.

Rules Discovered Milk --gt Coke
Diaper, Milk --gt Beer
3
Association Rules
  • Marketing and Sales Promotion
  • Let the rule discovered be
  • Bagels, --gt Potato Chips
  • Potato Chips as consequent gt
  • Can be used to determine what should be done to
    boost its sales.
  • Bagels in the antecedent gt
  • Can be used to see which products would be
    affected if the store discontinues selling bagels.

4
Two key issues
  • First
  • discovering patterns from a large transaction
    data set can be computationally expensive.
  • Second
  • some of the discovered patterns are potentially
    spurious
  • because they may happen simply by chance.

5
Items and transactions
  • Let
  • I i1, i2,,id be the set of all items in a
    market basket data and
  • T t1, t2 ,, tN be the set of all
    transactions.
  • Each transaction ti contains a subset of items
    chosen from I.
  • Itemset
  • A collection of one or more items
  • Example Milk, Bread, Diaper
  • k-itemset
  • An itemset that contains k items
  • Transaction width
  • The number of items present in a transaction.
  • A transaction tj is said to contain an itemset X,
    if X is a subset of tj.
  • E.g., the second transaction contains the itemset
    Bread, Diapers but not Bread, Milk.

6
Definition Frequent Itemset
  • Support count (?)
  • Frequency of occurrence of an itemset
  • E.g. ?(Milk, Bread,Diaper) 2
  • Support
  • Fraction of transactions that contain an itemset
  • E.g. s(Milk, Bread, Diaper) 2/5 ?/N
  • Frequent Itemset
  • An itemset whose support is greater than or equal
    to a minsup threshold

7
Definition Association Rule
  • Association Rule
  • An implication expression of the form X ? Y,
    where X and Y are itemsets
  • Example Milk, Diaper ? Beer
  • Rule Evaluation Metrics (X ? Y)
  • Support (s)
  • Fraction of transactions that contain both X and
    Y
  • Confidence (c)
  • Measures how often items in Y appear in
    transactions thatcontain X

8
Why Use Support and Confidence?
  • Support
  • A very low support rule ? may occur simply by
    chance.
  • A very low support rule ? uninteresting rules.
  • Confidence for a given rule X ? Y
  • the higher the confidence ? the more likely it is
    for Y to be present in transactions that contain
    X
  • Measures the reliability of the inference made by
    a rule.
  • is an estimate of the conditional probability of
    Y given X.

9
Association Rule Mining Task
  • Given a set of transactions T, the goal of
    association rule mining is to find all rules
    having
  • support ? minsup threshold
  • confidence ? minconf threshold
  • Brute-force approach
  • List all possible association rules
  • Compute the support and confidence for each rule
  • Prune rules that fail the minsup and minconf
    thresholds
  • ? Computationally prohibitive!

10
Brute-force approach
  • Suppose there are d items. We first choose k of
    the items to form the left hand side of the rule.
    There are Cd,k ways for doing this.
  • Now, there are Cd-k,i ways to choose the
    remaining items to form the right hand side of
    the rule, where 1 i d-k.

11
Brute-force approach
  • R3d-2d11
  • For d6,
  • 36-271602 possible rules
  • However, 80 of the rules are discarded after
    applying minsup20 and minconf50, thus making
    most of the computations become wasted.
  • So, it would be useful to prune the rules early
    without having to compute their support and
    confidence values.

An initial step toward improving the performance
decouple the support and confidence requirements.
12
Basic Observations
Example of Rules Milk,Diaper ? Beer (s0.4,
c0.67)Milk,Beer ? Diaper (s0.4,
c1.0) Diaper,Beer ? Milk (s0.4,
c0.67) Beer ? Milk,Diaper (s0.4, c0.67)
Diaper ? Milk,Beer (s0.4, c0.5) Milk ?
Diaper,Beer (s0.4, c0.5)
  • Observations
  • All the rules are binary partitions of the
    itemset Milk, Diaper, Beer
  • Rules originating from the same itemset have
    identical support
  • but can have different confidence
  • We may decouple the support and confidence
    requirements
  • If the itemset is infrequent, then all six
    candidate rules can be pruned immediately without
    our having to compute their confidence values.

13
Mining Association Rules
  • Two-step approach
  • Frequent Itemset Generation
  • Generate all itemsets whose support ? minsup
  • these itemsets are called frequent itemset
  • Rule Generation
  • Generate high confidence rules from each frequent
    itemset
  • where each rule is a binary partitioning of a
    frequent itemset (these rules are called strong
    rules)
  • The computational requirements for frequent
    itemset generation are more expensive than those
    of rule generation.
  • We focus first on frequent itemset generation.

14
Frequent Itemset Generation
Given d items, there are 2d possible candidate
itemsets
15
Frequent Itemset Generation
  • Brute-force approach
  • Each itemset in the lattice is a candidate
    frequent itemset
  • Count the support of each candidate by scanning
    the database
  • Match each transaction against every candidate
  • Complexity O(NMw) gt Expensive since M 2d !!!
  • w is max transaction width.

16
Frequent Itemset Generation Strategies
  • Reduce the number of candidates (M)
  • Complete search M2d
  • Use pruning techniques to reduce M
  • Apriori principle is an effective way to
    eliminate candidate itemsets without counting
    their support values.
  • Apriori principle
  • If an itemset is frequent, then all of its
    subsets must also be frequent

17
Reducing Number of Candidates
  • Apriori principle
  • If an itemset is frequent, then all of its
    subsets must also be frequent
  • conversely
  • If an itemset such as a, b is infrequent, then
    all of its supersets must be infrequent too.
  • Apriori principle holds due to the following
    property of the support measure
  • Support of an itemset never exceeds the support
    of its subsets
  • This is known as the anti-monotone property of
    support

18
Illustrating Apriori Principle
19
Illustrating Apriori Principle
Items (1-itemsets)
Pairs (2-itemsets)
(No need to generatecandidates involving Cokeor
Eggs)
Triplets (3-itemsets)
With the Apriori principle we need to keep only
this triplet, because its the only one whose
subsets are all frequent.
Minimum Support 3
If every subset is considered, 6C1 6C2 6C3
41 With support-based pruning, 6 6 1 13
20
Apriori Algorithm
  • Method
  • Let k1
  • Generate frequent itemsets of length 1
  • Repeat until no new frequent itemsets are
    identified
  • kk1
  • Generate length-k candidate itemsets
  • from length-k-1 frequent itemsets
  • Prune candidate itemsets
  • that contain infrequent subsets of length-k-1
  • Count the support of each candidate
  • by scanning the DB and eliminate candidates that
    are infrequent

21
Important Details of Apriori
  • How to generate length-k candidates?
  • Step 1 self-joining Lk-1
  • Step 2 pruning
  • How to count supports of candidates?
  • Example of Candidate-generation
  • L3abc, abd, acd, ace, bcd
  • Self-joining L3L3
  • abcd from abc and abd
  • acde from acd and ace
  • Pruning
  • acde is removed because ade is not in L3
  • C4abcd

22
Challenges of Frequent Pattern Mining
  • Challenges
  • Multiple scans of transaction database
  • Huge number of candidate itemsets
  • Tedious workload of support counting for
    candidates
  • Improving Apriori general ideas
  • Reduce passes of transaction database scans
  • Shrink number of candidates
  • Facilitate support counting of candidates

23
Candidate generation and prunning
  • An effective candidate generation procedure
    should
  • avoid generating too many unnecessary candidates.
  • unnecessary candidate itemset ? at least one of
    its subsets is infrequent.
  • ensure that the candidate set is complete,
  • no frequent itemsets are left out by the
    candidate generation procedure.
  • not generate the same candidate itemset more than
    once.
  • E.g., the candidate itemset a, b, c, d can be
    generated in many ways---
  • by merging a, b, c with d,
  • c with a, b, d, etc.

24
Brute force
  • A bruteforce method considers every kitemset as
    a potential candidate and then applies the
    candidate pruning step to remove any unnecessary
    candidates.

25
Fk-1?F1 Method
  • Extend each frequent (k - 1)itemset with a
    frequent 1-itemset.
  • Complete?
  • Yes, because every frequent kitemset is composed
    of a frequent (k - 1)itemset and a frequent
    1itemset.
  • Problem
  • doesnt prevent the same candidate itemset from
    being generated more than once.
  • E.g., Bread, Diapers, Milk can be generated by
    merging
  • Bread, Diapers with Milk,
  • Bread, Milk with Diapers, or
  • Diapers, Milk with Bread.

26
Lexicographic Order
  • To avoid generating duplicate candidates
  • ensure that the items in each frequent itemset
    are kept sorted in their lexicographic order.
  • Each frequent (k-1)itemset X is then extended
    with frequent items that are lexicographically
    larger than the items in X.
  • For example
  • the itemset Bread, Diapers can be augmented
    with Milk since Milk is lexicographically
    larger than Bread and Diapers.
  • we dont augment Diapers, Milk with Bread nor
    Bread, Milk with Diapers because they violate
    the lexicographic ordering condition.
  • Is it complete?

27
Lexicographic Order - Completeness
  • Complete? Yes
  • Let (i1,, ik-1, ik) be a frequent k-itemset
    sorted in lexicographic order.
  • Since it is frequent, by the Apriori principle,
    (i1,, ik-1) and (ik) are frequent as well.
  • I.e. (i1,, ik-1) ?Fk-1 and (ik) ?F1.
  • Since, (ik) is lexicographically bigger than
    i1,, ik-1
  • (i1,, ik-1) would be joined with (ik)
  • and give (i1,, ik-1, ik) as a candidate
    k-itemset.

28
Still too many candidates
  • E.g. merging Beer, Diapers with Milk is
    unnecessary because one of its subsets, Beer,
    Milk, is infrequent.
  • Heuristics available to reduce (prune) the number
    of unnecessary candidates.
  • E.g., for a candidate kitemset to be worthy,
  • every item in the candidate must be contained in
    at least k-1 of the frequent (k-1)itemsets.
  • Beer, Diapers, Milk is a viable candidate
    3itemset only if
  • every item in the candidate, including Beer, is
    contained in at least 2 frequent 2itemsets.
  • Since there is only one frequent 2itemset
    containing Beer, all candidate itemsets involving
    Beer must be infrequent.
  • Why?
  • Because each of k-1subsets containing an item
    must be frequent.

29
Fk-1?F1
30
Fk-1?Fk-1 Method
  • Merge a pair of frequent (k-1)itemsets only if
    their first k-2 items are identical.
  • E.g. Bread, Diapers,Bread, Milk ? candidate
    3itemset Bread, Diapers, Milk.
  • Beer, Diapers Diapers, Milk not merged
  • If Beer, Diapers, Milk is a viable candidate,
    it would have been obtained by merging Beer,
    Diapers with Beer, Milk instead.
  • an additional candidate pruning step is needed to
    ensure
  • the remaining k-2 subsets of k-1 elements are
    frequent.

31
Fk-1?Fk-1
32
Example
Min_sup_count 2
33
Generate C2 from F1?F1
Min_sup_count 2
F1
34
Generate C3 from F2?F2
Min_sup_count 2
F2
Prune
C3
F3
35
Generate C4 from F3?F3
Min_sup_count 2
F3
C4
I1,I2,I3,I5 is pruned because I2,I3,I5 is
infrequent
36
Support counting for Candidate
  • Scan the database of transactions to determine
    the support of each candidate itemset
  • Brute force Match each transaction against every
    candidate.
  • Too many comparisons!
  • Better method Store the candidate itemsets in a
    hash structure
  • A transaction will be tested for match only
    against candidates contained in a few buckets

For candidate itemsets
37
Hash Tree for Storing Candidate Itemsets
  • Store the candidate itemsets in a hash structure
  • A transaction will be tested for match only
    against candidates contained in a few buckets
  • Hash tree can also be used for candidate
    generation

For candidate itemsets
38
Hash Tree For candidate itemsets
  • Suppose you have 15 candidate itemsets of length
    3 to be stored
  • 1 4 5, 1 2 4, 4 5 7, 1 2 5, 4 5 8, 1 5
    9, 1 3 6, 2 3 4, 5 6 7, 3 4 5, 3 5 6,
    3 5 7, 6 8 9, 3 6 7, 3 6 8
  • You need
  • A hash function (e.g. p mod 3)
  • Max leaf size max number of itemsets stored in
    a leaf node (if number of candidate itemsets
    exceeds max leaf size, split the node)

39
Hash Tree For candidate itemsets
Suppose you have 15 candidate itemsets of length
3 1 4 5, 1 2 4, 4 5 7, 1 2 5, 4 5 8,
1 5 9, 1 3 6, 2 3 4, 5 6 7, 3 4 5, 3
5 6, 3 5 7, 6 8 9, 3 6 7, 3 6 8
2 3 4 5 6 7
1 4 5 1 3 6 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9
3 5 6 3 5 7 6 8 9 3 4 5 3 6 7 3 6 8
Split nodes with more than 3 candidates using
the second item
40
Hash Tree For candidate itemsets
Suppose you have 15 candidate itemsets of length
3 1 4 5, 1 2 4, 4 5 7, 1 2 5, 4 5 8,
1 5 9, 1 3 6, 2 3 4, 5 6 7, 3 4 5, 3
5 6, 3 5 7, 6 8 9, 3 6 7, 3 6 8
2 3 4 5 6 7
3 5 6 3 5 7 6 8 9 3 4 5 3 6 7 3 6 8
1 4 5
1 3 6
1 2 4 4 5 7 1 2 5 4 5 8 1 5 9
Now split nodes using the third item
41
Hash Tree For candidate itemsets
Suppose you have 15 candidate itemsets of length
3 1 4 5, 1 2 4, 4 5 7, 1 2 5, 4 5 8,
1 5 9, 1 3 6, 2 3 4, 5 6 7, 3 4 5, 3
5 6, 3 5 7, 6 8 9, 3 6 7, 3 6 8
Now, split this similarly.
42
Hash Tree For candidate itemsets
Now, split this similarly.
43
Enumerate all Subsets of a transaction
Given a (lexicographically ordered) transaction
t, say 1,2,3,5,6 how to enumerate all subsets
of size 3?
44
Matching transaction against candidates
transaction
1 3 6
3 4 5
1 5 9
45
Matching transaction against candidates
Hash Function
transaction
1,4,7
3,6,9
2,5,8
1 3 6
3 4 5
1 5 9
46
Matching transaction against candidates
transaction
1 3 6
3 4 5
1 5 9
Match transaction against 7 out of 15 candidates
47
Trie for Storing Candidate Itemsets
Suppose you have 5 candidate itemsets of length
3 A,C,D, A,E,G, A,E,L, A,E,M, K,M,N.
  • To match transaction t against candidates
  • we take all ordered k-subsets X of t
  • search for them in the trie structure
  • If X is found(as a candidate), then support count
    of this candidate 1

48
Tries can store frequent itemsets too
  • Candidate generation becomes easy and fast
  • We can generate candidates from pairs of nodes
    that have the same parents
  • except for the last item, the two sets are the
    same
  • Association rules are produced much faster
  • retrieving a support of an itemset is quicker
  • Remember the trie was originally developed to
    quickly decide if a word is included in a
    dictionary
Write a Comment
User Comments (0)
About PowerShow.com