Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal

Description:

Clothes Hiking Boots. may not have minimum confidence. 6. Taxonomy ... Hiking Boots. 15. Observation 2 cont. ... Hiking Boots. 31. Version 1: Stratify. Depth ... – PowerPoint PPT presentation

Number of Views:205
Avg rating:3.0/5.0
Slides: 51
Provided by: idith
Category:

less

Transcript and Presenter's Notes

Title: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal


1
Mining Generalized Association Rules Ramkrishnan
Strikant Rakesh Agrawal
  • Data Mining Seminar, spring semester, 2003
  • Prof. Amos Fiat
  • Student Idit Haran

2
Outline
  • Motivation
  • Terms Definitions
  • Interest Measure
  • Algorithms for mining generalized association
    rules
  • Comparison
  • Conclusions

3
Motivation
  • Find Association Rules of the formDiapers ?
    Beer
  • Different kinds of diapers Huggies/Pampers,
    S/M/L, etc.
  • Different kinds of beers Heineken/Maccabi, in a
    bottle/in a can, etc.
  • The information on the bar-code is of
    typeHuggies Diapers, M ? Heineken Beer in
    bottle
  • The preliminary rule is not interesting, and
    probably will not have minimum support.

4
Taxonomy
  • is-a hierarchies

5
Taxonomy - Example
  • Let say we found the rule Outwear ? Hiking
    Bootswith minimum support and confidence.
  • The rule Jackets ? Hiking Boots may not have
    minimum support
  • The rule Clothes ? Hiking Boots may not have
    minimum confidence.

6
Taxonomy
  • Users are interested in generating rules that
    span different levels of the taxonomy.
  • Rules of lower levels may not have minimum
    support
  • Taxonomy can be used to prune uninteresting or
    redundant rules
  • Multiple taxonomies may be present. for example
    category, price(cheap, expensive),
    items-on-sale. etc.
  • Multiple taxonomies may be modeled as a forest,
    or a DAG.

7
Notations
8
Notations
  • I i1, i2, , im- items.
  • T- transaction, set of items T?I(we expect the
    items in T to be leaves in T .)
  • D set of transactions
  • T supports item x, if x is in T or x is an
    ancestor of some item in T.
  • T supports X?I if it supports every item in X.

9
Notations
  • A generalized association rule X? Y if X?I ,
    Y?I , X?Y ? , and no item in Y is an
    ancestor of any item in X.
  • The rule X?Y has confidence c in D if c of
    transactions in D that support X also support Y.
  • The rule X?Y has support s in D if s of
    transactions in D supports X?Y.

10
Problem Statement
  • To find all generalized association rules that
    have support and confidence greater than the
    user-specified minimum support (called minsup)
    and minimum confidence (called minconf)
    respectively.

11
Example
  • Recall the taxonomy

12
Example
minsup 30 minconf 60
13
Observation 1
  • If the setx,y has minimum support, so do
    x,y x,y and x,y
  • For example if Jacket, Shoes has minsup, so
    will Outwear, Shoes, Jacket,Footwear, and
    Outwear,Footwear

14
Observation 2
  • If the rule x?y has minimum support and
    confidence, only x?y is guaranteed to have both
    minsup and minconf.
  • The rule Outwear?Hiking Boots has minsup and
    minconf.
  • The rule Outwear?Footwear has both minsup and
    minconf.

15
Observation 2 cont.
  • However, the rules x?y and x?y will have
    minsup, they may not have minconf.
  • For example The rules Clothes?Hiking Boots and
    Clothes?Footwear have minsup, but not minconf.

16
Interesting Rules Previous Work
  • a rule X?Y is not interesting ifsupport(X?Y) ?
    support(X)support(Y)
  • Previous work does not consider taxonomy.
  • The previous interest measure pruned less than 1
    of the rules on a real database.

17
Interesting Rules Using the Taxonomy
  • Milk?Cereal (8 support, 70 conf)
  • Milk is parent of Skim Milk, and 25 of sales of
    Milk are Skim Milk
  • We expectSkim Milk?Cereal to have 2 support
    and 70 confidence

18
R-Interesting Rules
  • A rule is X?Y is R-interesting w.r.t an ancestor
    X?Y if
  • or,
  • With R 1.1 about 40-55 of the rules were
    prunes.

real support(X?Y)
expected support (X?Y) based on (X?Y)
gt
R
real confidence(X?Y)
expected confidence (X?Y) based on (X?Y)
gt
R
19
Problem Statement (new)
  • To find all generalized R-interesting association
    rules (R is a user-specified minimum interest
    called min-interest) that have support and
    confidence greater than minsup and minconf
    respectively.

20
Algorithms 3 steps
  • 1. Find all itemsets whose support is greater
    than minsup. These itemsets are called frequent
    itemsets.
  • 2. Use the frequent itemsets to generate the
    desired rules if ABCD and AB are frequent then
    conf(AB?CD) support(ABCD)/support(AB)
  • 3. Prune all uninteresting rules from this set.
  • All presented algorithms will only implement
    step 1.

21
Algorithms 3 steps
  • 1. Find all itemsets whose support is greater
    than minsup. These itemsets are called frequent
    itemsets.
  • 2. Use the frequent itemsets to generate the
    desired rules if ABCD and AB are frequent then
    conf(AB?CD) support(ABCD)/support(AB)
  • 3. Prune all uninteresting rules from this set.
  • All presented algorithms will only implement
    step 1.

22
Algorithms (step 1)
  • Input Database, Taxonomy
  • Output All frequent itemsets
  • 3 algorithms (same output, different run-time)
    Basic, Cumulate, EstMerge

23
Algorithm Basic Main Idea
  • Is itemset X is frequent?
  • Does transaction T supports X? (X contains items
    from different levels of taxonomy, T contains
    only leaves)
  • T T ancestors(T)
  • Answer T supports X ? X ? T

24
Algorithm Basic
Count item occurrences
Generate new k-itemsets candidates
Add all ancestors of each item in t to t,
removing any duplication
Find the support of all the candidates
Take only those with support over minsup
25
Candidate generation
  • Join step
  • Prune step

P and q are 2 k-1 frequent itemsets identical in
all k-2 first items.
Join by adding the last item of q to p
Check all the subsets, remove a candidate with
small subset
26
Optimization 1
  • Filtering the ancestors added to transactions
  • We only need to add to transaction t the
    ancestors that are in one of the candidates.
  • If the original item is not in any itemsets, it
    can be dropped from the transaction.
  • Examplecandidates clothes,shoes.Transaction
    t Jacket, can be replaced with clothes,

27
Optimization 2
  • Pre-computing ancestors
  • Rather than finding ancestors for each item by
    traversing the taxonomy graph, we can pre-compute
    the ancestors for each item.
  • We can drop ancestors that are not contained in
    any of the candidates in the same time.

28
Optimization 3
  • Pruning itemsets containing an item and its
    ancestor
  • If we have Jacket and Outwear, we will have
    candidate Jacket, Outwear which is not
    interesting.
  • support(Jacket ) support(Jacket, Outwear)
  • Delete (Jacket, Outwear) in k2 will ensure it
    will not erase in kgt2. (because of the prune step
    of candidate generation method)
  • Therefore, we can prune the rules containing an
    item an its ancestor only for k2, and in the
    next steps all candidates will not include item
    ancestor.

29
Algorithm Cumulate
30
Stratification
  • Candidates Clothes, Shoes, Outwear,Shoes,
    Jacket,Shoes
  • If Clothes, Shoes does not have minimum
    support, we dont need to count either
    Outwear,Shoes or Jacket,Shoes
  • We will count in steps step 1 count Clothes,
    Shoes, and if it has minsup - step 2 count
    Outwear,Shoes, if has minsup step 3 count
    Jacket,Shoes

31
Version 1 Stratify
  • Depth of an itemset
  • itemsets with no parents are of depth 0.
  • others depth(X) max(depth(X) X is a
    parent of X) 1
  • The algorithm
  • Count all itemsets C0 of depth 0.
  • Delete candidates that are descendants to the
    itemsets in C0 that didnt have minsup.
  • Count remaining itemsets at depth 1 (C1)
  • Delete candidates that are descendants to the
    itemsets in C1 that didnt have minsup.
  • Count remaining itemsets at depth 2 (C2), etc

32
Tradeoff Optimizations
candidates counted
passes over DB
Cumulate
Count each depth on different pass
Optimiztion 1 Count together multiple depths
from certain level
Optimiztion 2 Count more than 20 of candidates
per pass
33
Version 2 Estimate
  • Estimating candidates support using sample
  • 1st pass (Ck)
  • count candidates that are expected to have
    minsup (we count these candidates as candidates
    that has 0.9minsup in the sample)
  • count candidates whose parents expect to have
    minsup.
  • 2nd pass (Ck)
  • count children of candidates in Ck that were not
    expected to have minsup.

34
Example for Estimate
  • minsup 5

35
Version 3 EstMerge
  • Motivation eliminate 2nd pass of algorithm
    Estimate
  • Implementation count these candidates of Ck
    with the candidates in Ck1.
  • Restriction to create Ck1 we assume that all
    candidates in Ck has minsup.
  • The tradeoff extra candidates counted by
    EstMerge v.s. extra pass made by Estimate.

36
Algorithm EstMerge
37
Stratify - Variants
38
Size of Sample
  • Prsupport in sample lt a

39
Size of Sample
40
Performance Evaluation
  • Compare running time of 3 algorithmsBasic,
    Cumulate and EstMerge
  • On synthetic data
  • effect of each parameter on performance
  • On real data
  • Supermarket Data
  • Department Store Data

41
Synthetic Data Generation
42
Minimum Support
43
Number of Transactions
44
Fanout
45
Number of Items
46
Reality Check
  • Supermarket Data
  • 548,000 items
  • Taxonomy 4 levels, 118 roots
  • 1.5 million transactions
  • Average of 9.6 items per transaction
  • Department Store Data
  • 228,000 items
  • Taxonomy 7 levels, 89 roots
  • 570,000 transactions
  • Average of 4.4 items per transaction

47
Results
48
Conclusions
  • Cumulate and EstMerge were 2 to 5 times faster
    than Basic on all synthetic datasets. On the
    supermarket database they were 100 times faster !
  • EstMerge was 25-30 faster than Cumulate.
  • Both EstMerge and Cumulate exhibits linear
    scale-up with the number of transactions.

49
Summary
  • The use of taxonomy is necessary for finding
    association rules between items at any level of
    hierarchy.
  • The obvious solution (algorithm Basic) is not
    very fast.
  • New algorithms that use the taxonomy benefits are
    much faster
  • We can use the taxonomy to prune uninteresting
    rules.

50
  • THE END
Write a Comment
User Comments (0)
About PowerShow.com