Advanced Topics in Data Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Advanced Topics in Data Mining

Description:

Title: Data Mining in CRM Author: Lee, Yue-Shi Last modified by: sjyen Created Date: 12/9/2001 3:01:49 AM – PowerPoint PPT presentation

Number of Views:138
Avg rating:3.0/5.0
Slides: 109
Provided by: LeeYu
Category:

less

Transcript and Presenter's Notes

Title: Advanced Topics in Data Mining


1
Advanced Topics in Data Mining
  • Association Rules
  • Sequential Patterns
  • Web Mining

2
Where to Find References?
  • Proceedings
  • Proceedings of ACM SIGKDD International
    Conference on Knowledge Discovery and Data Mining
  • Proceedings of IEEE International Conference on
    Data Mining (ICDM)
  • Proceedings of IEEE International Conference on
    Data Engineering (ICDE)
  • Proceedings of the International Conference on
    Very Large Data Bases (VLDB)
  • ACM SIGMOD Workshop on Research Issues in Data
    Mining and Knowledge Discovery
  • Proceedings of the International Conference on
    Data Warehousing and Knowledge Discovery

3
Where to Find References?
  • Proceedings
  • Proceedings of ACM SIGMOD International
    Conference on Management of Data
  • Pacific-Asia Conference on Knowledge Discovery
    and Data Mining (PAKDD)
  • European Conference on Principles of Data Mining
    and Knowledge Discovery (PKDD)
  • Proceedings of the International Conference on
    Database Systems for Advanced Applications
    (DASFAA)
  • Proceedings of the International Conference on
    Database and Expert Systems Applications (DEXA)

4
Where to Find References?
  • Journal
  • IEEE Transactions on Knowledge and Data
    Engineering (TKDE)
  • Data Mining and Knowledge Discovery
  • Journal of Intelligent Information Systems
  • ACM SIGMOD Record
  • The International Journal on Very Large Database
  • Knowledge and Information Systems
  • Data Knowledge Engineering
  • International Journal of Cooperative Information
    Systems

5
Advanced Topics in Data MiningAssociation Rules
6
Association Analysis
7
What Is Association Mining?
  • Association Rule Mining
  • Finding frequent patterns, associations,
    correlations, or causal structures among item
    sets in transaction databases, relational
    databases, and other information repositories
  • Applications
  • Market basket analysis (marketing strategy items
    to put on sale at reduced prices),
    cross-marketing, catalog design, shelf space
    layout design, etc
  • Examples
  • Rule form Body Head Support, Confidence.
  • buys(x, Computer) buys(x, Software) 2,
    60
  • major(x, CS) takes(x, DB) grade(x, A)
    1, 75

8
Market Basket Analysis
Typically, association rules are considered
interesting if they satisfy both a minimum
support threshold and a minimum confidence
threshold.
9
Rule Measures Support and Confidence
  • Let minimum support 50, and minimum confidence
    50, we have
  • A ? C 50, 66.6
  • C ? A 50, 100

10
Support Confidence
11
Association Rule Basic Concepts
  • Given
  • (1) database of transactions,
  • (2) each transaction is a list of items
    (purchased by a customer in a visit)
  • Find all rules that correlate the presence of one
    set of items with that of another set of items
  • Find all the rules A ? B with minimum confidence
    and support
  • support, s, P(A ? B)
  • confidence, c, P(BA)

12
Association Rule MiningA Road Map
  • Boolean vs. quantitative associations (Based on
    the types of values handled in the rule set)
  • buys(x, SQLServer) buys(x, DM Book) ?
    buys(x, DBMiner) 0.2, 60
  • age(x, 30..39) income(x, 42..48K) ? buys(x,
    PC) 1, 75
  • Single dimension vs. multiple dimensional
    associations
  • Single level vs. multiple-level analysis (Based
    on the levels of abstractions involved in the
    rule set)

13
Terminologies
  • Item
  • I1, I2, I3,
  • A, B, C,
  • Itemset
  • I1, I1, I7, I2, I3, I5,
  • A, A, G, B, C, E,
  • 1-Itemset
  • I1, I2, A,
  • 2-Itemset
  • I1, I7, I3, I5, A, G,

14
Terminologies
  • K-Itemset
  • If the length of the itemset is K
  • Frequent K-Itemset
  • If the length of the itemset is K and the itemset
    satisfies a minimum support threshold.
  • Association Rule
  • If a rule satisfies both a minimum support
    threshold and a minimum confidence threshold

15
Analysis
  • The number of itemsets of a given cardinality
    tends to grow exponentially

16
Mining Association Rules Apriori Principle
Min. support 50 Min. confidence 50
  • For rule A ? C
  • support support(A ? C) 50
  • confidence support(A ? C)/support(A)
    66.6
  • The Apriori principle
  • Any subset of a frequent itemset must be frequent

17
Mining Frequent Itemsets the Key Step
  • Find the frequent itemsets the sets of items
    that have minimum support
  • A subset of a frequent itemset must also be a
    frequent itemset
  • i.e., if AB is a frequent itemset, both A and
    B should be a frequent itemset
  • Iteratively find frequent itemsets with
    cardinality from 1 to k (k-itemset)
  • Use the frequent itemsets to generate
    association rules

18
Example
19
Apriori Algorithm
20
Apriori Algorithm
21
Apriori Algorithm
22
Example of Generating Candidates
  • L3abc, abd, acd, ace, bcd
  • Self-joining L3L3
  • abcd from abc and abd
  • acde from acd and ace
  • Pruning
  • acde is removed because ade is not in L3
  • C4abcd

23
Another Example 1
24
Another Example 2
25
Is Apriori Fast Enough? Performance Bottlenecks
  • The core of the Apriori algorithm
  • Use frequent (k1)-itemsets to generate candidate
    frequent k-itemsets
  • Use database scan to collect counts for the
    candidate itemsets
  • The bottleneck of Apriori
  • Huge candidate sets
  • 104 frequent 1-itemset will generate 107
    candidate 2-itemsets
  • To discover a frequent pattern of size 100, e.g.,
    a1, a2, , a100, one needs to generate 2100 ?
    1030 candidates.
  • Multiple scans of database
  • Needs (n 1) scans, n is the length of the
    longest pattern

26
Demo-IBM Intelligent Minner
27
Demo Database
28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
Methods to Improve Aprioris Efficiency
  • Hash-based itemset counting A k-itemset whose
    corresponding hashing bucket count is below the
    threshold cannot be frequent
  • Transaction reduction A transaction that does
    not contain any frequent k-itemset is useless in
    subsequent scans
  • Partitioning Any itemset that is potentially
    frequent in DB must be frequent in at least one
    of the partitions of DB
  • Sampling mining on a subset of given data, lower
    support threshold a method to determine the
    completeness

32
Partitioning
33
Hash-Based Itemset Counting



34
Compare Apriori DHP (Direct Hash Pruning)
Apriori
35
Compare Apriori DHP
DHP
36
DHP Database Trimming
37
DHP (Direct Hash Pruning)
  • A database has four transactions
  • Let min_sup 50

38
Example Apriori
39
Example DHP
40
Example DHP
41
Example DHP
42
Mining Frequent Patterns Without Candidate
Generation
  • Compress a large database into a compact,
    Frequent-Pattern tree (FP-tree) structure
  • highly condensed, but complete for frequent
    pattern mining
  • avoid costly database scans
  • Develop an efficient, FP-tree-based frequent
    pattern mining method
  • A divide-and-conquer methodology decompose
    mining tasks into smaller ones
  • Avoid candidate generation sub-database test
    only!

43
Construct FP-tree from a Transaction DB
44
Construction Steps
  • Scan DB once, find frequent 1-itemset (single
    item pattern)
  • Order frequent items in frequency descending
    order
  • Sorting DB according to the frequency descending
    order
  • Scan DB again, construct FP-tree

45
Benefits of the FP-Tree Structure
  • Completeness
  • never breaks a long pattern of any transaction
  • preserves complete information for frequent
    pattern mining
  • Compactness
  • reduce irrelevant informationinfrequent items
    are gone
  • frequency descending ordering more frequent
    items are more likely to be shared
  • never be larger than the original database (if
    not count node-links and counts)
  • Compression ratio could be over 100

46
Frequent Pattern Growth
Order frequent items in frequency descending order
  • For I5
  • I1, I5, I2, I5, I2, I1, I5
  • For I4
  • I2, I4
  • For I3
  • I1, I3, I2, I3, I2, I1, I3
  • For I1
  • I2, I1

47
Frequent Pattern Growth
  • For I5
  • I1, I5, I2, I5, I2, I1, I5
  • For I4
  • I2, I4
  • For I3
  • I1, I3, I2, I3, I2, I1, I3
  • For I1
  • I2, I1

Sub DB
Trimming Databases
Sub DB
Sub DB
FP-tree
Sub DB
Conditional FP-tree from Conditional Pattern-Base
48
Conditional FP-tree
Conditional FP-tree from Conditional
Pattern-Base for I3
49
Mining Results Using FP-tree
  • For I5 (???NULL)
  • Conditional Pattern Base
  • (I2I11), (I2I1I31)
  • Conditional FP-tree
  • Generate Frequent Itemsets
  • I22 Rule I2I52
  • I12 Rule I1I52
  • I2I12 Rule I2I1I52

? NULL 2
Item ID Support count Node Link
I2 2
I1 2
? I2 2
? I1 2
50
Mining Results Using FP-tree
  • For I4
  • Conditional Pattern Base
  • (I2I11), (I21)
  • Conditional FP-tree
  • Generate Frequent Itemsets
  • I22 Rule I2I42

? NULL 2
Item ID Support count Node Link
I2 2
? I2 2
51
Mining Results Using FP-tree
  • For I3
  • Conditional Pattern Base
  • (I2I12), (I22), (I12)
  • Conditional FP-tree

? NULL 4
Item ID Support count Node Link
I2 4
I1 4
? I2 4
? I1 2
? I1 2
52
Mining Results Using FP-tree
  • For I1/I3
  • Conditional Pattern Base
  • (NULL2), (I22)
  • Conditional FP-tree
  • Generate Frequent Itemsets
  • Null4 Rule I1I34
  • I22 Rule I2I1I32

? NULL 4
Item ID Support count Node Link
I2 2
? I2 2
53
Mining Results Using FP-tree
  • For I2/I3
  • Conditional Pattern Base
  • (NULL4)
  • Conditional FP-tree
  • Generate Frequent Itemsets
  • Null Rule I2I34

? NULL 4
54
Mining Results Using FP-tree
  • For I1
  • Conditional Pattern Base
  • (NULL2), (I24)
  • Conditional FP-tree
  • Generate Frequent Itemsets
  • I24 Rule I2I14

? NULL 6
Item ID Support count Node Link
I2 4
? I2 4
55
Mining Results Using FP-tree
56
Mining Frequent PatternsUsing FP-tree
  • General idea (divide-and-conquer)
  • Recursively grow frequent pattern path using the
    FP-tree
  • Method
  • For each item, construct its conditional
    pattern-base, and then its conditional FP-tree
  • Repeat the process on each newly created
    conditional FP-tree
  • Until the resulting FP-tree is empty, or it
    contains only one path (single path will generate
    all the combinations of its sub-paths, each of
    which is a frequent pattern)

57
Major Steps to Mine FP-tree
  • Construct conditional pattern base for each item
    in the FP-tree
  • Construct conditional FP-tree from each
    conditional pattern-base
  • Recursively mine conditional FP-trees and grow
    frequent patterns obtained so far
  • If the conditional FP-tree contains a single
    path, simply enumerate all the patterns

58
Virtual Items in Association Mining
  • Different region exhibit different selling
    patterns. Thus, including as virtual item the
    information on the location or the type of stores
    (existing or new) where the purchase was made
    will enable the comparisons between locations or
    types within a single chain
  • Virtual item may include information on whether
    the purchase was made with cash, a credit card or
    check. The inclusion of such virtual item allows
    to analyze the association between the payment
    method and items purchased.
  • Virtual item may include information on the day
    of the week or the time of the day the
    transaction occurred. The inclusion of such
    virtual item allows to analyze the association
    between the transaction time and items purchased

59
Virtual Items An Example
60
Dissociation Rules
  • A dissociation rule is similar to an association
    rule except that it can have not item-name in
    the condition or the result of the rule
  • A and not B C    
  • A and D not E
  • Dissociation rules can be generated by a simple
    adaptation of the association rule analysis

61
Discussions
  • The size of a typical transaction grows because
    it now includes inverted items
  • The total number of items used in the analysis
    doubles
  • Since the amount of computation grows
    exponentially with the number of items, doubling
    the number of items seriously degrades
    performance
  • The frequency of the inverted items tends to be
    much larger than the frequency of the original
    items. So, it tends to produce rules in which all
    items are inverted. These rules are less likely
    to be actionable.
  • not A and not B not C
  • It is useful to invert only the most frequent
    items in the set used for analysis. It is also
    useful to invert some items whose inverses are of
    interest.

62
Interestingness Measurements
  • Subjective Measures
  • A rule (pattern) is interesting if
  • it is unexpected (surprising to the user)
  • actionable (the user can do something with it)
  • Objective Measures
  • Two popular measurements
  • Support
  • confidence

63
Criticism to Support and Confidence
  • Example 1
  • Among 5000 students
  • 3000 play basketball
  • 3750 eat cereal
  • 2000 both play basket ball and eat cereal
  • play basketball ? eat cereal 40, 66.7 is
    misleading because the overall percentage of
    students eating cereal is 75 which is higher
    than 66.7.

64
Criticism to Support and Confidence
  • Example 2
  • X and Y positively correlated,
  • X and Z, negatively related
  • support and confidence of
  • XgtZ dominates
  • We need a measure of dependent or correlated
    events

65
Criticism to Support and Confidence
  • Improvement (Correlation)
  • Taking both P(A) and P(B) in consideration
  • P(AB)P(B)P(A), if A and B are independent
    events
  • A and B negatively correlated, if the value is
    less than 1otherwise A and B positively
    correlated
  • When improvement is less than 1, negating the
    result produces a better rule
  • X gt NOT Z

66
Multiple-Level Association Rules
  • Items often form hierarchy
  • Items at the lower level are expected to have
    lower support
  • Rules regarding itemsets at appropriate levels
    could be quite useful
  • Transaction database can be encoded based on
    dimensions and levels
  • We can explore multi-level mining

67
Transaction Database
68
Concept Hierarchy
69
Mining Multi-Level Associations
  • A top_down, progressive deepening approach
  • First find high-level strong rulesmilk bread
    20, 60
  • Then find their lower-level weaker rules2
    milk wheat bread 6, 50
  • Variations at mining multiple-level association
    rules.
  • Cross-level association rules (Generalized Asso.
    Rules)2 milk Wonder wheat bread
  • Association rules with multiple, alternative
    hierarchies2 milk Wonder bread

70
Multi-level Association Uniform Support vs.
Reduced Support
  • Uniform Support the same minimum support for all
    levels
  • One minimum support threshold is needed.
  • Lower level items do not occur as frequently. If
    support threshold
  • too high ? miss low level associations
  • too low ? generate too many high level
    associations
  • Reduced Support reduced minimum support at lower
    levels
  • There are 4 search strategies
  • Level-by-level independent
  • Level-cross filtering by k-itemset
  • Level-cross filtering by single item
  • Controlled level-cross filtering by single item

71
Uniform Support
?
  • Optimization Technique
  • The search avoids examining itemsets containing
    any item whose ancestors do not have minimum
    support.

72
Uniform Support An Example
L1 2,3,4,5,6 L2 23,24,25,26,34,45,46,56 L3
234,245,246,256,456 L4 2456
min_sup50
L1 2,3,4 L2 23,24,34 L3 234
min_sup60
L1 2,3,6,8,9 L2 23,68,69,89 L3 689
min_sup50
L1 9
min_sup60
73
Uniform Support An Example
All
Cheese
Crab
Milk
3
1
2
1
2
6
3
5
Kings Crab
Sunset Milk
Dairyland Milk
Dairyland Cheese
Best Cheese
Bread
Apple
Pie
4
5
6
4
9
7
10
8
Best Bread
Wonder Bread
Goldenfarm Apple
Westcoast Bread
Tasty Pie
74
Uniform Support An Example
L1 2,3,4,5,6 L2 23,24,25,26,34,45,46,56 L3
234,245,246,256,456 L4 2456
min_sup50
Apriori/DHP FP Growth
(1)
(2)
min_sup50
L1 2,3,6,8,9
Scan DB
C2 23,36,29,69,28,68,39,89 C3 239,369,289,689
(3)
min_sup50
Scan DB
L2 23,68,69,89 L3 689
75
Uniform Support An Example
All
Cheese
Crab
Milk
3
1
2
1
2
6
3
5
Kings Crab
Sunset Milk
Dairyland Milk
Dairyland Cheese
Best Cheese
Bread
Apple
Pie
4
5
6
4
9
7
10
8
Best Bread
Wonder Bread
Goldenfarm Apple
Westcoast Bread
Tasty Pie
76
Reduced Support
  • Each level of abstraction has its own minimum
    support threshold.

77
Search Strategies forReduced Support
  • There are 4 search strategies
  • Level-by-level independent
  • Full-Breadth Search
  • No pruning
  • No background knowledge of frequent itemsets is
    used for pruning
  • Level-cross filtering by single item
  • An item at the ith level is examined if and only
    if its parent node at the (i-1)th level is
    frequent.
  • Level-cross filtering by k-itemset
  • An k-itemset at the ith level is examined if and
    only if its corresponding parent k-itemset at the
    (i-1)th level is frequent.
  • Controlled level-cross filtering by single item

78
Level-Cross Filtering by Single Item
79
Reduced Support An Example
L1 2,3,4 L2 23,24,34 L3 234
min_sup60
Apriori/DHP FP Growth
(1)
(1)
min_sup50
L1 2,3,6,9
Scan DB
(2)
min_sup50
L2 23,69
Apriori/DHP FP Growth
(3)
80
Reduced Support An Example
All
Cheese
Crab
Milk
3
1
2
1
2
6
3
5
Kings Crab
Sunset Milk
Dairyland Milk
Dairyland Cheese
Best Cheese
Bread
Apple
Pie
4
5
6
4
9
7
10
8
Best Bread
Wonder Bread
Goldenfarm Apple
Westcoast Bread
Tasty Pie
81
Level-Cross Filtering by K-Itemset
?
?
82
Reduced Support An Example
L1 2,3,4 L2 23,24,34 L3 234
min_sup60
Apriori/DHP FP Growth
(1)
(2)
min_sup50
L1 2,3,6,9
Scan DB
C2 23,36,29,69,39 C3 239,369
(3)
min_sup50
Scan DB
L2 23,69
83
Reduced Support An Example
All
Cheese
Crab
Milk
3
1
2
1
2
6
3
5
Kings Crab
Sunset Milk
Dairyland Milk
Dairyland Cheese
Best Cheese
Bread
Apple
Pie
4
5
6
4
9
7
10
8
Best Bread
Wonder Bread
Goldenfarm Apple
Westcoast Bread
Tasty Pie
84
Reduced Support
  • Level-by-level independent
  • It is very relaxed in that it may lead to
    examining numerous infrequent items at low
    levels, finding associations between items of
    little importance.
  • Level-cross filtering by k-itemset
  • It allows the mining system to examine only the
    children of frequent k-itemsets.
  • This restriction is very strong in that there
    usually are not many k-itemsets.
  • Many valuable patterns may be filtered out.
  • Level-cross filtering by single item
  • A compromise between the above two approaches
  • This method may miss associations between low
    level items that are frequent based on a reduced
    minimum support, but whose ancestors do not
    satisfy minimum support.

85
Controlled Level-Cross Filtering by Single Item
86
Reduced Support An Example
L1 2,3,4 L2 23,24,34 L3 234
min_sup60
Apriori/DHP FP Growth
level_passage_sup50 L1 2,3,4,5,6
(1)
(1)
min_sup50
L1 2,3,6,8,9
Scan DB
(2)
L2 23,68,69,89 L3 689
min_sup50
Apriori/DHP FP Growth
(3)
87
Reduced Support An Example
All
Cheese
Crab
Milk
3
1
2
1
2
6
3
5
Kings Crab
Sunset Milk
Dairyland Milk
Dairyland Cheese
Best Cheese
Bread
Apple
Pie
4
5
6
4
9
7
10
8
Best Bread
Wonder Bread
Goldenfarm Apple
Westcoast Bread
Tasty Pie
88
Multi-Dimensional Association
  • Single-Dimensional (Intra-Dimension) Rules
    Single Dimension (Predicate) with Multiple
    Occurrences.
  • buys(X, milk) ? buys(X, bread)
  • Multi-Dimensional Rules ? 2 Dimensions
  • Inter-dimension association rules (no repeated
    predicates)
  • age(X,19-25) ? occupation(X,student) ?
    buys(X,coke)
  • hybrid-dimension association rules (repeated
    predicates)
  • age(X,19-25) ? buys(X, popcorn) ? buys(X,
    coke)

89
Summary
  • Association rule mining
  • Probably the most significant contribution from
    the database community in KDD
  • A large number of papers have been published
  • Some Important Issues
  • Generalized Association Rules
  • Multiple-Level Association Rules
  • Association Analysis in Other Types of Data
  • Spatial Data, Multimedia Data, Time Series Data,
    etc.
  • Weighted Association Rules
  • Quantitative Association Rules

90
Weighted Association Rules
  • Why Weighted Association Analysis?
  • In previous work, all items in a transactional
    database are treated uniformly
  • Items are given weights to reflect their
    importance to the user
  • The weights may correspond to special promotions
    on some products, or the profitability of
    different items
  • Some products may be under promotion and hence
    are more interesting, or some products are more
    profitable and hence rules concerning them are of
    greater values

91
Weighted Association Rules
  • A simple attempt to solve this problem is to
    eliminate the items with small weights
  • However, a rule for a heavy weighted item may
    also consist of low weighted items
  • Is Apriori algorithm feasible?
  • Apriori algorithm depends on the downward closure
    property which governs that subsets of a frequent
    itemset are also frequent
  • However, it is not true for the weighted case

92
Weighted Association RulesAn Example
  • Total Benefits 500
  • Benefits for the First Transaction
    (403030202010 10)160
  • Benefits for the Second Transaction (403020
    201010)130
  • Benefits for the Third Transaction
    (4030201010) 110
  • Benefits for the Fourth Transaction
    (3030201010)100
  • Suppose Weighted_Min_Sup 40
  • Minimum Benefits 500 40 200

93
An Example
  • Minimum Benefits 500 40 200
  • Itemset 3,5,6,7
  • Benefits 70
  • Support Count (Frequency) 3
  • 70 3 210 gt 200 ?3,5,6,7is a Frequent
    Itemset
  • Itemset 3,5,6
  • Benefits 60
  • Support Count ( Frequency) 3
  • 60 3 180 lt 200 ?3,5,6 is not a Frequent
    Itemset

Apriori Principle can not be applied!
94
K-Support Bound
  • If Y is a frequent q-itemset
  • Support_Count(Y) ? (Weighted_Min_Sup
    Total_Benefits) / Benefits(Y)
  • Example
  • 3,5,6,7 is a Frequent 4-Itemset
  • Support_Count(3,5,6,7) 3 ? (40 500) /
    Benefits(3,5,6,7) (40 500) / 70 2.857
  • If X is a frequent k-itemset containing q-itemset
    Y
  • Minimum_Support_Count(X) ? (Weighted_Min_Sup
    Total_Benefits) / (Benefits(Y) (k-q) Maximum
    Remaining Weights)
  • Example
  • X is a Frequent 5-Itemset containing 3,5,6,7
  • Minimum_Support_Count(X) ? (40 500) / (70
    40) 1.81

K-Support Bound
95
K-Support Bound
  • Itemset 1,2
  • Benefits 70
  • Support_Count(1,2) 1 lt (40 500) /
    Benefits(1,2) (40 500) / 70 2.857
  • 1,2 is not a Frequent Itemset
  • If X is a frequent 3-itemset containing 1,2
  • Minimum_Support_Count(X) ? (40 500) / (70
    30) 2
  • But, Maximum_Support_Count(X) 1
  • No frequent 3-itemsets containing 1,2
  • If X is a frequent 4-itemset containing 1,2
  • Minimum_Support_Count(X) ? (40 500) / (70 30
    20) 1.667
  • But, Maximum_Support_Count(X) 1
  • No frequent 4-itemsets containing 1,2
  • Similarly, no frequent 5, 6, 7-itemsets
    containing 1,2
  • The algorithm is designed based on this k-support
    bound

96
MINWAL Algorithm
97
Step by Step
  • Input Product Transactional Databases

Weighted_Min_Sup 50
Total Profits 1380
98
Step 2, 7
  • Search(D)
  • This subroutine finds out the maximum transaction
    size in that transactional database D
  • Size 4 in this case
  • Counting(D, w)
  • This subroutine cumulates the support counts of
    the 1-itemsets
  • The k-support bounds of each 1-itemset will be
    calculated, and the 1-itemsets with support
    counts greater than any of the k-support bounds
    will be kept in C1

99
Step 7
?
K-Support Bound ? (50 1380) / (10 90) 6.9
100
Step 11
  • Join(Ck-1)
  • The Join step generates Ck from Ck-1 as Apriori
    Algorithm
  • If we have 1, 2, 3, 1, 2, 4 in Ck-1 1, 2, 3,
    4 will be generated in Ck
  • In this case,
  • C1 1 (4), 2 (5), 4 (6), 5 (7)
  • C2 Join(C1) 12, 14, 15, 24, 25, 45

Support_Count
101
Step 12
  • Prune(Ck)
  • The itemset will be pruned in either of the
    following cases
  • A subset of the candidate itemset in Ck does not
    exist in Ck-1
  • Estimate an upper bound on the support count (SC)
    of the joined itemset X, which is the minimum
    support count among the k different (k-1)-subsets
    of X in Ck-1. If the estimated upper bound on
    the support count shows that the itemset X cannot
    be a subset of any large itemset in the coming
    passes (from the calculation of k-support bounds
    for all itemsets), that itemset will be pruned
  • In this case,
  • C2 Prune(C2) 12 (4), 14 (4), 15 (4), 24 (5),
    25 (5), 45 (6)
  • Using K-Support-Bound

Estimated_Support_Count
102
Step 12
  • Prune(Ck)
  • Using K-Support-Bound (No one is pruned)

103
Step 13
  • Checking(Ck, D)
  • Scan DB Generate Lk

?
?
C2
L2
104
Step 11, 12
  • Join(C2)
  • C2 15 (4), 24 (5), 25 (5), 45 (6)
  • C3 Join(C2) 245
  • Prune(C3)
  • C3 Prune(C3) 245 (5)
  • Using K-Support-Bound (No one is pruned)

105
Step 13
  • Checking(C3, D) Scan DB
  • C3 L3 245
  • Finally, L 45, 245

106
Step 15
  • Generate Rules for L 45, 245
  • 4 ? 5 (confidence 100)
  • 5 ? 4 (confidence 85.7)
  • 24 ? 5 (confidence 100)
  • 25 ? 4 (confidence 100)
  • 45 ? 2 (confidence 83.3)
  • 2 ? 45 (confidence 100)
  • 4 ? 25 (confidence 83.3)
  • 5 ? 24 (confidence 71.4)

Min_Conf90
107
Generalized Association Rules
108
Quantitative Association Rules
  • Let min_sup 50, we have
  • A, B 60
  • B, D 70
  • A, B, D 50
  • A(1..2), B(3) 50
  • A(1..2), B(3..5) 60
  • A(1..2), B(1..5) 60
  • B(1..5), D(1..2) 60
  • B(3..5), D(1..3) 60
  • B(1..5), D(1..3) 70
  • B(1..3), D(1..2) 50
  • B(3..5), D(1..2) 50
  • A(1..2), B(3..5), D(1..2) 50

?
Write a Comment
User Comments (0)
About PowerShow.com