Data Mining and Knowledge Acquisition Chapter 5 - PowerPoint PPT Presentation

1 / 86
About This Presentation
Title:

Data Mining and Knowledge Acquisition Chapter 5

Description:

Market basket analysis, cross-marketing, catalog design, etc. Examples. ... abcd from abc and abd. acde from acd and ace. Pruning by Apriori principle: Step(1a) ... – PowerPoint PPT presentation

Number of Views:80
Avg rating:3.0/5.0
Slides: 87
Provided by: jiaw196
Category:

less

Transcript and Presenter's Notes

Title: Data Mining and Knowledge Acquisition Chapter 5


1
Data Mining and Knowledge Acquisition
Chapter 5
  • BIS 541
  • Summer 2005

2
Chapter 5 Mining Association Rules in Large
Databases
  • Association rule mining
  • Mining single-dimensional Boolean association
    rules from transactional databases
  • Mining multilevel association rules from
    transactional databases
  • Mining multidimensional association rules from
    transactional databases and data warehouse
  • From association mining to correlation analysis
  • Constraint-based association mining
  • Summary

3
What Is Association Mining?
  • Association rule mining
  • Finding frequent patterns, associations,
    correlations, or causal structures among sets of
    items or objects in transaction databases,
    relational databases, and other information
    repositories.
  • Applications
  • Market basket analysis, cross-marketing, catalog
    design, etc.
  • Examples.
  • Rule form Body Head support, confidence.
  • buys(x, diapers) buys(x, beers) 0.5,
    60
  • major(x, MIS) takes(x, DM) grade(x,
    AA) 1, 75

4
Association Rule Basic Concepts
  • Given
  • (1) database of transactions,
  • (2) each transaction is a list of items
    (purchased by a customer in a visit)
  • Find all rules that correlate the presence of
    one set of items with that of another set of
    items
  • E.g., 98 of people who purchase tires and auto
    accessories also get automotive services done
  • The user specifies
  • Minimum support level
  • Minimum confidence level
  • Rules exceeding the two trasholds are listed as
    interesting

5
Basic Concepts cont.
  • Ii1,..,im set of all items, T any transaction
  • A?T T contains the itemset A
  • A?T, B?T A,B itemsets
  • Examine rule like
  • A?B where
  • A?B?,
  • support s P(A?B)
  • frequency of transactions containing both A and B
  • confidence c P(B?A) P(A?B)/P(A)
  • Conditional probability that a transaction
    containing A contains B

6
Rule Measures Support and Confidence
Customer buys both
  • Find all the rules X Y ? Z with minimum
    confidence and support
  • support, s, probability that a transaction
    contains X ? Y ? Z
  • confidence, c, conditional probability that a
    transaction having X ? Y also contains Z

Customer buys diaper
Customer buys beer
  • Let minimum support 50, and minimum confidence
    50, we have
  • A ? C (50, 66.6)
  • C ? A (50, 100)

7
Frequent itemsets
  • Strong association rules
  • Support rule gt min_support
  • Confidence rule gt min_confidence
  • k-item set itemsets containing k items
  • occurrence frequencycountsupport count
  • Minimum support count
  • min_suptransactions in database
  • frequent item sets
  • Itemsets satisfying minimum support count
  • The Apriori Algorithm has two steps
  • (1) - Find all frequent itemsets
  • (2) - Genertate strong association rules from
    frequent itemsets

8
Mining Association RulesAn Example(1)
Min_support 50 Min._confidence
50 Min_count0.542
  • A.B.C.D are 1-itemsets
  • A.B.C are frequent 1-itemsets as
  • CountA 3 gt 2 (minimum_count) or
  • SupportA 75 gt 50 (minimum_support)
  • D is not a frequent 1-itemsets as
  • CountD 1 lt 2 (minimum_count) or
  • SupportD 25 lt 50 (minimum_support)

9
Mining Association RulesAn Example(2)
Min_support 50 Min._confidence
50 Min_count0.542
  • A.B.A.C.A.D.B.C are 2-itemsets
  • A.Cis frequent 2-itemsets as
  • CountA.C 2 gt 2 (minimum_count) or
  • SupportA.C 50 gt 50 (minimum_support)
  • A.B.A.D are not frequent 2-itemsets as
  • CountA.D 1 lt 2 (minimum_count) or
  • SupportA.D 25 lt 50 (minimum_support)

10
Mining Association RulesAn Example(3)
Min. support 50 Min. confidence 50
  • For rule A ? C
  • support support(A ?C) 50
  • confidence support(A ?C)/support(A) 66.6
  • Strong rule as support gtmin_support
  • confidence gt min_confidence

11
The Apriori Principle
Min. support 50 Min. confidence 50
  • The Apriori principle
  • Any subset of a frequent itemset must be frequent
  • A.C is a frequent 2-itemset
  • A and C subsets of A,C must be frequent
    1-itemsets

12
Apriori Algorithme has two steps
  • (1)-Find the frequent itemsets the sets of items
    that have minimum support (the key step)
  • A subset of a frequent itemset must also be a
    frequent itemset
  • i.e., if AB is a frequent itemset, both A and
    B should be a frequent itemsets
  • Iteratively find frequent itemsets with
    cardinality from 1 to k (k-itemset)
  • Until k is an empty set
  • (2)-Use the frequent itemsets to generate
    association rules.

13
Generation of frequent itemsets from candidate
itemsets (Step 1)
  • C1?L1 ? C2?L2 ?C3 ? L3 ? C4?L4
  • From Ck (candidate k-itemsets) generate Lk Ck ?
    Lk
  • From candidate k itemsets generate frequent k
    itemsets
  • (a)-Using the Apriori principle that
  • Eliminate itemset sk in Ck if
  • At least one k-1 subset of sk is not in Lk-1
  • (b)-For candidate k itemsets in Ck
  • Make a database scan to eliminate those itemsets
    whose support counts are below the critical min
    support cout
  • From frequent k itemsets Lk generate candidate
    k1 itemsets Ck1 Lk ? Ck1
  • Self joining any Lk with Lk

14
Self Join operation
  • Sort the items in any li ?Lk in some
    lexicographic order
  • li1ltli2lt, ltlik-1ltlik
  • li and lj are elements of Lk li.lj ?Lk
  • If li1lj1 and li2lj2 and
    lik-1ljk-1
  • and likltljk
  • The first k-1 elements are the same
  • Only the last elements are different
  • ?li lj satisfiing the above condition
  • Construct the item set lk1
  • li1, li2, lik-1,lik, ljk
  • common items
  • the k-1 items are taken form li or lj
  • k th item is taken from li
  • k1 th item is from lj

15
Example of Self Join operation
  • Lexigographic order alphabetic altbltcltd....
  • L3abc, abd, acd, ace, bcd
  • Self-joining L3L3 Step(2)
  • abcd from abc and abd
  • acde from acd and ace
  • Pruning by Apriori principle Step(1a)
  • acde is removed because ade is not in L3
  • C4abcd

16
The Apriori Algorithm Examplemin support cont2
Database D
L1
C1
Scan D
C2
C2
L2
Scan D
C3
L3
Scan D
17
Example 6.1 Han
  • TID_____list of item_Ids
  • T100 1 2 5 9 transactions
  • T200 2 4 D9
  • T300 2 3 minimum transaction
  • T400 1 2 4 support_count2
  • T500 1 3 min_sup2/922
  • T600 2 3
  • T700 1 3 min conf 70
  • T800 1 2 3 5
  • T900 1 2 3
  • Find strong association rules having min sup
    count of 2 and min confidence 70

18
Data Dictionary
  • 1 milk
  • 2 apple
  • 3 butter
  • 4 bread
  • 5 orange

19
1th iteration of algorithm
  • C1 itemset sup_count L1itemset sup_count
  • 1 6 1 6
  • 2 7 2 7
  • 3 6 ? 3 6 ?
  • 4 2 4 2
  • 5 2 5 2
  • C2L1 join L1, itemset sup_count L2 itset
    supcount
  • 1 2 4 1 2 4
  • 1 3 4 1 3 4
  • 1 4 1x 1 5 2
  • 1 5 2 2 3 4
  • 2 3 4 2 4 2
  • 2 4 2 ? 2 5 2
  • 2 5 2 frequent 2 item sets L2 3 4 0x those
    itemsets in C2
  • 3 5 1x having minimum support
  • 4 5 0x Step (1b)

20
3 th iteration
  • Self join to get C3 Step (2)
  • C3 L2 join L2 1 2 3, 1 2 5,1 3 5,2 3 4,
  • 2 3 5,2 4 5
  • Now Step (1a) Apply Apriori to every itemset in
    C3
  • 2 item subsets of 1 2 31 2,1 3,2 3
  • all 2 items sets are members of L2
  • keep 1 2 3 in C3
  • 2 item subsets of 1 2 51 2,1 5,2 5
  • all 2 items sets are members of L2
  • keep 1 2 5 in C3
  • 2 item subsets of 1 3 51 3,1 5,3 5
  • 3 5 is not a members of L2 so it si not
    frequent
  • remove 1 2 5 from C3

21
3 iteration cont.
  • 2 item subsets of 2 3 42 3,2 4,3 4
  • 3 4 is not a members of L2 so it si not
    frequent
  • remove 2 3 4 from C3
  • 2 item subsets of 2 3 52 3,2 5,3 5
  • 3 5 is not a members of L2 so it si not
    frequent
  • remove 2 3 5 from C3
  • 2 item subsets of 2 4 52 4,2 5,4 5
  • 4 5 is not a members of L2 so it si not
    frequent
  • remove 2 4 5 from C3
  • C31 2 3,1 2 5 after pruning

22
4 th iteration
  • C3?L3 check min support Step (1b)
  • L3those item sets having minimum support
  • L3 item sets minsupcount
  • 1 2 3 2
  • 1 2 5 2
  • L3 join L3 to generate C4 Step (2)
  • L3 join L3 1 2 3 5
  • pruned since its subset 2 3 5 is not frequent
  • C4?
  • the algorithm terminates

23
Generating Association Rules from frequent
itemsets
  • Strong rules
  • min support and min confidence
  • confidence(A?B) P(B?A)sup_count(A?B)

  • sup_count(A)
  • for each frequent itemset l
  • generate non empty subsets of l denoted by s
  • For each s?l
  • construct rules s ?(l-s)
  • Satisfying the condition
  • sup_count(l)/sup_count(s)gtmin_conf
  • are listed as interesting

24
Example 6.2 Han cont.
  • the 3-frequent item set l1 2 5 transaction
    containing milk, apple and orange is frequent
  • non empty subsets of l are
  • 1 2,1 5,2 5,1,2,5
  • the resulting association rules are
  • 1?2?5 conf 2/450
  • 1?5?2 conf 2/2100
  • 2?5?1 conf 2/2100
  • 1?2?5 conf 2/633
  • 2?1?5 conf 2/729
  • 5?1?2 conf 2/2100
  • if min conf 70 2th 3th and last rules are strong

25
Example 6.2 cont. Detail on confidence for two
rules
  • For the rule
  • 1?5?2 conf s(1,2,5)/s(1,5)
  • conf 2/2100 gt 70
  • A strong rule
  • For the rule
  • 2?1?5 conf s(1,2,5)/s(2)
  • conf 2/729 lt 70
  • Not a strong rule

26
Exercise
  • Find all strong association rules in Example 6.2
  • Check minimum confindence
  • for 2-frequent intemsets
  • 1,2, 1,3, 1,5, 2,3, 2,4, 2,5
  • 1?2, 2?1
  • 2?5, 5?2 exetra
  • for 3-frequent intemset
  • 1,2,5
  • 1?2?3
  • 3 ? 1?2 exetra

27
Exercise
  • a) Suppose A ? B and B ? C are strong rules
  • Dose this imply that A ? C is also a strong rule?
  • b) Suppose A ? B and A ? C are strong rules
  • Dose this imply that B ? C is also a strong rule?
  • c) Suppose A ? C and B ? C are strong rules
  • Dose this imply that A and B ? C is also a strong
    rule?

28
Bottleneck of Frequent-pattern Mining
  • Multiple database scans are costly
  • Mining long patterns needs many passes of
    scanning and generates lots of candidates
  • To find frequent itemset i1i2i100
  • of scans 100
  • of Candidates (1001) (1002) (110000)
    2100-1 1.271030 !
  • Bottleneck candidate-generation-and-test
  • Can we avoid candidate generation?

29
Is Apriori Fast Enough? Performance Bottlenecks
  • The core of the Apriori algorithm
  • Use frequent (k 1)-itemsets to generate
    candidate frequent k-itemsets
  • Use database scan and pattern matching to collect
    counts for the candidate itemsets
  • The bottleneck of Apriori candidate generation
  • Huge candidate sets
  • 104 frequent 1-itemset will generate 107
    candidate 2-itemsets
  • To discover a frequent pattern of size 100, e.g.,
    a1, a2, , a100, one needs to generate 2100 ?
    1030 candidates.
  • Multiple scans of database
  • Needs (n 1 ) scans, n is the length of the
    longest pattern

30
Mining Frequent Patterns Without Candidate
Generation
  • Compress a large database into a compact,
    Frequent-Pattern tree (FP-tree) structure
  • highly condensed, but complete for frequent
    pattern mining
  • avoid costly database scans
  • Develop an efficient, FP-tree-based frequent
    pattern mining method
  • A divide-and-conquer methodology decompose
    mining tasks into smaller ones
  • Avoid candidate generation sub-database test
    only!

31
Construct FP-tree from a Transaction DB
TID Items bought (ordered) frequent
items 100 f, a, c, d, g, i, m, p f, c, a, m,
p 200 a, b, c, f, l, m, o f, c, a, b,
m 300 b, f, h, j, o f, b 400 b, c, k,
s, p c, b, p 500 a, f, c, e, l, p, m,
n f, c, a, m, p
min_support 0.5
  • Steps
  • Scan DB once, find frequent 1-itemset (single
    item pattern)
  • Order frequent items in frequency descending
    order
  • Scan DB again, construct FP-tree

32
Benefits of the FP-tree Structure
  • Completeness
  • never breaks a long pattern of any transaction
  • preserves complete information for frequent
    pattern mining
  • Compactness
  • reduce irrelevant informationinfrequent items
    are gone
  • frequency descending ordering more frequent
    items are more likely to be shared
  • never be larger than the original database (if
    not count node-links and counts)
  • Example For Connect-4 DB, compression ratio
    could be over 100

33
Chapter 5 Mining Association Rules in Large
Databases
  • Association rule mining
  • Mining single-dimensional Boolean association
    rules from transactional databases
  • Mining multilevel association rules from
    transactional databases
  • Mining multidimensional association rules from
    transactional databases and data warehouse
  • From association mining to correlation analysis
  • Constraint-based association mining
  • Summary

34
Multiple-Level Association Rules
  • Items often form hierarchy.
  • Items at the lower level are expected to have
    lower support.
  • Rules regarding itemsets at
  • appropriate levels could be quite useful.
  • Transaction database can be encoded based on
    dimensions and levels
  • We can explore shared multi-level mining

35
Mining Multi-Level Associations
  • A top_down, progressive deepening approach
  • First find high-level strong rules
  • milk bread
    20, 60.
  • Then find their lower-level weaker rules
  • 2 milk wheat
    bread 6, 50.
  • Variations at mining multiple-level association
    rules.
  • Level-crossed association rules
  • 2 milk Wonder wheat bread
  • Association rules with multiple, alternative
    hierarchies
  • 2 milk Wonder bread

36
Multi-level Association Uniform Support vs.
Reduced Support
  • Uniform Support the same minimum support for all
    levels
  • One minimum support threshold. No need to
    examine itemsets containing any item whose
    ancestors do not have minimum support.
  • Lower level items do not occur as frequently.
    If support threshold
  • too high ? miss low level associations
  • too low ? generate too many high level
    associations
  • Reduced Support reduced minimum support at lower
    levels
  • There are 4 search strategies
  • Level-by-level independent
  • Level-cross filtering by k-itemset
  • Level-cross filtering by single item
  • Controlled level-cross filtering by single item

37
Uniform Support
Multi-level mining with uniform support
Milk support 10
Level 1 min_sup 5
2 Milk support 6
Skim Milk support 4
Level 2 min_sup 5
Back
38
Reduced Support
Multi-level mining with reduced support
Level 1 min_sup 5
Milk support 10
2 Milk support 6
Skim Milk support 4
Level 2 min_sup 3
Back
39
  • Controlled level-cross filtering by single item
  • Specify a level passage treshold for each level k
  • min_sup_T(k1)ltLPT(k)ltmin_sup_T(k)
  • Example
  • High level milk
  • min supp5
  • Low level 2 milk,skim milk
  • Min supp 3
  • Level passage trashold 4

40
Multi-level Association Redundancy Filtering
  • Some rules may be redundant due to ancestor
    relationships between items.
  • Example
  • milk ? wheat bread support 8, confidence
    70
  • 2 milk ? wheat bread support 2, confidence
    72
  • We say the first rule is an ancestor of the
    second rule.
  • A rule is redundant if its support is close to
    the expected value, based on the rules
    ancestor.

41
Multi-Level Mining Progressive Deepening
  • A top-down, progressive deepening approach
  • First mine high-level frequent items
  • milk (15), bread
    (10)
  • Then mine their lower-level weaker frequent
    itemsets
  • 2 milk (5),
    wheat bread (4)
  • Different min_support threshold across
    multi-levels lead to different algorithms
  • If adopting the same min_support across
    multi-levels
  • then toss t if any of ts ancestors is
    infrequent.
  • If adopting reduced min_support at lower levels
  • then examine only those descendents whose
    ancestors support is frequent/non-negligible.

42
Progressive Refinement of Data Mining Quality
  • Why progressive refinement?
  • Mining operator can be expensive or cheap, fine
    or rough
  • Trade speed with quality step-by-step
    refinement.
  • Superset coverage property
  • Preserve all the positive answersallow a
    positive false test but not a false negative
    test.
  • Two- or multi-step mining
  • First apply rough/cheap operator (superset
    coverage)
  • Then apply expensive algorithm on a substantially
    reduced candidate set (Koperski Han, SSD95).

43
Chapter 5 Mining Association Rules in Large
Databases
  • Association rule mining
  • Mining single-dimensional Boolean association
    rules from transactional databases
  • Mining multilevel association rules from
    transactional databases
  • Mining multidimensional association rules from
    transactional databases and data warehouse
  • From association mining to correlation analysis
  • Constraint-based association mining
  • Summary

44
Interestingness Measurements
  • Objective measures
  • Two popular measurements
  • support and
  • confidence
  • Subjective measures (Silberschatz Tuzhilin,
    KDD95)
  • A rule (pattern) is interesting if
  • it is unexpected (surprising to the user) and/or
  • actionable (the user can do something with it)

45
Criticism to Support and Confidence
  • Example 1 (Aggarwal Yu, PODS98)
  • Among 5000 students
  • 3000 play basketball
  • 3750 eat cereal
  • 2000 both play basket ball and eat cereal
  • play basketball ? eat cereal 40, 66.7 is
    misleading because the overall percentage of
    students eating cereal is 75 which is higher
    than 66.7.
  • play basketball ? not eat cereal 20, 33.3 is
    far more accurate, although with lower support
    and confidence

46
Criticism to Support and Confidence (Cont.)
  • Example 2
  • X and Y positively correlated,
  • X and Z, negatively related
  • support and confidence of
  • XgtZ dominates
  • We need a measure of dependent or correlated
    events
  • P(BA)/P(B) is also called the lift of rule A gt B

47
Other Interestingness Measures Interest
  • Interest (correlation, lift)
  • taking both P(A) and P(B) in consideration
  • P(AB)P(B)P(A), if A and B are independent
    events
  • A and B negatively correlated, if the value is
    less than 1 otherwise A and B positively
    correlated

48
Example
  • Total transactions 10,000
  • Items Ccomputers, V video
  • V 7,500 C 6,000 C and V 4,000
  • Min_support 0.3 min_conf0,50
  • Consider the rule
  • Buy(X computer)? buy(X video)
  • Support 4000/10000 0.4
  • Confidence P(C and V) /P(C) 4000/6000 66
  • Strong but
  • The probablity of buying a video is 0.75 buying a
    comuter reduces the probablity of buying a video
  • From 0.75 to 0.66
  • Computer and video are negatively correlated

49
  • Lift of A ? B
  • Lift P(A and B)/P(A)P(B)
  • P(A and B) P(BA)P(A) then
  • Lift P(BA)/P(B)
  • Ratio of probablity of buying A and B divided by
    buying A and B independently
  • Or it can be interpreted as
  • Conditional probablity of buying B given that A
    is purchased divided by unconditional probablity
    of buying B

50
C
not C
7500
V
not V
2500
10000
4000
6000
Lift C?V is P(P and V)/P(V)P(C) P(VC)/P(V)
0.4/0.60.750.89lt1 there is a negative
correlation Between Video and computer
51
Are All the Rules Found Interesting?
  • Buy walnuts ? buy milk 1, 80 is
    misleading
  • if 85 of customers buy milk
  • Support and confidence are not good to represent
    correlations
  • So many interestingness measures? (Tan, Kumar,
    Sritastava _at_KDD02)

52
All Confidence
  • All confidence
  • All_conf sup(X)/max sup(Xi)?i
  • X (X1,X2,...,Xk)
  • For k 2
  • Rules are X1?X2 and X2 ?X1
  • All_conf sup(X1,X2)/max sup(X1),sup(X2)
  • Here sup(X1,X2)/sup(X1) confidence of rule
  • X1?X2
  • Ex all conf 0.4/max(0.6,0.75)0.4/0.750.53

53
Cosine
  • Cosine P(A,B)/sqrt(P(A),P(B))
  • Similar to lift but take square root of
    denominator
  • Both cosine and all_conf are null inveriant
  • Not affected from null transactions
  • Ex
  • Cosine 0.4/sqrt(0.60.75)0.27

54
Mining Highly Correlated Patterns
  • lift and ?2 are not good measures for
    correlations in transactional DBs
  • all-conf or cosine could be good measures
    (Omiecinski _at_TKDE03)
  • Both all-conf and coherence have the downward
    closure

55
(No Transcript)
56
Chapter 5 Mining Association Rules in Large
Databases
  • Association rule mining
  • Mining single-dimensional Boolean association
    rules from transactional databases
  • Mining multilevel association rules from
    transactional databases
  • Mining multidimensional association rules from
    transactional databases and data warehouse
  • From association mining to correlation analysis
  • Constraint-based association mining
  • Summary

57
Constraint-based (Query-Directed) Mining
  • Finding all the patterns in a database
    autonomously? unrealistic!
  • The patterns could be too many but not focused!
  • Data mining should be an interactive process
  • User directs what to be mined using a data mining
    query language (or a graphical user interface)
  • Constraint-based mining
  • User flexibility provides constraints on what to
    be mined
  • System optimization explores such constraints
    for efficient miningconstraint-based mining

58
Constraints in Data Mining
  • Knowledge type constraint
  • classification, association, etc.
  • Data constraint using SQL-like queries
  • find product pairs sold together in stores in
    Chicago in Dec.02
  • Dimension/level constraint
  • in relevance to region, price, brand, customer
    category
  • Rule (or pattern) constraint
  • small sales (price lt 10) triggers big sales
    (sum gt 200)
  • Interestingness constraint
  • strong rules min_support ? 3, min_confidence
    ? 60

59
Example
  • bread ? milk
  • milk ? butter
  • Strong rules but items are not that valuable
  • TV ? VCD player
  • Support may be lower then previous rules but
    value of items are much higher
  • This rule may be more valuable

60
Constrained Mining vs. Constraint-Based Search
  • Constrained mining vs. constraint-based
    search/reasoning
  • Both are aimed at reducing search space
  • Finding all patterns satisfying constraints vs.
    finding some (or one) answer in constraint-based
    search in AI
  • Constraint-pushing vs. heuristic search
  • It is an interesting research problem on how to
    integrate them
  • Constrained mining vs. query processing in DBMS
  • Database query processing requires to find all
  • Constrained pattern mining shares a similar
    philosophy as pushing selections deeply in query
    processing

61
Rule Constraints in Association Mining
  • Two kind of rule constraints
  • Rule form constraints meta-rule guided mining.
  • P(x, y) Q(x, w) takes(x, database
    systems).
  • Rule (content) constraint constraint-based query
    optimization (Ng, et al., SIGMOD98).
  • sum(LHS) lt 100 min(LHS) gt 20 count(LHS) gt 3
    sum(RHS) gt 1000
  • 1-variable vs. 2-variable constraints
    (Lakshmanan, et al. SIGMOD99)
  • 1-var A constraint confining only one side (L/R)
    of the rule, e.g., as shown above.
  • 2-var A constraint confining both sides (L and
    R).
  • sum(LHS) lt min(RHS) max(RHS) lt 5 sum(LHS)

62
  • Apriori principle stating that
  • All non empty subsets of a frequent itemsets must
    also be frequent
  • Note that
  • If a given itemset does not satisfy minimum
    support
  • None of its supersets can
  • Other examples of anti-monotone constraints
  • Min(l.price) gt 500
  • Count(l) lt 10
  • Average(l.price) lt 10 not anti-monotone

63
Anti-Monotonicity in Constraint Pushing
TDB (min_sup2)
  • Anti-monotonicity
  • When an intemset S violates the constraint, so
    does any of its superset
  • sum(S.Price) ? v is anti-monotone
  • sum(S.Price) ? v is not anti-monotone
  • Example. C range(S.profit) ? 15 is anti-monotone
  • Itemset ab violates C
  • So does every superset of ab

64
Monotonicity for Constraint Pushing
TDB (min_sup2)
  • Monotonicity
  • When an intemset S satisfies the constraint, so
    does any of its superset
  • sum(S.Price) ? v is monotone
  • min(S.Price) ? v is monotone
  • Example. C range(S.profit) ? 15
  • Itemset ab satisfies C
  • So does every superset of ab

65
The Apriori Algorithm Example
Database D
L1
C1
Scan D
C2
C2
L2
Scan D
C3
L3
Scan D
66
Naïve Algorithm Apriori Constraint
Database D
L1
C1
Scan D
C2
C2
L2
Scan D
C3
L3
Constraint SumS.price lt 5
Scan D
67
The Constrained Apriori Algorithm Push an
Anti-monotone Constraint Deep
Database D
L1
C1
Scan D
C2
C2
L2
Scan D
C3
L3
Constraint SumS.price lt 5
Scan D
68
The Constrained Apriori Algorithm Push Another
Constraint Deep
Database D
L1
C1
Scan D
C2
C2
L2
Scan D
C3
L3
Constraint minS.price lt 1
Scan D
69
Chapter 5 Mining Association Rules in Large
Databases
  • Association rule mining
  • Algorithms for scalable mining of
    (single-dimensional Boolean) association rules in
    transactional databases
  • Mining various kinds of association/correlation
    rules
  • Constraint-based association mining
  • Sequential pattern mining
  • Applications/extensions of frequent pattern
    mining
  • Summary

70
Sequence Databases and Sequential Pattern Analysis
  • Transaction databases, time-series databases vs.
    sequence databases
  • Frequent patterns vs. (frequent) sequential
    patterns
  • Applications of sequential pattern mining
  • Customer shopping sequences
  • First buy computer, then CD-ROM, and then digital
    camera, within 3 months.
  • Medical treatment, natural disasters (e.g.,
    earthquakes), science engineering processes,
    stocks and markets, etc.
  • Telephone calling patterns, Weblog click streams
  • DNA sequences and gene structures

71
What Is Sequential Pattern Mining?
  • Given a set of sequences, find the complete set
    of frequent subsequences

A sequence lt (ef) (ab) (df) c b gt
A sequence database
An element may contain a set of items. Items
within an element are unordered and we list them
alphabetically.
lta(bc)dcgt is a subsequence of lta(abc)(ac)d(cf)gt
Given support threshold min_sup 2, lt(ab)cgt is a
sequential pattern
72
Challenges on Sequential Pattern Mining
  • A huge number of possible sequential patterns are
    hidden in databases
  • A mining algorithm should
  • find the complete set of patterns, when possible,
    satisfying the minimum support (frequency)
    threshold
  • be highly efficient, scalable, involving only a
    small number of database scans
  • be able to incorporate various kinds of
    user-specific constraints

73
Studies on Sequential Pattern Mining
  • Concept introduction and an initial Apriori-like
    algorithm
  • R. Agrawal R. Srikant. Mining sequential
    patterns, ICDE95
  • GSPAn Apriori-based, influential mining method
    (developed at IBM Almaden)
  • R. Srikant R. Agrawal. Mining sequential
    patterns Generalizations and performance
    improvements, EDBT96
  • From sequential patterns to episodes
    (Apriori-like constraints)
  • H. Mannila, H. Toivonen A.I. Verkamo.
    Discovery of frequent episodes in event
    sequences, Data Mining and Knowledge Discovery,
    1997
  • Mining sequential patterns with constraints
  • M.N. Garofalakis, R. Rastogi, K. Shim SPIRIT
    Sequential Pattern Mining with Regular Expression
    Constraints. VLDB 1999

74
Sequential pattern mining Cases and Parameters
  • Duration of a time sequence T
  • Sequential pattern mining can then be confined to
    the data within a specified duration
  • Ex. Subsequence corresponding to the year of 1999
  • Ex. Partitioned sequences, such as every year, or
    every week after stock crashes, or every two
    weeks before and after a volcano eruption
  • Event folding window w
  • If w T, time-insensitive frequent patterns are
    found
  • If w 0 (no event sequence folding), sequential
    patterns are found where each event occurs at a
    distinct time instant
  • If 0 lt w lt T, sequences occurring within the same
    period w are folded in the analysis

75
Example
  • When event folding window is 5 munites
  • Purchases within 5 munits is considered to be
    taken together

76
Sequential pattern mining Cases and Parameters
(2)
  • Time interval, int, between events in the
    discovered pattern
  • int 0 no interval gap is allowed, i.e., only
    strictly consecutive sequences are found
  • Ex. Find frequent patterns occurring in
    consecutive weeks
  • min_int ? int ? max_int find patterns that are
    separated by at least min_int but at most max_int
  • Ex. If a person rents movie A, it is likely she
    will rent movie B within 30 days (int ? 30)
  • int c ? 0 find patterns carrying an exact
    interval
  • Ex. Every time when Dow Jones drops more than
    5, what will happen exactly two days later?
    (int 2)

77
A Basic Property of Sequential Patterns Apriori
  • A basic property Apriori (Agrawal Sirkant94)
  • If a sequence S is not frequent
  • Then none of the super-sequences of S is frequent
  • E.g, lthbgt is infrequent ? so do lthabgt and lt(ah)bgt

Given support threshold min_sup 2
78
GSPA Generalized Sequential Pattern Mining
Algorithm
  • GSP (Generalized Sequential Pattern) mining
    algorithm
  • proposed by Agrawal and Srikant, EDBT96
  • Outline of the method
  • Initially, every item in DB is a candidate of
    length-1
  • for each level (i.e., sequences of length-k) do
  • scan database to collect support count for each
    candidate sequence
  • generate candidate length-(k1) sequences from
    length-k frequent sequences using Apriori
  • repeat until no frequent sequence or no candidate
    can be found
  • Major strength Candidate pruning by Apriori

79
Finding Length-1 Sequential Patterns
  • Examine GSP using an example
  • Initial candidates all singleton sequences
  • ltagt, ltbgt, ltcgt, ltdgt, ltegt, ltfgt, ltggt, lthgt
  • Scan database once, count support for candidates

80
Generating Length-2 Candidates
51 length-2 Candidates
Without Apriori property, 8887/292 candidates
Apriori prunes 44.57 candidates
81
Finding Length-2 Sequential Patterns
  • Scan database one more time, collect support
    count for each length-2 candidate
  • There are 19 length-2 candidates which pass the
    minimum support threshold
  • They are length-2 sequential patterns

82
Generating Length-3 Candidates and Finding
Length-3 Patterns
  • Generate Length-3 Candidates
  • Self-join length-2 sequential patterns
  • Based on the Apriori property
  • ltabgt, ltaagt and ltbagt are all length-2 sequential
    patterns ? ltabagt is a length-3 candidate
  • lt(bd)gt, ltbbgt and ltdbgt are all length-2 sequential
    patterns ? lt(bd)bgt is a length-3 candidate
  • 46 candidates are generated
  • Find Length-3 Sequential Patterns
  • Scan database once more, collect support counts
    for candidates
  • 19 out of 46 candidates pass support threshold

83
The GSP Mining Process
min_sup 2
84
  • Definition c is a contiguous subsequence of a
    sequence ss1,s2,...,sn if
  • c is derived by dropping an item from s1 or sn
  • c is derived by dropping an item from si which
    has at least 2 items
  • c is a contiguous subsequence of c and c is a
    contiguous subsequence of s
  • Ex s (1,2),(3,4),5,6
  • 2,(3,4),5, (1,2),3,5,6, (3,5 are but
  • (1,2),(3,4),6, (1,5,6 are not

85
Candidate genration
  • Step 1 Join Step Lk-1 join with Lk-1 to give Ck
  • s1 and s2 are joined if dropping first item of s1
    and last item of s2 gives the same sequence
  • s1 is extended by adding the last item of s2
  • Step 2 Prune Step Delete candidate sequences
    having (k-1) contiguous subsequences whose
    support count is less than min_support count

86
  • L3 C4
    L4
  • (1,2),3 (1,2),(3,4)
    (1,2),(3,4)
  • (1,2),4 (1,2),3,5
  • 1,(3,4)
  • (1,3),5
  • 2,(3,4)
  • 2,3,5
  • (1,2),3 joined with 2,(3,4) to give
    (1,2),(3,4)
  • (1,2),3 joined with 2,3,5 to give (1,2),3,5
  • (1,2),3,5 is dropped since its 3 contiguous
    subseq
  • (1,3,5 not in L3

87
The GSP Algorithm
  • Take sequences in form of ltxgt as length-1
    candidates
  • Scan database once, find F1, the set of length-1
    sequential patterns
  • Let k1 while Fk is not empty do
  • Form Ck1, the set of length-(k1) candidates
    from Fk
  • If Ck1 is not empty, scan database once, find
    Fk1, the set of length-(k1) sequential patterns
  • Let kk1

88
Bottlenecks of GSP
  • A huge set of candidates could be generated
  • 1,000 frequent length-1 sequences generate
    length-2 candidates!
  • Multiple scans of database in mining
  • Real challenge mining long sequential patterns
  • An exponential number of short candidates
  • A length-100 sequential pattern needs 1030
    candidate
    sequences!
Write a Comment
User Comments (0)
About PowerShow.com