Introduction to Association Analysis - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Introduction to Association Analysis

Description:

Given a set of transactions, find rules that will predict the occurrence of an ... {Homepage} {Electronics} {Digital Cameras} {Canon Digital Camera} {Shopping Cart} ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 39
Provided by: Compu287
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Association Analysis


1
Introduction to Association Analysis
  • Zhangxi Lin
  • ISQS 3358
  • Texas Tech University

2
Outline
  • Basic concepts
  • Itemset generation - Apriori principle
  • Association rule discovery and generation
  • Evaluation of association patterns
  • Sequential pattern analysis

3
Basic Concepts
4
Association Rule Mining
  • Given a set of transactions, find rules that will
    predict the occurrence of an item based on the
    occurrences of other items in the transaction

Market-Basket transactions
Example of Association Rules
Diaper ? Beer,Milk, Bread ?
Eggs,Coke,Beer, Bread ? Milk,
Implication means co-occurrence, not causality!
5
Definition Frequent Itemset
  • Itemset
  • A collection of one or more items
  • Example Milk, Bread, Diaper
  • k-itemset
  • An itemset that contains k items
  • Support count (?)
  • Frequency of occurrence of an itemset
  • E.g. ?(Milk, Bread,Diaper) 2
  • Support
  • Fraction of transactions that contain an itemset
  • E.g. s(Milk, Bread, Diaper) 2/5
  • Frequent Itemset
  • An itemset whose support is greater than or equal
    to a minsup threshold

6
Definition Association Rule
  • Association Rule
  • An implication expression of the form X ? Y,
    where X and Y are itemsets
  • Example Milk, Diaper ? Beer
  • Rule Evaluation Metrics
  • Support (s)
  • Fraction of transactions that contain both X and
    Y
  • Confidence (c)
  • Measures how often items in Y appear in
    transactions thatcontain X

7
  • Count (CKG, SVG) 1
  • Support 1 / 5 20
  • Count (CKG) 3
  • Confidence 1 / 3 0.33
  • Count (CKG) 2
  • Count (CKG, SVG) 2
  • Confidence (CKG, SVG) 2 / 2 100

8
Formal Definitions
  • Support s(X ? Y)
  • Confidence, c(X ? Y)

9
Itemset generation - Apriori principle
10
Association Rule Mining Task
  • Given a set of transactions T, the goal of
    association rule mining is to find all rules
    having
  • support minsup threshold
  • confidence minconf threshold
  • Brute-force approach
  • List all possible association rules
  • Compute the support and confidence for each rule
  • Prune rules that fail the minsup and minconf
    thresholds
  • ? Computationally prohibitive!

11
Mining Association Rules
Example of Rules Milk,Diaper ? Beer (s0.4,
c0.67)Milk,Beer ? Diaper (s0.4,
c1.0) Diaper,Beer ? Milk (s0.4,
c0.67) Beer ? Milk,Diaper (s0.4, c0.67)
Diaper ? Milk,Beer (s0.4, c0.5) Milk ?
Diaper,Beer (s0.4, c0.5)
  • Observations
  • All the above rules are binary partitions of the
    same itemset Milk, Diaper, Beer
  • Rules originating from the same itemset have
    identical support but can have different
    confidence
  • Thus, we may decouple the support and confidence
    requirements

12
Mining Association Rules
  • Two-step approach
  • Frequent Itemset Generation
  • Generate all itemsets whose support ? minsup
  • Rule Generation
  • Generate high confidence rules from each frequent
    itemset, where each rule is a binary partitioning
    of a frequent itemset
  • Frequent itemset generation is still
    computationally expensive

13
Frequent Itemset Generation
Given d items, there are 2d possible candidate
itemsets
14
Frequent Itemset Generation
  • Brute-force approach
  • Each itemset in the lattice is a candidate
    frequent itemset
  • Count the support of each candidate by scanning
    the database
  • Match each transaction against every candidate
  • Complexity O(NMw) gt Expensive since M 2d !!!

15
Frequent Itemset Generation Strategies
  • Reduce the number of candidates (M)
  • Complete search M2d
  • Use pruning techniques to reduce M
  • Reduce the number of transactions (N)
  • Reduce size of N as the size of itemset increases
  • Used by DHP and vertical-based mining algorithms
  • Reduce the number of comparisons (NM)
  • Use efficient data structures to store the
    candidates or transactions
  • No need to match every candidate against every
    transaction

16
Reducing Number of Candidates
  • Apriori principle
  • If an itemset is frequent, then all of its
    subsets must also be frequent
  • Apriori principle holds due to the following
    property of the support measure
  • Support of an itemset never exceeds the support
    of its subsets
  • This is known as the anti-monotone property of
    support

17
Illustrating Apriori Principle
Found to be Frequent
18
Illustrating Apriori Principle
19
Apriori Algorithm
  • Method
  • Let k1
  • Generate frequent itemsets of length 1
  • Repeat until no new frequent itemsets are
    identified
  • Generate length (k1) candidate itemsets from
    length k frequent itemsets
  • Prune candidate itemsets containing subsets of
    length k that are infrequent
  • Count the support of each candidate by scanning
    the DB
  • Eliminate candidates that are infrequent, leaving
    only those that are frequent

20
Association rule discovery and generation
21
Reducing Number of Comparisons
  • Candidate counting
  • Scan the database of transactions to determine
    the support of each candidate itemset
  • To reduce the number of comparisons, store the
    candidates in a hash structure
  • Instead of matching each transaction against
    every candidate, match it against candidates
    contained in the hashed buckets

22
Factors Affecting Complexity
  • Choice of minimum support threshold
  • lowering support threshold results in more
    frequent itemsets
  • this may increase number of candidates and max
    length of frequent itemsets
  • Dimensionality (number of items) of the data set
  • more space is needed to store support count of
    each item
  • if number of frequent items also increases, both
    computation and I/O costs may also increase
  • Size of database
  • since Apriori makes multiple passes, run time of
    algorithm may increase with number of
    transactions
  • Average transaction width
  • transaction width increases with denser data
    sets
  • This may increase max length of frequent itemsets
    and traversals of hash tree (number of subsets in
    a transaction increases with its width)

23
Rule Generation
  • Given a frequent itemset L, find all non-empty
    subsets f ? L such that f ? L f satisfies the
    minimum confidence requirement
  • If A,B,C,D is a frequent itemset, candidate
    rules
  • ABC ?D, ABD ?C, ACD ?B, BCD ?A, A ?BCD, B
    ?ACD, C ?ABD, D ?ABCAB ?CD, AC ? BD, AD ? BC,
    BC ?AD, BD ?AC, CD ?AB,
  • If L k, then there are 2k 2 candidate
    association rules (ignoring L ? ? and ? ? L)

24
Rule Generation
  • How to efficiently generate rules from frequent
    itemsets?
  • In general, confidence does not have an
    anti-monotone property
  • c(ABC ?D) can be larger or smaller than c(AB ?D)
  • But confidence of rules generated from the same
    itemset has an anti-monotone property
  • e.g., L A,B,C,D c(ABC ? D) ? c(AB ? CD)
    ? c(A ? BCD)
  • Confidence is anti-monotone w.r.t. number of
    items on the RHS of the rule

25
Rule Generation for Apriori Algorithm
Lattice of rules
Low Confidence Rule
26
Rule Generation for Apriori Algorithm
  • Candidate rule is generated by merging two rules
    that share the same prefixin the rule consequent
  • join(CDgtAB,BDgtAC)would produce the
    candidaterule D gt ABC
  • Prune rule DgtABC if itssubset ADgtBC does not
    havehigh confidence

27
Demonstration
  • A bank wants to examine its customer base and
    understand which of its products individual
    customers own in combination with one another. It
    has chosen to conduct a market-basket analysis of
    a sample of its customer base. The bank has a
    data set that lists the banking products/services
    used by 7,991 customers.
  • Data set BANK
  • Variables
  • ACCT ID, Nominal, Account Number
  • SERVICE Target, Nominal, Type of Service
  • VISIT Sequence, Ordinal, Order of Product
    Purchase

28
Evaluation of association patterns
29
Contingency Table
30
Statistical Independence
  • Population of 1000 students
  • 600 students know how to swim (S)
  • 700 students know how to bike (B)
  • 420 students know how to swim and bike (S,B)
  • P(S?B) 420/1000 0.42
  • P(S) ? P(B) 0.6 ? 0.7 0.42
  • P(S?B) P(S) ? P(B) gt Statistical independence
  • P(S?B) gt P(S) ? P(B) gt Positively correlated
  • P(S?B) lt P(S) ? P(B) gt Negatively correlated

31
Statistical-based Measures
  • Measures that take into account statistical
    dependence

32
Example Lift/Interest Contingency Table
Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100
  • Association Rule Tea ? Coffee
  • Confidence P(CoffeeTea) 0.75
  • but P(Coffee) 0.9
  • Lift 0.75/0.9 0.8333 (lt 1, therefore is
    negatively associated)

33
Compared to Confusion Matrix
Computed Yes Computed No Total
Actual Yes 15 5 20
Actual No 75 5 80
Total 90 10 100
In classification, we are interested in P(Actual
YesComputed Yes) i.e. P(Row Column) In
associate analysis, we are interested in
P(ColumnRow)
34
Sequential pattern analysis
35
Examples of Sequence Data
Sequence Database Sequence Element (Transaction) Event(Item)
Customer Purchase history of a given customer A set of items bought by a customer at time t Books, diary products, CDs, etc
Web Data Browsing activity of a particular Web visitor A collection of files viewed by a Web visitor after a single mouse click Home page, index page, contact info, etc
Event data History of events generated by a given sensor Events triggered by a sensor at time t Types of alarms generated by sensors
Genome sequences DNA sequence of a particular species An element of the DNA sequence Bases A,T,G,C
Element (Transaction)
Event (Item)
E1E2
E1E3
E2
E3E4
E2
Sequence
36
Examples of Sequence
  • Web sequence
  • lt Homepage Electronics Digital Cameras
    Canon Digital Camera Shopping Cart Order
    Confirmation Return to Shopping gt
  • Sequence of initiating events causing the nuclear
    accident at 3-mile Island(http//stellar-one.com
    /nuclear/staff_reports/summary_SOE_the_initiating_
    event.htm)
  • lt clogged resin outlet valve closure loss
    of feedwater condenser polisher outlet valve
    shut booster pumps trip main waterpump
    trips main turbine trips reactor pressure
    increasesgt
  • Sequence of books checked out at a library
  • ltFellowship of the Ring The Two Towers
    Return of the Kinggt

37
Sequential Pattern Mining Definition
  • Given
  • a database of sequences
  • a user-specified minimum support threshold,
    minsup
  • Task
  • Find all subsequences with support minsup

38
Sequential Pattern Mining Example
Minsup 50 Examples of Frequent
Subsequences lt 1,2 gt s60 lt 2,3 gt
s60 lt 2,4gt s80 lt 3 5gt s80 lt 1
2 gt s80 lt 2 2 gt s60 lt 1 2,3
gt s60 lt 2 2,3 gt s60 lt 1,2 2,3 gt s60
Write a Comment
User Comments (0)
About PowerShow.com