Association Rule Mining: Apriori Algorithm - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Association Rule Mining: Apriori Algorithm

Description:

Association Rule Mining: Apriori Algorithm CIT365: Data Mining & Data Warehousing Bajuna Salehe The Institute of Finance Management: Computing and IT Dept. – PowerPoint PPT presentation

Number of Views:479
Avg rating:3.0/5.0
Slides: 34
Provided by: ifmAcTzs
Category:

less

Transcript and Presenter's Notes

Title: Association Rule Mining: Apriori Algorithm


1
Association Rule MiningApriori Algorithm
  • CIT365 Data Mining Data Warehousing
  • Bajuna Salehe
  • The Institute of Finance Management Computing
    and IT Dept.

2
Brief About Association Rule Mining
  • The results of Market Basket Analysis allowed
    companies to more fully understand purchasing
    behaviour and, as a result better target market
    audiences.
  • Association mining is user-centric as the
    objective is the elicitation of useful (or
    interesting) rules from which new knowledge can
    be derived.

3
Brief About Association Rule Mining
  • Association mining applications have been applied
    to many different domains including market basket
    and risk analysis in commercial environments,
    epidemiology, clinical medicine, fluid dynamics,
    astrophysics, crime prevention, and
    counter-terrorismall areas in which the
    relationship between objects can provide useful
    knowledge.

4
Example of Association Rule
  • For example, an insurance company, by finding a
    strong correlation between two policies A and B,
    of the form A ? B, indicating that customers that
    held policy A were also likely to hold policy B,
    could more efficiently target the marketing of
    policy B through marketing to those clients that
    held policy A but not B.

5
Brief About Association Rule Mining
  • Association mining analysis is a two part
    process.
  • First, the identification of sets of items or
    itemsets within the dataset.
  • Second, the subsequent derivation of inferences
    from these itemsets

6
Why Use Support and Confidence?
  • Support reflects the statistical significance of
    a rule. Rules that have very low support are
    rarely observed, and thus, are more likely to
    occur by chance. For example, the rule A ? B may
    not be significant because both items are present
    together only once in the previous Table in the
    last week lecture.

7
Why Use Support and Confidence?
  • Additionally, low support rules may not be
    actionable from a marketing perspective because
    it is not profitable to promote items that are
    seldom bought together by customers.
  • For these reasons, support is often used as a
    filter to eliminate uninteresting rules.

8
Why Use Support and Confidence?
  • Confidence is another useful metric because it
    measures how reliable is the inference made by a
    rule.
  • For a given rule A? B , the higher the
    confidence, the more likely it is for itemset B
    to be present in transactions that contain A. In
    a sense, confidence provides an estimate of the
    conditional probability for B given A.

9
Causality Association Rule
  • Finally, it is worth noting that the inference
    made by an association rule does not necessarily
    imply causality.
  • Instead, the implication indicates a strong
    co-occurrence relationship between items in the
    antecedent and consequent of the rule.

10
Causality A. Rule
  • Causality, on the other hand, requires a
    distinction between the causal and effect
    attributes of the data and typically involves
    relationships occurring over time (e.g., ozone
    depletion leads to global warming).

11
More About Support and Confidence
  • The support for the following candidate rules
  • Bread, Cheese ? Milk, Bread,Milk ?
    Cheese, Cheese,Milk ? Bread, Bread ?
    Cheese,Milk, Milk ? Bread,Cheese, Cheese
    ? Bread,Milk
  • are identical since they correspond to the
    same itemset, Bread, Cheese, Milk.
  • If the itemset is infrequent, then all six
    candidate rules can be immediately pruned without
    having to compute their confidence values.

12
More About Support and Confidence
  • Therefore, a common strategy adopted by many
    association rule mining algorithms is to
    decompose the problem into two major subtasks
  • Frequent Itemset Generation. Find all itemsets
    that satisfy the minsup threshold. These itemsets
    are called frequent itemsets.
  • Rule Generation. Extract high confidence
    association rules from the frequent
  • Itemsets found in the previous step. These rules
    are called strong rules.

13
Frequent Itemset Generation
  • A lattice structure can be used to enumerate the
    list of possible itemsets.
  • For example, the figure below illustrates all
    itemsets derivable from the set A,B,C,D.

14
Frequent Itemset Generation
15
Frequent Itemset Generation
  • In general, a data set that contains d items may
    generate up to 2d (raise to power d) - 1
    possible itemsets, excluding the null set.
  • Because d can be very large in many commercial
    databases, frequent itemset generation is an
    exponentially expensive task.

16
Frequent Itemset Generation
  • A naive approach for finding frequent itemsets is
    to determine the support count for every
    candidate itemset in the lattice structure.
  • To do this, we need to match each candidate
    against every transaction.

17
Apriori Algorithm
  • This algorithm is among of the algorithms that
    are grouped into candidate generation algorithms,
    used to identify candidate itemsets.
  • The common data structure that is used in apriori
    algorithm is tree data structures.
  • Two common type of tree data structures used in
    apriori are-
  • Enumeration Set Tree
  • Prefix Tree

18
Data Structure for Apriori Algorithm
19
Apriori Algorithm
  • Frequent itemsets (also called as large
    itemsets), are those itemsets whose support is
    greater than minSupp (Minimum Support).
  • The apriori property (downward closure property)
    says that any subsets of any frequent itemset are
    also frequent itemsets
  • The use of support for pruning candidate itemsets
    is guided by the following principle (Apriori
    Principle).
  • If an itemset is frequent, then all of its subsets

20
Reminder Steps of Association Rule Mining
  • The major steps in association rule mining are
  • Frequent Itemset generation
  • Rules derivation

21
Apriori Algorithm
  • Any subset of a frequent itemset must be frequent
  • If beer, nappy, nuts is frequent, so is beer,
    nappy
  • Every transaction having beer, nappy, nuts also
    contains beer, nappy
  • Apriori pruning principle If there is any
    itemset which is infrequent, its superset should
    not be generated/tested!

22
Apriori Algorithm
  • The APRIORI algorithm uses the downward closure
    property, to prune unnecessary branches for
    further consideration. It needs two parameters,
    minSupp and minConf. The minSupp is used for
    generating frequent itemsets and minConf is used
    for rule derivation.

23
The Apriori Algorithm An Example
Itemset sup
A 2
B 3
C 3
D 1
E 3
Itemset sup
A 2
B 3
C 3
E 3
Database TDB
L1
C1
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
1st scan
C2
C2
Itemset sup
A, B 1
A, C 2
A, E 1
B, C 2
B, E 3
C, E 2
Itemset
A, B
A, C
A, E
B, C
B, E
C, E
L2
2nd scan
Itemset sup
A, C 2
B, C 2
B, E 3
C, E 2
L3
C3
Itemset
B, C, E
3rd scan
Itemset sup
B, C, E 2
24
Important Details Of The Apriori Algorithm
  • There are two crucial questions in implementing
    the Apriori algorithm
  • How to generate candidates?
  • How to count supports of candidates?

25
Generating Candidates
  • There are 2 steps to generating candidates
  • Step 1 Self-joining Lk
  • Step 2 Pruning
  • Example of Candidate-generation
  • L3abc, abd, acd, ace, bcd
  • Self-joining L3L3
  • abcd from abc and abd
  • acde from acd and ace
  • Pruning
  • acde is removed because ade is not in L3
  • C4abcd

26
Apriori Algorithm
  • k 1.
  • Fk i i ? I ? s(i)/ N minsup. Find
    all frequent 1-itemsets
  • repeat
  • k k 1.
  • Ck apriori-gen(Fk-1). Generate candidate
    itemsets
  • for each transaction t ? T do
  • Ct subset(Ck, t). Identify all candidates
    that belong to t
  • for each candidate itemset c ? Ct do
  • s(c) s(c) 1. Increment support
    count
  • end for
  • end for
  • Fk c c ? Ck ? s(c) /N minsup. Extract
    the frequent k-itemsets
  • until Fk Ø
  • Result _Fk

27
How to Count Supports Of Candidates?
  • Why counting supports of candidates a problem?
  • The total number of candidates can be huge
  • One transaction may contain many candidates
  • Method
  • Candidate itemsets are stored in a hash-tree
  • Leaf node of hash-tree contains a list of
    itemsets and counts
  • Interior node contains a hash table
  • Subset function finds all the candidates
    contained in a transaction

28
Generating Association Rules
  • Once all frequent itemsets have been found
    association rules can be generated
  • Strong association rules from a frequent itemset
    are generated by calculating the confidence in
    each possible rule arising from that itemset and
    testing it against a minimum confidence threshold

29
Example
TID List of item_IDs
T100 Beer, Crisps, Milk
T200 Crisps, Bread
T300 Crisps, Nappies
T400 Beer, Crisps, Bread
T500 Beer, Nappies
T600 Crisps, Nappies
T700 Beer, Nappies
T800 Beer, Crisps, Nappies, Milk
T900 Beer, Crisps, Nappies
ID Item
I1 Beer
I2 Crisps
I3 Nappies
I4 Bread
I5 Milk
30
Example
31
Challenges Of Frequent Pattern Mining
  • Challenges
  • Multiple scans of transaction database
  • Huge number of candidates
  • Tedious workload of support counting for
    candidates
  • Improving Apriori general ideas
  • Reduce passes of transaction database scans
  • Shrink number of candidates
  • Facilitate support counting of candida

32
Bottleneck Of Frequent-Pattern Mining
  • Multiple database scans are costly
  • Mining long patterns needs many passes of
    scanning and generates lots of candidates
  • To find frequent itemset i1i2i100
  • of scans 100
  • of Candidates 2100-1
    1.271030
  • Bottleneck candidate-generation-and-test

33
Mining Frequent Patterns Without Candidate
Generation
  • Techniques for mining frequent itemsets which
    avoid candidate generation include
  • FP-growth
  • Grow long patterns from short ones using local
    frequent items
  • ECLAT (Equivalence CLASS Transformation)
    algorithm
  • Uses a data representation in which transactions
    are associated with items, rather than the other
    way around (vertical data format)
  • These methods can be much faster than the Apriori
    algorithm
Write a Comment
User Comments (0)
About PowerShow.com