Association Rule Mining: Apriori Algorithm - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

Association Rule Mining: Apriori Algorithm

Description:

Association Rule Mining: Apriori Algorithm CIT365: Data Mining & Data Warehousing Bajuna Salehe The Institute of Finance Management: Computing and IT Dept. – PowerPoint PPT presentation

Number of Views:479

Avg rating:3.0/5.0

Slides: 34

Provided by: ifmAcTzs

Category:

more less

Transcript and Presenter's Notes

Title: Association Rule Mining: Apriori Algorithm

1
Association Rule MiningApriori Algorithm

CIT365 Data Mining Data Warehousing
Bajuna Salehe
The Institute of Finance Management Computing
and IT Dept.

2
Brief About Association Rule Mining

The results of Market Basket Analysis allowed
companies to more fully understand purchasing
behaviour and, as a result better target market
audiences.
Association mining is user-centric as the
objective is the elicitation of useful (or
interesting) rules from which new knowledge can
be derived.

3
Brief About Association Rule Mining

Association mining applications have been applied
to many different domains including market basket
and risk analysis in commercial environments,
epidemiology, clinical medicine, fluid dynamics,
astrophysics, crime prevention, and
counter-terrorismall areas in which the
relationship between objects can provide useful
knowledge.

4
Example of Association Rule

For example, an insurance company, by finding a
strong correlation between two policies A and B,
of the form A ? B, indicating that customers that
held policy A were also likely to hold policy B,
could more efficiently target the marketing of
policy B through marketing to those clients that
held policy A but not B.

5
Brief About Association Rule Mining

Association mining analysis is a two part
process.
First, the identification of sets of items or
itemsets within the dataset.
Second, the subsequent derivation of inferences
from these itemsets

6
Why Use Support and Confidence?

Support reflects the statistical significance of
a rule. Rules that have very low support are
rarely observed, and thus, are more likely to
occur by chance. For example, the rule A ? B may
not be significant because both items are present
together only once in the previous Table in the
last week lecture.

7
Why Use Support and Confidence?

Additionally, low support rules may not be
actionable from a marketing perspective because
it is not profitable to promote items that are
seldom bought together by customers.
For these reasons, support is often used as a
filter to eliminate uninteresting rules.

8
Why Use Support and Confidence?

Confidence is another useful metric because it
measures how reliable is the inference made by a
rule.
For a given rule A? B , the higher the
confidence, the more likely it is for itemset B
to be present in transactions that contain A. In
a sense, confidence provides an estimate of the
conditional probability for B given A.

9
Causality Association Rule

Finally, it is worth noting that the inference
made by an association rule does not necessarily
imply causality.
Instead, the implication indicates a strong
co-occurrence relationship between items in the
antecedent and consequent of the rule.

10
Causality A. Rule

Causality, on the other hand, requires a
distinction between the causal and effect
attributes of the data and typically involves
relationships occurring over time (e.g., ozone
depletion leads to global warming).

11
More About Support and Confidence

The support for the following candidate rules
Bread, Cheese ? Milk, Bread,Milk ?
Cheese, Cheese,Milk ? Bread, Bread ?
Cheese,Milk, Milk ? Bread,Cheese, Cheese
? Bread,Milk
are identical since they correspond to the
same itemset, Bread, Cheese, Milk.
If the itemset is infrequent, then all six
candidate rules can be immediately pruned without
having to compute their confidence values.

12
More About Support and Confidence

Therefore, a common strategy adopted by many
association rule mining algorithms is to
decompose the problem into two major subtasks
Frequent Itemset Generation. Find all itemsets
that satisfy the minsup threshold. These itemsets
are called frequent itemsets.
Rule Generation. Extract high confidence
association rules from the frequent
Itemsets found in the previous step. These rules
are called strong rules.

13
Frequent Itemset Generation

A lattice structure can be used to enumerate the
list of possible itemsets.
For example, the figure below illustrates all
itemsets derivable from the set A,B,C,D.

14
Frequent Itemset Generation
15
Frequent Itemset Generation

In general, a data set that contains d items may
generate up to 2d (raise to power d) - 1
possible itemsets, excluding the null set.
Because d can be very large in many commercial
databases, frequent itemset generation is an
exponentially expensive task.

16
Frequent Itemset Generation

A naive approach for finding frequent itemsets is
to determine the support count for every
candidate itemset in the lattice structure.
To do this, we need to match each candidate
against every transaction.

17
Apriori Algorithm

This algorithm is among of the algorithms that
are grouped into candidate generation algorithms,
used to identify candidate itemsets.
The common data structure that is used in apriori
algorithm is tree data structures.
Two common type of tree data structures used in
apriori are-
Enumeration Set Tree
Prefix Tree

18
Data Structure for Apriori Algorithm
19
Apriori Algorithm

Frequent itemsets (also called as large
itemsets), are those itemsets whose support is
greater than minSupp (Minimum Support).
The apriori property (downward closure property)
says that any subsets of any frequent itemset are
also frequent itemsets
The use of support for pruning candidate itemsets
is guided by the following principle (Apriori
Principle).
If an itemset is frequent, then all of its subsets

20
Reminder Steps of Association Rule Mining

The major steps in association rule mining are
Frequent Itemset generation
Rules derivation

21
Apriori Algorithm

Any subset of a frequent itemset must be frequent
If beer, nappy, nuts is frequent, so is beer,
nappy
Every transaction having beer, nappy, nuts also
contains beer, nappy
Apriori pruning principle If there is any
itemset which is infrequent, its superset should
not be generated/tested!

22
Apriori Algorithm

The APRIORI algorithm uses the downward closure
property, to prune unnecessary branches for
further consideration. It needs two parameters,
minSupp and minConf. The minSupp is used for
generating frequent itemsets and minConf is used
for rule derivation.

23
The Apriori Algorithm An Example
Itemset sup
A 2
B 3
C 3
D 1
E 3
Itemset sup
A 2
B 3
C 3
E 3
Database TDB
L1
C1
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
1st scan
C2
C2
Itemset sup
A, B 1
A, C 2
A, E 1
B, C 2
B, E 3
C, E 2
Itemset
A, B
A, C
A, E
B, C
B, E
C, E
L2
2nd scan
Itemset sup
A, C 2
B, C 2
B, E 3
C, E 2
L3
C3
Itemset
B, C, E
3rd scan
Itemset sup
B, C, E 2
24
Important Details Of The Apriori Algorithm

There are two crucial questions in implementing
the Apriori algorithm
How to generate candidates?
How to count supports of candidates?

25
Generating Candidates

There are 2 steps to generating candidates
Step 1 Self-joining Lk
Step 2 Pruning
Example of Candidate-generation
L3abc, abd, acd, ace, bcd
Self-joining L3L3
abcd from abc and abd
acde from acd and ace
Pruning
acde is removed because ade is not in L3
C4abcd

26
Apriori Algorithm

k 1.
Fk i i ? I ? s(i)/ N minsup. Find
all frequent 1-itemsets
repeat
k k 1.
Ck apriori-gen(Fk-1). Generate candidate
itemsets
for each transaction t ? T do
Ct subset(Ck, t). Identify all candidates
that belong to t
for each candidate itemset c ? Ct do
s(c) s(c) 1. Increment support
count
end for
end for
Fk c c ? Ck ? s(c) /N minsup. Extract
the frequent k-itemsets
until Fk Ø
Result _Fk

27
How to Count Supports Of Candidates?

Why counting supports of candidates a problem?
The total number of candidates can be huge
One transaction may contain many candidates
Method
Candidate itemsets are stored in a hash-tree
Leaf node of hash-tree contains a list of
itemsets and counts
Interior node contains a hash table
Subset function finds all the candidates
contained in a transaction

28
Generating Association Rules

Once all frequent itemsets have been found
association rules can be generated
Strong association rules from a frequent itemset
are generated by calculating the confidence in
each possible rule arising from that itemset and
testing it against a minimum confidence threshold

29
Example
TID List of item_IDs
T100 Beer, Crisps, Milk
T200 Crisps, Bread
T300 Crisps, Nappies
T400 Beer, Crisps, Bread
T500 Beer, Nappies
T600 Crisps, Nappies
T700 Beer, Nappies
T800 Beer, Crisps, Nappies, Milk
T900 Beer, Crisps, Nappies
ID Item
I1 Beer
I2 Crisps
I3 Nappies
I4 Bread
I5 Milk
30
Example
31
Challenges Of Frequent Pattern Mining

Challenges
Multiple scans of transaction database
Huge number of candidates
Tedious workload of support counting for
candidates
Improving Apriori general ideas
Reduce passes of transaction database scans
Shrink number of candidates
Facilitate support counting of candida

32
Bottleneck Of Frequent-Pattern Mining

Multiple database scans are costly
Mining long patterns needs many passes of
scanning and generates lots of candidates
To find frequent itemset i1i2i100
of scans 100
of Candidates 2100-1
1.271030
Bottleneck candidate-generation-and-test

33
Mining Frequent Patterns Without Candidate
Generation

Techniques for mining frequent itemsets which
avoid candidate generation include
FP-growth
Grow long patterns from short ones using local
frequent items
ECLAT (Equivalence CLASS Transformation)
algorithm
Uses a data representation in which transactions
are associated with items, rather than the other
way around (vertical data format)
These methods can be much faster than the Apriori
algorithm