Title: Chapter 5: Mining Frequent Patterns, Associations, and Correlations
1Chapter 5Mining Frequent Patterns,
Associations, and Correlations
- Per Kristian Helland
- Joacim Christiansen
- Øystein Rose
2Plan
- Introduction
- Definitions
- Road Map
- Finding Frequent Itemsets (Apriori)
- Frequent Itemsets ? Association Rules
- Improving Apriori
3Introduction
- Early 1990s Executive at Marks Spencer
explained his db to Agrawal - Agrawal began devising algorithms for asking
open-ended queries ? publishes Mining Association
Rules between Sets of Items in Large Databases,
1993 (991 citations) - Represented in the form of association rules
computer ? antivirus_software support 2,
confidence 60
4Introduction (2)
- Make available former unknown information
- Wal-Mart Diapers beer at Fridays (Strategic
marketing) - Medical Information How patients react to
medicine (Social, economic, and domain)
5Definitions Overview
- We use three levels of abstraction
- Patterns
- Itemsets
- Association rules
6Definitions Patterns
- Patterns can be
- Itemsets (x and y)
- Sequential (x before y)
- Temporal (x 3 hours before y)
- Structured (x is a sub-tree)
- Etc
- Frequent patterns are those that appear in a
data set frequently (a measure that is given by a
user or an expert)
Historical remark Argawal (1993) used the term
large itemset to describe itemsets that satisfies
a specified minimum support threshold
7Itemset
- Given A set of items, I x1,,xn
- A set of items, X I, is an itemset
- X is a k-itemset where kX
8Itemset properties
- Proper sub-itemset
- Every X is contained in Y, but at least one item
of Y is not in X - Proper super-itemset
- Closed itemset
- The itemset X is closed in a data set S if there
exists no proper super-itemset Y such that Y has
the same support count as X - Closed frequent itemset
- The set C of closed frequent itemsets for a data
set S contains complete information regarding its
corresponding frequent itemsets - Maximal frequent itemset
- X is a maximal frequent itemset in S if X is
frequent and there exists no super-itemset Y such
that X Y and Y is frequent in S
9Road map- Market basket analysis is just one
form of frequent pattern mining
Frequent pattern mining techniques can be
classified based on
- Completeness
- Levels of abstraction
- Number of data dimensions
- Types of values
- Kinds of rules mined
- Kinds of patterns mined
10Road map- Market basket analysis is just one
form of frequent pattern mining
- Completeness
- Complete set of frequent itemsets, closed
frequent itemsets, constrained itemsets, top-k
itemsets - Different applications may have different
requirements - Levels of abstraction
- Multilevel vs. Single-level
- buys(X, computer) ? buys(X, HP_printer)
- buys(X, laptop_computer) ? buys(X,
HP_printer) - Number of data dimensions
- Single-dimensional vs. Multidimensional
- buys(X, computer) ? buys(X, antivirus_software
) - age(X, 30...39) income(X, 42K...48K) ?
buys(X, high resolution TV)
11Road map- Market basket analysis is just one
form of frequent pattern mining
- Types of values
- Boolean vs. quantitative
- age(X, 30...39) income(X, 42K...48K) ?
buys(X, high resolution TV) - Kinds of rules mined
- Association rules
- Further statistical analysed correlation rules
- Kinds of patterns mined
- Frequent itemset mining
- Sequential pattern mining (ordering of events)
- Structured pattern mining (any structure, more
general)
12Finding Frequent Itemsets- Apriori
- Apriori (1994, Agrawal, Srikant) Uses prior
knowledge of frequent itemset properties. - k-itemsets used to explore (k1)-itemsets, L1 ?
L2, L2 ? L3... LK-1 ? LK - Apriori property All nonempty subsets of a
frequent itemset must also be frequent.if P(I)
lt min_sup then P(I?A) lt min_sup
13(No Transcript)
14Finding Frequent Itemsets (2)- Apriori
- Two basic steps
- Join Finding candidates, CK LK-1 join LK-1
- li itemset i LK-1
- lij jth item in li
- Assumed Items within a transaction or itemset
are sorted in lexicographical order For
(k-1)-itemset li1 lt li2 lt ... lt lik-1 - l1 and l2 are joinable if their first (k-2) items
are in common(l11 l21) ? (l12 l22)
? ... ? (l1k-2 l2k-2) ? (l1k-1 lt
l2k-1) - Resulting itemset li and li1 l11, l12,
..., l1k-2, l1k-1, l2k-2
15Finding Frequent Itemsets (3) - Apriori
- Prune Removing infrequent itemsets
- Note Any (k-1)-subset of a candidate k-itemset
that is not in LK-1 cannot be frequent - CK ? LK
- Db scan Each candidate in CK is counted. if
support count of li lt minimum support countthen
li is removed
16Finding Frequent Itemsets (4)- Apriori
- ExampleTransactional dataI1, I2, I5I2,
I4I2, I3I1, I2, I4I1, I3I2, I3I1,
I3I1, I2, I3, I5I1, I2, I3
17Finding Frequent Itemsets (5)- Apriori
- 1. Each item of the itemsets is a member of the
set of candidate 1-itemsets, C2 - 2. (absolute) support gt minimum (absolute)
support
18Finding Frequent Itemsets (6)- Apriori
- 3. L1 join L1, no candidates are removed (each
subset is frequent) - 4. Finding support count C2
- 5. (absolute) support gt minimum (absolute)
support
19Finding Frequent Itemsets (7)- Apriori
- 6. L2 join L2 I1, I2, I3, I1, I2, I5,
I1, I3, I5, I2, I3, I4, I2, I3, I5, I2,
I4, I5. All itemsets whose subsets are not
frequent are removed - 7. Finding support count C3
- (absolute) support gt minimum (absolute) support
- L3 join L3 I1, I2, I3, I5, but subset I2,
I3, I5 is not frequent? C4 Ø
20Association rules
- Given a database D with multi-set of subsets of
the sets of items, I, we call each T in D a
transaction - An association rule is on the form X?Y where X
and Y are itemsets and (X U Y) Ø
21Association rules properties
- The rule support is defined as
- support(X?Y) P(X U Y) the percentage of
transactions in D that contain (X U Y) - The rule confidence is defined as
- confidence(X?Y) P(YB) the percentage of
transaction in D containing X that - also contain Y
- Rules that support both a minimum support
threshold and a minimum confidence threshold are
called strong
22Association rule howto?
- 1. Find all frequent itemsets
- Apriori
- 2. Generate strong association rules from the
frequent itemsets - Unsupervised
- For each frequent itemset l, generate all
nonempty subsets of l - For every subset s of l, output the rule s?(l\s)
if
23Association rule howto? (2)
- Supervised generation of rules
- What is a truly interesting rule?
- Chapter 1
- Easily understood by humans
- Valid on new or test data with some degree of
certainty - Potentially useful
- Novel
- Others Simplicity, generality, actionability,
unexpectedness (Mannila et.al, 1999) - User/expert to guide the discovery process
24Improving apriori- Reducing number of database
scans
Variations may be summarized as follows
- Hash-based technique
- Transaction reduction
- Partitioning
- Sampling
- Dynamic itemset counting
25Improving apriori- Reducing number of database
scans
Hash-based
- When scanning each transaction in the DB to
generate 1-itemsets - Generate all of the 2-itemsets for each
transaction - Hash them into different buckets
- Buckets with count below support_count can be
removed - Reduces the number of candidate k-itemsets
examined
26Improving apriori- Reducing number of database
scans
Transaction reduction
- Reduces number of transactions scanned in future
iterations - A transaction that does not contain any frequent
k-itemsets cannot contain any frequent
(k1)-itemsets - Such a transaction may be removed from subsequent
scans
27Improving apriori- Reducing number of database
scans
Partitioning (2 scans)
- Divide the transactions into nonoverlapping
partitions - Support for each partition is min_sup x nr. of
trans. in the partition - Scan partition to find all local frequent
itemsets - Any itemset that is potential frequent must occur
as a frequent itemset in at least one of the
partitions - Second scan to determine global frequent itemsets
Phase I
Phase II
Frequent itemsets in D
Transactions in D
Divide D into n partitions
Find frequent itemsets local to each partition (1
scan)
Combine local frequent itemsets to form candidate
itemsets
Find global frequent itemsets among candidates (1
scan)
28Improving apriori- Reducing number of database
scans
Sampling (1,5 2,5 scans)
- Randomly sample a subset of transactions from D
- Search frequent itemsets in this sample
- Lower support treshold to lessen possibility of
loosing global itemsets - Whole D is used to compute actual frequencies of
these candidates - Use the concept of negative border to check if
all global frequent itemsets are found Toi96
29Improving apriori- Reducing number of database
scans
Dynamic itemset counting
- Partition into blocks of size M
- Before scanning a block, update candidate
itemsets - If all subsets of a candidate itemset is
frequent, start counting - Requires fewer scans than Apriori
finished, frequent
finished, non-frequent
counting, frequent
counting, non-frequent