Chapter 5: Mining Frequent Patterns, Associations, and Correlations - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

Chapter 5: Mining Frequent Patterns, Associations, and Correlations

Description:

Mining Frequent Patterns, Associations, and Correlations. Per Kristian Helland ... Market basket analysis is just one form of frequent pattern mining. Completeness ... – PowerPoint PPT presentation

Number of Views:663

Avg rating:3.0/5.0

Slides: 30

Provided by: georgec80

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 5: Mining Frequent Patterns, Associations, and Correlations

1
Chapter 5Mining Frequent Patterns,
Associations, and Correlations

Per Kristian Helland
Joacim Christiansen
Øystein Rose

2
Plan

Introduction
Definitions
Road Map
Finding Frequent Itemsets (Apriori)
Frequent Itemsets ? Association Rules
Improving Apriori

3
Introduction

Early 1990s Executive at Marks Spencer
explained his db to Agrawal
Agrawal began devising algorithms for asking
open-ended queries ? publishes Mining Association
Rules between Sets of Items in Large Databases,
1993 (991 citations)
Represented in the form of association rules
computer ? antivirus_software support 2,
confidence 60

4
Introduction (2)

Make available former unknown information
Wal-Mart Diapers beer at Fridays (Strategic
marketing)
Medical Information How patients react to
medicine (Social, economic, and domain)

5
Definitions Overview

We use three levels of abstraction
Patterns
Itemsets
Association rules

6
Definitions Patterns

Patterns can be
Itemsets (x and y)
Sequential (x before y)
Temporal (x 3 hours before y)
Structured (x is a sub-tree)
Etc
Frequent patterns are those that appear in a
data set frequently (a measure that is given by a
user or an expert)

Historical remark Argawal (1993) used the term
large itemset to describe itemsets that satisfies
a specified minimum support threshold
7
Itemset

Given A set of items, I x1,,xn
A set of items, X I, is an itemset
X is a k-itemset where kX

8
Itemset properties

Proper sub-itemset
Every X is contained in Y, but at least one item
of Y is not in X
Proper super-itemset
Closed itemset
The itemset X is closed in a data set S if there
exists no proper super-itemset Y such that Y has
the same support count as X
Closed frequent itemset
The set C of closed frequent itemsets for a data
set S contains complete information regarding its
corresponding frequent itemsets
Maximal frequent itemset
X is a maximal frequent itemset in S if X is
frequent and there exists no super-itemset Y such
that X Y and Y is frequent in S

9
Road map- Market basket analysis is just one
form of frequent pattern mining
Frequent pattern mining techniques can be
classified based on

Completeness
Levels of abstraction
Number of data dimensions
Types of values
Kinds of rules mined
Kinds of patterns mined

10
Road map- Market basket analysis is just one
form of frequent pattern mining

Completeness
Complete set of frequent itemsets, closed
frequent itemsets, constrained itemsets, top-k
itemsets
Different applications may have different
requirements
Levels of abstraction
Multilevel vs. Single-level
buys(X, computer) ? buys(X, HP_printer)
buys(X, laptop_computer) ? buys(X,
HP_printer)
Number of data dimensions
Single-dimensional vs. Multidimensional
buys(X, computer) ? buys(X, antivirus_software
)
age(X, 30...39) income(X, 42K...48K) ?
buys(X, high resolution TV)

11
Road map- Market basket analysis is just one
form of frequent pattern mining

Types of values
Boolean vs. quantitative
age(X, 30...39) income(X, 42K...48K) ?
buys(X, high resolution TV)
Kinds of rules mined
Association rules
Further statistical analysed correlation rules
Kinds of patterns mined
Frequent itemset mining
Sequential pattern mining (ordering of events)
Structured pattern mining (any structure, more
general)

12
Finding Frequent Itemsets- Apriori

Apriori (1994, Agrawal, Srikant) Uses prior
knowledge of frequent itemset properties.
k-itemsets used to explore (k1)-itemsets, L1 ?
L2, L2 ? L3... LK-1 ? LK
Apriori property All nonempty subsets of a
frequent itemset must also be frequent.if P(I)
lt min_sup then P(I?A) lt min_sup

13
(No Transcript)
14
Finding Frequent Itemsets (2)- Apriori

Two basic steps
Join Finding candidates, CK LK-1 join LK-1
li itemset i LK-1
lij jth item in li
Assumed Items within a transaction or itemset
are sorted in lexicographical order For
(k-1)-itemset li1 lt li2 lt ... lt lik-1
l1 and l2 are joinable if their first (k-2) items
are in common(l11 l21) ? (l12 l22)
? ... ? (l1k-2 l2k-2) ? (l1k-1 lt
l2k-1)
Resulting itemset li and li1 l11, l12,
..., l1k-2, l1k-1, l2k-2

15
Finding Frequent Itemsets (3) - Apriori

Prune Removing infrequent itemsets
Note Any (k-1)-subset of a candidate k-itemset
that is not in LK-1 cannot be frequent
CK ? LK
Db scan Each candidate in CK is counted. if
support count of li lt minimum support countthen
li is removed

16
Finding Frequent Itemsets (4)- Apriori

ExampleTransactional dataI1, I2, I5I2,
I4I2, I3I1, I2, I4I1, I3I2, I3I1,
I3I1, I2, I3, I5I1, I2, I3

17
Finding Frequent Itemsets (5)- Apriori

1. Each item of the itemsets is a member of the
set of candidate 1-itemsets, C2
2. (absolute) support gt minimum (absolute)
support

18
Finding Frequent Itemsets (6)- Apriori

3. L1 join L1, no candidates are removed (each
subset is frequent)
4. Finding support count C2
5. (absolute) support gt minimum (absolute)
support

19
Finding Frequent Itemsets (7)- Apriori

6. L2 join L2 I1, I2, I3, I1, I2, I5,
I1, I3, I5, I2, I3, I4, I2, I3, I5, I2,
I4, I5. All itemsets whose subsets are not
frequent are removed
7. Finding support count C3
(absolute) support gt minimum (absolute) support
L3 join L3 I1, I2, I3, I5, but subset I2,
I3, I5 is not frequent? C4 Ø

20
Association rules

Given a database D with multi-set of subsets of
the sets of items, I, we call each T in D a
transaction
An association rule is on the form X?Y where X
and Y are itemsets and (X U Y) Ø

21
Association rules properties

The rule support is defined as
support(X?Y) P(X U Y) the percentage of
transactions in D that contain (X U Y)
The rule confidence is defined as
confidence(X?Y) P(YB) the percentage of
transaction in D containing X that
also contain Y
Rules that support both a minimum support
threshold and a minimum confidence threshold are
called strong

22
Association rule howto?

1. Find all frequent itemsets
Apriori
2. Generate strong association rules from the
frequent itemsets
Unsupervised
For each frequent itemset l, generate all
nonempty subsets of l
For every subset s of l, output the rule s?(l\s)
if

23
Association rule howto? (2)

Supervised generation of rules
What is a truly interesting rule?
Chapter 1
Easily understood by humans
Valid on new or test data with some degree of
certainty
Potentially useful
Novel
Others Simplicity, generality, actionability,
unexpectedness (Mannila et.al, 1999)
User/expert to guide the discovery process

24
Improving apriori- Reducing number of database
scans
Variations may be summarized as follows

Hash-based technique
Transaction reduction
Partitioning
Sampling
Dynamic itemset counting

25
Improving apriori- Reducing number of database
scans
Hash-based

When scanning each transaction in the DB to
generate 1-itemsets
Generate all of the 2-itemsets for each
transaction
Hash them into different buckets
Buckets with count below support_count can be
removed
Reduces the number of candidate k-itemsets
examined

26
Improving apriori- Reducing number of database
scans
Transaction reduction

Reduces number of transactions scanned in future
iterations
A transaction that does not contain any frequent
k-itemsets cannot contain any frequent
(k1)-itemsets
Such a transaction may be removed from subsequent
scans

27
Improving apriori- Reducing number of database
scans
Partitioning (2 scans)

Divide the transactions into nonoverlapping
partitions
Support for each partition is min_sup x nr. of
trans. in the partition
Scan partition to find all local frequent
itemsets
Any itemset that is potential frequent must occur
as a frequent itemset in at least one of the
partitions
Second scan to determine global frequent itemsets

Phase I
Phase II
Frequent itemsets in D
Transactions in D
Divide D into n partitions
Find frequent itemsets local to each partition (1
scan)
Combine local frequent itemsets to form candidate
itemsets
Find global frequent itemsets among candidates (1
scan)
28
Improving apriori- Reducing number of database
scans
Sampling (1,5 2,5 scans)

Randomly sample a subset of transactions from D
Search frequent itemsets in this sample
Lower support treshold to lessen possibility of
loosing global itemsets
Whole D is used to compute actual frequencies of
these candidates
Use the concept of negative border to check if
all global frequent itemsets are found Toi96

29
Improving apriori- Reducing number of database
scans
Dynamic itemset counting