An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets - PowerPoint PPT Presentation

About This Presentation

Title:

An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Description:

An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets Collaborators Adam Kirsch (Harvard) Michael Mitzenmacher (Harvard ... – PowerPoint PPT presentation

Number of Views:101

Avg rating:3.0/5.0

Slides: 35

Provided by: Eli1165

Learn more at: http://www.eecs.harvard.edu

Category:

more less

Transcript and Presenter's Notes

Title: An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

1
An Efficient Rigorous Approach for Identifying
Statistically Significant Frequent Itemsets
2
Data Mining

Discover hidden patterns, correlations,
association rules, etc., in large data sets
When is the discovery interesting, important,
significant?
We develop rigorous mathematical/statistical
approach

3
Frequent Itemsets

Dataset D of transactions tj (subsets) of a base
set of items I, (tj ? 2I).
Support of an itemsets X number of transactions
that contain X.

support(Beer,Diaper) 3
4
Frequent Itemsets

Discover all itemsets with significant support.
Fundamental primitive in data mining applications

support(Beer,Diaper) 3
5
Significance

What support level makes an itemset significantly
frequent?
Minimize false positive and false negative
discoveries
Improve quality of subsequent analyses
How to narrow the search to focus only on
significant itemsets?
Reduce the possibly exponential time search

6
Statistical Model

Input
D a dataset of t transactions over In
For i?I, let n(i) be the support of i in D.
fi n(i)/t frequency of i in D
H0 Model
D a dataset of t transactions, In
Item i is included in transaction j with
probability fi independent of all other events.

7
Statistical Tests

H0 null hypothesis the support of no itemset
is significant with respect to D
H1 alternative hypothesis, the support of
itemsets X1, X2, X3, is significant. It is
unlikely that their support comes from the
distribution of D
Significance level
a Prob( rejecting H0 when its true )

8
Naïve Approach

Let Xx1,x2,xr,
fx ?j fj, probability that a given itemset is
in a given transaction
sx support of X, distributed sx B(t, fx)
Reject H0 if
Prob(B(t, fx) sx) p-value a

9
Naïve Approach

Variations
Rsupport /Esupport in D
Rsupport - Esupport in D
Z-value (s-Es)/?s
many more

10
Whats wrong? example

D has 1,000,000 transactions, over 1000 items,
each item has frequency 1/1000.
We observed that a pair i,j appears 7 times, is
this pair statistically significant?
In D (random dataset)
E support(i,j) 1
Prob(i,j has support 7 ) ? 0.0001
p-value 0.0001 - must be significant!

11
Whats wrong? example

There are 499,500 pairs, each has probability
0.0001 to appear in 7 transactions in D
The expected number of pairs with support 7 in
D is ? 50,
not such a rare event!
Many false positive discoveries (flagging
itemsets that are not significant)
Need to correct for multiplicity of hypothesis.

12
Multi-Hypothesis test

Testing for significant itemsets of size k
involves testing simultaneously for
m null hypotheses.
H0 (X) support of X conforms with D
sx support of X, distributed sx B(t, fx)
How to combine m tests while minimizing false
positive and negative discoveries?

13
Family Wise Error Rate (FWER)

Family Wise Error Rate (FWER) probability of at
least one false positive
(flagging a non-significant itemset as
significant)
Bonferroni method (union bound) test each null
hypothesis with significance level a/m
Too conservative many false negative does not
flag many significant itemsets.

14
False Discovery Rate (FDR)

Less conservative approach
V number of false positive discoveries
R total number of rejected null hypothesis
number itemsets flagged as significant
Test with level of significance a reject
maximum number of null hypothesis such that FDR
a

FDR EV/R (FDR0 when R0)
15
Standard Multi-Hypothesis test
16
Standard Multi-Hypothesis test

Less conservative than Bonferroni method
i a/m VS a/m
For m , still needs very small individual
p-value to reject an hypothesis

17
Alternative Approach

Q(k, si) observed number of itemsets of size k
and support si
p-value
the probability of Q(k, si) in D
Fewer hypothesis
How to compute the p-value? What is the
distribution of the number of itemsets of size k
and support si in D ?

18
Permutation Test

Simulations to estimate the probabilities
Choose a data set at random and count
Main problem m
small probabilities to reject hypothesis
a lot of simulations to estimate probabilities

19
Main Contributions

Poisson approximation let Qk,s number of
itemsets of size k and support s in D (random
dataset), for ssmin
Qk,s is well approximate by a Poisson
distribution.
Based on the Poisson approximation a powerful
FDR multi-hypothesis test for significant
frequent itemsets.

20
Chen-Stein Method

A powerful technique for approximating the sum of
dependent Bernoulli variables.
For an itemset X of k items let ZX1 if X has
support at least s, else ZX0
Qk,s ?X ZX (X of k items)
UPoisson(?)
I(x) Y yk, YX ? empty,

21
Chen-Stein Method (2)
22
Approximation Result

Qk,s is well approximate by a Poisson
distribution for ssmin

23
Monte-Carlo Estimate

To determine smin for a given set of parameters
(n,t,fi )
Choose m random datasets with the required
parameters.
For each dataset extract all itemsets with
support at least s ( smin)
Find the minimum s such that
Prob(b1(s)b2(s) e) 1-d

24
New Statistical Test

Instead of testing the significance of the
support of individual itemsets we test the
significance of the number of itemsets with a
given support
The null hypothesis distribution is specified by
the Poisson approximation result
Reduces the number of simultaneous tests
More powerful test less false negatives

25
Test I