An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets - PowerPoint PPT Presentation

About This Presentation
Title:

An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Description:

An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets Collaborators Adam Kirsch (Harvard) Michael Mitzenmacher (Harvard ... – PowerPoint PPT presentation

Number of Views:101
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets


1
An Efficient Rigorous Approach for Identifying
Statistically Significant Frequent Itemsets
2
Data Mining
  • Discover hidden patterns, correlations,
    association rules, etc., in large data sets
  • When is the discovery interesting, important,
    significant?
  • We develop rigorous mathematical/statistical
  • approach

3
Frequent Itemsets
  • Dataset D of transactions tj (subsets) of a base
    set of items I, (tj ? 2I).
  • Support of an itemsets X number of transactions
    that contain X.

support(Beer,Diaper) 3
4
Frequent Itemsets
  • Discover all itemsets with significant support.
  • Fundamental primitive in data mining applications

support(Beer,Diaper) 3
5
Significance
  • What support level makes an itemset significantly
    frequent?
  • Minimize false positive and false negative
    discoveries
  • Improve quality of subsequent analyses
  • How to narrow the search to focus only on
    significant itemsets?
  • Reduce the possibly exponential time search

6
Statistical Model
  • Input
  • D a dataset of t transactions over In
  • For i?I, let n(i) be the support of i in D.
  • fi n(i)/t frequency of i in D
  • H0 Model
  • D a dataset of t transactions, In
  • Item i is included in transaction j with
    probability fi independent of all other events.

7
Statistical Tests
  • H0 null hypothesis the support of no itemset
    is significant with respect to D
  • H1 alternative hypothesis, the support of
    itemsets X1, X2, X3, is significant. It is
    unlikely that their support comes from the
    distribution of D
  • Significance level
  • a Prob( rejecting H0 when its true )

8
Naïve Approach
  • Let Xx1,x2,xr,
  • fx ?j fj, probability that a given itemset is
    in a given transaction
  • sx support of X, distributed sx B(t, fx)
  • Reject H0 if
  • Prob(B(t, fx) sx) p-value a

9
Naïve Approach
  • Variations
  • Rsupport /Esupport in D
  • Rsupport - Esupport in D
  • Z-value (s-Es)/?s
  • many more

10
Whats wrong? example
  • D has 1,000,000 transactions, over 1000 items,
    each item has frequency 1/1000.
  • We observed that a pair i,j appears 7 times, is
    this pair statistically significant?
  • In D (random dataset)
  • E support(i,j) 1
  • Prob(i,j has support 7 ) ? 0.0001
  • p-value 0.0001 - must be significant!

11
Whats wrong? example
  • There are 499,500 pairs, each has probability
    0.0001 to appear in 7 transactions in D
  • The expected number of pairs with support 7 in
    D is ? 50,
  • not such a rare event!
  • Many false positive discoveries (flagging
    itemsets that are not significant)
  • Need to correct for multiplicity of hypothesis.

12
Multi-Hypothesis test
  • Testing for significant itemsets of size k
    involves testing simultaneously for
  • m null hypotheses.
  • H0 (X) support of X conforms with D
  • sx support of X, distributed sx B(t, fx)
  • How to combine m tests while minimizing false
    positive and negative discoveries?

13
Family Wise Error Rate (FWER)
  • Family Wise Error Rate (FWER) probability of at
    least one false positive
  • (flagging a non-significant itemset as
    significant)
  • Bonferroni method (union bound) test each null
    hypothesis with significance level a/m
  • Too conservative many false negative does not
    flag many significant itemsets.

14
False Discovery Rate (FDR)
  • Less conservative approach
  • V number of false positive discoveries
  • R total number of rejected null hypothesis
  • number itemsets flagged as significant
  • Test with level of significance a reject
    maximum number of null hypothesis such that FDR
    a

FDR EV/R (FDR0 when R0)
15
Standard Multi-Hypothesis test
16
Standard Multi-Hypothesis test
  • Less conservative than Bonferroni method
  • i a/m VS a/m
  • For m , still needs very small individual
    p-value to reject an hypothesis

17
Alternative Approach
  • Q(k, si) observed number of itemsets of size k
    and support si
  • p-value
    the probability of Q(k, si) in D
  • Fewer hypothesis
  • How to compute the p-value? What is the
    distribution of the number of itemsets of size k
    and support si in D ?

18
Permutation Test
  • Simulations to estimate the probabilities
  • Choose a data set at random and count
  • Main problem m
  • small probabilities to reject hypothesis
  • a lot of simulations to estimate probabilities

19
Main Contributions
  • Poisson approximation let Qk,s number of
    itemsets of size k and support s in D (random
    dataset), for ssmin
  • Qk,s is well approximate by a Poisson
    distribution.
  • Based on the Poisson approximation a powerful
    FDR multi-hypothesis test for significant
    frequent itemsets.

20
Chen-Stein Method
  • A powerful technique for approximating the sum of
    dependent Bernoulli variables.
  • For an itemset X of k items let ZX1 if X has
    support at least s, else ZX0
  • Qk,s ?X ZX (X of k items)
  • UPoisson(?)
  • I(x) Y yk, YX ? empty,

21
Chen-Stein Method (2)
22
Approximation Result
  • Qk,s is well approximate by a Poisson
    distribution for ssmin

23
Monte-Carlo Estimate
  • To determine smin for a given set of parameters
    (n,t,fi )
  • Choose m random datasets with the required
    parameters.
  • For each dataset extract all itemsets with
    support at least s ( smin)
  • Find the minimum s such that
  • Prob(b1(s)b2(s) e) 1-d

24
New Statistical Test
  • Instead of testing the significance of the
    support of individual itemsets we test the
    significance of the number of itemsets with a
    given support
  • The null hypothesis distribution is specified by
    the Poisson approximation result
  • Reduces the number of simultaneous tests
  • More powerful test less false negatives

25
Test I
  • Define a1, a2, a3, such that ?ai a
  • For i0,,log (smax smin ) 1
  • si smin 2i
  • Q(k, si) observed number of itemsets of size k
    and support si
  • H0(k,si) Q(k,si) conforms with Poisson(?i)
  • Reject H0(k,si) if p-value lt ai

26
Test I
  • Let s be the smallest s such that
  • H0 (k,s) rejected by Test I
  • With confidence level a the number of itemsets
    with support s is significant
  • Some itemsets with support s could still be
    false positive

27
Test II
  • Define ß1, ß2, ß3, such that ? ßi ß
  • Reject H0 (k,si) if
  • p-value lt ai and Q(k,si) ?i / ßi
  • Let s be the minimum s such that H0(k,s) was
    rejected
  • If we flag all itemsets with support s as
    significant, FDR ß

28
Proof
  • Vi false discoveries if H0(k,si) first rejected
  • Ei H0(k,si) rejected

29
Real Datasets
  • FIMI repository
  • http//fimi.cs.helsinki.fi/data/
  • standard benchmarks
  • m avg. transaction length

30
Experimental Results
  • Poisson approximation
  • Poisson regime ? no itemsets expected

31
Experimental Results
  • Poisson approximation
  • not approximating the p-values of itemsets as
    hypothesis (small!)
  • finding the minimum s such that
  • Prob(b1(s)b2(s) e) 1-d
  • fewer simulations
  • less time per simulation (few itemsets)

32
Experimental Results
  • Test II a 0.05, ß 0.05
  • Rk,s num. itemsets of size k with support s

Itemset of size 154 with support 7
33
Experimental Results
  • Standard Multi-Hypothesis test ß 0.05
  • R size output Standard Multi-Hypothesis test
  • Rk,s size output Test II

34
Collaborators
  • Adam Kirsch (Harvard)
  • Michael Mitzenmacher (Harvard)
  • Andrea Pietracaprina (U. of Padova)
  • Geppino Pucci (U. of Padova)
  • Eli Upfal (Brown U.)
  • Fabio Vandin (U. of Padova)
Write a Comment
User Comments (0)
About PowerShow.com