From Association Rules To Causality - PowerPoint PPT Presentation

About This Presentation
Title:

From Association Rules To Causality

Description:

Limitations of Association Rules and the Support-Confidence Framework ... Association Rule: Hot-Dogs BBQ Sauce [33%, 50%] Causality Rule: Hamburgers BBQ Sauce ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 38
Provided by: jiaw203
Category:

less

Transcript and Presenter's Notes

Title: From Association Rules To Causality


1
From Association Rules To Causality
Presenters Amol Shukla, University of Waterloo
Claude-Guy Quimper, University of Waterloo
2
From Association Rules To Causality
Presentation Outline
  • Limitations of Association Rules and the
    Support-Confidence Framework
  • Generalizing Association Rules to Correlations
  • Scalable Techniques for Mining Causal Structures
  • Applications of Correlation and Causality
  • Summary

3
Review Association Rules Mining
  • Itemset Ii1, , ik
  • Find all the rules X?Y with min confidence and
    support
  • support, s, probability that a transaction
    contains X?Y
  • confidence, c, conditional probability that a
    transaction having X also contains Y, i.e., P(YX)

Let min_support 50, min_conf 50. Two
example association rules are A ? C (50,
66.7) C ? A (50, 100)
4
Limitations of Association Rules using
Support-Confidence Framework
  • Negative implications or dependencies are ignored
  • Consider the adjoining database.
  • X and Y positively related,
  • X and Z negatively related
  • support and confidence of
  • XgtZ dominates
  • Only the presence of items is taken into account

5
Limitations of Association Rules using
Support-Confidence Framework
  • Another market basket data example
  • Buys Tea gt Buys Coffee
  • (support20,confidence80)
  • Is this rule really valid?
  • Pr(Buys Coffee)90
  • Pr(Buys CoffeeBuys Tea)80
  • Negative correlation between buying tea and
    buying coffee is ignored

6
From Association Rules To Causality
  • Limitations of Association Rules and the
    Support-Confidence Framework
  • Generalizing Association Rules to Correlations
  • Scalable Techniques for Mining Causal Structures
  • Applications of Correlation and Causality
  • Summary

7
What is Correlation?
  • P(A) Probability that event A occurs
  • P(A) Probability that event A does not
    occur
  • P(AB) Probability that events A and B occur
    together.
  • Events A and B are said to be independent if
  • P(AB) P(A) x P(B)
  • Otherwise A and B are dependent
  • Events A and B are said to be correlated if any
    of
  • AB, AB , AB, AB are dependent
  • A correlation rule is a set of items that are
    correlated

8
Computing Correlation Rules Chi-squared Test
for Independence
  • For an itemset Ii1,,ik, construct a
    k-dimensional contingency table R i1,i1 x
    x ik,ik
  • We need to test whether each cell r r1,,rk in
    this table is dependent
  • Let O(r) denote the observed value of cell r in
    this table, and E(r) be its expected value.
  • The chi-squared statistic is the computed as
  • If ?2 0, the cells are independent. If ?2 gt
    cut-off value,reject the independence assumption

9
Example Computing the Chi-squared Statistic
E(Coffee,Tea) (90 x 25)/100 22.5 E(No
Coffee,Tea) (10 x 25)/100 2.5 E(Coffee,No
Tea) (90 x 75)/100 67.5 E(No Coffee,No
Tea)(10 x 75)/1007.5
?2 (20-22.5)2/22.5 (5-2.5)2/2.5
(70-67.5)2/67.5 (5-7.5)2/7.5 0.28 2.5
0.09 0.83 3.7
Since this value is greater than the cut-off
value (2.71 at 90 significance level), we reject
the independence assumption
10
Determining the Cause of Correlation
  • Define measures of interest for each cell I(r)
    O(r) / E(r)
  • I(r)gt1 indicates positive dependence and I(r)lt1
    indicates negative dependence
  • The farther I(r) is from 1, the more a cell
    contributes to the ?2 value, and the correlation.

Cell Counts
  • Thus, No Coffee,Tea contributes the most to the
    correlation, indicating that buying tea might
    inhibit buying coffee

Measures of Interest
70/67.5
11
Properties of Correlation
  • If a set of items is correlated, all its
    supersets are also correlated. Thus, correlation
    is upward-closed
  • We can focus on minimal correlated itemsets to
    reduce our search space
  • Support is downward-closed. A set has minimum
    support only if all its subsets have minimum
    support
  • We can combine correlation with support for an
    effective pruning strategy

12
Combining Correlation with Support
  • Support-confidence framework looks at only the
    top-left cell in the contingency table. To
    incorporate negative dependence, we must consider
    all the cells in the table
  • Combine correlation with support by defining
    CT-support
  • Let s be a user specified min-support threshold.
    Let p be a user-specified cut-off percentage
    value
  • An itemset I is CT-supported if at least p of
    the cells in its contingency table have support
    not less than s
  • An itemset is significant if it is CT-supported
    and minimally correlated

13
A level-wise algorithm for finding correlation
rules
14
Steps performed by the algorithm at level k
Start
Is the Itemset CT-supported?
No
Construct Contingency Table for next itemset at
the level
Add to the set NOTSIG
Yes
Done processing all itemsets at level k
No
Is ?2 greater than cut-off value?
Generate itemset(s) of size k1 such that all of
its subsets are in NOTSIG
Mark the itemset as significant
Yes
15
Limitations of Correlation
  • Correlation might not be valid for sparse
    itemsets. At least 80 of the cells in the
    contingency table must have expected value
    greater than 5.
  • Finding correlation rules is computationally more
    expensive than finding association rules.
  • Only indicates that the existence of a
    relationship. Does not specify the nature of the
    relationship, i.e., the cause and effect
    phenomenon is ignored.
  • Identifying causality is important for
    decision-making.

16
From Association Rules to Causality
  • Limitations of Association Rules and the
    Support-Confidence Framework
  • Generalizing Association Rules to Correlations
  • Scalable Techniques for Mining Causal Structures
  • Applications of Correlation and Causality
  • Summary

17
Causality
33
33
33
Association Rule Hot-Dogs ? BBQ Sauce 33, 50
Causality Rule Hamburgers ? BBQ Sauce
18
Bayesian Networks
  • What is the best topology of a Bayesian network
    that describes the observed data?
  • Problem Very expensive to compute

19
Simplifying Causal Relationships
  • Knowing the existence of a causal relationship is
    as good as knowing the relationship

20
Causality vs Correlation
  • Two correlated variables can have either
  • A causal relationship
  • A common ancestor

21
First Rule of Causality
1) Suppose we have threepair wise
dependentvariables
2) And two variables become independent when
conditionedon the third one
22
First Rule of Causality
Then we have one of these following configurations
23
Second Rule of Causality
  • Suppose we havethree variables withthese
    relationships

2) And the two independent variables become
dependentwhen conditioned on the third variable
24
Second Rule of Causality
  • Then the two independent variables cause the
    third variable.

25
Finding Causality
1) Construct a graph whereeach variable is a
vertex
2) Perform a Chi-squared testto determine
correlation
3) Add an edge labeled Cfor each correlated
test
4) Add an edge labeled Ufor each uncorrelated
test
5) For each triplet, check if acausality rule
can be applied
26
Weaknesses of the Algorithm
  • Causality rules do not cover all possible
    causality relationships
  • The X2 test with confidence set to 95 is
    expected to fail 5 times every 100 tests
  • Some variables might not be reported correlated
    or uncorrelated

27
From Association Rules to Causality
  • Limitations of Association Rules and the
    Support-Confidence Framework
  • Generalizing Association Rules to Correlations
  • Scalable Techniques for Mining Causal Structures
  • Applications of Correlation and Causality
  • Summary

28
Experiments (Census)
  • Correlation rules
  • Not a native English speaker ?? Not born in the
    U.S
  • Served in the military ?? Male
  • Married ?? more than 40 years old
  • Causality Rules
  • Male ? Moved Last 5 years, Support-Job
  • Native-Amer. ? 20-40K ? House Holder
  • Asian, Laborer ? lt 20K

29
Experiments (Text Data)
  • 416 distinct frequent words
  • 86320 pairs of words, 10 are correlated
  • Correlation Causality Rules
  • Nelson, Mandela upi, not reuter
  • area, province Iraqi, Iraq
  • area, secretary, war united, states
  • area, secretary, they prime, minister

30
Beyond Correlation and Causality
  • Correlation and causality seem to be stronger
    mathematical model than confidence and support
  • It is possible to apply these concepts where
    confidence and support were previously applied

31
Association Rules with Constraints
  • Correlation can be seen as a monotone constraint
  • Algorithm obtained by modifying algorithms for
    mining constrained association rules

32
From Association Rules to Causality
  • Limitations of Association Rules and the
    Support-Confidence Framework
  • Generalizing Association Rules to Correlations
  • Scalable Techniques for Mining Causal Structures
  • Applications of Correlation and Causality
  • Summary

33
Conclusion (Good news)
  • Correlation and causality are stronger
    mathematical models to retrieve interesting
    association rules
  • Allow to detect negative implications
  • Causality explains why there is a correlation

34
Conclusion (Bad news)
  • Difficult to precisely detect correlation
    (especially in sparse data cubes)
  • Not all causality relationships can be found
  • Are the results really better than with support
    and confidence?

35
Open Problems
  • How to discover hidden variables in causality
  • How to resolve bi-directional causality for
    disambiguatione.g prime ? minister
    minister ?prime
  • How do we find causal patterns for more than 3
    variables

36
References
  • Papers
  • Beyond Market Baskets Generalizing Association
    Rules to Correlations - Brin, Motwani,
    Silverstein SIGMOD 97
  • Scalable Techniques for Mining Causal
    Structures - Silverstein, Brin, Motwani, Ullman
    VLDB 98
  • Efficient Mining of Constrained Correlated Sets
    - Grahne, Lakshmanan, Wang ICDE 2000
  • A Simple Constraint-Based Algorithm for
    Efficiently Mining Observational Databases for
    Causal Relationships - Cooper Data Mining and
    Knowledge Discovery, vol 1, 1997
  • Textbook
  • Causality models, reasoning, and inference -
    Judea Pearl Cambridge University Press, 2000

37
From Association Rules To Causality
  • Questions
Write a Comment
User Comments (0)
About PowerShow.com