Relative%20Linkage%20Disequilibrium:%20An%20intersection%20between%20evolution,%20algebraic%20statistics,%20text%20mining%20and%20contingency%20tables - PowerPoint PPT Presentation

About This Presentation
Title:

Relative%20Linkage%20Disequilibrium:%20An%20intersection%20between%20evolution,%20algebraic%20statistics,%20text%20mining%20and%20contingency%20tables

Description:

KPA Ltd., Raanana, Israel and Department of Statistics and ... and Comparisons Between Different Timings, Theoretical Population Biology, 11, pp. 386-409. ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Relative%20Linkage%20Disequilibrium:%20An%20intersection%20between%20evolution,%20algebraic%20statistics,%20text%20mining%20and%20contingency%20tables


1
Relative Linkage Disequilibrium An intersection
between evolution, algebraic statistics, text
mining and contingency tables
  • Ron S. Kenett
  • KPA Ltd., Raanana, Israel and Department of
    Statistics and Applied Mathematics "Diego de
    Castro", University of Torino, Italy
  • in collaboration with
  • Silvia Salini
  • Department of Economics, Business and Statistics,
    University of Milan, Italy

2
Outline
Graphical Displays
Evolution
Algebraic Statistics
Text Mining
Contingency Tables
3
Background 1930 - 1975
Evolution
Linkage Disequilibrium
The Fundamental Theorem of Natural
Selection the rate of increase of fitness of
any organism at any time is equal to its genetic
variance at that time
R.A. Fisher
Sam Karlin
FISHER, R. A. 1930, The Genetical Theory of
Natural Selection. Clarendon, Oxford,
U.K.. LEWONTIN, R., and KOJIMA, K., 1960, The
evolutionary dynamics of complex polymorphisms.
Evolution 14 458-472. KARLIN, S . and FELDMAN,
M., 1970, Linkage and selection Two locus
symmetric viability model. Theoretical Population
Biology 1 39-71. KARLIN, S., 1975, General
two-locus models some objectives, results and
interpretations. Theoretical Population Biology 7
364-398. KARLIN,S . and KENETT, R. 1977,
Variable Spatial Selection with Two Stages of
Migration and Comparisons Between Different
Timings, Theoretical Population Biology, 11, pp.
386-409.
4
Background - 1975
Contingency Tables
Discrete multivariate analysis theory and
practice The first comprehensive treatise on the
analysis of categorical data using loglinear and
related statistical models
5
Background - 1983
Graphical Display
6
Background - 2008
Algebraic Statistics
7
254 itemsetsThe e relationship
Text Mining
57
.. female..infection
40
.. female..
109
.. ...infection
48
.....
8
The Data
Contingency Tables
e RHS RHS
LHS 57 40
LHS 109 48
e RHS RHS
LHS x1.224 x2.157
LHS x3.429 x4.189
d RHS RHS
LHS x1.389 x2.056
LHS x3.50 x4.056
h RHS RHS
LHS x1.057 x2.321
LHS x3.189 x4.434
9
Contingency Tables
a man who has carefully investigated a printed
table, finds, when done, that he has only a very
faint and partial idea of what he has read and
that like a figure imprinted on sand, is soon
totally erased and defaced.
William Playfair (1786), The Commercial and
Political Atlas, from Edward R. Tufte (1983), The
Visual Display of Quantitative Information.
10
The Simplex
Graphical Display
x3
x1
x2
x4
11
The Simplex
Graphical Display
12
Linkage Disequilibrium
Two loci, two alleles each, four genotypes AB,
Ab, aB, ab
Algebraic Statistics
RHS RHS
LHS x1 x2
LHS x3 x4
D can be extended to more dimensions
13
Linkage Disequilibrium
Algebraic Statistics
An algebraic observation
14
Relative Linkage Disequilibrium
Algebraic Statistics
D is the distance from the point corresponding
to the contingency table in the simplex, to the
surface D0 in the e?e direction.
DM is the distance from the point corresponding
to the contingency table on the surface D0 in
the e?e direction, to the surface of the simplex,
in that direction.
15
Graphical Display
16
RLD Example
Contingency Tables
RLD
h RHS RHS
LHS x1.057 x2.321
LHS x3.189 x4.434
e RHS RHS
LHS x1.224 x2.157
LHS x3.429 x4.189
d RHS RHS
LHS x1.389 x2.056
LHS x3.50 x4.056
17
Association Rules
Text Mining
  • Association rules are one of the most popular
    unsupervised data mining methods used in
    applications such as Market Basket Analysis, to
    measure the associations between products
    purchased by each consumer, or in web clickstream
    analysis, to measure the association between the
    pages seen (sequentially) by a visitor of a site.
  • Mining frequent itemsets and association rules is
    a popular and well researched method for
    discovering interesting relations between
    variables in large databases. The structure of
    the data to be analyzed is typically referred to
    as transactional.
  • Once obtained, the list of association rules
    extractable from a given dataset is compared in
    order to evaluate their importance level. The
    measures commonly used to assess the strength of
    an association rule are the indexes of support,
    confidence, and lift.

18
Support
Text Mining
  • The support for a rule A gt B is obtained
    dividing the number of transactions which satisfy
    the rule, NAgtB, by the total number of
    transactions, N.
  • support AgtB NAgtB / N

19
Support
Text Mining
The higher the support the stronger the
information that both type of events occur
together.
support AgtB NAgtB / N x1
RHS RHS
LHS x1 x2 g
LHS x3 x4 1-g
f 1-f 1
20
Confidence
Text Mining
  • The confidence of the rule A gt B is obtained by
    dividing the number of transactions which satisfy
    the rule, NAgtB , by the number of transactions
    which contain the body of the rule, A.
  • confidence AgtB NAgtB / NA

21
Confidence
Text Mining
A high confidence that the LHS event leads to the
RHS event implies causation or statistical
dependence.
confidence AgtB NAgtB / NA x1/g
RHS RHS
LHS x1 x2 g
LHS x3 x4 1-g
f 1-f 1
22
Lift
Text Mining
  • The lift of the rule A gt B is the deviation of
    the support of the whole rule from the support
    expected under independence given the supports of
    the LHS (A) and the RHS (B).
  • lift AgtB confidenceAgtB / supportB
  • supportAgtB/supportAsupportB

23
Relative Linkage Disequilibrium and other measures
Contingency Tables
RLD
Sup 57/254 .224
e RHS RHS
LHS 57 40
LHS 109 48
Conf 57/97 .588
Sup (RHS) 166/254 .654
lift .588/.654 .90
24
The groceries example
Association Rules

First 20 rules for groceries data, sorted by Lift
25
The groceries example
Association Rules
For Lift gt 2.5, RLD varies between 1-40
RLD
Lift
Plot of Relative Disequilibrium versus Lift for
the 430 rules of groceries data set
26
The groceries example
Association Rules
27
The groceries example
Association Rules
RLD shows more variability than support
RLD
Support
Plot of Relative Disequilibrium versus Support
for the 430 rules of groceries data set
28
The groceries example
Association Rules
For RLD of 20, Confidence varies between
1-40
RLD
Confidence
Plot of Relative Disequilibrium versus Confidence
for the 430 rules of groceries data set
29
The groceries example
Association Rules

First 20 rules for groceries data, sorted by Lift
30
The groceries example
Association Rules
For Lift gt 2.5, RLD varies between 1-40
RLD
Lift
Plot of Relative Disequilibrium versus Lift for
the 430 rules of groceries data set
31
The groceries example
Association Rules
RLD
ChiSquared
Plot of RLD versus Chisquare for the top 20
rules of groceries data set, sorted by RLD
32
The groceries example
Association Rules
RLD
Odds Ratio
Plot of RLD versus Odds Ratio for the top 20
rules of groceries data set, sorted by RLD
33
Summary
  • RLD is intuitive
  • RLD yields different answers from usual
    measures
  • RLD can be extended to higher dimensions
  • There are opportunities in considering the
    relationship between Association Rules and
    Contingency Tables

34
Graphical Displays
Evolution
Algebraic Statistics
Text Mining
Contingency Tables
35
References
  • Bishop, Y.M. M., Fienberg, S.E. and Holland, P.W.
    (1975). Discrete Multivariate Analysis Theory
    and Practice. M.I.T. Press, Cambridge, MA.
    Paperback edition (1977). Reprinted, by
    Springer-Verlag, New York (2007).
  • Fisher, R. A. (1930). The Genetical Theory of
    Natural Selection. Clarendon, Oxford, U.K..
  • Hahsler, M., Grun, B., and Hornik, K. (2005).
    arules A computational environment for mining
    association rules and frequent item sets. Journal
    of Statistical Software, 14(15)125. ISSN
    1548-7660. URL http//www.jstatsoft.org/v14/i15/.
  • Karlin, S. and Feldman, M. (1970). Linkage and
    selection Two locus symmetric viability model.
    Theoretical Population Biology 1 39-71.
  • Karlin, S. and Kenett, R.S. (1977). Variable
    Spatial Selection with Two Stages of Migration
    and Comparisons Between Different Timings,
    Theoretical Population Biology, 11, pp. 386-409.
  • Kenett, R. (1983). On an Exploratory Analysis of
    Contingency Tables. The Statistician, 32, pp.
    395-403.
  • Kenett, R. and Salini, S. (2008). Relative
    Linkage Disequilibrium A New measure for
    association rules UNIMI - Research Papers in
    Economics, Business, and Statistics
    http//services.bepress.com/unimi/statistics/art32
    /
  • Lewontin, R,. C., and Kojima, K. (1960). The
    evolutionary dynamics of complex polymorphisms.
    Evolution 14 458-472.
  • Omiecinski, E. (2003). Alternative interest
    measures for mining associations in databases.
    IEEE Transactions on Knowledge and Data
    Engineering, 15(1)5769.
  • Piatetsky-Shapiro, G. (1991). Discovery,
    analysis, and presentation of strong rules. In
    Knowledge Discovery in Databases, pages 229248.
  • Shimada, K., Hirasawa K, and Hu J. (2006)
    Association Rule Mining with Chi-Squared Test
    Using Alternate Genetic Network Programming,
    ICDM2006.
  • Tan, P-N, Kumar, V., and Srivastava, J. (2004).
    Selecting the right objective measure for
    association analysis. Information Systems,
    29(4)293313.

36
Backup slides
37
The telecom systems example
Association Rules
38
The telecom systems example
Association Rules
Item Frequency Plot (Supportgt0.1) of telecom
data set
39
The telecom systems example
Association Rules
3D Simplex Representation for 200 rules of
telecom data set and for the top 10 rules sorted
by RLD
40
The telecom systems example
Association Rules
2D Simplex Representation for the top 10 rules
sorted by RLD of telecom data set
41
The telecom systems example
Association Rules
Top 10 rules sorted by RLD of telecom data set
42
RLD Statistical Properties
Contingency Tables

43
RLD Statistical Properties
Contingency Tables
Write a Comment
User Comments (0)
About PowerShow.com