Title: Relative%20Linkage%20Disequilibrium:%20An%20intersection%20between%20evolution,%20algebraic%20statistics,%20text%20mining%20and%20contingency%20tables
1Relative Linkage Disequilibrium An intersection
between evolution, algebraic statistics, text
mining and contingency tables
- Ron S. Kenett
- KPA Ltd., Raanana, Israel and Department of
Statistics and Applied Mathematics "Diego de
Castro", University of Torino, Italy - in collaboration with
- Silvia Salini
- Department of Economics, Business and Statistics,
University of Milan, Italy
2Outline
Graphical Displays
Evolution
Algebraic Statistics
Text Mining
Contingency Tables
3Background 1930 - 1975
Evolution
Linkage Disequilibrium
The Fundamental Theorem of Natural
Selection the rate of increase of fitness of
any organism at any time is equal to its genetic
variance at that time
R.A. Fisher
Sam Karlin
FISHER, R. A. 1930, The Genetical Theory of
Natural Selection. Clarendon, Oxford,
U.K.. LEWONTIN, R., and KOJIMA, K., 1960, The
evolutionary dynamics of complex polymorphisms.
Evolution 14 458-472. KARLIN, S . and FELDMAN,
M., 1970, Linkage and selection Two locus
symmetric viability model. Theoretical Population
Biology 1 39-71. KARLIN, S., 1975, General
two-locus models some objectives, results and
interpretations. Theoretical Population Biology 7
364-398. KARLIN,S . and KENETT, R. 1977,
Variable Spatial Selection with Two Stages of
Migration and Comparisons Between Different
Timings, Theoretical Population Biology, 11, pp.
386-409.
4Background - 1975
Contingency Tables
Discrete multivariate analysis theory and
practice The first comprehensive treatise on the
analysis of categorical data using loglinear and
related statistical models
5Background - 1983
Graphical Display
6Background - 2008
Algebraic Statistics
7254 itemsetsThe e relationship
Text Mining
57
.. female..infection
40
.. female..
109
.. ...infection
48
.....
8The Data
Contingency Tables
e RHS RHS
LHS 57 40
LHS 109 48
e RHS RHS
LHS x1.224 x2.157
LHS x3.429 x4.189
d RHS RHS
LHS x1.389 x2.056
LHS x3.50 x4.056
h RHS RHS
LHS x1.057 x2.321
LHS x3.189 x4.434
9Contingency Tables
a man who has carefully investigated a printed
table, finds, when done, that he has only a very
faint and partial idea of what he has read and
that like a figure imprinted on sand, is soon
totally erased and defaced.
William Playfair (1786), The Commercial and
Political Atlas, from Edward R. Tufte (1983), The
Visual Display of Quantitative Information.
10The Simplex
Graphical Display
x3
x1
x2
x4
11The Simplex
Graphical Display
12Linkage Disequilibrium
Two loci, two alleles each, four genotypes AB,
Ab, aB, ab
Algebraic Statistics
RHS RHS
LHS x1 x2
LHS x3 x4
D can be extended to more dimensions
13 Linkage Disequilibrium
Algebraic Statistics
An algebraic observation
14 Relative Linkage Disequilibrium
Algebraic Statistics
D is the distance from the point corresponding
to the contingency table in the simplex, to the
surface D0 in the e?e direction.
DM is the distance from the point corresponding
to the contingency table on the surface D0 in
the e?e direction, to the surface of the simplex,
in that direction.
15Graphical Display
16RLD Example
Contingency Tables
RLD
h RHS RHS
LHS x1.057 x2.321
LHS x3.189 x4.434
e RHS RHS
LHS x1.224 x2.157
LHS x3.429 x4.189
d RHS RHS
LHS x1.389 x2.056
LHS x3.50 x4.056
17Association Rules
Text Mining
- Association rules are one of the most popular
unsupervised data mining methods used in
applications such as Market Basket Analysis, to
measure the associations between products
purchased by each consumer, or in web clickstream
analysis, to measure the association between the
pages seen (sequentially) by a visitor of a site.
- Mining frequent itemsets and association rules is
a popular and well researched method for
discovering interesting relations between
variables in large databases. The structure of
the data to be analyzed is typically referred to
as transactional. - Once obtained, the list of association rules
extractable from a given dataset is compared in
order to evaluate their importance level. The
measures commonly used to assess the strength of
an association rule are the indexes of support,
confidence, and lift.
18Support
Text Mining
- The support for a rule A gt B is obtained
dividing the number of transactions which satisfy
the rule, NAgtB, by the total number of
transactions, N. - support AgtB NAgtB / N
19Support
Text Mining
The higher the support the stronger the
information that both type of events occur
together.
support AgtB NAgtB / N x1
RHS RHS
LHS x1 x2 g
LHS x3 x4 1-g
f 1-f 1
20Confidence
Text Mining
- The confidence of the rule A gt B is obtained by
dividing the number of transactions which satisfy
the rule, NAgtB , by the number of transactions
which contain the body of the rule, A. - confidence AgtB NAgtB / NA
21Confidence
Text Mining
A high confidence that the LHS event leads to the
RHS event implies causation or statistical
dependence.
confidence AgtB NAgtB / NA x1/g
RHS RHS
LHS x1 x2 g
LHS x3 x4 1-g
f 1-f 1
22Lift
Text Mining
- The lift of the rule A gt B is the deviation of
the support of the whole rule from the support
expected under independence given the supports of
the LHS (A) and the RHS (B). - lift AgtB confidenceAgtB / supportB
- supportAgtB/supportAsupportB
23Relative Linkage Disequilibrium and other measures
Contingency Tables
RLD
Sup 57/254 .224
e RHS RHS
LHS 57 40
LHS 109 48
Conf 57/97 .588
Sup (RHS) 166/254 .654
lift .588/.654 .90
24The groceries example
Association Rules
First 20 rules for groceries data, sorted by Lift
25The groceries example
Association Rules
For Lift gt 2.5, RLD varies between 1-40
RLD
Lift
Plot of Relative Disequilibrium versus Lift for
the 430 rules of groceries data set
26The groceries example
Association Rules
27The groceries example
Association Rules
RLD shows more variability than support
RLD
Support
Plot of Relative Disequilibrium versus Support
for the 430 rules of groceries data set
28The groceries example
Association Rules
For RLD of 20, Confidence varies between
1-40
RLD
Confidence
Plot of Relative Disequilibrium versus Confidence
for the 430 rules of groceries data set
29The groceries example
Association Rules
First 20 rules for groceries data, sorted by Lift
30The groceries example
Association Rules
For Lift gt 2.5, RLD varies between 1-40
RLD
Lift
Plot of Relative Disequilibrium versus Lift for
the 430 rules of groceries data set
31The groceries example
Association Rules
RLD
ChiSquared
Plot of RLD versus Chisquare for the top 20
rules of groceries data set, sorted by RLD
32The groceries example
Association Rules
RLD
Odds Ratio
Plot of RLD versus Odds Ratio for the top 20
rules of groceries data set, sorted by RLD
33Summary
- RLD is intuitive
- RLD yields different answers from usual
measures - RLD can be extended to higher dimensions
- There are opportunities in considering the
relationship between Association Rules and
Contingency Tables
34Graphical Displays
Evolution
Algebraic Statistics
Text Mining
Contingency Tables
35References
- Bishop, Y.M. M., Fienberg, S.E. and Holland, P.W.
(1975). Discrete Multivariate Analysis Theory
and Practice. M.I.T. Press, Cambridge, MA.
Paperback edition (1977). Reprinted, by
Springer-Verlag, New York (2007). - Fisher, R. A. (1930). The Genetical Theory of
Natural Selection. Clarendon, Oxford, U.K.. - Hahsler, M., Grun, B., and Hornik, K. (2005).
arules A computational environment for mining
association rules and frequent item sets. Journal
of Statistical Software, 14(15)125. ISSN
1548-7660. URL http//www.jstatsoft.org/v14/i15/.
- Karlin, S. and Feldman, M. (1970). Linkage and
selection Two locus symmetric viability model.
Theoretical Population Biology 1 39-71. - Karlin, S. and Kenett, R.S. (1977). Variable
Spatial Selection with Two Stages of Migration
and Comparisons Between Different Timings,
Theoretical Population Biology, 11, pp. 386-409. - Kenett, R. (1983). On an Exploratory Analysis of
Contingency Tables. The Statistician, 32, pp.
395-403. - Kenett, R. and Salini, S. (2008). Relative
Linkage Disequilibrium A New measure for
association rules UNIMI - Research Papers in
Economics, Business, and Statistics
http//services.bepress.com/unimi/statistics/art32
/ - Lewontin, R,. C., and Kojima, K. (1960). The
evolutionary dynamics of complex polymorphisms.
Evolution 14 458-472. - Omiecinski, E. (2003). Alternative interest
measures for mining associations in databases.
IEEE Transactions on Knowledge and Data
Engineering, 15(1)5769. - Piatetsky-Shapiro, G. (1991). Discovery,
analysis, and presentation of strong rules. In
Knowledge Discovery in Databases, pages 229248. - Shimada, K., Hirasawa K, and Hu J. (2006)
Association Rule Mining with Chi-Squared Test
Using Alternate Genetic Network Programming,
ICDM2006. - Tan, P-N, Kumar, V., and Srivastava, J. (2004).
Selecting the right objective measure for
association analysis. Information Systems,
29(4)293313.
36Backup slides
37The telecom systems example
Association Rules
38The telecom systems example
Association Rules
Item Frequency Plot (Supportgt0.1) of telecom
data set
39The telecom systems example
Association Rules
3D Simplex Representation for 200 rules of
telecom data set and for the top 10 rules sorted
by RLD
40The telecom systems example
Association Rules
2D Simplex Representation for the top 10 rules
sorted by RLD of telecom data set
41The telecom systems example
Association Rules
Top 10 rules sorted by RLD of telecom data set
42RLD Statistical Properties
Contingency Tables
43RLD Statistical Properties
Contingency Tables