Title: Testing statistical hypotheses: The Chisquare test and how Sir R.A. Fisher caught Mendel cheating
1Testing statistical hypotheses The Chi-square
test and how Sir R.A. Fisher caught Mendel
cheating
Algorithmic Foundations of Computational
Biology Professor Istrail
2 The Chi-Square Test
Algorithmic Foundations of Computational
Biology Professor Istrail
- How well does it fit the facts ? In many cases
can be answered by the -test. - The test was invented in 1900 by Karl
Pearson - The test is used when there are more than two
categories of data - Like the probabilities of A, C, G, T in two DNA
sequences, to check whether these categories are
equally likely.
3A gambler is accused of using a loaded die
Algorithmic Foundations of Computational
Biology Professor Istrail
- but he pleads innocent. A record has been kept
for the last 60 throws. - 4 3 3 1 2 3 4 6 5 6
- 2 4 1 3 3 5 3 4 3 4
- 3 3 4 5 4 5 6 4 5 1
- 6 4 4 2 3 3 2 4 4 5
- 6 3 6 2 4 6 4 6 3 2
- 5 4 6 3 3 3 5 3 1 4
- If the gambler is innocent, the numbers from the
table should be like 60 random drawings with
replacement from a box with 1,2,3,4,5,6. Each
number should show up about 10 times. - The expected frequency is 10.
4Observed Frequencies
Algorithmic Foundations of Computational
Biology Professor Istrail
- Value Observed Freq Expected Freq
- 1 4
10 - 2 6
10 - 3 17
10 - 4 16
10 - 5 8
10 - 6 9
10
Sum
60
60
5The statistic
Algorithmic Foundations of Computational
Biology Professor Istrail
2
(observed frequency expected frequency)
sum of
expected frequency
When the observed frequency is far from the
expected frequency, the corresponding term in
the sum is large when the two are close, the
term is small. Large values of indicate
that the observed and expected frequencies are
far apart. Small values of mean the
opposite observed are close to expected. So
chi-square is a measure of the distance between
observed and expected.
6The P-value the observed significance level
Algorithmic Foundations of Computational
Biology Professor Istrail
- We need to know the chance that when a fair die
is rolled 60 times and is computed from the
observed frequencies, its value turns out to be
14.2 or more. - The answer P1.4 That is, if the die is fair
there is 1.4 chance for the statistic to be
as big as or bigger than the observed one. - Conclusion The gambler is in trouble!!!
7Degrees of freedom
Algorithmic Foundations of Computational
Biology Professor Istrail
- Pearson invented to curves one curve
for each degree of freedom. - In our case, the model is fully specified, i.e.,
there is no parameter to estimate from data so - degrees of freedom number of terms in
- 1
In the gambler problem, degrees of freedom 6
1 5
8The -test P-value
Algorithmic Foundations of Computational
Biology Professor Istrail
- For the -test the P-value is
approximately - equal to the area to the right of the observed
- value for the statistic, under the
-curve - with the appropriate number of degrees of
- freedom.
9P area under curve
Algorithmic Foundations of Computational
Biology Professor Istrail
14.2
P brown area
Curve with 5 degress of freedom
10Rule of thumb
Algorithmic Foundations of Computational
Biology Professor Istrail
- The approximation given by the curve can
be trusted when the expected frequency in each
line of the table is 5 or more.
Summary for the -test
- The basic data N observations of a random
process - The frequency table computed
- The -statistic formula is used to sum
things up - The degrees of freedom
- The observed significance level using the
curve
11Is Mendels experimental data too good to be true
? Yes!
Algorithmic Foundations of Computational
Biology Professor Istrail
- In 1865 Gregor Mendel published an article in
which he provided a scientific explanation for
heredity, and eventually caused a revolution in
biology. - Mendels experiments were all performed on garden
peas. Pea seeds are either yellow or green.
Color is a property of the seed. - Mendel bred a pure yellow strain, that is a
strain in which every plant in every generation
had only yellow seeds and separately he bred a
pure green strain.
12Yellow and Green peas
Algorithmic Functions of Computational
Biology Professor Istrail
- He then crossed plants of the pure yellow with
the plants of pure green - The seeds resulted from a yellow-green cross and
the resulting plants are called first-generation
hybrids. - First-generation hybrid seeds are all yellow,
indistinguishable from seeds of the pure yellow
strain. The green seems to have disappeared
completely. - These first-generation hybrid seeds grew into
first-generation hybrid plants which Mendel
crossed with themselves, producing
second-generation hybrid seeds. Some of these
second generation seeds were yellow, but some
were green. - So the green disappeared for one generation but
reappeared in the second. Even more surprising,
the green reappeared in a simple proportion - Of the second generation hybrids 75 were yellow
and 25 were green
13Factors, aka genes
Algorithmic Foundations of Computational
Biology Professor Istrail
- To explain it, Mendel postulated the existence of
factors later called genes. - According to Mendels theory, there were two
different variants of a gene which paired up to
control seed color. Denoted Y and G. It is the
gene pair in the seed not the parent which
determines what color the seed will be, all the
cells making up a seed contain the same gene-pair
14Y is dominant
Algorithmic Foundations of Computational
Biology Professor Istrail
- There are four different gene-pairs
- Y/Y, Y/G, G/Y, G/G
- Gene pairs control seed color by the rule
- Y/Y, Y/G,G/Y make yellow
- G/G makes green
- As geneticists say,
- Y is dominant and
- G is recessive
15Randomness
Algorithmic Foundations of Computational
Biology Professor Istrail
- Seed grows and become a plant. All cells in this
plant also carry the seeds color gene-pair. With
one exception Sex cells, either sperm or eggs,
contain only one gene of the pair. - For example, a plant whose ordinary gene pair is
Y/Y will produce sperm cell each containing a
gene Y similarly it will produce egg cells each
containing gene Y. - One plant whose pair is Y/G will produce half of
its sperm cells containing Y and half containing
G. The same is true of the eggs cells.
16First generation model explanation
Algorithmic Foundations of Computational
Biology Professor Istrail
- Plants of pure yellow have the color pair Y/Y
- Plants of pure green have the color pair G/G
- Crossing a pure yellow with a pure green is
producing fertilized egg of Y/G gene pair this
cell reproduces itself and eventually becomes a
seed, in which all the cells have the gene-pair
Y/G and are yellow in color.
17Second generation model explanation
Algorithmic Foundations of Computational
Biology Professor Istrail
- A first generation hybrid seed grows into a first
generation hybrid plant with gene-pair Y/G. This
plant produces sperm cells of which half will
contain the gene Y and the other half will
contain the gene G it also produces eggs of
which half will be Y and half G. - When two first generation hybrids are crossed,
each resulting second-generation hybrid seed gets
one gene at random from each parent -- because it
is formed by the random combination of a sperm
cell and an egg.
18Mendels chance model He was right!
Algorithmic Foundations of Computational
Biology Professor Istrail
First generation hybrid plants
Y
G
Y
G
Second generation hybrid plants
Y
Y
Y
Y
G
G
G
G
Chances
25
25
25
25
25
75
The combination Y/G and G/Y are not
distinguishable after fertilization
19Did Mendels facts fit his model ?
Algorithmic Foundations of Computational
Biology Professor Istrail
.. the general level of agreement between
Mendels expectations and his reported results
shows that it is closer than would be expected in
the best of several thousand repetitions. The
data have evidently been sophisticated systematica
lly, after examining various possibilities, I
have no doubt that Mendel was deceived by a
gardening assistant, who knew only too well what
his principal expected from each trial made.
- Only too well answered R. A. Fisher
20How Fisher used the test to show
that Mendel was cheating
Algorithmic Foundations of Computational
Biology Professor Istrail
- For each of Mendels experiments, Fisher
- computed the statistic. These experiments
- were all independent, for they involved different
- sets of plants. And Fisher pooled the results.
With independent experiments, the results can be
pooled by adding up the separate
statistics the degree of freedom add up too.
21Too good to be true
Algorithmic Foundations of Computational
Biology Professor Istrail
- For example if one experiment gives 5.8
with 5 - degrees of freedom, and another independent
experiment - gives 3.1 with 2 degrees of freedom, the
two - together have a pooled 8.9 with 7 degrees
of - freedom. For Mendels data, Fisher got a pooled
under - 42, with 84 degrees of freedom. The area under
the left of - 42 under the curve with 84 degrees of
freedom is - about 4 in 100,000. The agreement between the
observed - and expected is too good to be true.
22What does it mean ?
Algorithmic Foundations of Computational
Biology Professor Istrail
- Suppose million of scientists were repeating
Mendels experiments. For each scientist, imagine
measuring the discrepancy between his observed
frequencies and the expected frequencies by the
statistic. Then by the laws of chance, about
99,996 out of every 100,000 of these scientists
would report a discrepancy between observations
and expectations greater than the one reported by
Mendel. That leaves two possibilities. - (1) Either Mendels data were massaged
- (2) Or he was pretty lucky ?
- The first is easier to believe.
23Using chi-square test
Algorithmic Foundations of Computational
Biology Professor Istrail
- To test whether the null hypothesis that the
prescribed probabilities for the - nucleotides of a sequence are
- for i1,2,3,4 (aka A, C, G, T)
- we apply the test
- If the observed values are such that
-
- is large that we reject the null hypothesis.
- is the number of nucleotides in category
in our sequence. - The formula for is a measure of
discrepancy between the observed values and - the respective null hypothesis means
- When the null hypothesis is true and
large we have a chi-square distribution with - 4-13 degrees of freedom.