Testing statistical hypotheses: The Chisquare test and how Sir R.A. Fisher caught Mendel cheating - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Testing statistical hypotheses: The Chisquare test and how Sir R.A. Fisher caught Mendel cheating

Description:

Rule of thumb ... The green seems to have disappeared completely. ... Even more surprising, the green reappeared in a simple proportion: ... – PowerPoint PPT presentation

Number of Views:184
Avg rating:3.0/5.0
Slides: 24
Provided by: applerae
Category:

less

Transcript and Presenter's Notes

Title: Testing statistical hypotheses: The Chisquare test and how Sir R.A. Fisher caught Mendel cheating


1
Testing statistical hypotheses The Chi-square
test and how Sir R.A. Fisher caught Mendel
cheating
Algorithmic Foundations of Computational
Biology Professor Istrail
2
The Chi-Square Test
Algorithmic Foundations of Computational
Biology Professor Istrail
  • How well does it fit the facts ? In many cases
    can be answered by the -test.
  • The test was invented in 1900 by Karl
    Pearson
  • The test is used when there are more than two
    categories of data
  • Like the probabilities of A, C, G, T in two DNA
    sequences, to check whether these categories are
    equally likely.

3
A gambler is accused of using a loaded die
Algorithmic Foundations of Computational
Biology Professor Istrail
  • but he pleads innocent. A record has been kept
    for the last 60 throws.
  • 4 3 3 1 2 3 4 6 5 6
  • 2 4 1 3 3 5 3 4 3 4
  • 3 3 4 5 4 5 6 4 5 1
  • 6 4 4 2 3 3 2 4 4 5
  • 6 3 6 2 4 6 4 6 3 2
  • 5 4 6 3 3 3 5 3 1 4
  • If the gambler is innocent, the numbers from the
    table should be like 60 random drawings with
    replacement from a box with 1,2,3,4,5,6. Each
    number should show up about 10 times.
  • The expected frequency is 10.

4
Observed Frequencies
Algorithmic Foundations of Computational
Biology Professor Istrail
  • Value Observed Freq Expected Freq
  • 1 4
    10
  • 2 6
    10
  • 3 17
    10
  • 4 16
    10
  • 5 8
    10
  • 6 9
    10

Sum
60
60
5
The statistic
Algorithmic Foundations of Computational
Biology Professor Istrail
2
(observed frequency expected frequency)
sum of
expected frequency
When the observed frequency is far from the
expected frequency, the corresponding term in
the sum is large when the two are close, the
term is small. Large values of indicate
that the observed and expected frequencies are
far apart. Small values of mean the
opposite observed are close to expected. So
chi-square is a measure of the distance between
observed and expected.
6
The P-value the observed significance level
Algorithmic Foundations of Computational
Biology Professor Istrail
  • We need to know the chance that when a fair die
    is rolled 60 times and is computed from the
    observed frequencies, its value turns out to be
    14.2 or more.
  • The answer P1.4 That is, if the die is fair
    there is 1.4 chance for the statistic to be
    as big as or bigger than the observed one.
  • Conclusion The gambler is in trouble!!!

7
Degrees of freedom
Algorithmic Foundations of Computational
Biology Professor Istrail
  • Pearson invented to curves one curve
    for each degree of freedom.
  • In our case, the model is fully specified, i.e.,
    there is no parameter to estimate from data so
  • degrees of freedom number of terms in
    - 1

In the gambler problem, degrees of freedom 6
1 5
8
The -test P-value
Algorithmic Foundations of Computational
Biology Professor Istrail
  • For the -test the P-value is
    approximately
  • equal to the area to the right of the observed
  • value for the statistic, under the
    -curve
  • with the appropriate number of degrees of
  • freedom.

9
P area under curve
Algorithmic Foundations of Computational
Biology Professor Istrail
  • c

14.2
P brown area
Curve with 5 degress of freedom
10
Rule of thumb
Algorithmic Foundations of Computational
Biology Professor Istrail
  • The approximation given by the curve can
    be trusted when the expected frequency in each
    line of the table is 5 or more.

Summary for the -test
  • The basic data N observations of a random
    process
  • The frequency table computed
  • The -statistic formula is used to sum
    things up
  • The degrees of freedom
  • The observed significance level using the
    curve

11
Is Mendels experimental data too good to be true
? Yes!
Algorithmic Foundations of Computational
Biology Professor Istrail
  • In 1865 Gregor Mendel published an article in
    which he provided a scientific explanation for
    heredity, and eventually caused a revolution in
    biology.
  • Mendels experiments were all performed on garden
    peas. Pea seeds are either yellow or green.
    Color is a property of the seed.
  • Mendel bred a pure yellow strain, that is a
    strain in which every plant in every generation
    had only yellow seeds and separately he bred a
    pure green strain.

12
Yellow and Green peas
Algorithmic Functions of Computational
Biology Professor Istrail
  • He then crossed plants of the pure yellow with
    the plants of pure green
  • The seeds resulted from a yellow-green cross and
    the resulting plants are called first-generation
    hybrids.
  • First-generation hybrid seeds are all yellow,
    indistinguishable from seeds of the pure yellow
    strain. The green seems to have disappeared
    completely.
  • These first-generation hybrid seeds grew into
    first-generation hybrid plants which Mendel
    crossed with themselves, producing
    second-generation hybrid seeds. Some of these
    second generation seeds were yellow, but some
    were green.
  • So the green disappeared for one generation but
    reappeared in the second. Even more surprising,
    the green reappeared in a simple proportion
  • Of the second generation hybrids 75 were yellow
    and 25 were green

13
Factors, aka genes
Algorithmic Foundations of Computational
Biology Professor Istrail
  • To explain it, Mendel postulated the existence of
    factors later called genes.
  • According to Mendels theory, there were two
    different variants of a gene which paired up to
    control seed color. Denoted Y and G. It is the
    gene pair in the seed not the parent which
    determines what color the seed will be, all the
    cells making up a seed contain the same gene-pair

14
Y is dominant
Algorithmic Foundations of Computational
Biology Professor Istrail
  • There are four different gene-pairs
  • Y/Y, Y/G, G/Y, G/G
  • Gene pairs control seed color by the rule
  • Y/Y, Y/G,G/Y make yellow
  • G/G makes green
  • As geneticists say,
  • Y is dominant and
  • G is recessive

15
Randomness
Algorithmic Foundations of Computational
Biology Professor Istrail
  • Seed grows and become a plant. All cells in this
    plant also carry the seeds color gene-pair. With
    one exception Sex cells, either sperm or eggs,
    contain only one gene of the pair.
  • For example, a plant whose ordinary gene pair is
    Y/Y will produce sperm cell each containing a
    gene Y similarly it will produce egg cells each
    containing gene Y.
  • One plant whose pair is Y/G will produce half of
    its sperm cells containing Y and half containing
    G. The same is true of the eggs cells.

16
First generation model explanation
Algorithmic Foundations of Computational
Biology Professor Istrail
  • Plants of pure yellow have the color pair Y/Y
  • Plants of pure green have the color pair G/G
  • Crossing a pure yellow with a pure green is
    producing fertilized egg of Y/G gene pair this
    cell reproduces itself and eventually becomes a
    seed, in which all the cells have the gene-pair
    Y/G and are yellow in color.

17
Second generation model explanation
Algorithmic Foundations of Computational
Biology Professor Istrail
  • A first generation hybrid seed grows into a first
    generation hybrid plant with gene-pair Y/G. This
    plant produces sperm cells of which half will
    contain the gene Y and the other half will
    contain the gene G it also produces eggs of
    which half will be Y and half G.
  • When two first generation hybrids are crossed,
    each resulting second-generation hybrid seed gets
    one gene at random from each parent -- because it
    is formed by the random combination of a sperm
    cell and an egg.

18
Mendels chance model He was right!
Algorithmic Foundations of Computational
Biology Professor Istrail
First generation hybrid plants
Y
G
Y
G
Second generation hybrid plants
Y
Y
Y
Y
G
G
G
G
Chances
25
25
25
25
25
75
The combination Y/G and G/Y are not
distinguishable after fertilization
19
Did Mendels facts fit his model ?
Algorithmic Foundations of Computational
Biology Professor Istrail
.. the general level of agreement between
Mendels expectations and his reported results
shows that it is closer than would be expected in
the best of several thousand repetitions. The
data have evidently been sophisticated systematica
lly, after examining various possibilities, I
have no doubt that Mendel was deceived by a
gardening assistant, who knew only too well what
his principal expected from each trial made.
  • Only too well answered R. A. Fisher

20
How Fisher used the test to show
that Mendel was cheating
Algorithmic Foundations of Computational
Biology Professor Istrail
  • For each of Mendels experiments, Fisher
  • computed the statistic. These experiments
  • were all independent, for they involved different
  • sets of plants. And Fisher pooled the results.

With independent experiments, the results can be
pooled by adding up the separate
statistics the degree of freedom add up too.
21
Too good to be true
Algorithmic Foundations of Computational
Biology Professor Istrail
  • For example if one experiment gives 5.8
    with 5
  • degrees of freedom, and another independent
    experiment
  • gives 3.1 with 2 degrees of freedom, the
    two
  • together have a pooled 8.9 with 7 degrees
    of
  • freedom. For Mendels data, Fisher got a pooled
    under
  • 42, with 84 degrees of freedom. The area under
    the left of
  • 42 under the curve with 84 degrees of
    freedom is
  • about 4 in 100,000. The agreement between the
    observed
  • and expected is too good to be true.

22
What does it mean ?
Algorithmic Foundations of Computational
Biology Professor Istrail
  • Suppose million of scientists were repeating
    Mendels experiments. For each scientist, imagine
    measuring the discrepancy between his observed
    frequencies and the expected frequencies by the
    statistic. Then by the laws of chance, about
    99,996 out of every 100,000 of these scientists
    would report a discrepancy between observations
    and expectations greater than the one reported by
    Mendel. That leaves two possibilities.
  • (1) Either Mendels data were massaged
  • (2) Or he was pretty lucky ?
  • The first is easier to believe.

23
Using chi-square test
Algorithmic Foundations of Computational
Biology Professor Istrail
  • To test whether the null hypothesis that the
    prescribed probabilities for the
  • nucleotides of a sequence are
  • for i1,2,3,4 (aka A, C, G, T)
  • we apply the test
  • If the observed values are such that
  • is large that we reject the null hypothesis.
  • is the number of nucleotides in category
    in our sequence.
  • The formula for is a measure of
    discrepancy between the observed values and
  • the respective null hypothesis means
  • When the null hypothesis is true and
    large we have a chi-square distribution with
  • 4-13 degrees of freedom.
Write a Comment
User Comments (0)
About PowerShow.com