Title: Figures of Merits (1) Assessing the quality of a solution
1TREES
2Trees
3Same thing
4Evaluation of the tree topology
The maximum parsimony principle
5Genes 0 absent, 1 present Genes 0 absent, 1 present Genes 0 absent, 1 present Genes 0 absent, 1 present Genes 0 absent, 1 present Genes 0 absent, 1 present
species g1 g2 g3 g4 g5 g6
s1 1 0 0 1 1 0
s2 0 0 1 0 0 0
s3 1 1 0 0 0 0
s4 1 1 0 1 1 1
s5 0 0 1 1 1 0
6Evaluate this tree
s2
s1
s4
s3
s5
7Gene number 1
s1
s4
s3
s2
s5
1
1
1
0
0
8Gene number 1, Option number 1.
1
1
s1
s4
s3
s2
s5
1
1
1
0
0
9Gene number 1, Option number 2.
s1
s4
s3
s2
s5
Number of changes for g1 1
10Gene number 2, Option number 1.
s2
s1
s4
s3
s5
11Gene number 2, Option number 2.
s2
s1
s4
s3
s5
12Gene number 2, Option number 3.
s2
s1
s4
s3
s5
Number of changes for g2 2
13Gene number 3, Option number 1.
s2
s1
s4
s3
s5
14Gene number 3, Option number 2.
s2
s1
s4
s3
s5
Number of changes for g3 1
15Gene number 4, Option number 1.
s2
s1
s4
s3
s5
16Gene number 4, Option number 2.
s2
s1
s4
s3
s5
Number of changes for g4 2
17Gene number 5 is the same as Gene number 4
Number of changes for g5 2
18Gene number 6, 1option only
s2
s1
s4
s3
s5
Number of changes for g6 1
19Sum of changes
Number of changes for g1 1
Number of changes for g2 2
Number of changes for g3 1
Number of changes for g4 2
Number of changes for g5 2
Number of changes for g6 1
Sum of changes for this tree topology 9
Can we do better ???
20The MP (most parsimonious) tree
s2
s1
s4
s3
s5
Sum of changes for this tree topology 8
21TR TREE ROOTED
How many rooted trees?
N2, TR(2) 1
N3, TR(3) 3
N4, TR(4) 15
22How many rooted trees
2 sequences 1 tree 3 sequences 3 trees 4
sequences 3515 trees 5 sequences 357105
trees. TR(n) 1357..(2n-3)
23(No Transcript)
24Rooting...
25Rooting the tree
26Rooted vs. unrooted trees
3
1
2
3
1
2
27Rooted vs. Unrooted
The position of the root does not affect the MP
score.
28Intuition why rooting doesnt change the score
Gene number 1, Option number 1.
1
1
s1
s4
s3
s2
s5
1
1
1
0
0
The change will always be on the same branch, no
matter where the root is positioned
29How can we root the tree? we want rooted trees!
30(No Transcript)
31(No Transcript)
32Gorilla gorilla (Gorilla)
Pan troglodytes (Chimpanzee)
Homo sapiens (human)
Gallus gallus (chicken)
33Evaluate all 3 possible UNROOTED trees
MP tree
34Rooting based on a priori knowledge
Human
Chicken
Gorilla
Chimp
Human
Chimp
Chicken
Gorilla
35Ingroup / Outgroup
Chicken
Human
Chimp
Gorilla
INGROUP
OUTGROUP
36Monophyletic groups
Chicken
Human
Chimp
Gorilla
The GorillaHumanChimp are monophyletic
37How to efficiently compute the MP score of a tree
38The Fitch algorithm (1971)
Post-order tree scan. In each node, if the
intersection between the child-nodes is empty we
apply a union operator. Otherwise, an
intersection.
39Number of changes
Total number of changes number of union
operators.
40Likelihood
41- Parsimony has many shortcomings. To name a few
- All changes are counted the same, which is not
true for biological systems (Leu-gtIle is much
more likely than Leu-gtHis). - Cannot take biological context into account
(secondary structures, dependencies among sites,
evolutionary distances between the analyzed
organisms, etc). - Statistical basis questionable.
42Alternative MAXIMUM-LIKELIHOOD METHOD
43Maximum likelihood uses a probabilistic model of
evolution Each amino acid has a certain
probability to change and this probability
depends on the evolutionary distance. Evolutionar
y distances are inferred from the entire set of
sequences.
44Evolutionary distances
Positions in an alignment can be conserved due to
two reasons. Either because of functional
constraints, or because a short evolutionary time
elapsed since the divergence of the organisms. 5
replacements in 10 positions between 2 chimps, is
considered very variable. 5 replacements between
human and cucumber, is not considered too
variable Maximum likelihood takes this
information into account.
45Maximum Parsimony Maximum Likelihood
All changes are considered the same Different probabilities to different types of substitutions
Statistically questionable Statistically robust
Ignores biological context Accounts for biological context
46The likelihood computations
With likelihood models we can 1. Infer the most
likely phylogenetic tree 2. Compute conservation
for each site
47Maximum likelihood tree reconstruction
This is incredibly difficult (and challenging)
from the computational point of view, but
efficient algorithms to find approximate
solutions were developed.
48Tree reconstruction using distance based methods
- Two steps
- Compute a distance D(i,j) between any two
sequences i and j. - Find the tree that agrees most with the distance
table.
49Neighbor-joining is based on Star decomposition
Red best pair to group together
B
E
A
(C,B)
C
A
D
D
E
In each step we cluster a pair so that the sum of
branches is minimal
A
((C,B),E)
D
50(No Transcript)
51A few words on Human Immunodeficiency Virus
(HIV) The virus HIV The disease/syndrome
Aquired Immunodeficiency First recognized
clinically in 1981. By 1992, it had become the
major cause of death in individuals of 25-44
years of age in the U.S.
52HIV Till Dec 2002 20 million people died of
AIDS. Infected in 2002 5 millions. Number of
currently infected 42 millions
1 out of every 100 adults of age 15-49 in the
world population.
53HIV HIV is the leading cause of death in
sub-Sharan Africa. In some parts of this region
25-30 of the population is infected.
1 out of 3 children in these areas lost at least
one of his parents.
54Sub-Saharan Africa refers to the territories
south to the Sahara. In the past the term Black
Africa has also been used to refer to the same
region however today it is obsolete due to its
politically incorrectness Tropical Africa
might be taken as an alternative label of the
same region however it excludes South Africa,
which lies outside the tropics.
55HIV is a lentivirus Species HIV Genus
Lentiviruses Family Retroviridae Lentiviruses
have long incubation time, and are thus called
slow viruses.
56HIV-1 and HIV-2 In 1986, a distinct type of HIV
prevalent in certain regions of West Africa was
discovered and was termed HIV type
2. Individuals infected with type 2 also had
AIDS, but had longer incubation time and lower
morbidity.
57Morbidity vs. Mortality
- Morbidity the prevalence of a disease
- ????? ???????
The probability that a randomly selected person
out of the entire population is ill, at time t.
58Morbidity vs. Mortality
Mortality Deaths from a disease or at general
- Mortality rate Death rate
- ????? ??????
59Origin of HIV-1 in the chimpanzee Pan troglodytes
troglodytes
Nature Vol. 397. Pages 436-441. 1999.
60Five lines of evidence have been used to
substantiate zoonotic transmission of primate
lentivirus 1. Similarities in viral genome
organization 2. Phylogenetic relatedness 3.
Geographic coincidence 4. Plausible routes of
transmission 5. Prevalence in the natural host.
61For HIV-2, a virus (SIVsm) that is genomically
indistinguishable and phylogenetically closely
related was found in substantial numbers of
wild-living sooty mangabeys whose natural habitat
coincides with the epicenter of the HIV-2 epidemic
62?????, ??? ???? ??? ???? ???????? ???? ??????
?????? ?? ??????
63Close contact between sooty mangabeys and humans
is common because these monkey are hunted for
food and kept as pets. No fewer than six
independent transmissions of SIVsm to humans have
been proposed.
The origin of HIV-1 is much less certain.
64HIV-1 is most similar in sequence and genomic
organization to viruses found in chimpanzees
(SIVcpz).
65- BUT, there are several doubts casting the theory
that chimpanzees are the natural host and
reservoir for HIV-1 - There is a wide spectrum of diversity between
HIV-1 and SIVcpz. - An apparent low prevalence of SIVcpz infection
in wild-living animals. - The presence of chimpanzees in geographic
regions of Africa where AIDS was not initially
recognized.
66Rather, it has been suggested that another, yet
unidentified, primate species could be the
natural host for SIVcpz and HIV-1.
67Marilyn
We recently identified a fourth chimpanzee with
natural SIVcpz infection This animal
(Marilyn) was wild-caught in Africa (county of
origin unknown), exported to the United States as
an infant, and used as a breeding female in a
primate facility until her death at age 26.
68HOW was the SIV found
During a serosurvey in 1985, Marilyn was the only
chimpanzee of 98 tested who had antibodies
strongly reactive against HIV-1 by enzyme-linked
immunosorbent assay (ELISA) and western
immunoblot.
69Maybe Marylin was infected with HIV during her
stay in the U.S.?
She has never been used in AIDS research and had
not received human blood products after 1969. She
died in 1985 after giving birth to still-born
twins.
70To convince that she did not have AIDS
An autopsy revealed endometritis, retained
placental elements and sepsis as the final cause
of death. Depletion of lymhoid tissues was not
noted.
Endometritis ???? ????? ???? Sepsis ??? ??
71PCR was used to amplify HIV- or SIV-related DNA
sequences directly from uncultured (frozen)
spleen and lymph-node tissue obtained at the
autopsy in order to characterize the infection
responsible for Marilyns HIV-1 seropositivity.
72Amplification and sequence analysis of subgenomic
gag (508 base pairs (bp)) and pol (766 bp)
fragments revealed the presence of a virus
related to, but distinct from, known SIVcpz and
HIV-1 strains.
73PCR was used to amplify and sequence four
overlapping subgenomic fragments that together
comprised a complete proviral genome. The genome
was termed SIVcpzUS.
74Provirus The "provirus" is the form of the virus
which is capable of being integrated into the
host genome. In the case of HIV it means the
DNA "copy" of the HIV genome (HIV normally
carries its genes around in RNA form).
75Provirus As far as the host cell's cellular
machinery is concerned, this extra DNA is not
different from the self DNA.
76Only three other SIVcpz strains have been
reported Two from animals wild-caught in Gabon
(SIVcpzGAB1 and SIVcpzGAB2) One from a
chimpanzee exported to Belgium from Zaire
(SIVcpzANT).
77SIVcpzGAB1 and SIVcpzANT have been sequenced
completely, but only 280bp of the pol sequence
are available for SIVcpzGAB2.
78- To determine the evolutionary relationships of
SIVcpzUS to these and other HIV and SIV
sequences - Sequences from the HIV sequence database
(http//hiv-web.lanl.gov/HTML/compendium.html)
were downloaded. - Neighbour-joining was used to construct the tree,
based on the full-length Pol sequences. - Maximum likelihood was also used and yielded
very similar topologies
79The neighbour-joining method was applied to
protein-sequence distances calculated by the
method of Kimura. Clade support values were
computed with 1,000 bootstrap replicates. NJ
computations were computed using the CLUSTAL_X
program.
80These analyses identified SIVcpzUS unambiguously
as a new member of the HIV-1/SIVcpz group of
viruses.