Bioinformatica e Analisi Funzionale del Genoma - PowerPoint PPT Presentation

1 / 68
About This Presentation
Title:

Bioinformatica e Analisi Funzionale del Genoma

Description:

Mutational changes (errors) in genetic transmission are ... MACAQUE. OWL MONKEY. Newick format. Molecular Phylogeny. Tree building methods. Molecular Phylogeny ... – PowerPoint PPT presentation

Number of Views:74
Avg rating:3.0/5.0
Slides: 69
Provided by: grazian
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatica e Analisi Funzionale del Genoma


1
Bioinformatica e Analisi Funzionale del Genoma
Molecular Phylogenetics and Evolutionary Analysis
2
Molecular Phylogenetics and Evolutionary Analysis
Tutorial Outline
  • Introduction
  • Molecular mechanisms underlying genetic evolution
  • Homology, Orthology and Paralogy
  • DNA vs protein sequences in evolutionary analyses
  • Multiple Alignment
  • Genetic Distances
  • Site-Specific rates
  • Molecular Phylogeny
  • Software packages for evolutionary analysis

3
Molecular Phylogenetics and Evolutionary Analysis
Nothing in Biology makes sense except in the
light of Evolution Theodosius Dobzhansky
(1900-1975)
Rosetta Stone
4
Mechanism of Evolution
Mutational changes (errors) in genetic
transmission are at the basis of evolutionary
processes that starting from an ancestral life
form have produced the amazing diversification in
today life forms. Basic types of DNA changes
include
  • point mutations (nucleotide substitutions)
  • insertions
  • deletions
  • inversion and other kind of rearrangements

5
Mechanism of Evolution
Mutagenesis is the process that give rise to
mutations, can be
  • Spontaneous, e.g. through errors in the normal
    process of DNA replication, mostly due to the
    peculiar properties of purine and pyrimidine
    bases
  • Induced, e.g. through radiation or chemical
    damage to DNA

6
Mechanism of Evolution
Spontaneous mutagenesis and errors in DNA
replication
  • In man DNA replication must accurately replicate
    6 x 109 base pairs every time a cell divides
  • On average a new mutation occur per 1010 bp
    incorporated

7
Mechanism of Evolution
DNA polymerases ensures accuracy by two main
methods
(a) Base selection
Only AT and GC base pairs fit properly in the
active site of the polymerase
(b) Proofreading and Repair
If a wrong base is inserted, then it is removed
and replaced with the correct one.
8
Mechanism of Evolution
Point Mutations
Transition
Transition
Transversions
Pyrimidine
Purine
9
Mechanism of Evolution
Spontaneous mutations are the result of basic
properties of purine and pyrimidine bases that
may assume two structures for the keto-enol or
amino-imino tautomery.
H
H
N
N
N
N
O
O
H
G (keto)
G (enol)
N
N
N
N
H
N
H
H
N
H
H
10
Mechanism of Evolution
Keto-enol and amino-imino tautomerism
Rare isomeric forms (tautomers) of the 4 bases
exist in equilibrium with the major forms (11000)
The rare forms reverse the base pairing rules
11
Mechanism of Evolution
Base pairing of the rare forms produce transitions
A(amino) C (imino) A(imino) C
(amino) G (keto) ? T (enol) G (enol) ?
T (keto)
These minor base pairs may escape proofreading
and generate transition mutations in further
rounds of DNA replication
12
Mechanism of Evolution
Base pairing of the rare forms produce also
transversions
Tautomer concentration is 10-4 (enol) and 10-5
(imino). If simultaneously the rare tautomer
pairs with a nucleotide in syn (0.05 -0.1)
conformation in the template then transversion
substitutions (purine ? pyrimidine) may occur.
13
Mechanism of Evolution
Transitions vs Transversion
As expected transitions are more frequent than
transversions. Observed frequencies of point
mutations (1 per 10-9 - 10-10 bases incorporated)
are much lower than expected (about 10-6) because
of repair systems. Point mutations are also
produced by other types on non canonical
pairings, depurination processes, oxydative
deamination, etc.)
14
Mechanism of Evolution
Slippage generates small insertions and deletions
1st replication round
2nd replication round
15
Mechanism of Evolution
Unequal crossing-over may generate larger
duplications and insertions
Insertion

Deletion
16
Mechanism of Evolution
Mutation and Fixation
To be a mutation genetically relevant it should
be heritable, i.e. to occur in germline cells,
and spread in a remarkable fraction of the
population, i.e. fixation.
17
Mechanism of Evolution
Sequence and length changes in evolution
18
Homology, Orthology and Paralogy
  • Similarity
  • - resemblance between two biosequences
  • - local or global
  • - can be measured
  • Homology
  • - common evolutionary descent
  • - established through an evolutionary analysis
  • - yes or not

19
Orthology vs Paralogy
both imply homology
Sequences originated from a common ancestor
following a speciation event
  • Orthology

Sequences originated from a common ancestor
following a gene duplication event
  • Paralogy

Sequences originated from a lateral gene transfer
event
  • Xenology

20
Orthology vs Paralogy
ancestral gene
gene duplication
gene A
gene B
speciation
orthologous
paralogous
orthologous
21
DNA vs Proteins
Sequence conservation is in the order DNA lt
protein sequence lt protein secondary structure
lt protein 3D structure
Evolutionary Information
22
DNA vs Proteins
Ser Gly Arg His Lys
UCU GGU CGU CAU AAA UCC GGC CGC CAC
AAG UCG GGG CGG UCA GGA CGA AGU AGC
Many different coding sequence codify for the
same protein.
23
DNA vs Proteins
Protein 2 changes
DNA 52 changes
24
Protein sequence vs structure
Spinach (1A70) and Azotobacter (7FD1) ferredoxins
25
Steps in phylogenetic analysis
Multiple Alignment
Compute genetic distance or other analyses
Get a tree or other evolutionary inference
26
Multiple Alignment
The Multiple Alignment is critical a bad
alignment produces wrong results. Carefully
evaluate the alignment (usually
computer-generated) before the analysis to
eventually introduce manual adjustments from
structural or functional information and remove
low-quality regions.
27
Genetic Distance
2/7 occurred changes observed in the alignment
28
Genetic Distance
The observed proportion of changes is a poor
estimator of the actual number of evolutionary
changes at increasing divergence
expected difference
saturation
observed difference
29
Stochastic Models
Molecular Evolution is modeled as a
time-dependent probabilistic process. Several
models have been proposed differing in accepted
assumptions
  • All nucleotide sites change independently.
  • The substitution rate is constant over time and
    in different lineages.
  • The base composition is at equilibrium.
  • Nucleotide substitution rate is the same for all
    kind of changes, for all sites.

30
Base Composition
Stochastic models assume that base composition is
at equilibrium, i.e. that base composition is
roughly the same over the collection of sequences
being studied. This condition is also known as
Stationarity. Violation of stationarity may lead
to incorrect inferences.
31
Base Composition
The stationarity check is mandatory before using
stochastic models.
Puzzle check
32
Substitution Model
A
Transversion
Transition
T
C
G
33
Jukes-Cantor Model
Equilibrium base composition 1/4, 1/4, 1/4, 1/4
A
One-parameter model
T
C
G
34
Jukes-Cantor Model
qt fraction of identical sites at time t l
substitution rate per unit time
A
C
B
T t
B C (qt) (1- l)2 qt
B ? C (1-qt) 2 (l/3)(1- l)(1- qt)
C
B
T t1
qt1(1- 2l) qt 2/3 l(1- qt)
qt1-qt dq/dt 2/3 l - 8/3 l q
2lT d -3/4 ln (1-4/3p) p1-q
35
Kimura Model
Equilibrium base composition 1/4, 1/4, 1/4, 1/4
A
Two-parameter model
T
C
G
36
General Time Reversible Model
Equilibrium base composition qA, qC, qG, qT)
A
9-parameter model
T
C
G
37
Outline of substitution models
38
More Realistic Models
Models can be made more parameter rich to
increase their realism
  • The most common additional parameters are
  • A correction for the proportion of sites which
    are unable to change
  • A correction for variable site rates at those
    sites which can change

39
Site Rate Heterogeneity
Different sites in DNA sequences may have quite
different probabilities of change. For this
reason it is advisable to analyze separately
sites presumably subjected to different
evolutionary dynamics (e.g. first and second vs
third codon positions). Furthermore rate
variation among sites can be modeled by a Gamma
distribution whose shape parameter alpha
specifies the range of rate variation among sites.
40
Gamma Distribution
Higher alpha Lower heterogeneity
41
More vs Lesser parameter Models
Models can be made more parameter rich to
increase their realism but the more parameters
you estimate from the data the more time needed
for an analysis and the more sampling error
accumulatesOne might have a realistic model but
large sampling errorsRealism comes at a cost in
time and precision!Fewer parameters may give an
inaccurate estimate, but more parameters decrease
the precision of the estimate In general use the
simplest model which fits the data.
42
Parameter Estimation
Use PAUP tree scores to use ML to estimate
parameters
43
Distance measure for protein sequences
When comparing protein sequences a parameter-rich
model is inapplicable as proteins use a 20-letter
alphabet. A suitable approximation is given by
the Kimura formula
With p the observed proportion of different amino
acids.
44
Genetic distances
The measure unit of a genetic distance is
substitutions/site
45
Estimating site-specific rate variability
  • Rationale
  • Substitutions between closely related sequences
    are likely to have occurred at fast evolving
    sites
  • Between closely related sequences substitutions
    involving biochemically distinct amino acids will
    tend to occur at least constrained (faster
    evolving) sites.

46
Estimating site-specific rate variability
The variability of the i-th site in a multiple
alignment of N sequences, L sites long is given
by
  • Where dij is
  • Nucleotide sequences 0 or 1 depending on the
    observation or not of a nt substitution in the
    j-th pairwise comparison.
  • Protein sequences a measure of amino acid pair
    distance, ranging from 0 (identity) to 1 (least
    common substitution).
  • .. and Kj is the overall genetic distance for the
    j-th comparison.
  • A relative variability index is given by gi ni
    / nmax.

47
Human mtDNA D-loop site variability
1
146
Hvr1 Region SiteVar software
152
195
0,8
0,6
Relative variability
0,4
0,2
0
141
1
11
21
31
41
51
61
71
81
91
331
351
371
121
361
101
111
131
151
161
171
181
191
201
211
221
231
241
251
261
271
281
291
301
311
321
341
CSB1
CSB2
CSB3
O

H
48
Molecular Phylogeny
Evolutionary relationships between organisms, or
more generally between homologous genes, can be
suitably represented by phylogenetic trees. A
phylogenetic tree is a graph composed of nodes
and branches in which only one branch connects
two adjacent nodes. The nodes represent taxonomic
units, and the branches define the relationships
among the units in terms of descent and ancestry.
49
Molecular PhylogenyTree topology
50
Molecular PhylogenyRoot of the Tree
Rooted tree
Unrooted tree
51
Molecular PhylogenyTypes of Trees
0.098
Cladogram
Phylogram
0.001
0.046
0.091
0.010
0.014
0.019
0.053
0.048
0.033
0.022
All branch lengths equal
Branch lengths proportional to distance
52
Molecular PhylogenyTree Styles
CHIMP
ORANG
GORILLA
CHIMP
HUMAN
MACAQUE
HUMAN
GORILLA

ORANG
MACAQUE
OWL MONKEY
Newick format
OWL MONKEY
53
Molecular PhylogenyTree building methods
54
Molecular PhylogenyUPGMA
  • Unweighted Pair Group Method with Arithmetic Mean
  • uses a distance matrix
  • a sequential algorithm identifies local
    topological relationships
  • tree built in a stepwise manner
  • arithmetic mean defines the distances between
    taxa (initial or composite)

1
2
3
4
5
55
Molecular PhylogenyMolecular Clock
UPGMA assumes homogeneous substitution rate along
all lineages, i.e. the Molecular Clock.
V K/2T T K/2V
56
Molecular PhylogenyMolecular Clock - Divergence
Times
calibration
57
Molecular PhylogenyMolecular Clock - Divergence
Times
58
Molecular PhylogenyNeighbor Joining
Determines the N-3 internal branches that give
the smallest tree length (sum of all branches)
59
Molecular PhylogenyDrawbacks of clustering
methods
Loss of Information
  • Uninterpretable branch length
  • dtree lt dobs biologically impossible
  • occasionally even dtree lt 0

The method does not optimize an objective
function Clustering methods merely produce a
tree, but do not allow us to evaluate i) the
quality of the tree ii) competing hypotheses.
60
Maximum Parsimony
  • Uses the principle of minimum evolution to
    identify the tree that requires the minimum of
    evolutionary changes to explain divergence
    between sequences
  • Often, no unique tree can be inferred
  • Exhaustive search is unfeasible for large datasets

61
Maximum Parsimony
4 changes
5 changes
6 changes
62
Maximum Likelihood
Given a dataset D (i.e. a multiple alignment)
  • Maximize the likelihood function L(S, w, R) ln
    P (D S, w, R) with
  • Tree topology S
  • Branch lengths w (w1, w2, w3, w4, w5)
  • Rate matrix (and other evolutionary parameters,
    e.g. Gamma, Inv)

1
3
1
2
1
3
State 3
State 2
State 1
2
4
3
4
4
2
63
Maximum Likelihood
  • Very reliable method
  • Provides tree topology, branch length, parameter
    estimates, may account for site variability and
    fraction of invariant sites
  • May compare alternative hypotheses
  • but
  • it is computationally very intensive and
    unapplicable to large dataset, but some
    approximation methods are available such as
    Quartet Puzzling and Bayesian Inference.

64
Maximum LikelihoodTesting alternative hypotheses
H0 no Clock
H1 Clock
L0
L1
Likelihood Ratio Test (LRT) 2(L1 - L0 ) ? c2
(k-2)
65
Assessing Tree ReliabilityBootstrap
Resampling with repetition
Consensus
Jacknife Resampling without repetition
66
Assessing Tree ReliabilityConsensus Tree
67
Molecular PhylogenyPrograms and Packages
http//evolution.genetics.washington.edu/phylip/so
ftware.html
68
Molecular PhylogenyMajor Softwares
Write a Comment
User Comments (0)
About PowerShow.com