Title: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications I, and Orthology
1Bioinformatics and Evolutionary GenomicsGene
Trees, Gene Duplications (I), and Orthology
2Gene Trees, Gene Duplications and Orthology
3Phylogenetic gene trees how to make them
- Homology are two pieces of sequence related
Trees when did they diverge (how are they
related) - Start from a multiple sequence alignment
- All multiple sequence programs alignments make a
global alignment, thus feed it regions that you
know are homologous ? Domains ! - MUSCLE / clustal / t_coffee
- Visual inspection of alignments (gaps,
fragments/complete sequences, weird things e.g. A)
4Put homologs in the alignment
- Even if they are not homologous MUSCLE will align
them (muscle/clustalw implicitly assumes that
the sequences you feed it are homologous) - And in a phylogeny program, non-homologous
sequences will be clustered
5Visual inspection of alignments ?!
6An additive tree which is wrongly reconstructed
by UPGMA
B
A B C D A x 12 9 9 B 12 x 9 7 C 9 9
x 6 D 9 7 6 x
A
5
6
2
1
D
3
C
7Neighbour-Joining (Saitou and Nei, 1987)
- Global measure. keeps total branch length minimal
- At each step, join two nodes such that distances
are minimal (criterion of minimal evolution) - Leads to unrooted tree
8Neighbour-Joining
- At each step all possible neighbour joinings
are checked and the one corresponding to the
minimal total tree length (calculated by adding
all branch lengths) is taken.
9Neighbour-Joining
r net divergence
A B C D r A x 12 9 9 30 B 12 x 9 7
28 C 9 9 x 6 24 D 9 7 6 x 22
Mab dab (rarb)/(N-2)
Mab 12 (3028)/(4-2)) -17
A B C D A x -17 -18 -17 B x -17
-18 C x -17 D x
AC ? U
dau dac/2 (ra-rc)/(2(N-2)) 9/2
(30-24)/(22) 6 dcu dac - dau 9 6
3 dbu (dab dbc dac ) / 2 (12 9 9 ) /
2 6 ddu (dad dcd dac ) / 2 (9 6 9)
/ 2 3
10 U B D r U x 6 3 9 B 6 x 7 13 D 3 7
x 10
U B D U x -16 -16 B x -16 D
x
e.g. UB ?V
Dvu dub / 2 (ru rb )/ (2(N-2)) 6/2
(9-13)/(21) 3 2 1 Dvb dub duv 6 1
5 Ddv (dud dbd dub)/2 (37-6)/2 2
B
A
5
6
V
1
2
U
D
3
C
11Unequal rates between speciesare a very real
phenomenon
12Character based parsimony and maximum likelihood
- Two way classification in phylogeny distance
based vs character based - character state method. Searches directly (i.e.
without defining distances) for a tree that fits
best to the data (the alignment)
13Maximum likelihood
- Search the tree with the highest maximum
likelihood - one searches for the maximum likelihood (ML)
value for the character state configurations
among the sequences under study for each possible
tree and chooses the one with the largest ML
value as the preferred tree.
14Maximum likelihood
- have to specify a model of sequence evolution
- likelihood for all sites is the product of the
likelihoods for individual sites assuming all the
nucleotide sites evolve independently. - maximum likelihood method computes the
probabilities for all possible combinations of
ancestral states! - ML methods evaluate phylogenetic hypotheses n
terms of the probability that a proposed model of
the evolutionary process and the proposed
unrooted tree (hypothesis) would give rise to the
observed data (the alignment). The tree found to
have the highest (log)ML value is considered to
be the preferred tree.
15Interpreting trees
16Interpreting the tree
- Taxonomic findings
- Paraphyly
- Monophyly
17Interpreting the tree
- Outgroup. place root between distant homologouss
sequence and rest group (b) - Midpoint. place root at midpoint of longest path
(sum of branches between any two leafs) NB njplot - Gene duplication. Place root between paralogous
gene copies (b) - NB all affected by rates !
b
18Simple example (kinase)
19Two genes per species how to differentiate
between one ancient or two recent duplications?
- Two genes in Human chromosomes ( Human A Human
B) two genes in mouse chromosomes (Mouse A
Mouse B)
20Duplications, Speciations
1
2
3
?
21Interpreting the tree duplications vs
speciations, going pseudo 3D
Gene Duplication
Speciation
22Interpreting the tree gene trees vs species trees
23Interpreting the tree Example vertebrate
duplications
24Interpreting the tree Horizontal Gene Transfer (
HGT )
Bacteria
Eukarya
Archaea
25Jargon for interpretation Orthology (and
paralogy) as a specification of homology when
discussing two species
mouse1
human2
human1
Fitch 1970 Two genes in two species are
orthologous if they derive from one gene in
their last common ancestor
the corresponding gene
Genes can diverge by
Gene duplication by cell division
Speciation, or
Duplication
implied to have the same function
26Orthology annotating internal nodesas
duplications or speciations
Because of the definition, how does that
translate to a tree With or without species
phylogeny?
27Terminology inparalogs, outparalogs, co-orthologs
28Importance of orthology for comparative genomics
more resolution
Af
Af
Bs
Bs
Ec
Ec
Hi
Mg
Gene family present in Ec Hi Bs Mg Af Orthologs
1 present in Ec Hi Bs Af Orthologs 2 present in
Ec Bs Mg Af
Phenotype gene correlation Func prediction if
Hi is only biochem characterized enzyme Func
prediction by co-oc Evolution of gene content
loss vs dupl
29Heurisitcs for orthology definition
- Needed because
- Speed (MSA plus reliable tree building is slow)
- Difficulty in deciding of which things you should
make a tree in the first place (PFAM?) - Difficulty in operationalizing nuanced tree
orthology into group orthology - Historically bidirectional blast hits BBH
30BBH
Af
Af
Bs2
Bs1
Ec2
Ec1
Hi
Mg
Extracting tree-like information from pairwise
similarities
Ec1Bs1 50 Ec1Bs2 35 Ec2 Bs1 33 Ec2 Bs2
48
31BBH issues 1 unequal rates
prpC N. meningitidis
11 orthologs
prpC E. coli
prpC P. aeruginosa
.
VCh1337 V cholerae
.
mmgD B. subtilis
mmgD B. halodurans
citZ B. subtilis
Outparalogs
citZ B. halodurans
VCh2092 V. cholerae
.
gltA P. multocida
gltA E. coli
gltA P. aeruginosa
gltA N. meningitidis
Duplication
Speciation
32BBH issues 2 ignores inparalogs
Af
Af
Bs2
Bs1
Ec2
Ec1
Hi
Bs3
Prevalence? Depends on e.g. evo distance, group
vs pairwise orthology At least 16
prokaryotes INPARANOID
Ec2 Bs2 48 Ec2 Bs3 51 (Bs2 Bs3 70)
Ec1 Hi 70 Ec2 Hi 38
33BBH issues 3 differential gene loss
Af
Af
Bs2
Bs1
Ec2
Ec1
Hi
Mg
Mg Hi 35
34Other Large Scale orthology schemes Inparanoid
35Orthologous groups
- Solution to the non-transitivity of the concept
of orthology sensu stricto is Group orthology
- Conceptually all proteins that are directly
descended from one protein in the last common
ancestor are considered orthologous to each other - Operationally Combine all connected best
triangular hits into Clusters of Orthologous
Groups (COGs, Tatusov et al, 1997).
WWW.NCBI.NLM.GOV (Watch out for fusion/fission
though !!!)
36 Large Scale orthology schemes COG
- 1. Perform the all-against-all protein sequence
comparison. - 2. Detect and collapse obvious paralogs, that is,
proteins from the same genome that are more
similar to each other than to any proteins from
other species. - 3. Detect triangles of mutually consistent,
genome-specific best hits (BeTs), taking into
account the paralogous groups detected at step 2.
- 4. Merge triangles with a common side to form
COGs. - 5. A case-by-case analysis of each COG. This
analysis serves to eliminate false-positives and
to identify groups that contain multidomain
proteins by examining the pictorial
representation of the BLAST search outputs. The
sequences of detected multidomain proteins are
split into single-domain segments and steps 14
are repeated with these sequences, which results
in the assignment of individual domains to COGs
in accordance with their distinct evolutionary
affinities. - 6. Examination of large COGs that include
multiple members from all or several of the
genomes using phylogenetic trees, cluster
analysis and visual inspection of alignments as
a result, some of these groups are split into two
or more smaller ones that are included in the
final set of COGs.
37Large Scale orthology schemes COG
- 5. A case-by-case analysis of each COG. This
analysis serves to eliminate false-positives and
to identify groups that contain multidomain
proteins by examining the pictorial
representation of the BLAST search outputs. The
sequences of detected multidomain proteins are
split into single-domain segments and steps 14
are repeated with these sequences, which results
in the assignment of individual domains to COGs
in accordance with their distinct evolutionary
affinities. - 6. Examination of large COGs that include
multiple members from all or several of the
genomes using phylogenetic trees, cluster
analysis and visual inspection of alignments as
a result, some of these groups are split into two
or more smaller ones that are included in the
final set of COGs.
38Other Large Scale orthology schemes Ortho MCL
39The too ambitious comparative genomics dilemma
duplication/speciation vs domains
Domain composition, accretion
Single structural elements?
Gene fusion
Domains
Gene
Domain cassettes
Very distant past
TIME
present
orthologs
homologs
Distant homologs
Gene
Trivial orthologs
Sequence divergence
i.e. genome comparison between close species no
domain considerations, sub-sub-ortholog. Between
distant Homologs, loads of domain
considerations
40Implication of coupling between duplication
domain accretion for evolution and function
prediction
- for some genes life is easy 111 orthologs, no
fusion / domains, couple of losses. But a
minority of families but a large proportion of
proteins is a formidable challenge, domains
permutations and duplications make life
complicated
41Orthology function predictionBlast with a
newly sequenced globin from frog
What kind of globin is it?
42Globins
Blast query
43Orthologous function prediction vs homologous
that are not orthologous function
- Orthologs tend to have the exact same molecular
function, mere HTANOs not - and operate in the same pathway.
- Orthologs mostly have the same domain
composition
44 but inparalogs fate after duplication
neofunctionalization or subfunctionalization
- Even evolutionary true orthologs can have
different functions - Both co-orthologs have taken over some aspect of
the ancestral function and have lost other
aspects - Acquiring of new function or loss-of-function
one of co-orthologs does something different now.
45Does retaining the ancestral role correlate
with speed of sequence evolution yes but a
substantial minority is inconsistent
386
220
46rfbB / rffG
RfbB and RffG catalyze the same reaction, but are
involved in two different biological processes.
rfb gene cluster biosynthesis of O-specific
polysaccharides (inner membrane). rff gene
cluster complex biosynthesis of enterobacteria
common antigen (outer membrane).
47Why do observe inconsistencies?
Consistent
70
Inconsistent
60
50
40
Frequency ( cases)
30
20
10
0
1
0
1
5
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
5
100
0
Sequence identity between inparalogs ()
Not because of chance due to lack of divergence
time
48Why do observe inconsistencies?
Similar sequence divergence of inparalogs
relative to their single-ortholog, molecular
function similar? Any inconsistencies are then a
chance outcome both duplicates have diverged,
but at (roughly) the same evolutionary speed
(most amino acids substitutions are only been
subject to purifying selection and not to
adaptive selection)
49- In certain orthology scheme gene order is given
prevalence above most similarity - Gene at conserved position is considered the
original and the other duplicate the copy