Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications I, and Orthology PowerPoint PPT Presentation

presentation player overlay
1 / 49
About This Presentation
Transcript and Presenter's Notes

Title: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications I, and Orthology


1
Bioinformatics and Evolutionary GenomicsGene
Trees, Gene Duplications (I), and Orthology
2
Gene Trees, Gene Duplications and Orthology
3
Phylogenetic gene trees how to make them
  • Homology are two pieces of sequence related
    Trees when did they diverge (how are they
    related)
  • Start from a multiple sequence alignment
  • All multiple sequence programs alignments make a
    global alignment, thus feed it regions that you
    know are homologous ? Domains !
  • MUSCLE / clustal / t_coffee
  • Visual inspection of alignments (gaps,
    fragments/complete sequences, weird things e.g. A)

4
Put homologs in the alignment
  • Even if they are not homologous MUSCLE will align
    them (muscle/clustalw implicitly assumes that
    the sequences you feed it are homologous)
  • And in a phylogeny program, non-homologous
    sequences will be clustered

5
Visual inspection of alignments ?!
6
An additive tree which is wrongly reconstructed
by UPGMA
B
A B C D A x 12 9 9 B 12 x 9 7 C 9 9
x 6 D 9 7 6 x
A
5
6
2
1
D
3
C
7
Neighbour-Joining (Saitou and Nei, 1987)
  • Global measure. keeps total branch length minimal
  • At each step, join two nodes such that distances
    are minimal (criterion of minimal evolution)
  • Leads to unrooted tree

8
Neighbour-Joining
  • At each step all possible neighbour joinings
    are checked and the one corresponding to the
    minimal total tree length (calculated by adding
    all branch lengths) is taken.

9
Neighbour-Joining
r net divergence
A B C D r A x 12 9 9 30 B 12 x 9 7
28 C 9 9 x 6 24 D 9 7 6 x 22
Mab dab (rarb)/(N-2)
Mab 12 (3028)/(4-2)) -17
A B C D A x -17 -18 -17 B x -17
-18 C x -17 D x
AC ? U
dau dac/2 (ra-rc)/(2(N-2)) 9/2
(30-24)/(22) 6 dcu dac - dau 9 6
3 dbu (dab dbc dac ) / 2 (12 9 9 ) /
2 6 ddu (dad dcd dac ) / 2 (9 6 9)
/ 2 3
10
U B D r U x 6 3 9 B 6 x 7 13 D 3 7
x 10
U B D U x -16 -16 B x -16 D
x
e.g. UB ?V
Dvu dub / 2 (ru rb )/ (2(N-2)) 6/2
(9-13)/(21) 3 2 1 Dvb dub duv 6 1
5 Ddv (dud dbd dub)/2 (37-6)/2 2
B
A
5
6
V
1
2
U
D
3
C
11
Unequal rates between speciesare a very real
phenomenon
12
Character based parsimony and maximum likelihood
  • Two way classification in phylogeny distance
    based vs character based
  • character state method. Searches directly (i.e.
    without defining distances) for a tree that fits
    best to the data (the alignment)

13
Maximum likelihood
  • Search the tree with the highest maximum
    likelihood
  • one searches for the maximum likelihood (ML)
    value for the character state configurations
    among the sequences under study for each possible
    tree and chooses the one with the largest ML
    value as the preferred tree.

14
Maximum likelihood
  • have to specify a model of sequence evolution
  • likelihood for all sites is the product of the
    likelihoods for individual sites assuming all the
    nucleotide sites evolve independently.
  • maximum likelihood method computes the
    probabilities for all possible combinations of
    ancestral states!
  • ML methods evaluate phylogenetic hypotheses n
    terms of the probability that a proposed model of
    the evolutionary process and the proposed
    unrooted tree (hypothesis) would give rise to the
    observed data (the alignment). The tree found to
    have the highest (log)ML value is considered to
    be the preferred tree.

15
Interpreting trees
  • (recurring theme)

16
Interpreting the tree
  • Taxonomic findings
  • Paraphyly
  • Monophyly

17
Interpreting the tree
  • Outgroup. place root between distant homologouss
    sequence and rest group (b)
  • Midpoint. place root at midpoint of longest path
    (sum of branches between any two leafs) NB njplot
  • Gene duplication. Place root between paralogous
    gene copies (b)
  • NB all affected by rates !

b
18
Simple example (kinase)
19
Two genes per species how to differentiate
between one ancient or two recent duplications?
  • Two genes in Human chromosomes ( Human A Human
    B) two genes in mouse chromosomes (Mouse A
    Mouse B)

20
Duplications, Speciations
1
2
3
?
21
Interpreting the tree duplications vs
speciations, going pseudo 3D
Gene Duplication
Speciation
22
Interpreting the tree gene trees vs species trees
23
Interpreting the tree Example vertebrate
duplications
  • Tetraploidy?

24
Interpreting the tree Horizontal Gene Transfer (
HGT )
Bacteria
Eukarya
Archaea
25
Jargon for interpretation Orthology (and
paralogy) as a specification of homology when
discussing two species
mouse1
human2
human1
Fitch 1970 Two genes in two species are
orthologous if they derive from one gene in
their last common ancestor
the corresponding gene
Genes can diverge by
Gene duplication by cell division
Speciation, or
Duplication
implied to have the same function
26
Orthology annotating internal nodesas
duplications or speciations
Because of the definition, how does that
translate to a tree With or without species
phylogeny?
27
Terminology inparalogs, outparalogs, co-orthologs
28
Importance of orthology for comparative genomics
more resolution
Af
Af
Bs
Bs
Ec
Ec
Hi
Mg
Gene family present in Ec Hi Bs Mg Af Orthologs
1 present in Ec Hi Bs Af Orthologs 2 present in
Ec Bs Mg Af
Phenotype gene correlation Func prediction if
Hi is only biochem characterized enzyme Func
prediction by co-oc Evolution of gene content
loss vs dupl
29
Heurisitcs for orthology definition
  • Needed because
  • Speed (MSA plus reliable tree building is slow)
  • Difficulty in deciding of which things you should
    make a tree in the first place (PFAM?)
  • Difficulty in operationalizing nuanced tree
    orthology into group orthology
  • Historically bidirectional blast hits BBH

30
BBH
Af
Af
Bs2
Bs1
Ec2
Ec1
Hi
Mg
Extracting tree-like information from pairwise
similarities
Ec1Bs1 50 Ec1Bs2 35 Ec2 Bs1 33 Ec2 Bs2
48
31
BBH issues 1 unequal rates
prpC N. meningitidis
11 orthologs
prpC E. coli
prpC P. aeruginosa
.
VCh1337 V cholerae
.
mmgD B. subtilis
mmgD B. halodurans
citZ B. subtilis
Outparalogs
citZ B. halodurans
VCh2092 V. cholerae
.
gltA P. multocida
gltA E. coli

gltA P. aeruginosa
gltA N. meningitidis

Duplication
Speciation
32
BBH issues 2 ignores inparalogs
Af
Af
Bs2
Bs1
Ec2
Ec1
Hi
Bs3
Prevalence? Depends on e.g. evo distance, group
vs pairwise orthology At least 16
prokaryotes INPARANOID
Ec2 Bs2 48 Ec2 Bs3 51 (Bs2 Bs3 70)
Ec1 Hi 70 Ec2 Hi 38
33
BBH issues 3 differential gene loss
Af
Af
Bs2
Bs1
Ec2
Ec1
Hi
Mg
Mg Hi 35
34
Other Large Scale orthology schemes Inparanoid
  • Eric Sonnhammer

35
Orthologous groups
  • Solution to the non-transitivity of the concept
    of orthology sensu stricto is Group orthology
  • Conceptually all proteins that are directly
    descended from one protein in the last common
    ancestor are considered orthologous to each other
  • Operationally Combine all connected best
    triangular hits into Clusters of Orthologous
    Groups (COGs, Tatusov et al, 1997).
    WWW.NCBI.NLM.GOV (Watch out for fusion/fission
    though !!!)

36
Large Scale orthology schemes COG
  • 1. Perform the all-against-all protein sequence
    comparison.
  • 2. Detect and collapse obvious paralogs, that is,
    proteins from the same genome that are more
    similar to each other than to any proteins from
    other species.
  • 3. Detect triangles of mutually consistent,
    genome-specific best hits (BeTs), taking into
    account the paralogous groups detected at step 2.
  • 4. Merge triangles with a common side to form
    COGs.
  • 5. A case-by-case analysis of each COG. This
    analysis serves to eliminate false-positives and
    to identify groups that contain multidomain
    proteins by examining the pictorial
    representation of the BLAST search outputs. The
    sequences of detected multidomain proteins are
    split into single-domain segments and steps 14
    are repeated with these sequences, which results
    in the assignment of individual domains to COGs
    in accordance with their distinct evolutionary
    affinities.
  • 6. Examination of large COGs that include
    multiple members from all or several of the
    genomes using phylogenetic trees, cluster
    analysis and visual inspection of alignments as
    a result, some of these groups are split into two
    or more smaller ones that are included in the
    final set of COGs.

37
Large Scale orthology schemes COG
  • 5. A case-by-case analysis of each COG. This
    analysis serves to eliminate false-positives and
    to identify groups that contain multidomain
    proteins by examining the pictorial
    representation of the BLAST search outputs. The
    sequences of detected multidomain proteins are
    split into single-domain segments and steps 14
    are repeated with these sequences, which results
    in the assignment of individual domains to COGs
    in accordance with their distinct evolutionary
    affinities.
  • 6. Examination of large COGs that include
    multiple members from all or several of the
    genomes using phylogenetic trees, cluster
    analysis and visual inspection of alignments as
    a result, some of these groups are split into two
    or more smaller ones that are included in the
    final set of COGs.

38
Other Large Scale orthology schemes Ortho MCL
39
The too ambitious comparative genomics dilemma
duplication/speciation vs domains
Domain composition, accretion
Single structural elements?
Gene fusion
Domains
Gene
Domain cassettes
Very distant past
TIME
present
orthologs
homologs
Distant homologs
Gene
Trivial orthologs
Sequence divergence
i.e. genome comparison between close species no
domain considerations, sub-sub-ortholog. Between
distant Homologs, loads of domain
considerations
40
Implication of coupling between duplication
domain accretion for evolution and function
prediction
  • for some genes life is easy 111 orthologs, no
    fusion / domains, couple of losses. But a
    minority of families but a large proportion of
    proteins is a formidable challenge, domains
    permutations and duplications make life
    complicated

41
Orthology function predictionBlast with a
newly sequenced globin from frog
What kind of globin is it?
42
Globins
Blast query
43
Orthologous function prediction vs homologous
that are not orthologous function
  • Orthologs tend to have the exact same molecular
    function, mere HTANOs not
  • and operate in the same pathway.
  • Orthologs mostly have the same domain
    composition

44
but inparalogs fate after duplication
neofunctionalization or subfunctionalization
  • Even evolutionary true orthologs can have
    different functions
  • Both co-orthologs have taken over some aspect of
    the ancestral function and have lost other
    aspects
  • Acquiring of new function or loss-of-function
    one of co-orthologs does something different now.

45
Does retaining the ancestral role correlate
with speed of sequence evolution yes but a
substantial minority is inconsistent
386
220
46
rfbB / rffG
RfbB and RffG catalyze the same reaction, but are
involved in two different biological processes.
rfb gene cluster biosynthesis of O-specific
polysaccharides (inner membrane). rff gene
cluster complex biosynthesis of enterobacteria
common antigen (outer membrane).
47
Why do observe inconsistencies?
Consistent
70
Inconsistent
60
50
40
Frequency ( cases)
30
20
10
0
1
0
1
5
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
5
100
0
Sequence identity between inparalogs ()
Not because of chance due to lack of divergence
time
48
Why do observe inconsistencies?
Similar sequence divergence of inparalogs
relative to their single-ortholog, molecular
function similar? Any inconsistencies are then a
chance outcome both duplicates have diverged,
but at (roughly) the same evolutionary speed
(most amino acids substitutions are only been
subject to purifying selection and not to
adaptive selection)
49
  • In certain orthology scheme gene order is given
    prevalence above most similarity
  • Gene at conserved position is considered the
    original and the other duplicate the copy
Write a Comment
User Comments (0)
About PowerShow.com