Whole-Genome Prokaryote Phylogeny without Sequence Alignment - PowerPoint PPT Presentation

Loading...

PPT – Whole-Genome Prokaryote Phylogeny without Sequence Alignment PowerPoint presentation | free to download - id: 46a414-NjM1Y



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Whole-Genome Prokaryote Phylogeny without Sequence Alignment

Description:

Whole-Genome Prokaryote Phylogeny without Sequence Alignment Bailin HAO and Ji QI T-Life Research Center, Fudan University Shanghai 200433, China – PowerPoint PPT presentation

Number of Views:82
Avg rating:3.0/5.0
Slides: 64
Provided by: baili
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Whole-Genome Prokaryote Phylogeny without Sequence Alignment


1
Whole-Genome Prokaryote Phylogeny
without Sequence Alignment
Bailin HAO and Ji QI T-Life Research
Center, Fudan University Shanghai 200433,
China Institute of Theoretical Physics, Academia
Sinica Beijing 100080, China http//www.itp.ac.cn/
hao/
2
Classification of Prokaryotes A Long-Standing
Problem
  • Traditional taxonomy too few features
  • Morphologyspheric, helices, rod-shaped
  • Metabolismphotosythesis, N-fixing,
    desulfurization
  • Gram stainingpositive and negative
  • SSU rRNA Tree (Carl Woese et al., 1977)
  • 16S rRNA ancient conserved sequences of about
    1500kb
  • Discovery of the three domains of life Archaea,
    Bacteria and Eucarya
  • Endosymbiont origin of mitochondria and
    chloroplasts

3
The SSU rRNA Tree of Life A big progress in
molecular phylogeny of prokaryotes as evidenced
by the history of the Bergeys Manual
4
Bergeys Manual Trust Bergeys Manual
  • 1st Ed. Determinative Bacteriology 1923
  • 8th Ed. Determinative Bacteriology 1974
  • 1st Ed. Systematic Bacteriology 1984-1989, 4
    volumes
  • 9th Ed. Determinative Bacteriology 1994
  • 2nd Ed. Systematic Bacteriology 2001-200?, 5
    volumes planned On-Line Taxonomic Outline of
    Procarytes by Garrity et al. (October 2003)
  • (26 phyla A1-A2, B1-B24)

5
Our Final Result
  • 132 organisms (16A 110B 6E)
  • Input genome data
  • Output phylogenetic tree
  • No selection of genes, no alignment of sequences,
    no fine adjustment whatsoever
  • See the tree first. Story follows.

6
(No Transcript)
7
Protein Tree for 145 Organisms From 82
Genera (K5)
  • 16 Archaea (11 genera, 16 species)
  • 123 Bacteria (65 genera, 98 species)
  • 6 Eukaryotes

8
(No Transcript)
9
Complete Bacterial Genomes Appeared since
1995 Early Expectations
  • More support to the SSU rRNA Tree of Life
  • Add details to the classification (branchings and
    groupings)
  • More hints on taxonomic revisions

10
  • Confusion brought by the hyperthermophiles
  • Aquifex aeolicus (Aquae) 1998 1551335
  • Thermotoga maritima (Thema) 1999 1860725


  • Genome Data Shake tree of life
  • Science 280 (1 May 1998) 672
  • Is it time to uproot the tree of life?
  • Science 284 (21 May 1999) 130
  • Uprooting the tree of life
  • W. Ford Doolittle, Scientific American
    (February 2000) 90

11
Debate on Lateral Gene Transfer
  • Extreme estimate 17 in E. Coli
  • Limitations of the above approach
  • B. Wang, J. Mol. Evol. 53 (2001) 244
  • Phase transition and crystalization of
    species (C. Woese 1998)
  • Lateral transfer within smaller gene pools as an
    innovative agent
  • Composition vector may incorporate LGT within
    small gene pools

12
  • Alignment-Based Molecular Phylogeny
  • TCAGACGC
  • TCGGAGT
  • T C A G A C G C
  • T C G G A - G T
  • Scoring scheme
  • Gap penalty
  • 16S rRNA tree was based on sequence alignment

13
  • Problem sequence alignment cannot be readily
    applied to complete genomes
  • Homology -gt alignment
  • Different genome size, gene content and gene order

14
Our Motivations
  • Develop a molecular phylogeny method that makes
    use of complete genomes no selection of
    particular genes
  • Avoid sequence alignment
  • Try to reach higher resolution to provide an
    independent comparison with other approaches such
    as SSU tRNA trees
  • Make comparison with bacteriologists systematics
    as reflected in Bergeys Manual (2001, 2002)
  • Our paper accepted by J. Molecular Evolution

15
Other Whole-Genome Approaches
  • Gene content
  • Presence or absence of COGs
  • Conserved Gene Pairs
  • Information distances
  • Domain order in proteins (Ken Nishikawas talk at
    InCoB2003)

16
Comparison of Complete Genomes/Proteomes
  • Compositional vectors
  • Nucleotides a?t?c?g
  • aatcgcgcttaagtc
  • Di-nucleotide (K2) distribution
  • aa at ac ag ta tt tc tg ca ct cc cg ga gt
    gc gg
  • 2 ,1 ,0 , 1 , 1 ,1, 1, 0, 0, 1, 0,
    2, 0, 1 ,2 , 0

17
  • K-strings make a composition vector
  • DNA sequence ? vector of dimension 4K
  • Protein sequence ? vector of dimension 20K
  • Given a genomic or protein sequence ? a unique
    composition vector
  • The converse a vector ? one or more sequences?
  • K big enough -gt uniqueness
  • Connection with the number of Eulerian loops in a
    graph (a separate study available as a preprint
    at ArXivphysics/0103028 and from Haos webpage)

18
A Key Improvement Subtraction of Random
Background
  • Mutations took place randomly at molecular level
  • Selection shaped the direction of evolution
  • Many neutral mutations remain as random
    background
  • At single amino acid level protein sequences are
    quite close to random
  • Highlighting the role of selection by subtraction
    a random background

19
Frequency and Probability
  • A sequence of length
  • A K-string
  • Frequency of appearance
  • Probability

20
Predicting (K-strings) from that of lengths
(K-1) and (K-2) strings
  • Joint probability vs. conditional probability
  • Making the weakest Markov assumption
  • Another joint probability

21
(K-2)-th Order Markov Model
  • Change to frequencies
  • Normalization factor may be ignored when LgtgtK

22
  • Construct compositional vectors using these
    modified string counts
  • For the i-th string type of species A we use

23
Composition Distance
  • Define correlation between two compositional
    vectors by the cosine of angle
  • From two complete proteomes
  • Aa1,a2,,an n205 3 200 000
  • Bb1,b2,,bn

  • C(A,B) ?-1,1
  • Distance

  • D(A,B)?0,1

24
Materials Genomes from NCBI (ftp.ncbi.nih.gov/gen
omes/Bacteria/) Not the original GenBank files
6 Eucaryote genomes were included for
reference Tree construction Neighbor-Joining in
Phylip
25
Protein Tree for 132 species (K5)
  • 16 Archaea (11 genera, 16 species)
  • 110 Bacteria (57 genera, 88 species)
  • 6 Eukaryotes

26
(No Transcript)
27
Protein Tree for 132 species K6
  • 16 Archaea (11 genera, 16 species)
  • 110 Bacteria (57 genera, 88 species)
  • 6 Eukaryotes

28
(No Transcript)
29
Protein Class vs. Whole Proteome
  • Trees based on collection of ribosomal proteins
    (SSU LSU) ribosomal proteins are interwoven
    with rRNA to form functioning complex results
    consistent with SSU rRNA trees
  • Trees based on collection of aminoacyl-tRNA
    synthetases (AARS). Trees based on single AARS
    were not good. Trees based on all 20 AARSs much
    better but not as good as that based on rProteins.

30
Genus Tree based on Ribosomal Proteins
31
A Genus Tree based on Aminoacyl tRNA synthetases
32
Chloroplast Tree
  • Sequences of about 100 000 bp
  • Tree of the endosymbiont partners
  • Paper accepted by Molecular Biology and Evolution
    on 12 August 2003

33
Chloroplast tree
34
Coronaviruses including Human SARS-CoV
  • Sequences of tens kilo bases
  • SARS squence about 29730 bases
  • Paper published in Chinese Science Bulletin on 26
    June 2003

35
Coronavirus tree
36
Understanding the Subtraction Procedure Analysis
of Extreme Cases in E. coli
  • There are 1 343 887 5-strings belonging to
    841832 different types.
  • Maximal count before subtraction 58 for the
  • 5-peptide GKSTL. 58 reduces to 0.646 after
    subtraction.
  • Maximal component after subtraction 197 for the
    5-peptide HAMSC. The number 197 came from a
    single count 1 before the subtraction.

37
GKSTL how 58 reduces to 0.646?
  • (GKST)113
  • (KSTL)77
  • (KST)247
  • Markov prediction 11377/24735.23
  • Final result (58-35.23)/35.230.646

38
HAMSC how 1 grows to 197?
  • (HAMS)1
  • (AMSC)1
  • (AMS)198
  • Markov prediction 11/1981/198
  • Final result (1-1/198)/(1/198)197

39
6121 Exact Matches of GKSTL In PIR Rel.1.26 with
gt1.2 Mil Proteins
  • These 6121 matches came from a diverse taxonomic
    assortment from virus to bacteria to fungi to
    plants and animals including human being
  • In the parlance of classic cladistics GKSTL
    contributes to plesiomorphic characters that
    should be eliminated in a strict phylogeny
  • The subtraction procedure did the job.

40
15 Exact Matches of HAMSC In PIR Rel.1.26 with
gt1.2 Mil Proteins
  • 1 match from Eukaryotic protein
  • 4 matches (the same protein) from virus
  • 10 matches from prokaryotes, among which
  • 3 from Shegella and E. coli (HAMSCAPDKE)
  • 3 from Samonella
    (HAMSCAPERD)
  • HAMSC is characteristic for prokaryotes
  • HAMSCA is specific for enterobacteria

41
Stable Topology of the Tree
  • K1 makes some sense!
  • K2,3,4 topology gradually converges
  • K5 and K6 present calculation
  • K7 and more too high resolution star-tree or
    bush expected

42
Statistical Test of the Tree
  • Bootstrap versus Jack knife
  • Bootstrap in sequence alignments
  • Bootstrap by random selections
  • from the AA-sequence pool
  • A time consuming job
  • 180 bootstraps for 72 species

43
About 70 genes for every species were selected
in one bootstrap
44
K-string Picture of Evolution
  • K5 -gt3 200 000 points in space of
  • 5-strings
  • K6 -gt64 000 000 points
  • In the primordial soup short polypeptides of a
    limited assortment
  • Evolution by growth, fusion, mutation leads to
    diffusion in the string space
  • String space not saturated yet

45
The Problem of Higher Taxa
  • 1974 Bacteria as a separate kingdom
  • 1994 Archaea and Bacetria as two domains
  • The relation of higher taxa?

46
  • Summary
  • As composition vectors do not depend on genome
    size and gene content. The use of whole genome
    data is straightforward
  • Data independent on that of 16S rRNA
  • Method different from that based on SSU rRNA
  • Results agree with SSU rRNA trees and the
    Bergeys Manual
  • Hint on groupings of higher taxa
  • A method without free parameters data in, tree
    out
  • Possibility of an automatic and objective
    classification tool for prokaryotes

47
Conclusion The Tree of Life is saved! There is
phylogenetic information in the prokaryotic
proteomes. Time to work on molecular definition
of taxa. Thank you!
48
(No Transcript)
49
(No Transcript)
50
Protein Tree for 132 species (K5)
  • 16 Archaea (11 genera, 16 species)
  • 110 Bacteria (57 genera, 88 species)
  • 6 Eukaryotes

51
(No Transcript)
52
(No Transcript)
53
A Failed Attempt Using Avoidance Sinatures
54
(No Transcript)
55
Comparison with the Bergeys Manual
56
  • Tree Construction
  • phylip package of J. Felsenstein
    (Neighbor-Joining)
  • The Fitch method is not
  • feasible here,
  • Nondistance-matrix method (MP, ML et al)
  • Material
  • ftp//ncbi.nlm.nih.gov/genomes/Bacteria/

57
Early expectation from genome data
  • Was there intensive lateral gene transfer?
  • Gene tree cannot be equated to the real tree of
    life
  • Genome data 106 to 107
  • Difficult to align whole genome data

58
  • Prokaryote and Eukaryote
  • Three Kingdoms( Carl Woese ,16S rRNA )
  • Archaea
  • Eubacteria
  • Eukarya
  • Five Kingdoms ( Lynn Margulis )
  • Bacteria (Archaea, Eubacteria)
  • Protoctista
  • Animalia
  • Fungi
  • Plantae

59
  • Common features of Archaea and Eubacteria
  • Small cells, no nucleus membrane, ring DNA,
  • no CAP at 5end of mRNA, presence of S-D
  • segments
  • Many proteins associated with replication,
    transcription, and translation are common in
    Archaea and Eukaryote
  • Features of Archaea lack of some enzymes,
    insensitive to some antibiotics

60
  • Compositional Representation of Protein
    Sequences and the Number of Eulerian Loops
  • by Bailin Hao, Huimin Xie, Shuyu Zhang
  • K5 76.7 proteins have unique reconstruction
  • K6 ? 94.0
  • K10 gt99
  • Checked 2820 AA-seqs from pdb.seq, a special
    selection of SWISS-PROT
  • See Los Alamos National Lab E-Archive
    physics/0103028

61
Subtraction of Random Background
  • Using a (K-2)-order Markov Model
  • K2 genomic signature by Karlin and Burge
  • May be justified by using Maximal Entropy
    Principle with appropriate constraints (Hu
    Wang, 2001)

62
What to do next
  • Detailed comparison with traditional taxonomy
  • Add more eukaryotes
  • Elucidation of the foundatrion and limitation of
    compositional approach
  • Software and web interface
  • Problem of lateral gene transfer
  • Viruses?

63
  • Confusion brought by the hyperthermophiles
  • Aquifex aeolicus (Aqua) 1998 1551335
  • Thermotoga maritima (Tmar) 1999 1860725
  • Genome Data Shake tree of life
  • Science 280 (1 May 1998) 672
  • Is it time to uproot the tree of life?
  • Science 284 (21 May 1999) 130
  • Uprooting the tree of life
  • Sci. Amer. (February 2000) 9
  • Problem of Lateral Gene Transfer (LGT) tree or
    network
  • Problem of higher taxa
About PowerShow.com