Comparative genomics and proteomics in Ensembl - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Comparative genomics and proteomics in Ensembl

Description:

Comparative genomics and proteomics in Ensembl. November 2004 ... The Compara database is one single multispecies database. Gene orthology/paralogy prediction ... – PowerPoint PPT presentation

Number of Views:325
Avg rating:3.0/5.0
Slides: 50
Provided by: xos49
Category:

less

Transcript and Presenter's Notes

Title: Comparative genomics and proteomics in Ensembl


1
Comparative genomics and proteomics in Ensembl
November 2004
2
Overview
  • Rationale
  • Species available
  • Comparative proteomics
  • Orthologues prediction
  • Protein clustering into families
  • Comparative genomics
  • Genome-wide DNA alignments
  • Conserved synteny blocks
  • Future and perspectives

3
Compara
  • The Compara database is one single multispecies
    database
  • Gene orthology/paralogy prediction
  • Protein clustering
  • Whole genome alignments
  • Synteny regions

4
Comparing different species
H. sapiens (Human) 3Gb NCBI 34
5
23
P. troglodytes (Common chimpanzee) CHIMP1
91
M. mulatta (Rhesus macaque)
92
M. musculus (House mouse) 2.6Gb NCBIm33
41
R. norvegicus (Norway rat) 2.6Gb RGSC3.1
C. familiaris (Domestic dog) BROAD1
45
310
74
F. catus (Domestic cat)
83
E. caballus (Horse)
65
S. scrofa (Domestic pig)
360
B. taurus (Cow) Btau 1.0
20
O. aries (Domestic sheep)
450
M. domestica (opossum)
G. gallus (Domestic fowl) 1.2Gb WASHUC1
X. laevis (African clawed frog) JGI3 3.1Gb
197
550
X. tropicalis (Tropical clawed frog) 1.7Gb
D. rerio (Zebrafish) 1.7Gb WTSI Zv4
135
70
O. latipes (Japanese medaka) 800Mb
T. nigroviridis Tetraodon7 400Mb
25
990
T. rubripes (Tiger pufferfish) 400Mb Fugu v2.0
?
C. savignyi (sea squirt) 180Mb
C. intestinalis (sea squirt) 180Mb
200?
A. aegypti (yellow fever mosquito)
A. gambiae (African malaria mosquito) 230Mb MOZ2
250
D. melanogaster (fruitfly) 125Mb BDGP3.1
300
A. mellifera (honeybee) 270Mb Amel1.1
C. elegans (nematode) 100Mb WS116
40
C. briggsae (nematode) 100Mb cb25.agp8
Million years
100
200
300
400
500
1000
Red whole genome assembly available Green
whole genome assembly due within the next 2 years
Currently in Ensembl
5
Comparing different species
  • From the Ensembl perspective joins species
    through
  • orthologous/paralogous genes links
  • chromosome synteny links
  • protein family links
  • From a broader perspective
  • Where are syntenic regions located?
  • How many genes are conserved?
  • Where are orthologous/paralogous genes?
  • Is gene order conserved?
  • Where are potential regulatory regions?
  • What is missing in one species, present only in
    another?

6
Orthologues prediction
  • Use in model organism
  • Gathering of information
  • Identify potential species-specific
    proteins/genes

7
Identifying orthologous genes
time
Speciation
Orthologous
Duplication
Paralogous
Original function maintained
Original function maintained
Novel function
Paralogous
Gene 1
Gene 2
Gene 3
Functional Orthologous
8
Orthologues prediction
  • Find orthologous genes by comparing the protein
    sets of two species (only the longest peptide
    considered).
  • blastpsw all versus all (on a paired species
    basis)
  • Best Reciprocal Hit as putative orthologues
    (named BRH)

UBRH
MBRH subtype DUP1.3
MBRH subtype COMPLEX
9
RHS, Orphans and Others
  • Based on UBRH and MBRH-DUPs genomic coordinates
    in both species compared and gene order
    conservation, we identify additional orthologues
    or RHS for Reciprocal Hit supported by Synteny.

MBRH COMPLEX
Human
Orphan
Mouse
For chimp, due to the special nature of the gene
build process, we also have DWGA (Derived from
Whole Genome Alignment)
10
(No Transcript)
11
For each orthologous gene pair
  • We store
  • identity, positivity, coverage, cigar lines,
    description (UBRH, MRHS), subtype
    (DUP1.2,SYN,COMPLEX), dN, dS
  • All the blastpsw results are provided
  • Using the compara perl API
  • Protein or cDNA protein-based alignment
  • 4D, 2D sites can also be easily retrieved
  • Future developments
  • UBRHSYN or UBRHNON-SYN
  • Consider all isoforms for each gene
  • Build clusters of orthologues
  • Bringing in paralogues?
  • Multiple alignments and phylogenies

12
Protein clustering into families
  • Cluster proteins from different organisms that
    may share the same function
  • Obtain some kind of description for novel
    genes/proteins
  • Locate family members over the whole genome
  • Identify possible orthologues and paralogues in
    other species

13
Dataset used and comparisons
  • Half a million proteins clustered
  • All Ensembl proteins from all species in Ensembl
  • 233,000 predicted proteins
  • All metazoan (animal) proteins in
    SwissProt/SPTrEMBL
  • 40,000 UniProt/Swiss-Prot
  • 230,000 UniProt/SPTrEMBL
  • Blastp all versus all, then clustering with MCL

14
Clustering with MCL
  • MCL for Markov CLustering algorithm, based on
    flow simulation in graphs (http//micans.org/mcl/)
  • Keeps into the same graph/cluster only very well
    inter-connected nodes/protein
  • Allows rapid and accurate detection of protein
    families on large-scale.
  • Automatic description and clustalw multiple
    alignment applied on each cluster

MCL
15
For each cluster
  • We store
  • Description and score
  • Multiple alignment
  • Future extensions
  • Improving descriptions
  • Multiple alignment assessment
  • t-coffee
  • Protein domain information consistency
  • Build phylogeny on each cluster
  • Using the multiple alignment
  • Using dS values (mainly inside mammals)
  • Identify intra/inter-species orthologue/paralogues

16
(No Transcript)
17
Addition of protein domain information
  • Introduction of protein domain
  • Help for internal data QC by checking consistancy
    between orthologues, protein clusters and domains
    information.
  • Provide this kind of cross-check data to the user

18
Aligning complete genomes
19
Aligning genomes, why?
  • Understand what evolution has done on the species
    compared, after their speciation
  • Define syntenic regions, those long regions of
    DNA sequences were order and orientation is
    highly conserved
  • Finding conserved non coding regions
  • Good guides to find and test putative regulatory
    regions
  • What is missing in one species, present only in
    another?
  • Differences between closely related species
    (human/chimpanzee, human/macaque), may help
    understanding speciation

20
Basic ideas
Ancestor sequence
Speciation event
mutations
selection
alignment
Mutation Regulatory region Exon
21
Basic ideas
  • Functional sequences (coding exons, regulatory
    regions)
  • are generally highly conserved
  • Conserved sequences can be functionnaly important
  • Conservation Function
  • Comparing DNA sequences from different species
    can help
  • to find biological functions

22
Using a local aligner
  • Local alignment
  • Find all highly similar regions over 2 sequences
  • Find the orthologous as well as all the
    paralogous sequences
  • Separated by segments without alignment
  • Can handle rearranged sequences
  • Need post- filtering to limit too much
    overlapping alignments

23
Global vs. Local Alignments
Local
Global
inversion
duplication
1
2
1
2
(-)
24
Aligning large genomic sequences
  • Independent from protein/gene predictions
  • Issues
  • Heavy process
  • Computes run only by few dedicated groups
  • Scalability (more and more species available)
  • Time constraint
  • As the true alignment is not known, then
    difficult to measure the alignment accuracy and
    apply the right method

25
Ensembl compute 0.25
26
The rest of it
27
Trying to avoid the all versus all comparison
  • Phusion shotgun assembler-gapped BLASTN
    combination
  • (Jim Mullikin and Zemin Ning, Sanger Institute)

The Phusion Assembler in Genome Res. (2003) 13
81-90
28
Phusion - gapped BLASTN
Human 60Kb fragments
Mouse 60Kb fragments
Phusion clustering
16000 clusters containing no more than 50
fragments
gapped BLASTN
29
The compute
The clusters
The farm
320x Compaq DS10 1Gb memory 60Gb local disk
10Tb
768 RLX blades 1Gb memory 80Gb local disk
8x Compaq ES40 32 CPUs
6x Compaq ES45 24 CPUs
30
Phusion - gapped BLASTN
  • Fast but speed comes at a cost
  • Only 22 of human genome coverage
  • Good enough for generating orthologous links
    between
  • the 2 species aligned, so that can be used
    either
  • - in the web site for moving from one species to
    another
  • - calculate synteny regions
  • Not good enough for serious genome-wide
    post-analysis
  • because not comprehensive enough

31
all versus all approach usingBLASTZ
(collaboration with UCSC)
  • Can handle large sequences
  • Used 2-weighted spaced seeding strategy
  • Dynamic masking
  • Makes distinction between repeat and
  • non-repeat sequences (soft masking)
  • Try aligning inside repeats
  • One iterative step with lower threshold
  • to expand alignments

32
Blastz strategy
  • 10Mb Human fragments (3000)
  • 30Mb Mouse fragments (100)
  • Lineage-specific repeats removed
  • 48 hours on 1024 CPUs
  • Generates 9Gb of output
  • When filtered for Best hit on Human,
  • reduced to 2.5Gb
  • 10Mb Human fragments (3000)
  • 30Mb Mouse fragments (100)

33
Blastz human genome coverage
  • 40 of the human genome is covered by an
  • alignment of mouse sequences
  • By rescoring the alignment over a tight matrix
  • that is very stringent and look for high
    conservation
  • (gt70 identity), the coverage goes down to 6

34
Genome alignment summary
  • cons track
  • blastz human/mouse, human/rat,
    human/chimpanzee, human/chicken,
    mouse/chicken, mouse/rat, rat/chicken,
    fugu/Tetraodon
  • phusion-blastn elegans/briggsae
  • high cons track
  • Obtained by rescoring the raw alignments over
  • a tight matrix
  • trans BLAT track
  • translated BLAT human/fugu, human/zebrafish,
    human/chicken, human/Tetradodon,
    fly/anopheles, fly/bee,
    elegans/briggsae, chicken/mouse, rat/zebrafish
    mouse/Tetraodon, mouse/ zebrafish,
    mouse/fly, mouse/fugu, rat/Tetraodon,
    fugu/zebrafish, Tetraodon/chicken,
    Tetraodon/zebrafish, Tetraodon/fugu,
    Anopheles/bee

35
(No Transcript)
36
DNA/DNA matches web display
37
DotterView
38
Defining large syntenic regions
  • genome alignments are refined into large syntenic
    regions.
  • Alignments are clustered together when the
    relative distance between them is less than 100kb
    and order and orientation are consistent.
  • Any clusters less than 100kb are discarded.

39
Synteny web display
  • 347 syntenic regions
  • Coverage
  • 87.5 human
  • 92.4 mouse
  • Size range
  • human
  • 104.4Kb - 57.3Mb
  • mouse
  • 100.2Kb - 51.4Mb

40
MultiContigView
41
Synteny blocks in ContigView/CytoView
42
Integrated multigenome browser
direct
Orthologous/paralogous genes
via families
Mouse genome browser
Human genome browser
Whole genome alignment Syntenic regions
43
Species used in genome alignments
Human
Mouse
Rat
Vertebrata Compara
Chimp
Chimp
Chicken
Nematoda Compara
C. briggsae
C. elegans
44
Species used in orthologues prediction
Fruit fly
Human
Mouse
Rat
Vertebrata Compara
Arthropoda Compara
Chimp
Chimp
Mosquito
Fugu
Zebrafish
Honeybee
Chicken
Tetraodon
Nematoda Compara
C. briggsae
C. elegans
45
Species included in protein clustering
Fruit fly
Human
Mouse
Mouse
Rat
Vertebrata Compara
Arthropoda Compara
Chimp
Mosquito
Fugu
Zebrafish
Honeybee
Chicken
Tetraodon
Nematoda Compara
C. briggsae
C. elegans
46
Outlook
  • OrthoView
  • Displaying alignments both from whole genome
    alignments and on orthologues
  • Projected transcripts

47
Acknowledgements
  • Abel Ureta-Vidal
  • Cara Woodwark
  • Jessica Severin
  • Javier Herrero
  • Ensembl team

48
AlignSlice concept
  • Slice having alignment information attached to
    it.
  • Being able to project a transcript from one
    species to another through the alignment data
    (pairwise or multiple)
  • Give gene context information across species
  • Needed as a significant number of genomes are
    going to be 2X/3X. No sensible gene building
    possible. Cow will be used as test run.

49
AlignSlice concept
  • Getting an human AlignSlice
  • my HumanAlignSlice
  • AlignSliceAdaptor-gtfetch_by_Slice_method_link_spe
    cies_set(
  • human_slice,
  • method_link_species_set)
  • Getting mouse genes projected on the human slice
    coordinates as much as possible
  • my mouse_genes
  • HumanAlignSlice-gtget_all_genes_by_species(Mus
    musculus)
  • Changing the reference species
  • my MouseAlignSlice
  • HumanAlignSlice-gtchange_reference_species_to(Mus
    musculus)
Write a Comment
User Comments (0)
About PowerShow.com