Comparative genomics and proteomics in Ensembl - PowerPoint PPT Presentation

1 / 49

About This Presentation

Title:

Comparative genomics and proteomics in Ensembl

Description:

Comparative genomics and proteomics in Ensembl. November 2004 ... The Compara database is one single multispecies database. Gene orthology/paralogy prediction ... – PowerPoint PPT presentation

Number of Views:325

Avg rating:3.0/5.0

Slides: 50

Provided by: xos49

Category:

more less

Transcript and Presenter's Notes

Title: Comparative genomics and proteomics in Ensembl

1
Comparative genomics and proteomics in Ensembl
November 2004
2
Overview

Rationale
Species available
Comparative proteomics
Orthologues prediction
Protein clustering into families
Comparative genomics
Genome-wide DNA alignments
Conserved synteny blocks
Future and perspectives

3
Compara

The Compara database is one single multispecies
database
Gene orthology/paralogy prediction
Protein clustering
Whole genome alignments
Synteny regions

4
Comparing different species
H. sapiens (Human) 3Gb NCBI 34
5
23
P. troglodytes (Common chimpanzee) CHIMP1
91
M. mulatta (Rhesus macaque)
92
M. musculus (House mouse) 2.6Gb NCBIm33
41
R. norvegicus (Norway rat) 2.6Gb RGSC3.1
C. familiaris (Domestic dog) BROAD1
45
310
74
F. catus (Domestic cat)
83
E. caballus (Horse)
65
S. scrofa (Domestic pig)
360
B. taurus (Cow) Btau 1.0
20
O. aries (Domestic sheep)
450
M. domestica (opossum)
G. gallus (Domestic fowl) 1.2Gb WASHUC1
X. laevis (African clawed frog) JGI3 3.1Gb
197
550
X. tropicalis (Tropical clawed frog) 1.7Gb
D. rerio (Zebrafish) 1.7Gb WTSI Zv4
135
70
O. latipes (Japanese medaka) 800Mb
T. nigroviridis Tetraodon7 400Mb
25
990
T. rubripes (Tiger pufferfish) 400Mb Fugu v2.0
?
C. savignyi (sea squirt) 180Mb
C. intestinalis (sea squirt) 180Mb
200?
A. aegypti (yellow fever mosquito)
A. gambiae (African malaria mosquito) 230Mb MOZ2
250
D. melanogaster (fruitfly) 125Mb BDGP3.1
300
A. mellifera (honeybee) 270Mb Amel1.1
C. elegans (nematode) 100Mb WS116
40
C. briggsae (nematode) 100Mb cb25.agp8
Million years
100
200
300
400
500
1000
Red whole genome assembly available Green
whole genome assembly due within the next 2 years
Currently in Ensembl
5
Comparing different species

From the Ensembl perspective joins species
through
orthologous/paralogous genes links
chromosome synteny links
protein family links
From a broader perspective
Where are syntenic regions located?
How many genes are conserved?
Where are orthologous/paralogous genes?
Is gene order conserved?
Where are potential regulatory regions?
What is missing in one species, present only in
another?

6
Orthologues prediction

Use in model organism
Gathering of information
Identify potential species-specific
proteins/genes

7
Identifying orthologous genes
time
Speciation
Orthologous
Duplication
Paralogous
Original function maintained
Original function maintained
Novel function
Paralogous
Gene 1
Gene 2
Gene 3
Functional Orthologous
8
Orthologues prediction

Find orthologous genes by comparing the protein
sets of two species (only the longest peptide
considered).
blastpsw all versus all (on a paired species
basis)
Best Reciprocal Hit as putative orthologues
(named BRH)

UBRH
MBRH subtype DUP1.3
MBRH subtype COMPLEX
9
RHS, Orphans and Others

Based on UBRH and MBRH-DUPs genomic coordinates
in both species compared and gene order
conservation, we identify additional orthologues
or RHS for Reciprocal Hit supported by Synteny.

MBRH COMPLEX
Human
Orphan
Mouse
For chimp, due to the special nature of the gene
build process, we also have DWGA (Derived from
Whole Genome Alignment)
10
(No Transcript)
11
For each orthologous gene pair

We store
identity, positivity, coverage, cigar lines,
description (UBRH, MRHS), subtype
(DUP1.2,SYN,COMPLEX), dN, dS
All the blastpsw results are provided
Using the compara perl API
Protein or cDNA protein-based alignment
4D, 2D sites can also be easily retrieved
Future developments
UBRHSYN or UBRHNON-SYN
Consider all isoforms for each gene
Build clusters of orthologues
Bringing in paralogues?
Multiple alignments and phylogenies

12
Protein clustering into families

Cluster proteins from different organisms that
may share the same function
Obtain some kind of description for novel
genes/proteins
Locate family members over the whole genome
Identify possible orthologues and paralogues in
other species

13
Dataset used and comparisons

Half a million proteins clustered
All Ensembl proteins from all species in Ensembl
233,000 predicted proteins
All metazoan (animal) proteins in
SwissProt/SPTrEMBL
40,000 UniProt/Swiss-Prot
230,000 UniProt/SPTrEMBL
Blastp all versus all, then clustering with MCL

14
Clustering with MCL

MCL for Markov CLustering algorithm, based on
flow simulation in graphs (http//micans.org/mcl/)
Keeps into the same graph/cluster only very well
inter-connected nodes/protein
Allows rapid and accurate detection of protein
families on large-scale.
Automatic description and clustalw multiple
alignment applied on each cluster

MCL
15
For each cluster

We store
Description and score
Multiple alignment
Future extensions
Improving descriptions
Multiple alignment assessment
t-coffee
Protein domain information consistency
Build phylogeny on each cluster
Using the multiple alignment
Using dS values (mainly inside mammals)
Identify intra/inter-species orthologue/paralogues

16
(No Transcript)
17
Addition of protein domain information

Introduction of protein domain
Help for internal data QC by checking consistancy
between orthologues, protein clusters and domains
information.
Provide this kind of cross-check data to the user

18
Aligning complete genomes
19
Aligning genomes, why?

Understand what evolution has done on the species
compared, after their speciation
Define syntenic regions, those long regions of
DNA sequences were order and orientation is
highly conserved
Finding conserved non coding regions
Good guides to find and test putative regulatory
regions
What is missing in one species, present only in
another?
Differences between closely related species
(human/chimpanzee, human/macaque), may help
understanding speciation

20
Basic ideas
Ancestor sequence
Speciation event
mutations
selection
alignment
Mutation Regulatory region Exon
21
Basic ideas

Functional sequences (coding exons, regulatory
regions)
are generally highly conserved
Conserved sequences can be functionnaly important
Conservation Function
Comparing DNA sequences from different species
can help
to find biological functions

22
Using a local aligner

Local alignment
Find all highly similar regions over 2 sequences
Find the orthologous as well as all the
paralogous sequences
Separated by segments without alignment
Can handle rearranged sequences
Need post- filtering to limit too much
overlapping alignments

23
Global vs. Local Alignments
Local
Global
inversion
duplication
1
2
1
2
(-)
24
Aligning large genomic sequences

Independent from protein/gene predictions
Issues
Heavy process
Computes run only by few dedicated groups
Scalability (more and more species available)
Time constraint
As the true alignment is not known, then
difficult to measure the alignment accuracy and
apply the right method

25
Ensembl compute 0.25
26
The rest of it
27
Trying to avoid the all versus all comparison

Phusion shotgun assembler-gapped BLASTN
combination
(Jim Mullikin and Zemin Ning, Sanger Institute)

The Phusion Assembler in Genome Res. (2003) 13
81-90
28
Phusion - gapped BLASTN
Human 60Kb fragments
Mouse 60Kb fragments
Phusion clustering
16000 clusters containing no more than 50
fragments
gapped BLASTN
29
The compute
The clusters
The farm
320x Compaq DS10 1Gb memory 60Gb local disk
10Tb
768 RLX blades 1Gb memory 80Gb local disk
8x Compaq ES40 32 CPUs
6x Compaq ES45 24 CPUs
30
Phusion - gapped BLASTN

Fast but speed comes at a cost
Only 22 of human genome coverage
Good enough for generating orthologous links
between
the 2 species aligned, so that can be used
either
- in the web site for moving from one species to
another
- calculate synteny regions
Not good enough for serious genome-wide
post-analysis
because not comprehensive enough

31
all versus all approach usingBLASTZ
(collaboration with UCSC)

Can handle large sequences
Used 2-weighted spaced seeding strategy
Dynamic masking
Makes distinction between repeat and
non-repeat sequences (soft masking)
Try aligning inside repeats
One iterative step with lower threshold
to expand alignments

32
Blastz strategy

10Mb Human fragments (3000)
30Mb Mouse fragments (100)
Lineage-specific repeats removed
48 hours on 1024 CPUs
Generates 9Gb of output
When filtered for Best hit on Human,
reduced to 2.5Gb
10Mb Human fragments (3000)
30Mb Mouse fragments (100)

33
Blastz human genome coverage

40 of the human genome is covered by an
alignment of mouse sequences
By rescoring the alignment over a tight matrix
that is very stringent and look for high
conservation
(gt70 identity), the coverage goes down to 6

34
Genome alignment summary

cons track
blastz human/mouse, human/rat,
human/chimpanzee, human/chicken,
mouse/chicken, mouse/rat, rat/chicken,
fugu/Tetraodon
phusion-blastn elegans/briggsae
high cons track
Obtained by rescoring the raw alignments over
a tight matrix
trans BLAT track
translated BLAT human/fugu, human/zebrafish,
human/chicken, human/Tetradodon,
fly/anopheles, fly/bee,
elegans/briggsae, chicken/mouse, rat/zebrafish
mouse/Tetraodon, mouse/ zebrafish,
mouse/fly, mouse/fugu, rat/Tetraodon,
fugu/zebrafish, Tetraodon/chicken,
Tetraodon/zebrafish, Tetraodon/fugu,
Anopheles/bee

35
(No Transcript)
36
DNA/DNA matches web display
37
DotterView
38
Defining large syntenic regions

genome alignments are refined into large syntenic
regions.
Alignments are clustered together when the
relative distance between them is less than 100kb
and order and orientation are consistent.
Any clusters less than 100kb are discarded.

39
Synteny web display

347 syntenic regions
Coverage
87.5 human
92.4 mouse
Size range
human
104.4Kb - 57.3Mb
mouse
100.2Kb - 51.4Mb

40
MultiContigView
41
Synteny blocks in ContigView/CytoView
42
Integrated multigenome browser
direct
Orthologous/paralogous genes
via families
Mouse genome browser
Human genome browser
Whole genome alignment Syntenic regions
43
Species used in genome alignments
Human
Mouse
Rat
Vertebrata Compara
Chimp
Chimp
Chicken
Nematoda Compara
C. briggsae
C. elegans
44
Species used in orthologues prediction
Fruit fly
Human
Mouse
Rat
Vertebrata Compara
Arthropoda Compara
Chimp
Chimp
Mosquito
Fugu
Zebrafish
Honeybee
Chicken
Tetraodon
Nematoda Compara
C. briggsae
C. elegans
45
Species included in protein clustering
Fruit fly
Human
Mouse
Mouse
Rat
Vertebrata Compara
Arthropoda Compara
Chimp
Mosquito
Fugu
Zebrafish
Honeybee
Chicken
Tetraodon
Nematoda Compara
C. briggsae
C. elegans
46
Outlook

OrthoView
Displaying alignments both from whole genome
alignments and on orthologues
Projected transcripts

47
Acknowledgements

Abel Ureta-Vidal
Cara Woodwark
Jessica Severin
Javier Herrero
Ensembl team

48
AlignSlice concept

Slice having alignment information attached to
it.
Being able to project a transcript from one
species to another through the alignment data
(pairwise or multiple)
Give gene context information across species
Needed as a significant number of genomes are
going to be 2X/3X. No sensible gene building
possible. Cow will be used as test run.

49
AlignSlice concept

Getting an human AlignSlice
my HumanAlignSlice
AlignSliceAdaptor-gtfetch_by_Slice_method_link_spe
cies_set(
human_slice,
method_link_species_set)
Getting mouse genes projected on the human slice
coordinates as much as possible
my mouse_genes
HumanAlignSlice-gtget_all_genes_by_species(Mus
musculus)
Changing the reference species
my MouseAlignSlice
HumanAlignSlice-gtchange_reference_species_to(Mus
musculus)