Title: Prediction of protein function and pathways in the genome era
1Prediction of protein function and pathways in
the genome era
Toni Gabaldón Estevan Computational
Genomics CMBI-NCMLS U. Nijmegen (The
Netherlands) T.Gabaldon_at_cmbi.kun.nl www.cmbi.kun.n
l/jagabald
2Prediction of protein function and pathways in
the genome era
1- Analysis of non-coding regions in DNA
2- Coding regions Introduction to genomic
context function-prediction techniques
3- Case studies mitochondrial proteome Availabil
ity www.cmbi.kun.nl/jagabald/summercourse.html
31995 Haemophilus influenzae 1,8 GB
4(No Transcript)
5Gene prediction
ATGATGCAGTTCTCGAAAATGCATGGCCTTGGCAACGATTTTATGGTCGT
CGACGCGGTAACGCAGAATGTCTTTTTTTCACCGGAGCTGATTCGTCGCC
TGGCTGATCGGCACCTGGGGGTAGGGTTTGACCAACTGCTGGTGGTTGAG
CCGCCGTATGATCCTGAACTGGATTTTCACTATCGCATTTTCAATGCTGA
TGGCAGTGAAGTGGCGCAGTGCGGCAACGGTGCGCGCTGCTTTGCCCGTT
TTGTGCGTCTGAAAGGACTGACCAATAAGCGTGATATCCGCGTCAGCACC
GCCAACGGGCGGATGGTTCTGACCGTCACCGATGATGATCTGGTCCGCGT
AAATATGGGCGAACCCAACTTCGAACCTTCCGCCGTGCCGTTTCGCGCTA
ACAAAGCGGAAAAGACCTATATTATGCGCGCCGCCGAGCAGACAATCTTA
TGCGGCGTGGTGTCGATGGGAAATCCGCATTGCGTGATTCAGGTCGATGA
TGTCGATACCGCGGCGGTAGAAACGCTTGGTCCTGTTCTGGAAAGCCACG
AGCGTTTTCCGGAGCGCGCCAATATCGGTTTTATGCAAGTGGTTAAGCGC
GAGCATATTCGTTTACGCGTTTATGAGCGTGGGGCAGGAGAAACCCAGGC
CTGCGGCAGCGGCGCGTGTGCGGCGGTTGCAGTAGGGATTCAGCAAGGTT
TGCTGGCCGAAGAAGTACGCGTGGAACTCCCCGGCGGTCGTCTTGATATC
GCCTGGAAAGGTCCGGGTCACCCGTTATATATGACTGGCCCGGCGGTACA
TGTCTACGACGGATTTATTCATCTATGA
6Non-coding regions Alignment of mouse-man
genomes - 5 under positive selection (1.5
coding, 3.5 non-coding) - More subtle signals ?
evolutionarily conserved (comparative genomics)
7Detecting subtle signals in DNA sequences
Typical Structure of a Eukaryotic Gene
8Control of Transcription Initiation
9Representing motifsSequence Logo
Height is the information content per position.
Height of the individual nucleotides is
determined by their frequency, the most frequent
on top
Information content 2 S pi log2 pi
10a1
a2
a3
a4
ak
Genes regulated by the same factor
?
Find the motif for the binding site
a1
a2
a3
a4
ak
11Gibbs Sampling
- Goal find the best ak to maximize the difference
between motif and background base distribution.
12- E. Coli, the most intensively studied organism
- only 1924 genes (43) have been (partially)
experimentaly characterized.
13Classic method function prediction by homology
14Classic method function prediction by homology
15- No homolog (orfans)
- Homologos of unknown function.
-
60 poorly annotated
16What is protein function? Fuzzy term
Homology
17 A genome is more than the sum of its genes
18Turning data into knowledge
Comparative genomics
biology
19Genomic context
Homology
20(some)Types of genomic context
- Gene fusion/fission
- Chromosomal location
- Co-evolution
- Co-expression
21Gene Fusion (fission)
22Gene fusion/fission
3 genomes ? 88 gene fusions 30 genomes ? 10.075
g. fusions
trpA trpB
E.coli
Yeast
Tryptophan synthase subunits A and B, fused in
yeast.
23(No Transcript)
24GENE ORDER
Genomes are shuffled In the course of
evolution. But,..
25Gene order/neighborhood. Extreme case bacterial
operons.
26(No Transcript)
27Gene content ? co-evolution. (The easy case, few
genomes. )
Differences between gene Content reflect
differences in Phenotypic potentialities
Genomes share genes for phenotypes they have in
common
28L. innocua (non-pathogen)
L. monocytogenes (pathogen)
29Genes involved in pathogenecity
L. innocua (non-pathogenic)
L. monocytogenes (pathogenic)
30species 1 species 2 species 3 species 4
species 5 ...... ... .. ..
Generalization phylogenetic profiles
Gene 1 Gene 2 Gene 3 ....
31species 1 species 2 species 3 species 4
species 5 ...... ... .. ..
Generalization phylogenetic profiles
Gene 1 1 0 1 1 0 1
Gene 2 1 1 0 0 1
0 Gene 3 0 1 0 0 1
0 ....
32Generalization phylogenetic profiles
Genes with similar phylogenetic profiles tend to
be involved in the same biological process.
A
B
C
33Generalization phylogenetics profiles
Genes with complementary phylogenetic profiles
tend to have a similar biochemical function.
A
B
A
B
34Co-evolution, correlation of mutations, physical
interaction
Receptor-ligand Complexes ..
35(No Transcript)
36Predicting gene function by conserved
co-expression after gene duplication or
speciation
- Co-expression in one species too weak
- Use evolutionary conservation to improve function
prediction?
37Benchmarking high-throughput interaction data
100
10
fraction of reference set covered by data (log )
Coverage
1
1
0.1
0.1
1
1
10
100
Accuracy
Snel B. et al Nat. Gen. (2003)
fraction of data confirmed by reference set (log
)
38http//string.embl.de
392. Case studies mitochondrial proteome
40Calcium signaling
Coenzyme synthesis
Citric acid cycle
Urea cycle
Heme synthesis
Electrical signaling
Apoptosis
ATP production
Fatty acids oxidation
Heat generation
41Mitochondria originated from the endosymbiosis of
an alpha-proteobacteria
42Our method
Identify eukariotic proteins with an
alpha-proteobacterial origin based on its
phylogeny.
Eukaryotes
Common origin endosymbiosis
Alpha-proteobacteria
43Reconstruction of an ancestral metabolism
Gabaldón T. and Huynen M. Science (2003)
44Eukaryotes underwent extensive lineage-specific
gene loss of the proto-mitochondrial derived set
45We used this property to predict biological
interactions among our set.
. Identifying proteins with a similar
evolutionary history
46Proteins that have a similar evolution tend to
function in the same biological process
Fraction of proteins Functioning in the same
biochemical pathway
Average
47- Complex I deficiency.
- Inherited
- Severe (patients lt 5 years old)
- No mutation in known 46 Complex I genes.
- ???
48(No Transcript)
49(No Transcript)
50Recommended
- ? Read Prediction of protein function and
pathways in the genome era Gabadon Huynen
(2004) Cell Mol. Life Sci. 2004
Apr61(7-8)930-44 - www.cmbi.ru.nl/jagabald/summercourse.html
- ? Try these methods with your favourite protein
- http//string.embl.de/