Title: Predicting interactions between genes based on genome Sequence comparisons The
1Predicting interactions between genes based on
genome Sequence comparisonsThe genomic
context component of STRINGBioinformatics
seminar series5-10-2004Berend Snel
2To do
- Seminar (today) please ask questions
- Article a gene co-expression network for global
discovery of conserved genetic modules - Make schedule for article discussion (today)
- Read article (next couple of days)
- 5 minute discussion per person of the article
(Preferentially Monday 11 October)
3http//string.embl.de
4Contents
- Predicting functional interactions between
proteins - Genomic context methods
- General
- Gene fusion
- Gene order
- Presence / absence of genes across genomes
- Integration and benchmarking of predictions
- Interaction networks
- In addition to genomic context functional
genomics data
5Complete genomes, now what?
- Post-genomic era we have the parts list
(complete genomes) - to understand the cell we need to know the
functions of the genes
6For most genes in any genome we need function
prediction
- E. Coli, the most intensively studied organism
- only 1924 genes (43) have been (partially)
experimentally characterized.
7Predicting protein function
What is function ? Various levels of
description Sequence similarity/homology has
the largest relevance for Molecular Function.
This aspect of protein function is best
conserved. Molecular function can often be
predicted from similarities between protein
sequences (BLAST), or structures.
8BLAST
9Beyond homology and molecular function
- Homolgy based function prediction works very
well, but - a large fraction of genes are poorly described
(no homologs, uncharacterized homologs this
holds for 60 of the human genes) - There are other aspects of function functional
associations, e.g. the target of a protein kinase
or a transcriptional regulator - Thus predicting these associations
10- Genome sequences
- Allowing us to interpret the function of proteins
within the context in which they occur - Reverse this process predict the function of a
protein from the context in which it tends to
occur ? prediction of protein function/pathways
from genome sequences Use the genome sequences
(through comparative genome analysis) for
interaction prediction genomic context methods - Genomic context methods have been shown to be
reliable indicators for functional associations
11There are many types of functional associations
(AKA functional interactions, interactions,
functional links, functional relations) in
molecular biology
Cellular process
12Types of functional associations
metabolic pathways filling gaps
13Types of functional associations
Transcription regulation
Signalling pathways
P
14Types of functional associations
Cellular process
Protein complexes
15Contents
- Predicting functional interactions between
proteins - Genomic context methods
- General
- Gene fusion
- Gene order
- Presence / absence of genes across genomes
- Integration and benchmarking of predictions
- Interaction networks
- In addition to genomic context functional
genomics data
16Genomic context is an tool to predict functional
associations between genes
- Use the genome sequences (through comparative
genome analysis) for interaction prediction
genomic context methods - Genomic context methods have been shown to be
reliable indicators for functional interaction - Genomic context is also known as in silico
interaction prediction, or genomic associations
17Genomic context methods detect evolutionary
traces in genomes of functionally associated
proteins
trpA
trpB
18(No Transcript)
19Three different genomic context methods in STRING
- Gene fusion, Rosetta stone method
- Conserved gene order between divergent genomes
- Co-occurrence of genes across genomes,
phylogenetic profiles
20All genomic context methods use orthologs
corresponding genes between genomes
- Orthologs not just homologs related by
speciation - Orthologs are very likely to have the same
function - orthologs genomes alignment sequence
Gene Duplication
Speciation
21Contents
- Predicting functional interactions between
proteins - Genomic context methods
- General
- Gene fusion
- Gene order
- Presence / absence of genes across genomes
- Integration and benchmarking of predictions
- Interaction networks
- In addition to genomic context functional
genomics data
22Gene fusion
- i.e. the orthologs of two genes in another
organism are fused into one polypeptide - A very reliable indicator for functional
interaction partly because it is an relatively
infrequent evolutionary event 3470 distinct
fusions when surveying 179 genomes
Fusion
23Gene fusion an example
24Contents
- Predicting functional interactions between
proteins - Genomic context methods
- General
- Fusion
- Gene order
- Presence / absence of genes across genomes
- Integration and benchmarking of predictions
- Interaction networks
- In addition to genomic context functional
genomics data
25Gene order evolves rapidly
But
26Differential retention of divergent / convergent
gene pairs suggests that conservation implies a
functional association
27Comparison to pathways conservation implies a
functional association
28Conserved gene order
- i.e. genes that are present over sufficiently
large evolutionary distances in the same gene
cluster - Contributes by far the most predictions
29Conserved gene order
NB1 predicting operons is not trivial in fact
conserved gene order or functional association is
a major clue NB2 using only operons without
requiring conservation results in much less
reliable function prediction
30Conserved gene order an example from metabolism
of propionyl-CoA
target
query
31Conserved gene order an example from metabolism
of propionyl-CoA
Biochemical assays confirm the function of
members of COG0346 as a DL-methylmalonyl-CoA
racemase
32Contents
- Predicting functional interactions between
proteins - Genomic context methods
- General
- Gene fusion
- Gene order
- Presence / absence of genes across genomes
- Integration and benchmarking of predictions
- Interaction networks
- In addition to genomic context functional
genomics data
33Presence / absence of genes
Gene content ? co-evolution. (The easy case, few
genomes. )
Differences between gene Content reflect
differences in Phenotypic potentialities
Genomes share genes for phenotypes they have in
common
34Presence / absence of genes
L. innocua (non-pathogen)
L. monocytogenes (pathogen)
35Presence / absence of genes
Genes involved in pathogenecity
L. monocytogenes (pathogenic)
L. innocua (non-pathogenic)
36Generalization phylogenetic profiles /
co-occurence
species 1 species 2 species 3 species 4
species 5 ...... ... .. ..
Gene 1 Gene 2 Gene 3 ....
species 1 species 2 species 3 species 4
species 5 ...... ... .. ..
Gene 1 1 0 1 1 0 1
Gene 2 1 1 0 0 1
0 Gene 3 0 1 0 0 1
0 ....
37 but phylogenetic signal in gene content!
Escherichia coli
Haemophilus influenzae
\s sp1 sp2 sp3 sp4 sp1 \1 0.2 0.4
0.2 sp2 \1 0.9 0.1 sp3
\1 0.3 sp4 \1
38Co-occurrence of genes across genomes
- i.e. two genes have the same presence/ absence
pattern over multiple genomes they have
co-evolved - AKA phylogenetic profiles
39Predicting function of a disease gene protein
with unknown function, frataxin, using
co-occurrence of genes across genomes
- Friedreichs ataxia
- No (homolog with) known function
40Frataxin has co-evolved with hscA and hscB
indicating that it plays a role in iron-sulfur
cluster assembly
A
.
a
e
B
o
u
l
i
c
c
h
u
n
R
s
.
e
S
p
r
y
a
r
P
D
X
H
n
o
.
N
P
.
.
.
a
.
e
V
i
M
r
.
f
E
w
e
B
.
c
a
C
a
n
m
.
.
m
r
c
a
.
s
f
h
d
.
g
s
M
u
l
u
z
e
t
h
i
c
o
e
M
i
l
coli
u
g
u
o
e
r
.
n
o
t
d
c
n
l
i
b
.
e
i
l
e
k
o
t
d
i
i
o
y
t
s
n
n
e
i
n
t
c
i
u
u
t
o
s
i
r
c
o
l
a
i
g
z
i
r
s
t
b
a
i
l
e
i
s
d
i
a
a
a
s
i
e
t
e
s
a
n
e
a
i
u
r
n
t
d
c
s
u
m
i
u
H
s
s
l
.
s
o
D.melan.
s
a
i
p
s
i
e
n
s
s
cyaY Yfh1
41Iron-Sulfur (2Fe-2S) cluster in the Rieske protein
42Prediction
Confirmation
43The opposite of co-occurrenceanti-correlation /
complementary patterns predicting analogous
enzymes
Genes with complementary phylogenetic profiles
tend to have a similar biochemical function.
A
B
A
B
44Complementary patterns in thiamin biosynthesis
predict analogous enzymes
45Prediction of analogous enzymes is confirmed
46Contents
- Predicting functional interactions between
proteins - Genomic context methods
- General
- Gene fusion
- Gene order
- Presence / absence of genes across genomes
- Integration and benchmarking of predictions
- Interaction networks
- In addition to genomic context functional
genomics data
47Benchmark and integration KEGG maps
48Integrating genomic context scores into one
single score
- Compare each individual method against an
independent benchmark (KEGG), and find
equivalency - Multiply the chances that two proteins are not
interacting and subtract from 1 naive bayesian
i.e. assuming independence
1
0.8
0.6
Fraction same KEGG map
0.4
Fusion
Gene Order
0.2
Co-occurrence
0
0
0.2
0.4
0.6
0.8
1
Score
49Benchmark
100000
10000
1000
Coverage (number of predicted links between
orthologous groups)
Integrated
Gene Order (norm.)
Gene Order (abs.)
100
Cooccurrence
Fusion (norm.)
Fusion (abs.)
10
0.5
0.6
0.7
0.8
0.9
1.0
Accuracy (fraction of confirmed predictions,
i.e. same KEGG map)
50Performance of genomic context compared to
high-throughput interaction data
purified complexes TAP
Purified Complexes HMS-PCI
genomic context
mRNA co-expression
two methods
synthetic lethality
Coverage
combined evidence
fraction of reference set covered by data
yeast two-hybrid
three methods
raw data
filtered data
parameter choices
Accuracy
fraction of data confirmed by reference set
51Genomic context biochemistry by other means
Despite the high performance of genomic context
methods, as a tool for function prediction it is
not a button press method It is more like
biochemistry by other means. Often quite a lot
of manual input and expert knowledge from the
researcher is needed to distill associations into
a concrete function prediction Small-scale
bioinformatics?
52Contents
- Predicting functional interactions between
proteins - Genomic context methods
- General
- Fusion
- Gene order
- Co-occurrence across genomes
- Integration and benchmarking of predictions
- Interaction networks
- In addition to genomic context functional
genomics data
53STRING allows a network view
e.g. see not only to which genes the query gene
has an association, but also what the relations
are among these other genes
54STRING
Network output (depth1)
Assigning
uncharacterized archeal proteins
to a network around
Archeal flagellins
Archeal flagellin biosynth. ATPase
55STRING
Type IV secretion pathway
Network (depth2)
Connecting associated cellular processes
Archeal flagellins
Archeal flagella components
Chemotaxis- related
56STRING
Network (depth3)
Zooming out to other cellular processes
57Using the local network to detect
multi-functional proteins
58Contents
- Predicting functional interactions between
proteins - Genomic context methods
- General
- Fusion
- Gene order
- Co-occurrence across genomes
- Integration and benchmarking of predictions
- Interaction networks
- In addition to genomic context functional
genomics data
59- STRING currently in addition includes
- Functional association data from large scale /
high-throughput biochemical experiments
(functional genomics data) - protein complex purification
- yeast-2-hybrid
- ChIP-on-chip
- micro-array gene expression
- known functional relations, so called legacy
data, as present in PubMed abstracts and
databases like MIPS or KEGG.
60(No Transcript)