Title: Integrating Ontological Prior Knowledge into Relational Learning for Protein Function Prediction
1Integrating Ontological Prior Knowledge into
Relational Learning for Protein Function
Prediction
Stefan ReckowMax Planck Institute of
PsychiatryVolker TrespSiemens, Corporate
Technology
TexPoint fonts used in EMF. Read the TexPoint
manual before you delete this box. AAAAAAAAAAAA
2Proteins and Protein Ontologies
3Protein and Protein Functions
- motivation
- proteins molecular machines in any organism
- understanding protein function is essential for
all areas of bio-sciences - diverse sources of knowledge about proteins
- challenges
- experimental determination of functions difficult
and expensive - homologies can be misleading
- most proteins have several functions
4Protein function prediction
What function does this protein have?
catalytic activity (catalyzes a
reaction) isomerase activity intramolecular
oxidoreductase activity
specificity
intramolecular oxidoreductase activity,
interconverting aldoses and ketoses
triose-phosphate isomerase activity (catalyzes a
very specific reaction)
5Function Ontologies
- ontologies are a way of bringing order in the
function of proteins - an ontology is a description of concepts of a
domain and their relationships - hierarchical representation (subclass-relationship
) - tree
- directed, acyclic graph
6Complex Ontology
- complex structure formed by a group of two or
more proteins to perfom certain functions
concertedly
7Ontologies as Great Source of Prior Knowledge in
Machine Learning
- A considerable amount of community effort is
invested in designing ontologies - Typically this prior knowledge is deterministic
(logical constraints) - Machine Learning should be able to exploit this
knowledge - Interactions of proteins is an important
information for predicting function statistical
relational learning
8Statistical Relational Learning with the IHRM
9Statistical Relational Learning (SRL)
- SRL generalizes standard Machine Learning to
domains where relations between entities (and not
just entity attributes) play a significant role - Examples PRM, DAPER, MLN, RMN, RDN
- The IHRM is an easily applicable general model,
performs a cluster analysis of relational domains
and requires no structural learning - Z. Xu, V. Tresp, K. Yu, and H.-P. Kriegel.
Infinite hidden relational models. In Proc. 22nd
UAI, 2006 - Kemp, C., Tenenbaum, J. B., Griffiths, T. L.,
Yamada, T. Ueda, N. (2006). Learning systems of
concepts with an infinite relational model. AAAI
2006 -
10Standard Latent Model for Protein Mixture Models
Protein2
- In a Bayesian approach, we can permit an infinite
number of states in the latent variables and
achieve a Dirichlet Process Mixture Model (DPM) - Advantage the model only uses a finite number of
those states thus no time consuming structural
optimization is required
11Infinite Hidden Relational Model (IHRM)
- Permits us to include protein-protein
interactions into the model
interact
Protein3
interact
interact
Protein2
12Ground Network
function
motif
complex
Z2
motif
interact
interact
complex
Z3
interact
Z1
function
function
motif
complex
13Experimental Results
- KDD Cup 2001
- Yeast genome data
- 1243 genes/proteins 862 (training) / 381 (test)
- Attributes
- Chromosome
- Motif (351) 1-6 A gene might contain one or
more characteristic motifs (information about the
amino acid sequence of the protein) - Essential
- Structural class (24) 1-2 The protein coded by
the gene might belong to one or more structural
categories (24) 1-2 - Phenotype (11)1-6 observed phenotypes in the
organism - Interaction
- Complex (56)1-3 The expression of the gene can
complex with others to form a larger protein - Function (14)1-4 (cell growth, cell
organization, transport, ) - genes were anonymous
14Results
Comparison with Supervised Models
ROC curve
Accuracy
Model
15IHRM Result
Node gene Link interaction Color cluster.
16Integrating Ontological Prior Knowledge into the
IHRM
17Integration of ontologies
Deductive closure
18Integration of ontologies
Zi
independent concepts
dependent concepts
function
motif
complex
cytoskeleton
translocon
actin filaments
microtubules
signal peptidase
19Experiments Including Complex Ontology
- Data collected from CYGD of MIPS
- 1000 genes/proteins 800 (Training) / 200 (Test)
- Attributes
- chromosome, motif, essential, structural class,
phenotype, interaction, complex, function - interactions from DIP
- usage of ontological knowledge on complex
- five levels of hierarchal
- in our model 258 nodes (concepts) using 66 top
level categories - every protein has at least one complex annotation
- After including ontological constraints about
three annotations per protein on average
20Results
800 (training) / 200 (test) 200 (training)
/ 200 (test)
w/o ontology 0.895 with ontology 0.928
w/o ontology 0.832 with ontology 0.894
AUC
21Results
explicit modeling of dependencies
22Results
- proteins concerned with secretion and
transportation - The "Golgi apparatus" works together with the
"endoplasmatic reticulum (ER)" as the transport
and delivery system of the cell. - "SNARE" proteins help to direct material to the
correct destination - Test proteins also "cellular transport"
- proteins acting in cell division
- control proteins
- "Septins Septins have several roles throughout
the cell cycle and carry out essential functions
in cytokinesis - The three highlighted proteins fit into this
cluster ( "cell fate" and "cell type
differentiation)
23Results
sampling convergence
24Results
Distribution of proteins in the clusters
25Results
- Cellular Transport Cluster
- The former singleton "Clathrin light chain", as a
major constituent of coated vesicles (a component
for transport) fits into this cluster quite well
- Tasks occurring during DNA replication
- The former singleton "DNA polymerase", as a main
actor in replication, obviously is assigned the
correct cluster here
26Conclusion
- application of the IHRM to function prediction
- competitive with supervised learning methods
- insights into the solution
- advantages of integrating ontological knowledge
- improvement of the clustering structure
- robustness stable results with varying
parameterization - deductive closure prior to learning is a general
powerful principle - future challenges
- usage of several or more complex ontologies
- further analysis of dependent vs. independent
concepts - Acknowledgements
- Karsten Borgwardt (MPIs Tübingen)
Hans-Peter Kriegel (LMU)