Title: A%20knowledge%20based%20approach%20for%20representing,%20reasoning%20and%20hypothesizing%20about%20biochemical%20networks
1A knowledge based approach for representing,
reasoning and hypothesizing about biochemical
networks
- Chitta Baral
- Arizona State University
2Three parts to the talk
- Prediction, Explanation and Planning with respect
to biochemical networks - Hypothesis Generation with respect to biochemical
networks - Collaborative BioCuration CBioC
3Motivation purpose of interaction databases?
- Suppose We have an almost exhaustive database of
the intracellular interactions (protein-protein,
metabolic, etc.) of particular cells. - What next?
- How will we use this database?
- What if our knowledge is incomplete?
4Motivation Uses of networks pathways
- Visualize the pathways
- Analyze the graphs of the networks
- Compare graphs of the networks
- Use pathway data in conjunction with micro-array
data analysis - Do system level simulation
- Is that all?
5Motivation ultimate uses!
- Prediction/System Simulation (Systems Biology?)
- Impact of particular perturbations (say caused by
a drug that introduces certain proteins to the
cell membrane or into the cell) - Do the perturbations have the desired impact?
- Do they mess up something else? (side effects!)
- But thats not all!
6Motivation Explaining observations
- A phenotypical observation (leading to) OR
- an observation that a particular protein or
chemical has abnormally high concentration - What is wrong? What is out of the ordinary?
- The cause/explanation will give us approaches to
fix the problem. - How deep should the explanations go?
- How do we compare explanations?
7Motivation Designing drugs therapies
- What perturbations (when and where) need to be
made so as to make the cell behave in a
particular way? - In case of cancer prevent proliferation, induce
apoptosis, prevent migration, etc.
8What if knowledge is incomplete?
- What kind of useful reasoning can we do with
incomplete knowledge? - Drug makers dont wait till full knowledge is
available. - Answer hypothesis formation
9Motivation Use summary
- The ultimate uses of signaling (metabolic, etc.)
interaction databases are to do - Prediction therapy verification determining
side effects. - Explanation -- diagnosing what is wrong.
- Planning therapy and drug design.
- Intermediate or immediate use
- Generate Hypothesis
10Initial goal of our research
- Use knowledge representation and reasoning
techniques to - Represent interactions
- Reason about these interactions prediction,
explanation, planning and hypothesis formation.
11Some questions
- Isnt it a little premature?
- We know very little about the networks
- New knowledge is being constantly added
- Why knowledge representation and reasoning?
- Why not simulation
- Why not use Petri nets, p-calculus
- Why a knowledge-based approach? Why not a data
base approach? Whats the difference?
12Our approach present and future
- Yes, prediction is kind-of same as simulation
- Incompleteness of information is an issue though!
- But hard to do explanation generation, or design
of therapies (planning) using simulation
guesses can be verified using simulation though - The core database query languages can not express
explanation or planning queries. - Dealing with incompleteness!
13Dealing with incompleteness ongoing and future
work
- Is one of the key criteria behind a good
knowledge representation language when building
AI systems. - Need to be non-monotonic.
- Need to be elaboration tolerant.
- Proper analysis leads to hypothesizing
- If certain observations can not be satisfactorily
explained by the existing knowledge about the
network then use general biological knowledge to
hypothesize
14Motivation -- summary
- Goal To emulate the abstract reasoning done by
biologists, medical researchers, and pharmacology
researchers. - Types of reasoning prediction, explanation,
planning and hypothesis formation. - Current system biology approaches mostly
prediction. - Ongoing issues Dealing with incomplete knowledge
and elaboration tolerance.
15Related Works
- Quantitative approaches. (hybrid systems, use of
differential equations) - Graphical representations.
- Other qualitative approaches.
- Petri Nets
- ?-calculus
- Pathway Logic
- Model Checking
16Overview of our approach
- Represent signal network as a knowledge base that
describes - actions/events (biological interactions,
processes). - effect of these actions/events.
- triggering conditions of the actions/events.
- To query using the knowledge base
- Prediction explanation planning Hypothesis
generation - BioSigNet-RR (Biological Signal Network -
Representation and Reasoning) and BioSigNet-RRH
systems.
17Foundation behind our approach
- Research on representing and reasoning about
dynamic systems (space shuttles, mobile robots,
software agents) - causal relations between properties of the world
- effects of actions (when can they be executed)
- goal specification
- action-plans
- Research on knowledge representation, reasoning
and declarative problem solving the AnsProlog
language.
18An NFkB signaling pathway
19An NFkB signaling pathway
20Syntax by example
- bind(TNF-a,TNFR1) causes trimerized(TNFR1)
- trimerized(TNFR1) triggers bind(TNFR1,TRADD)
21General syntax to represent networks
- e causes f if f1 fk
- g1 gk causes g
- h1 hm n_triggers e
- k1 kl triggers e
- r1 rl inhibits e
- e is an event (also referred to as an action) and
the rest are fluents (properties of the cell) - For metabolic interactions
- e converts g1 gk to f1 fk if h1 hm
22Semantics queries and entailment
- Observation part of queries
- f at t
- a occurs_at t
- Given the Network N and observation O
- Predict if a temporal expression holds.
- Explain a set of observations.
- Plan to achieve a goal.
23Importance of a formal semantics
- Besides defining prediction, explanation and
planning, it is also useful in identifying - Under what restrictions the answer given by a
given (graph based) algorithm will be correct.
(soundness!) - Under what restrictions a given (graph based)
algorithm will find a correct answer if one
exists. (completeness!)
24Utility of declarative programming languages
(such as AnsProlog)
- Allows for quick implementation of the semantics
- The specification or the definition of what is an
explanation, or what is a plan becomes a program
that finds explanations and plans respectively.
25Prediction
- Given some initial conditions and observations,
to predict how the world would evolve or predict
the outcome of (hypothetical) interventions.
26Back to the example
- Binding of TNF-a with TNFR1 leads to TRADD
binding with one or more of TRAF2, FADD, RIP. - TRADD binding with TRAF2 leads to over-expression
of FLIP provided NIK is phosphorylated on the
way. - TRADD binding with RIP inhibits phosphorylation
of NIK. - TRADD binding with FADD in the absence of FLIP
leads to cell death.
27Prediction 1.
- Binding of TNF-a with TNFR1 leads to TRADD
binding with one or more of TRAF2, FADD, RIP. - TRADD binding with TRAF2 leads to over-expression
of FLIP provided NIK is phosphorylated on the
way. - TRADD binding with RIP inhibits phosphorylation
of NIK. - TRADD binding with FADD in the absence of FLIP
leads to cell death.
- Initial Condition
- bind(TNF-a,TNF-R1) occurs at t0
- Query
- predict eventually apoptosis
- Answer
- Unknown!
- Incomplete knowledge about the TRADDs bindings.
- Depends on if bind(TRADD, RIP) happened or not!
28Prediction 2
- Binding of TNF-a with TNFR1 leads to TRADD
binding with one or more of TRAF2, FADD, RIP. - TRADD binding with TRAF2 leads to over-expression
of FLIP provided NIK is phosphorylated on the
way. - TRADD binding with RIP inhibits phosphorylation
of NIK. - TRADD binding with FADD in the absence of FLIP
leads to cell death.
- Initial Condition
- bind(TNF-a,TNF-R1) occurs at t0
- Observation
- TRADDs binding with TRAF2, FADD, RIP
- Query
- predict eventually apoptosis
- Answer Yes!
29Explanation
- Given initial condition and observations, to
explain why final outcome does not match
expectation.
30Explanation 1
- Binding of TNF-a with TNFR1 leads to TRADD
binding with one or more of TRAF2, FADD, RIP. - TRADD binding with TRAF2 leads to over-expression
of FLIP provided NIK is phosphorylated on the
way. - TRADD binding with RIP inhibits phosphorylation
of NIK. - TRADD binding with FADD in the absence of FLIP
leads to cell death.
- Initial condition
- bound(TNF-a,TNFR1) at t0
- Observation
- bound(TRADD, TRAF2) at t1
- Query Explain apoptosis
- One explanation
- Binding of TRADD with RIP
- Binding of TRADD with FADD
31Planning
- Given initial conditions, to plan interventions
to achieve a goal. - Application in drug and therapy design.
32Planning requirements
- In addition to the knowledge about the pathway we
need additional information about possible
interventions such as - What proteins can be introduced
- What mutations can be forced.
33Planning example
- Defining possible interventions
- intervention intro(DN-TRAF2)
- intro(DN-TRAF2) causes present(DN-TRAF2)
- present(DN-TRAF2) inhibits bind(TRAF2,TRADD)
- present(DN-TRAF2) inhibits interact(TRAF2,NIK)
- Initial condition
- bound(NF?B,I?B) at 0
- bind(TNF-a,TNF-R1) at 0
- Goal to keep NF?B remain inactive.
- Query
- plan always bound(NF?B,I?B) from 0
34Conclusion of part 1
- From paper in ISMB 2004
- Our goal in this paper was to make progress
towards developing a system (and the necessary
representation language and reasoning algorithms)
that can be used to represent signal networks and
pathways associated with cells and reason with
them. - A start was made.
- Defined a simple language (syntax and semantics)
- Defined prediction, planning and explanation
- A prototype implementation using AnsProlog
- Illustration of its applicability with respect to
an NFkB pathway.
35Issues with incomplete knowledge
- Often one may not be able to do much predication,
explanation or planning. - What then?
- Can reasoning help in obtaining new knowledge?
- Yes, through hypothesis generation!
- In fact, hypothesis generation needs reasoning!
36Part II Hypothesis Generation
37 Hypothesis generation
- Our observations can not be explained by our
existing knowledge OR the explanations given by
our existing knowledge are invalidated by
experiments? - Conclusion Our knowledge needs to be augmented
or revised? - How?
- Can we use a reasoning system to predict some
hypothesis that one can verify through
experimentation? - Automate the reasoning in the minds of a
biologist, especially helpful when the background
knowledge is humongous.
38Knowledge base
UV leads_to cancer
High UV
Hypothesis space
(K,I) O
p53
Cancer
No cancer
39Issues in this tiny example
- Hypothesis formation
- Theory UV leads to cancer.
- Observation wild-type p53 resists the UV
effect. - Hypothesis p53 is a tumor-suppressor.
- Elaboration tolerance
- How do we update/revise UV leads to cancer?
- Default NM reasoning
- Normally UV leads to cancer.
- UV does not lead to cancer if p53 is present.
40Related Works some prior mention of hypothesis
formation
- HYPGENE (Karp, 1991)
- TRANSGENE (Darden, 1997)
- GenePath (Zupan et al., 2003)
- Robot Scientist (King et al., 2004)
- Database (Doherty et al., 2004)
- BIOCHAM (Calzone et al., 2005)
- PathLogic (Karp et al. 2002)
- Cytoscape (Shannon et al., 2003)
- Integrative Scheme (Su et al., 2003)
- Pathway Analysis (Ingenuity?)
- do not use the latest advances in knowledge
representation and reasoning. (eg. lack of ways
to express defaults, non-monotonicity,
elaboration tolerance, problem solving rules,
etc.)
41Hypothesis formation
- Knowledge base K
- Set of initial conditions I
- Set of (experimental) observations O
- (K,I) does not entail O
- To expand (K,I) to (K, I) (K, I) entails
O - How to expand (hypothesis space)
- Explanation expand only I
- Diagnosis normality assumptions about I,
minimally abandon the normality assumptions - Hypothesis formation expand K
42Construction of hypothesis space
- Present manual construction, using research
literature - Future integration of multiple data sources
- Protein interactions
- Pathway databases
- Biological ontologies
- ..
- Provide cues, hunches such as
- A may interact with B action interact(A,B)
- A-B interaction may have effect C
- interact(A,B) causes C
43Generation of hypotheses
- Enumeration of hypotheses
- Search computing with Smodels (an implementation
of AnsProlog) - Heuristics
- A trigger statement is selected only if it is the
only cause of some action occurrence that is
needed to explain the novel observations. - An inhibition statement is selected only if it is
the only blocker of some triggered action at some
time. - Maximizing preferences of selected statements
44Generation (cont) heuristics
- Knowledge base K
- a causes g
- b causes g
- Initial condition I intially f
- Observation O eventually g
- (K,I) does not entail O
- Hypothesis space to expand K with rules among
- f triggers a
- f triggers b
- Hypotheses f triggers a , or f
triggers b
45Case study p53 network
46Tumor suppression by p53
- p53 has 3 main functional domains
- N terminal transactivator domain
- Central DNA-binding domain
- C terminal domain that recognizes DNA damage
- Appropriate binding of N terminal activates
pathways that lead to protection of cell from
cancer. - Inappropriate binding (say to Mdm2) inhibits p53
induced tumor suppression.
47p53 knowledge base
- Stress
- high(UV ) triggers upregulate(mRNA(p53))
- Upregulation of p53
- upregulate(mRNA(p53)) causes high(mRNA(p53))
- high(mRNA(p53)) triggers translate(p53)
- translate(p53) causes high(p53)
48p53 knowledge base (cont.)
- Tumor suppression by p53
- high(p53) inhibits growth(tumor)
49p53 knowledge base (cont)
- Interaction between Mdm2 and p53
- high(p53), high(mdm2) triggers bind(p53,mdm2)
- bind(p53,mdm2) causes bound(dom(p53,N))
- bind(p53,mdm2) causes high(p53 mdm2),
- bind(p53,mdm2) causes high(p53),high(mdm2)
50Hypothesis formation
- Experimental observation
- I initially high(UV), high(mdm2), high(ARF)
- O eventually tumorous
- (K,I) does not entail O
- Need to hypothesize the role of ARF.
51Constructing hypothesis space
- Levels of ARF and p53 correlate
- high(ARF) triggers upregulate(mRNA(p53))
- high(p53) triggers upregulate(mRNA(ARF))
52Constructing (cont)
- Interactions of ARF with the known proteins
- bind(p53,ARF) causes bound(dom(p53,N))
53Constructing (cont)
- Influence of X (ARF) on other interactions
- high(ARF) triggers upreg(mRNA(p53))
- high(ARF) triggers translate(p53)
- high(ARF) triggers bind(p53,mdm2)
54Twelve Generated Hypothesis such as
- high(UV) triggers upregulate(mRNA(ARF))
- high(ARF), high(mdm2) triggers bind(ARF,mdm2)
55Conclusion of part 2
- Goal Automation of hypothesis formation (with
respect to interactions and pathways) - Approach Viewed known qualitative aspects of
cell activities as a knowledge base - Used knowledge representation language that
- Can express defaults
- Allows reasoning with incomplete knowledge
- Can express reasoning as well as problem solving
rules - Developed a system BioSigNet-RRH
- Formalizing and reasoning about hypotheses
- Illustration Hypothesizing the role of ARF
protein in the p53 network.
56Future Work on Reasoning about Biochemical
Networks (Part I and II)
- Further development of the language
- Validation with respect to larger networks
- Kohns map
- Networks in Reactome and other repositories
- Going from prototype to deployable systems
- Scaling up challenges
- Recent advances in automatic planning
- Integration with Biopax
57Part III CBioC
58Do we have enough knowledge in the various
databases
- Some have been curated into databases.
- But there is much more in the literature.
- So what do we do?
59Current status of curation from text
- About 15 million abstracts in Pubmed
- 3 million published by US and EU researchers
during 1994-2004 (800 articles per day) - 300 K articles published so far reporting
protein-protein interactions in human, yeast and
mouse. - BIND (in 7 yrs) -- 23K DIP 3K MINT 2.4K.
60Premise High cost of human curation
- Overwhelming cost of large curation efforts may
be unsustainable for long periods - BIND Nov 2005 bad news.
- Operated for 7 years
- Listed over 100 curators programmers
- CND 29 million received in 2003, plus other
funding - Curation efforts of AFCS has recently stopped.
- Lack of funding for some genome annotation
projects.
61Premise summary
- Human curation of text is expensive.
- Human curation of text is not scalable.
- Human curation of text is not sustainable.
62Why not resort to computers? do automatic
extraction
- Lessons from DARPA funded MUCs (message
understanding conferences) in 90s for a decade
and at the cost of tens of millions of dollars. - Getting to 60 recall and precision is quick
- Then every 5 improvement is about a years work.
- Even when we get to 90 for an individual entity
extraction - for recognizing 4 related entities (.9)4 .64
- Lessons from Biomedical text extraction
- No proper evaluation.
- Recognized that recall and precision is not very
good even in the best systems.
63What do we do?
- How do we curate not only the existing articles,
but also the future articles? - Too important to give up!
- Need to think of a new way to do it.
- Faster computers, better sequencing technology
and better algorithms came to the rescue of the
Human Genome project. - Hmm. What resources are we overlooking?
64Key Idea
- If lots of articles are being written then lot of
people are writing them and lot of people are
reading them. - If only we could make these people (the authors
and the readers) contribute to the curation
effort - Especially the readers the ones who need the
curated data!
65Mass collaboration has worked in
- Wikipedia
- Project Gutenberg
- Netflix rating
- Amazon rating
- Etc.
66Mass collaborative curation initial hurdles
- An average reader
- (S)he is not normally interested in filling a
blank curation form. - We can not make an average reader go though
curation training. - So it has to be very different from just making
the existing curation tools available to the mass
and expect them to contribute.
67Mass collaborative curation key initial ideas
- Make it very easy
- user need not remember where (which database,
which web page) to put the curated knowledge. - Curation opportunity should present itself
seamlessly. - Curation should not be a burden to an average
user - Make the curated knowledge thin.
- There should be immediate rewards
- Do not start with a blank slate.
68Realization of the key ideas a biologist with a
gene name
- Goes to Pubmed, types the gene name, clicks on
one of the abstracts - Curation panel presents itself automatically
- Our approach calls for researchers to contribute
to the curation of facts as they read and
research over the web - But not with a blank slate
- No one wants to be the first one!
- Automatic extraction jump-starts the process, and
then researchers improve upon the extracted data,
ironing out inconsistencies by subsequent edits
on a massive scale. - Thin Schemas
- Average users turned off by traditional wide
schemas - Wide schemas need to be broken down.
69(No Transcript)
70(No Transcript)
71(No Transcript)
72(No Transcript)
73(No Transcript)
74(No Transcript)
75(No Transcript)
76(No Transcript)
77(No Transcript)
78(No Transcript)
79Summary
- Information/curation window pops up
automatically. - Automatic extraction is used as a boot strap so
that no user is working on a blank slate. - Users vote on correctness, make corrections, add
fact. - Suppose 60 precision and recall of automatic
extraction system - A person will have an easier time discarding 40
of wrongly extracted text than identifying 60 of
correct entries and entering them!
80Very useful byproducts
- Avoids some problems with existing human curation
approach - Curators bias
- Curators miss things
- Curators have disagreements
- Slow access to newest findings
- Researchers at large have little or no control
over what gets curated and when - A large curated corpus of text gets created
- Very useful to evaluate and improve automated
extraction systems.
81Current status of CBioC future plans
- Basic system, as described, is ready
- Being populated with
- Facts from existing databases (BIND etc.)
- Facts extracted using our extraction system
- Querying mechanism
- Answer display
- Future work
- Voter confidence issues
82Conclusion
- Collecting what is known
- Reasoning with what is known
- Hypothesizing what is unknown (based
on observations)
83Open Invitation
- We are building and eager to help other groups
build knowledge bases in particular domains to - Predict impact of interventions
- Plan (therapy design) to make a pathway behave in
a desired way - Explain observation
- Hypothesize new knowledge
- Further improvements to and adaptation of CBioC
84Acknowledgements
- BioSignet
- Nam Tran, Ph.D thesis on this, Postdoc _at_ Yale
- Karen Chancellor, Ph.D student
- Michael Berens and his group (Ana Joy, Nhan Tran)
- Lokesh Joshi and his group (Vinay Nagraj)
- CBioc Graciela Gonzalez, Lian Yu, Luis Tari,
Tony Gitter, Amanda Ziegler, Ryan Wendt,
Prabhdeep Singh. - Other projects
- BioQA
- Biogenenet
85Thank you!