A%20knowledge%20based%20approach%20for%20representing,%20reasoning%20and%20hypothesizing%20about%20biochemical%20networks - PowerPoint PPT Presentation

About This Presentation
Title:

A%20knowledge%20based%20approach%20for%20representing,%20reasoning%20and%20hypothesizing%20about%20biochemical%20networks

Description:

A knowledge based approach for representing, reasoning and ... Initial condition I = { intially f } Observation O = { eventually g } (K,I) does not entail O ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: A%20knowledge%20based%20approach%20for%20representing,%20reasoning%20and%20hypothesizing%20about%20biochemical%20networks


1
A knowledge based approach for representing,
reasoning and hypothesizing about biochemical
networks
  • Chitta Baral
  • Arizona State University

2
Three parts to the talk
  • Prediction, Explanation and Planning with respect
    to biochemical networks
  • Hypothesis Generation with respect to biochemical
    networks
  • Collaborative BioCuration CBioC

3
Motivation purpose of interaction databases?
  • Suppose We have an almost exhaustive database of
    the intracellular interactions (protein-protein,
    metabolic, etc.) of particular cells.
  • What next?
  • How will we use this database?
  • What if our knowledge is incomplete?

4
Motivation Uses of networks pathways
  • Visualize the pathways
  • Analyze the graphs of the networks
  • Compare graphs of the networks
  • Use pathway data in conjunction with micro-array
    data analysis
  • Do system level simulation
  • Is that all?

5
Motivation ultimate uses!
  • Prediction/System Simulation (Systems Biology?)
  • Impact of particular perturbations (say caused by
    a drug that introduces certain proteins to the
    cell membrane or into the cell)
  • Do the perturbations have the desired impact?
  • Do they mess up something else? (side effects!)
  • But thats not all!

6
Motivation Explaining observations
  • A phenotypical observation (leading to) OR
  • an observation that a particular protein or
    chemical has abnormally high concentration
  • What is wrong? What is out of the ordinary?
  • The cause/explanation will give us approaches to
    fix the problem.
  • How deep should the explanations go?
  • How do we compare explanations?

7
Motivation Designing drugs therapies
  • What perturbations (when and where) need to be
    made so as to make the cell behave in a
    particular way?
  • In case of cancer prevent proliferation, induce
    apoptosis, prevent migration, etc.

8
What if knowledge is incomplete?
  • What kind of useful reasoning can we do with
    incomplete knowledge?
  • Drug makers dont wait till full knowledge is
    available.
  • Answer hypothesis formation

9
Motivation Use summary
  • The ultimate uses of signaling (metabolic, etc.)
    interaction databases are to do
  • Prediction therapy verification determining
    side effects.
  • Explanation -- diagnosing what is wrong.
  • Planning therapy and drug design.
  • Intermediate or immediate use
  • Generate Hypothesis

10
Initial goal of our research
  • Use knowledge representation and reasoning
    techniques to
  • Represent interactions
  • Reason about these interactions prediction,
    explanation, planning and hypothesis formation.

11
Some questions
  • Isnt it a little premature?
  • We know very little about the networks
  • New knowledge is being constantly added
  • Why knowledge representation and reasoning?
  • Why not simulation
  • Why not use Petri nets, p-calculus
  • Why a knowledge-based approach? Why not a data
    base approach? Whats the difference?

12
Our approach present and future
  • Yes, prediction is kind-of same as simulation
  • Incompleteness of information is an issue though!
  • But hard to do explanation generation, or design
    of therapies (planning) using simulation
    guesses can be verified using simulation though
  • The core database query languages can not express
    explanation or planning queries.
  • Dealing with incompleteness!

13
Dealing with incompleteness ongoing and future
work
  • Is one of the key criteria behind a good
    knowledge representation language when building
    AI systems.
  • Need to be non-monotonic.
  • Need to be elaboration tolerant.
  • Proper analysis leads to hypothesizing
  • If certain observations can not be satisfactorily
    explained by the existing knowledge about the
    network then use general biological knowledge to
    hypothesize

14
Motivation -- summary
  • Goal To emulate the abstract reasoning done by
    biologists, medical researchers, and pharmacology
    researchers.
  • Types of reasoning prediction, explanation,
    planning and hypothesis formation.
  • Current system biology approaches mostly
    prediction.
  • Ongoing issues Dealing with incomplete knowledge
    and elaboration tolerance.

15
Related Works
  • Quantitative approaches. (hybrid systems, use of
    differential equations)
  • Graphical representations.
  • Other qualitative approaches.
  • Petri Nets
  • ?-calculus
  • Pathway Logic
  • Model Checking

16
Overview of our approach
  • Represent signal network as a knowledge base that
    describes
  • actions/events (biological interactions,
    processes).
  • effect of these actions/events.
  • triggering conditions of the actions/events.
  • To query using the knowledge base
  • Prediction explanation planning Hypothesis
    generation
  • BioSigNet-RR (Biological Signal Network -
    Representation and Reasoning) and BioSigNet-RRH
    systems.

17
Foundation behind our approach
  • Research on representing and reasoning about
    dynamic systems (space shuttles, mobile robots,
    software agents)
  • causal relations between properties of the world
  • effects of actions (when can they be executed)
  • goal specification
  • action-plans
  • Research on knowledge representation, reasoning
    and declarative problem solving the AnsProlog
    language.

18
An NFkB signaling pathway
19
An NFkB signaling pathway
20
Syntax by example
  • bind(TNF-a,TNFR1) causes trimerized(TNFR1)
  • trimerized(TNFR1) triggers bind(TNFR1,TRADD)

21
General syntax to represent networks
  • e causes f if f1 fk
  • g1 gk causes g
  • h1 hm n_triggers e
  • k1 kl triggers e
  • r1 rl inhibits e
  • e is an event (also referred to as an action) and
    the rest are fluents (properties of the cell)
  • For metabolic interactions
  • e converts g1 gk to f1 fk if h1 hm

22
Semantics queries and entailment
  • Observation part of queries
  • f at t
  • a occurs_at t
  • Given the Network N and observation O
  • Predict if a temporal expression holds.
  • Explain a set of observations.
  • Plan to achieve a goal.

23
Importance of a formal semantics
  • Besides defining prediction, explanation and
    planning, it is also useful in identifying
  • Under what restrictions the answer given by a
    given (graph based) algorithm will be correct.
    (soundness!)
  • Under what restrictions a given (graph based)
    algorithm will find a correct answer if one
    exists. (completeness!)

24
Utility of declarative programming languages
(such as AnsProlog)
  • Allows for quick implementation of the semantics
  • The specification or the definition of what is an
    explanation, or what is a plan becomes a program
    that finds explanations and plans respectively.

25
Prediction
  • Given some initial conditions and observations,
    to predict how the world would evolve or predict
    the outcome of (hypothetical) interventions.

26
Back to the example
  • Binding of TNF-a with TNFR1 leads to TRADD
    binding with one or more of TRAF2, FADD, RIP.
  • TRADD binding with TRAF2 leads to over-expression
    of FLIP provided NIK is phosphorylated on the
    way.
  • TRADD binding with RIP inhibits phosphorylation
    of NIK.
  • TRADD binding with FADD in the absence of FLIP
    leads to cell death.

27
Prediction 1.
  • Binding of TNF-a with TNFR1 leads to TRADD
    binding with one or more of TRAF2, FADD, RIP.
  • TRADD binding with TRAF2 leads to over-expression
    of FLIP provided NIK is phosphorylated on the
    way.
  • TRADD binding with RIP inhibits phosphorylation
    of NIK.
  • TRADD binding with FADD in the absence of FLIP
    leads to cell death.
  • Initial Condition
  • bind(TNF-a,TNF-R1) occurs at t0
  • Query
  • predict eventually apoptosis
  • Answer
  • Unknown!
  • Incomplete knowledge about the TRADDs bindings.
  • Depends on if bind(TRADD, RIP) happened or not!

28
Prediction 2
  • Binding of TNF-a with TNFR1 leads to TRADD
    binding with one or more of TRAF2, FADD, RIP.
  • TRADD binding with TRAF2 leads to over-expression
    of FLIP provided NIK is phosphorylated on the
    way.
  • TRADD binding with RIP inhibits phosphorylation
    of NIK.
  • TRADD binding with FADD in the absence of FLIP
    leads to cell death.
  • Initial Condition
  • bind(TNF-a,TNF-R1) occurs at t0
  • Observation
  • TRADDs binding with TRAF2, FADD, RIP
  • Query
  • predict eventually apoptosis
  • Answer Yes!

29
Explanation
  • Given initial condition and observations, to
    explain why final outcome does not match
    expectation.

30
Explanation 1
  • Binding of TNF-a with TNFR1 leads to TRADD
    binding with one or more of TRAF2, FADD, RIP.
  • TRADD binding with TRAF2 leads to over-expression
    of FLIP provided NIK is phosphorylated on the
    way.
  • TRADD binding with RIP inhibits phosphorylation
    of NIK.
  • TRADD binding with FADD in the absence of FLIP
    leads to cell death.
  • Initial condition
  • bound(TNF-a,TNFR1) at t0
  • Observation
  • bound(TRADD, TRAF2) at t1
  • Query Explain apoptosis
  • One explanation
  • Binding of TRADD with RIP
  • Binding of TRADD with FADD

31
Planning
  • Given initial conditions, to plan interventions
    to achieve a goal.
  • Application in drug and therapy design.

32
Planning requirements
  • In addition to the knowledge about the pathway we
    need additional information about possible
    interventions such as
  • What proteins can be introduced
  • What mutations can be forced.

33
Planning example
  • Defining possible interventions
  • intervention intro(DN-TRAF2)
  • intro(DN-TRAF2) causes present(DN-TRAF2)
  • present(DN-TRAF2) inhibits bind(TRAF2,TRADD)
  • present(DN-TRAF2) inhibits interact(TRAF2,NIK)
  • Initial condition
  • bound(NF?B,I?B) at 0
  • bind(TNF-a,TNF-R1) at 0
  • Goal to keep NF?B remain inactive.
  • Query
  • plan always bound(NF?B,I?B) from 0

34
Conclusion of part 1
  • From paper in ISMB 2004
  • Our goal in this paper was to make progress
    towards developing a system (and the necessary
    representation language and reasoning algorithms)
    that can be used to represent signal networks and
    pathways associated with cells and reason with
    them.
  • A start was made.
  • Defined a simple language (syntax and semantics)
  • Defined prediction, planning and explanation
  • A prototype implementation using AnsProlog
  • Illustration of its applicability with respect to
    an NFkB pathway.

35
Issues with incomplete knowledge
  • Often one may not be able to do much predication,
    explanation or planning.
  • What then?
  • Can reasoning help in obtaining new knowledge?
  • Yes, through hypothesis generation!
  • In fact, hypothesis generation needs reasoning!

36
Part II Hypothesis Generation
37
Hypothesis generation
  • Our observations can not be explained by our
    existing knowledge OR the explanations given by
    our existing knowledge are invalidated by
    experiments?
  • Conclusion Our knowledge needs to be augmented
    or revised?
  • How?
  • Can we use a reasoning system to predict some
    hypothesis that one can verify through
    experimentation?
  • Automate the reasoning in the minds of a
    biologist, especially helpful when the background
    knowledge is humongous.

38
Knowledge base
UV leads_to cancer
High UV
Hypothesis space
(K,I) O
p53
Cancer
No cancer
39
Issues in this tiny example
  • Hypothesis formation
  • Theory UV leads to cancer.
  • Observation wild-type p53 resists the UV
    effect.
  • Hypothesis p53 is a tumor-suppressor.
  • Elaboration tolerance
  • How do we update/revise UV leads to cancer?
  • Default NM reasoning
  • Normally UV leads to cancer.
  • UV does not lead to cancer if p53 is present.

40
Related Works some prior mention of hypothesis
formation
  • HYPGENE (Karp, 1991)
  • TRANSGENE (Darden, 1997)
  • GenePath (Zupan et al., 2003)
  • Robot Scientist (King et al., 2004)
  • Database (Doherty et al., 2004)
  • BIOCHAM (Calzone et al., 2005)
  • PathLogic (Karp et al. 2002)
  • Cytoscape (Shannon et al., 2003)
  • Integrative Scheme (Su et al., 2003)
  • Pathway Analysis (Ingenuity?)
  • do not use the latest advances in knowledge
    representation and reasoning. (eg. lack of ways
    to express defaults, non-monotonicity,
    elaboration tolerance, problem solving rules,
    etc.)

41
Hypothesis formation
  • Knowledge base K
  • Set of initial conditions I
  • Set of (experimental) observations O
  • (K,I) does not entail O
  • To expand (K,I) to (K, I) (K, I) entails
    O
  • How to expand (hypothesis space)
  • Explanation expand only I
  • Diagnosis normality assumptions about I,
    minimally abandon the normality assumptions
  • Hypothesis formation expand K

42
Construction of hypothesis space
  • Present manual construction, using research
    literature
  • Future integration of multiple data sources
  • Protein interactions
  • Pathway databases
  • Biological ontologies
  • ..
  • Provide cues, hunches such as
  • A may interact with B action interact(A,B)
  • A-B interaction may have effect C
  • interact(A,B) causes C

43
Generation of hypotheses
  • Enumeration of hypotheses
  • Search computing with Smodels (an implementation
    of AnsProlog)
  • Heuristics
  • A trigger statement is selected only if it is the
    only cause of some action occurrence that is
    needed to explain the novel observations.
  • An inhibition statement is selected only if it is
    the only blocker of some triggered action at some
    time.
  • Maximizing preferences of selected statements

44
Generation (cont) heuristics
  • Knowledge base K
  • a causes g
  • b causes g
  • Initial condition I intially f
  • Observation O eventually g
  • (K,I) does not entail O
  • Hypothesis space to expand K with rules among
  • f triggers a
  • f triggers b
  • Hypotheses f triggers a , or f
    triggers b

45
Case study p53 network
46
Tumor suppression by p53
  • p53 has 3 main functional domains
  • N terminal transactivator domain
  • Central DNA-binding domain
  • C terminal domain that recognizes DNA damage
  • Appropriate binding of N terminal activates
    pathways that lead to protection of cell from
    cancer.
  • Inappropriate binding (say to Mdm2) inhibits p53
    induced tumor suppression.

47
p53 knowledge base
  • Stress
  • high(UV ) triggers upregulate(mRNA(p53))
  • Upregulation of p53
  • upregulate(mRNA(p53)) causes high(mRNA(p53))
  • high(mRNA(p53)) triggers translate(p53)
  • translate(p53) causes high(p53)

48
p53 knowledge base (cont.)
  • Tumor suppression by p53
  • high(p53) inhibits growth(tumor)

49
p53 knowledge base (cont)
  • Interaction between Mdm2 and p53
  • high(p53), high(mdm2) triggers bind(p53,mdm2)
  • bind(p53,mdm2) causes bound(dom(p53,N))
  • bind(p53,mdm2) causes high(p53 mdm2),
  • bind(p53,mdm2) causes high(p53),high(mdm2)

50
Hypothesis formation
  • Experimental observation
  • I initially high(UV), high(mdm2), high(ARF)
  • O eventually tumorous
  • (K,I) does not entail O
  • Need to hypothesize the role of ARF.

51
Constructing hypothesis space
  • Levels of ARF and p53 correlate
  • high(ARF) triggers upregulate(mRNA(p53))
  • high(p53) triggers upregulate(mRNA(ARF))

52
Constructing (cont)
  • Interactions of ARF with the known proteins
  • bind(p53,ARF) causes bound(dom(p53,N))

53
Constructing (cont)
  • Influence of X (ARF) on other interactions
  • high(ARF) triggers upreg(mRNA(p53))
  • high(ARF) triggers translate(p53)
  • high(ARF) triggers bind(p53,mdm2)

54
Twelve Generated Hypothesis such as
  • high(UV) triggers upregulate(mRNA(ARF))
  • high(ARF), high(mdm2) triggers bind(ARF,mdm2)

55
Conclusion of part 2
  • Goal Automation of hypothesis formation (with
    respect to interactions and pathways)
  • Approach Viewed known qualitative aspects of
    cell activities as a knowledge base
  • Used knowledge representation language that
  • Can express defaults
  • Allows reasoning with incomplete knowledge
  • Can express reasoning as well as problem solving
    rules
  • Developed a system BioSigNet-RRH
  • Formalizing and reasoning about hypotheses
  • Illustration Hypothesizing the role of ARF
    protein in the p53 network.

56
Future Work on Reasoning about Biochemical
Networks (Part I and II)
  • Further development of the language
  • Validation with respect to larger networks
  • Kohns map
  • Networks in Reactome and other repositories
  • Going from prototype to deployable systems
  • Scaling up challenges
  • Recent advances in automatic planning
  • Integration with Biopax

57
Part III CBioC
  • http//cbioc.org

58
Do we have enough knowledge in the various
databases
  • Some have been curated into databases.
  • But there is much more in the literature.
  • So what do we do?

59
Current status of curation from text
  • About 15 million abstracts in Pubmed
  • 3 million published by US and EU researchers
    during 1994-2004 (800 articles per day)
  • 300 K articles published so far reporting
    protein-protein interactions in human, yeast and
    mouse.
  • BIND (in 7 yrs) -- 23K DIP 3K MINT 2.4K.

60
Premise High cost of human curation
  • Overwhelming cost of large curation efforts may
    be unsustainable for long periods
  • BIND Nov 2005 bad news.
  • Operated for 7 years
  • Listed over 100 curators programmers
  • CND 29 million received in 2003, plus other
    funding
  • Curation efforts of AFCS has recently stopped.
  • Lack of funding for some genome annotation
    projects.

61
Premise summary
  • Human curation of text is expensive.
  • Human curation of text is not scalable.
  • Human curation of text is not sustainable.

62
Why not resort to computers? do automatic
extraction
  • Lessons from DARPA funded MUCs (message
    understanding conferences) in 90s for a decade
    and at the cost of tens of millions of dollars.
  • Getting to 60 recall and precision is quick
  • Then every 5 improvement is about a years work.
  • Even when we get to 90 for an individual entity
    extraction
  • for recognizing 4 related entities (.9)4 .64
  • Lessons from Biomedical text extraction
  • No proper evaluation.
  • Recognized that recall and precision is not very
    good even in the best systems.

63
What do we do?
  • How do we curate not only the existing articles,
    but also the future articles?
  • Too important to give up!
  • Need to think of a new way to do it.
  • Faster computers, better sequencing technology
    and better algorithms came to the rescue of the
    Human Genome project.
  • Hmm. What resources are we overlooking?

64
Key Idea
  • If lots of articles are being written then lot of
    people are writing them and lot of people are
    reading them.
  • If only we could make these people (the authors
    and the readers) contribute to the curation
    effort
  • Especially the readers the ones who need the
    curated data!

65
Mass collaboration has worked in
  • Wikipedia
  • Project Gutenberg
  • Netflix rating
  • Amazon rating
  • Etc.

66
Mass collaborative curation initial hurdles
  • An average reader
  • (S)he is not normally interested in filling a
    blank curation form.
  • We can not make an average reader go though
    curation training.
  • So it has to be very different from just making
    the existing curation tools available to the mass
    and expect them to contribute.

67
Mass collaborative curation key initial ideas
  • Make it very easy
  • user need not remember where (which database,
    which web page) to put the curated knowledge.
  • Curation opportunity should present itself
    seamlessly.
  • Curation should not be a burden to an average
    user
  • Make the curated knowledge thin.
  • There should be immediate rewards
  • Do not start with a blank slate.

68
Realization of the key ideas a biologist with a
gene name
  • Goes to Pubmed, types the gene name, clicks on
    one of the abstracts
  • Curation panel presents itself automatically
  • Our approach calls for researchers to contribute
    to the curation of facts as they read and
    research over the web
  • But not with a blank slate
  • No one wants to be the first one!
  • Automatic extraction jump-starts the process, and
    then researchers improve upon the extracted data,
    ironing out inconsistencies by subsequent edits
    on a massive scale.
  • Thin Schemas
  • Average users turned off by traditional wide
    schemas
  • Wide schemas need to be broken down.

69
(No Transcript)
70
(No Transcript)
71
(No Transcript)
72
(No Transcript)
73
(No Transcript)
74
(No Transcript)
75
(No Transcript)
76
(No Transcript)
77
(No Transcript)
78
(No Transcript)
79
Summary
  • Information/curation window pops up
    automatically.
  • Automatic extraction is used as a boot strap so
    that no user is working on a blank slate.
  • Users vote on correctness, make corrections, add
    fact.
  • Suppose 60 precision and recall of automatic
    extraction system
  • A person will have an easier time discarding 40
    of wrongly extracted text than identifying 60 of
    correct entries and entering them!

80
Very useful byproducts
  • Avoids some problems with existing human curation
    approach
  • Curators bias
  • Curators miss things
  • Curators have disagreements
  • Slow access to newest findings
  • Researchers at large have little or no control
    over what gets curated and when
  • A large curated corpus of text gets created
  • Very useful to evaluate and improve automated
    extraction systems.

81
Current status of CBioC future plans
  • Basic system, as described, is ready
  • Being populated with
  • Facts from existing databases (BIND etc.)
  • Facts extracted using our extraction system
  • Querying mechanism
  • Answer display
  • Future work
  • Voter confidence issues

82
Conclusion
  • Collecting what is known
  • Reasoning with what is known
  • Hypothesizing what is unknown (based
    on observations)

83
Open Invitation
  • We are building and eager to help other groups
    build knowledge bases in particular domains to
  • Predict impact of interventions
  • Plan (therapy design) to make a pathway behave in
    a desired way
  • Explain observation
  • Hypothesize new knowledge
  • Further improvements to and adaptation of CBioC

84
Acknowledgements
  • BioSignet
  • Nam Tran, Ph.D thesis on this, Postdoc _at_ Yale
  • Karen Chancellor, Ph.D student
  • Michael Berens and his group (Ana Joy, Nhan Tran)
  • Lokesh Joshi and his group (Vinay Nagraj)
  • CBioc Graciela Gonzalez, Lian Yu, Luis Tari,
    Tony Gitter, Amanda Ziegler, Ryan Wendt,
    Prabhdeep Singh.
  • Other projects
  • BioQA
  • Biogenenet

85
Thank you!
Write a Comment
User Comments (0)
About PowerShow.com