ReConceptualizing LiteratureBased Discovery - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

ReConceptualizing LiteratureBased Discovery

Description:

Regardless of how this is done, how the implicit assertions are assessed, ... autophagy, or. therapeutic agents 'Interestingness' Measures. Field of data mining. ... – PowerPoint PPT presentation

Number of Views:74

Avg rating:3.0/5.0

Slides: 22

Provided by: Nsmalh

Category:

more less

Transcript and Presenter's Notes

Title: ReConceptualizing LiteratureBased Discovery

1
Re-Conceptualizing Literature-Based Discovery
Neil R. Smalheiser March 29, 2008
2
What is LBD? A strategy for uncovering novel
hypotheses

advocated by Don Swanson
Magnesium-migraine, Fish oil-Raynauds
The key idea is putting together
explicit assertions from different papers to form
new implicit assertions
Regardless of how this is done, how the implicit
assertions are assessed, whether the implicit
assertions are correct!

3
What is LBD? A routine way of life for scientists

greatly under-recognized! Not just background
reading, not just identifying anomalies or
critical incidents that appear (explicitly) in
a paper
Since 1996 8 papers with Swanson, 40 without
(i.e. non-one node search),
24 are biological (i.e. non-informatics
modeling) 9/24 3/8 gt 1/3
Proteins in unexpected locations (Molec. Biol.
Cell, 1996)
Expression of reelin in the blood (PNAS, 2000)
Reelin and schizophrenia (PNAS 2000)
Fluoxetine and neurogenesis (Eur. J. Pharmacol.
2001)
RNAi and memory (Trends in Neurosci. 2001)
Bath toys (New Engl. J. Med. 2003)
Dicer and calpain (J. Neurochem. 2005)
Exosomal transfer of proteins RNAs at synapses
(Biol Direct, 2007)
microRNA machinery and regulation by
phosphorylation (BBA, 2008)

4
What is LBD? A body of research articles,
software and websites

mostly by information scientists and computer
scientists
Mostly concerned with open discovery or one
node searches, begin with a set of articles A
that represents a problem
Mostly use B-terms present in A to expand the
search, find disparate lits Ci that share B-terms
with A
Try to find the Ci that is
disparate yet most similar to A

5
What is LBD? other researchers employ implicit
information too

Bioinformatics
gene-gene interactions
protein-protein interactions
web search
author disambiguation
text mining
Yet these are not viewed as examples of LBD for
some reason!

6
Has the LBD field stagnated and not fulfilled
its promise?

Kostoff critique(s)
what is a discovery vs. an innovation
argues against frequency based ranking,
Uses very high recall, hundreds of discoveries
claimed per question
Swansons legacy Sw refs ended 2001!
Bork review refs Sw ended 1996!
Few gold standards are available (Mg, fish oil
worn out)
Combinatorial explosion A B C search method
Impossible standards for what counts as a LBD
prediction (never considered, never tested, must
shatter a paradigm but must be proven
experimentally??)
Excluding active approaches other than one node
search as being LBD

7
Well, what DO we know about progress in LBD?

The two-node search
http//arrowsmith.psych.uic.edu
Begin with two lits A and C that represent a
known finding or a hypothesis (estrogen-AD)
look for meaningful links
(whether or not A and C are disparate)
We use B-terms extracted from titles
Could use abstracts, MeSH, triples

8
(No Transcript)
9
Modeling the Two-Node Search-1

Field testers, free-form use of the tool
Chose 6 two-node searches as gold standards not
too big or small, disparate, topically coherent,
clean questions
E.g. for A retinal detachment, C aortic
aneurysm, a) find diseases in which both features
appear not necessarily in same person or b)
find surgical procedures that have been applied
to both conditions.
Manually marked relevant B-terms for a given
query (sometimes several queries for the same two
node search)
Details in Bioinformatics (2007) paper

10
Modeling the Two-Node Search-2

Used 8 complementary features to score each
B-term (e.g. recency, frequency, semantic
categories)
created a single combined and weighted score for
each B-term
Used logistic regression model to optimally give
weights to each feature so as to separate marked
relevant B-terms from all others (mixed set)

11
Modeling the Two-Node Search-3
12
Two End-Points of this Research

For any two-node search, we can now rank the list
of B-terms in order of estimated probability that
they will be marked as relevant (meaningful) by
SOME user for SOME query.
For any pair of lits A and C,
we can now estimate the OVERALL shared implicit
information between A and C
( of B-terms that are predicted to be relevant)

13
Relevance to the One Node Search

We can re-conceptualize the one-node search as a
series of two-node searches
Choose A, then choose category C
Divide category C into many small coherent Ci
densely
For each Ci, score multi-dimensional features
Including, but not limited to, features that
relate A to Ci (e.g. number of B-terms in common
or predicted relevant B-terms)
Rank the Ci to identify the most promising lits
(which are presumed to point to novel hyps or
implicit information helpful when applied to A)

14
A is evaluated pairwise against C
C1 might involve B-terms C2 might
not! C3 C4 . e.g. A
Huntington Disease C lifestyle
factors autophagy, or therapeutic
agents
15
Interestingness Measures

Field of data mining.
This allows us to encode real-life priorities and
strategies of working scientists
Existing one node search looks for novelty,
relevance, non-trivial, likelihood of being true
. get low hanging fruit
What about actionability, feasibility of
follow-up, surprisingness, cross-discipline,
presence of high experimental support,
generalizability to other problems, or high
potential impact?
A candidate Ci could be interesting because it is
recently discovered and rapidly growing (e.g.
microRNAs), well characterized, for a disease
has an animal model, for a protein is connected
to many other proteins, for a drug has FDA
approval.
not only re-conceptualizes one node search (e.g.,
no combinatorial explosion) but it generalizes
the ranking methods.

16
Gold Standards for One-Node Searches

Also, we can now envision preparing a series of
gold standard searches, even automatically (cf.
TREC 2006, 2007).
Use implicit assertions to reconstruct explicit
knowledge.
Use review articles
lists (e.g. in virus study, gold standard was a
list of viruses that were thought to be at risk
of being exploited for biological warfare).
time slices
Avoids the paradox that one node searches must
predict things that have no experimental support!

17
Conclusions

LBD is (can be, will be) alive and well!
Need to incorporate the types of real-life
priorities and strategies of working scientists
Re-conceptualize the one node search as a series
of two-node searches
Use interestingness measures to supplement
B-term measures.

18
(No Transcript)
19
Journal of Biomedical Discovery and Collaboration

Unique multi-disciplinary audience
People who engage in scientific discovery and
collaboration
People who make tools that enhance scientific
discovery and collaboration
People who study scientific discovery and
collaboration
Hosted by Biomed Central
Fully peer-reviewed
RAPID review (lt3 weeks is routine)
Open-access, indexed in PubMed Central et al
Readership goes up 10-100-fold
Impact goes up too
Article fee reduced or zeroed depending on
institution

20
Acknowledgements

Don Swanson
Vetle Torvik
Wei Zhou (Clement Yu)
Marc Weeber

21
Ruminations

Should LBD analyses be user-friendly? Popular??
Dont they overlook true divergent discoveries?
Should LBD be run automatically as a program in
the background, with alerts of possible
discoveries?
Does LBD bypass, or reinforce, good old fashioned
hypothesis driven science?

Write a Comment

User Comments (0)