ReConceptualizing LiteratureBased Discovery - PowerPoint PPT Presentation


PPT – ReConceptualizing LiteratureBased Discovery PowerPoint presentation | free to view - id: f30c2-NDUyY


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

ReConceptualizing LiteratureBased Discovery


Regardless of how this is done, how the implicit assertions are assessed, ... autophagy, or. therapeutic agents 'Interestingness' Measures. Field of data mining. ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 22
Provided by: Nsmalh


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: ReConceptualizing LiteratureBased Discovery

Re-Conceptualizing Literature-Based Discovery
Neil R. Smalheiser March 29, 2008
What is LBD? A strategy for uncovering novel
  • advocated by Don Swanson
  • Magnesium-migraine, Fish oil-Raynauds
  • The key idea is putting together
  • explicit assertions from different papers to form
    new implicit assertions
  • Regardless of how this is done, how the implicit
    assertions are assessed, whether the implicit
    assertions are correct!

What is LBD? A routine way of life for scientists
  • greatly under-recognized! Not just background
    reading, not just identifying anomalies or
    critical incidents that appear (explicitly) in
    a paper
  • Since 1996 8 papers with Swanson, 40 without
    (i.e. non-one node search),
  • 24 are biological (i.e. non-informatics
    modeling) 9/24 3/8 gt 1/3
  • Proteins in unexpected locations (Molec. Biol.
    Cell, 1996)
  • Expression of reelin in the blood (PNAS, 2000)
  • Reelin and schizophrenia (PNAS 2000)
  • Fluoxetine and neurogenesis (Eur. J. Pharmacol.
  • RNAi and memory (Trends in Neurosci. 2001)
  • Bath toys (New Engl. J. Med. 2003)
  • Dicer and calpain (J. Neurochem. 2005)
  • Exosomal transfer of proteins RNAs at synapses
    (Biol Direct, 2007)
  • microRNA machinery and regulation by
    phosphorylation (BBA, 2008)

What is LBD? A body of research articles,
software and websites
  • mostly by information scientists and computer
  • Mostly concerned with open discovery or one
    node searches, begin with a set of articles A
    that represents a problem
  • Mostly use B-terms present in A to expand the
    search, find disparate lits Ci that share B-terms
    with A
  • Try to find the Ci that is
  • disparate yet most similar to A

What is LBD? other researchers employ implicit
information too
  • Bioinformatics
  • gene-gene interactions
  • protein-protein interactions
  • web search
  • author disambiguation
  • text mining
  • Yet these are not viewed as examples of LBD for
    some reason!

Has the LBD field stagnated and not fulfilled
its promise?
  • Kostoff critique(s)
  • what is a discovery vs. an innovation
  • argues against frequency based ranking,
  • Uses very high recall, hundreds of discoveries
    claimed per question
  • Swansons legacy Sw refs ended 2001!
  • Bork review refs Sw ended 1996!
  • Few gold standards are available (Mg, fish oil
    worn out)
  • Combinatorial explosion A B C search method
  • Impossible standards for what counts as a LBD
    prediction (never considered, never tested, must
    shatter a paradigm but must be proven
  • Excluding active approaches other than one node
    search as being LBD

Well, what DO we know about progress in LBD?
  • The two-node search
  • http//
  • Begin with two lits A and C that represent a
    known finding or a hypothesis (estrogen-AD)
  • look for meaningful links
  • (whether or not A and C are disparate)
  • We use B-terms extracted from titles
  • Could use abstracts, MeSH, triples

(No Transcript)
Modeling the Two-Node Search-1
  • Field testers, free-form use of the tool
  • Chose 6 two-node searches as gold standards not
    too big or small, disparate, topically coherent,
    clean questions
  • E.g. for A retinal detachment, C aortic
    aneurysm, a) find diseases in which both features
    appear not necessarily in same person or b)
    find surgical procedures that have been applied
    to both conditions.
  • Manually marked relevant B-terms for a given
    query (sometimes several queries for the same two
    node search)
  • Details in Bioinformatics (2007) paper

Modeling the Two-Node Search-2
  • Used 8 complementary features to score each
    B-term (e.g. recency, frequency, semantic
  • created a single combined and weighted score for
    each B-term
  • Used logistic regression model to optimally give
    weights to each feature so as to separate marked
    relevant B-terms from all others (mixed set)

Modeling the Two-Node Search-3
Two End-Points of this Research
  • For any two-node search, we can now rank the list
    of B-terms in order of estimated probability that
    they will be marked as relevant (meaningful) by
    SOME user for SOME query.
  • For any pair of lits A and C,
  • we can now estimate the OVERALL shared implicit
    information between A and C
  • ( of B-terms that are predicted to be relevant)

Relevance to the One Node Search
  • We can re-conceptualize the one-node search as a
    series of two-node searches
  • Choose A, then choose category C
  • Divide category C into many small coherent Ci
  • For each Ci, score multi-dimensional features
    Including, but not limited to, features that
    relate A to Ci (e.g. number of B-terms in common
    or predicted relevant B-terms)
  • Rank the Ci to identify the most promising lits
    (which are presumed to point to novel hyps or
    implicit information helpful when applied to A)

A is evaluated pairwise against C
C1 might involve B-terms C2 might
not! C3 C4 . e.g. A
Huntington Disease C lifestyle
factors autophagy, or therapeutic
Interestingness Measures
  • Field of data mining.
  • This allows us to encode real-life priorities and
    strategies of working scientists
  • Existing one node search looks for novelty,
    relevance, non-trivial, likelihood of being true
    . get low hanging fruit
  • What about actionability, feasibility of
    follow-up, surprisingness, cross-discipline,
    presence of high experimental support,
    generalizability to other problems, or high
    potential impact?
  • A candidate Ci could be interesting because it is
    recently discovered and rapidly growing (e.g.
    microRNAs), well characterized, for a disease
    has an animal model, for a protein is connected
    to many other proteins, for a drug has FDA
  • not only re-conceptualizes one node search (e.g.,
    no combinatorial explosion) but it generalizes
    the ranking methods.

Gold Standards for One-Node Searches
  • Also, we can now envision preparing a series of
    gold standard searches, even automatically (cf.
    TREC 2006, 2007).
  • Use implicit assertions to reconstruct explicit
  • Use review articles
  • lists (e.g. in virus study, gold standard was a
    list of viruses that were thought to be at risk
    of being exploited for biological warfare).
  • time slices
  • Avoids the paradox that one node searches must
    predict things that have no experimental support!

  • LBD is (can be, will be) alive and well!
  • Need to incorporate the types of real-life
    priorities and strategies of working scientists
  • Re-conceptualize the one node search as a series
    of two-node searches
  • Use interestingness measures to supplement
    B-term measures.

(No Transcript)
Journal of Biomedical Discovery and Collaboration
  • Unique multi-disciplinary audience
  • People who engage in scientific discovery and
  • People who make tools that enhance scientific
    discovery and collaboration
  • People who study scientific discovery and
  • Hosted by Biomed Central
  • Fully peer-reviewed
  • RAPID review (lt3 weeks is routine)
  • Open-access, indexed in PubMed Central et al
  • Readership goes up 10-100-fold
  • Impact goes up too
  • Article fee reduced or zeroed depending on

  • Don Swanson
  • Vetle Torvik
  • Wei Zhou (Clement Yu)
  • Marc Weeber

  • Should LBD analyses be user-friendly? Popular??
  • Dont they overlook true divergent discoveries?
  • Should LBD be run automatically as a program in
    the background, with alerts of possible
  • Does LBD bypass, or reinforce, good old fashioned
    hypothesis driven science?