GeneTUC: Natural Language Understanding in Medical Text - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

GeneTUC: Natural Language Understanding in Medical Text

Description:

The rock grows. calmodulin. with. interacts. Rgs4. S. VP. NP. N. PP. V ... Gastrin activates rat stomach histidine decarboxylase via cholecystokinin-B/gastrin ... – PowerPoint PPT presentation

Number of Views:98
Avg rating:3.0/5.0
Slides: 41
Provided by: carlfredr
Category:

less

Transcript and Presenter's Notes

Title: GeneTUC: Natural Language Understanding in Medical Text


1
GeneTUC Natural Language Understanding in
Medical Text
  • PhD Defense
  • Rune Sætre
  • June 27th 2006

2
Overview
  • Motivation
  • Thesis Work
  • Overview (Diploma Thesis)
  • Idea (Paper 1 and 2)
  • Bioogle (Paper 3, 4 and 5)
  • GeneTUC (Paper 6)
  • Results, Related Work and Discussion
  • Comments and Questions by Jong C. Park and Eivind
    Hovig

3
Motivation
http//www.ncbi.nlm.nih.gov/PubMed/
4
Motivation
  • Biomedical Researchers publish almost 2000
    abstracts per day in MEDLINE
  • Computers are needed to automatically find all
    (recall), and only (precision), the relevant
    information
  • Future Solution GeneTUC
  • TUC The Understanding Computer
  • BusTUC works for Natural Language queries about
    busses in Trondheim
  • GeneTUC uses full-parsing to extract knowledge
    from MEDLINE
  • After parsing the input, GeneTUC can answer
    simple questions about protein and gene
    interactions and other facts from the text

5
Challenge Medical language
  • Example Input Sentences
  • Subsequently, activated CREB activates
    transcription of genes essential for proper germ
    cell differentiation.
  • Indeed, Ca2/calmodulin binds a complex of RGS4
    and a transition state analog of Galpha
    i1-GDP-AlF4-.
  • Medical language is not always natural language
  • Complex grammar
  • Invention of new words/names every day

PMID 11988318
BioCreative1 Example, PMID 11988318
6
GeneTUC Research Overview
7
Thesis Work
  • GeneTUC Diploma Work
  • Literature Review NLU in Medicine
  • GeneTUC Full-parsing of MEDLINE Abstracts
  • PhD Papers
  • 1 Unitex Local Grammars
  • 2 ProtChew Automatic Protein Name Recognition
  • 3 Alchymoogle Automatic Entity Annotation
  • 4 gProt Automatic Protein Interaction Annotation
  • 5 WebProt Online gProt Experiments
  • 6 GeneTUC GENIA corpus experiments

8
TUC Introduction
  • Chat-80, Prat-89, HSQL
  • 1991 The Understanding Computer
  • 1996 BusTUC (www.team-trafikk.no)
  • 2000 GeneTUC, diploma project
  • 2001-2006 GeneTUC has been my PhD-Project ?

9
GeneTUC System Architecture
  • MEDLINE Abstracts
  • GO GeneOntology
  • TUC The Understanding Computer
  • DB TQL DataBase
  • HGNC HUGO Gene Nomenclature Committee
  • WordNet Ontology

GeneTUC
GO
Answer
Query
MEDLINE
TUC
DB
HG NC
WordNet
10
WordNet 2.0
  • Online lexical reference system
  • Nouns, verbs, adjectives and adverbs
  • Inspired by psycholinguistic theories of human
    lexical memory
  • Organized into synonym sets, each representing
    one underlying lexical concept
  • Different relations link the synonym sets
  • E.g. hypernyms, hyponyms, holonyms, synonyms,
    coordinate terms, domain,

11
Nomenclature, HUGO
  • HUGO Gene Nomenclature Committee
  • Approve a gene name and symbol for each known
    human gene
  • Stored in the Human Gene Nomenclature Database
  • Approved 13,000 symbols (20-30,000 human genes)
  • Each symbol is unique
  • Each gene is only given one approved gene symbol
  • Similar names used, e.g. in mouse gene research
  • Efforts are made to use a symbol acceptable to
    workers in the field
  • Facilitates electronic data retrieval from
    publications

12
Gene Ontology
  • Heterarchy
  • Molecular Terms
  • Controlled Vocabulary
  • Function, Process and Location

13
GeneTUC Parser
S
  • Top-Down, left to right
  • Greedy Heuristics
  • Semantic Constraints
  • Interact(Agent RGS4)
  • The rock grows

VP
NP
N
PP
V
PP
P
N
calmodulin
with
interacts
Rgs4
14
Screenshot Example
RGS4
  • E rgs4 interacts with calmodulin.
  • ..................................................
    ......................
  • TQL
  • rgs4 isa protein
  • calmodulin isa protein
  • interact/rgs4/sk(1)
  • srel/with/thing/calmodulin/sk(1)
  • event/real/sk(1)
  • ..................................................
    ......................
  • E calmodulin interacts with cck.
  • ..................................................
    ......................
  • TQL
  • cck isa gene
  • interact/calmodulin/sk(3)
  • srel/with/thing/cck/sk(3)
  • event/real/sk(3)
  • ..................................................
    ......................

Calmodulin
CCK
15
Screenshot Example ctd.
RGS4
  • E does rgs4 interact with cck?
  • ..................................................
    .............
  • TQL
  • test(rgs4 isa protein,
  • cck isa gene,
  • interact/rgs4/A,
  • srel/with/thing/cck/A,
  • event/real/A)
  • ..................................................
    ..............
  • Yes
  • ..................................................
    ..............
  • A transitive rule
  • ProteinA interacts with ProteinB and ProteinB
    interacts with ProteinCgt ProteinA interacts
    with ProteinC

Calmodulin
Calmodulin
CCK
16
Dictionary
  • GeneTUC does not perform very well without a
    complete dictionary
  • Current Solution Bioogle can build a dictionary

17
Bioogle (Paper III)
  • Current ontology 275 medical terms
  • Connect Unknowns to these Concepts
  • Query syntax
  • Unknown is (ana)
  • Parse results until a hit is found (or not)
  • Pentagastrin is a synthetic peptide containing
    the five terminal amino acids of gastrin.
  • Result 104 of 200 terms were correctly classified

18
GeneTUC Ontology
Relations AKO Is-A Has_A
19
Google API Search
  • 1000 queries per user pr day
  • Free to use for everybody
  • Can be programmed with SOAP in most languages
  • Simple Object Access Protocol
  • Results are handled automatically
  • Alexa (Amazon) has implemented a similar service
  • 1 per processor hour
  • 1 per gigabyte/year of user storage
  • 1 per 50 gigabytes of data processed
  • 1 per gigabyte uploaded/downloaded

http//news.bbc.co.uk/1/hi/technology/4530978.st
m
20
Paper IV gProt
  • What about protein interactions?
  • Protein Interaction
  • Protein ? Protein
  • BioCreAtIvE1 Protein ? Set of GeneOntology Terms
  • Find publicly known interactions for a given
    protein, using Google as the main source for new
    knowledge
  • Query proteinX VerbY
  • Example Gastrin activates

21
PaperIVgProt
22
Gastrin activates nuclear factor kappaB
(NFkappaB) through a ...Conclusions Gastrin
activates NF kappa B via a PKC dependent
pathway whichinvolves I kappa B kinase, NF
kappa B inducing kinase, and TRAF6.
...gut.bmjjournals.com/cgi/content/abstract/52/6/
813 - Lignende sider Gastrin activates nuclear
factor kappaB (NFkappaB) through a
...gut.bmjjournals.com/cgi/reprint/52/6/813 -
Lignende sider Gastrin activates nuclear factor
kappaB (NFkappaB) through a ...BACKGROUND We
previously reported that gastrin induces
expression of CXC chemokinesthrough
activat...www.ncbi.nlm.nih.gov/entrez/query.fcgi?
cmdRetrieve dbPubMedlist_uids12740336doptAb
stract - Lignende sider Gastrin activates
nuclear factor kappaB (NFkappaB) through a
...CONCLUSIONS Gastrin activates NFkappaB via a
PKC dependent pathway which involvesIkappaB
kinase, NFkappaB inducing kinase, and TRAF6. MeSH
Terms ...www.ncbi.nlm.nih.gov/entrez/query.fcgi?
cmdRetrieve dbPubMedlist_uids12740336doptCi
tation - Lignende sider Flere resultater fra
www.ncbi.nlm.nih.gov Gastrin activates nuclear
factor kappaB (NFkappaB) through a ...iHOP -
Information Hyperlinked over Proteins Gastrin
activates nuclear factorkappaB (NFkappaB)
through a protein kinase C dependent pathway
involving ...www.pdg.cnb.uam.es/UniPub/iHOP/gp/97
05030.html - 7k - I hurtigbuffer - Lignende sider
Gast - Gastrin precursorGastrin activates rat
stomach histidine decarboxylase via
cholecystokinin-B/gastrinreceptors.
Abstract-863492. Gastrin activated transcription
through a ...www.pdg.cnb.uam.es/UniPub/iHOP/gg/12
1191.html - 105k - I hurtigbuffer -
Lignende sider Flere resultater fra
www.pdg.cnb.uam.es Anatomy Physiology
Lecture Outlinesaa. gastrin activates gastric
juice secretion gastric smooth muscle
churning bb.gastrin activates gastroileal
reflex which moves chyme from ileum to
...www.gwc.maricopa.edu/class/bio202/digestlc.htm
- 20k - I hurtigbuffer - Lignende sider
23
Paper IV gProt
  • Results, 2000 facts

24
Paper V WebProt
  • Online Implementation, bigger experiment
  • Can Annotate Protein Interactions with 70
    precision
  • Tested the effect of source filtering
  • 90 precision, but recall dropping to 70

25
Google as a source
4660 facts total from WebProt
1480
26
WebProt
27
Screenshot
WebProt
28
Paper VI GeneTUC Results
  • Can parse 60 of test input sentences in the
    GENIA corpus (500 abstracts),
  • With 86 accuracy on the POS-tagging
  • Bracketing Precision and Recall scores of 70,6
    and 53,9
  • And answer simple questions about the parsed
    sentences

29
Evalb scores
Paper VI
30
Summary
  • 6 papers describing the steps needed to show that
    GeneTUC can handle medical text
  • 60 parsing success-rate may not be enough for a
    commercial application,
  • But the fact that it improved from just 10 in
    2001 is very promising
  • Once the parsing success-rate is good enough,
    GeneTUC can be tested on Question-Answering
  • There is a need for a good public dataset that
    allows measuring and comparing between different
    QA systems (Future Work)

31
Acknowledgements
  • Biologists
  • Astrid Lægreid, Kamilla Stunes, Kristine Misund,
    Liv Thommesen, Tonje Strømmen Steigedal
  • Computer Scientists
  • Tore Amble, Arne Halaas, Amund Tveit, Martin
    Ranang, Harald Søvik, Yoshimasa Tsuruoka, Anders
    Andenæs, Tor-Kristian Jenssen, Franz Günthner,
    Junichi Tsujii, Jörg Cassens, Waclaw
    Kusnierczyk, Tore Bruland, Peep Küngas, Magnus
    Lie Hetland, Morten Hartman, Hallgeir Bergum, Jo
    Kristian Bergum, Frode Jünge, Heri Ramampiaro,
    Rolv Inge Seehuus, Per Kristian Lehre, Clemens
    Marschner, Petra Maier, Holger Bosk, Sebastian
    Nagel, Mariya Vitusevych, Yoshimasa Tsuruoka,
    Jin-Dong Kim, Hong-Woo Chun, Takashi Ninomiya,
    Yusuke Miyao, Frode Høyvik, Henrik Tveit, Jian Su
    and others

32
Questions and Comments
  • Associate professor Jong C. Park
  • Computer Science Division,
  • Korea Advanced Institute of Science and
    Technology (KAIST),
  • Daejeon, South Korea
  • Professor Eivind Hovig
  • Department of Tumor Biology,
  • Institute for Cancer Research,
  • The Norwegian Radium Hospital

33
Thesis Work
  • GeneTUC Project
  • Use TUC in the Medical Text Domain
  • Use Google (Bioogle) to Recognize Unknown
    Entities
  • Galpha(i1)-GDP-AlF(4)(-), Ca2, Gastrin
  • Use Google (WebProt) to do Automatic Annotation
  • Mapping (BioCreative)
  • From Gene/Protein ? Set of GeneOntology Terms

34
Motivation
  • Natural language is natural ?
  • Talking computers
  • Voice as input
  • Repetitive tasks should be automated!
  • Information Extraction is trivial,if you know
    what to look for

35
0 GeneTUC Diploma Work
  • NLU Review 2002
  • GENIA HPSG
  • Park et al. CCG-parsing
  • Numbers?

36
Paper I Local Grammars
  • Maurice Gross
  • there is more than 1050 ways to build a sentence
    with at most twenty words

Gross (1997). Construction of Local Grammars
37
Paper II ProtChew
  • Protein Names
  • Galpha(i1)-GDP-AlF(4)(-)
  • Gastrin
  • Idea Automatic Extraction
  • Based on existing dictionaries and machine
    learning
  • Results?

38
evalb
  • 4 OUTPUT FORMAT FROM THE SCORER
  • The scorer gives individual scores for each
    sentence, for
  • example
  • Sent. Matched Bracket
    Cross Correct Tag
  • ID Len. Stat. Recal Prec. Bracket gold test
    Bracket Words Tags Accracy
  • 1 8 0 100.00 100.00 5 5 5
    0 6 5 83.33
  • At the end of the output the Summary
    section gives statistics
  • for all sentences, and for sentences lt40 words
    in length. The summary
  • contains the following information
  • i) Number of sentences -- total number of
    sentences.
  • ii) Number of Error/Skip sentences -- should
    both be 0 if there is no
  • problem with the parsed/gold files.
  • iii) Number of valid sentences Number of
    sentences - Number of Error/Skip
  • sentences
  • iv) Bracketing recall (number of correct
    constituents)
  • -------------------------
    ---------------
  • (number of constituents
    in the goldfile)
  • v) Bracketing precision (number of correct
    constituents)
  • -------------------------
    ---------------
  • (number of constituents
    in the parsed file)
  • vi) Complete match percentaage of sentences
    where recall and precision are
  • both 100.
  • vii) Average cross(const crossing a goldfile
    constituen
  • -------------------
    ---------------------
  • (number of
    sentences)
  • viii) No crossing percentage of sentences which
    have 0 crossing brackets.
  • ix) 2 or less crossing percentage of
    sentences which have lt2 crossing brackets.
  • x) Tagging accuracy percentage of correct
    POS tags (but see 5.3 for exact
  • details of what is counted).

39
Remember
  • Present one paper at the time
  • Summary results and related work also in the end

Ta med tabeller for parsing, sammenligning med
andre etc. Et eksempel på en kompleks setning
med gtb treet. Ref tabell. Sammenlign
brackets. Ta med webprot screenshot Related
work!! Phd pres. Related work. Lexiquest, 40
verbs, hva er f-score? Fra tore Hvorfor bare
50. Er det semantikk eller gramatikk som gjør at
50 feiler
40
Dr. Carl-Fredrik Sørensen (50 min, jeg tid /2) 5
min intro, state-of-the-art 5 min definitions
NLU 10 min thesis/papers overview and Research
Questions 15 min three themes and contributions.
Evaluation of the work 10 min future work Proof
of concept. It can be implemented. Next
step? Industry... Results are trusted Academic...
Results are validated through understanding the
research process. Dennings, proof of
concept Research question...soon....Moores
law Proof of performance Shift the work to
biologists Medline growth graph. Figure...
Everything is published. Background http//www.c
oli.uni-saarland.de/hansu/what_is_cl.html Schope
nhauer imagine how clever a vice man would be,
if he knew everything in his books. Inter-annotat
or agreement in gprot, maybe 80 percent precision
is enough?!
Write a Comment
User Comments (0)
About PowerShow.com