Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text

Description:

Using Weakly Labeled Data to Learn Models for Extracting Information ... DAG kinase zeta. DAMOX. DASOX. DAT. DB83 protein. PubMed abstract. SwissProt dictionary ... – PowerPoint PPT presentation

Number of Views:104
Avg rating:3.0/5.0
Slides: 38
Provided by: mark634
Category:

less

Transcript and Presenter's Notes

Title: Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text


1
Using Weakly Labeled Data to Learn Models for
Extracting Information from Biomedical Text
Mark Craven Department of Biostatistics Medical
Informatics Department of Computer
Sciences University of Wisconsin U.S.A. craven_at_bio
stat.wisc.edu www.biostat.wisc.edu/craven
2
The Information Extraction Task
Analysis of Yeast PRP20 Mutations and Functional
Complementation by the Human Homologue RCC1, a
Protein Involved in the Control of Chromosome
Condensation Fleischmann M, Clark M, Forrester
W, Wickens M, Nishimoto T, Aebi M Mutations in
the PRP20 gene of yeast show a pleitropic
phenotype, in which both mRNA metabolishm and
nuclear structure are affected. SRM1 mutants,
defective in the same gene, influence the signal
transduction pathway for the pheromone response .
. . By immunofluorescence microscopy the PRP20
protein was localized in the nucleus. Expression
of the RCC1 protein can complement the
temperature-sensitive phenotype of PRP20 mutants,
demonstrating the functional similarity of the
yeast and mammalian proteins
protein(PRP20) subcellular-localization(PRP20,
nucleus)
3
Motivation
  • assisting in the construction and updating of
    databases
  • providing structured summaries for queries
  • What is known about protein X
    (subcellular tissue localization, associations
    with diseases, interactions with drugs, )?
  • assisting scientific discovery by detecting
    previously unknown relationships, annotating
    experimental data

4
Three Themes in Our IE Research
  • Using weakly labeled training data
  • Representing sentence structure in learned models
  • Combining evidence when making predictions

5
1. Using Weakly Labeled Data
  • why use machine learning methods in building
    information-extraction systems?
  • hand-coding IE systems is expensive,
    time-consuming
  • there is a lot of data that can be leveraged
  • where do we get a training set?
  • by having someone hand-label data (expensive)
  • by coupling tuples in an existing database with
    relevant documents (cheap)

6
Weakly Labeled Training Data
  • to get positive examples, match DB tuples to
    passages of text referencing constants in tuples

YPD database
MEDLINE abstracts
P1, L1
P1L1
P2, L2
P3, L3
P2L2
P1L1
L3P3
7
Weakly Labeled Training Data
  • the labeling is weak in that many sentences with
    co-occurrences wouldnt be considered
    positive examples if we were hand-labeling them
  • consider the sentences associated with the
    relation subcellular-localization(VAC8p, vacuole)
    after weak labeling

8
Learning Context Patterns for Recognizing Protein
Names
  • We use AutoSlog Riloff 96 to find triggers
    that commonly occur before and after tagged
    proteins in a training corpus

9
Weak Labeling Example
PubMed abstract
SwissProt dictionary
... D-AKAP-2 D-amino acid oxidase D-aspartate
oxidase D-dopachrome tautomerase DAG kinase
zeta DAMOX DASOX DAT DB83 protein
Two distinct forms of oxidases catalysing the
oxidative deamidation of D-alpha-amino acids have
been identified in human tissues ltpgtD-amino
acid oxidaselt/pgt and ltpgtD-aspartate oxidaselt/pgt.
The enzymes differ in their electrophoretic
properties, tissue distribution, binding with
flavine adenine denucleotide, heat stability,
molecular size and possibly in subunit structure.
Neither enzyme exhibits genetic polymorphism in
European populations, but a rare electrophoretic
variant phenotype (ltpgtDASOXlt/pgt 2-1) was
identified which suggests that the ltpgtDASOXlt/pgt
locus is autosomal and independent of the
ltpgtDAMOXlt/pgt locus.
10
Protein Name Extraction Approach
11
Experimental Evaluation
  • hypothesis we get more accurate models by using
    weakly labeled data in addition to manually
    labeled data
  • models use Autoslog-induced context patterns
    naïve Bayes on morphological/syntax features of
    candidate names
  • compare predictive accuracy resulting from
  • fixed amount of hand-labeled data
  • varying amounts of weakly labeled data
    hand-labeled data

12
Extraction Accuracy Yapex Data Set
13
Extraction Accuracy Texas Data Set
14
2. Representing Sentence Structure in Learned
Models
  • hidden Markov models (HMMs) have proven to be
    perhaps the best family of methods for learning
    IE models
  • typically these HMMs have a flat structure, and
    are able to represent relatively little about
    grammatical structure
  • how can we provide HMMs with more information
    about sentence structure?

15
Hidden Markov Models Example
Pr(... the Bed1 protein ... ... q1,q4,q2 ...)
16
Hidden Markov Models for Information Extraction
  • there are efficient algorithms for doing the
    following with HMMs
  • determining the likelihood of a sentence given a
    model
  • determining the most likely path through a model
    for a sentence
  • setting the parameters of the model to maximize
    the likelihood of a set of sentences

17
Representing Sentences
  • we first process sentences by analyzing them with
    a shallow parser (Sundance, Riloff et al., 98)

18
Hierarchical HMMs for IE(Part 1)
  • Ray Craven, IJCAI 01 Skounakis et al, IJCAI
    03
  • states have types, emit phrases
  • some states have labels (PROTEIN, LOCATION)
  • our models have ? 25 states at this level

19
Hierarchical HMMs for IE (Part 2)
positive model
null model
20
Hierarchical HMMs for IE (Part 3)
21
Hierarchical HMMs
in
PP-SEGMENT
PROTEIN NP-SEGMENT
  • consider emitting

END
START
. . . is found in the ER
VP-SEGMENT
LOCATION NP-SEGMENT
the
ER
START
END
ALL
BEFORE
LOCATION
START
AFTER
END
BETWEEN
is found
22
Extraction with our HMMs
  • extract a relation instance if
  • sentence is more probable under positive model
  • Viterbi (most probable) path goes through special
    extraction states

NP-SEGMENT
PROTEIN NP-SEGMENT
END
START
PP-SEGMENT
LOCATION NP-SEGMENT
23
Representing More Local Context
  • we can have the word-level states represent more
    about the local context of each emission
  • partition sentence into overlapping trigrams
  • ... the/ART Bed1/UNK protein/N is/COP
    located/V ...

24
Representing More Local Context
  • states emit trigrams
    with probability
  • note the independence assumption above we
    compensate for this naïve assumption by using a
    discriminative training method Krogh 94 to
    learn parameters

25
Experimental Evaluation
  • hypothesis we get more accurate models by using
    a richer representation of sentence structure in
    HMMs
  • compare predictive accuracy of various types of
    representations
  • hierarchical w/context features
  • hierarchical
  • phrases
  • tokens w/part of speech
  • tokens
  • 5-fold cross validation on 3 data sets

26
Weakly Labeled Data Sets for Learning to Extract
Relations
  • subcellular_localization(PROTEIN, LOCATION)
  • YPD database
  • 769 positive, 6193 negative sentences
  • 939 tuples (402 distinct)
  • disorder_association(GENE, DISEASE)
  • OMIM database
  • 829 positive, 11685 negative sentences
  • 852 tuples (143 distinct)
  • protein_protein_interaction(PROTEIN, PROTEIN)
  • MIPS database
  • 5446 positive, 41377 negative
  • 8088 (819 distinct)

27
Extraction Accuracy (YPD)
28
Extraction Accuracy (MIPS)
29
Extraction Accuracy (OMIM)
30
3. Combining Evidence when Making Predictions
  • in processing a large corpus, we are likely to
    see the same entities, relations in multiple
    places
  • in making extractions, we should combine evidence
    across different occurrences/contexts in we see
    some entity/relation

31
Combining EvidenceOrganizing Predictions into
Bags
actual
predicted
occurrence
CAT is a 64-kD protein
the cat activated the mouse...
CAT was established to be
were removed from cat brains.
32
Combining Evidence when Making Predictions
  • given a bag of predictions, estimate the
    probability that the bag contains at least one
    actual positive example

33
Combining EvidenceEstimating Relevant
Probabilities
34
Evidence Combination Protein-Protein Interactions
35
Evidence Combination Protein Names
36
Conclusions
  • machine learning methods provide a means for
    learning/refining models for information
    extraction
  • learning is inexpensive when unlabeled/weakly
    labeled sources can be exploited
  • learning context patterns for protein names
  • learning HMMs for relation extraction
  • we can learn more accurate models by giving HMMs
    more information about syntactic structure of
    sentences
  • hierarchical HMMs
  • we can improve the precision of our predictions
    by carefully combining evidence across extractions

37
Acknowledgments
  • my graduate students
  • Soumya Ray
  • Burr Settles
  • Marios Skounakis
  • NIH/NLM grant 1R01 LM07050-01
  • NSF CAREER grant IIS-0093016
Write a Comment
User Comments (0)
About PowerShow.com