Title: Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text
1Using Weakly Labeled Data to Learn Models for
Extracting Information from Biomedical Text
Mark Craven Department of Biostatistics Medical
Informatics Department of Computer
Sciences University of Wisconsin U.S.A. craven_at_bio
stat.wisc.edu www.biostat.wisc.edu/craven
2The Information Extraction Task
Analysis of Yeast PRP20 Mutations and Functional
Complementation by the Human Homologue RCC1, a
Protein Involved in the Control of Chromosome
Condensation Fleischmann M, Clark M, Forrester
W, Wickens M, Nishimoto T, Aebi M Mutations in
the PRP20 gene of yeast show a pleitropic
phenotype, in which both mRNA metabolishm and
nuclear structure are affected. SRM1 mutants,
defective in the same gene, influence the signal
transduction pathway for the pheromone response .
. . By immunofluorescence microscopy the PRP20
protein was localized in the nucleus. Expression
of the RCC1 protein can complement the
temperature-sensitive phenotype of PRP20 mutants,
demonstrating the functional similarity of the
yeast and mammalian proteins
protein(PRP20) subcellular-localization(PRP20,
nucleus)
3Motivation
- assisting in the construction and updating of
databases - providing structured summaries for queries
- What is known about protein X
(subcellular tissue localization, associations
with diseases, interactions with drugs, )? - assisting scientific discovery by detecting
previously unknown relationships, annotating
experimental data
4Three Themes in Our IE Research
- Using weakly labeled training data
- Representing sentence structure in learned models
- Combining evidence when making predictions
51. Using Weakly Labeled Data
- why use machine learning methods in building
information-extraction systems? - hand-coding IE systems is expensive,
time-consuming - there is a lot of data that can be leveraged
- where do we get a training set?
- by having someone hand-label data (expensive)
- by coupling tuples in an existing database with
relevant documents (cheap)
6Weakly Labeled Training Data
- to get positive examples, match DB tuples to
passages of text referencing constants in tuples
YPD database
MEDLINE abstracts
P1, L1
P1L1
P2, L2
P3, L3
P2L2
P1L1
L3P3
7Weakly Labeled Training Data
- the labeling is weak in that many sentences with
co-occurrences wouldnt be considered
positive examples if we were hand-labeling them - consider the sentences associated with the
relation subcellular-localization(VAC8p, vacuole)
after weak labeling
8Learning Context Patterns for Recognizing Protein
Names
- We use AutoSlog Riloff 96 to find triggers
that commonly occur before and after tagged
proteins in a training corpus
9Weak Labeling Example
PubMed abstract
SwissProt dictionary
... D-AKAP-2 D-amino acid oxidase D-aspartate
oxidase D-dopachrome tautomerase DAG kinase
zeta DAMOX DASOX DAT DB83 protein
Two distinct forms of oxidases catalysing the
oxidative deamidation of D-alpha-amino acids have
been identified in human tissues ltpgtD-amino
acid oxidaselt/pgt and ltpgtD-aspartate oxidaselt/pgt.
The enzymes differ in their electrophoretic
properties, tissue distribution, binding with
flavine adenine denucleotide, heat stability,
molecular size and possibly in subunit structure.
Neither enzyme exhibits genetic polymorphism in
European populations, but a rare electrophoretic
variant phenotype (ltpgtDASOXlt/pgt 2-1) was
identified which suggests that the ltpgtDASOXlt/pgt
locus is autosomal and independent of the
ltpgtDAMOXlt/pgt locus.
10Protein Name Extraction Approach
11Experimental Evaluation
- hypothesis we get more accurate models by using
weakly labeled data in addition to manually
labeled data - models use Autoslog-induced context patterns
naïve Bayes on morphological/syntax features of
candidate names - compare predictive accuracy resulting from
- fixed amount of hand-labeled data
- varying amounts of weakly labeled data
hand-labeled data
12Extraction Accuracy Yapex Data Set
13Extraction Accuracy Texas Data Set
142. Representing Sentence Structure in Learned
Models
- hidden Markov models (HMMs) have proven to be
perhaps the best family of methods for learning
IE models - typically these HMMs have a flat structure, and
are able to represent relatively little about
grammatical structure - how can we provide HMMs with more information
about sentence structure?
15Hidden Markov Models Example
Pr(... the Bed1 protein ... ... q1,q4,q2 ...)
16Hidden Markov Models for Information Extraction
- there are efficient algorithms for doing the
following with HMMs - determining the likelihood of a sentence given a
model - determining the most likely path through a model
for a sentence - setting the parameters of the model to maximize
the likelihood of a set of sentences
17Representing Sentences
- we first process sentences by analyzing them with
a shallow parser (Sundance, Riloff et al., 98)
18Hierarchical HMMs for IE(Part 1)
- Ray Craven, IJCAI 01 Skounakis et al, IJCAI
03 - states have types, emit phrases
- some states have labels (PROTEIN, LOCATION)
- our models have ? 25 states at this level
19Hierarchical HMMs for IE (Part 2)
positive model
null model
20Hierarchical HMMs for IE (Part 3)
21Hierarchical HMMs
in
PP-SEGMENT
PROTEIN NP-SEGMENT
END
START
. . . is found in the ER
VP-SEGMENT
LOCATION NP-SEGMENT
the
ER
START
END
ALL
BEFORE
LOCATION
START
AFTER
END
BETWEEN
is found
22Extraction with our HMMs
- extract a relation instance if
- sentence is more probable under positive model
- Viterbi (most probable) path goes through special
extraction states
NP-SEGMENT
PROTEIN NP-SEGMENT
END
START
PP-SEGMENT
LOCATION NP-SEGMENT
23Representing More Local Context
- we can have the word-level states represent more
about the local context of each emission - partition sentence into overlapping trigrams
- ... the/ART Bed1/UNK protein/N is/COP
located/V ...
24Representing More Local Context
- states emit trigrams
with probability - note the independence assumption above we
compensate for this naïve assumption by using a
discriminative training method Krogh 94 to
learn parameters
25Experimental Evaluation
- hypothesis we get more accurate models by using
a richer representation of sentence structure in
HMMs - compare predictive accuracy of various types of
representations - hierarchical w/context features
- hierarchical
- phrases
- tokens w/part of speech
- tokens
- 5-fold cross validation on 3 data sets
26Weakly Labeled Data Sets for Learning to Extract
Relations
- subcellular_localization(PROTEIN, LOCATION)
- YPD database
- 769 positive, 6193 negative sentences
- 939 tuples (402 distinct)
- disorder_association(GENE, DISEASE)
- OMIM database
- 829 positive, 11685 negative sentences
- 852 tuples (143 distinct)
- protein_protein_interaction(PROTEIN, PROTEIN)
- MIPS database
- 5446 positive, 41377 negative
- 8088 (819 distinct)
27Extraction Accuracy (YPD)
28Extraction Accuracy (MIPS)
29Extraction Accuracy (OMIM)
303. Combining Evidence when Making Predictions
- in processing a large corpus, we are likely to
see the same entities, relations in multiple
places - in making extractions, we should combine evidence
across different occurrences/contexts in we see
some entity/relation
31Combining EvidenceOrganizing Predictions into
Bags
actual
predicted
occurrence
CAT is a 64-kD protein
the cat activated the mouse...
CAT was established to be
were removed from cat brains.
32Combining Evidence when Making Predictions
- given a bag of predictions, estimate the
probability that the bag contains at least one
actual positive example
33Combining EvidenceEstimating Relevant
Probabilities
34Evidence Combination Protein-Protein Interactions
35Evidence Combination Protein Names
36Conclusions
- machine learning methods provide a means for
learning/refining models for information
extraction - learning is inexpensive when unlabeled/weakly
labeled sources can be exploited - learning context patterns for protein names
- learning HMMs for relation extraction
- we can learn more accurate models by giving HMMs
more information about syntactic structure of
sentences - hierarchical HMMs
- we can improve the precision of our predictions
by carefully combining evidence across extractions
37Acknowledgments
- my graduate students
- Soumya Ray
- Burr Settles
- Marios Skounakis
- NIH/NLM grant 1R01 LM07050-01
- NSF CAREER grant IIS-0093016