Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text

Description:

Using Weakly Labeled Data to Learn Models for Extracting Information ... DAG kinase zeta. DAMOX. DASOX. DAT. DB83 protein. PubMed abstract. SwissProt dictionary ... – PowerPoint PPT presentation

Number of Views:104

Avg rating:3.0/5.0

Slides: 38

Provided by: mark634

Category:

more less

Transcript and Presenter's Notes

Title: Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text

1
Using Weakly Labeled Data to Learn Models for
Extracting Information from Biomedical Text
Mark Craven Department of Biostatistics Medical
Informatics Department of Computer
Sciences University of Wisconsin U.S.A. craven_at_bio
stat.wisc.edu www.biostat.wisc.edu/craven
2
The Information Extraction Task
Analysis of Yeast PRP20 Mutations and Functional
Complementation by the Human Homologue RCC1, a
Protein Involved in the Control of Chromosome
Condensation Fleischmann M, Clark M, Forrester
W, Wickens M, Nishimoto T, Aebi M Mutations in
the PRP20 gene of yeast show a pleitropic
phenotype, in which both mRNA metabolishm and
nuclear structure are affected. SRM1 mutants,
defective in the same gene, influence the signal
transduction pathway for the pheromone response .
. . By immunofluorescence microscopy the PRP20
protein was localized in the nucleus. Expression
of the RCC1 protein can complement the
temperature-sensitive phenotype of PRP20 mutants,
demonstrating the functional similarity of the
yeast and mammalian proteins
protein(PRP20) subcellular-localization(PRP20,
nucleus)
3
Motivation

assisting in the construction and updating of
databases
providing structured summaries for queries
What is known about protein X
(subcellular tissue localization, associations
with diseases, interactions with drugs, )?
assisting scientific discovery by detecting
previously unknown relationships, annotating
experimental data

4
Three Themes in Our IE Research

Using weakly labeled training data
Representing sentence structure in learned models
Combining evidence when making predictions

5
1. Using Weakly Labeled Data

why use machine learning methods in building
information-extraction systems?
hand-coding IE systems is expensive,
time-consuming
there is a lot of data that can be leveraged
where do we get a training set?
by having someone hand-label data (expensive)
by coupling tuples in an existing database with
relevant documents (cheap)

6
Weakly Labeled Training Data

to get positive examples, match DB tuples to
passages of text referencing constants in tuples

YPD database
MEDLINE abstracts
P1, L1
P1L1
P2, L2
P3, L3
P2L2
P1L1
L3P3
7
Weakly Labeled Training Data

the labeling is weak in that many sentences with
co-occurrences wouldnt be considered
positive examples if we were hand-labeling them
consider the sentences associated with the
relation subcellular-localization(VAC8p, vacuole)
after weak labeling

8
Learning Context Patterns for Recognizing Protein
Names

We use AutoSlog Riloff 96 to find triggers
that commonly occur before and after tagged
proteins in a training corpus

9
Weak Labeling Example
PubMed abstract
SwissProt dictionary
... D-AKAP-2 D-amino acid oxidase D-aspartate
oxidase D-dopachrome tautomerase DAG kinase
zeta DAMOX DASOX DAT DB83 protein
Two distinct forms of oxidases catalysing the
oxidative deamidation of D-alpha-amino acids have
been identified in human tissues ltpgtD-amino
acid oxidaselt/pgt and ltpgtD-aspartate oxidaselt/pgt.
The enzymes differ in their electrophoretic
properties, tissue distribution, binding with
flavine adenine denucleotide, heat stability,
molecular size and possibly in subunit structure.
Neither enzyme exhibits genetic polymorphism in
European populations, but a rare electrophoretic
variant phenotype (ltpgtDASOXlt/pgt 2-1) was
identified which suggests that the ltpgtDASOXlt/pgt
locus is autosomal and independent of the
ltpgtDAMOXlt/pgt locus.
10
Protein Name Extraction Approach
11
Experimental Evaluation

hypothesis we get more accurate models by using
weakly labeled data in addition to manually
labeled data
models use Autoslog-induced context patterns
naïve Bayes on morphological/syntax features of
candidate names
compare predictive accuracy resulting from
fixed amount of hand-labeled data
varying amounts of weakly labeled data
hand-labeled data

12
Extraction Accuracy Yapex Data Set
13
Extraction Accuracy Texas Data Set
14
2. Representing Sentence Structure in Learned
Models

hidden Markov models (HMMs) have proven to be
perhaps the best family of methods for learning
IE models
typically these HMMs have a flat structure, and
are able to represent relatively little about
grammatical structure
how can we provide HMMs with more information
about sentence structure?

15
Hidden Markov Models Example
Pr(... the Bed1 protein ... ... q1,q4,q2 ...)
16
Hidden Markov Models for Information Extraction

there are efficient algorithms for doing the
following with HMMs
determining the likelihood of a sentence given a
model
determining the most likely path through a model
for a sentence
setting the parameters of the model to maximize
the likelihood of a set of sentences

17
Representing Sentences

we first process sentences by analyzing them with
a shallow parser (Sundance, Riloff et al., 98)

18
Hierarchical HMMs for IE(Part 1)

Ray Craven, IJCAI 01 Skounakis et al, IJCAI
03
states have types, emit phrases
some states have labels (PROTEIN, LOCATION)
our models have ? 25 states at this level

19
Hierarchical HMMs for IE (Part 2)
positive model
null model
20
Hierarchical HMMs for IE (Part 3)
21
Hierarchical HMMs
in
PP-SEGMENT
PROTEIN NP-SEGMENT

consider emitting

END
START
. . . is found in the ER
VP-SEGMENT
LOCATION NP-SEGMENT
the
ER
START
END
ALL
BEFORE
LOCATION
START
AFTER
END
BETWEEN
is found
22
Extraction with our HMMs

extract a relation instance if
sentence is more probable under positive model
Viterbi (most probable) path goes through special
extraction states

NP-SEGMENT
PROTEIN NP-SEGMENT
END
START
PP-SEGMENT
LOCATION NP-SEGMENT
23
Representing More Local Context

we can have the word-level states represent more
about the local context of each emission
partition sentence into overlapping trigrams
... the/ART Bed1/UNK protein/N is/COP
located/V ...

24
Representing More Local Context

states emit trigrams
with probability
note the independence assumption above we
compensate for this naïve assumption by using a
discriminative training method Krogh 94 to
learn parameters

25
Experimental Evaluation

hypothesis we get more accurate models by using
a richer representation of sentence structure in
HMMs
compare predictive accuracy of various types of
representations
hierarchical w/context features
hierarchical
phrases
tokens w/part of speech
tokens
5-fold cross validation on 3 data sets

26
Weakly Labeled Data Sets for Learning to Extract
Relations

subcellular_localization(PROTEIN, LOCATION)
YPD database
769 positive, 6193 negative sentences
939 tuples (402 distinct)
disorder_association(GENE, DISEASE)
OMIM database
829 positive, 11685 negative sentences
852 tuples (143 distinct)
protein_protein_interaction(PROTEIN, PROTEIN)
MIPS database
5446 positive, 41377 negative
8088 (819 distinct)

27
Extraction Accuracy (YPD)
28
Extraction Accuracy (MIPS)
29
Extraction Accuracy (OMIM)
30
3. Combining Evidence when Making Predictions

in processing a large corpus, we are likely to
see the same entities, relations in multiple
places
in making extractions, we should combine evidence
across different occurrences/contexts in we see
some entity/relation

31
Combining EvidenceOrganizing Predictions into
Bags
actual
predicted
occurrence
CAT is a 64-kD protein
the cat activated the mouse...
CAT was established to be
were removed from cat brains.
32
Combining Evidence when Making Predictions

given a bag of predictions, estimate the
probability that the bag contains at least one
actual positive example

33
Combining EvidenceEstimating Relevant
Probabilities
34
Evidence Combination Protein-Protein Interactions
35
Evidence Combination Protein Names
36
Conclusions

machine learning methods provide a means for
learning/refining models for information
extraction
learning is inexpensive when unlabeled/weakly
labeled sources can be exploited
learning context patterns for protein names
learning HMMs for relation extraction
we can learn more accurate models by giving HMMs
more information about syntactic structure of
sentences
hierarchical HMMs
we can improve the precision of our predictions
by carefully combining evidence across extractions

37
Acknowledgments