Part-of-speech tagging and chunking with log-linear models - PowerPoint PPT Presentation

About This Presentation

Title:

Part-of-speech tagging and chunking with log-linear models

Description:

Title: PowerPoint Presentation Last modified by: Created Date: 1/1/1601 12:00:00 AM Document presentation format: – PowerPoint PPT presentation

Number of Views:222

Avg rating:3.0/5.0

Slides: 43

Provided by: jaistAcJ

Category:

more less

Transcript and Presenter's Notes

Title: Part-of-speech tagging and chunking with log-linear models

1
Part-of-speech tagging and chunking with
log-linear models

University of Manchester
Yoshimasa Tsuruoka

2
Outline

POS tagging and Chunking for English
Conditional Markov Models (CMMs)
Dependency Networks
Bidirectional CMMs
Maximum entropy learning
Conditional Random Fields (CRFs)
Domain adaptation of a tagger

3
Part-of-speech tagging
The peri-kappa B site mediates human
immunodeficiency DT NN NN NN VBZ
JJ NN virus type 2 enhancer
activation in monocytes NN NN CD
NN NN IN NNS

The tagger assigns a part-of-speech tag to each
word in the sentence.

4
Algorithms for part-of-speech tagging

Tagging speed and accuracy on WSJ

Tagging Speed Accuracy
Dependency Net (2003) Slow? 97.24
SVM (2004) Fast 97.16
Perceptron (2002) ? 97.11
Bidirectional CMM (2005) Fast 97.10
HMM (2000) Very fast 96.7
CMM (1998) Fast 96.6
evaluated on different portion of WSJ
5
Chunking (shallow parsing)
He reckons the current account deficit will
narrow to NP VP NP
VP PP only
1.8 billion in September . NP
PP NP

A chunker (shallow parser) segments a sentence
into non-recursive phrases

6
Chunking (shallow parsing)
He reckons the current account deficit will
narrow to BNP BVP BNP INP
INP INP BVP IVP BPP only
1.8 billion in September . BNP INPINP INP
BPP BNP

Chunking tasks can be converted into a standard
tagging task
Different approaches
Sliding window
Semi-Markov CRF

7
Algorithms for chunking

Chunking speed and accuracy on Penn Treebank

Tagging Speed Accuracy
SVM voting (2001) Slow? 93.91
Perceptron (2003) ? 93.74
Bidirectional CMM (2005) Fast 93.70
SVM (2000) Fast 93.48
8
Conditional Markov Models (CMMs)
t1
t2
t3
o

Left to right decomposition (with the first-order
Markov assumption)

9
POS tagging with CMMs Ratnaparkhi 1996 etc.

Left-to-right decomposition
The local classifier uses the information on the
preceding tag.

He runs fast
PRP
VBZ
RB
?
?
?
10
Examples of the features for local classification
He runs fast
Word unigram wi, wi-1, wi1
Word bigram wi-1wi , wi wi1
Previous tag ti-1
Tag/word ti-1wi
Prefix/suffix Up to length 10
Lexical features Hyphen, number, etc..
PRP
?
11
POS tagging with Dependency Network Toutanova et
al. 2003
t1
t2
t3

Use the information on the succeeding tag as well

You can use the succeeding tag as a feature in
the local classification model
This is no longer a probability
12
POS tagging with a Cyclic Dependency Network
Toutanova et al. 2003
t1
t2
t3

Training cost is small almost equal to CMMs.
Decoding can be performed with dynamic
programming, but it is still expensive.
Collusion the model can lock onto conditionally
consistent but jointly unlikely sequences.

13
Bidirectional CMMs Tsuruoka and Tsujii, 2005

Possible decomposition structures
Bidirectional CMMs
We can find the best structure and tag
sequences in polynomial time

(a)
(b)
t1
t2
t3
t1
t2
t3
(c)
(d)
t1
t2
t3
t1
t2
t3
14
Bidirectional CMMs

Another way of decomposition
The local classifiers have the information about
the tags on both sides when tagging the second
word.

He runs fast
PRP
VBZ
RB
?
?
?
15
Outline

POS tagging and Chunking for English
Conditional Markov Models (CMMs)
Dependency Networks
Bidirectional CMMs
Maximum entropy learning
Conditional Random Fields (CRFs)
Domain adaptation of a tagger

16
Maximum entropy learning

Log-linear modeling

Feature function
Feature weight
17
Maximum entropy learning

Maximum likelihood estimation
Find the parameters that maximize the (log-)
likelihood of the training data
Regularization
Gaussian prior Berger et al, 1996
Inequality constrains Kazama and Tsujii, 2005

18
Parameter estimation

Algorithms for maximum entropy
GIS Darroch and Ratcliff, 1972, IIS Della
Pietra et al., 1997
General-purpose algorithms for numerical
optimization
BFGS Nocedal and Wright, 1999, LMVM Benson and
More, 2001
You need to provide the objective function and
gradient
Likelihood of training samples
Model expectation of each feature

19
Computing likelihood and model expectation

Example
Two possible tags Noun and Verb
Two types of features word and suffix

He opened it
Verb
Noun
Noun
tag noun
tag verb
20
Conditional Random Fields (CRFs)

A single log-linear model on the whole sentence
One can use exactly the same techniques as
maximum entropy learning to estimate the
parameters.
However, the number of classes is HUGE, and it is
impossible in practice to do it in a naive way.

21
Conditional Random Fields (CRFs)

Solution
Lets restrict the types of features
Then, you can use a dynamic programming algorithm
that drastically reduces the amount of
computation
Features you can use (in first-order CRFs)
Features defined on the tag
Features defined on the adjacent pair of tags

22
Features
W0He Tag Noun

Feature weights are associated with states and
edges

He has opened
it
Noun
Noun
Noun
Noun
Tagleft Noun Tagright Noun
Verb
Verb
Verb
Verb
23
A naive way of calculating Z(x)
7.2
4.1
Noun
Noun
Noun
Noun
Verb
Noun
Noun
Noun
1.3
0.8
Noun
Noun
Noun
Verb
Verb
Noun
Noun
Verb
4.5
9.7
Noun
Noun
Verb
Noun
Verb
Noun
Verb
Noun
0.9
5.5
Noun
Noun
Verb
Verb
Verb
Noun
Verb
Verb
2.3
5.7
Noun
Verb
Noun
Noun
Verb
Verb
Noun
Noun
11.2
4.3
Noun
Verb
Noun
Verb
Verb
Verb
Noun
Verb
3.4
2.2
Noun
Verb
Verb
Noun
Verb
Verb
Verb
Noun
2.5
1.9
Noun
Verb
Verb
Verb
Verb
Verb
Verb
Verb
Sum 67.5
24
Dynamic programming

Results of intermediate computation can be reused.

He has opened
it
Noun
Noun
Noun
Noun
Verb
Verb
Verb
Verb
forward
25
Dynamic programming

Results of intermediate computation can be reused.

He has opened
it
Noun
Noun
Noun
Noun
Verb
Verb
Verb
Verb
backward
26
Dynamic programming

Computing marginal distribution

He has opened
it
Noun
Noun
Noun
Noun
Verb
Verb
Verb
Verb
27
Maximum entropy learning and Conditional Random
Fields

Maximum entropy learning
Log-linear modeling MLE
Parameter estimation
Likelihood of each sample
Model expectation of each feature
Conditional Random Fields
Log-linear modeling on the whole sentence
Features are defined on states and edges
Dynamic programming

28
Named Entity Recognition
We have shown that interleukin-1 (IL-1) and IL-2
control
protein protein protein IL-2 receptor alpha
(IL-2R alpha) gene transcription in
DNA CD4-CD8-murine T lymphocyte
precursors. cell_line

A term consists of multiple tokens
We want to define features on a term rather than
on a token.

Semi-Markov CRFs Sarawagi 2004
29
Algorithms for Biomedical Named Entity Recognition

Shared task data for Coling 2004 BioNLP workshop

Recall Precision F-score
SVMHMM (2004) 76.0 69.4 72.6
Semi-Markov CRF Okanohara et al., 2006 72.7 70.4 71.5
Sliding window 75.8 67.5 71.4
MEMM (2004) 71.6 68.6 70.1
CRF (2004) 70.3 69.3 69.8
30
Outline

POS tagging and Chunking for English
Conditional Markov Models (CMMs)
Dependency Networks
Bidirectional CMMs
Maximum entropy learning
Conditional Random Fields (CRFs)
Domain adaptation of a tagger

31
Domain adaptation

Large training data has been available for
general domains (e.g. Penn Treebank WSJ)
NLP Tools trained with general domain data are
less accurate on biomedical domains
Development of domain-specific data requires
considerable human efforts

32
Tagging errors made by a tagger trained on WSJ
and membrane potential after mitogen binding.
CC NN NN IN NN
JJ two factors, which bind to the same
kappa B enhancers CD NNS WDT NN TO
DT JJ NN NN NNS by analysing the Ag
amino acid sequence. IN VBG DT VBG
JJ NN NN to contain more T-cell
determinants than TO VB RBR JJ
NNS IN Stimulation of
interferon beta gene transcription in vitro by
NN IN JJ JJ NN
NN IN NN IN

Accuracy of the tagger on the GENIA POS corpus
84.4

33
Re-training of maximum entropy models

Taggers trained as maximum entropy models
Adapting Maximum entropy models to target domains
by re-training with domain specific data

Feature function(given by the developer)
Model parameter
34
Methods for domain adaptation

Combined training data a model is trained from
scratch with the original and domain-specific
data
Reference distribution an original model is used
as a reference probabilistic distribution of a
domain-specific model

35
Adaptation of the part-of-speech tagger

Relationships among training and test data are
evaluated for the following corpora
WSJ Penn Treebank WSJ
GENIA GENIA POS corpus Kim et al., 2003
2,000 MEDLINE abstracts selected by MeSH terms,
Human, Blood cells, and Transcription factors
PennBioIE Penn BioIE corpus Kulick et al.,
2004
1,100 MEDLINE abstracts about inhibition of the
cytochrome P450 family of enzymes
1,157 MEDLINE abstracts about molecular genetics
of cancer
Fly 200 MEDLINE abstracts on Drosophia
melanogaster

36
Training and test sets

Training sets
Test sets

tokens sentences
WSJ 912,344 38,219
GENIA 450,492 18,508
PennBioIE 641,838 29,422
Fly 1,024
tokens sentences
WSJ 129,654 5,462
GENIA 50,562 2,036
PennBioIE 70,713 3,270
Fly 7,615 326
37
Experimental results
Accuracy Accuracy Accuracy Accuracy Training time(sec.)
WSJ GENIA PennBioIE Fly Training time(sec.)
WSJGENIAPennBioIE 96.68 98.10 97.65 96.35
Fly only 93.91
Combined 96.69 98.12 97.65 97.94 30,632
Ref. dist 95.38 98.17 96.93 98.08 21
38
Corpus size vs. accuracy(combined training data)
39
Corpus size vs. accuracy(reference distribution)
40
Summary

POS tagging
MEMM-like approaches achieve good performance
with reasonable computational cost. CRFs seem to
be too computationally expensive at present.
Chunking
CRFs yield good performance for NP chunking.
Semi-Markov CRFs are promising, but we need to
somehow reduce computational cost.
Domain Adaptation
One can easily use the information about the
original domain as the reference distribution.

41
References

A. L. Berger, S. A. Della Pietra, and V. J. Della
Pietra. (1996). A maximum entropy approach to
natural language processing. Computational
Linguistics.
Adwait Ratnaparkhi. (1996). A Maximum Entropy
Part-Of-Speech Tagger. Proceedings of EMNLP.
Thorsten Brants. (2000). TnT A Statistical
Part-Of-Speech Tagger. Proceedings of ANLP.
Taku Kudo and Yuji Matsumoto. (2001). Chunking
with Support Vector Machines, Proceedings of
NAACL.
John Lafferty, Andrew McCallum, and Fernando
Pereira. (2001). Conditional Random Fields,,
Probabilistic Models for Segmenting and Labeling
Sequence Data. Proceedings of ICML.
Michael Collins. (2002). Discriminative Training
Methods for Hidden Markov Models Theory and
Experiments with Perceptron Algorithms.
Proceedings of EMNLP.
Fei Sha and Fernando Pereira. (2003). Shallow
Parsing with Conditional Random Fields.
Proceedings of HLT-NAACL.
K. Toutanova, D. Klein, C. Manning, and Y.
Singer. (2003). Feature-Rich Part-of-Speech
Tagging with a Cyclic Dependency Network.
Proceedings of HLT-NAACL.

42
References

Xavier Carreras and Lluis Marquez. (2003). Phrase
recognition by filtering and ranking with
perceptrons. Proceedings of RANLP.
Jesús Giménez and Lluís Márquez. (2004). SVMTool
A general POS tagger generator based on Support
Vector Machines. Proceedings of LREC.
Sunita Sarawagi and William W. Cohen. (2004).
Semimarkov conditional random fields for
information extraction. Proceedings of NIPS 2004.
Yoshimasa Tsuruoka and Jun'ichi Tsujii. (2005).
Bidirectional Inference with the Easiest-First
Strategy for Tagging Sequence Data. Proceedings
of HLT/EMNLP.
Yuka Tateisi,Yoshimasa Tsuruoka and Jun'ichi
Tsujii. (2006). Subdomain adaptation of a POS
tagger with a small corpus. In Proceedings of
HLT-NAACL BioNLP Workshop.
Daisuke Okanohara, Yusuke Miyao, Yoshimasa
Tsuruoka, and Jun'ichi Tsujii. (2006). Improving
the Scalability of Semi-Markov Conditional Random
Fields for Named Entity Recognition. Proceedings
of COLING/ACL 2006.