Title: Part-of-speech tagging and chunking with log-linear models
1Part-of-speech tagging and chunking with
log-linear models
- University of Manchester
- National Centre for Text Mining (NaCTeM)
- Yoshimasa Tsuruoka
2Outline
- POS tagging and Chunking for English
- Conditional Markov Models (CMMs)
- Dependency Networks
- Bidirectional CMMs
- Maximum entropy learning
- Conditional Random Fields (CRFs)
- Domain adaptation of a tagger
3Part-of-speech tagging
The peri-kappa B site mediates human
immunodeficiency DT NN NN NN VBZ
JJ NN virus type 2 enhancer
activation in monocytes NN NN CD
NN NN IN NNS
- The tagger assigns a part-of-speech tag to each
word in the sentence.
4Algorithms for part-of-speech tagging
- Tagging speed and accuracy on WSJ
Tagging Speed Accuracy
Dependency Net (2003) Slow 97.24
SVM (2004) Fast 97.16
Perceptron (2002) ? 97.11
Bidirectional CMM (2005) Fast 97.10
HMM (2000) Very fast 96.7
CMM (1998) Fast 96.6
evaluated on different portion of WSJ
5Chunking (shallow parsing)
He reckons the current account deficit will
narrow to NP VP NP
VP PP only
1.8 billion in September . NP
PP NP
- A chunker (shallow parser) segments a sentence
into non-recursive phrases
6Chunking (shallow parsing)
He reckons the current account deficit will
narrow to BNP BVP BNP INP
INP INP BVP IVP BPP only
1.8 billion in September . BNP INPINP INP
BPP BNP
- Chunking tasks can be converted into a standard
tagging task - Different approaches
- Sliding window
- Semi-Markov CRF
7Algorithms for chunking
- Chunking speed and accuracy on Penn Treebank
Tagging Speed Accuracy
SVM voting (2001) Slow? 93.91
Perceptron (2003) ? 93.74
Bidirectional CMM (2005) Fast 93.70
SVM (2000) Fast 93.48
8Conditional Markov Models (CMMs)
t1
t2
t3
o
- Left to right decomposition (with the first-order
Markov assumption)
9POS tagging with CMMs Ratnaparkhi 1996 etc.
- Left-to-right decomposition
- The local classifier uses the information on the
preceding tag.
He runs fast
PRP
VBZ
RB
?
?
?
10Examples of the features for local classification
He runs fast
Word unigram wi, wi-1, wi1
Word bigram wi-1wi , wi wi1
Previous tag ti-1
Tag/word ti-1wi
Prefix/suffix Up to length 10
Lexical features Hyphen, number, etc..
PRP
?
11POS tagging with Dependency Network Toutanova et
al. 2003
t1
t2
t3
- Use the information on the following tag as well
You can use the following tag as a feature in the
local classification model
This is no longer a probability
12POS tagging with a Cyclic Dependency Network
Toutanova et al. 2003
t1
t2
t3
- Training cost is small almost equal to CMMs.
- Decoding can be performed with dynamic
programming, but it is still expensive. - Collusion the model can lock onto conditionally
consistent but jointly unlikely sequences.
13Bidirectional CMMs Tsuruoka and Tsujii, 2005
- Possible decomposition structures
- Bidirectional CMMs
- We can find the best structure and tag
sequences in polynomial time
(a)
(b)
t1
t2
t3
t1
t2
t3
(c)
(d)
t1
t2
t3
t1
t2
t3
14Maximum entropy learning
Feature function
Feature weight
15Maximum entropy learning
- Maximum likelihood estimation
- Find the parameters that maximize the (log-)
likelihood of the training data - Smoothing
- Gaussian prior Berger et al, 1996
- Inequality constrains Kazama and Tsujii, 2005
16Parameter estimation
- Algorithms for maximum entropy
- GIS Darroch and Ratcliff, 1972, IIS Della
Pietra et al., 1997 - General-purpose algorithms for numerical
optimization - BFGS Nocedal and Wright, 1999, LMVM Benson and
More, 2001 - You need to provide the objective function and
gradient - Likelihood of training samples
- Model expectation of each feature
17Computing likelihood and model expectation
- Example
- Two possible tags Noun and Verb
- Two types of features word and suffix
He opened it
Verb
Noun
Noun
tag noun
tag verb
18Conditional Random Fields (CRFs)
- A single log-linear model on the whole sentence
- One can use exactly the same techniques as
maximum entropy learning to estimate the
parameters. - However, the number of classes is HUGE, and it is
impossible in practice to do it in a naive way.
19Conditional Random Fields (CRFs)
- Solution
- Lets restrict the types of features
- Then, you can use a dynamic programming algorithm
that drastically reduces the amount of
computation - Features you can use (in first-order CRFs)
- Features defined on the tag
- Features defined on the adjacent pair of tags
20Features
W0He Tag Noun
- Feature weights are associated with states and
edges
He has opened
it
Noun
Noun
Noun
Noun
Tagleft Noun Tagright Noun
Verb
Verb
Verb
Verb
21A naive way of calculating Z(x)
7.2
4.1
Noun
Noun
Noun
Noun
Verb
Noun
Noun
Noun
1.3
0.8
Noun
Noun
Noun
Verb
Verb
Noun
Noun
Verb
4.5
9.7
Noun
Noun
Verb
Noun
Verb
Noun
Verb
Noun
0.9
5.5
Noun
Noun
Verb
Verb
Verb
Noun
Verb
Verb
2.3
5.7
Noun
Verb
Noun
Noun
Verb
Verb
Noun
Noun
11.2
4.3
Noun
Verb
Noun
Verb
Verb
Verb
Noun
Verb
3.4
2.2
Noun
Verb
Verb
Noun
Verb
Verb
Verb
Noun
2.5
1.9
Noun
Verb
Verb
Verb
Verb
Verb
Verb
Verb
Sum 67.5
22Dynamic programming
- Results of intermediate computation can be reused.
He has opened
it
Noun
Noun
Noun
Noun
Verb
Verb
Verb
Verb
23Maximum entropy learning and Conditional Random
Fields
- Maximum entropy learning
- Log-linear modeling MLE
- Parameter estimation
- Likelihood of each sample
- Model expectation of each feature
- Conditional Random Fields
- Log-linear modeling on the whole sentence
- Features are defined on states and edges
- Dynamic programming
24Named Entity Recognition
We have shown that interleukin-1 (IL-1) and IL-2
control
protein protein protein IL-2 receptor alpha
(IL-2R alpha) gene transcription in
DNA CD4-CD8-murine T lymphocyte
precursors. cell_line
25Algorithms for Biomedical Named Entity Recognition
- Shared task data for Coling 2004 BioNLP workshop
Recall Precision F-score
SVMHMM (2004) 76.0 69.4 72.6
Semi-Markov CRF Okanohara et al., 2006 72.7 70.4 71.5
Sliding window 75.8 67.5 70.8
MEMM (2004) 71.6 68.6 70.1
CRF (2004) 70.3 69.3 69.8
26Domain adaptation
- Large training data has been available for
general domains (e.g. Penn Treebank WSJ) - NLP Tools trained with general domain data are
less accurate on biomedical domains - Development of domain-specific data requires
considerable human efforts
27Tagging errors made by a tagger trained on WSJ
and membrane potential after mitogen binding.
CC NN NN IN NN
JJ two factors, which bind to the same
kappa B enhancers CD NNS WDT NN TO
DT JJ NN NN NNS by analysing the Ag
amino acid sequence. IN VBG DT VBG
JJ NN NN to contain more T-cell
determinants than TO VB RBR JJ
NNS IN Stimulation of
interferon beta gene transcription in vitro by
NN IN JJ JJ NN
NN IN NN IN
- Accuracy of the tagger on the GENIA POS corpus
84.4
28Re-training of maximum entropy models
- Taggers trained as maximum entropy models
- Adapting Maximum entropy models to target domains
by re-training with domain specific data
Feature function(given by the developer)
Model parameter
29Methods for domain adaptation
- Combined training data a model is trained from
scratch with the original and domain-specific
data - Reference distribution an original model is used
as a reference probabilistic distribution of a
domain-specific model
30Adaptation of the part-of-speech tagger
- Relationships among training and test data are
evaluated for the following corpora - WSJ Penn Treebank WSJ
- GENIA GENIA POS corpus Kim et al., 2003
- 2,000 MEDLINE abstracts selected by MeSH terms,
Human, Blood cells, and Transcription factors - PennBioIE Penn BioIE corpus Kulick et al.,
2004 - 1,100 MEDLINE abstracts about inhibition of the
cytochrome P450 family of enzymes - 1,157 MEDLINE abstracts about molecular genetics
of cancer - Fly 200 MEDLINE abstracts on Drosophia
melanogaster
31Training and test sets
tokens sentences
WSJ 912,344 38,219
GENIA 450,492 18,508
PennBioIE 641,838 29,422
Fly 1,024
tokens sentences
WSJ 129,654 5,462
GENIA 50,562 2,036
PennBioIE 70,713 3,270
Fly 7,615 326
32Experimental results
Accuracy Accuracy Accuracy Accuracy Training time(sec.)
WSJ GENIA PennBioIE Fly Training time(sec.)
WSJGENIAPennBioIE 96.68 98.10 97.65 96.35
Fly only 93.91
Combined 96.69 98.12 97.65 97.94 30,632
Ref. dist 95.38 98.17 96.93 98.08 21
33Corpus size vs. accuracy(combined training data)
34Corpus size vs. accuracy(reference distribution)
35Summary
- POS tagging
- MEMM-like approaches achieve good performance
with reasonable computational cost. CRFs seems
to be too computationally expensive at present. - Chunking
- CRFs yield good performance for NP chunking.
Semi-Markov CRFs are promising, but we need to
somehow reduce computational cost. - Domain Adaptation
- One can easily use the information about the
original domain as the reference distribution.
36References
- A. L. Berger, S. A. Della Pietra, and V. J. Della
Pietra. (1996). A maximum entropy approach to
natural language processing. Computational
Linguistics. - Adwait Ratnaparkhi. (1996). A Maximum Entropy
Part-Of-Speech Tagger. Proceedings of EMNLP. - Thorsten Brants. (2000). TnT A Statistical
Part-Of-Speech Tagger. Proceedings of ANLP. - Taku Kudo and Yuji Matsumoto. (2001). Chunking
with Support Vector Machines, Proceedings of
NAACL. - John Lafferty, Andrew McCallum, and Fernando
Pereira. (2001). Conditional Random Fields,,
Probabilistic Models for Segmenting and Labeling
Sequence Data. Proceedings of ICML. - Michael Collins. (2002). Discriminative Training
Methods for Hidden Markov Models Theory and
Experiments with Perceptron Algorithms.
Proceedings of EMNLP. - Fei Sha and Fernando Pereira. (2003). Shallow
Parsing with Conditional Random Fields.
Proceedings of HLT-NAACL. - K. Toutanova, D. Klein, C. Manning, and Y.
Singer. (2003). Feature-Rich Part-of-Speech
Tagging with a Cyclic Dependency Network.
Proceedings of HLT-NAACL.
37References
- Xavier Carreras and Lluis Marquez. (2003). Phrase
recognition by filtering and ranking with
perceptrons. Proceedings of RANLP. - Jesús Giménez and Lluís Márquez. (2004). SVMTool
A general POS tagger generator based on Support
Vector Machines. Proceedings of LREC. - Sunita Sarawagi and William W. Cohen. (2004).
Semimarkov conditional random fields for
information extraction. Proceedings of NIPS 2004. - Yoshimasa Tsuruoka and Jun'ichi Tsujii. (2005).
Bidirectional Inference with the Easiest-First
Strategy for Tagging Sequence Data. Proceedings
of HLT/EMNLP. - Yuka Tateisi,Yoshimasa Tsuruoka and Jun'ichi
Tsujii. (2006). Subdomain adaptation of a POS
tagger with a small corpus. In Proceedings of
HLT-NAACL BioNLP Workshop. - Daisuke Okanohara, Yusuke Miyao, Yoshimasa
Tsuruoka, and Jun'ichi Tsujii. (2006). Improving
the Scalability of Semi-Markov Conditional Random
Fields for Named Entity Recognition. Proceedings
of COLING/ACL 2006.