Part-of-speech tagging and chunking with log-linear models - PowerPoint PPT Presentation

About This Presentation
Title:

Part-of-speech tagging and chunking with log-linear models

Description:

Title: PowerPoint Presentation Last modified by: Created Date: 1/1/1601 12:00:00 AM Document presentation format: – PowerPoint PPT presentation

Number of Views:222
Avg rating:3.0/5.0
Slides: 43
Provided by: jaistAcJ
Category:

less

Transcript and Presenter's Notes

Title: Part-of-speech tagging and chunking with log-linear models


1
Part-of-speech tagging and chunking with
log-linear models
  • University of Manchester
  • Yoshimasa Tsuruoka

2
Outline
  • POS tagging and Chunking for English
  • Conditional Markov Models (CMMs)
  • Dependency Networks
  • Bidirectional CMMs
  • Maximum entropy learning
  • Conditional Random Fields (CRFs)
  • Domain adaptation of a tagger

3
Part-of-speech tagging
The peri-kappa B site mediates human
immunodeficiency DT NN NN NN VBZ
JJ NN virus type 2 enhancer
activation in monocytes NN NN CD
NN NN IN NNS
  • The tagger assigns a part-of-speech tag to each
    word in the sentence.

4
Algorithms for part-of-speech tagging
  • Tagging speed and accuracy on WSJ

Tagging Speed Accuracy
Dependency Net (2003) Slow? 97.24
SVM (2004) Fast 97.16
Perceptron (2002) ? 97.11
Bidirectional CMM (2005) Fast 97.10
HMM (2000) Very fast 96.7
CMM (1998) Fast 96.6
evaluated on different portion of WSJ
5
Chunking (shallow parsing)
He reckons the current account deficit will
narrow to NP VP NP
VP PP only
1.8 billion in September . NP
PP NP
  • A chunker (shallow parser) segments a sentence
    into non-recursive phrases

6
Chunking (shallow parsing)
He reckons the current account deficit will
narrow to BNP BVP BNP INP
INP INP BVP IVP BPP only
1.8 billion in September . BNP INPINP INP
BPP BNP
  • Chunking tasks can be converted into a standard
    tagging task
  • Different approaches
  • Sliding window
  • Semi-Markov CRF

7
Algorithms for chunking
  • Chunking speed and accuracy on Penn Treebank

Tagging Speed Accuracy
SVM voting (2001) Slow? 93.91
Perceptron (2003) ? 93.74
Bidirectional CMM (2005) Fast 93.70
SVM (2000) Fast 93.48
8
Conditional Markov Models (CMMs)
t1
t2
t3
o
  • Left to right decomposition (with the first-order
    Markov assumption)

9
POS tagging with CMMs Ratnaparkhi 1996 etc.
  • Left-to-right decomposition
  • The local classifier uses the information on the
    preceding tag.

He runs fast
PRP
VBZ
RB
?
?
?
10
Examples of the features for local classification
He runs fast
Word unigram wi, wi-1, wi1
Word bigram wi-1wi , wi wi1
Previous tag ti-1
Tag/word ti-1wi
Prefix/suffix Up to length 10
Lexical features Hyphen, number, etc..
PRP
?
11
POS tagging with Dependency Network Toutanova et
al. 2003
t1
t2
t3
  • Use the information on the succeeding tag as well

You can use the succeeding tag as a feature in
the local classification model
This is no longer a probability
12
POS tagging with a Cyclic Dependency Network
Toutanova et al. 2003
t1
t2
t3
  • Training cost is small almost equal to CMMs.
  • Decoding can be performed with dynamic
    programming, but it is still expensive.
  • Collusion the model can lock onto conditionally
    consistent but jointly unlikely sequences.

13
Bidirectional CMMs Tsuruoka and Tsujii, 2005
  • Possible decomposition structures
  • Bidirectional CMMs
  • We can find the best structure and tag
    sequences in polynomial time

(a)
(b)
t1
t2
t3
t1
t2
t3
(c)
(d)
t1
t2
t3
t1
t2
t3
14
Bidirectional CMMs
  • Another way of decomposition
  • The local classifiers have the information about
    the tags on both sides when tagging the second
    word.

He runs fast
PRP
VBZ
RB
?
?
?
15
Outline
  • POS tagging and Chunking for English
  • Conditional Markov Models (CMMs)
  • Dependency Networks
  • Bidirectional CMMs
  • Maximum entropy learning
  • Conditional Random Fields (CRFs)
  • Domain adaptation of a tagger

16
Maximum entropy learning
  • Log-linear modeling

Feature function
Feature weight
17
Maximum entropy learning
  • Maximum likelihood estimation
  • Find the parameters that maximize the (log-)
    likelihood of the training data
  • Regularization
  • Gaussian prior Berger et al, 1996
  • Inequality constrains Kazama and Tsujii, 2005

18
Parameter estimation
  • Algorithms for maximum entropy
  • GIS Darroch and Ratcliff, 1972, IIS Della
    Pietra et al., 1997
  • General-purpose algorithms for numerical
    optimization
  • BFGS Nocedal and Wright, 1999, LMVM Benson and
    More, 2001
  • You need to provide the objective function and
    gradient
  • Likelihood of training samples
  • Model expectation of each feature

19
Computing likelihood and model expectation
  • Example
  • Two possible tags Noun and Verb
  • Two types of features word and suffix

He opened it
Verb
Noun
Noun
tag noun
tag verb
20
Conditional Random Fields (CRFs)
  • A single log-linear model on the whole sentence
  • One can use exactly the same techniques as
    maximum entropy learning to estimate the
    parameters.
  • However, the number of classes is HUGE, and it is
    impossible in practice to do it in a naive way.

21
Conditional Random Fields (CRFs)
  • Solution
  • Lets restrict the types of features
  • Then, you can use a dynamic programming algorithm
    that drastically reduces the amount of
    computation
  • Features you can use (in first-order CRFs)
  • Features defined on the tag
  • Features defined on the adjacent pair of tags

22
Features
W0He Tag Noun
  • Feature weights are associated with states and
    edges

He has opened
it
Noun
Noun
Noun
Noun
Tagleft Noun Tagright Noun
Verb
Verb
Verb
Verb
23
A naive way of calculating Z(x)
7.2
4.1
Noun
Noun
Noun
Noun
Verb
Noun
Noun
Noun
1.3
0.8
Noun
Noun
Noun
Verb
Verb
Noun
Noun
Verb
4.5
9.7
Noun
Noun
Verb
Noun
Verb
Noun
Verb
Noun
0.9
5.5
Noun
Noun
Verb
Verb
Verb
Noun
Verb
Verb
2.3
5.7
Noun
Verb
Noun
Noun
Verb
Verb
Noun
Noun
11.2
4.3
Noun
Verb
Noun
Verb
Verb
Verb
Noun
Verb
3.4
2.2
Noun
Verb
Verb
Noun
Verb
Verb
Verb
Noun
2.5
1.9
Noun
Verb
Verb
Verb
Verb
Verb
Verb
Verb
Sum 67.5
24
Dynamic programming
  • Results of intermediate computation can be reused.

He has opened
it
Noun
Noun
Noun
Noun
Verb
Verb
Verb
Verb
forward
25
Dynamic programming
  • Results of intermediate computation can be reused.

He has opened
it
Noun
Noun
Noun
Noun
Verb
Verb
Verb
Verb
backward
26
Dynamic programming
  • Computing marginal distribution

He has opened
it
Noun
Noun
Noun
Noun
Verb
Verb
Verb
Verb
27
Maximum entropy learning and Conditional Random
Fields
  • Maximum entropy learning
  • Log-linear modeling MLE
  • Parameter estimation
  • Likelihood of each sample
  • Model expectation of each feature
  • Conditional Random Fields
  • Log-linear modeling on the whole sentence
  • Features are defined on states and edges
  • Dynamic programming

28
Named Entity Recognition
We have shown that interleukin-1 (IL-1) and IL-2
control
protein protein protein IL-2 receptor alpha
(IL-2R alpha) gene transcription in
DNA CD4-CD8-murine T lymphocyte
precursors. cell_line
  • A term consists of multiple tokens
  • We want to define features on a term rather than
  • on a token.

Semi-Markov CRFs Sarawagi 2004
29
Algorithms for Biomedical Named Entity Recognition
  • Shared task data for Coling 2004 BioNLP workshop

Recall Precision F-score
SVMHMM (2004) 76.0 69.4 72.6
Semi-Markov CRF Okanohara et al., 2006 72.7 70.4 71.5
Sliding window 75.8 67.5 71.4
MEMM (2004) 71.6 68.6 70.1
CRF (2004) 70.3 69.3 69.8
30
Outline
  • POS tagging and Chunking for English
  • Conditional Markov Models (CMMs)
  • Dependency Networks
  • Bidirectional CMMs
  • Maximum entropy learning
  • Conditional Random Fields (CRFs)
  • Domain adaptation of a tagger

31
Domain adaptation
  • Large training data has been available for
    general domains (e.g. Penn Treebank WSJ)
  • NLP Tools trained with general domain data are
    less accurate on biomedical domains
  • Development of domain-specific data requires
    considerable human efforts

32
Tagging errors made by a tagger trained on WSJ
and membrane potential after mitogen binding.
CC NN NN IN NN
JJ two factors, which bind to the same
kappa B enhancers CD NNS WDT NN TO
DT JJ NN NN NNS by analysing the Ag
amino acid sequence. IN VBG DT VBG
JJ NN NN to contain more T-cell
determinants than TO VB RBR JJ
NNS IN Stimulation of
interferon beta gene transcription in vitro by
NN IN JJ JJ NN
NN IN NN IN
  • Accuracy of the tagger on the GENIA POS corpus
    84.4

33
Re-training of maximum entropy models
  • Taggers trained as maximum entropy models
  • Adapting Maximum entropy models to target domains
    by re-training with domain specific data

Feature function(given by the developer)
Model parameter
34
Methods for domain adaptation
  • Combined training data a model is trained from
    scratch with the original and domain-specific
    data
  • Reference distribution an original model is used
    as a reference probabilistic distribution of a
    domain-specific model

35
Adaptation of the part-of-speech tagger
  • Relationships among training and test data are
    evaluated for the following corpora
  • WSJ Penn Treebank WSJ
  • GENIA GENIA POS corpus Kim et al., 2003
  • 2,000 MEDLINE abstracts selected by MeSH terms,
    Human, Blood cells, and Transcription factors
  • PennBioIE Penn BioIE corpus Kulick et al.,
    2004
  • 1,100 MEDLINE abstracts about inhibition of the
    cytochrome P450 family of enzymes
  • 1,157 MEDLINE abstracts about molecular genetics
    of cancer
  • Fly 200 MEDLINE abstracts on Drosophia
    melanogaster

36
Training and test sets
  • Training sets
  • Test sets

tokens sentences
WSJ 912,344 38,219
GENIA 450,492 18,508
PennBioIE 641,838 29,422
Fly 1,024
tokens sentences
WSJ 129,654 5,462
GENIA 50,562 2,036
PennBioIE 70,713 3,270
Fly 7,615 326
37
Experimental results
Accuracy Accuracy Accuracy Accuracy Training time(sec.)
WSJ GENIA PennBioIE Fly Training time(sec.)
WSJGENIAPennBioIE 96.68 98.10 97.65 96.35
Fly only 93.91
Combined 96.69 98.12 97.65 97.94 30,632
Ref. dist 95.38 98.17 96.93 98.08 21
38
Corpus size vs. accuracy(combined training data)
39
Corpus size vs. accuracy(reference distribution)
40
Summary
  • POS tagging
  • MEMM-like approaches achieve good performance
    with reasonable computational cost. CRFs seem to
    be too computationally expensive at present.
  • Chunking
  • CRFs yield good performance for NP chunking.
    Semi-Markov CRFs are promising, but we need to
    somehow reduce computational cost.
  • Domain Adaptation
  • One can easily use the information about the
    original domain as the reference distribution.

41
References
  • A. L. Berger, S. A. Della Pietra, and V. J. Della
    Pietra. (1996). A maximum entropy approach to
    natural language processing. Computational
    Linguistics.
  • Adwait Ratnaparkhi. (1996). A Maximum Entropy
    Part-Of-Speech Tagger. Proceedings of EMNLP.
  • Thorsten Brants. (2000). TnT A Statistical
    Part-Of-Speech Tagger. Proceedings of ANLP.
  • Taku Kudo and Yuji Matsumoto. (2001). Chunking
    with Support Vector Machines, Proceedings of
    NAACL.
  • John Lafferty, Andrew McCallum, and Fernando
    Pereira. (2001). Conditional Random Fields,,
    Probabilistic Models for Segmenting and Labeling
    Sequence Data. Proceedings of ICML.
  • Michael Collins. (2002). Discriminative Training
    Methods for Hidden Markov Models Theory and
    Experiments with Perceptron Algorithms.
    Proceedings of EMNLP.
  • Fei Sha and Fernando Pereira. (2003). Shallow
    Parsing with Conditional Random Fields.
    Proceedings of HLT-NAACL.
  • K. Toutanova, D. Klein, C. Manning, and Y.
    Singer. (2003). Feature-Rich Part-of-Speech
    Tagging with a Cyclic Dependency Network.
    Proceedings of HLT-NAACL.

42
References
  • Xavier Carreras and Lluis Marquez. (2003). Phrase
    recognition by filtering and ranking with
    perceptrons. Proceedings of RANLP.
  • Jesús Giménez and Lluís Márquez. (2004). SVMTool
    A general POS tagger generator based on Support
    Vector Machines. Proceedings of LREC.
  • Sunita Sarawagi and William W. Cohen. (2004).
    Semimarkov conditional random fields for
    information extraction. Proceedings of NIPS 2004.
  • Yoshimasa Tsuruoka and Jun'ichi Tsujii. (2005).
    Bidirectional Inference with the Easiest-First
    Strategy for Tagging Sequence Data. Proceedings
    of HLT/EMNLP.
  • Yuka Tateisi,Yoshimasa Tsuruoka and Jun'ichi
    Tsujii. (2006). Subdomain adaptation of a POS
    tagger with a small corpus. In Proceedings of
    HLT-NAACL BioNLP Workshop.
  • Daisuke Okanohara, Yusuke Miyao, Yoshimasa
    Tsuruoka, and Jun'ichi Tsujii. (2006). Improving
    the Scalability of Semi-Markov Conditional Random
    Fields for Named Entity Recognition. Proceedings
    of COLING/ACL 2006.
Write a Comment
User Comments (0)
About PowerShow.com