Title: POS tagging and Chunking for Indian Languages
1POS tagging and Chunking for Indian Languages
- Rajeev Sangal and V. Sriram,
- International Institute of Information
Technology, - Hyderabad
2Contents
- NLP Introduction
- Language Analysis - Representation
- Part-of-speech tags in Indian Languages (Ex.
Hindi) - Corpus based methods An introduction
- POS tagging using HMMs
- Introduction to TnT
- Chunking for Indian languages Few experiments
- Shared task - Introduction
3Language
- A unique ability of humans
- Animals have signs Sign for danger
- But cannot combine the signs
- Higher animals Apes
- Can combine symbols (noun verb)
- But can talk only about here and now
4Language Means of Communication
CONCEPT
CONCEPT
coding
decoding
Language
The concept gets transferred through language
5Language Means of thinking
What should I wear today?
Can we think without language ?
6What is NLP ?
- The process of computer analysis of input
provided in a human language is known as Natural
Language Processing.
Concept
Language
Used for processing by computer
Intermediate representation
7Applications
- Machine translation
- Document Clustering
- Information Extraction / Retrieval
- Text classification
8MT system Shakti
- Machine translation system being developed at
- IIIT Hyderabad.
- A hybrid translation system which uses the
combined - strengths of Linguistic, Statistical and
Machine learning - techniques.
- Integrates the best available NLP technologies.
9Shakti architecture
Morphology POS tagging Chunking Parsing
English sentence
English sentence analysis
Word reordering Hindi word subs.
Transfer from English to Hindi
Hindi sentence generation
Agreement Word-generation
Hindi sentence
10Contents
- NLP Introduction
- Language Analysis - Representation
- Part-of-speech tags in Indian Languages (Ex.
Hindi) - Corpus based methods An introduction
- POS tagging using HMMs
- Introduction to TnT
- Chunking for Indian languages Few experiments
- Shared task - Introduction
11Levels of Language Analysis
- Morphological analysis
- Lexical Analysis ( POS tagging )
- Syntactic Analysis ( Chunking, Parsing )
- Semantic Analysis ( Word sense disambiguation
) - Discourse processing ( Anaphora resolution )
Lets take an example sentence Children are
watching some programmes on television in the
house
12Chunking
- What are chunks ?
- Children (( are watching )) some
programmes - on television in the house
- Chunks
- Noun chunks (NP, PP) in square brackets
- Verb chunks (VG) in parentheses
- Chunks represent objects
- Noun chunks represent objects/concepts
- Verb chunks represent actions
13Chunking
14Part-of-Speech tagging
15Morphological analysis
- Deals with the word form and its analysis.
- Analysis consists of characteristic properties
like - Root/Stem
- Lexical category
- Gender, number, person
- Etc
- Ex watching
- Root watch
- Lexical category verb
- Etc
16Morphological analysis
17Contents
- NLP Introduction
- Language Analysis - Representation
- Part-of-speech tags in Indian Languages (Ex.
Hindi) - Corpus based methods An introduction
- POS tagging using HMMs
- Introduction to TnT
- Chunking for Indian languages Few experiments
- Shared task - Introduction
18POS Tags in Hindi
- POS Tags in Hindi
- Broadly categories are noun, verb, adjective
adverb. - Word are classified depending on their role, both
individually as well as in the sentence. - Example
- vaha aama khaa rahaa hei
- Pron noun verb verb verb
19POS Tagging
- Simplest method of POS tagging
- Looking in the dictionary
khaanaa
Dictionary lookup
verb
20Problems with POS Tagging
- Size of the dictionary limits the scope of
POS-tagger. - Ambiguity
- The same word can be used both as a noun as well
as a verb.
khaanaa
noun
verb
21Problems with POS Tagging
- Ambiguity
- Sentences in which the word khaanaa occurs
- tum bahuta achhaa khaanaa banatii ho.
- mein jilebii khaanaa chaahataa hun.
- Hence, complete sentence has to be looked at
before - determining its role and thus the POS tag.
22Problems with POS Tagging
- Many applications need more specific POS tags.
- For example,
- Hence, the need for defining a tagset.
seba khaa rahaa Verb Finite Main
khaate huE Verb Non-Finite Adjective
khaakara Verb Non-Finite Adverb
sharaaba piinaa sehata Verb Non-Finite Nominal
23Defining the tagset for Hindi (IIIT Tagset)
- Issues !
- Fineness V/s Coarseness in linguistic analysis
- Syntactic Function V/s lexical category
- New tags V/s tags from a standard tagger
24Fineness V/s Coarseness
- Decision has to be taken whether tags will
account for finer distinctions of various
features of the parts of speech. - Need to strike a balance
- Not too fine to hamper machine learning
- Not too coarse to loose information
25Fineness V/s Coarseness
- Nouns
- Plurality information not taken into account
- (noun singular and noun plural are marked with
same tags). - Case information not marked
- (noun direct and noun oblique are marked with
same tags). - Adjectives and Adverbs
- No distinction between comparitive and
superlative forms - Verbs
- Finer distinctions are made (eg., VJJ, VRB, VNN)
- Helps us understand the arguments that a verb
form can take.
26Fineness in Verb tags
- Useful for tasks like dependency parsing as we
have better information about arguments of verb
form. - Non-finite form of verbs which are used as nouns
or adjectives or adverbs still retain their
verbal property. - (VNN -gt Noun formed for a verb)
- Example
- aasamaana/NN mein/PREP udhane/VNN
vaalaa/PREP ghodhaa/NN - sky in
flying
horse - niiche/NLOC utara/VFM aayaa/VAUX
- down climb
came
27Syntactic V/S Lexical
- Whether to tag the word based on lexical or
syntactic category. - Should uttar in uttar bhaarata be tagged as
noun or - adjective ?
- Lexical category is given more importance than
syntactic category while marking text manually. - Leads to consistency in tagging.
28New tags v/s tags from standard tagset
- Entirely new tagset for Indian languages not
desirable as people are familiar with standard
tagsets like Penn tags. - Penn tagset has been used as benchmark while
deciding tags for Hindi. - Wherever Penn tagset has been found inadequate,
new tags introduced. - NVB ? New tag for kriyamuls or Light verbs
- QW ? Modified tag for question words
29IIIT Tagset
- Tags are grouped into three types.
- Group1 Adopted from the Penn tagset with minor
changes. - Group2 Modification over Penn tagset.
- Group3 Tags not present in Penn tagset.
- Examples of tags in Group3
- INTF ( Intensifier ) Words like baHuta,
kama etc. - NVB, JVB, RBVB Light verbs.
- Detailed guidelines would be put online.
30Contents
- NLP Introduction
- Language Analysis - Representation
- Part-of-speech tags in Indian Languages (Ex.
Hindi) - Corpus based methods An introduction
- POS tagging using HMMs
- Introduction to TnT
- Chunking for Indian languages Few experiments
- Shared task - Introduction
31Corpus based approach
Untagged new corpus
Learn
POS tagged corpus
POS tagger
Tagged new corpus
32POS tagging A simple method
- Pick the most likely tag for each word
- Probabilities can be estimated from a
- tagged corpus.
- Assumes independence between tags.
- Accuracy lt 90
33POS tagging A simple method
- Example
- Brown corpus, 182159 tagged words (training
section), - 26 tags
- Example
- mujhe xo kitabein xijiye
-
- Word xo occurs 267 times,
- 227 times tagged as QFN
- 29 times as VAUX
- P(QFNWxo) 227/267 0.8502
- P(NN Wxo) 29/267 0.1086
34Corpus-based approaches
Learning Rules Statistical
Transformation-based error driven learning. Brill - 1995 Hidden Markov models. TnT, Brants 00
Inductive Logic programming. Cussens - 1997 Maximum entropy. Ratnaparakhi 96
35Contents
- NLP Introduction
- Language Analysis - Representation
- Part-of-speech tags in Indian Languages (Ex.
Hindi) - Corpus based methods An introduction
- POS tagging using HMMs
- Introduction to TnT
- Chunking for Indian languages Few experiments
- Shared task - Introduction
36POS tagging using HMMs
Let W be a sequence of words W w1 , w2 wn
Let T be the corresponding tag
sequence T t1 ,
t2 tn Task Find T which maximizes P (
T W ) T
argmaxT P ( T W )
37POS tagging using HMM
By Bayes Rule, P ( T W ) P ( W T )
P ( T ) / P ( W ) T argmaxT P ( W T )
P ( T ) P ( T ) P ( t1 ) P ( t2
t1 ) P ( t3 t1 t2 ) P ( tn t1 tn-1
) Applying Bi-gram approximation, P ( T )
P ( t1 ) P ( t2 t1 ) P ( t3 t2 ) P (
tn tn-1 )
38POS tagging using HMM
P ( W T ) P ( w1 T ) P ( w2 w1 T )
P ( w3 w1.w2 T )
P ( wn w1 wn-1 , T )
?i 1 to n P ( wi w1wi-1 T ) Assume, P
( wi w1wi-1 T ) P ( wi ti
) Now, T is the one which maximizes,
P ( t1 ) P ( t2 t1 ) P
( tn tn-1 ) P ( w1
t1 ) P ( w2 t2 ) P ( wn wn-1 )
39POS tagging using HMM
- If we use Tri-gram model instead for the tag
sequence, - P ( T ) P ( t1 ) P ( t2 t1 ) P ( t3
t1 t2 ) P ( tn tn-2 tn-1 ) - Which model to choose ?
- Depends on the amount of data available !
- Richer models ( Tri-grams, 4-grams )
require lots of data.
40Chain rule with approximations
- P( W vaha ladakaa gayaa , T det noun
verb ) - P(det) P(vahadet) P(noundet)
- P(ladakaanoun) P(verbnoun)
P(gayaaverb)
det
noun
verb
ladakaa
gayaa
vaha
41Chain rule with approximations Example
- P (vaha det ) ( Number of times vaha
appeared as det in the corpus ) - --------------------
----------------------------------------- - ( Total number of
occurrences of det in the corpus ) - P ( verb noun ) ( Number of times verb
followed noun in the corpus ) -
--------------------------------------------------
----------- - ( Total number of
occurrences of noun in the corpus ) - If we obtained the following estimates from the
corpus
det
noun
verb
0.5
0.99
0.4
0.4
0.5
0.02
ladakaa
gayaa
vaha
P ( W , T ) 0.5 0.4 0.99 0.5 0.4 0.02
0.000792
42POS tagging using HMM
We need to estimate three types of parameters
from the corpus Pstart(ti) (no. of sentences
which begin with ti ) / ( no. of sentences ) P
( ti ti-1 ) count ( ti-1 ti ) / count (
ti-1 ) P ( wi ti ) count ( wi with ti ) /
count ( ti ) These parameters can be directly
represented using the Hidden Markov Models (HMMs)
and the best tag sequence can be computed by
applying Viterbi algorithm on the HMMs.
43Markov models
- Markov Chain
- An event is dependent on the previous
events. - Consider the word sequence
usane
kahaa
ki
Here, each word is dependent on the previous one
word. Hence, it is said to form markov chain of
order 1.
44Hidden Markov models
Observation sequence O
o1
o2
o3
o4
Hidden states sequence X
x3
x1
x2
x4
3
4
1
2
Index of sequence t
Hidden states follow markov property. Hence, this
model is know as Hidden Markov Model.
45Hidden Markov models
- Representation of parameters in HMMs
- Define O(t) tth Observation
- Define X(t) Hidden State Value at
tth position
A aab P ( X ( t1 ) Xb X
( t ) Xa ) ? Transition matrix B
bak P ( O ( t ) Ok X ( t )
Xa ) ? Emission matrix PI pia
Probability of the starting with hidden state
Xa ? PI matrix
The model is µ A , PI , B
46HMM for POS tagging
Observation sequence Word
sequence Hidden state sequence
Tag sequence
Model
A P ( current tag previous tag
) B P ( current word current tag
) PI Pstart ( tag )
Tag sequences are mapped to Hidden state
sequences because they are not observable in the
natural language text.
47Example
det noun verb
det .01 .99 .00
noun .30 .30 .40
verb .40 .40 .20
A
det 0.5
noun 0.4
verb .01
PI
vaha ladakaa gayaa
det .40 .00 .00
noun .00 .015 .0031
verb .00 .0004 .020
B
48POS tagging using HMM
The problem can be formulated as, Given the
observation sequence O and the model µ
(A, B, PI), how to choose the best state sequence
X which explains the observations ?
- Consider all the possible tag sequences and
choose the - tag sequence having the maximum joint
probability with - the observation sequence.
- X_max argmax ( P(O , X) )
-
- The complexity of the above is high. Order NT
- Viterbi algorithm is used for computational
efficiency.
49POS tagging using HMM
ladakaa
hansaa
vaha
O
det
det
det
noun
noun
noun
Xs
verb
verb
verb
t
1
2
3
27 tag sequences possible ! 27 paths
50Viterbi algorithm
ladakaa
hansaa
vaha
O
det
det
det
noun
noun
noun
Xs
verb
verb
verb
t
1
2
3
Let anoun(ladakaa) represent the probability of
reaching the state noun taking the best
possible path and generating observation
ladakaa
51Viterbi algorithm
vaha
ladakaa
hansaa
O
det
det
det
noun
noun
noun
Xs
verb
verb
verb
t
1
2
3
Best probability of reaching a state associated
with first word apron(vaha)
PI (det) B det, vaha
52Viterbi algorithm
ladakaa
hansaa
vaha
O
det
det
det
noun
noun
noun
Xs
verb
verb
verb
t
1
2
3
Probability of reaching a state elsewhere in the
best possible way anoun(ladakaa)
53Viterbi algorithm
ladakaa
hansaa
vaha
O
det
det
det
noun
noun
noun
Xs
verb
verb
verb
t
1
2
3
Probability of reaching a state in the best
possible way anoun(ladakaa) MAX apron(vaha)
A det, noun B noun, ladakaa ,
54Viterbi algorithm
ladakaa
hansaa
vaha
O
det
det
det
noun
noun
noun
Xs
verb
verb
verb
t
1
2
3
Probability of reaching a state in the best
possible way, anoun(ladakaa) MAX apron(vaha)
A det, noun B noun, ladakaa ,
anoun(vaha) A
noun, noun B noun, ladakaa ,
55Viterbi algorithm
ladakaa
hansaa
vaha
O
det
det
det
noun
noun
noun
Xs
verb
verb
verb
t
1
2
3
Probability of reaching a state in the best
possible way anoun(ladakaa) MAX apron(vaha)
A det, noun B noun, ladakaa ,
anoun(vaha) A
noun, noun B noun, ladakaa ,
averb(vaha) A verb,
noun B noun, ladakaa
56Viterbi algorithm
ladakaa
hansaa
vaha
O
det
det
det
noun
noun
noun
Xs
verb
verb
verb
t
1
2
3
What is the best way to come to a particular
state ? phinoun(ladakaa) ARGMAX apron(vaha)
A pron, noun B noun, ladakaa ,
anoun(vaha) A noun, noun B noun,
ladakaa ,
averb(vaha) A verb, noun B
noun, ladakaa
57Viterbi algorithm
hansaa
ladakaa
vaha
O
det
det
det
noun
noun
noun
Xs
verb
verb
verb
t
1
2
3
The last tag of the most likely sequence phi
(T1) ARGMAX apron(hansaa) , anoun(hansaa) ,
averb(hansaa)
58Viterbi algorithm
ladakaa
hansaa
vaha
O
det
det
det
noun
noun
noun
Xs
verb
verb
verb
t
1
2
3
Most likely sequence is obtained by backtracking.
59Preliminary Results
- POS tagging for Indian languages
- Training set 182159 tokens, Testing set 14277
tokens - Tags 26.
- Most frequent tag labelling 78.85
- Hidden Markov Models 86.75
- Needs improvement!
- By experimenting with a variety of tags and
tokens ( Some experiments on the chunking task
are shown in following slides ).
60Preliminary Results
- Most Common error seen.
- NNP, NNC ? NN
- lt see the output of the system gt
- Opportunity to carry out experiments to eliminate
such errors as part of NLPAI shared task , 2006
(will be introduced at the end).
61Contents
- NLP Introduction
- Language Analysis - Representation
- Part-of-speech tags in Indian Languages (Ex.
Hindi) - Corpus based methods An introduction
- POS tagging using HMMs
- Introduction to TnT
- Chunking for Indian languages Few experiments
- Shared task - Introduction
62Introduction to TnT
- Efficient implementation of Viterbis algorithm
for 2nd order Markov Chains ( Trigram
approximation ). - Language independent Can be trained on any
corpus. - Easy to use.
63Introduction to TnT
- 4 main programs
- tnt-para trains the model (parameter
generation) - tnt-para options ltcorpus_filegt
- tnt tagging
- tnt options ltmodelgt ltcorpusgt
- tnt-diff - Comparing two files to get precision/
recall figures. - tnt-diff options ltoriginal file 1gt ltnew output
filegt - tnt-wc count tokens (words) and types
(pos-tag/chunk-tag) in different files. - tnt-wc options ltcorpusfilegt
64Introduction to TnT
- Training file format
- Tokens and tag separated by white space.
- Example,
- ltcommentgt
- nirAlA NNP
- kI PREP
- sAhiwya NN
- blank line new sentence
- yahAz PRP
- yaha PRP
- aXikAMRa JJ
65Introduction to TnT
- Testing file consists of only the first column.
- Other files Used to store the model
- .lex file
- .123 file
- .map file
- Demo1.
66Contents
- NLP Introduction
- Language Analysis - Representation
- Part-of-speech tags in Indian Languages (Ex.
Hindi) - Corpus based methods An introduction
- POS tagging using HMMs
- Introduction to TnT
- Chunking for Indian languages Few experiments
- Shared task - Introduction
67An Example (Chunk boundary identification)
68Chunking with TnT
- Chunk Tags
- STRT A chunk starts at this token
- CNT This token lies in the middle of a chunk
- STP This token lies at the end of a chunk
- STRT_STP This token lies in a chunk of its own
- Chunk Tag Schemes
- 2-tag Scheme STRT, CNT
- 3-tag Scheme STRT, CNT, STP
- 4-tag Scheme STRT, CNT, STP, STRT_STP
69Input Tokens
- What kinds of input tokens can we use?
- Word only simplest
- POS tag only use only the part of speech tag of
the word - Combinations of the above
- Word_POStag word followed by POS tag
- POStag_Word POS tag followed by word.
70Chunking with TnT Experiments
- Training corpus 150000 tokens
- Testing corpus 20000 tokens
- Trick to improve learning is by training on
larger tagset and reduce it to smaller tagset - NO LOSS of INFO. as all the tagsets convey same
info. - Best results (Precision 85.6) obtained for
- Input Tokens of the form Word_POS
- Learning trick 4 tags reduced to 2
71Chunking with TnT Improvement
- 85.6 not good enough.
- Improvement of model (Precision 88.63) by
adding contextual information (POS tags).
Example,
72Chunking with TnT Improvements
- For experiments which lead to furthur
improvements in chunk boundary identification,
see - Akshay Singh Sushama Bendre Rajeev Sangal, HMM
based Chunker for Hindi, In Second International
Joint Conference on Natural Language Processing
Companion Volume including Posters/Demos and
tutorial abstracts.
73Chunking labelling Results
- Chunk labelling
- Chunks which have been identified have to be
labelled as - Noun chunks, Verb chunks etc.
- Rule based chunk labelling performed best.
- RESULTS
- Final Chunk Boundary Identification accuracy
92.6 - Chunk boundary identification Chunk labelling
91.5
74Contents
- NLP Introduction
- Language Analysis - Representation
- Part-of-speech tags in Indian Languages (Ex.
Hindi) - Corpus based methods An introduction
- POS tagging using HMMs
- Introduction to TnT
- Chunking for Indian languages Few experiments
- Shared task - Introduction
75Shared task.
- For information on the shared task, refer to the
flyer on NLPAI shared task 2006.
76Thank you