POS tagging and Chunking for Indian Languages - PowerPoint PPT Presentation

1 / 76

About This Presentation

Title:

POS tagging and Chunking for Indian Languages

Description:

Title: Natural Language Processing Author: CVIT12 Last modified by: Sriram Created Date: 11/20/2003 1:49:22 PM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:356

Avg rating:3.0/5.0

Slides: 77

Provided by: CVI45

Category:

more less

Transcript and Presenter's Notes

Title: POS tagging and Chunking for Indian Languages

1
POS tagging and Chunking for Indian Languages

Rajeev Sangal and V. Sriram,
International Institute of Information
Technology,
Hyderabad

2
Contents

NLP Introduction
Language Analysis - Representation
Part-of-speech tags in Indian Languages (Ex.
Hindi)
Corpus based methods An introduction
POS tagging using HMMs
Introduction to TnT
Chunking for Indian languages Few experiments
Shared task - Introduction

3
Language

A unique ability of humans
Animals have signs Sign for danger
But cannot combine the signs
Higher animals Apes
Can combine symbols (noun verb)
But can talk only about here and now

4
Language Means of Communication
CONCEPT
CONCEPT
coding
decoding
Language
The concept gets transferred through language
5
Language Means of thinking
What should I wear today?
Can we think without language ?
6
What is NLP ?

The process of computer analysis of input
provided in a human language is known as Natural
Language Processing.

Concept
Language
Used for processing by computer
Intermediate representation
7
Applications

Machine translation
Document Clustering
Information Extraction / Retrieval
Text classification

8
MT system Shakti

Machine translation system being developed at
IIIT Hyderabad.
A hybrid translation system which uses the
combined
strengths of Linguistic, Statistical and
Machine learning
techniques.
Integrates the best available NLP technologies.

9
Shakti architecture
Morphology POS tagging Chunking Parsing
English sentence
English sentence analysis
Word reordering Hindi word subs.
Transfer from English to Hindi
Hindi sentence generation
Agreement Word-generation
Hindi sentence
10
Contents

NLP Introduction
Language Analysis - Representation
Part-of-speech tags in Indian Languages (Ex.
Hindi)
Corpus based methods An introduction
POS tagging using HMMs
Introduction to TnT
Chunking for Indian languages Few experiments
Shared task - Introduction

11
Levels of Language Analysis

Morphological analysis
Lexical Analysis ( POS tagging )
Syntactic Analysis ( Chunking, Parsing )
Semantic Analysis ( Word sense disambiguation
)
Discourse processing ( Anaphora resolution )

Lets take an example sentence Children are
watching some programmes on television in the
house
12
Chunking

What are chunks ?
Children (( are watching )) some
programmes
on television in the house
Chunks
Noun chunks (NP, PP) in square brackets
Verb chunks (VG) in parentheses
Chunks represent objects
Noun chunks represent objects/concepts
Verb chunks represent actions

13
Chunking

Representation in SSF

14
Part-of-Speech tagging
15
Morphological analysis

Deals with the word form and its analysis.
Analysis consists of characteristic properties
like
Root/Stem
Lexical category
Gender, number, person
Etc
Ex watching
Root watch
Lexical category verb
Etc

16
Morphological analysis
17
Contents

NLP Introduction
Language Analysis - Representation
Part-of-speech tags in Indian Languages (Ex.
Hindi)
Corpus based methods An introduction
POS tagging using HMMs
Introduction to TnT
Chunking for Indian languages Few experiments
Shared task - Introduction

18
POS Tags in Hindi

POS Tags in Hindi
Broadly categories are noun, verb, adjective
adverb.
Word are classified depending on their role, both
individually as well as in the sentence.
Example
vaha aama khaa rahaa hei
Pron noun verb verb verb

19
POS Tagging

Simplest method of POS tagging
Looking in the dictionary

khaanaa
Dictionary lookup
verb
20
Problems with POS Tagging

Size of the dictionary limits the scope of
POS-tagger.
Ambiguity
The same word can be used both as a noun as well
as a verb.

khaanaa
noun
verb
21
Problems with POS Tagging

Ambiguity
Sentences in which the word khaanaa occurs
tum bahuta achhaa khaanaa banatii ho.
mein jilebii khaanaa chaahataa hun.
Hence, complete sentence has to be looked at
before
determining its role and thus the POS tag.

22
Problems with POS Tagging

Many applications need more specific POS tags.
For example,
Hence, the need for defining a tagset.

seba khaa rahaa Verb Finite Main
khaate huE Verb Non-Finite Adjective
khaakara Verb Non-Finite Adverb
sharaaba piinaa sehata Verb Non-Finite Nominal
23
Defining the tagset for Hindi (IIIT Tagset)

Issues !
Fineness V/s Coarseness in linguistic analysis
Syntactic Function V/s lexical category
New tags V/s tags from a standard tagger

24
Fineness V/s Coarseness

Decision has to be taken whether tags will
account for finer distinctions of various
features of the parts of speech.
Need to strike a balance
Not too fine to hamper machine learning
Not too coarse to loose information

25
Fineness V/s Coarseness

Nouns
Plurality information not taken into account
(noun singular and noun plural are marked with
same tags).
Case information not marked
(noun direct and noun oblique are marked with
same tags).
Adjectives and Adverbs
No distinction between comparitive and
superlative forms
Verbs
Finer distinctions are made (eg., VJJ, VRB, VNN)
Helps us understand the arguments that a verb
form can take.

26
Fineness in Verb tags

Useful for tasks like dependency parsing as we
have better information about arguments of verb
form.
Non-finite form of verbs which are used as nouns
or adjectives or adverbs still retain their
verbal property.
(VNN -gt Noun formed for a verb)
Example
aasamaana/NN mein/PREP udhane/VNN
vaalaa/PREP ghodhaa/NN
sky in
flying
horse
niiche/NLOC utara/VFM aayaa/VAUX
down climb
came

27
Syntactic V/S Lexical

Whether to tag the word based on lexical or
syntactic category.
Should uttar in uttar bhaarata be tagged as
noun or
adjective ?
Lexical category is given more importance than
syntactic category while marking text manually.
Leads to consistency in tagging.

28
New tags v/s tags from standard tagset

Entirely new tagset for Indian languages not
desirable as people are familiar with standard
tagsets like Penn tags.
Penn tagset has been used as benchmark while
deciding tags for Hindi.
Wherever Penn tagset has been found inadequate,
new tags introduced.
NVB ? New tag for kriyamuls or Light verbs
QW ? Modified tag for question words

29
IIIT Tagset

Tags are grouped into three types.
Group1 Adopted from the Penn tagset with minor
changes.
Group2 Modification over Penn tagset.
Group3 Tags not present in Penn tagset.
Examples of tags in Group3
INTF ( Intensifier ) Words like baHuta,
kama etc.
NVB, JVB, RBVB Light verbs.
Detailed guidelines would be put online.

30
Contents

NLP Introduction
Language Analysis - Representation
Part-of-speech tags in Indian Languages (Ex.
Hindi)
Corpus based methods An introduction
POS tagging using HMMs
Introduction to TnT
Chunking for Indian languages Few experiments
Shared task - Introduction

31
Corpus based approach
Untagged new corpus
Learn
POS tagged corpus
POS tagger
Tagged new corpus
32
POS tagging A simple method

Pick the most likely tag for each word
Probabilities can be estimated from a
tagged corpus.
Assumes independence between tags.
Accuracy lt 90

33
POS tagging A simple method

Example
Brown corpus, 182159 tagged words (training
section),
26 tags
Example
mujhe xo kitabein xijiye
Word xo occurs 267 times,
227 times tagged as QFN
29 times as VAUX
P(QFNWxo) 227/267 0.8502
P(NN Wxo) 29/267 0.1086

34
Corpus-based approaches
Learning Rules Statistical
Transformation-based error driven learning. Brill - 1995 Hidden Markov models. TnT, Brants 00
Inductive Logic programming. Cussens - 1997 Maximum entropy. Ratnaparakhi 96
35
Contents

NLP Introduction
Language Analysis - Representation
Part-of-speech tags in Indian Languages (Ex.
Hindi)
Corpus based methods An introduction
POS tagging using HMMs
Introduction to TnT
Chunking for Indian languages Few experiments
Shared task - Introduction

36
POS tagging using HMMs
Let W be a sequence of words W w1 , w2 wn
Let T be the corresponding tag
sequence T t1 ,
t2 tn Task Find T which maximizes P (
T W ) T
argmaxT P ( T W )
37
POS tagging using HMM
By Bayes Rule, P ( T W ) P ( W T )
P ( T ) / P ( W ) T argmaxT P ( W T )
P ( T ) P ( T ) P ( t1 ) P ( t2
t1 ) P ( t3 t1 t2 ) P ( tn t1 tn-1
) Applying Bi-gram approximation, P ( T )
P ( t1 ) P ( t2 t1 ) P ( t3 t2 ) P (
tn tn-1 )
38
POS tagging using HMM
P ( W T ) P ( w1 T ) P ( w2 w1 T )
P ( w3 w1.w2 T )
P ( wn w1 wn-1 , T )
?i 1 to n P ( wi w1wi-1 T ) Assume, P
( wi w1wi-1 T ) P ( wi ti
) Now, T is the one which maximizes,
P ( t1 ) P ( t2 t1 ) P
( tn tn-1 ) P ( w1
t1 ) P ( w2 t2 ) P ( wn wn-1 )
39
POS tagging using HMM

If we use Tri-gram model instead for the tag
sequence,
P ( T ) P ( t1 ) P ( t2 t1 ) P ( t3
t1 t2 ) P ( tn tn-2 tn-1 )
Which model to choose ?
Depends on the amount of data available !
Richer models ( Tri-grams, 4-grams )
require lots of data.

40
Chain rule with approximations

P( W vaha ladakaa gayaa , T det noun
verb )
P(det) P(vahadet) P(noundet)
P(ladakaanoun) P(verbnoun)
P(gayaaverb)

det
noun
verb
ladakaa
gayaa
vaha
41
Chain rule with approximations Example

P (vaha det ) ( Number of times vaha
appeared as det in the corpus )
--------------------
-----------------------------------------
( Total number of
occurrences of det in the corpus )
P ( verb noun ) ( Number of times verb
followed noun in the corpus )
--------------------------------------------------
-----------
( Total number of
occurrences of noun in the corpus )
If we obtained the following estimates from the
corpus

det
noun
verb
0.5
0.99
0.4
0.4
0.5
0.02
ladakaa
gayaa
vaha
P ( W , T ) 0.5 0.4 0.99 0.5 0.4 0.02
0.000792
42
POS tagging using HMM
We need to estimate three types of parameters
from the corpus Pstart(ti) (no. of sentences
which begin with ti ) / ( no. of sentences ) P
( ti ti-1 ) count ( ti-1 ti ) / count (
ti-1 ) P ( wi ti ) count ( wi with ti ) /
count ( ti ) These parameters can be directly
represented using the Hidden Markov Models (HMMs)
and the best tag sequence can be computed by
applying Viterbi algorithm on the HMMs.
43
Markov models

Markov Chain
An event is dependent on the previous
events.
Consider the word sequence

usane
kahaa
ki
Here, each word is dependent on the previous one
word. Hence, it is said to form markov chain of
order 1.
44
Hidden Markov models
Observation sequence O
o1
o2
o3
o4
Hidden states sequence X
x3
x1
x2
x4
3
4
1
2
Index of sequence t
Hidden states follow markov property. Hence, this
model is know as Hidden Markov Model.
45
Hidden Markov models

Representation of parameters in HMMs
Define O(t) tth Observation
Define X(t) Hidden State Value at
tth position

A aab P ( X ( t1 ) Xb X
( t ) Xa ) ? Transition matrix B
bak P ( O ( t ) Ok X ( t )
Xa ) ? Emission matrix PI pia
Probability of the starting with hidden state
Xa ? PI matrix
The model is µ A , PI , B
46
HMM for POS tagging
Observation sequence Word
sequence Hidden state sequence
Tag sequence
Model
A P ( current tag previous tag
) B P ( current word current tag
) PI Pstart ( tag )
Tag sequences are mapped to Hidden state
sequences because they are not observable in the
natural language text.
47
Example
det noun verb
det .01 .99 .00
noun .30 .30 .40
verb .40 .40 .20
A
det 0.5
noun 0.4
verb .01
PI
vaha ladakaa gayaa
det .40 .00 .00
noun .00 .015 .0031
verb .00 .0004 .020
B
48
POS tagging using HMM
The problem can be formulated as, Given the
observation sequence O and the model µ
(A, B, PI), how to choose the best state sequence
X which explains the observations ?

Consider all the possible tag sequences and
choose the
tag sequence having the maximum joint
probability with
the observation sequence.
X_max argmax ( P(O , X) )
The complexity of the above is high. Order NT
Viterbi algorithm is used for computational
efficiency.

49
POS tagging using HMM
ladakaa
hansaa
vaha
O
det
det
det
noun
noun
noun
Xs
verb
verb
verb
t
1
2
3
27 tag sequences possible ! 27 paths
50
Viterbi algorithm
ladakaa
hansaa
vaha
O
det
det
det
noun
noun
noun
Xs
verb
verb
verb
t
1
2
3
Let anoun(ladakaa) represent the probability of
reaching the state noun taking the best
possible path and generating observation
ladakaa
51
Viterbi algorithm
vaha
ladakaa
hansaa
O
det
det
det
noun
noun
noun
Xs
verb
verb
verb
t
1
2
3
Best probability of reaching a state associated
with first word apron(vaha)
PI (det) B det, vaha
52
Viterbi algorithm
ladakaa
hansaa
vaha
O
det
det
det
noun
noun
noun
Xs
verb
verb
verb
t
1
2
3
Probability of reaching a state elsewhere in the
best possible way anoun(ladakaa)
53
Viterbi algorithm
ladakaa
hansaa
vaha
O
det
det
det
noun
noun
noun
Xs
verb
verb
verb
t
1
2
3
Probability of reaching a state in the best
possible way anoun(ladakaa) MAX apron(vaha)
A det, noun B noun, ladakaa ,

54
Viterbi algorithm
ladakaa
hansaa
vaha
O
det
det
det
noun
noun
noun
Xs
verb
verb
verb
t
1
2
3
Probability of reaching a state in the best
possible way, anoun(ladakaa) MAX apron(vaha)
A det, noun B noun, ladakaa ,
anoun(vaha) A
noun, noun B noun, ladakaa ,
55
Viterbi algorithm
ladakaa
hansaa
vaha
O
det
det
det
noun
noun
noun
Xs
verb
verb
verb
t
1
2
3
Probability of reaching a state in the best
possible way anoun(ladakaa) MAX apron(vaha)
A det, noun B noun, ladakaa ,
anoun(vaha) A
noun, noun B noun, ladakaa ,
averb(vaha) A verb,
noun B noun, ladakaa
56
Viterbi algorithm
ladakaa
hansaa
vaha
O
det
det
det
noun
noun
noun
Xs
verb
verb
verb
t
1
2
3
What is the best way to come to a particular
state ? phinoun(ladakaa) ARGMAX apron(vaha)
A pron, noun B noun, ladakaa ,

anoun(vaha) A noun, noun B noun,
ladakaa ,
averb(vaha) A verb, noun B
noun, ladakaa
57
Viterbi algorithm
hansaa
ladakaa
vaha
O
det
det
det
noun
noun
noun
Xs
verb
verb
verb
t
1
2
3
The last tag of the most likely sequence phi
(T1) ARGMAX apron(hansaa) , anoun(hansaa) ,
averb(hansaa)
58
Viterbi algorithm
ladakaa
hansaa
vaha
O
det
det
det
noun
noun
noun
Xs
verb
verb
verb
t
1
2
3
Most likely sequence is obtained by backtracking.
59
Preliminary Results

POS tagging for Indian languages
Training set 182159 tokens, Testing set 14277
tokens
Tags 26.
Most frequent tag labelling 78.85
Hidden Markov Models 86.75
Needs improvement!
By experimenting with a variety of tags and
tokens ( Some experiments on the chunking task
are shown in following slides ).

60
Preliminary Results

Most Common error seen.
NNP, NNC ? NN
lt see the output of the system gt
Opportunity to carry out experiments to eliminate
such errors as part of NLPAI shared task , 2006
(will be introduced at the end).

61
Contents

NLP Introduction
Language Analysis - Representation
Part-of-speech tags in Indian Languages (Ex.
Hindi)
Corpus based methods An introduction
POS tagging using HMMs
Introduction to TnT
Chunking for Indian languages Few experiments
Shared task - Introduction

62
Introduction to TnT

Efficient implementation of Viterbis algorithm
for 2nd order Markov Chains ( Trigram
approximation ).
Language independent Can be trained on any
corpus.
Easy to use.

63
Introduction to TnT

4 main programs
tnt-para trains the model (parameter
generation)
tnt-para options ltcorpus_filegt
tnt tagging
tnt options ltmodelgt ltcorpusgt
tnt-diff - Comparing two files to get precision/
recall figures.
tnt-diff options ltoriginal file 1gt ltnew output
filegt
tnt-wc count tokens (words) and types
(pos-tag/chunk-tag) in different files.
tnt-wc options ltcorpusfilegt

64
Introduction to TnT

Training file format
Tokens and tag separated by white space.
Example,
ltcommentgt
nirAlA NNP
kI PREP
sAhiwya NN
blank line new sentence
yahAz PRP
yaha PRP
aXikAMRa JJ

65
Introduction to TnT

Testing file consists of only the first column.
Other files Used to store the model
.lex file
.123 file
.map file
Demo1.

66
Contents

NLP Introduction
Language Analysis - Representation
Part-of-speech tags in Indian Languages (Ex.
Hindi)
Corpus based methods An introduction
POS tagging using HMMs
Introduction to TnT
Chunking for Indian languages Few experiments
Shared task - Introduction

67
An Example (Chunk boundary identification)
68
Chunking with TnT

Chunk Tags
STRT A chunk starts at this token
CNT This token lies in the middle of a chunk
STP This token lies at the end of a chunk
STRT_STP This token lies in a chunk of its own
Chunk Tag Schemes
2-tag Scheme STRT, CNT
3-tag Scheme STRT, CNT, STP
4-tag Scheme STRT, CNT, STP, STRT_STP

69
Input Tokens

What kinds of input tokens can we use?
Word only simplest
POS tag only use only the part of speech tag of
the word
Combinations of the above
Word_POStag word followed by POS tag
POStag_Word POS tag followed by word.

70
Chunking with TnT Experiments

Training corpus 150000 tokens
Testing corpus 20000 tokens
Trick to improve learning is by training on
larger tagset and reduce it to smaller tagset
NO LOSS of INFO. as all the tagsets convey same
info.
Best results (Precision 85.6) obtained for
Input Tokens of the form Word_POS
Learning trick 4 tags reduced to 2

71
Chunking with TnT Improvement

85.6 not good enough.
Improvement of model (Precision 88.63) by
adding contextual information (POS tags).
Example,

72
Chunking with TnT Improvements

For experiments which lead to furthur
improvements in chunk boundary identification,
see
Akshay Singh Sushama Bendre Rajeev Sangal, HMM
based Chunker for Hindi, In Second International
Joint Conference on Natural Language Processing
Companion Volume including Posters/Demos and
tutorial abstracts.

73
Chunking labelling Results