POS tagging and Chunking for Indian Languages - PowerPoint PPT Presentation

1 / 76
About This Presentation
Title:

POS tagging and Chunking for Indian Languages

Description:

Title: Natural Language Processing Author: CVIT12 Last modified by: Sriram Created Date: 11/20/2003 1:49:22 PM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:356
Avg rating:3.0/5.0
Slides: 77
Provided by: CVI45
Category:

less

Transcript and Presenter's Notes

Title: POS tagging and Chunking for Indian Languages


1
POS tagging and Chunking for Indian Languages
  • Rajeev Sangal and V. Sriram,
  • International Institute of Information
    Technology,
  • Hyderabad

2
Contents
  • NLP Introduction
  • Language Analysis - Representation
  • Part-of-speech tags in Indian Languages (Ex.
    Hindi)
  • Corpus based methods An introduction
  • POS tagging using HMMs
  • Introduction to TnT
  • Chunking for Indian languages Few experiments
  • Shared task - Introduction

3
Language
  • A unique ability of humans
  • Animals have signs Sign for danger
  • But cannot combine the signs
  • Higher animals Apes
  • Can combine symbols (noun verb)
  • But can talk only about here and now

4
Language Means of Communication
CONCEPT
CONCEPT
coding
decoding
Language
The concept gets transferred through language
5
Language Means of thinking
What should I wear today?
Can we think without language ?
6
What is NLP ?
  • The process of computer analysis of input
    provided in a human language is known as Natural
    Language Processing.

Concept
Language
Used for processing by computer
Intermediate representation
7
Applications
  • Machine translation
  • Document Clustering
  • Information Extraction / Retrieval
  • Text classification

8
MT system Shakti
  • Machine translation system being developed at
  • IIIT Hyderabad.
  • A hybrid translation system which uses the
    combined
  • strengths of Linguistic, Statistical and
    Machine learning
  • techniques.
  • Integrates the best available NLP technologies.

9
Shakti architecture
Morphology POS tagging Chunking Parsing
English sentence
English sentence analysis
Word reordering Hindi word subs.
Transfer from English to Hindi
Hindi sentence generation
Agreement Word-generation
Hindi sentence
10
Contents
  • NLP Introduction
  • Language Analysis - Representation
  • Part-of-speech tags in Indian Languages (Ex.
    Hindi)
  • Corpus based methods An introduction
  • POS tagging using HMMs
  • Introduction to TnT
  • Chunking for Indian languages Few experiments
  • Shared task - Introduction

11
Levels of Language Analysis
  • Morphological analysis
  • Lexical Analysis ( POS tagging )
  • Syntactic Analysis ( Chunking, Parsing )
  • Semantic Analysis ( Word sense disambiguation
    )
  • Discourse processing ( Anaphora resolution )

Lets take an example sentence Children are
watching some programmes on television in the
house
12
Chunking
  • What are chunks ?
  • Children (( are watching )) some
    programmes
  • on television in the house
  • Chunks
  • Noun chunks (NP, PP) in square brackets
  • Verb chunks (VG) in parentheses
  • Chunks represent objects
  • Noun chunks represent objects/concepts
  • Verb chunks represent actions

13
Chunking
  • Representation in SSF

14
Part-of-Speech tagging
15
Morphological analysis
  • Deals with the word form and its analysis.
  • Analysis consists of characteristic properties
    like
  • Root/Stem
  • Lexical category
  • Gender, number, person
  • Etc
  • Ex watching
  • Root watch
  • Lexical category verb
  • Etc

16
Morphological analysis
17
Contents
  • NLP Introduction
  • Language Analysis - Representation
  • Part-of-speech tags in Indian Languages (Ex.
    Hindi)
  • Corpus based methods An introduction
  • POS tagging using HMMs
  • Introduction to TnT
  • Chunking for Indian languages Few experiments
  • Shared task - Introduction

18
POS Tags in Hindi
  • POS Tags in Hindi
  • Broadly categories are noun, verb, adjective
    adverb.
  • Word are classified depending on their role, both
    individually as well as in the sentence.
  • Example
  • vaha aama khaa rahaa hei
  • Pron noun verb verb verb

19
POS Tagging
  • Simplest method of POS tagging
  • Looking in the dictionary

khaanaa
Dictionary lookup
verb
20
Problems with POS Tagging
  • Size of the dictionary limits the scope of
    POS-tagger.
  • Ambiguity
  • The same word can be used both as a noun as well
    as a verb.

khaanaa
noun
verb
21
Problems with POS Tagging
  • Ambiguity
  • Sentences in which the word khaanaa occurs
  • tum bahuta achhaa khaanaa banatii ho.
  • mein jilebii khaanaa chaahataa hun.
  • Hence, complete sentence has to be looked at
    before
  • determining its role and thus the POS tag.

22
Problems with POS Tagging
  • Many applications need more specific POS tags.
  • For example,
  • Hence, the need for defining a tagset.

seba khaa rahaa Verb Finite Main
khaate huE Verb Non-Finite Adjective
khaakara Verb Non-Finite Adverb
sharaaba piinaa sehata Verb Non-Finite Nominal
23
Defining the tagset for Hindi (IIIT Tagset)
  • Issues !
  • Fineness V/s Coarseness in linguistic analysis
  • Syntactic Function V/s lexical category
  • New tags V/s tags from a standard tagger

24
Fineness V/s Coarseness
  • Decision has to be taken whether tags will
    account for finer distinctions of various
    features of the parts of speech.
  • Need to strike a balance
  • Not too fine to hamper machine learning
  • Not too coarse to loose information

25
Fineness V/s Coarseness
  • Nouns
  • Plurality information not taken into account
  • (noun singular and noun plural are marked with
    same tags).
  • Case information not marked
  • (noun direct and noun oblique are marked with
    same tags).
  • Adjectives and Adverbs
  • No distinction between comparitive and
    superlative forms
  • Verbs
  • Finer distinctions are made (eg., VJJ, VRB, VNN)
  • Helps us understand the arguments that a verb
    form can take.

26
Fineness in Verb tags
  • Useful for tasks like dependency parsing as we
    have better information about arguments of verb
    form.
  • Non-finite form of verbs which are used as nouns
    or adjectives or adverbs still retain their
    verbal property.
  • (VNN -gt Noun formed for a verb)
  • Example
  • aasamaana/NN mein/PREP udhane/VNN
    vaalaa/PREP ghodhaa/NN
  • sky in
    flying
    horse
  • niiche/NLOC utara/VFM aayaa/VAUX
  • down climb
    came

27
Syntactic V/S Lexical
  • Whether to tag the word based on lexical or
    syntactic category.
  • Should uttar in uttar bhaarata be tagged as
    noun or
  • adjective ?
  • Lexical category is given more importance than
    syntactic category while marking text manually.
  • Leads to consistency in tagging.

28
New tags v/s tags from standard tagset
  • Entirely new tagset for Indian languages not
    desirable as people are familiar with standard
    tagsets like Penn tags.
  • Penn tagset has been used as benchmark while
    deciding tags for Hindi.
  • Wherever Penn tagset has been found inadequate,
    new tags introduced.
  • NVB ? New tag for kriyamuls or Light verbs
  • QW ? Modified tag for question words

29
IIIT Tagset
  • Tags are grouped into three types.
  • Group1 Adopted from the Penn tagset with minor
    changes.
  • Group2 Modification over Penn tagset.
  • Group3 Tags not present in Penn tagset.
  • Examples of tags in Group3
  • INTF ( Intensifier ) Words like baHuta,
    kama etc.
  • NVB, JVB, RBVB Light verbs.
  • Detailed guidelines would be put online.

30
Contents
  • NLP Introduction
  • Language Analysis - Representation
  • Part-of-speech tags in Indian Languages (Ex.
    Hindi)
  • Corpus based methods An introduction
  • POS tagging using HMMs
  • Introduction to TnT
  • Chunking for Indian languages Few experiments
  • Shared task - Introduction

31
Corpus based approach
Untagged new corpus
Learn
POS tagged corpus
POS tagger
Tagged new corpus
32
POS tagging A simple method
  • Pick the most likely tag for each word
  • Probabilities can be estimated from a
  • tagged corpus.
  • Assumes independence between tags.
  • Accuracy lt 90

33
POS tagging A simple method
  • Example
  • Brown corpus, 182159 tagged words (training
    section),
  • 26 tags
  • Example
  • mujhe xo kitabein xijiye
  • Word xo occurs 267 times,
  • 227 times tagged as QFN
  • 29 times as VAUX
  • P(QFNWxo) 227/267 0.8502
  • P(NN Wxo) 29/267 0.1086

34
Corpus-based approaches
Learning Rules Statistical
Transformation-based error driven learning. Brill - 1995 Hidden Markov models. TnT, Brants 00
Inductive Logic programming. Cussens - 1997 Maximum entropy. Ratnaparakhi 96
35
Contents
  • NLP Introduction
  • Language Analysis - Representation
  • Part-of-speech tags in Indian Languages (Ex.
    Hindi)
  • Corpus based methods An introduction
  • POS tagging using HMMs
  • Introduction to TnT
  • Chunking for Indian languages Few experiments
  • Shared task - Introduction

36
POS tagging using HMMs
Let W be a sequence of words W w1 , w2 wn
Let T be the corresponding tag
sequence T t1 ,
t2 tn Task Find T which maximizes P (
T W ) T
argmaxT P ( T W )
37
POS tagging using HMM
By Bayes Rule, P ( T W ) P ( W T )
P ( T ) / P ( W ) T argmaxT P ( W T )
P ( T ) P ( T ) P ( t1 ) P ( t2
t1 ) P ( t3 t1 t2 ) P ( tn t1 tn-1
) Applying Bi-gram approximation, P ( T )
P ( t1 ) P ( t2 t1 ) P ( t3 t2 ) P (
tn tn-1 )
38
POS tagging using HMM
P ( W T ) P ( w1 T ) P ( w2 w1 T )
P ( w3 w1.w2 T )
P ( wn w1 wn-1 , T )
?i 1 to n P ( wi w1wi-1 T ) Assume, P
( wi w1wi-1 T ) P ( wi ti
) Now, T is the one which maximizes,
P ( t1 ) P ( t2 t1 ) P
( tn tn-1 ) P ( w1
t1 ) P ( w2 t2 ) P ( wn wn-1 )
39
POS tagging using HMM
  • If we use Tri-gram model instead for the tag
    sequence,
  • P ( T ) P ( t1 ) P ( t2 t1 ) P ( t3
    t1 t2 ) P ( tn tn-2 tn-1 )
  • Which model to choose ?
  • Depends on the amount of data available !
  • Richer models ( Tri-grams, 4-grams )
    require lots of data.

40
Chain rule with approximations
  • P( W vaha ladakaa gayaa , T det noun
    verb )
  • P(det) P(vahadet) P(noundet)
  • P(ladakaanoun) P(verbnoun)
    P(gayaaverb)

det
noun
verb
ladakaa
gayaa
vaha
41
Chain rule with approximations Example
  • P (vaha det ) ( Number of times vaha
    appeared as det in the corpus )
  • --------------------
    -----------------------------------------
  • ( Total number of
    occurrences of det in the corpus )
  • P ( verb noun ) ( Number of times verb
    followed noun in the corpus )

  • --------------------------------------------------
    -----------
  • ( Total number of
    occurrences of noun in the corpus )
  • If we obtained the following estimates from the
    corpus

det
noun
verb
0.5
0.99
0.4
0.4
0.5
0.02
ladakaa
gayaa
vaha
P ( W , T ) 0.5 0.4 0.99 0.5 0.4 0.02
0.000792
42
POS tagging using HMM
We need to estimate three types of parameters
from the corpus Pstart(ti) (no. of sentences
which begin with ti ) / ( no. of sentences ) P
( ti ti-1 ) count ( ti-1 ti ) / count (
ti-1 ) P ( wi ti ) count ( wi with ti ) /
count ( ti ) These parameters can be directly
represented using the Hidden Markov Models (HMMs)
and the best tag sequence can be computed by
applying Viterbi algorithm on the HMMs.
43
Markov models
  • Markov Chain
  • An event is dependent on the previous
    events.
  • Consider the word sequence

usane
kahaa
ki
Here, each word is dependent on the previous one
word. Hence, it is said to form markov chain of
order 1.
44
Hidden Markov models
Observation sequence O
o1
o2
o3
o4
Hidden states sequence X
x3
x1
x2
x4
3
4
1
2
Index of sequence t
Hidden states follow markov property. Hence, this
model is know as Hidden Markov Model.
45
Hidden Markov models
  • Representation of parameters in HMMs
  • Define O(t) tth Observation
  • Define X(t) Hidden State Value at
    tth position

A aab P ( X ( t1 ) Xb X
( t ) Xa ) ? Transition matrix B
bak P ( O ( t ) Ok X ( t )
Xa ) ? Emission matrix PI pia
Probability of the starting with hidden state
Xa ? PI matrix
The model is µ A , PI , B
46
HMM for POS tagging
Observation sequence Word
sequence Hidden state sequence
Tag sequence
Model
A P ( current tag previous tag
) B P ( current word current tag
) PI Pstart ( tag )
Tag sequences are mapped to Hidden state
sequences because they are not observable in the
natural language text.
47
Example
det noun verb
det .01 .99 .00
noun .30 .30 .40
verb .40 .40 .20
A
det 0.5
noun 0.4
verb .01
PI
vaha ladakaa gayaa
det .40 .00 .00
noun .00 .015 .0031
verb .00 .0004 .020
B
48
POS tagging using HMM
The problem can be formulated as, Given the
observation sequence O and the model µ
(A, B, PI), how to choose the best state sequence
X which explains the observations ?
  • Consider all the possible tag sequences and
    choose the
  • tag sequence having the maximum joint
    probability with
  • the observation sequence.
  • X_max argmax ( P(O , X) )
  • The complexity of the above is high. Order NT
  • Viterbi algorithm is used for computational
    efficiency.

49
POS tagging using HMM
ladakaa
hansaa
vaha
O
det
det
det
noun
noun
noun
Xs
verb
verb
verb
t
1
2
3
27 tag sequences possible ! 27 paths
50
Viterbi algorithm
ladakaa
hansaa
vaha
O
det
det
det
noun
noun
noun
Xs
verb
verb
verb
t
1
2
3
Let anoun(ladakaa) represent the probability of
reaching the state noun taking the best
possible path and generating observation
ladakaa
51
Viterbi algorithm
vaha
ladakaa
hansaa
O
det
det
det
noun
noun
noun
Xs
verb
verb
verb
t
1
2
3
Best probability of reaching a state associated
with first word apron(vaha)
PI (det) B det, vaha
52
Viterbi algorithm
ladakaa
hansaa
vaha
O
det
det
det
noun
noun
noun
Xs
verb
verb
verb
t
1
2
3
Probability of reaching a state elsewhere in the
best possible way anoun(ladakaa)
53
Viterbi algorithm
ladakaa
hansaa
vaha
O
det
det
det
noun
noun
noun
Xs
verb
verb
verb
t
1
2
3
Probability of reaching a state in the best
possible way anoun(ladakaa) MAX apron(vaha)
A det, noun B noun, ladakaa ,

54
Viterbi algorithm
ladakaa
hansaa
vaha
O
det
det
det
noun
noun
noun
Xs
verb
verb
verb
t
1
2
3
Probability of reaching a state in the best
possible way, anoun(ladakaa) MAX apron(vaha)
A det, noun B noun, ladakaa ,
anoun(vaha) A
noun, noun B noun, ladakaa ,
55
Viterbi algorithm
ladakaa
hansaa
vaha
O
det
det
det
noun
noun
noun
Xs
verb
verb
verb
t
1
2
3
Probability of reaching a state in the best
possible way anoun(ladakaa) MAX apron(vaha)
A det, noun B noun, ladakaa ,
anoun(vaha) A
noun, noun B noun, ladakaa ,
averb(vaha) A verb,
noun B noun, ladakaa
56
Viterbi algorithm
ladakaa
hansaa
vaha
O
det
det
det
noun
noun
noun
Xs
verb
verb
verb
t
1
2
3
What is the best way to come to a particular
state ? phinoun(ladakaa) ARGMAX apron(vaha)
A pron, noun B noun, ladakaa ,

anoun(vaha) A noun, noun B noun,
ladakaa ,
averb(vaha) A verb, noun B
noun, ladakaa
57
Viterbi algorithm
hansaa
ladakaa
vaha
O
det
det
det
noun
noun
noun
Xs
verb
verb
verb
t
1
2
3
The last tag of the most likely sequence phi
(T1) ARGMAX apron(hansaa) , anoun(hansaa) ,
averb(hansaa)
58
Viterbi algorithm
ladakaa
hansaa
vaha
O
det
det
det
noun
noun
noun
Xs
verb
verb
verb
t
1
2
3
Most likely sequence is obtained by backtracking.
59
Preliminary Results
  • POS tagging for Indian languages
  • Training set 182159 tokens, Testing set 14277
    tokens
  • Tags 26.
  • Most frequent tag labelling 78.85
  • Hidden Markov Models 86.75
  • Needs improvement!
  • By experimenting with a variety of tags and
    tokens ( Some experiments on the chunking task
    are shown in following slides ).

60
Preliminary Results
  • Most Common error seen.
  • NNP, NNC ? NN
  • lt see the output of the system gt
  • Opportunity to carry out experiments to eliminate
    such errors as part of NLPAI shared task , 2006
    (will be introduced at the end).

61
Contents
  • NLP Introduction
  • Language Analysis - Representation
  • Part-of-speech tags in Indian Languages (Ex.
    Hindi)
  • Corpus based methods An introduction
  • POS tagging using HMMs
  • Introduction to TnT
  • Chunking for Indian languages Few experiments
  • Shared task - Introduction

62
Introduction to TnT
  • Efficient implementation of Viterbis algorithm
    for 2nd order Markov Chains ( Trigram
    approximation ).
  • Language independent Can be trained on any
    corpus.
  • Easy to use.

63
Introduction to TnT
  • 4 main programs
  • tnt-para trains the model (parameter
    generation)
  • tnt-para options ltcorpus_filegt
  • tnt tagging
  • tnt options ltmodelgt ltcorpusgt
  • tnt-diff - Comparing two files to get precision/
    recall figures.
  • tnt-diff options ltoriginal file 1gt ltnew output
    filegt
  • tnt-wc count tokens (words) and types
    (pos-tag/chunk-tag) in different files.
  • tnt-wc options ltcorpusfilegt

64
Introduction to TnT
  • Training file format
  • Tokens and tag separated by white space.
  • Example,
  • ltcommentgt
  • nirAlA NNP
  • kI PREP
  • sAhiwya NN
  • blank line new sentence
  • yahAz PRP
  • yaha PRP
  • aXikAMRa JJ

65
Introduction to TnT
  • Testing file consists of only the first column.
  • Other files Used to store the model
  • .lex file
  • .123 file
  • .map file
  • Demo1.

66
Contents
  • NLP Introduction
  • Language Analysis - Representation
  • Part-of-speech tags in Indian Languages (Ex.
    Hindi)
  • Corpus based methods An introduction
  • POS tagging using HMMs
  • Introduction to TnT
  • Chunking for Indian languages Few experiments
  • Shared task - Introduction

67
An Example (Chunk boundary identification)
68
Chunking with TnT
  • Chunk Tags
  • STRT A chunk starts at this token
  • CNT This token lies in the middle of a chunk
  • STP This token lies at the end of a chunk
  • STRT_STP This token lies in a chunk of its own
  • Chunk Tag Schemes
  • 2-tag Scheme STRT, CNT
  • 3-tag Scheme STRT, CNT, STP
  • 4-tag Scheme STRT, CNT, STP, STRT_STP

69
Input Tokens
  • What kinds of input tokens can we use?
  • Word only simplest
  • POS tag only use only the part of speech tag of
    the word
  • Combinations of the above
  • Word_POStag word followed by POS tag
  • POStag_Word POS tag followed by word.

70
Chunking with TnT Experiments
  • Training corpus 150000 tokens
  • Testing corpus 20000 tokens
  • Trick to improve learning is by training on
    larger tagset and reduce it to smaller tagset
  • NO LOSS of INFO. as all the tagsets convey same
    info.
  • Best results (Precision 85.6) obtained for
  • Input Tokens of the form Word_POS
  • Learning trick 4 tags reduced to 2

71
Chunking with TnT Improvement
  • 85.6 not good enough.
  • Improvement of model (Precision 88.63) by
    adding contextual information (POS tags).
    Example,

72
Chunking with TnT Improvements
  • For experiments which lead to furthur
    improvements in chunk boundary identification,
    see
  • Akshay Singh Sushama Bendre Rajeev Sangal, HMM
    based Chunker for Hindi, In Second International
    Joint Conference on Natural Language Processing
    Companion Volume including Posters/Demos and
    tutorial abstracts.

73
Chunking labelling Results
  • Chunk labelling
  • Chunks which have been identified have to be
    labelled as
  • Noun chunks, Verb chunks etc.
  • Rule based chunk labelling performed best.
  • RESULTS
  • Final Chunk Boundary Identification accuracy
    92.6
  • Chunk boundary identification Chunk labelling
    91.5

74
Contents
  • NLP Introduction
  • Language Analysis - Representation
  • Part-of-speech tags in Indian Languages (Ex.
    Hindi)
  • Corpus based methods An introduction
  • POS tagging using HMMs
  • Introduction to TnT
  • Chunking for Indian languages Few experiments
  • Shared task - Introduction

75
Shared task.
  • For information on the shared task, refer to the
    flyer on NLPAI shared task 2006.

76
Thank you
Write a Comment
User Comments (0)
About PowerShow.com