Building feature rich POS tagger for morphologically rich languages : Experiences in Hindi

About This Presentation

Title:

Building feature rich POS tagger for morphologically rich languages : Experiences in Hindi

Description:

Building feature rich POS tagger for morphologically rich languages : Experiences in Hindi ... Somerset, New Jersey. Adwait Ratnaparakhi. 1997. ... – PowerPoint PPT presentation

Number of Views:128

Avg rating:3.0/5.0

Slides: 25

Provided by: cseIi8

Category:

more less

Transcript and Presenter's Notes

Title: Building feature rich POS tagger for morphologically rich languages : Experiences in Hindi

1
Building feature rich POS tagger for
morphologically rich languages Experiences in
Hindi

Aniket Dalal
Kumar Nagaraj
Uma Sawant
Sandeep Shelke
Pushpak. Bhattacharyya

2
Motivation

POS tagging Preparation for higher level NLP
tasks
Parsing
Named Entity Recognition
Translation
Challenges in Hindi POS tagging
Morphologically rich
Free word order language
Long distance dependencies

3
Outline

Maximum Entropy Markov Model (MEMM)
System Architecture
Feature Functions
Experimental Setup
Results and Performance Analysis
Conclusion and Future Work

4
Maximum Entropy Markov Model

MEMM Feature based exponential probabilistic
model.
Feature function Captures relevant aspect of
language.
Fix the best feature set.
Training Assign that weight to the features
which will maximize the entropy of the model.
Deployment Choose the maximum probable tag
sequence for a sentence.

5
System Architecture
6
Feature Functions

Contextual
Morphological
Categorical
Compound
Lexical

7
Contextual Features

Sense disambiguation
Trade-off between large and small context
windows
Example

8
Morphological Features

Suffix list
Useful for unseen word tagging
Example
(suffix)

9
Categorical Features

List of POS tags associated with a word
Exactly one POS tag
Example
- noun
(mango),
- adj
(common)

10
Compound Features

Combine information from lexicon and dictionary
Condition-based features
Example
If the word is present in the lexicon as PPN
- Is the word PPN according to dictionary
OR
- Is the word unknown

11
Lexical Features

English letters
Numerals
Special characters
Example
ISRO, IIT, IIIT

12
Experimental Setup

Maxent package
Hindi news corpus of BBC
4 data sets, manually tagged at IIT Bombay
15562 words
27 POS tags

13
Results Different Context Windows
14
Results Introduction of Features
15
Results Cross Validation
19 of test data consisted of unseen words.
16
Results Per Tag Accuracy
17
Good Performance CM, CONJ, PNG, ORD and
NEG
Closed list
Closed list
Closed list
Closed list
Closed list
Closed list
18
Good Performance Number
Number
19
Good Performance PPN N
Compound Features
compound features
20
Poor Performance ADV, QUAN INTEN
Sparse occurrence

21
Poor Performance VM VCOP
Semantic level ambiguity

22
Performance Analysis

Good performance
Closed Lists CM, NEG, PNG, CONJ ORD
Numbers
Compound features N PPN
Poor performance
Sparse occurrence ADV, QUAN INTEN
Semantic level ambiguity VCOP and VM

23
Conclusion and Future Work

Contextual, morphological, categorical and
lexical features together deliver high
performance.
Avg. accuracy - 94.38 and Best accuracy -
94.89
Can be extended to other Indo-Aryan languages by
building language specific resources like stemmer
and dictionary.
Enriching dictionary.

24
References

Adwait Ratnaparakhi. 1996. A maximum entropy
model for part-of-speech tagging. In Erich Brill
and Kenneth Church, editors, Proceedings of the
Conference on Empirical Methods in NLP, pages
133-142. ACL. Somerset, New Jersey.
Adwait Ratnaparakhi. 1997. A simple introduction
to maximum entropy models for natural language
processing. Technical report 97-08, Institute for
Research in Cognitive Science, University of
Pennsylvania.
Adam L. Berger , Vincent J. Della Pietra ,
Stephen A. Della Pietra. 1996. A maximum entropy
approach to natural language processing.
Computational Linguistics, v.22 n.1, p.39-71.

25
References

Jan Haji?c. 2000. Morphological tagging Data vs.
dictionaries. In Proceedings of the 6th Applied
Natural Language Processing and the 1st NAACL
Conference, pages 94101.6
Manish Shrivastava, N. Agrawal, S. Singh, and P.
Bhattacharya. 2005. Harnessing morphological
analysis in pos tagging task. In Proceedings of
ICON 05, December.
Smriti Singh, Kuhoo Gupta, Manish Shrivastava,
and Pushpak Bhattacharyya. 2006. Morphological
richness offsets resource poverty- an experience
in building a pos tagger for hindi. In
Proceedings of Coling/ACL 2006, Sydney,
Australia, July.

26
References

P. R Ray, V. Harish, A. Basu, S. Sarkar. 2003.
Part of speech tagging and local word grouping
techniques for natural language parsing in Hindi.
In proceedings of ICON 2003, Mysore.
http//maxent.sourceforge.net

27
Thank you!Questions ?
28
Maximum Entropy Markov Model