Title: A New Approach for HMM Based Chunking for Hindi
1A New Approach for HMM Based Chunking for Hindi
Ashish Tiwari Arnab Sinha
Under the guidance of Dr. Sudeshna
Sarkar Department of Computer Science and
Engineering Indian Institute of Technology,
Kharagpur
2TnT
.utf file (IIIT corpus)
script
.tt file (tagged training Data)
.t file (untagged data)
tnt_para
tnt
Model files
.tts file (tagged by TnT)
.tt file (tagged)
tnt_diff
Accuracy
3Tool Flow
Parse the corpus
Results
Apply 4 types of token schemes
Chunk Boundary
Results
Apply 3 different tag schemes
Results
Add POS context to chunk-tags
Results
Do Chunk-labeling
Chunk labeling
Recommendations
Compare the accuracies
4Token schemes
- (word-token, Chunk-tag)
- (( ashish)) (( arnab ke pIche )) (( bajar
meM )) (( gaya . )) - ashish arnab of behind
market in went . - NN NN PREP PREP NN
PREP VB SYM - 2. (POS-tag, Chunk-tag)
- (( ashish)) (( arnab ke pIche )) ((
bajar meM )) (( gaya . )) - NN NN PREP PREP
NN PREP VB SYM - 3. (word_POS-tag, Chunk-tag)
- (( ashish)) (( arnab ke pIche )) ((
bajar meM )) (( gaya . )) - ashish arnab of behind
market in went
. - ashish _NN arnab _NN of _PREP market _NN in
_PREP went _VB SYM - behind _PREP
- 4. (POS-tag_word, Chunk-tag)
- (( ashish)) (( arnab ke pIche ))
(( bajar meM )) (( gaya . ))
5Chunk Tag schemes
2-Tag Scheme STRT, CNT 3-Tag Scheme
STRT, CNT, END 4-Tag Scheme STRT, CNT,
END, STRT_END
6Adding POS-tag to Chunk-tag
- (( ashish)) (( arnab ke pIche )) ((
bajar meM )) (( gaya . )) - ashish arnab of behind
market in went . - NN NN PREP PREP NN
PREP VB SYM - NN STRT NN STRT
NNSTRT VB STRT - PREP CNT PREP CNT
PREPCNT SYM CNT - Ex Word as token and POS2tag chunking
7Colon vs Non-Colon
Marginal Improvement
- Corpus size20000 words
- In large data-set, ltWord_POS-taggt token might
perform better
8Chunk Boundary identification
Results are improved !
4tag?2tag gives the highest precision and
recall.!!
9Addition of POS-tag Information to Chunk-tags
Significant increment in precision and recall is
observed. 4?2-tag scheme for ltword_POS,
chunkPOSgt scores highest
10Labeling the Chunks
First Scheme Second Scheme Third Scheme
token ltwordgt_ltPOS-taggt label lt2-tag chunk boundarygtPOS-tagltchunk labelgt (if this is the first token of the chunk.) lt2-tag chunk boundarygtPOS-tag (otherwise) token ltwordgt_ltPOS-taggt label lt2-tag chunk boundarygtPOS-tagltchunk labelgt (for all tokens) token ltwordgt_ltPOS-taggt label lt2-tag chunk boundarygtPOS-tagltchunk labelgt (if this is the last token of the chunk.) lt2-tag chunk boundarygtPOS-tag (otherwise)
11Results Labelling Of Chunks
- The first scheme is giving the highest precision
89.02 but again to be noted that word_pos tag
approach is not far behind with 85.58 precision
and highest recall 98.48. - Recall value of word_pos and pos_word approach is
same in all schemes, this is because ordering
seems to add no new knowledge to existing model.
12Recommendations
For Identification of Chunk Boundary
- Best option ltPOS word_POSgt ltchunk_taggtltPOS_taggt
- Subsequent convertion to 2-tag set gives better
results
- scheme 1 is best
- POS-tag info addition improves the precision and
recall of chunk labeling.
For chunk labeling
this approach can be used for other Indian
languages as well !!!
13References
- An Introduction to Natural Language Processing,
Computational Linguistics, and Speech
Recognition. By Daniel Jurafsky and James H.
Martin - Miles Osborne 2000. Shallow Parsing as
Partof-Speech Tagging. Proceedings of
CoNLL-2000.(2000) - Lance A. Ramshaw, and Mitchell P. Marcus. 1995.
Text Chunking Using Transformation-Based
Learning. Proceedings of the 3rd Workshop on Very
Large Corpora (1995) 88.94 - W. Skut and T. Brants 1998. Chunk Tagger,
Statistical Recognition of Noun Phrases.
ESSLLI-1998 (1998) - Thorsten Brants. 2000. TnT - A Statistical
Part-of-Speech Tagger Proceedings of the sixth
conference on Applied Natural Language Processing
(2000) 224.231