A New Approach for HMM Based Chunking for Hindi - PowerPoint PPT Presentation

About This Presentation
Title:

A New Approach for HMM Based Chunking for Hindi

Description:

A New Approach for HMM Based Chunking for Hindi Ashish Tiwari Arnab Sinha Under the guidance of Dr. Sudeshna Sarkar Department of Computer Science and Engineering – PowerPoint PPT presentation

Number of Views:156
Avg rating:3.0/5.0
Slides: 14
Provided by: sam1303
Learn more at: http://www.princeton.edu
Category:
Tags: hmm | approach | based | chunking | files | hindi | new

less

Transcript and Presenter's Notes

Title: A New Approach for HMM Based Chunking for Hindi


1
A New Approach for HMM Based Chunking for Hindi
Ashish Tiwari Arnab Sinha
Under the guidance of Dr. Sudeshna
Sarkar Department of Computer Science and
Engineering Indian Institute of Technology,
Kharagpur
2
TnT
.utf file (IIIT corpus)
script
.tt file (tagged training Data)
.t file (untagged data)
tnt_para
tnt
Model files
.tts file (tagged by TnT)
.tt file (tagged)
tnt_diff
Accuracy
3
Tool Flow
Parse the corpus
Results
Apply 4 types of token schemes
Chunk Boundary
Results
Apply 3 different tag schemes
Results
Add POS context to chunk-tags
Results
Do Chunk-labeling
Chunk labeling
Recommendations
Compare the accuracies
4
Token schemes
  • (word-token, Chunk-tag)
  • (( ashish)) (( arnab ke pIche )) (( bajar
    meM )) (( gaya . ))
  • ashish arnab of behind
    market in went .
  • NN NN PREP PREP NN
    PREP VB SYM
  • 2. (POS-tag, Chunk-tag)
  • (( ashish)) (( arnab ke pIche )) ((
    bajar meM )) (( gaya . ))
  • NN NN PREP PREP
    NN PREP VB SYM
  • 3. (word_POS-tag, Chunk-tag)
  • (( ashish)) (( arnab ke pIche )) ((
    bajar meM )) (( gaya . ))
  • ashish arnab of behind
    market in went
    .
  • ashish _NN arnab _NN of _PREP market _NN in
    _PREP went _VB SYM
  • behind _PREP
  • 4. (POS-tag_word, Chunk-tag)
  • (( ashish)) (( arnab ke pIche ))
    (( bajar meM )) (( gaya . ))

5
Chunk Tag schemes
2-Tag Scheme STRT, CNT 3-Tag Scheme
STRT, CNT, END 4-Tag Scheme STRT, CNT,
END, STRT_END
6
Adding POS-tag to Chunk-tag
  • (( ashish)) (( arnab ke pIche )) ((
    bajar meM )) (( gaya . ))
  • ashish arnab of behind
    market in went .
  • NN NN PREP PREP NN
    PREP VB SYM
  • NN STRT NN STRT
    NNSTRT VB STRT
  • PREP CNT PREP CNT
    PREPCNT SYM CNT
  • Ex Word as token and POS2tag chunking

7
Colon vs Non-Colon
Marginal Improvement
  • Corpus size20000 words
  • In large data-set, ltWord_POS-taggt token might
    perform better

8
Chunk Boundary identification
Results are improved !
4tag?2tag gives the highest precision and
recall.!!
9
Addition of POS-tag Information to Chunk-tags
Significant increment in precision and recall is
observed. 4?2-tag scheme for ltword_POS,
chunkPOSgt scores highest
10
Labeling the Chunks
First Scheme Second Scheme Third Scheme
token ltwordgt_ltPOS-taggt label lt2-tag chunk boundarygtPOS-tagltchunk labelgt (if this is the first token of the chunk.) lt2-tag chunk boundarygtPOS-tag (otherwise) token ltwordgt_ltPOS-taggt label lt2-tag chunk boundarygtPOS-tagltchunk labelgt (for all tokens) token ltwordgt_ltPOS-taggt label lt2-tag chunk boundarygtPOS-tagltchunk labelgt (if this is the last token of the chunk.) lt2-tag chunk boundarygtPOS-tag (otherwise)
11
Results Labelling Of Chunks
  • The first scheme is giving the highest precision
    89.02 but again to be noted that word_pos tag
    approach is not far behind with 85.58 precision
    and highest recall 98.48.
  • Recall value of word_pos and pos_word approach is
    same in all schemes, this is because ordering
    seems to add no new knowledge to existing model.

12
Recommendations
For Identification of Chunk Boundary
  • Best option ltPOS word_POSgt ltchunk_taggtltPOS_taggt
  • Subsequent convertion to 2-tag set gives better
    results
  • scheme 1 is best
  • POS-tag info addition improves the precision and
    recall of chunk labeling.

For chunk labeling
this approach can be used for other Indian
languages as well !!!
13
References
  • An Introduction to Natural Language Processing,
    Computational Linguistics, and Speech
    Recognition. By  Daniel Jurafsky and  James H.
    Martin
  • Miles Osborne 2000. Shallow Parsing as
    Partof-Speech Tagging. Proceedings of
    CoNLL-2000.(2000)
  • Lance A. Ramshaw, and Mitchell P. Marcus. 1995.
    Text Chunking Using Transformation-Based
    Learning. Proceedings of the 3rd Workshop on Very
    Large Corpora (1995) 88.94
  • W. Skut and T. Brants 1998. Chunk Tagger,
    Statistical Recognition of Noun Phrases.
    ESSLLI-1998 (1998)
  • Thorsten Brants. 2000. TnT - A Statistical
    Part-of-Speech Tagger Proceedings of the sixth
    conference on Applied Natural Language Processing
    (2000) 224.231
Write a Comment
User Comments (0)
About PowerShow.com