POS Tagging Experiments on PERSIAN Text - PowerPoint PPT Presentation

Loading...

PPT – POS Tagging Experiments on PERSIAN Text PowerPoint presentation | free to view - id: 2cac0-MTgwM



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

POS Tagging Experiments on PERSIAN Text

Description:

Markov Model (TnT) Memory Based POS Tagging (MBT) Maximum Likelihood Estimation ... TnT Accuracy (Cont.) significantly higher accuracy for known words ... – PowerPoint PPT presentation

Number of Views:159
Avg rating:3.0/5.0
Slides: 40
Provided by: IAS46
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: POS Tagging Experiments on PERSIAN Text


1
POS Tagging Experiments on PERSIAN Text
  • Fahimeh Raja
  • University of Tehran

2
Outline
  • Introduction
  • Corpus
  • Markov Model (TnT)
  • Memory Based POS Tagging (MBT)
  • Maximum Likelihood Estimation
  • Heuristic Post Processing
  • Comparisons
  • Conclusions and Future Works

3
Part-Of-Speech (POS) Tagging
  • marking-up the words in a text
  • part of text and natural language processing
  • for corpus annotation
  • many models and software
  • debate about the best approach
  • little work on Persian
  • our interest
  • the performance of applying methods used in other
    languages to a different language (Persian)

4
Outline
  • Introduction
  • Corpus
  • Markov Model (TnT)
  • Memory Based POS Tagging (MBT)
  • Maximum Likelihood Estimation
  • Heuristic Post Processing
  • Comparisons
  • Conclusions and Future Works

5
Corpus
  • part of BijanKhan's tagged corpus
  • at linguistics lab of Univ. of Tehran
  • from daily news and common texts
  • 550 different tags
  • tree structure
  • tag name most general tag followed by its
    subcategories
  • "N_PL_LOC
  • "N noun
  • "PL plurality
  • LOC location

6
Tag set reduction
Number of tags
  • tags with more than 2 levels ? 2-level tags
  • N_PL_Loc, N_PL_Date ? N_PL
  • tags describing numerical entities ? DEFAULT
  • some of 2-level tags ?1-level tags
  • rare in the corpus and unnecessarily too specific
  • removing tags appeared rarely in the corpus
  • noun (N) ? (N-SING)
  • short infinitive verbs (V_SNFL)

550
81
72
42
40
7
Outline
  • Introduction
  • Corpus
  • Markov Model (TnT)
  • Memory Based POS Tagging (MBT)
  • Maximum Likelihood Estimation
  • Heuristic Post Processing
  • Comparisons
  • Conclusions and Future Works

8
Markov Model (MM)
  • patterns over a space of time
  • process's state is dependent only on preceding N
    states
  • an order N Markov Model
  • Markov Model (MM)
  • a probabilistic process over a finite set of
    states
  • to solve classification problems with a state
    sequence

State1
State2
Transition Probability
9
MM in POS Tagging
  • patterns in sequences of words in sentences
  • the most likely sequence of tags (states) for
    words in a sentence
  • transition probability p(tag tj follows ti)
  • estimated by using data from a training corpus

10
Smoothing
  • sparse data problem
  • not enough instances to reliably estimate the
    probability
  • probability0 the probability of the sequence0
  • smoothing the probability
  • several methods for a variety of applications
  • linear interpolation
  • Good-Turing method
  • Katz method

11
TnT (Trigrams'n'Tags)
  • by Thorsten Brants
  • statistical POS tagger
  • an implementation of 2nd orders Markov Model
  • Markov chains to estimate probabilities of
    assigning tags to words based on surrounding
    words
  • linear interpolation for smoothing
  • suffix trie for handling unknown words

12
TnT (Cont.)
  • trainable on different languages
  • one of the best and fastest taggers
  • Avg. accuracy for various languages 96-97
  • high accuracy for known tokens even with very
    small amounts of training data

13
Evaluation
  • evaluation comparing the tagged file with the
    original manually tagged test file
  • tagging accuracy the percentage of correct tags
  • known words accuracy
  • unknown words accuracy
  • overall accuracy

14
TnT Accuracy
  • overall accuracy 96.47-96.94

15
TnT Accuracy (Cont.)
  • significantly higher accuracy for known words
  • the larger the size of training set, the higher
    the accuracy
  • very high accuracy even with small sample of
    training

16
TnT Accuracy (Cont.)
  • randomly dividing the corpus
  • 85 for training
  • 15 for test

17
Outline
  • Introduction
  • Corpus
  • Markov Model (TnT)
  • Memory Based POS Tagging (MBT)
  • Maximum Likelihood Estimation
  • Heuristic Post Processing
  • Comparisons
  • Conclusions and Future Works

18
Memory Based POS Tagging
  • some specifications of words features
  • examples possible tags, fixed width context
  • memory-based learning algorithms
  • learning from a training set
  • usually a tree like data structure
  • of learned instances kept in memory
  • using some similarity metrics between features to
    add a new instance

19
Memory Based POS Tagging (Cont.)
  • Tool MBT by ILK Research Group at the Tilburg
    Univ.
  • data structures
  • a lexicon associating words to tags as in the
    training set
  • a case base for known words
  • a case base for unknown words
  • appropriate feature sets for accuracy
  • ddfa known words
  • dd disambiguated tag of two previous words
  • f focus word
  • a ambiguous word after the current word
  • dfass unknown words
  • ss two suffix letters of the current word

20
MBT Accuracy
21
Outline
  • Introduction
  • Corpus
  • Markov Model (TnT)
  • Memory Based POS Tagging (MBT)
  • Maximum Likelihood Estimation
  • Heuristic Post Processing
  • Comparisons
  • Conclusions and Future Works

22
Maximum Likelihood Estimation (MLE)
  • Benchmark
  • simple
  • easy to implement
  • for every word in training set
  • calculating maximum likelihood probabilities for
    each tag assigned to each word
  • picking a tag with the greatest maximum
    likelihood probability
  • making it the only tag assignable to that word
    designated tag
  • designated tag the tag assigned to the word more
    than other tags

23
MLE Accuracy
DEFAULT as designated tag for unknown words
24
MLE Accuracy
N-SING as designated tag for unknown words
25
Outline
  • Introduction
  • Corpus
  • Markov Model (TnT)
  • Memory Based POS Tagging (MBT)
  • Maximum Likelihood Estimation
  • Heuristic Post Processing
  • Comparisons
  • Conclusions and Future Works

26
Heuristic Post Processing
  • correct tag for most of the unknown words
    N_SING
  • some of the plural unknown words were incorrectly
    tagged as DEFAULT or N_SING
  • substrings like ??, ???, ??, ??, etc. for
    plural nouns
  • ????? (bench) ????? ?? (benches)
  • post processing the output with a simple
    heuristic
  • doesnt work for all words ????? ?? (your
    school)

27
Heuristic Post Processing (Cont.)
Unknown Words Features (Tail)
Unknown Words Features (Head)
28
Result of Post Processing on MBT
Accuracy of MBT for Unknown Words
29
Result of Post Processing on MLE
Accuracy of MLE with "DEFAULT" as Designated Tag
for Unknown Words
30
Result of Post Processing on MLE (Cont.)
Accuracy of MLE with N-SING" as Designated Tag
for Unknown Words
31
Outline
  • Introduction
  • Corpus
  • Markov Model (TnT)
  • Memory Based POS Tagging (MBT)
  • Maximum Likelihood Estimation
  • Heuristic Post Processing
  • Comparisons
  • Conclusions and Future Works

32
Comparison
Weaker POS taggers benefited more from post
processing
33
Comparison (Cont.)
Max Accuracy
Max Accuracy
Max Accuracy
34
Comparison with other Languages
similar results in line with other languages
35
Outline
  • Introduction
  • Corpus
  • Markov Model (TnT)
  • Memory Based POS Tagging (MBT)
  • Maximum Likelihood Estimation
  • Heuristic Post Processing
  • Comparisons
  • Conclusions and Future Works

36
Summary
  • preparing a Persian POS tagged corpus
  • POS tagging of Persian text with
  • Markov Model
  • Memory based approach
  • Maximum Likelihood approach
  • post processing of the results based on modifying
    the tags for unknown words

37
Conclusion
  • reasonable POS taggers for Persian
  • statistical POS tagger (TnT)
  • Memory Based with post processing
  • improvement in unknown words accuracy especially
    in weaker models by heuristic post processing
  • encouraging results and comparable with other
    languages such as English and German

38
Future Works
  • experiments with more POS tagging models
  • more post processing
  • building other test collections

39
Acknowledgement
Dr. Bijankhan
Dr. Thorsten Brants
Dr. Farhad Oroumchian
About PowerShow.com