Learning Morphological Disambiguation Rules for Turkish - PowerPoint PPT Presentation

About This Presentation
Title:

Learning Morphological Disambiguation Rules for Turkish

Description:

Learning Morphological Disambiguation Rules for Turkish Deniz Yuret Ferhan T re Ko University, stanbul Overview Turkish morphology The morphological ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 23
Provided by: Deniz66
Category:

less

Transcript and Presenter's Notes

Title: Learning Morphological Disambiguation Rules for Turkish


1
Learning Morphological Disambiguation Rules for
Turkish
  • Deniz Yuret
  • Ferhan Türe
  • Koç University, Istanbul

2
Overview
  • Turkish morphology
  • The morphological disambiguation task
  • The Greedy Prepend Algorithm
  • Training
  • Evaluation

3
Turkish Morphology
  • Turkish is an agglutinative language Many
    syntactic phenomena expressed by function words
    and word order in English are expressed by
    morphology in Turkish.
  • I will be able to go.
  • (go) (able to) (will) (I)
  • git ebil ecek im
  • Gidebilecegim.

4
Fun with Turkish Morphology
Avrupalilastiramadiklarimizdanmissiniz
  • Avrupa Europe
  • li European
  • las become
  • tir make
  • ama not able to
  • dik we were
  • larimiz those that
  • dan from
  • mis were
  • siniz you

5
So how long can words be?
  • uyu sleep
  • uyut make X sleep
  • uyuttur have Y make X sleep
  • uyutturt have Z have Y make X sleep
  • uyutturttur have W have Z have Y make X sleep
  • uyutturtturt have Q have W have Z

6
Morphological Analyzer for Turkish
  • masali
  • masalNounA3sgPnonAcc ( the story)
  • masalNounA3sgP3sgNom ( his story)
  • masaNounA3sgPnonNomDBAdjWith ( with
    tables)
  • Oflazer, K. (1994). Two-level description of
    Turkish morphology. Literary and Linguistic
    Computing
  • Oflazer, K., Hakkani-Tür, D. Z., and Tür, G.
    (1999) Design for a turkish treebank. EACL99
  • Kenneth R. Beesley and Lauri Karttunen, Finite
    State Morphology, CSLI Publications, 2003

7
Features, IGs and Tags
masaNounA3sgPnonNomDBAdjWith
  • 8 unique tags
  • 11084 distinct tags observed in 1M word training
    corpus
  • 126 unique features
  • 9129 unique IGs

8
Why not just do POS tagging?
from Oflazer (1999)
9
Why not just do POS tagging?
  • Inflectional groups can independently act as
    heads or modifiers in syntactic dependencies.
  • Full morphological analysis is essential for
    further syntactic analysis.

10
Morphological disambiguation
  • Ambiguity rare in English
  • lives lives or lifes
  • More serious in Turkish
  • 42.1 of the tokens ambiguous
  • 1.8 parses per token on average
  • 3.8 parses for ambiguous tokens

11
Morphological disambiguation
  • Task pick correct parse given context
  • masalNounA3sgPnonAcc
  • masalNounA3sgP3sgNom
  • masaNounA3sgPnonNomDBAdjWith
  • Uzun masali anlat Tell the long story
  • Uzun masali bitti His long story ended
  • Uzun masali oda Room with long table

12
Morphological disambiguation
  • Task pick correct parse given context
  • masalNounA3sgPnonAcc
  • masalNounA3sgP3sgNom
  • masaNounA3sgPnonNomDBAdjWith
  • Key Idea
  • Build a separate classifier for each feature.

13
Decision Lists
  • If (W çok) and (R1 DA)
  • Then W has Det
  • If (L1 pek)
  • Then W has Det
  • If (W AzI)
  • Then W does not have Det
  • If (W çok)
  • Then W does not have Det
  • If TRUE
  • Then W has Det
  • pek çok alanda (R1)
  • pek çok insan (R2)
  • insan çok daha (R4)

14
Greedy Prepend Algorithm
GPA(data) 1 dlist NIL 2 default-class
Most-Common-Class(data) 3 rule If TRUE Then
default-class 4 while Gain(rule, dlist, data) gt
0 5 do dlist prepend(rule, dlist) 6
rule Max-Gain-Rule(dlist, data) 7 return
dlist
15
Training Data
  • 1M words of news material
  • Semi automatically disambiguated
  • Created 126 separate training sets, one for each
    feature
  • Each training set only contains instances which
    have the corresponding feature in at least one of
    their parses

16
Input attributes
  • For a five word window
  • The exact word string (e.g. WAli'nin)
  • The lowercase version (e.g. Wali'nin)
  • All suffixes (e.g. Wn, WIn, WnIn, W'nIn,
    etc.)
  • Character types (e.g. Ali'nin would be described
    with WUPPER-FIRST, WLOWER-MID, WAPOS-MID,
    WLOWERLAST)
  • Average 40 features per instance.

17
Sample decision lists
Acc 0 1 WInI 1 WyI 1 WUPPER0 1 WIzI 1
L1bu 1 Wonu 1 R1mAK 1 Wbeni 0 Wgünü 1
WInlArI 1 Wonlarý 0 WolAyI 0 Wsorunu
(672 rules)
Prop 1 0 WSTFIRST 0 WTürk 1 WSTFIRST
R1UCFIRST 0 L1. 0 WAnAl 1 R1, 0 WyAD 1
WUPPER0 0 WlAD 0 WAK 1 R1UPPER 0 WMilli 1
WSTFIRST R1UPPER0 (3476 rules)
18
Models for individual features
19
Combining models
  • masalNounA3sgP3sgNom
  • masalNounA3sgPnonAcc
  • Decision list results and confidence (only
    distinguishing features necessary)
  • P3sg yes (89.53)
  • Nom no (93.92)
  • Pnon no (95.03)
  • Acc yes (89.24)
  • score(P3sgNom) 0.8953 x (1 0.9392)
  • score(PnonAcc) (1 0.9503) x 0.8924

20
Evaluation
  • Test corpus 1000 words, hand tagged
  • Accuracy 95.87 (conf. int 94.57-97.08)
  • Better than the training data !?

21
Other Experiments
  • Retraining on own output 96.03
  • Training on unambiguous data 82.57
  • Forget disambiguation, lets do tagging with a
    single decision list 91.23, 10000 rules

22
Contributions
  • Learning morphological disambiguation rules using
    GPA decision list learner.
  • Reducing data sparseness and increase noise
    tolerance using separate models for individual
    output features.
  • ECOC, WSD, etc.
Write a Comment
User Comments (0)
About PowerShow.com