Chapter 3: Morphology and Finite State Transducer - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Chapter 3: Morphology and Finite State Transducer

Description:

Morphology is the study of the internal structure of words ... Morphological parsing is the task of ... going go:VERB ing:GERUND. 3. Kinds of morphology ... – PowerPoint PPT presentation

Number of Views:511
Avg rating:3.0/5.0
Slides: 45
Provided by: Inderje9
Category:

less

Transcript and Presenter's Notes

Title: Chapter 3: Morphology and Finite State Transducer


1
Chapter 3 Morphology and Finite State Transducer
  • Heshaam Faili
  • hfaili_at_ece.ut.ac.ir
  • University of Tehran

2
Morphology
  • Morphology is the study of the internal structure
    of words
  • morphemes (roughly) minimal meaning-bearing unit
    in a language, smallest building block of words
  • Morphological parsing is the task of breaking a
    word down into its component morphemes, i.e.,
    assigning structure
  • going ? go ing
  • running ? run ing
  • spelling rules are different from morphological
    rules
  • Parsing can also provide us with an analysis
  • going ? goVERB ingGERUND

3
Kinds of morphology
  • Inflectional morphology grammatical morphemes
    that are required for words in certain syntactic
    situations
  • I run
  • John runs
  • -s is an inflectional morpheme marking the third
    person singular verb
  • Derivational morphology morphemes that are used
    to produce new words, providing new meanings
    and/or new parts of speech
  • establish
  • establishment
  • -ment is a derivational morpheme that turns verbs
    into nouns

4
More on morphology
  • We will refer to the stem of a word (main part)
    and its affixes (additions), which include
    prefixes, suffixes, infixes, and circumfixes
  • Most inflectional morphological endings (and some
    derivational) are productive they apply to
    every word in a given class
  • -ing can attach to any verb (running, hurting)
  • re- can attach to any verb (rerun, rehurt)
  • Morphology is highly complex in more
    agglutinative languages like Persian and Turkish
  • Some of the work of syntax in English is in the
    morphology in Turkish
  • Shows that we cant simply list all possible words

5
Overview
  • Morphological recognition with finite-state
    automata (FSAs)
  • Morphological parsing with finite-state
    transducers (FSTs)
  • Combining FSTs
  • More applications of FSTs

6
A. Morphological recognition with FSA
  • Before we talk about assigning a full structure
    to a word, we can talk about recognizing
    legitimate words
  • We have the technology to do this finite-state
    automata (FSAs)

7
Overview of English verbal morphology
  • 4 English regular verb forms base, -s, -ing, -ed
  • walk/walks/walking/walked
  • merge/merges/merging/merged
  • try/tries/trying/tried
  • map/maps/mapping/mapped
  • Generally productive forms
  • English irregular verbs (250)
  • eat/eats/eating/ate/eaten
  • catch/catches/catching/caught/caught
  • cut/cuts/cutting cut/cut
  • etc.

8
Analyzing English verbs
  • For the -s, and ing forms, both regular and
    irregular verbs use their base forms
  • Irregulars differ in how they treat the past and
    the past participle forms

9
FSA for English verbal morphology (morphotactics)
  • initial 0 final 1, 2, 3
  • 0-gtverb-past-irreg-gt3
  • 0-gtvstem-reg-gt1
  • 1-gtpast-gt3
  • 1-gtpastpart-gt3
  • 0-gtvstem-reg-gt2
  • 0-gtvstem-irreg-gt2
  • 2-gtprog-gt3
  • 2?sing-gt3
  • N.B. covers morphotactics, but not spelling
    rules (latter requires a separate FSA)

10
A Fun FSA Exercise Isleta Morphology
  • Consider the following data from Isleta, a
    dialect of Southern Tiwa, a Native American
    language spoken in New Mexico
  • temiban I went
  • amiban you went
  • temiwe I am going
  • mimiay he was going
  • tewanban I came
  • tewanhi I will come

11
Practising Isleta
  • List the morphemes corresponding to the following
    English translations
  • I
  • you
  • he
  • go
  • come
  • past
  • present_progressive
  • past_progressive
  • future
  • What is the order of morphemes in Isleta?
  • How would you say each of the following in
    Isleta?
  • He went
  • I will go
  • You were coming

12
An FSA for Isleta Verbal Inflection
  • initial 0 final 3
  • 0-gtmitea-gt1
  • 1-gtmiwan-gt2
  • 2-gtbanweayhi-gt3

13
B. Morphological Parsingwith FSTs
  • Using a finite-state automata (FSA) to recognize
    a morphological realization of a word is useful
  • But what if we also want to analyze that word?
  • e.g. given cats, tell us that its cat N PL
  • A finite-state transducer (FST) can give us the
    necessary technology to do this
  • Two-level morphology
  • Lexical level stem plus affixes
  • Surface level actual spelling/realization of the
    word
  • Roughly, well have the following for cats
  • cc aa tt eN sPL

14
Finite-State Transducers
  • While an FSA recognizes (accept/reject) an input
    expression, it doesnt produce any other output
  • An FST, on the other hand, in addition produces
    an output expression ? we define this in terms of
    relations
  • So, an FSA is a recognizer, whereas an FST
    translates from one expression to another
  • So, it reads from one tape, and writes to another
    tape (see Figure 3.8, p. 71)
  • Actually, it can also read from the output tape
    and write to the input tape
  • So, FSTs can be used for both analysis and
    generation (they are bidirectional)

15
Transducers and Relations
  • Lets pretend we want to translate from the
    Cyrillic alphabet to the Roman alphabet
  • We can use a mapping table, such as
  • A A
  • ? B
  • ? G
  • ? D
  • etc.
  • We define R ltA, Agt, lt?, Bgt, lt?, Ggt, lt?, Dgt,
    ..
  • We can thing of this as a relation R ? Cyrillic
    X Roman
  • To understand FSTs, we need to understand
    relations

16
The Cyrillic Transducer
  • initial 0 final 0
  • 0gtAA-gt 0
  • 0-gt?B-gt 0
  • 0-gt?G-gt 0
  • 0-gt?D-gt 0
  • .
  • Transducers implement a mapping defined by a
    relation
  • R ltA, Agt, lt?, Bgt, lt?, Ggt, lt?, Dgt, ..
  • These relations are called regular relations
    (since each side expresses a regular expression)

17
FSAs and FSTs
  • FSTs, then, are almost identical to FSAs Both
    have
  • Q a finite set of states
  • q0 a designated start state
  • F a set of final states
  • ? a transition function
  • The difference the alphabet (?) for an FST is
    now comprised of complex symbols (e.g., XY)
  • FSA ? a finite alphabet of symbols
  • FST ? a finite alphabet of complex symbols, or
    pairs
  • As a shorthand, if we have XX, we can write this
    as X

18
FSTs for morphology
  • For morphology, using FSTs means that we can
  • set up pairs between the surface level (actual
    realization) and the lexical level (stem/affixes)
  • cc aa tt eN sPL
  • set up pairs to go from one form to another,
    i.e., the underlying base form maps to the
    plural
  • gg oe oe ss ee
  • Can combine both kinds of information into the
    same FST
  • gg oo oo ss ee eN eSG
  • gg oe oe ss ee eN ePL

19
Isleta Verbal Inflection
  • te ? ? mi hi ?
  • te PRO 1P mi hi FUT
  • I will go
  • Surface temihi
  • Lexical tePRO1PmihiFUTURE
  • Note that the cells have to line up across tapes.
  • So, if an input symbol gives rise to more/less
    output symbols, epsilons have to be added to the
    input/output tape in the appropriate positions.

20
An FST for Isleta Verbal Inflection
  • initial 0 final 3
  • 0-gt mi??miPRO3P te??tePRO1P a??aPRO2P
    -gt1
  • 1-gt miwan -gt2
  • 2-gt ban?banPAST we??wePRESPROG
    ay??ayPASTPROG hi?hiFUT -gt3
  • Interpret te??tePRO1P as shorthand for 3
    separate arcs

21
A Lexical Transducer (Xerox)
  • Remember that FSTs can be used in either
    direction
  • l e a v e VBZ l e a v e s l e a v e VB l e
    a v el e a v e VBG l e a v i n g l e a v e
    VBD l e f t l e a v e NN l e a v e l e a
    v e NNS l e a v e s l e a f NNS l e a v e
    s   l e f t JJ l e f t
  • Left-to-Right Input leaveVBD (upper
    language) Output left
          (lower language)
  • Right-to-Left Input leaves (lower
    language)      Output leaveNNS
    (upper language)     
    leaveVBZ      
    leafNNS

22
Transducer Example (Xerox)
  • L1 a-z.
  • Consider language L2 that results from replacing
    any instances of "ab" in L1 by "x".
  • So, to define the mapping, we define a relation
    R ? L1 X L2
  • e.g., lt"abacab", "xacx"gt is in R.
  • Note xacx" in lower language is paired with 4
    strings in upper language, "abacab", "abacx",
    "xacab", "xacx"

NB ? a-z\a,b,x
23
C. Combining FSTs Spelling Rules
  • So far, we have gone from a lexical level (e.g.,
    catNPL) to a surface level (e.g., cats)
  • But this surface level is actually an
    intermediate level ? it doesnt take spelling
    into account
  • So, the lexical level of foxNPL corresponds
    to foxs
  • We will use to refer to a morpheme boundary
  • We need another level to account for spelling
    rules

24
Lexicon FST
  • The lexicon FST will convert a lexical level to
    an intermediate form
  • dogNPL ? dogs
  • foxNPL ? foxs
  • mouseNPL ? mouses
  • dogVSG ? dogs
  • This will be of the form
  • 0-gt f -gt1 3-gt N -gt4
  • 1-gt o -gt2 4-gt PLs -gt5
  • 2-gt x -gt3 4-gt SGe -gt6
  • And so on

25
English noun lexicon as a FST
JM Fig 3.9
Expanding the aliases LEX-FST
JM Fig 3.11
26
LEX-FST
  • Lets allow?? to pad the tape
  • Then, we wont force both tapes to have same
    length
  • Also, lets pretend were generating

Morpheme Boundary
Word-Final Boundary
Lexical Tape
Intermediate Tape
27
Rule FST
  • The rule FST will convert the intermediate form
    into the surface form
  • dogs ? dogs (covers both N and V forms)
  • foxs ? foxes
  • mouses ? mice
  • Assuming we include other arcs for every other
    character, this will be of the form
  • 0-gt f -gt0 1-gt e -gt2
  • 0 -gt o -gt0 2-gt ee -gt3
  • 0 -gt x -gt 1 3-gt s -gt4
  • This FST is too impoverished

28
Some English Spelling Rules
29
E-insertion FST JM Fig 3.14, p. 78
30
E-insertion FST
Intermediate Tape
Surface Tape
  • Trace
  • generating foxes from foxs
  • q0-f-gtq0-o-gtq0-x-gtq1-?-gtq2-?e-gtq3-s-gtq4--gtq0
  • generating foxs from foxs
  • q0-f-gtq0-o-gtq0-x-gtq1-?-gtq2-s-gtq5--gtFAIL
  • generating salt from salt
  • q0-s-gtq1-a-gtq0-l-gtq0-tgtq0--gtq0
  • parsing assess
  • q0-a-gtq0-s-gtq1-s-gtq1-?-gtq2-?e-gtq3-s-gtq4-s-gtFAIL
  • q0-a-gtq0-s-gtq1-s-gtq1-e-gtq0-s-gtq1-s-gtq1--gtq0

31
Combining Lexicon and Rule FSTs
  • We would like to combine these two FSTs, so that
    we can go from the lexical level to the surface
    level.
  • How do we integrate the intermediate level?
  • Cascade the FSTs one after the other
  • Compose the FSTs combine the rules at each state

32
Cascading FSTs
  • The idea of cascading FSTs is simple
  • Input1 ? FST1 ? Output1
  • Output1 ? FST2 ? Output2
  • The output of the first FST is run as the input
    of the second
  • Since both FSTs are reversible, the cascaded FSTs
    are still reversible/bi-directional.

33
Composing FSTs
  • We can compose each transition in one FST with a
    transition in another
  • FST1 p0-gt ab -gt p1 p0-gt de -gtp1
  • FST2 q0-gt bc -gt q1 q0-gt ef -gt q0
  • Composed FST
  • (p0,q0)-gt ac -gt(p1,q1)
  • (p0,q0)-gt df -gt(p1,q0)
  • The new state names (e.g., (p0,q0)) seem somewhat
    arbitrary, but this ensures that two FSTs with
    different structures can still be composed
  • e.g., ab and de originally went to the same
    state, but now we have to distinguish those
    states
  • Why doesnt ef loop anymore?

34
Composing FSTs for morphology
  • With our lexical, intermediate, and surface
    levels, this means that well compose
  • p2-gt x -gtp3 p4-gt PLs -gtp5
  • p3-gt N -gtp4 p4-gt ee -gtp4
  • q0-gt x -gtq1 q2-gt ee -gtq3
  • q1-gt e -gtq2 q3-gt s -gtq4
  • into
  • (p2,q0)-gt x -gt(p3,q1)
  • (p3,q1)-gt Ne -gt(p4,q2)
  • (p4,q2)-gt ee -gt(p4,q3)
  • (p4,q3)-gt PLs -gt(p4,q4)

35
Generating or Parsing with FST lexicon and rules
36
Lexicon-Free FST Porter Stemmer
  • Used for IR and Search Engine
  • e.g. by search Foxes should relates Fox
  • Stemming
  • Lexicon-Free Porter Algorithm
  • ATIONAL -gt ATE (relational -gt relate)
  • ING -gt e if stem contain vowel (motoring -gt
    motor)

37
D. More applications of FSTs
  • Syntactic parsing using FSTs
  • approximate the actual structure
  • (it wont work in general for syntax)
  • Noun Phrase (NP) parsing using FSTs
  • also referred to as NP chunking, or partial
    parsing
  • often used for prepositional phrases (PPs), too

38
Syntactic parsing using FSTs
  • Parsing more than recognition returns a
    structure
  • For syntactic recognition, FSA could be used
  • How does syntax work?
  • S ? NP VP D ? the
  • NP ? (D) N N ? girl N ? zebras
  • VP ? V NP V ? saw
  • How do we go about encoding this?

39
Syntactic Parsing using FSTs
S
FST 3 Ss
VP
FST 2 VPs
NP
NP
FST 1 NPs
D
N
V
N
The
girl
saw
zebras
Input
0 1 2 3 4
FST1 initial0 final 2 0-gtNNP-gt2 0-gtD?-gt1 1-
gtN-NP-gt2
D N V N ? NP V NP ? NP ? VP ?
? ? S
FST1 FST2 FST3
40
Syntactic structure with FSTs
  • Note that the previous FSTs only output labels
    after the phrase has been completed.
  • Where did the phrase start?
  • To fully capture the structure of a sentence, we
    need an FST which delineates the beginning and
    the end of a phrase
  • 0-gt DetNP-Start -gt1
  • 1-gt NNP-Finish -gt2
  • Another FST can group the pieces into complete
    phrases

41
Why FSTs cant always be used for syntax
  • Syntax is infinite, but we have set up a finite
    number of levels (depth) to a tree with a finite
    number of FSTs
  • Can still use FSTs, but (arguably) not as elegant
  • The girl saw that zebras saw the giraffes.
  • We have a VP over a VP and will have to run FST2
    twice at different times.
  • Furthermore, we begin to get very complex FST
    abbreviationse.g., /Det? Adj N PP/which dont
    match linguistic structure
  • Center-embedding constructions
  • Allowed in languages like English
  • Mathematically impossible to capture with
    finite-state methods

42
Center embedding
  • Example
  • The man (that) the woman saw laughed.
  • The man Harry said the woman saw laughed.
  • S in the middle of another S
  • Problem for FSA/FST technology
  • Theres no way for finite-state grammars to make
    sure that the number of NPs matches the number of
    verbs
  • These are ancbn constructions ? not regular
  • We have to use context-free grammars a topic
    well return to later in the course

43
Noun Phrase (NP) parsing using FSTs
  • If we make the task more narrow, we can have more
    success e.g., only parse NPs
  • The man on the floor likes the woman who is a
    trapeze artist
  • The manNP on the floorNP likes the womanNP
    who is a trapeze artistNP
  • Taking the NP chunker output at input, a PP
    chunker
  • The manNP on the floorNPPP likes the
    womanNP who is a trapeze artistNP

44
Exercises
  • 3.1, 3.4, 3.6 (3.3 , 3.5 new edition 2005)
  • Write the Persian morphology (including name and
    verb) analyzer by perl
  • See The document for morphological Analysis from
    Shiraz Project
Write a Comment
User Comments (0)
About PowerShow.com