Morphology and Finite-State Transducers - PowerPoint PPT Presentation

About This Presentation
Title:

Morphology and Finite-State Transducers

Description:

Morphology is the study of the way words are built up from smaller meaning ... e.g. cat, cat s vs. goose, geese. Orthographic constraints, i.e. spelling rules ... – PowerPoint PPT presentation

Number of Views:252
Avg rating:3.0/5.0
Slides: 33
Provided by: mathias
Category:

less

Transcript and Presenter's Notes

Title: Morphology and Finite-State Transducers


1
Morphology and Finite-State Transducers
  • by Mathias Creutz
  • 31 October 2001
  • Chapter 3, Jurafsky Martin

2
Contents
  • Morphology
  • morphemes, inflection and derivation, allomporphs
  • Morphological Parsing
  • finite-state automata, two-level morphology
  • Finite-State Transducers
  • rules, combination of FSTs, lexicon-free FSTs
  • Human Morphological Processing
  • Exercise

3
Morphology
  • Morphology is the study of the way words are
    built up from smaller meaning-bearing units,
    morphemes.
  • e.g. talo ssa ni kin
  • Two broad classes of morphemes, stems and
    affixes
  • the stem is the main morpheme of the word,
    supplying the main meaning, e.g. talo in
    talossanikin

4
Affixes
  • Affixes add additional meanings.
  • Concatenative morphology uses the following types
    of affixes
  • prefixes, e.g. epä- in epäolennainen
  • suffixes, e.g. ssa in talossa
  • circumfixes, e.g. German ge- -t in gesagt
    (have said)

5
Non-concatenative Morphology
  • In non-concatenative morphology the stem morpheme
    is split up. The following types of affixes are
    used
  • infixes, e.g. Californian Jurok, sepolah (field),
    segepolah (fields)
  • transfixes, e.g. Hebrew, lamad (he studied),
    limed (he taught), lumad (he was taught)
  • This type of non-concatenative morphology is
    called templatic or root-and-pattern morphology.

6
Inflection and Derivation
  • There are two broad classes of ways to form words
    from morphemes inflection and derivation.

7
Inflection
  • Inflection is the combination of a word stem with
    a grammatical morpheme, usually resulting in a
    word of the same class as the original stem, and
    usually filling some syntactic function, e.g.
    plural of nouns.
  • talo (singular), talot (plural)
  • Inflection is productive.
  • talo, talot vs. auto, autot vs. metsä, metsät
  • The meaning of the resulting word is easily
    predictable.

8
Derivation
  • Derivation is the combination of a word stem with
    a grammatical morpheme, usually resulting in a
    word of a different class, often with a meaning
    hard to predict exactly.
  • e.g. järki, järjestää, järjestö,
    järjestellä, järjestelmä,
    järjestelmällinen, järjestelmällisyys
  • Not always productive.
  • järki, järjestää vs. metsä, metsästää vs.
    talo, talostaa?

9
Allomorphs
  • A group of allomorphs make up one morpheme class.
    An allomorph is a special variant of a morpheme.
  • e.g. Finnish illative ending ltvowel_lengtheninggt
    n, hltvowelgtn, seen, siin ? taloon, metsään,
    taloihin, huoneeseen, huoneisiin
  • e.g. Finnish stem variation käsi, käden,
    kättä, käteen

10
Why Allomorphs?
  • Phonological constraints
  • e.g. vowel harmony, talossa vs. metsässä
  • Morphological paradigms
  • e.g. käsi, käden vs. kasi, kasin,
    Swedish leta, letade vs. heta, hette
  • Irregularities
  • e.g. cat, cats vs. goose, geese
  • Orthographic constraints, i.e. spelling rules
  • e.g. cat, cats vs. city, cities

11
Morphological Parsing
  • Parsing means taking an input and producing some
    sort of structure for it.
  • Morphological parsing means breaking down a word
    form into its constituent morphemes.
  • e.g. talossa ? talo ssa
  • Mapping of a word form to its baseform is called
    stemming.
  • e.g. talossa ? talo

12
Finite-State Morphological Parsing
  • In order to build a parser we need the following
  • a lexicon containing the stems and affixes,
  • morphotactics, i.e. the model of morpheme
    ordering, e.g. talossani instead of
    talonissa,
  • a set of rules (orthographic, etc.), i.e. the
    model of changes that occur in a word, usually
    when two morphemes combine, e.g. city s ?
    cities.

13
Finite-State Automaton for Inflection of English
Verbs
irreg-past-verb-form
reg-verb-stem
preterite (-ed)
q0
past-participle (-ed)
reg-verb-stem
progressive (-ing)
irreg-verb-stem
3-singular (-s)
14
Finite-State Automaton for Inflection of the
Verbs talk, test and sing
u
a
n
s
s
t
g
e
e
k
l
a
t
d
q0
e
s
e
t
d
t
a
s
l
g
k
n
i
i
s
g
n
15
Two-Level Morphology
  • Two-level morphology represents a word as a
    correspondence between a lexical level, which
    represents a simple concatenation of morphemes
    making up a word, and the surface level, which
    represents the actual spelling of the final word.

Lexical
s
n
i
g
PROG
V
s
n
i
g
g
n
i
Surface
16
Finite-State Transducer
  • A transducer maps between one set of symbols and
    another a finite state transducer does this via
    a finite automaton.
  • Where an FSA accepts a language stated over a
    finite alphabet of single symbols, e.g. ?a, b,
    c, ..., an FST accepts a language stated over
    pairs of symbols, e.g. ?aa, bb, ac, a?,
    ??, ...
  • In two-level morphology, we call pairs like aa
    default pairs, and refer to them by a single
    symbol a.
  • An FST can be seen as a recognizer, generator,
    translator or a set relator.

17
Finite-State Transducer for Inflection of the
Verbs talk, test and sing
n
g
iu
V?
n
g
s
ia
V?
s
t
PSTPCP?
e
V?
PRET?
PRETe
k
l
a
t
?d
q0
e
s
PSTPCPe
t
?d
t
a
s
l
k
V?
?g
PROGi
?n
i
g
n
3SGs
18
Examples
Lexical form Surface form
talk V talk
sing V 3SG sings
test V PROG testing
talk V PRET talked
sing V PRET sang
talk V PSTPCP talked
sing V PSTPCP sung
19
Useful FST Operations
  • Inversion Switch input and output labels.
  • e.g. ?(T)ab, cd ? ?(inv(T))ba, dc
  • Intersection Only sequences of pairs accepted by
    both transducerT1 and transducerT2 are accepted
    by transducer T1T2.
  • Composition The output of transducer T1 serves
    as input to T2. This is marked as T1ºT2 or
    T2(T1).

20
Spelling Rules and FSTs
Name Description of Rule Example
Consonant doubling 1-letter consonant doubled before -ing/-ed beg/begging
E deletion Silent e dropped before -ing and ed make/making
E insertion e added after s, -z, -x, -ch, -sh before -s watch/watches
Y replacement -y changes to ie before -s, and to -i before -ed try/tries
K insertion verbs ending with vowel -c add -k panic/panicked
21
Three levels
  • Add an intermediate level between the lexical and
    surface levels

Lexical
i
k
s
3SG
V
s
Intermediate
i
k
s

s

s
i
k
s
s
e
s
Surface
22
FST for the E-insertion Rule
q5
?
other
other
z, s, x
z, s, x

?
s
z, s, x
?
?e
q0
q3
q4
q1
q2
s
z, x
, other
, other

23
Combination of FSTs (1)
Lexicon-FST
...
Rule1-FST
RuleN-FST
24
Combination of FSTs (2)
Lexicon-FST
Intermediate
i
k
s

s

s
...
Rule1-FST
RuleN-FST
Intersect
25
Combination of FSTs (3)
Compose
Lexicon-FST
Intermediate
i
k
s

s

s
...
Rule1-FST
RuleN-FST
Intersect
26
Intersection and Composition
  • For each state qi in transducer T1 and state qj
    in transducer T2, create a new state qij.
  • Intersection For any pair ab, if T1 transitions
    from qi to qn, and T2 transitions from qj to qm,
    T1T2 transitions from qij to qnm.
  • Composition If T1 transitions from qi to qn with
    the pair ab, and T2 transitions from qj to qm
    with the pair bc, then T1ºT2 transitions from
    qij to qnm with the pair ac.

27
Lexicon-Free FSTs
  • Used in information-retrieval
  • E.g. the Porter algorithm, which is based on a
    series of simple cascaded rewrite rules
  • ATIONAL ? ATE (relational ? relate)
  • ING ? ? if stem contains vowel (motoring ? motor)
  • Errors occur
  • organization ? organ, doing ? doe, university ?
    universe

28
Human Morphological Processing (1)
  • How are multi-morphemic words represented in the
    minds of human speakers?
  • full-listing hypothesis vs. minimum redundancy
    hypothesis
  • Experiments
  • Stanners et al. 1979 a word is recognized faster
    if it has been seen before (priming) lifting ?
    lift, burned ? burn, selective ?/ select, i.e.
    different representations for inflection and
    derivation.
  • Marsen-Wilson et al. 1994 spoken derived words
    can prime their stems, but only if their meaning
    is close government ? govern, department ?/
    depart

29
Human Morphological Processing (2)
  • Speech errors Speakers mix up the order of
    words...
  • e.g. if you break it, itll drop
  • ... and also attach affixes to the wrong stems
  • e.g. its not only we who have screw looses (for
    screws loose)
  • e.g. easy enoughly (for easily enough)

30
Excercise (1/3)
  • Your task is to create a finite-state transducer
    that can analyze the following Finnish word
    forms

Surface form Lexical form
talo talo NOM
taloon talo ILL
talomme talo NOM POS1PL
taloomme talo ILL POS1PL
metsä metsä NOM
metsään metsä ILL
metsämme metsä NOM POS1PL
metsäämme metsä ILL POS1PL
31
Exercise (2/3)
  • The morphological tags have the following
    meaning NOM nominative ILL illative
    POS1PL possessive, 1st person plural.
  • Take a look at Fig 3.16, 3.17 and 3.18 in
    Jurafsky Martin. Create three separate
    finite-state transducers that you finally combine
    into one
  • a) Create a transducer that operates between the
    intermediate and surface level. This transducer
    handles the vowel lengthening that is necessary
    for the illative form talo ILL ? taloon vs.
    metsä ILL ? metsään.

32
Excercise (3/3)
  • b) Create a transducer that operates between the
    intermediate and surface level. This transducer
    handles the deletion of n in front of a
    possessive ending talo mme ? talomme
    vs. taloon mme
    ? taloomme.
  • c) Create a transducer that operates between the
    lexical and the intermediate level. This
    transducer maps morphological tags onto endings.
  • d) Combine all the transducers into one.
  • Present your transducers as graphs or tables (cf.
    Fig. 3.15 in Jurafsky Martin)
Write a Comment
User Comments (0)
About PowerShow.com