Morphology: Parsing Words - PowerPoint PPT Presentation

About This Presentation

Title:

Morphology: Parsing Words

Description:

Lecture 3 Morphology: Parsing Words – PowerPoint PPT presentation

Number of Views:376

Avg rating:3.0/5.0

Slides: 26

Provided by: juliah165

Learn more at: http://www1.cs.columbia.edu

Category:

more less

Transcript and Presenter's Notes

Title: Morphology: Parsing Words

1
Lecture 3

Morphology Parsing Words

2
What is morphology?

The study of how words are composed from
smaller, meaning-bearing units (morphemes)
Stems children, undoubtedly,
Affixes (prefixes, suffixes, circumfixes,
infixes)
Immaterial
Trying
Gesagt
Absobldylutely
Concatenative vs. non-concatenative (e.g. Arabic
root-and-pattern) morphological systems

3
Morphology Helps Define Word Classes

AKA morphological classes, parts-of-speech
Closed vs. open (function vs. content) class
words
Pronoun, preposition, conjunction, determiner,
Noun, verb, adverb, adjective,

4
(English) Inflectional Morphology

Word stem grammatical morpheme
Usually produces word of same class
Usually serves a syntactic function (e.g.
agreement)
like ? likes or liked
bird ? birds
Nominal morphology
Plural forms
s or es
Irregular forms (goose/geese)
Mass vs. count nouns (fish/fish,email or emails?)
Possessives (cats, cats)

Verbal inflection
Main verbs (sleep, like, fear) verbs relatively
regular
-s, ing, ed
And productive Emailed, instant-messaged, faxed,
homered
But some are not regular eat/ate/eaten,
catch/caught/caught
Primary (be, have, do) and modal verbs (can,
will, must) often irregular and not productive
Be am/is/are/were/was/been/being
Irregular verbs few (250) but frequently
occurring
So.English inflectional morphology is fairly
easy to model.with some special cases...

6
(English) Derivational Morphology

Word stem grammatical morpheme
Usually produces word of different class
More complicated than inflectional
E.g. verbs --gt nouns
-ize verbs ? -ation nouns
generalize, realize ? generalization, realization
E.g. verbs, nouns ? adjectives
embrace, pity? embraceable, pitiable
care, wit ? careless, witless

E.g. adjective ? adverb
happy ? happily
But rules have many exceptions
Less productive evidence-less, concern-less,
go-able, sleep-able
Meanings of derived terms harder to predict by
rule
clueless, careless, nerveless

8
Parsing

Taking a surface input and identifying its
components and underlying structure
Morphological parsing parsing a word into stem
and affixes, identifying its parts and their
relationships
Stem and features
goose ? goose N SG or goose V
geese ? goose N PL
gooses ? goose V 3SG
Bracketing indecipherable ? in de cipher
able

9
Why parse words?

For spell-checking
Is muncheble a legal word?
To identify a words part-of-speech (pos)
For sentence parsing, for machine translation,
To identify a words stem
For information retrieval
Why not just list all word forms in a lexicon?

10
How do people represent words?

Hypotheses
Full listing hypothesis words listed
Minimum redundancy hypothesis morphemes listed
Experimental evidence
Priming experiments (Does seeing/hearing one word
facilitate recognition of another?) suggest
neither
Regularly inflected forms prime stem but not
derived forms
But spoken derived words can prime stems if they
are semantically close (e.g. government/govern
but not department/depart)

Speech errors suggest affixes must be represented
separately in the mental lexicon
easy enoughly

12
What do we need to build a morphological parser?

Lexicon list of stems and affixes (w/
corresponding pos)
Morphotactics of the language model of how and
which morphemes can be affixed to a stem
Orthographic rules spelling modifications that
may occur when affixation occurs
in ? il in context of l (in- legal)

13
Using FSAs to Represent English Plural Nouns

English nominal inflection

plural (-s)
reg-n
q0
q2
q1
irreg-pl-n
irreg-sg-n

Inputs cats, geese, goose

Derivational morphology adjective fragment

adj-root1
-er, -ly, -est
un-
q5
adj-root1
q3
q4
?
-er, -est
adj-root2

Adj-root1 clear, happy, real (clearly)
Adj-root2 big, red (bigly)

15
FSAs can also represent the Lexicon

Expand each non-terminal arc in the previous FSA
into a sub-lexicon FSA (e.g. adj_root2 big,
red) and then expand each of these stems into
its letters (e.g. red ? r e d) to get a
recognizer for adjectives

e
r
q1
q2
un-
q3
q7
q0
b
d
q4
-er, -est
q5
g
q6
i
16
But..

Covering the whole lexicon this way will require
very large FSAs with consequent search and
maintenance problems
Adding new items to the lexicon means recomputing
the whole FSA
Non-determinism
FSAs tell us whether a word is in the language or
not but usually we want to know more
What is the stem?
What are the affixes and what sort are they?
We used this information to recognize the word
can we get it back?

17
Parsing with Finite State Transducers

cats ?cat N PL (a plural NP)
Koskenniemis two-level morphology
Idea word is a relationship between lexical
level (its morphemes) and surface level (its
orthography)
Morphological parsing find the mapping
(transduction) between lexical and surface levels

c a t N PL
c a t s
18
Finite State Transducers can represent this
mapping

FSTs map between one set of symbols and another
using an FSA whose alphabet ? is composed of
pairs of symbols from input and output alphabets
In general, FSTs can be used for
Translators (HelloCiao)
Parser/generator s(HelloHow may I help you?)
As well as Kimmo-style morphological parsing

FST is a 5-tuple consisting of
Q set of states q0,q1,q2,q3,q4
? an alphabet of complex symbols, each an i/o
pair s.t. i ? I (an input alphabet) and o ? O (an
output alphabet) and ? is in I x O
q0 a start state
F a set of final states in Q q4
?(q,io) a transition function mapping Q x ? to
Q
Emphatic Sheep ? Quizzical Cow

ao
bm
ao
ao
!?
q0
q4
q1
q2
q3
20
FST for a 2-level Lexicon
cc
aa
tt

E.g.

q3
q0
q1
q2
e
g
q4
q6
q7
q5
s
eo
eo
Reg-n Irreg-pl-n Irreg-sg-n
c a t g oe oe s e g o o s e
21
FST for English Nominal Inflection
N?
reg-n
PLs
q1
q4
SG-
N?
irreg-n-sg
q0
q7
q2
q5
SG-
q3
q6
irreg-n-pl
PL-s
N?
22
Useful Operations on Transducers

Cascade running 2 FSTs in sequence
Intersection represent the common transitions in
FST1 and FST2 (ASR finding pronunciations)
Composition apply FST2 transition function to
result of FST1 transition function
Inversion exchanging the input and output
alphabets (recognize and generate with same FST)
cf ATT FSM Toolkit and papers by Mohri, Pereira,
and Riley

23
Orthographic Rules and FSTs

Define additional FSTs to implement rules such as
consonant doubling (beg ? begging), e deletion
(make ? making), e insertion (watch ? watches),
etc.

Lexical f o x N PL
Intermediate f o x s
Surface f o x e s
24
Porter Stemmer

Used for tasks in which you only care about the
stem
IR, modeling given/new distinction, topic
detection, document similarity
Rewrite rules (e.g. misunderstanding --gt
misunderstand --gt understand --gt )
Not perfect . But sometimes it doesnt matter
too much
Fast and easy

25
Summing Up

FSTs provide a useful tool for implementing a
standard model of morphological analysis, Kimmos
two-level morphology
But for many tasks (e.g. IR) much simpler
approaches are still widely used, e.g. the
rule-based Porter Stemmer
Next time
Read Ch 4
Read over HW1 and ask questions now

Write a Comment

User Comments (0)