Morphology-2

About This Presentation

Title:

Morphology-2

Description:

Sudeshna Sarkar Professor Computer Science & Engineering Department Indian Institute of Technology Kharagpur Morphology in NLP Analysis vs synthesis what does dogs ... – PowerPoint PPT presentation

Number of Views:247

Avg rating:3.0/5.0

Slides: 72

Provided by: DanJur1

Category:

more less

Transcript and Presenter's Notes

Title: Morphology-2

1
Morphology-2

Sudeshna Sarkar
Professor
Computer Science Engineering Department
Indian Institute of Technology Kharagpur

2
Morphology in NLP

Analysis vs synthesis
what does dogs mean? vs what is the plural of
dog?
Analysis
Need to identify lexeme
Tokenization
To access lexical information
Inflections (etc) carry information that will be
needed by other processes (eg agreement useful in
parsing, inflections can carry meaning (eg tense,
number)
Morphology can be ambiguous
May need other process to disambiguate (eg German
en)
Synthesis
Need to generate appropriate inflections from
underlying representation

3
Morphological processing

Stemming
String-handling approaches
Regular expressions
Mapping onto finite-state automata
2-level morphology
Mapping between surface form and lexical
representation

4
Stemming

Stemming is the particular case of tokenization
which reduces inflected forms to a single base
form or stem
Stemming algorithms are basic string-handling
algorithms, which depend on rules which identify
affixes that can be stripped

5
Surface and Lexical Forms

The surface level of a word represents the actual
spelling
of that word.
geliyorum eats cats kitabim
The lexical level of a word represents a simple
concatenation
of morphemes making up that word.
gel PROG 1SG
eat AOR
cat PLU
kitap P1SG
Morphological processors try to find
correspondences between lexical and surface forms
of words.
Morphological recognition/ analysis surface
to lexical
Morphological generation/ synthesis lexical
to surface

6
Morphological Parsing

Morphological parsing is to find the lexical form
of a word
from its surface form.
cats -- cat N PLU
cat -- cat N SG
goose -- goose N SG or goose V
geese -- goose N PLU
gooses -- goose V 3SG
catch -- catch V
caught -- catch V PAST or catch V PP
AsachhilAma AsAPROGPAST1st I/We was/were
coming
There can be more than one lexical level
representation
for a given word. (ambiguity)
flies flyVERBPROG
flyNOUNPLU
mAtAla
kare

The history of morphological analysis dates back
to the ancient Indian linguist Pa?ini, who
formulated the 3,959 rules of Sanskrit morphology
in the text A??adhyayi by using a Constituency
Grammar.

8
Formal definition of the problem

Surface form The word (ws) as it occurs in the
text. sings
ws ? L ? S
Lexical form The root word(s) (r1, r2, ) and
other grammatical features (F). sing,v,sg,3rd
wl ? S,F
wl ? ?

9
Analysis Synthesis

Morphological Analysis Maps a string from
surface form to corresponding lexical form.
fMA S ? ?
Morphological Synthesis Maps a string from
lexical form to surface form.
fMA ? ? S

10
Relationship between MA MS

fMS??fMA(ws) ws
fMA??fMS(wl) wl
fMS? fMA, fMA? fMS
But is that really the case?

-1
-1
11

Fly s ? flys ? flies (y ?i rule)
Duckling
Go-getter ? get er
Doer ? do er
Beer ? ?
What knowledge do we need?
How do we represent it?
How do we compute with it?

12
Knowledge needed

Knowledge of stems or roots
Duck is a possible root, not duckl
We need a dictionary (lexicon)
Only some endings go on some words
Do er ok
Be er not ok
In addition, spelling change rules that adjust
the surface form
Get er double the t getter
Fox s insert e foxes
Fly s insert e flys y to i flies
Chase ed drop e - chased

13
Put all this in a big dictionary (lexicon)

Turkish approx 600 ? 106 forms
Finnish 107
Hindi, Bengali, Telugu, Tamil?
Besides, always novel forms can be constructed
Anti-missile
Anti-anti-missile
Anti-anti-anti-missile
..
Compounding of words Sanskrit, German

14
Dictionary

Lemma lexical unit, pointer to lexicon
typically is represented as the base form, or
dictionary headword
possibly indexed when ambiguous/polysemous
state1 (verb), state2 (state-of-the-art), state3
(government)
from one or more morphemes (root, stem,
rootderivation, ...)
Categories non-lexical
small number of possible values (lt 100, often lt
5-10)

15
Morphological Analyzer

Relatively simple for English.
But for many Indian languages, it may be more
difficult.
Examples
Inflectional and Derivational Morphology.
Common tools Finite-state transducers
A transducer maps a set/string of symbols to
another set/string of symbols

16
A simpler problem

Linear concatenation of morphemes with possible
spelling changes at the boundary and a few
irregular cases.
Quite practical assumptions
English, Hindi, Bengali, Telugu, Tamil, French,
Turkish
Exceptions Semitic languages, Sanskrit

17
Computational Morphology

Approaches
Lexicon only
Rules only
Lexicon and Rules
Finite-state Automata
Finite-state Transducers

18
Computational Morphology

Systems
WordNets morphy
PCKimmo
Named after Kimmo Koskenniemi, much work done by
Lauri Karttunen, Ron Kaplan, and Martin Kay
Accurate but complex
http//www.sil.org/pckimmo/
Two-level morphology
Commercial version available from InXight Corp.
Background
Chapter 3 of Jurafsky and Martin
A short history of Two-Level Morphology
http//www.ling.helsinki.fi/koskenni/esslli-2001-
karttunen/

19
Morphological Anlayser

To build a morphological analyser we need
lexicon the list of stems and affixes, together
with basic information about them
morphotactics the model of morpheme ordering (eg
English plural morpheme follows the noun rather
than a verb)
orthographic rules these spelling rules are used
to model the changes that occur in a word,
usually when two morphemes combine (e.g., flys
flies)

20
Finite State Machines

FSAs are equivalent to regular languages
FSTs are equivalent to regular relations (over
pairs of regular languages)
FSTs are like FSAs but with complex labels.
We can use FSTs to transduce between surface and
lexical levels.

21
Can FSAs help?
Reg-noun
Plural (-s)
Q0
Q1
Q2
Irreg-pl-noun
Irreg-sg-noun
22
Whats this for?
un
Adj-root
Q0
Q1
Q2
Q3
-er -est -ly
e
un?ADJ-ROOTer est ly?
23
Morphotactics

The last two examples basically model some parts
of the English morphotactics
But where is the information about regular and
irregular roots?LEXICON
Can we include the lexicon in the FSA?

24
The English Pluralization FSA
25
After adding a mini-lexicon
a
s
g
u
b
Q1
Q2
s
Q0
d
o
g
m
a
n
n
e
26
Elegance Power

FSAs are elegant because
NFA ?? DFA
Closed under Union, Intersection, Concatenation,
Complementation
Traversal is always linear on input size
Well-known algorithms for minimization,
determinization, compilation etc.
They are powerful because they can capture
Linear morphology
Irregularities

27
But

FSAs are language recognizer/generator.
We need transducers to build
Morphological Analyzers (fMA)
Morphological Synthesizers (fMS)

28
Finite State Transducers
s i n g s
Finite State Machine
Surface form
Lexical form
s i n g v sg
29
Formal Definition

A 6-tuple S,?,Q,d,q0,F
S is the (finite) set of input symbols
? is the (finite) set of output symbols
Q is the set (FINITE) of states
d is the transition function Q?? S to Q ? ?
q0 ? Q is the start state
F ? Q is the set of accepting states

30
An example FST
aa
se
gg
bb
uu
Q1
Q2
ss
Q0
dd
oo
gg
aa
mm
nn
nn
ea
31
The Lexicon FST
aa
sPl
gg
Sg
bb
uu
Q1
Q2
ss
Q0
dd
oo
gg
Sg
aa
nn
mm
Q3
ea
Pl
nn
Q4
32
Ways to look at FSTs

Recognizer of a pair of strings
Generator of a pair of strings
Translator from one regular language to another
Computer of a relation regular relation

33
Invertibility

Given T S,?,Q,d,q0,F
Construct T-1 ?,S,Q,d-1,q0,F
such that if d(x,q) ? (y,q)
then d-1(y,q) ? (x,q)
where, x ? S and y ? ?

34
Compositionality

T1 S, X, Q1,d1,q1,F1 T2 X, ?,
Q2,d2,q2,F2
Define T3 S, ?, Q3,d3,q3,F3
such that Q3 Q1 ? Q2
q3 (q1, q2)
d3 ((q,s), i) ((q,s),o) if
?c s.t d1 (q, i) (q,c) and d2 (s, c) (s,o)

35
Modeling Orthographic Rules

Spelling changes in morpheme boundaries
buss ? buses, watchs ? watches
flys ? flies
makeing ? making
Rules
E-insertion takes place if the stem ends in s, z,
ch, sh etc.
y maps to ie when pluralization marker s is added

36
Incorporating Spelling Rules

Spelling rules, each corresponding to an FST, can
be run in parallel provided that they are
"aligned".
The set of spelling rules is positioned between
the surface level and the intermediate level.
Parallel execution of FSTs can be carried out
by simulation in this case FSTs must first be
aligned.
by first constructing a a single FST
corresponding to their intersection.

37
Rewrite Rules

Chomsky and Halle (1968)
General form
a ? b / ?__ ?
E-insertion
e ? e / x,s,z,ch,sh __ s
Kay and Kaplan (1994) showed that FSTs can be
compiled from general rewrite rules

38
Two-level Morphology (Koskenniemi, 1983)
b u s N Pl
lexical
LEXICON FST
b u s s
intermediate
FST1
FSTn
orthographic rules
b u s e s
surface
39
A Single FST for MA and MS
Pl
N
s
u
b
Pl
N
s
u
b
LEXICON FST
Morphology FST

s

s
u
b
FST1
FSTn
orthographic rules
40
Can we do without the lexicon

Not really!
But for some applications we might need to know
the stem only
Surface form ? Stem Stemming
Porter Stemming algorithm (1980) is a very
popular technique that does not use lexicon.

41
Derivational Rules
42
Lexicon Morphotactics

Typically list of word parts (lexicon) and the
models of ordering can be combined together into
an FSA which will recognise the all the valid
word forms.
For this to be possible the word parts must first
be classified into sublexicons.
The FSA defines the morphotactics (ordering
constraints).

43
Sublexicons to classify the list of word parts
reg-noun irreg-pl-noun irreg-sg-noun plural
cat mice mouse -s
fox sheep sheep
geese goose
44
Towards the Analyser

We can use lexc or xfst to build such an FSA (see
lex1.lexc)
To augment this to produce an analysis we must
create a transducer Tnum which maps between the
lexical level and an "intermediate" level that is
needed to handle the spelling rules of English.

45
Ambiguity

Recall that in non-deterministic recognition
multiple paths through a machine may lead to an
accept state.
Didnt matter which path was actually traversed
In FSTs the path to an accept state does matter
since differ paths represent different parses and
different outputs will result

46
Ambiguity

Whats the right parse for
Unionizable
Union-ize-able
Un-ion-ize-able
Each represents a valid path through the
derivational morphology machine.

47
Ambiguity

There are a number of ways to deal with this
problem
Simply take the first output found
Find all the possible outputs (all paths) and
return them all (without choosing)
Bias the search so that only one or a few likely
paths are explored

48
Generativity

Nothing really privileged about the directions.
We can write from one and read from the other or
vice-versa.
One way is generation, the other way is analysis

49
Multi-Level Tape Machines

We use one machine to transduce between the
lexical and the intermediate level, and another
to handle the spelling changes to the surface
tape

50
Note

A key feature of this machine is that it doesnt
do anything to inputs to which it doesnt apply.
Meaning that they are written out unchanged to
the output tape.
Turns out the multiple tapes arent really
needed they can be compiled away.

51
Overall Scheme

We now have one FST that has explicit information
about the lexicon (actual words, their spelling,
facts about word classes and regularity).
Lexical level to intermediate forms
We have a larger set of machines that capture
orthographic/spelling rules.
Intermediate forms to surface forms

52
Other Issues

How to formulate the rewrite rules?
How to ensure coverage?
What to do for unknown roots?
Is it possible to learn morphology of a language
in supervised/unsupervised manner?
What about non-linear morphology?

53
References

Chapter 3, pp 57-89
Speech and Language Processing by D. Jurafsky
J. H. Martin, Pearson Education Asia, 2002 (2000)
Slides based on the chapter
Chapter 2, pp 70
Natural Language Understanding by J. Allen,
Pearson Education, 2003 (1995)
Slide by Monojit Choudhury

54
Spelling errors
55
Non-word error detection

Any word not in a dictionary
Assume its a spelling error
Need a big dictionary!
What to use?
FST dictionary!!

56
Isolated word error correction

How do I fix graffe?
Search through all words
graf
craft
grail
giraffe
Pick the one thats closest to graffe
What does closest mean?
We need a distance metric.
The simplest one edit distance.
(More sophisticated probabilistic ones noisy
channel)

57
Edit Distance

The minimum edit distance between two strings
Is the minimum number of editing operations
Insertion
Deletion
Substitution
Needed to transform one into the other

58
Minimum Edit Distance

If each operation has cost of 1
Distance between these is 5
If substitutions cost 2 (Levenshtein)
Distance between these is 8

59
Part of Speech Tagging

Task
assign the right part-of-speech tag, e.g. noun,
verb, conjunction, to a word in context
POS taggers
need to be fast in order to process large corpora
should take no more than time linear in the size
of the corpora
full parsing is slow
e.g. context-free grammar ? n3, n length of the
sentence
POS taggers try to assign correct tag without
actually parsing the sentence

60
Part-of-Speech (POS)

Categories to which words are assigned according
to their function.
Noun, verb, adjective, preposition, adverb,
article, pronoun, conjunction, etc.

61
POS Tagging

The process of assigning a part-of-speech to each
word in a sentence

Keep the book on the top
shelf .

.
N
ADJ N
DET
ADV ADJ P
N V
N V
DET
62
Techniques for POS tagging

Linguistic approaches
Statistical approaches
Hidden Markov Model
Maximum Entropy
CRF

63
Named Entity Recognition

Named Entity Recognition (NER) Locate and
Classify the Names in Text
Example
Jawaharlal Nehru was the first prime
minister of India.
Per-beg Per-end Title-beg Title-end
Loc
Importance
Information Extraction, Question-Answering
Can help Summarization, ASR and MT
Intelligent document access
etc

64
Syntax

Order and group words together in sentence
The dog barked at the visitor
Vs
Barked dog the at visitor the

65
Semantics

Understand word meanings and combine meanings in
larger units
Lexical semantics
Compositional sematics

66
Discourse Pragmatics

Interpret utterances in context
Resolve references
I'm afraid I can't do that
that ?
Speech act interpretation
Open the pod bay doors
Command

67
Phonology

The study of the sound patterns of languages.

68
Computational phonology

Automatic Speech Recognition (ASR)
take an acoustic waveform as input and produce
as output a string of words.
Text-To-Speech (TTS)
take a sequence of text words and produce as
output an acoustic waveform.
? How words are pronounced in terms
of individual speech units called
phones.

69
Speech sounds and phonetic transcription

A phone a speech sound, represented by IPA or
ARPAbet.
IPA An evolving standard with the goal of
transcribing the sounds of all human languages.
ARPAbet A phonetic alphabet designed for
American English using only ASCII symbols.

70
Why phonology?

Text to speech (TTS) applications include a
component which converts spelled words to
sequences of phonemes ( sound representations).
G2P - grapheme to phoneme conversion
E.g., sight ?S AY1 T
John ? J AA1 N
Phoneme to Grapheme for speech recognition

71
Varieties of sounds in peoples speech

Most phonemes have several different
pronunciations (called their allophones),
determined by nearby sounds, most usually by the
following sound.
A striking instance of such variation is in the
realization of the phoneme /T/ in American
English.

72
Grapheme phoneme relationships

LTS Letter to sound, or G2P relationships.
In some languages, this is simple, e.g., Sanskrit
But in English and in French, its very messy.
Why? Because the spelling system is based on how
the language used to be pronounced, and the
pronunciation has since changed.
Schwa deletion in Hindi

73
References

Chapter 3, pp 57-89
Speech and Language Processing by D. Jurafsky
J. H. Martin, Pearson Education Asia, 2002 (2000)
Slides based on the chapter
Chapter 2, pp 70
Natural Language Understanding by J. Allen,
Pearson Education, 2003 (1995)
Slide by Monojit Choudhury

Write a Comment

User Comments (0)