Modelling Natural Language with Finite Automata - PowerPoint PPT Presentation

1 / 66
About This Presentation
Title:

Modelling Natural Language with Finite Automata

Description:

phon./orthograph. alternation rules. special. expressions. lexicon. compiler. lexical transducer ... Phon./Orth. Alternation Rules. compilation of lexical transducer: ... – PowerPoint PPT presentation

Number of Views:302
Avg rating:3.0/5.0
Slides: 67
Provided by: karinh
Category:

less

Transcript and Presenter's Notes

Title: Modelling Natural Language with Finite Automata


1
Formal Languages FSAs in NLP Hinrich Schütze IMS,
Uni Stuttgart, WS 2006/07 Most slides borrowed
from K. Haenelt and E. Gurari
2
Formal Language Theory
  • Two different goals in computational linguistics
  • Theoretical interest
  • What is the correct formalization of natural
    language?
  • What does this formalization tell us about the
    properties of natural language?
  • What are the limits of NLP algorithms in
    principle?
  • E.g., natural language is context free would
    imply syntactic analysis of natural language is
    cubic.
  • Practical interest
  • Well-understood mathematical and computational
    framework for solving NLP problems
  • ... even if formalization is not cognitively
    sound
  • Today finite state for practical applications

3
Language-Technology Based on Finite-State-Devices
4
Advantages ofFinite-State Devices
  • efficiency
  • time
  • very fast
  • if deterministic or low-degree non-determinism
  • space
  • compressed representations of data
  • search structure ( hash function)
  • system development and maintenance
  • modular design andautomatic compilation of
    system components
  • high level specifications
  • language modelling
  • uniform framework for modelling dictionaries and
    rules

5
Modelling Goal
  • specification of a formal language that
    corresponds as closely as possible to a natural
    language
  • ideally the formal system should
  • never undergenerate (i.e. accept or generate all
    the strings that characterise a natural language)
  • never overgenerate(i.e.not accept or generate
    any string which is not acceptable in a real
    language)
  • realistically
  • natural languages are moving targets
    (productivity, variations)
  • approximations are achievable
  • Finite-state is a crude, but useful approximation

(Beesley/Karttunen, 2003)
6
Layers of Linguistic Modelling
Sem
promote
on
in
Commission
information
product
third country
Syn
S
PP
NP
VP
NP
PP
Lex
noun
dete
verb
noun
prpo
noun
prpo
noun
The
Commission
promote
information
on
product
in
third country
Text
The Commission promotes information on
products in third countries
7
Main Types of Transducers for Natural Language
Processing
general case - non-deterministic transducers
with e-transitions optimisation -
determinisation and minimisation
s
a
w
aw
s
a
w
s
e
e
s
a
w
e
e
s
ee
a
w
s
8
Lexical Analysis of Natural Language Texts
Lexical Analysis
Recogn. Input L
Mapping to Output Language
Tokenisation
Morphological Analysis
9
Tokenization
  • Just use white space?
  • Tokenization rules?

10
Properties ofNatural Language Words
  • very large set (how many?)
  • word formation concatenation with constraints
  • simple words word dorw
  • compound word
  • productive compounding and derivations Drosselkl
    appenpotentiometer organise ? organisation
    ? organisational, re-organise ?
    re-organisation ? re-organisational be-wald-en
    be-feld-en
  • contiguous dependencies go-es walk-es
    un-expect-ed-ly un-elephant-ed-ly
  • discontiguous (long distance) dependencies expec
    t-s un-expect-s mach-st ge-mach-st

11
(No Transcript)
12
Modelling ofNatural Language Words
  • Concatenation rules expressed in terms of
  • meaningful word components (including simple
    words) (morphs)
  • and their concatenation (morphotactics)
  • Modelling approach
  • lexicalisation of sets of meaningful components
    (morph classes)
  • representation of these dictionaries with
    finite-state transducers
  • specification of concatenation

d
e
b
o
o
k
i
n
g
b
o
o
k
s
w
o
r
k
w
o
r
k
13
Modelling of Natural Language Words System
Overview
affix (enoun sg) ltgt (snoun pl) ltgt
morpheme classes
stem (bookbook) ltaffixgt (boxbox) ltaffixgt
morphotactics
concatenation of morpheme classes
lexicon compiler
phon./orthograph. alternation rules

special expressions
lexical transducer
(0-92.)2190-92 ...
14
Non-Regular Phenomena
  • ?

15
Modelling of WordsStandard Case
noun-stem book ltnoun-suffixgt work ltnoun-suffixgt
NSg
NPl
b
o
o
k
e
noun-suffix (eN Sg) (sN Pl)
s
b
o
o
k
w
o
r
r
w
o
  • Lexical Transducer
  • deterministic
  • minimal
  • additional output at final states

16
Modelling of Words Mapping Ambiguities between
io-Language
l
e
a
v
e
e
leave
l
e
a
v
s
e
leave
leaves
ave
e
e
left
left
f
t
ft
  • Lexical Transducer
  • deterministic
  • delayed emission at final state
  • minimal

17
Modelling of WordsOverlapping of Matching
Patterns
e-
w
a
c
h
s
Wach-stube
Wachs-tube
u
b
e
s
t
t
  • Lexical Transducer
  • non-deterministic with e-transitions
  • determinisation not possible
  • infinite delay of output due to cycle
  • Ambiguities
  • cannot be resolved at lexical level
  • must be preserved for later analysis steps
  • require non-deterministic traversal

18
Modelling of WordsNon-Disjoint Morpheme Classes
  • problem
  • multiplication of start sections
  • high degree of non-determinism
  • solutions
  • pure finite-state devices
  • transducer with high degree of non-determinism
  • heavy backtracking / parallel search
  • slow processing
  • determinisation
  • explosion of network in size
  • extendend device feature propogation along paths
  • merging of morpheme classes
  • reduction of degree of non-determinsms
  • minimisation
  • additional checking of bit-vectors

19
Modelling of WordsNon-Disjoint Morpheme
Classes high degree of non-determinism
noun-stem book ltnoun-stemgt ltnoun-suffixgt work ltno
un-stemgt ltnoun-suffixgt
e
book
e
e
N
work
e
book
eV
e
verb-stem book ltverb-suffixgt work ltverb-suffixgt
V
work
noun-suffix (eN Sg) (sN Pl)
  • problem
  • each of the subdivisions must be searched
    separateley
  • (determinization not feasible
  • leads to explosion of network or
  • not possible)

verb-suffix (eV) (edV past) (sV
3rd)
20
Modelling of WordsNon-Disjoint Morpheme
Classes feature propagation
noun-stem book ltnoun-stemgt ltnoun-suffixgt work ltno
un-stemgt ltnoun-suffixgt
e N
book N,V
e N,V
work N,V
ed V
s N,V
verb-stem book ltverb-suffixgt work ltverb-suffixgt
noun-suffix (eN Sg) (sN Pl)
  • solution
  • interpreting continuation class as bit-feature
  • added to the arcs
  • checked during traversal (feature intersection)
  • merging the dictionaries
  • searching only one dictionary (degree of
    non-determinism reduced considerably)

verb-suffix (eV) (edV past) (sV
3rd)
21
Modelling of WordsLong-Distance Dependencies
Constraints on the co-occurrence of morphs within
words
Contiguous Constraints
Discontiguous Constraints
invalid sequences
ge-mach- ge-mach-
e st
mach- mach- mach- mach- mach- mach- ...
e st t en t en
ge-mach-t
ge-wachs-t
wachs-e
ge-wach-st
wach-e
22
Modelling of WordsLong-Distance Dependencies
  • modelling alternatives
  • pure finite-state transducer
  • copies of the network
  • can cause explosion in the size of the resulting
    transducer
  • extended finite-state transducer
  • context-free extension simple memory flags
  • special treatment by analysis and generation
    routine
  • keeps transducers small

23
Modelling of WordsLong-Distance Dependencies
network copies
inflection suffixes
verb-stems
e
e
verb-stems
ge
gt 15.000 entries
problem can cause explosion in the size of the
resulting transducer
24
Modelling of WordsLong-Distance Dependencies
simple memory flags
inflection suffixes
_at_require(-ge)
e
verb-stems
ge
_at_set(ge)
_at_require(ge)
solution -state flags - procedural
interpretation (bit vector operations)
Beesley/Karttunen, 2003
25
Modelling of WordsPhon./Orth. Alternation Rules
  • phenomena
  • pity ? pitiless
  • fly ? flies
  • swim ? swimming
  • delete ? deleting
  • fox ? foxes
  • dictionary
  • rules high level specification of regular
    expressions
  • .

noun-stem dog ltnoun-suffixgt fox ltnoun-suffixgt
noun-suffix (eN Sg) (sN Pl)
Beesley/Karttunen, 2003 61
a?b_c
? b a ? ? a c ?
26
Modelling of WordsPhon./Orth. Alternation Rules
  • compilation of lexical transducer
  • construction of dictionary transducer
  • construction of rule transducer
  • composition of lexical transducer and rule
    transducer

27
Modelling of WordsPhon./Orth. Alternation Rules
e-insertion rule for English plural nouns ending
with x,s,z (foxes)
,other
r5

other
z,s,x
s
z,s,x
z,s,x
s
r0
r1
r2
r3
r4
z,x
,other

,other
Jurafsky/Martin, 2000, S. 78
28
(No Transcript)
29
(No Transcript)
30
Diagram?
31
(No Transcript)
32
  • Compute output for aabba

33
(No Transcript)
34
Tisch/Tische Rat/Räte
NSg
NPl
b
o
o
k
e
s
b
o
o
k
w
o
r
r
w
o
35
  • Numbers to Numerals for English (1-99)

36
  • DMOR/SMOR

37
Modelling of WordsPhon./Orth. Alternation Rules
dictionary
oo
xx
ff
1
2
Ne
Pl
es
e
0
5
6
7
8
9
dd
oo
gg
3
4
rule
,other
5
e
z,s,x
other
,
e
z,s,x
s
D
W
ee
e
s
z,s,x
1
2
3
4
0
,other
z,x

,other
dictionary ? rule


xx
e
ee
ss

61
51
20
10
72
73
84

f
f
oo
xx
Ne
Pl
es
e
00
90


e
ss


30
81
40
50
60
70
dd
Pl
oo
gg
Ne
es
e
38
Time complexity of transducer?
39
Layers of Linguistic Modelling
Sem
promote
on
in
Commission
information
product
third country
Syn
S
PP
NP
VP
NP
PP
Lex
noun
dete
verb
noun
prpo
noun
prpo
noun
The
Commission
promote
information
on
product
in
third country
Text
The Commission promotes information on
products in third countries
40
Syntactic Analysis of Natural Language Texts
Syntactic Analysis
Input Language
Output Language
Assignment of Syntactic Categories
Assigment of Syntactic Structure
NP
NP
NP
NP
the
good
example
the
good
example
41
Sequences of Words Properties and
Well-Formedness Conditionssyntactic conditions
type-3
  • regular language concatenations
  • local word ordering principles
  • the good example example good the
  • could have been done been could done have
  • global word ordering principles
  • (we) (gave) (him) (the book) (gave) (him) (the
    book) (we)

42
Sequences of Words Properties and
Well-Formedness Conditionssyntactic conditions
beyond type-3
  • concatenations beyond regular languages
  • centre embedding (S ? a S b)
  • obligatorily paired correspondences
  • either ... or, if ... then
  • can be nested inside each other

43
Sequences of Words Properties and
Well-Formedness Conditionssyntactic conditions
beyond type-2
  • concatenations beyond context-free languages
  • cross-serial dependencies

Jan säit das mer dchind em Hans es huus lönd
hälfe aastriiche
y1
y2
y3
x1
x2
x3
John said that we the children-acc
let
Hans-dat help
the
house paint
44
Syntactic Grammars
  • complete parsing
  • goal recover complete, exact parses of sentences
  • closed-world assumption
  • lexicon and grammar are complete
  • place all types of conditions into one grammar
  • seeking the globally best parse of the entire
    search space
  • problems
  • not robust
  • too slow for mass data processing
  • partial parsing
  • goal recover syntactic informationefficiently
    and reliably from unrestricted text
  • sacrificing completeness and depth of analysis
  • open-world assumption
  • lexicon and grammar are incomplete
  • local decisions

Abney, 1996
45
Syntactic GrammarsComplete Sentence Structure ?
syntactic link
NP
semantic link
AP
text structure
PP
NP
AP
PP
NP-c
with
stopper
fastening
cork
material
or
made of
with
bottle
closed
46
Syntactic GrammarsComplete Sentence
Parsingcomputational problem
  • combinatorial explosion of readings

Bod, 1998 2
47
All Grammars Leak
Edward Sapir, 1921
  • Not possible to provide an exact and complete
    characterization
  • of all well-formed utterances
  • that cleanly divides them from all other
    sequences of words which are regarded as
    ill-formed utterances
  • Rules are not completely ill-founded
  • Somehow we need to make things looser to account
    for the creativity of language use

48
All Grammars Leak
  • Example for leaking rule?

49
All Grammars Leak
  • Agreement in English
  • Why do some teachers, parents and religious
    leaders feel that celebrating their religious
    observances in home and church are inadequate and
    deem it necessary to bring those practices into
    the public schools?

50
Syntactic Structure Partial Parsing Approaches
  • finite-state approximation of sentence structures
    (Abney 1995)
  • finite-state cascades sequences of levels of
    regular expressions
  • recognition approximation tail-recursion
    replaced by iteration
  • interpretation approximation embedding replaced
    by fixed levels

51
Syntactic StructureFinite State Cascades
  • functionally equivalent to composition of
    transducers,
  • but without intermediate structure output
  • the individual transducers are considerably
    smaller than a composed transducer

52
Syntactic StructureFinite-State Cascades (Abney)
Finite-State Cascade
S
S
L3 ----
T3
NP
PP
VP
NP
VP
L2 ----
T2
NP
P
NP
VP
NP
VP
L1 ----
T1
D
N
P
D
N
N
V-tns
Pron
Aux
V-ing
L0 ----
the
woman
in
the
lab
coat
thought
you
were
sleeping
Regular-Expression Grammar
53
Syntactic StructureFinite-State Cascades (Abney)
  • cascade consists of a sequence of levels
  • phrases at one level are built on phrases at the
    previous level
  • no recursion phrases never contain same level or
    higher level phrases
  • two levels of special importance
  • chunks non-recursive cores (NX, VX) of major
    phrases (NP, VP)
  • simplex clauses embedded clauses as siblings
  • patterns reliable indicators of gist of
    syntactic structure

54
Syntactic StructureFinite-State Cascades (Abney)
  • each transduction is defined by a set of patterns
  • category
  • regular expression
  • regular expression is translated into a
    finite-state automaton
  • level transducer
  • union of pattern automata
  • deterministic recognizer
  • each final state is associated with a unique
    pattern
  • heuristics
  • longest match (resolution of ambiguities)
  • external control process
  • if the recognizer blocks without reaching a final
    state,a single input element is punted to the
    output andrecognition resumes at the following
    word

55
Syntactic StructureFinite-State Cascades (Abney)
  • patterns reliable indicators of bits of
    syntactic structure
  • parsing
  • easy-first parsing (easy calls first)
  • proceeds by growing islands of certainty into
    larger and larger phrases
  • no systematic parse tree from bottom to top
  • recognition of recognizable structures
  • containment of ambiguity
  • prepositional phrases and the like are left
    unattached
  • noun-noun modifications not resolved

56
Syntactic StructureBounding of Centre Embedding
  • Sproat, 2002
  • observation unbounded centre embedding
  • does not occur in language use
  • seems to be too complex for human mental
    capacities
  • finite state modelling of bounded centre embedding

S ? the (mandog) S1 (biteswalks) S1 ? the
(mandog) S2 (biteswalks) S2 ? the (mandog)
(biteswalks) S1 ? e S2 ? e
57
Modelling of Natural Language Word Sequences
Approaches
58
Modelling of Natural Language Word Sequences
Cases (1)
59
Modelling of Natural Language Word Sequences
Cases (2)
60
Semantic AnalysisAn example
  • message understanding
  • filling in relational database templates from
    newswire texts
  • approach of FASTUS 1) cascade of five
    transducers
  • recognition of names,
  • fixed form expressions,
  • basic noun andverb groups
  • patterns of events
  • ltcompanygt ltformgtltjoint venturegt with ltcompanygt
  • "Bridgestone Sports Co. said Friday it has set up
    a joint venture in Taiwan with a local concern
    and a Japanese trading house to produce golf
    clubs to be shipped to Japan.
  • identification of event structures that describe
    the same event

1) Hobbs/Appelt/Bear/Israel/Kehler/Martin/Meyers/K
ameyama/Stickel/Tyson (1997)
61
Summary Linguistic Adequacy
  • word formation
  • essentially regular language
  • sentence formation
  • reduced recognition capacity (approximations)
  • corresponds to language userather than natural
    language system
  • flat interpretation structures
  • clearly separate syntactic constraints from other
    (semantic, textual) constraints
  • partial interpretation structures
  • clearly identify the contribution of syntactic
    structure in the interplay with other structuring
    principles
  • content
  • suitable for restricted fact extraction
  • deep text understanding generally still poorly
    understood

62
Summary Practical Usefulness
  • not all natural language phenomena can be
    described with finite-state devices
  • many actually occurring phenomena can be
    described with regular devices
  • not all practical applications require a complete
    and deep processing of natural language
  • partial solutions allow for the development of
    many useful applications

63
Summary Complexity of Finite-State Transducers
for NLP
  • theoretically computationally intractable
    (Barton/Berwick/Ristad,1987)
  • SAT-problem is unnatural
  • natural language problems are bounded in size
  • input and output alphabets,
  • word length of linguistic words,
  • partiality of functions and relations
  • combinatorial possibilities are locally
    restricted.
  • practically, natural language finite-state
    systems
  • do not involve complex search
  • are remarkably fast
  • can in many relevant cases be determinised and
    minimised

64
Summary Large Scale Processing
  • Context-free devices
  • run-time complexity G n3 , G gtgt n3
  • too slow for mass data processing
  • Finite-state devices
  • run-time complexity best case linear(with low
    degree of non-determinism)
  • best suited for mass data processing

65
References
  • Abney, Steven (1996). Tagging and Partial
    Parsing. In Ken Church, Steve Young, and Gerrit
    Bloothooft (eds.), Corpus-Based Methods in
    Language and Speech. Kluwer Academic Publishers,
    Dordrecht. http//www.vinartus.net/spa/95a.pdf
  • Abney, Steven (1996a) Cascaded Finite-State
    Parsing. Viewgraphs for a talk given at Xerox
    Research Centre, Grenoble, France.
    http//www.vinartus.net/spa/96a.pdf
  • Abney, Steven (1995). Partial Parsing via
    Finite-State Cascades. In Journal of Natural
    Language Engineering, 2(4) 337-344.
    http//www.vinartus.net/spa/97a.pdf
  • Barton Jr., G. Edward Berwick, Robert, C. und
    Eric Sven Ristad (1987). Computational Complexity
    and Natural Language. MIT Press.
  • Beesley Kenneth R. und Lauri Karttunen (2003).
    Finite-State Morphology. Distributed for the
    Center for the Study of Language and Information.
    (CSLI- Studies in Computational Linguistics)
  • Bod, Rens (1998). Beyond Grammar. An
    Experienced-Based Theory of Language. CSLI
    Lecture Notes, 88, Standford, California Center
    for the Study of Information and Language
  • Grefenstette, Gregory (1999). Light Parsing as
    Finite State Filtering. In Kornai 1999, S.
    86-94. earlier version in Workshop on Extended
    finite state models of language, Budapest,
    Hungary, Aug 11--12, 1996. ECAI'96.
    http//citeseer.nj.nec.com/grefenstette96light.htm
    l
  • Hobbs, Jerry Doug Appelt, John Bear, David
    Israel, Andy Kehler, David Martin, Karen Meyers,
    Megumi Kameyama, Mark Stickel, Mabry Tyson
    (1997). Breaking the Text Barrier. FASTUS
    Presentation slides. SRI International.
    http//www.ai.sri.com/israel/Generic-FASTUS-talk.
    pdf
  • Jurafsky, Daniel und James H. Martin (2000)
    Speech and Language Processing. An Introduction
    to Natural Language Processing, Computational
    Linguistics and Speech Recognition. New Jersey
    Prentice Hall.
  • Kornai, András (ed.) (1999). Extended Finite
    State Models of Language. (Studies in Natural
    Language Processing). Cambridge Cambridge
    University Press.
  • Koskenniemi, Kimmo (1983). Two-level morphology
    a general computational model for word-form
    recognition and production. Publication 11,
    University of Helsinki. Helsinki Department of
    Genral Linguistics

66
References
  • Kunze, Jürgen (2001). Computerlinguistik.
    Voraussetzungen, Grundlagen, Werkzeuge.
    Vorlesungsskript. Humboldt Universität zu Berlin.
    http//www2.rz.hu-berlin.de/compling/Lehrstuhl/Skr
    ipte/Computerlinguistik_1/index.html
  • Manning, Christopher D. Schütze, Hinrich (1999).
    Foundations of Statistical Natural Language
    Processing. Cambridge, Mass., London The MIT
    Press. http//www.sultry.arts.usyd.edu.au/fsnlp
  • Mohri, Mehryar (1997). Finite State Transducers
    in Language and Speech Processing. In
    Computational Linguistics, 23, 2, 1997, S.
    269-311. http//citeseer.nj.nec.com/mohri97finites
    tate.html
  • Mohri, Mehryar (1996). On some Applications of
    finite-state automata theory to natural language
    processing. In Journal of Natural Language
    Egineering, 2, S. 1-20.
  • Mohri, Mehryar und Michael Riley (2002). Weighted
    Finite-State Transducers in Speech Recognition
    (Tutorial). Teil 1 http//www.research.att.com/m
    ohri/postscript/icslp.ps, Teil 2
    http//www.research.att.com/mohri/postscript/icsl
    p-tut2.ps
  • Partee, Barbara ter Meulen, Alice and Robert E.
    Wall (1993). Mathematical Methods in Linguistics.
    Dordrecht Kluwer Academic Publishers.
  • Pereira, Fernando C. N. and Rebecca N. Wright
    (1997). Finite-State Approximation of
    Phrase-Structure Grammars. In Roche/Schabes
    1997.
  • Roche, Emmanuel und Yves Schabes (Eds.) (1997).
    Finite-State Language Processing. Cambridge
    (Mass.) und London MIT Press.
  • Sproat, Richard (2002). The Linguistic
    Significance of Finite-State Techniques. February
    18, 2002. http//www.research.att.com/rws
  • Strzalkowski, Tomek Lin, Fang Ge, Jin Wang
    Perez-Carballo, Jose (1999). Evaluating Natural
    Language Processing Techniques in Information
    Retrieval. In Strzalkowski, Tomek (Ed.) Natural
    Language Information Retrieval, Kluwer Academic
    Publishers, Holland 113-145
  • Woods, W.A. (1970). Transition Network Grammar
    for Natural Language Analysis. In Communications
    of the ACM 13 591-602.
Write a Comment
User Comments (0)
About PowerShow.com