# LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing - PowerPoint PPT Presentation

Title:

## LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing

Description:

### LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing Dan Jurafsky – PowerPoint PPT presentation

Number of Views:224
Avg rating:3.0/5.0
Slides: 65
Provided by: DanJur6
Category:
Tags:
Transcript and Presenter's Notes

Title: LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing

1
LING 138/238 SYMBSYS 138Intro to Computer Speech
and Language Processing
• Dan Jurafsky

2
Today 9/30 Week 1
• Finite State Automata
• Deterministic Recognition of FSAs
• Non-Determinism (NFSAs)
• Recognition of NFSAs
• Proof that regular expressions FSAs
• Very brief sketch Morphology, FSAs, FSTs
• Why finite-state machines are so great.

3
Three Views
• Three equivalent formal ways to look at what
were up to (thanks to Martin Kay)

Regular Expressions
Finite State Automata
Regular Languages
4
Finite State Automata
• Terminology Finite State Automata, Finite State
Machines, FSA, Finite Automata
• Regular expressions are one way of specifying the
structure of finite-state automata.
• FSAs and their close relatives are at the core of
most algorithms for speech and language
processing.

5
Intuition FSAs as Graphs
• /baa!/

6
Sheep FSA
machine
• It has 5 states
• At least b,a, and ! are in its alphabet
• q0 is the start state
• q4 is an accept state
• It has 5 transitions

7
But note
• There are other machines that correspond to this
language
• More on this one later

8
More Formally Defining an FSA
• You can specify an FSA by enumerating the
following things.
• The set of states Q
• A finite alphabet S
• A start state q0
• A set F of accepting/final states F?Q
• A transition function ?(q,i) that maps QxS to Q

9
Yet Another View
• State-transition table

10
Recognition
• Recognition is the process of determining if a
string should be accepted by a machine
• Or its the process of determining if a string
is in the language were defining with the
machine
• Or its the process of determining if a regular
expression matches a string

11
Recognition
• Traditionally, (Turings idea) this process is
depicted with a tape.

12
Recognition
• Start in the start state
• Examine the current input
• Consult the table
• Go to a new state and update the tape pointer.
• Until you run out of tape.

13
D-Recognize
14
Tracing D-Recognize
15
Key Points
• Deterministic means that at each point in
processing there is always one unique thing to do
(no choices).
• D-recognize is a simple table-driven interpreter
• The algorithm is universal for all unambiguous
languages.
• To change the machine, you change the table.

16
Key Points
• Crudely therefore matching strings with regular
expressions (ala Perl) is a matter of
• translating the expression into a machine (table)
and
• passing the table to an interpreter

17
Recognition as Search
• You can view this algorithm as state-space
search.
• States are pairings of tape positions and state
numbers.
• Operators are compiled into the table
• Goal state is a pairing with the end of tape
position and a final accept state

18
Generative Formalisms
• Formal Languages are sets of strings composed of
symbols from a finite set of symbols.
• Finite-state automata define formal languages
(without having to enumerate all the strings in
the language)
• The term Generative is based on the view that you
can run the machine as a generator to get strings
from the language.

19
Generative Formalisms
• FSAs can be viewed from two perspectives
• Acceptors that can tell you if a string is in the
language
• Generators to produce all and only the strings in
the language

20
Dollars and Cents
21
Summary
• Regular expressions are just a compact textual
representation of FSAs
• Recognition is the process of determining if a
string/input is in the language defined by some
machine.
• Recognition is straightforward with deterministic
machines.

22
Three Views
• Three equivalent formal ways to look at what
were up to (thanks to Martin Kay)

Regular Expressions
Finite State Automata
Regular Languages
23
Regular Languages
• More on these in a couple of weeks
• S ? b a a A
• A ? a A
• A ? !

24
Non-Determinism
25
Non-Determinism cont.
• Yet another technique
• Epsilon transitions
• Key point these transitions do not examine or

e
26
Equivalence
• Non-deterministic machines can be converted to
deterministic ones with a fairly simple
construction
• That means that they have the same power
non-deterministic machines are not more powerful
than deterministic ones
• It also means that one way to do recognition with
a non-deterministic machine is to turn it into a
deterministic one.

27
Non-Deterministic Recognition
• In a ND FSA there exists at least one path
through the machine for a string that is in the
language defined by the machine.
• But not all paths directed through the machine
for an accept string lead to an accept state.
• No paths through the machine lead to an accept
state for a string not in the language.

28
Non-Deterministic Recognition
• So success in a non-deterministic recognition
occurs when a path is found through the machine
that ends in an accept.
• Failure occurs when none of the possible paths

29
Example
b
a
a
a
!
\
q0
q2
q1
q2
q3
q4
30
Example
31
Example
32
Example
33
Example
34
Example
35
Example
36
Example
37
Example
38
Key Points
• States in the search space are pairings of tape
positions and states in the machine.
• By keeping track of as yet unexplored states, a
recognizer can systematically explore all the
paths through the machine given an input.

39
ND-Recognize Code
40
Infinite Search
• If youre not careful such searches can go into
an infinite loop.
• How?

41
Why Bother?
• Non-determinism doesnt get us more formal power
and it causes headaches so why bother?
• More natural solutions
• Machines based on construction are too big

42
Regular languages
• The class of languages characterizable by regular
expressions
• Given alphabet ?, the reg. lgs. over ? is
• The empty set ? is a regular language
• ?a ? ? ? ?, a is a regular language
• If L1 and L2 are regular lgs, then so are
• L1 L2 xyx ? L1,y ? L2, concatenation of L1
L2
• L1 ? L2, the union of L1 and L2
• L1, the Kleene closure of L1

43
Going from regexp to FSA
• Since all regular lgs meet above properties
• And reg lgs are the lgs characterizable by
regular expressions
• All regular expression operators can be
implemented by combinations of union,
disjunction, closure
• Counters (,) are repetition plus closure
• Anchors are individual symbols
• and () and . are kinds of disjunction

44
Going from regexp to FSA
• So if we could just show how to turn
closure/union/concat from regexps to FSAs, this
would give an idea of how FSA compilation works.
• The actual proof that reg lgs FSAs has 2 parts
• An FSA can be built for each regular lg
• A regular lg can be built for each automaton
• So Ill give the intuition of the first part
• Take any regular expression and build an
automaton
• Intuition induction
• Base case build an automaton for single symbol
(say a)
• Inductive step Show how to imitate the 3 regexp
operations in automata

45
Union
• Accept a string in either of two languages

46
Concatenation
• Accept a string consisting of a string from
language L1 followed by a string from language L2.

47
FSAs and Computational Morphology
• An important use of FSAs is for morphology, the
study of word parts
• Well just have time for a quick overview today
• This is the exact topic of LING 239F, being
offered this quarter!

48
English Morphology
• Morphology is the study of the ways that words
are built up from smaller meaningful units called
morphemes
• We can usefully divide morphemes into two classes
• Stems The core meaning bearing units
• Affixes Bits and pieces that adhere to stems to
change their meanings and grammatical functions

49
Nouns and Verbs (English)
• Nouns are simple (not really)
• Markers for plural and possessive
• Verbs are only slightly more complex
• Markers appropriate to the tense of the verb

50
Regulars and Irregulars
• Ok so it gets a little complicated by the fact
that some words misbehave (refuse to follow the
rules)
• Mouse/mice, goose/geese, ox/oxen
• Go/went, fly/flew
• The terms regular and irregular will be used to
refer to words that follow the rules and those
that dont.

51
Regular and Irregular Nouns and Verbs
• Regulars
• Walk, walks, walking, walked, walked
• Table, tables
• Irregulars
• Eat, eats, eating, ate, eaten
• Catch, catches, catching, caught, caught
• Cut, cuts, cutting, cut, cut
• Goose, geese

52
Compute
• Many paths are possible
• Computer -gt computerize -gt computerization
• Computation -gt computational
• Computer -gt computerize -gt computerizable
• Compute -gt computee

53
• Stemming in information retrieval
• Might want to search for aardvark and find
pages with both aardvark and aardvarks
• Morphology in machine translation
• Need to know that the Spanish words quiero and
quieres are both related to querer want
• Morphology in spell checking
• Need to know that misclam and antiundoggingly are
not words despite being made up of word parts

54
Cant just list all words
• Turkish
• (behaving) as if you are among those whom we
could not civilize
• Uygar civilized las become tir cause
ama not able dik past lar plural imiz
p1pl dan abl mis past siniz 2pl
casina as if

55
What we want
• Something to automatically do the following kinds
of mappings
• Cats cat N PL
• Cat cat N SG
• Cities city N PL
• Merging merge V Present-participle
• Caught catch V past-participle

56
FSAs and the Lexicon
• This will actual require a kind of FSA we wont
be studying the Finite State Transducer (FST)
• But well give a quick overview anyhow
• First well capture the morphotactics
• The rules governing the ordering of affixes in a
language.
• Then well add in the actual words

57
Simple Rules
58
59
Derivational Rules
60
Parsing/Generation vs. Recognition
• Recognition is usually not quite what we need.
• Usually if we find some string in the language we
need to find the structure in it (parsing)
• Or we have some structure and we want to produce
a surface form (production/generation)
• Example
• From cats to cat N PL

61
Finite State Transducers
• The simple story
• Add extra symbols to the transitions
• On one tape we read cats, on the other we write
cat N PL

62
Transitions
• cc means read a c on one tape and write a c on
the other
• Ne means read a N symbol on one tape and write
nothing on the other
• PLs means read PL and write an s

63
Lexical to Intermediate Level
64
Some on-line demos
• Finite state automata demos
• http//www.xrce.xerox.com/competencies/content-ana
lysis/fsCompiler/fsinput.html
• Finite state morphology
• http//www.xrce.xerox.com/competencies/content-ana
lysis/demos/english