LING 138/238 SYMBSYS 138Intro to Computer Speech

and Language Processing

- Dan Jurafsky

Today 9/30 Week 1

- Finite State Automata
- Deterministic Recognition of FSAs
- Non-Determinism (NFSAs)
- Recognition of NFSAs
- Proof that regular expressions FSAs
- Very brief sketch Morphology, FSAs, FSTs
- Why finite-state machines are so great.

Three Views

- Three equivalent formal ways to look at what

were up to (thanks to Martin Kay)

Regular Expressions

Finite State Automata

Regular Languages

Finite State Automata

- Terminology Finite State Automata, Finite State

Machines, FSA, Finite Automata - Regular expressions are one way of specifying the

structure of finite-state automata. - FSAs and their close relatives are at the core of

most algorithms for speech and language

processing.

Intuition FSAs as Graphs

- Lets start with the sheep language from the text
- /baa!/

Sheep FSA

- We can say the following things about this

machine - It has 5 states
- At least b,a, and ! are in its alphabet
- q0 is the start state
- q4 is an accept state
- It has 5 transitions

But note

- There are other machines that correspond to this

language - More on this one later

More Formally Defining an FSA

- You can specify an FSA by enumerating the

following things. - The set of states Q
- A finite alphabet S
- A start state q0
- A set F of accepting/final states F?Q
- A transition function ?(q,i) that maps QxS to Q

Yet Another View

- State-transition table

Recognition

- Recognition is the process of determining if a

string should be accepted by a machine - Or its the process of determining if a string

is in the language were defining with the

machine - Or its the process of determining if a regular

expression matches a string

Recognition

- Traditionally, (Turings idea) this process is

depicted with a tape.

Recognition

- Start in the start state
- Examine the current input
- Consult the table
- Go to a new state and update the tape pointer.
- Until you run out of tape.

D-Recognize

Tracing D-Recognize

Key Points

- Deterministic means that at each point in

processing there is always one unique thing to do

(no choices). - D-recognize is a simple table-driven interpreter
- The algorithm is universal for all unambiguous

languages. - To change the machine, you change the table.

Key Points

- Crudely therefore matching strings with regular

expressions (ala Perl) is a matter of - translating the expression into a machine (table)

and - passing the table to an interpreter

Recognition as Search

- You can view this algorithm as state-space

search. - States are pairings of tape positions and state

numbers. - Operators are compiled into the table
- Goal state is a pairing with the end of tape

position and a final accept state

Generative Formalisms

- Formal Languages are sets of strings composed of

symbols from a finite set of symbols. - Finite-state automata define formal languages

(without having to enumerate all the strings in

the language) - The term Generative is based on the view that you

can run the machine as a generator to get strings

from the language.

Generative Formalisms

- FSAs can be viewed from two perspectives
- Acceptors that can tell you if a string is in the

language - Generators to produce all and only the strings in

the language

Dollars and Cents

Summary

- Regular expressions are just a compact textual

representation of FSAs - Recognition is the process of determining if a

string/input is in the language defined by some

machine. - Recognition is straightforward with deterministic

machines.

Three Views

- Three equivalent formal ways to look at what

were up to (thanks to Martin Kay)

Regular Expressions

Finite State Automata

Regular Languages

Regular Languages

- More on these in a couple of weeks
- S ? b a a A
- A ? a A
- A ? !

Non-Determinism

Non-Determinism cont.

- Yet another technique
- Epsilon transitions
- Key point these transitions do not examine or

advance the tape during recognition

e

Equivalence

- Non-deterministic machines can be converted to

deterministic ones with a fairly simple

construction - That means that they have the same power

non-deterministic machines are not more powerful

than deterministic ones - It also means that one way to do recognition with

a non-deterministic machine is to turn it into a

deterministic one.

Non-Deterministic Recognition

- In a ND FSA there exists at least one path

through the machine for a string that is in the

language defined by the machine. - But not all paths directed through the machine

for an accept string lead to an accept state. - No paths through the machine lead to an accept

state for a string not in the language.

Non-Deterministic Recognition

- So success in a non-deterministic recognition

occurs when a path is found through the machine

that ends in an accept. - Failure occurs when none of the possible paths

lead to an accept state.

Example

b

a

a

a

!

\

q0

q2

q1

q2

q3

q4

Example

Example

Example

Example

Example

Example

Example

Example

Key Points

- States in the search space are pairings of tape

positions and states in the machine. - By keeping track of as yet unexplored states, a

recognizer can systematically explore all the

paths through the machine given an input.

ND-Recognize Code

Infinite Search

- If youre not careful such searches can go into

an infinite loop. - How?

Why Bother?

- Non-determinism doesnt get us more formal power

and it causes headaches so why bother? - More natural solutions
- Machines based on construction are too big

Regular languages

- The class of languages characterizable by regular

expressions - Given alphabet ?, the reg. lgs. over ? is
- The empty set ? is a regular language
- ?a ? ? ? ?, a is a regular language
- If L1 and L2 are regular lgs, then so are
- L1 L2 xyx ? L1,y ? L2, concatenation of L1

L2 - L1 ? L2, the union of L1 and L2
- L1, the Kleene closure of L1

Going from regexp to FSA

- Since all regular lgs meet above properties
- And reg lgs are the lgs characterizable by

regular expressions - All regular expression operators can be

implemented by combinations of union,

disjunction, closure - Counters (,) are repetition plus closure
- Anchors are individual symbols
- and () and . are kinds of disjunction

Going from regexp to FSA

- So if we could just show how to turn

closure/union/concat from regexps to FSAs, this

would give an idea of how FSA compilation works. - The actual proof that reg lgs FSAs has 2 parts
- An FSA can be built for each regular lg
- A regular lg can be built for each automaton
- So Ill give the intuition of the first part
- Take any regular expression and build an

automaton - Intuition induction
- Base case build an automaton for single symbol

(say a) - Inductive step Show how to imitate the 3 regexp

operations in automata

Union

- Accept a string in either of two languages

Concatenation

- Accept a string consisting of a string from

language L1 followed by a string from language L2.

FSAs and Computational Morphology

- An important use of FSAs is for morphology, the

study of word parts - Well just have time for a quick overview today
- This is the exact topic of LING 239F, being

offered this quarter!

English Morphology

- Morphology is the study of the ways that words

are built up from smaller meaningful units called

morphemes - We can usefully divide morphemes into two classes
- Stems The core meaning bearing units
- Affixes Bits and pieces that adhere to stems to

change their meanings and grammatical functions

Nouns and Verbs (English)

- Nouns are simple (not really)
- Markers for plural and possessive
- Verbs are only slightly more complex
- Markers appropriate to the tense of the verb

Regulars and Irregulars

- Ok so it gets a little complicated by the fact

that some words misbehave (refuse to follow the

rules) - Mouse/mice, goose/geese, ox/oxen
- Go/went, fly/flew
- The terms regular and irregular will be used to

refer to words that follow the rules and those

that dont.

Regular and Irregular Nouns and Verbs

- Regulars
- Walk, walks, walking, walked, walked
- Table, tables
- Irregulars
- Eat, eats, eating, ate, eaten
- Catch, catches, catching, caught, caught
- Cut, cuts, cutting, cut, cut
- Goose, geese

Compute

- Many paths are possible
- Start with compute
- Computer -gt computerize -gt computerization
- Computation -gt computational
- Computer -gt computerize -gt computerizable
- Compute -gt computee

Why care about morphology?

- Stemming in information retrieval
- Might want to search for aardvark and find

pages with both aardvark and aardvarks - Morphology in machine translation
- Need to know that the Spanish words quiero and

quieres are both related to querer want - Morphology in spell checking
- Need to know that misclam and antiundoggingly are

not words despite being made up of word parts

Cant just list all words

- Turkish
- Uygarlastiramadiklarimizdanmissinizcasina
- (behaving) as if you are among those whom we

could not civilize - Uygar civilized las become tir cause

ama not able dik past lar plural imiz

p1pl dan abl mis past siniz 2pl

casina as if

What we want

- Something to automatically do the following kinds

of mappings - Cats cat N PL
- Cat cat N SG
- Cities city N PL
- Merging merge V Present-participle
- Caught catch V past-participle

FSAs and the Lexicon

- This will actual require a kind of FSA we wont

be studying the Finite State Transducer (FST) - But well give a quick overview anyhow
- First well capture the morphotactics
- The rules governing the ordering of affixes in a

language. - Then well add in the actual words

Simple Rules

Adding the Words

Derivational Rules

Parsing/Generation vs. Recognition

- Recognition is usually not quite what we need.
- Usually if we find some string in the language we

need to find the structure in it (parsing) - Or we have some structure and we want to produce

a surface form (production/generation) - Example
- From cats to cat N PL

Finite State Transducers

- The simple story
- Add another tape
- Add extra symbols to the transitions
- On one tape we read cats, on the other we write

cat N PL

Transitions

- cc means read a c on one tape and write a c on

the other - Ne means read a N symbol on one tape and write

nothing on the other - PLs means read PL and write an s

Lexical to Intermediate Level

Some on-line demos

- Finite state automata demos
- http//www.xrce.xerox.com/competencies/content-ana

lysis/fsCompiler/fsinput.html - Finite state morphology
- http//www.xrce.xerox.com/competencies/content-ana

lysis/demos/english - Some other downloadable FSA tools
- http//www.research.att.com/sw/tools/fsm/