Title: CSC 9010 Natural Language Processing Lecture 2: Regular Expressions, Finite State Automata Paula Matuszek Mary-Angela Papalaskari
1CSC 9010Natural Language ProcessingLecture 2
Regular Expressions, Finite State AutomataPaula
MatuszekMary-Angela Papalaskari
- Presentation slides adapted from Jim Martins
course http//www.cs.colorado.edu/martin/csci583
2.html
2Regular Expressions and Text Searching
- Everybody does it
- Emacs, vi, perl, grep, etc..
3Example
- Find me all instances of the word the in a
text. - /the/
- /tThe/
- /\btThe\b/
4Two kinds of Errors
- Matching strings that we should not have matched
(there, then, other) - False positives
- Not matching things that we should have matched
(The) - False negatives
5Two Antagonistic Goals
- Accuracy
- (minimize false positives)
- Coverage
- (minimize false negatives).
6Finite State Automata
- Idealized machines for processing regular
expressions - Example /baa!/
7Finite State Automata
- Idealized machines for processing regular
expressions - Example /baa!/
- 5 states
- 5 transitions
- alphabet?
initial state
accept state
8More examples
9Another FSA for the same language
10Formally Specifying a FSA
- The set of states Q
- A finite alphabet S
- A start state
- A set of accept/final states
- A transition function that maps QxS to Q
11Dollars and Cents
12Recognition
- Recognition is the process of determining if a
string should be accepted by a machine - Or its the process of determining if as string
is in the language were defining with the
machine - Or its the process of determining if a regular
expression matches a string
13Turings way of Visualizing Recognition
14Recognition
- Begin in the start state
- Examine current input
- Consult the table
- Go to a new state and update the tape pointer.
- When you run out of tape
- if in accepting state, accept input
- else reject input
15D-Recognize
16Key Points
- Deterministic means that at each point in
processing there is always one unique thing to do
(no choices). - D-recognize is a simple table-driven interpreter
- The algorithm is universal for all unambiguous
languages. - To change the machine, you change the table.
17Key Points
- Crudely therefore matching strings with regular
expressions is a matter of - translating the expression into a machine (table)
and - passing the table to an interpreter
18Recognition as Search
- You can view this algorithm as a degenerate kind
of state-space search. - States are pairings of tape positions and state
numbers. - Operators are compiled into the table
- Goal state is a pairing with the end of tape
position and a final accept state - Its degenerate because?
19Generative Formalisms
- Formal Languages are sets of strings composed of
symbols from a finite set of symbols. - Finite-state automata define formal languages
(without having to enumerate all the strings in
the language) - The term Generative is based on the view that you
can run the machine as a generator to get strings
from the language.
20Generative Formalisms
- FSAs can be viewed from two perspectives
- Acceptors that can tell you if a string is in the
language - Generators to produce all and only the strings in
the language
21Review
- Regular expressions are just a compact textual
representation of FSAs - Recognition is the process of determining if a
string/input is in the language defined by some
machine. - Recognition is straightforward with deterministic
machines.
22Three Views
- Three equivalent formal ways to look at what
were up to (not including tables)
Regular Expressions
Finite State Automata
Regular Languages
23Defining Languages with Productions
- S ? b a a A
- A ? a A
- A ? !
S ? NP VP NP ? PrNoun NP ? Det Noun Det ? a
the Noun ? cat dog book PrNoun ? samantha
elmer fido VP ? IVerb TVerb NP IVerb ? ran
slept ate TVerb ? hit kissed ate
24Non-Determinism
Compare
25Non-Determinism cont.
- Epsilon transitions
- Note these transitions do not examine or advance
the tape during recognition
e
26Are Non-deterministic FSA more powerful?
- NO
- Non-deterministic machines can be converted to
deterministic ones with a fairly simple
construction - One way to do recognition with a
non-deterministic machine is to turn it into a
deterministic one.
27Non-Deterministic Recognition
- In a ND FSA there exists at least one path
through the machine for a string that is in the
language defined by the machine. - But not all paths directed through the machine
for an accept string lead to an accept state. - No paths through the machine lead to an accept
state for a string not in the language.
28Non-Deterministic Recognition
- So success in a non-deterministic recognition
occurs when a path is found through the machine
that ends in an accept. - Failure occurs when none of the possible paths
lead to an accept state.
29Example
b
a
a
a
!
\
q0
q2
q1
q2
q3
q4
30Example
31Example
32Example
33Example
34Example
35Example
36Example
37Example
38Key Points
- States in the search space are pairings of tape
positions and states in the machine. - By keeping track of as yet unexplored states, a
recognizer can systematically explore all the
paths through the machine given an input.
39ND-Recognize Code
40Infinite Search
- If youre not careful such searches can go into
an infinite loop. - How?
41Why Bother?
- Non-determinism doesnt get us more formal power
and it causes headaches so why bother? - More natural solutions
- Machines based on construction are too big
42Compositional Machines
- Formal languages are just sets of strings
- Therefore, we can talk about various set
operations (intersection, union, concatenation) - This turns out to be a useful exercise
43Union
- Accept a string in either of two languages
44Concatenation
- Accept a string consisting of a string from
language L1 followed by a string from language L2.
45Negation
- Construct a machine M2 to accept all strings not
accepted by machine M1 and reject all the strings
accepted by M1 - Invert all the accept and not accept states in M1
- Does that work for non-deterministic machines?
46Intersection
- Accept a string that is in both of two specified
languages - An indirect construction
- AB (A or B)