CSC 9010 Natural Language Processing Lecture 2: Regular Expressions, Finite State Automata Paula Matuszek Mary-Angela Papalaskari - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

CSC 9010 Natural Language Processing Lecture 2: Regular Expressions, Finite State Automata Paula Matuszek Mary-Angela Papalaskari

Description:

Lecture 2: Regular Expressions, Finite State Automata. Paula Matuszek. Mary ... from Jim Martin's course: http://www.cs.colorado.edu/~martin/csci5832.html ... – PowerPoint PPT presentation

Number of Views:147
Avg rating:3.0/5.0
Slides: 47
Provided by: maryangela
Category:

less

Transcript and Presenter's Notes

Title: CSC 9010 Natural Language Processing Lecture 2: Regular Expressions, Finite State Automata Paula Matuszek Mary-Angela Papalaskari


1
CSC 9010Natural Language ProcessingLecture 2
Regular Expressions, Finite State AutomataPaula
MatuszekMary-Angela Papalaskari
  • Presentation slides adapted from Jim Martins
    course http//www.cs.colorado.edu/martin/csci583
    2.html

2
Regular Expressions and Text Searching
  • Everybody does it
  • Emacs, vi, perl, grep, etc..

3
Example
  • Find me all instances of the word the in a
    text.
  • /the/
  • /tThe/
  • /\btThe\b/

4
Two kinds of Errors
  • Matching strings that we should not have matched
    (there, then, other)
  • False positives
  • Not matching things that we should have matched
    (The)
  • False negatives

5
Two Antagonistic Goals
  • Accuracy
  • (minimize false positives)
  • Coverage
  • (minimize false negatives).

6
Finite State Automata
  • Idealized machines for processing regular
    expressions
  • Example /baa!/

7
Finite State Automata
  • Idealized machines for processing regular
    expressions
  • Example /baa!/
  • 5 states
  • 5 transitions
  • alphabet?

initial state
accept state
8
More examples
9
Another FSA for the same language
10
Formally Specifying a FSA
  • The set of states Q
  • A finite alphabet S
  • A start state
  • A set of accept/final states
  • A transition function that maps QxS to Q

11
Dollars and Cents
12
Recognition
  • Recognition is the process of determining if a
    string should be accepted by a machine
  • Or its the process of determining if as string
    is in the language were defining with the
    machine
  • Or its the process of determining if a regular
    expression matches a string

13
Turings way of Visualizing Recognition
14
Recognition
  • Begin in the start state
  • Examine current input
  • Consult the table
  • Go to a new state and update the tape pointer.
  • When you run out of tape
  • if in accepting state, accept input
  • else reject input

15
D-Recognize
16
Key Points
  • Deterministic means that at each point in
    processing there is always one unique thing to do
    (no choices).
  • D-recognize is a simple table-driven interpreter
  • The algorithm is universal for all unambiguous
    languages.
  • To change the machine, you change the table.

17
Key Points
  • Crudely therefore matching strings with regular
    expressions is a matter of
  • translating the expression into a machine (table)
    and
  • passing the table to an interpreter

18
Recognition as Search
  • You can view this algorithm as a degenerate kind
    of state-space search.
  • States are pairings of tape positions and state
    numbers.
  • Operators are compiled into the table
  • Goal state is a pairing with the end of tape
    position and a final accept state
  • Its degenerate because?

19
Generative Formalisms
  • Formal Languages are sets of strings composed of
    symbols from a finite set of symbols.
  • Finite-state automata define formal languages
    (without having to enumerate all the strings in
    the language)
  • The term Generative is based on the view that you
    can run the machine as a generator to get strings
    from the language.

20
Generative Formalisms
  • FSAs can be viewed from two perspectives
  • Acceptors that can tell you if a string is in the
    language
  • Generators to produce all and only the strings in
    the language

21
Review
  • Regular expressions are just a compact textual
    representation of FSAs
  • Recognition is the process of determining if a
    string/input is in the language defined by some
    machine.
  • Recognition is straightforward with deterministic
    machines.

22
Three Views
  • Three equivalent formal ways to look at what
    were up to (not including tables)

Regular Expressions
Finite State Automata
Regular Languages
23
Defining Languages with Productions
  • S ? b a a A
  • A ? a A
  • A ? !

S ? NP VP NP ? PrNoun NP ? Det Noun Det ? a
the Noun ? cat dog book PrNoun ? samantha
elmer fido VP ? IVerb TVerb NP IVerb ? ran
slept ate TVerb ? hit kissed ate
24
Non-Determinism
Compare
25
Non-Determinism cont.
  • Epsilon transitions
  • Note these transitions do not examine or advance
    the tape during recognition


e
26
Are Non-deterministic FSA more powerful?
  • NO
  • Non-deterministic machines can be converted to
    deterministic ones with a fairly simple
    construction
  • One way to do recognition with a
    non-deterministic machine is to turn it into a
    deterministic one.

27
Non-Deterministic Recognition
  • In a ND FSA there exists at least one path
    through the machine for a string that is in the
    language defined by the machine.
  • But not all paths directed through the machine
    for an accept string lead to an accept state.
  • No paths through the machine lead to an accept
    state for a string not in the language.

28
Non-Deterministic Recognition
  • So success in a non-deterministic recognition
    occurs when a path is found through the machine
    that ends in an accept.
  • Failure occurs when none of the possible paths
    lead to an accept state.

29
Example
b
a
a
a
!
\
q0
q2
q1
q2
q3
q4
30
Example
31
Example
32
Example
33
Example
34
Example
35
Example
36
Example
37
Example
38
Key Points
  • States in the search space are pairings of tape
    positions and states in the machine.
  • By keeping track of as yet unexplored states, a
    recognizer can systematically explore all the
    paths through the machine given an input.

39
ND-Recognize Code
40
Infinite Search
  • If youre not careful such searches can go into
    an infinite loop.
  • How?

41
Why Bother?
  • Non-determinism doesnt get us more formal power
    and it causes headaches so why bother?
  • More natural solutions
  • Machines based on construction are too big

42
Compositional Machines
  • Formal languages are just sets of strings
  • Therefore, we can talk about various set
    operations (intersection, union, concatenation)
  • This turns out to be a useful exercise

43
Union
  • Accept a string in either of two languages

44
Concatenation
  • Accept a string consisting of a string from
    language L1 followed by a string from language L2.

45
Negation
  • Construct a machine M2 to accept all strings not
    accepted by machine M1 and reject all the strings
    accepted by M1
  • Invert all the accept and not accept states in M1
  • Does that work for non-deterministic machines?

46
Intersection
  • Accept a string that is in both of two specified
    languages
  • An indirect construction
  • AB (A or B)
Write a Comment
User Comments (0)
About PowerShow.com