LING 388: Language and Computers - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

LING 388: Language and Computers

Description:

typically for a single word form. search text: unix (e)grep, perl, microsoft word ... the number of keystrokes for inputting words on a telephone keypad (8 keys) ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 27
Provided by: sandiw
Category:

less

Transcript and Presenter's Notes

Title: LING 388: Language and Computers


1
LING 388 Language and Computers
  • Sandiway Fong
  • Lecture 5 9/8

2
Administrivia
  • Reminder
  • LING 388 Homework 1
  • due today
  • Email submissions by midnight to
  • sandiway_at_email.arizona.edu
  • Need help?
  • Office hour after this class

3
Todays Topic
  • Regular Expressions
  • Finite State Automata (FSA) in Prolog

4
Regular Expressions
  • Formally equivalent to
  • Finite state automata (FSA), and
  • Regular grammars
  • Practical use
  • string matching
  • typically for a single word form
  • search text unix (e)grep, perl, microsoft word
  • caution differences in notation and
    implementation

5
Regular Expressions Microsoft Word
  • Terminology
  • wildcard search

6
Regular Expressions Microsoft Word
7
Regular Expressions GNU grep
  • Terminology
  • metacharacter
  • character with special meaning, not interpreted
    literally, e.g. vs. a
  • must be quoted or escaped using the backslash \
    to get its literal meaning, e.g. \
  • Excerpts from the manpage
  • A list of characters enclosed by and matches
    any single character in that list if the first
    character of the list is the caret then it
    matches any character not in the list.
  • For example, the regular expression
    0123456789 matches any single digit. A range
    of characters may be specified by giving the
    first and last characters, separated by a
    hyphen.

8
Regular Expressions GNU grep
  • Excerpts from the manpage
  • Finally, certain named classes of characters are
    predefined.
  • Their names are self explanatory, and they are
    alnum, alpha, cntrl, digit,
    graph, lower, print, punct,
    space, upper, and xdigit.
  • For example, alnum means 0-9A-Za-z
  • The period . matches any single character.
  • The symbol \w is a synonym for alnum and \W
    is a synonym for alnum.
  • The caret and the dollar sign are
    metacharacters that respectively match the empty
    string at the beginning and end of a line.
  • The symbols \lt and \gt respectively match the
    empty string at the beginning and end of a
    word.
  • The symbol \b matches the empty string at the
    edge of a word
  • Terminology
  • word
  • unbroken sequence of digits, underscores and
    letters

9
Regular Expressions GNU grep
  • Excerpts from the manpage
  • A regular expression may be followed by one of
    several repetition operators
  • ? The preceding item is optional and matched
    at most once.
  • The preceding item will be matched zero or
    more times.
  • The preceding item will be matched one or
    more times.
  • n The preceding item is matched exactly n
    times
  • n, The preceding item is matched n or more
    times.
  • n,m The preceding item is matched at least n
    times, but not more than m times.

10
Regular Expressions GNU grep
  • Excerpts from the manpage
  • Two regular expressions may be concatenated
    the resulting regular expression matches any
    string formed by concatenating two substrings
    that respectively match the concatenated
    subexpressions.
  • Two regular expressions may be joined by
    the infix operator the resulting regular
    expression matches any string matching either
    subexpression.

11
Regular Expressions Examples
  • Example
  • \b99 matches 99 in there are 99 bottles but
    not in there are 299 bottles
  • Note 99 contains two words, so \b99 will match
    99 here
  • Example (sheeptalk)
  • baa!
  • baaa!
  • baaaa!
  • Regular Expression
  • baaa!
  • baa!

12
Regular Expressions Examples
  • Example
  • guppy
  • guppies
  • Regular Expression
  • gupp(yies)
  • Example
  • the (whole word, case insensitive)
  • the25
  • Regular Expression
  • (a-zA-Z)tThea-zA-Z

13
Regular Expressions
  • Regular Expressions
  • shorthand for describing sets of strings
  • Examples
  • string - set of one or more occurrences of
    string
  • a a, aa, aaa, aaaa, aaaaa,
  • (abc) abc, abcabc, abcabcabc,
  • string - set of zero or more occurrences of
    string
  • a ?, a, aa, aaa, aaaa,
  • (abc) ?, abc, abcabc,
  • Note
  • a a a
  • a ?, a, aa, aaa, aaaa,
  • a ?, aa, aaa, aaaa, aaaaa,
  • stringn - exactly n occurrences of string
  • a4 b3 aaaabbb
  • Language a set of strings

14
Regular Expressions
  • Regular Expressions
  • formally equivalent to regular grammars/finite
    state automata
  • How to show this?
  • Proof by construction
  • cant describe all possible languages
  • Examples
  • anbn ngt0is not regular
  • wwR w ? a,b is not regular
  • R reverse, e.g. abcR cba
  • How to show this?
  • Proof by Pumping Lemma

15
Finite State Automata (FSA)
  • Example
  • Language L ab
  • one or more as followed by one or more bs
  • regular language
  • described by a regular expression
  • Note
  • infinite set of strings belonging to language L
  • e.g. abbb, aaaab, aabb, abab, ?
  • Notation
  • ??is the empty string (or string with zero
    length)
  • means string is not in the language

16
Finite State Automata (FSA)
L ab
L aabb
17
Finite State Automata (FSA)
  • FSA shown on previous slide
  • acceptor - no output
  • cf. transducer - input/output pairs
  • deterministic - no ambiguity
  • i.e. at any given state and input character,
  • the next state is uniquely determined
  • cf. non-deterministic FSA (NDFSA)
  • Note
  • NDFSAs are not more powerful than FSAs
  • Proof by construction

18
Finite State Automata (FSA)
  • Practical Applications
  • Encode regular expressions
  • Morphological analyzers
  • Different word forms, e.g. want, wanted, unwanted
    (suffixation/prefixation)
  • Speech Recognizers
  • Markov models FSA probabilities
  • and many more
  • T9 text entry (tegic.com)
  • Probably built in to your cellphone
  • Predictive text entry for mobile messaging
  • Reduces the number of keystrokes for inputting
    words on a telephone keypad (8 keys)

19
Finite State Automata (FSA)
  • More formally, FSA can be specified as a tuple
    containing
  • set of states s,x,y
  • start state s
  • end state y
  • alphabet a, b
  • transition function ?
  • signature character x state -gt state
  • ?(a,s)x
  • ?(a,x)x
  • ?(b,x)y
  • ?(b,y)y

20
Finite State Automata (FSA)
  • In Prolog
  • Define one predicate for each state
  • taking one argument (the input string L)
  • consume input character
  • call next state with remaining input string
  • Initially
  • fsa(L) - s(L).
  • i.e. call start state s

21
Finite State Automata (FSA)
  • State s (start state)
  • s(aL) - x(L).
  • match input string beginning with a and
  • call state x with remainder of input
  • State x
  • x(aL) - x(L).
  • x(bL) - y(L).
  • State y (end state)
  • y().
  • y(bL) - y(L).

22
Finite State Automata (FSA)
s(aL) - x(L). x(aL) - x(L). x(bL) -
y(L). y(). y(bL) - y(L).
  • Example
  • ?- fsa(a,a,b).
  • ?- s(a,a,b).
  • ?- x(a,b).
  • ?- x(b).
  • ?- y(). Yes

23
Finite State Automata (FSA)
s(aL) - x(L). x(aL) - x(L). x(bL) -
y(L). y(). y(bL) - y(L).
  • Example
  • ?- fsa(a,b,a).
  • ?- s(a,b,a).
  • ?- x(b,a).
  • ?- y(a). No

24
Finite State Automata (FSA)
  • Another possible Prolog encoding strategy
  • fsa(S,L) -
  • L CM,
  • transition(S,C,T),
  • fsa(T,M).
  • fsa(y,). End state
  • transition(s,a,x). Encodes transition
  • transition(x,a,x). function ?
  • transition(x,b,y).
  • transition(y,b,y).

25
Finite State Automata (FSA)
  • Example
  • ?- fsa(s,a,a,b).
  • ?- transition(s,a,T). Tx
  • ?- fsa(x,a,b).
  • ?- transition(x,a,T). Tx
  • ?- fsa(x,b).
  • ?- transition(x,b,T). Ty
  • ?- fsa(y,). Yes

fsa(S,L) - L CM, transition(S,C,T), fsa(
T,M). fsa(y,).
transition(s,a,x). transition(x,a,x). transition(x
,b,y). transition(y,b,y).
26
Next Time
  • More on FSA including
  • NDFSA non-deterministic
  • Regular Grammars
Write a Comment
User Comments (0)
About PowerShow.com