Title: LING 388: Language and Computers
1LING 388 Language and Computers
- Sandiway Fong
- Lecture 5 9/8
2Administrivia
- Reminder
- LING 388 Homework 1
- due today
- Email submissions by midnight to
- sandiway_at_email.arizona.edu
- Need help?
- Office hour after this class
3Todays Topic
- Regular Expressions
- Finite State Automata (FSA) in Prolog
4Regular Expressions
- Formally equivalent to
- Finite state automata (FSA), and
- Regular grammars
- Practical use
- string matching
- typically for a single word form
- search text unix (e)grep, perl, microsoft word
- caution differences in notation and
implementation
5Regular Expressions Microsoft Word
- Terminology
- wildcard search
6Regular Expressions Microsoft Word
7Regular Expressions GNU grep
- Terminology
- metacharacter
- character with special meaning, not interpreted
literally, e.g. vs. a - must be quoted or escaped using the backslash \
to get its literal meaning, e.g. \ - Excerpts from the manpage
- A list of characters enclosed by and matches
any single character in that list if the first
character of the list is the caret then it
matches any character not in the list. - For example, the regular expression
0123456789 matches any single digit. A range
of characters may be specified by giving the
first and last characters, separated by a
hyphen.
8Regular Expressions GNU grep
- Excerpts from the manpage
- Finally, certain named classes of characters are
predefined. - Their names are self explanatory, and they are
alnum, alpha, cntrl, digit,
graph, lower, print, punct,
space, upper, and xdigit. - For example, alnum means 0-9A-Za-z
- The period . matches any single character.
- The symbol \w is a synonym for alnum and \W
is a synonym for alnum. - The caret and the dollar sign are
metacharacters that respectively match the empty
string at the beginning and end of a line. - The symbols \lt and \gt respectively match the
empty string at the beginning and end of a
word. - The symbol \b matches the empty string at the
edge of a word - Terminology
- word
- unbroken sequence of digits, underscores and
letters
9Regular Expressions GNU grep
- Excerpts from the manpage
- A regular expression may be followed by one of
several repetition operators - ? The preceding item is optional and matched
at most once. - The preceding item will be matched zero or
more times. - The preceding item will be matched one or
more times. - n The preceding item is matched exactly n
times - n, The preceding item is matched n or more
times. - n,m The preceding item is matched at least n
times, but not more than m times.
10Regular Expressions GNU grep
- Excerpts from the manpage
- Two regular expressions may be concatenated
the resulting regular expression matches any
string formed by concatenating two substrings
that respectively match the concatenated
subexpressions. - Two regular expressions may be joined by
the infix operator the resulting regular
expression matches any string matching either
subexpression.
11Regular Expressions Examples
- Example
- \b99 matches 99 in there are 99 bottles but
not in there are 299 bottles - Note 99 contains two words, so \b99 will match
99 here - Example (sheeptalk)
- baa!
- baaa!
- baaaa!
- Regular Expression
- baaa!
- baa!
12Regular Expressions Examples
- Example
- guppy
- guppies
- Regular Expression
- gupp(yies)
- Example
- the (whole word, case insensitive)
- the25
- Regular Expression
- (a-zA-Z)tThea-zA-Z
13Regular Expressions
- Regular Expressions
- shorthand for describing sets of strings
- Examples
- string - set of one or more occurrences of
string - a a, aa, aaa, aaaa, aaaaa,
- (abc) abc, abcabc, abcabcabc,
- string - set of zero or more occurrences of
string - a ?, a, aa, aaa, aaaa,
- (abc) ?, abc, abcabc,
- Note
- a a a
- a ?, a, aa, aaa, aaaa,
- a ?, aa, aaa, aaaa, aaaaa,
- stringn - exactly n occurrences of string
- a4 b3 aaaabbb
- Language a set of strings
14Regular Expressions
- Regular Expressions
- formally equivalent to regular grammars/finite
state automata - How to show this?
- Proof by construction
- cant describe all possible languages
- Examples
- anbn ngt0is not regular
- wwR w ? a,b is not regular
- R reverse, e.g. abcR cba
- How to show this?
- Proof by Pumping Lemma
15Finite State Automata (FSA)
- Example
- Language L ab
- one or more as followed by one or more bs
- regular language
- described by a regular expression
- Note
- infinite set of strings belonging to language L
- e.g. abbb, aaaab, aabb, abab, ?
- Notation
- ??is the empty string (or string with zero
length) - means string is not in the language
16Finite State Automata (FSA)
L ab
L aabb
17Finite State Automata (FSA)
- FSA shown on previous slide
- acceptor - no output
- cf. transducer - input/output pairs
- deterministic - no ambiguity
- i.e. at any given state and input character,
- the next state is uniquely determined
- cf. non-deterministic FSA (NDFSA)
- Note
- NDFSAs are not more powerful than FSAs
- Proof by construction
18Finite State Automata (FSA)
- Practical Applications
- Encode regular expressions
- Morphological analyzers
- Different word forms, e.g. want, wanted, unwanted
(suffixation/prefixation) - Speech Recognizers
- Markov models FSA probabilities
- and many more
- T9 text entry (tegic.com)
- Probably built in to your cellphone
- Predictive text entry for mobile messaging
- Reduces the number of keystrokes for inputting
words on a telephone keypad (8 keys)
19Finite State Automata (FSA)
- More formally, FSA can be specified as a tuple
containing - set of states s,x,y
- start state s
- end state y
- alphabet a, b
- transition function ?
- signature character x state -gt state
- ?(a,s)x
- ?(a,x)x
- ?(b,x)y
- ?(b,y)y
20Finite State Automata (FSA)
- In Prolog
- Define one predicate for each state
- taking one argument (the input string L)
- consume input character
- call next state with remaining input string
- Initially
- fsa(L) - s(L).
- i.e. call start state s
21Finite State Automata (FSA)
- State s (start state)
- s(aL) - x(L).
- match input string beginning with a and
- call state x with remainder of input
- State x
- x(aL) - x(L).
- x(bL) - y(L).
- State y (end state)
- y().
- y(bL) - y(L).
22Finite State Automata (FSA)
s(aL) - x(L). x(aL) - x(L). x(bL) -
y(L). y(). y(bL) - y(L).
- Example
- ?- fsa(a,a,b).
- ?- s(a,a,b).
- ?- x(a,b).
- ?- x(b).
- ?- y(). Yes
23Finite State Automata (FSA)
s(aL) - x(L). x(aL) - x(L). x(bL) -
y(L). y(). y(bL) - y(L).
- Example
- ?- fsa(a,b,a).
- ?- s(a,b,a).
- ?- x(b,a).
- ?- y(a). No
24Finite State Automata (FSA)
- Another possible Prolog encoding strategy
- fsa(S,L) -
- L CM,
- transition(S,C,T),
- fsa(T,M).
- fsa(y,). End state
- transition(s,a,x). Encodes transition
- transition(x,a,x). function ?
- transition(x,b,y).
- transition(y,b,y).
25Finite State Automata (FSA)
- Example
- ?- fsa(s,a,a,b).
- ?- transition(s,a,T). Tx
- ?- fsa(x,a,b).
- ?- transition(x,a,T). Tx
- ?- fsa(x,b).
- ?- transition(x,b,T). Ty
- ?- fsa(y,). Yes
fsa(S,L) - L CM, transition(S,C,T), fsa(
T,M). fsa(y,).
transition(s,a,x). transition(x,a,x). transition(x
,b,y). transition(y,b,y).
26Next Time
- More on FSA including
- NDFSA non-deterministic
- Regular Grammars