CC384 Natural Language Engineering - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

CC384 Natural Language Engineering

Description:

The basic tasks in text processing. TOKENIZATION: identify tokens in text ... Matches any string which contains can': can, canterbury, scannning ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 19
Provided by: courses2
Category:

less

Transcript and Presenter's Notes

Title: CC384 Natural Language Engineering


1
CC384 - Natural Language Engineering
  • Regular Expressions

2
The basic tasks in text processing
  • TOKENIZATION identify tokens in text
  • WORD COUNTING count words and their frequencies
  • SEARCHING FOR WORDS
  • NORMALIZATION
  • MASSIMO POESIO, massimo poesio, masimo peosio ?
    Massimo Poesio
  • Oct 20, 20th of October, .. ? 20/10/2009
  • STEMMING

3
Regular Expressions and Finite State Automata
  • A central language technology
  • Regular expressions a way to express powerful
    SEARCH PATTERNS that can be implemented
    efficiently
  • Implemented in Perl, Java 1.4, Emacs, search
    engines
  • Finite state automata the computational model
    underlying regular expressions
  • The regular expressions model can be expanded to
    specify SUBSTITUTIONS as well, implementable as
    FINITE STATE TRANSDUCERS
  • In fact, Finite State Transducers are powerful
    enough to be usable for PARSING
  • Simpler cases of parsing tokenization,
    normalization

4
Searching text for words
char text .int leftMargin Boolean
matchWord(String word) boolean retval
true for (int i 0 i lt word.length()
retval true i) if
(word.charAt(i) ! textleftMarginI)
retvalfalse return(retval)
5
Searching text for patterns
  • Most common case searching using Google or
    similar
  • Simpler case just looking for web pages
    containing a word (accommodation)
  • More complex cases
  • Different spellings
  • accomodation OR accommodation
  • Centre OR Center Cognitive Science
  • Patterns only occurring in certain contexts
  • But also to validate string entered by the user
  • E.g., checking whether the string entered is a
    phone number
  • (44)(0)20-12341234, 02012341234, 44 (0)
    1234-1234
  • But not (44)020-12341234, 12341234(020)
  • A regular email address
  • asmith_at_mactec.com, foo12_at_foo.edu,
    bob.smith_at_foo.tv
  • But not asmith, _at_mactech.com, a_at_a
  • A post code
  • G1 1AA, EH10 2QQ, SW1 1ZZ

6
Regular Expressions a formalism for expressing
search patterns
  • Because matching is a very common problem, over
    the years computer scientists have identified a
    set of patterns that
  • Are very common
  • Can be searched for efficiently
  • The language of REGULAR EXPRESSIONS has been
    developed to characterize these patterns
  • Many programming languages (Perl, Java 1.4, TCL,
    Python. ) / web search tools / software systems
    (awk, sed, emacs) allow users to use regular
    expressions to specify what they are searching
    these REs are then compiled into efficient code
  • You do not need to write the code yourselves!

7
Regular Expressions the basic case
  • The simplest form of regular expression a
    SEQUENCE OF SYMBOLS
  • /can/
  • Matches any string which contains can can,
    canterbury, scannning
  • Whitespace can be included /top ten/
  • Also matches how to stop tension

8
More complex types of regular expressions
  • Disjunction
  • /centrecenter/
  • /accomodationaccommodation/
  • Also
  • /Ccentre/
  • /accommmodation/
  • Repetitions
  • Any number greater than 0
  • /YES!/
  • Matches YES!, YESS!, YESSS!
  • E.g., any binary number 01
  • 0 or more
  • /ab/
  • Matches a, ab, abb, abbb

9
Software that includes an implementation of REs
  • Pure REs awk, egrep, lex
  • Extended REs perl, Java

10
Regular expressions in Java (from 1.4)
  • Standard library java.util.regex
  • Tutorial (very good) http//java.sun.com/docs/bo
    oks/tutorial/extra/regex/index.html
  • Alternative
  • http//www.javaworld.com/javaworld/jw-07-2001
    /jw-0713-regex-p2.html
  • Main classes
  • PATTERN ( compiled form of a RE)
  • Pattern rePattern Pattern.compile(ab")
  • MATCHER ( analyze a string using a pattern)
  • Matcher pm rePattern.matcher(string)
  • pm.find() find the next substring that matches
  • pm.group() the substring found by find()

11
Grep in Java 1.4 (cc384/code/java)
  • .import java.util.regex. public
    class Grep .
    // Pattern used to parse lines
    private static Pattern linePattern
    Pattern.compile(".\r?\n")
    // The input pattern that we're looking
    for private static Pattern pattern
    // Compile the
    pattern from the command line private
    static void compile(String pat)
    try pattern Pattern.compile(pat)
    catch (PatternSyntaxException
    x) System.err.println(
    x.getMessage()) System.exit(1)
    // Use the linePattern to break
    the given CharBuffer into lines, applying
    // the input pattern to each
    line to see if we have a match
  • private static void grep(File f,
    CharBuffer cb) Matcher
    lm linePattern.matcher(cb) // Line matcher
    Matcher pm null // Pattern
    matcher int lines 0
    while (lm.find()) lines
    CharSequence cs
    lm.group() // The current line
    if (pm null) pm
    pattern.matcher(cs) else pm.reset(cs)
    if (pm.find())
    System.out.print(f "" lines "" cs)
    if (lm.end()
    cb.limit()) break

12
Regular expressions in Perl
  • Example print lines containing the string can
    (a simple version of the grep program)

while (ltSTDINgt) if (/can/) print _
13
Even more complex cases and more metacharacters
(PERL- and Java-specific )
  • Other forms of disjunction
  • Range /textfile02-4/
  • Will match textfile02 textfile03 textfile04
  • Metacharacters (in Perl / Java)
  • \d (any digit) /a\dz/ matches a0z, a123z, a456z
  • \w (letter, digit, or underscore _)
  • \s (any whitespace)
  • Any character . (period)
  • /cyclo.ane/ matches
  • cyclodecane, cyclohexane, cyclones drive me
    insane
  • Zero or one times ?
  • /accomm?odation/ matches accomodation and
    accommodation
  • Negation abc
  • /textfile0268/ matches textfile1,
    textfile3,

14
Applications of more complex REs
  • Web pages about Centres and Centers
  • /CcentreCcenter/
  • Regular expression to validate phone numbers
  • (44)(0)20-12341234, 02012341234, 44 (0)
    1234-1234
  • But not (44)020-12341234, 12341234(020)
  • (\(?\?0-9\)?)?0-9_\- \(\)
  • Validating email addresses
  • asmith_at_mactec.com, foo12_at_foo.edu,
    bob.smith_at_foo.tv
  • But not asmith, _at_mactech.com, a_at_a
  • (a-zA-Z0-9_\-\.)_at_((\0-91,3\.0-91,3\.
    0-91,3\.)((a-zA-Z0-9\-\.)))(a-zA-Z2,4
    0-91,3)(\?)

15
Notational Variants
  • Different programming languages tend to use
    different notations for expressing REs.
  • In FSA,
  • Sequence d,o,g
  • Disjunction c,a,t,d,o,g (instead of
    catdog)
  • Range a..z (instead of a-z)
  • Any symbol whatsoever ? (instead of .)
  • Optional character E (instead of E?)

16
Notational variants advanced search in Google
  • CAPITALIZATION, etc
  • Google search is not case-sensitive
  • OR search
  • vacation london OR paris
  • NUMRANGE search
  • DVD player 250..350
  • WILDCARD search
  • "Sony Vaio laptop"
  • For more tips http//www.google.com/help/refinese
    arch.html

17
Readings
  • Jurafsky and Martin, chapter 2
  • The regular expressions library
  • http//www.regxlib.com/
  • The Java tutorial at Sun, section on regular
    expressions
  • http//java.sun.com/docs/books/tutorial/extra/rege
    x/index.html
  • The sections of the Perl manual on regular
    expressions (perlre)
  • Jeffrey Friedl, Understanding Regular
    Expressions, The Perl Journal

18
Acknowledgments
  • Some material borrowed from Gosse Bouma
Write a Comment
User Comments (0)
About PowerShow.com