CSA2050: Natural Language Processing - PowerPoint PPT Presentation

About This Presentation
Title:

CSA2050: Natural Language Processing

Description:

Limits the range of following words for Speech Recognition ... you must use the reload command before the changes become visible in Python: ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 38
Provided by: michael307
Category:

less

Transcript and Presenter's Notes

Title: CSA2050: Natural Language Processing


1
CSA2050 Natural Language Processing
  • Tagging 1
  • Tagging
  • POS and Tagsets
  • Ambiguities
  • NLTK

2
Tagging 1 Lecture
  • Slides based on Mike Rosner and Marti Hearst
    notes
  • Diane Litmans version of Steven Birds notes
  • Additions from NLTK tutorials

3
Tagging
  • Mr. Sherlock Holmes, who was usually very X,
  • What is the part of speech of X ?

4
Tagging
  • Mr. Sherlock Holmes, who was usually very
    late/ADJ in the mornings, save upon those not
    infrequent occasions when he was up all night,
    was Y
  • What is the part of speech of Y ?

5
Tagging
  • Mr. Sherlock Holmes, who was usually very late in
    the mornings, save upon those not infrequent
    occasions when he was up all night, was
    seated/VBN at the breakfast table

6
Tagging Terminology
  • Tagging
  • The process of associating labels with each token
    in a text
  • Tags
  • The labels
  • Tag Set
  • The collection of tags used for a particular task

7
Tagging Example
  • Typically a tagged text is a sequence of
    white-space separated base/tag tokens
  • The/at Pantheons/np interior/nn ,/,still/rb
    in/in its/pp original/jj form/nn ,/, is/bez
    truly/ql majestic/jj and/cc an/at
    architectural/jj triumph/nn ./. Its/pp rotunda/nn
    forms/vbz a/at perfect/jj circle/nn whose/wp
    diameter/nn is/bez equal/jj to/in the/at
    height/nn from/in the/at floor/nn to/in the/at
    ceiling/nn ./.

8
What does tagging do?
  • Collapses Some Distinctions
  • Lexical identity may be discarded
  • e.g. all personal pronouns tagged with PRP
  • .But Introduces Others
  • Ambiguities may be removed
  • e.g. deal tagged with NN or VB
  • e.g. deal tagged with DEAL1 or DEAL2
  • Helps classification and prediction

9
Parts of Speech (POS)
  • A words POS tells us a lot about the word and
    its neighbors
  • Limits the range of meanings (deal),
    pronunciation (object vs object) or both (wind)
  • Helps in stemming
  • Limits the range of following words for Speech
    Recognition
  • Can help select nouns from a document for IR
  • Basis for partial parsing (chunked parsing)
  • Parsers can build trees directly on the POS tags
    instead of maintaining a lexicon

10
POS and Tagsets
  • The choice of tagset greatly affects the
    difficulty of the problem
  • Need to strike a balance between
  • Getting better information about context (best
    introduce more distinctions)
  • Make it possible for classifiers to do their job
    (need to minimize distinctions)

11
Common Tagsets
  • Brown corpus 87 tags
  • Penn Treebank 45 tags
  • Lancaster UCREL C5 (used to tag the British
    National Corpus - BNC) 61 tags
  • Lancaster C7 145 tags

12
Brown Corpus
  • The first digital corpus (1961)
  • Francis and Kucera, Brown University
  • Contents 500 texts, each 2000 words long
  • From American books, newspapers, magazines
  • Representing genres
  • Science fiction, romance fiction, press reportage
    scientific writing, popular lore

13
Penn Treebank
  • First syntactically annotated corpus
  • 1 million words from Wall Street Journal
  • Part of speech tags and syntax trees

14
Penn Treebank
  • The/DT grand/JJ jury/NN commented/VBD on/IN a/DT
    number/NN of/IN other/JJ topics/NNS ./.
  • VB DT NN .Book that flight .
  • VBZ DT NN VB NN ?Does that flight
    serve dinner ?

15
Penn Treebank
16
Penn Treebank Important Tags
17
Penn Treebank Verb Tags
18
Penn Treebank Example
  • (S (NP-SBJ-1 (DT The)
  • (NNP Senate))
  • (VP (VBZ plans_
  • (S (NP-SBJ (-NONE- -1))
  • (VP (TO to)
  • (VP (VB take)
  • (PRT (RP up))
  • (NP (DT the)
  • (NN measure))
  • (ADV-TMP (RB quickly))))))
  • (. .))

19
Tagging
  • Typically the set of tags is larger than basic
    parts of speech
  • Tags often contain some morphological information
  • Often referred to as morphosyntactic labels

20
Tagging Ambiguities
  • N N-V V-IN DT
    N
  • FRUIT FLIES LIKE A BANANA

21
Interpretation 1
S
VP
NP NP
N N V
DT N FRUIT FLIES
LIKE A BANANA
22
Interpretation 2
S VP
PP NP
NP
N V IN DT
N FRUIT FLIES LIKE A
BANANA
23
Lots of ambiguities
  1. He can can a can.
  2. I can light a fire and you can open a can of
    beans. Now the can is open, and we can eat in the
    light of the fire.

24
Lots of ambiguities
  • In the Brown Corpus
  • 11.5 of word types are ambiguous
  • 40 of word tokens are ambiguous
  • Most words in English are unambiguous.
  • Many of the most common words are ambiguous.
  • Typically ambiguous tags are not equally
    probable.

25
Lots of ambiguities
  • Brown Corpus
  • Unambiguous (1 tag) 35,340 types
  • Ambiguous (2-7 tags) 4,100 types
  • (Table Derose, 1988)

2 tags 3,760
3 tags 264
4 tags 61
5 tags 12
6 tags 2
7 tags 1
26
Approaches to Tagging
  1. Tagger ENGTWOL Tagger(Voutilainen 1995)
  2. Stochastic Tagger HMM-based Tagger
  3. Transformation-Based Tagger Brill Tagger(Brill
    1995)

27
NLTK
  • Natural Language Toolkit (NLTK)
  • http//nltk.sourceforge.net/
  • Please download and install!
  • Runs on Python

28
NLTK Introduction
  • The Natural Language Toolkit (NLTK) provides
  • Basic classes for representing data relevant to
    natural language processing.
  • Standard interfaces for performing tasks, such as
    tokenization, tagging, and parsing.
  • Standard implementations of each task, which can
    be combined to solve complex problems.
  • Two versions NLTK and NLTK-Lite

29
NLTK Modules
  • nltk.token processing individual elements of
    text, such as words or sentences.
  • nltk.probability modeling frequency
    distributions and probabilistic systems.
  • nltk.tagger tagging tokens with supplemental
    information, such as parts of speech or wordnet
    sense tags.
  • nltk.parser high-level interface for parsing
    texts.
  • nltk.chartparser a chart-based implementation of
    the parser interface.
  • nltk.chunkparser a regular-expression based
    surface parser.

30
Python for NLP
  • Python is a great language for NLP
  • Simple
  • Easy to debug
  • Exceptions
  • Interpreted language
  • Easy to structure
  • Modules
  • Object oriented programming
  • Powerful string manipulation

31
Python Modules and Packages
  • Python modules package program code and data for
    reuse. (Lutz)
  • Similar to library in C, package in Java.
  • Python packages are hierarchical modules (i.e.,
    modules that contain other modules).
  • Three commands for accessing modules
  • import
  • fromimport
  • reload

32
Import Command
  • The import command loads a module
  • Load the regular expression module
  • gtgtgt import re
  • To access the contents of a module, use dotted
    names
  • Use the search method from the re module
  • gtgtgt re.search(\w, str)
  • To list the contents of a module, use dir
  • gtgtgt dir(re)
  • DOTALL, I, IGNORECASE,

33
from...import
  • The fromimport command loads individual
    functions and objects from a module
  • Load the search function from the re module
  • gtgtgt from re import search
  • Once an individual function or object is loaded
    with fromimport, it can be used directly
  • Use the search method from the re module
  • gtgtgt search (\w, str)

34
Import vs. from...import
  • Import
  • Keeps module functions separate from user
    functions.
  • Requires the use of dotted names.
  • Works with reload.
  • fromimport
  • Puts module functions and user functions
    together.
  • More convenient names.
  • Does not work with reload.

35
Reload
  • If you edit a module, you must use the reload
    command before the changes become visible in
    Python
  • gtgtgt import mymodule
  • ...
  • gtgtgt reload (mymodule)
  • The reload command only affects modules that have
    been loaded with import it does not update
    individual functions and objects loaded with
    from...import.

36
Reload
  • If you edit a module, you must use the reload
    command before the changes become visible in
    Python
  • gtgtgt import mymodule
  • ...
  • gtgtgt reload (mymodule)
  • The reload command only affects modules that have
    been loaded with import it does not update
    individual functions and objects loaded with
    from...import.

37
Next Sessions
  • Rule-Based Tagging
  • Stochastic Tagging
  • Hidden Markov Models (HMMs)
  • N-Grams
  • Read Jurafsky and Marting Chapter 4 (PDF)
  • Install NLTK
Write a Comment
User Comments (0)
About PowerShow.com