NLTK - PowerPoint PPT Presentation

1 / 14
About This Presentation
Title:

NLTK

Description:

... ('austen-emma.txt') emma = nltk.Text(emma) ... Frank Churchill; Miss Woodhouse; Miss Bates; Jane Fairfax; Miss Fairfax; young man; great deal; John ... – PowerPoint PPT presentation

Number of Views:120
Avg rating:3.0/5.0
Slides: 15
Provided by: HarryH5
Learn more at: http://www2.tulane.edu
Category:
Tags: nltk | austen | emma | jane

less

Transcript and Presenter's Notes

Title: NLTK


1
NLTK PythonDay 7
  • LING 681.02
  • Computational Linguistics
  • Harry Howard
  • Tulane University

2
Course organization
  • I have requested that NLTK be installed on the
    computers in this room.

3
NLPP 2 Accessing text corpora and lexical
resources
  • 2.1 Accessing text corpora

4
What's that word
  • What is a corpus/corpora?
  • "large bodies of linguistic data"

5
Some corpora in NLTK
  • The Project Gutenberg electronic text archive
  • 25k free electronic books at http//www.gutenberg.
    org/
  • Web and chat text
  • The Brown corpus
  • First 1M word e-corpus, from 500 sources
  • The Reuters corpus
  • The Inaugural Address corpus
  • Annotated text corpora
  • Corpora in other languages

6
Using corpora in NLTK
  • Only the corpora in the nltk.book corpus are
    formatted as lists and so can be arguments to
    NLTK functions.
  • To convert another corpus into a list, use
  • your_text_name nltk.Text(corpus_name)

7
Basic corpus functionsTable 2.3
Example Description
fileids() the files of the corpus
categories() the categories of the corpus
fileids(categories) the files of the corpus corresponding to these categories
categories(fileids) the categories of the corpus corresponding to these files
raw() the raw content of the corpus
raw(fileidsf1,f2,f3) the raw content of the specified files
raw(categoriesc1,c2) the raw content of the specified categories
8
Basic corpus functionsTable 2.3
Example Description
words() the words of the whole corpus
words(fileidsf1,f2,f3) the words of the specified fileids
words(categoriesc1,c2) the words of the specified categories
sents() the sentences of the whole corpus
sents(fileidsf1,f2,f3) the sentences of the specified fileids
sents(categoriesc1,c2) the sentences of the specified categories
9
Code to get started
  • gtgtgt from nltk.corpus import gutenberg
  • gtgtgt
  • gtgtgt emma gutenberg.words('austen-emma.txt')
  • gtgtgt
  • gtgtgt emma nltk.Text(emma)
  • gtgtgt
  • gtgtgt emma.collocations()
  • Frank Churchill Miss Woodhouse Miss Bates Jane
    Fairfax Miss
  • Fairfax young man great deal John Knightley
    Maple Grove Miss
  • Smith Miss Taylor Robert Martin Colonel
    Campbell Box Hill Harriet
  • Smith William Larkins Brunswick Square young
    lady young woman
  • Miss Hawkins

10
Loading your own corpusTable 2.3
Example Description
abspath(fileid) the location of the file on disk
encoding(fileid) the encoding of the file (if known)
open(fileid) open a stream for reading the given corpus file
root() the path to the root of locally installed corpus
readme() the contents of the README file of the corpus
11
NLPP 2 Accessing text corpora and lexical
resources
  • 2.2 Conditional frequency distributions

12
Back to frequency
  • FreqDist(mylist) calculates the number of
    occurrences of each item in 'mylist'.
  • ConditionalFreqDist(mypairs) calculates the
    number of occurrences of each pair of items in
    'mypairs',
  • where the pairing might be of author word,
    genre word, topic word, etc. condition text

13
An example
  • gtgtgt from nltk.corpus import brown
  • gtgtgt cfd nltk.ConditionalFreqDist(
  • ... (genre, word)
  • ... for genre in brown.categories()
  • ... for word in brown.words(categoriesgen
    re))

14
Next time
  • NLPP 2.3ff
  • Do "Your Turn" up to p. 55
  • Exercises 2.8.2-4, 2.8.8
Write a Comment
User Comments (0)
About PowerShow.com