CS224N Section 3: Corpora, etc'

About This Presentation

Title:

Description:

Number of Views:46

Avg rating:3.0/5.0

Slides: 17

Provided by: pichua

Category:

Tags: cs224n | corpora | etc | section

Transcript and Presenter's Notes

Title: CS224N Section 3: Corpora, etc'

1
CS224N Section 3 Corpora, etc.

(Thanks to Bill MacCartney and Pi-Chuan Chang for
these materials)
2
Final Project

3
Corpora

Corpora_at_Stanford
http//www.stanford.edu/dept/linguistics/corpora/
Some are on AFS (/afs/ir/data/linguistic-data/)
some are available on DVD/CDs in the linguistic
department
LDC (Linguistic Data Consortium)
http//www.ldc.upenn.edu/Catalog/
Links to many resources
http//nlp.stanford.edu/links/statnlp.html
Previous years notes
http//www.stanford.edu/class/cs224n/handouts/cs22
4n-section3-corpora.txt
http//www.stanford.edu/class/cs224n/handouts/sect
ion3-2008.pdf

4
Treebanks

5
Parsed corpora in other languages

6
Part-of-speech tagged corpora

7
Named Entity Recognition (NER)

Message Understanding Conference (MUC)
We have MUC-6 and MUC-7
Example /afs/ir/data/linguistic-data/MUC_7/muc_7/
data/training.ne.eng.keys.980205
CoNLL shared tasks Language-Independent Named
Entity Recognition (I), (II)
2002 http//www.cnts.ua.ac.be/conll2002/ner/
2003 http//www.cnts.ua.ac.be/conll2003/ner/

8
Anaphora resolution

Data MUC-6 and MUC-7
Example Pam went home because she felt sick
Demo http//lingpipe-demos.com8080/lingpipe-demo
s/coref_en_news_muc6/textInput.html
Unsolved problem
Harder example
We gave the bananas to the monkeys because they
were hungry
We gave the bananas to the monkeys because they
were ripe.

9
Semantics

WordNet
Website http//wordnet.princeton.edu/
Browse online http//wordnetweb.princeton.edu/per
l/webwn
150,000 nouns, verbs, adjectives, adverbs
Groups words into synsets with short, general
definitions, and records various relations
between synsets, e.g. hypernym (kind-of)
hierarchy.
Good tutorial http//www.brians.org/Projects/Tech
nology/Papers/Wordnet/
Neat visual interface http//www.visualthesaurus.
com/?vt
Problems with WordNet
fine-grained senses
sense ordering sometimes funny (see "airline")

10
Semantic Role Labeling

11
More corpora for specific tasks

12
More corpora for specific tasks

13
Speech Dialogue

Speech
BNC 10m words
Dialogue
Switchboard corpus
Conversations of two speakers recorded over the
phone
Transcriptions of their speech, with speakers
labeled
Example http//www.ldc.upenn.edu/Catalog/readme_f
iles/switchboard.readme.htmltxt

14
Email/Spam

15
Tools

16
Machine learning tools

Stanford classifier
conditional loglinear (aka maximum entropy) model
http//nlp.stanford.edu/software/classifier.shtml
Weka
Java library containing (nearly) every machine
learning algorithm -Naive Bayes, perceptron,
decision tree, MaxEnt, SVM, etc.
http//www.cs.waikato.ac.nz/ml/weka/
Mallet
Java useful for statistical NLP, document
classification, clustering, topic modeling,
information extraction
http//mallet.cs.umass.edu/