CS224N Section 3: Corpora, etc' - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

CS224N Section 3: Corpora, etc'

Description:

(Thanks to Bill MacCartney and Pi-Chuan Chang ... http://www.stanford.edu/dept/linguistics/corpora ... Browse online: http://wordnetweb.princeton.edu/perl/webwn ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 17
Provided by: pichua
Category:
Tags: cs224n | corpora | etc | section

less

Transcript and Presenter's Notes

Title: CS224N Section 3: Corpora, etc'


1
CS224N Section 3 Corpora, etc.
  • Helen Kwong
  • April 24, 2009

(Thanks to Bill MacCartney and Pi-Chuan Chang for
these materials)
2
Final Project
  • Proposal due in 2 weeks - Wed. 5/6
  • Project ideas
  • Final project guide
  • Go through the syllabus
  • Projects from previous years http//nlp.stanford.
    edu/courses/cs224n/

3
Corpora
  • Corpora_at_Stanford
  • http//www.stanford.edu/dept/linguistics/corpora/
  • Some are on AFS (/afs/ir/data/linguistic-data/)
    some are available on DVD/CDs in the linguistic
    department
  • LDC (Linguistic Data Consortium)
  • http//www.ldc.upenn.edu/Catalog/
  • Links to many resources
  • http//nlp.stanford.edu/links/statnlp.html
  • Previous years notes
  • http//www.stanford.edu/class/cs224n/handouts/cs22
    4n-section3-corpora.txt
  • http//www.stanford.edu/class/cs224n/handouts/sect
    ion3-2008.pdf

4
Treebanks
  • Most widely used Penn Treebank
  • There's PTB2 and PTB3. Use PTB3, i.e. Treebank-3
  • Contains
  • 50,000 sentences (1,000,000 words) of WSJ text
    from 1989
  • 30,000 sentences (400,000 words) of Brown corpus
  • Parsed WSJ trees
  • /afs/ir/data/linguistic-data/Treebank/3/parsed/mrg
    /wsj/
  • See Bills notes for more details
  • BLLIP like PTB, WSJ text, but 30m words, parsed
    automatically by Charniak
  • Switchboard telephone conversations

5
Parsed corpora in other languages
  • Penn Arabic Treebank Corpus
  • 734 stories (140,000 words)
  • Penn Chinese Treebank Corpus
  • 50,000 sentences
  • German (newspaper text)
  • NEGRA
  • http//www.coli.uni-saarland.de/projects/sfb378/ne
    gra-corpus/
  • TIGER
  • http//www.ims.uni-stuttgart.de/projekte/TIGER/
  • Tueba-D/Z
  • http//www.sfs.uni-tuebingen.de/en_tuebadz.shtml

6
Part-of-speech tagged corpora
  • POS tags from treebanks
  • British National Corpus (BNC)
  • 100m words
  • wide sample of British English newspapers,
    books, letters
  • http//www.natcorp.ox.ac.uk/

7
Named Entity Recognition (NER)
  • Message Understanding Conference (MUC)
  • We have MUC-6 and MUC-7
  • Example /afs/ir/data/linguistic-data/MUC_7/muc_7/
    data/training.ne.eng.keys.980205
  • CoNLL shared tasks Language-Independent Named
    Entity Recognition (I), (II)
  • 2002 http//www.cnts.ua.ac.be/conll2002/ner/
  • 2003 http//www.cnts.ua.ac.be/conll2003/ner/

8
Anaphora resolution
  • Data MUC-6 and MUC-7
  • Example Pam went home because she felt sick
  • Demo http//lingpipe-demos.com8080/lingpipe-demo
    s/coref_en_news_muc6/textInput.html
  • Unsolved problem
  • Harder example
  • We gave the bananas to the monkeys because they
    were hungry
  • We gave the bananas to the monkeys because they
    were ripe.

9
Semantics
  • WordNet
  • Website http//wordnet.princeton.edu/
  • Browse online http//wordnetweb.princeton.edu/per
    l/webwn
  • 150,000 nouns, verbs, adjectives, adverbs
  • Groups words into synsets with short, general
    definitions, and records various relations
    between synsets, e.g. hypernym (kind-of)
    hierarchy.
  • Good tutorial http//www.brians.org/Projects/Tech
    nology/Papers/Wordnet/
  • Neat visual interface http//www.visualthesaurus.
    com/?vt
  • Problems with WordNet
  • fine-grained senses
  • sense ordering sometimes funny (see "airline")

10
Semantic Role Labeling
  • Detection of semantic arguments associated with
    each verb in a sentence
  • Example I agent sold you patient a book
    theme
  • CoNLL shared task 2004, 2005
  • http//www.lsi.upc.es/srlconll/
  • PropBank
  • Adds predicate-argument relations to PTB syntax
    trees
  • FrameNet http//framenet.icsi.berkeley.edu/
  • Demo from UIUC http//l2r.cs.uiuc.edu/cogcomp/sr
    l-demo.php

11
More corpora for specific tasks
  • Word Sense Disambiguation (WSD)
  • Senseval http//www.senseval.org/
  • Question Answering
  • e.g. "What film introduced Jar Jar Binks?"
  • TREC competition, Question Answering track
  • http//trec.nist.gov/data/qamain.html
  • Textual Entailment
  • Recognizing Textual Entailment (RTE) challenges
  • http//www.nist.gov/tac/tracks/2008/rte/
  • Events, temporal relations
  • TimeBank corpus http//corpora.dutchboy.net/timeb
    ank/

12
More corpora for specific tasks
  • Topic Detection and Tracking
  • Given documents, separate into different topics
  • http//projects.ldc.upenn.edu/TDT/

13
Speech Dialogue
  • Speech
  • BNC 10m words
  • Dialogue
  • Switchboard corpus
  • Conversations of two speakers recorded over the
    phone
  • Transcriptions of their speech, with speakers
    labeled
  • Example http//www.ldc.upenn.edu/Catalog/readme_f
    iles/switchboard.readme.htmltxt

14
Email/Spam
  • Enron corpus
  • /afs/ir/data/linguistic-data/Enron-Email-Corpus/ma
    ildir/skilling-j/
  • Annotated subsets http//www.cs.cmu.edu/einat/da
    tasets.html
  • TREC Spam track
  • http//trec.nist.gov/data/spam.html

15
Tools
  • Many links to tools on the StatNLP page
  • http//nlp.stanford.edu/links/statnlp.html
  • Parsers
  • Stanford Parser (English, Chinese, German and
    Arabic)
  • http//nlp.stanford.edu/software/lex-parser.shtml
  • Online parser http//josie.stanford.edu8080/pars
    er/
  • Collins parser, Charniaks parser, MiniPar, etc.
  • http//nlp.stanford.edu/fsnlp/probparse/
  • POS taggers
  • Named entity recognizers
  • Language modeling toolkits

16
Machine learning tools
  • Stanford classifier
  • conditional loglinear (aka maximum entropy) model
  • http//nlp.stanford.edu/software/classifier.shtml
  • Weka
  • Java library containing (nearly) every machine
    learning algorithm -Naive Bayes, perceptron,
    decision tree, MaxEnt, SVM, etc.
  • http//www.cs.waikato.ac.nz/ml/weka/
  • Mallet
  • Java useful for statistical NLP, document
    classification, clustering, topic modeling,
    information extraction
  • http//mallet.cs.umass.edu/
Write a Comment
User Comments (0)
About PowerShow.com