I256 Applied Natural Language Processing Fall 2009 - PowerPoint PPT Presentation

Loading...

PPT – I256 Applied Natural Language Processing Fall 2009 PowerPoint presentation | free to download - id: 15ffa4-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

I256 Applied Natural Language Processing Fall 2009

Description:

He arrived his way through the lecture. Language is complex! Why NLP is difficult ... t-test, chi-squared, point-wise mutual information. Part-of-speech tagging ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 28
Provided by: BROS62
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: I256 Applied Natural Language Processing Fall 2009


1
I256 Applied Natural Language ProcessingFall
2009
  • Lecture 1
  • Introduction

Barbara Rosario
2
Introductions
  • Barbara Rosario
  • iSchool alumni (class 2005)
  • Intel Labs
  • Gopal Vaswani
  • iSchool master student (class 2010)
  • You?

3
Today
  • Introductions
  • Administrivia
  • What is NLP
  • NLP Applications
  • Why is NLP difficult
  • Corpus-based statistical approaches
  • Course goals
  • What well do in this course

4
Administrivia
  • http//courses.ischool.berkeley.edu/i256/f09/index
    .html
  • Books
  • Foundations of Statistical NLP, Manning and
    Schuetze, MIT press
  • Natural Language Processing with Python, Bird,
    Klein Loper, O'Reilly.  (also on line)
  • See Web site for additional resources
  • Work
  • Individual coding assignments (Python
    NLTK-Natural Language Toolkit) (4 or 5)
  • Final group project
  • Participation
  • Office hours
  • Barbara Thursday 200-300 in Room 6
  • Gopal Tuesday 200-300 in Room 6 (to be
    confirmed)

5
Administrivia
  • Communication
  • My email barbara.rosario_at_intel.com
  • Gopal gopal.vaswani_at_gmail.com
  • Mailing list i256_at_ischool.berkeley.edu
  • Send an email to majordomo_at_ischool.berkeley.edu
    with subscribe i256 in the body
  • Through intranet
  • Announcements webpage and/or mailing list and/or
    Bspace (TBA)
  • Public discussion Bspace(?)
  • Related course Statistical Natural Language
    Processing, Spring 2009, CS 288
  • http//www.cs.berkeley.edu/klein/cs288/sp09/
  • Instructor Dan Klein
  • Much more emphasis on statistical algorithms
  • Questions?

6
Natural Language Processing
  • Fundamental goal deep understand of broad
    language
  • Not just string processing or keyword matching!
  • End systems that we want to build
  • Ambitious speech recognition, machine
    translation, question answering
  • Modest spelling correction, text categorization

Slide taken from Kleins course UCB CS 288
spring 09
7
Example Machine Translation
8
NLP applications
  • Text Categorization
  • Classify documents by topics, language, author,
    spam filtering, information retrieval (relevant,
    not relevant), sentiment classification
    (positive, negative)
  • Spelling Grammar Corrections
  • Information Extraction
  • Speech Recognition
  • Information Retrieval
  • Synonym Generation
  • Summarization
  • Machine Translation
  • Question Answering
  • Dialog Systems
  • Language generation

9
Why NLP is difficult
  • A NLP system needs to answer the question who
    did what to whom
  • Language is ambiguous
  • At all levels lexical, phrase, semantic
  • Iraqi Head Seeks Arms
  • Word sense is ambiguous (head, arms)
  • Stolen Painting Found by Tree
  • Thematic role is ambiguous tree is agent or
    location?
  • Ban on Nude Dancing on Governors Desk
  • Syntactic structure (attachment) is ambiguous is
    the ban or the dancing on the desk?
  • Hospitals Are Sued by 7 Foot Doctors
  • Semantics is ambiguous what is 7 foot?

10
Why NLP is difficult
  • Language is flexible
  • New words, new meanings
  • Different meanings in different contexts
  • Language is subtle
  • He arrived at the lecture
  • He chuckled at the lecture
  • He chuckled his way through the lecture
  • He arrived his way through the lecture
  • Language is complex!

11
Why NLP is difficult
  • MANY hidden variables
  • Knowledge about the world
  • Knowledge about the context
  • Knowledge about human communication techniques
  • Can you tell me the time?
  • Problem of scale
  • Many (infinite?) possible words, meanings,
    context
  • Problem of sparsity
  • Very difficult to do statistical analysis, most
    things (words, concepts) are never seen before
  • Long range correlations

12
Why NLP is difficult
  • Key problems
  • Representation of meaning
  • Language presupposes knowledge about the world
  • Language only reflects the surface of meaning
  • Language presupposes communication between people

13
Meaning
  • What is meaning?
  • Physical referent in the real world
  • Semantic concepts, characterized also by
    relations.
  • How do we represent and use meaning
  • I am Italian
  • From lexical database (WordNet)
  • Italian a native or inhabitant of Italy? Italy
    republic in southern Europe ..
  • I am Italian
  • Who is I?
  • I know she is Italian/I think she is Italian
  • How do we represent I know and I think
  • Does this mean that I is Italian? What does it
    say about the I and about the person speaking?
  • I thought she was Italian
  • How do we represent tenses?

14
Today
  • Introductions
  • Administrivia
  • What is NLP
  • NLP Applications
  • Why is NLP difficult
  • Corpus-based statistical approaches
  • Course goals
  • What well do in this course

15
Corpus-based statistical approaches to tackle NLP
problem
  • How can a can a machine understand these
    differences?
  • Decorate the cake with the frosting
  • Decorate the cake with the kids
  • Rules based approaches, i.e. hand coded syntactic
    constraints and preference rules
  • The verb decorate require an animate being as
    agent
  • The object cake is formed by any of the
    following, inanimate entities (cream, dough,
    frosting..)
  • Such approaches have been showed to be time
    consuming to build, do not scale up well and are
    very brittle to new, unusual, metaphorical use of
    language
  • To swallow requires an animate being as
    agent/subject and a physical object as object
  • I swallowed his story
  • The supernova swallowed the planet

16
Corpus-based statistical approaches to tackle NLP
problem
  • A Statistical NLP approach seeks to solve these
    problems by automatically learning lexical and
    structural preferences from text collections
    (corpora)
  • Statistical models are robust, generalize well
    and behave gracefully in the presence of errors
    and new data.
  • So
  • Get large text collections
  • Compute statistics over those collections
  • (The bigger the collections, the better the
    statistics)

17
Corpus-based statistical approaches to tackle NLP
problem
  • Decorate the cake with the frosting
  • Decorate the cake with the kids
  • From (labeled) corpora we can learn that
  • (kids are subject/agent of decorate) gt
    (frosting is subject/agent of decorate)
  • From (UN-labeled) corpora we can learn that
  • (the kids decorate the cake) gtgt (the
    frosting decorates the cake)
  • (cake with frosting) gtgt (cake with kids)
  • etc..
  • Given these facts we then need a statistical
    model for the attachment decision

18
Corpus-based statistical approaches to tackle NLP
problem
  • Topic categorization classify the document into
    semantics topics

Document 1 The U.S. swept into the Davis Cup final on Saturday when twins Bob and Mike Bryan defeated Belarus's Max Mirnyi and Vladimir Voltchkov to give the Americans an unsurmountable 3-0 lead in the best-of-five semi-final tie. Topic sport Document 2 One of the strangest, most relentless hurricane seasons on record reached new bizarre heights yesterday as the plodding approach of Hurricane Jeanne prompted evacuation orders for hundreds of thousands of Floridians and high wind warnings that stretched 350 miles from the swamp towns south of Miami to the historic city of St. Augustine. Topic disaster
19
Corpus-based statistical approaches to tackle NLP
problem
  • Topic categorization classify the document into
    semantics topics

Document 1 (sport) The U.S. swept into the Davis Cup final on Saturday when twins Bob and Mike Bryan Document 2 (disasters) One of the strangest, most relentless hurricane seasons on record reached new bizarre heights yesterday as.
  • From (labeled) corpora we can learn that
  • (sport documents containing word Cup) gt
    (disaster documents containing word Cup) --
    feature
  • We then need a statistical model for the topic
    assignment

20
Corpus-based statistical approaches to tackle NLP
problem
  • Feature extractions (usually linguistics
    motivated)
  • Statistical models
  • Data (corpora, labels, linguistic resources)

21
Goals of this Course
  • Learn about the problems and possibilities of
    natural language analysis
  • What are the major issues?
  • What are the major solutions?
  • At the end you should
  • Agree that language is difficult, interesting and
    important
  • Be able to assess language problems
  • Know which solutions to apply when, and how
  • Feel some ownership over the algorithms
  • Be able to use software to tackle some NLP
    language tasks
  • Know language resources
  • Be able to read papers in the field

22
What Well Do in this Course
  • Linguistic Issues
  • What are the range of language phenomena?
  • What are the knowledge sources that let us
    disambiguate?
  • What representations are appropriate?
  • Applications
  • Software (Python and NLTK)
  • Statistical Modeling Methods

23
What Well Do in this Course
  • Read books, research papers and tutorials
  • Final project
  • Your own ideas or chose from some suggestions I
    will provide
  • Well talk later during the couse about
    ideas/methods etc. but come talk to me if you
    have already some ideas
  • Learn Python
  • Learn/use NLTK (Natural Language ToolKit) to try
    out various algorithms

24
Python
  • Python - Simple yet powerfulThe zen of python 
    http//www.python.org/dev/peps/pep-0020/
  • Very clear, readable syntax
  • Strong introspection capabilities
  • http//www.ibm.com/developerworks/linux/library/l-
    pyint.html (recommended) 
  • Intuitive object orientation
  • Natural expression of procedural code
  • Full modularity, supporting hierarchical packages
  • Exception-based error handling
  • Very high level dynamic data types
  • Extensive standard libraries and third party
    modules for virtually every task
  • Excellent functionality for processing linguistic
    data.
  • NLTK is one such extensive third party module. 

Source python.org
25
NLTK
  • NLTK defines an infrastructure that can be used
    to build NLP programs in Python.
  • It provides basic classes for representing data
    relevant to natural language processing.
  • Standard interfaces for performing tasks such as
    part-of-speech tagging, syntactic parsing, and
    text classification.
  • Standard implementations for each task which can
    be combined to solve complex problems.
  • Resources
  • Download at http//www.nltk.org/download
  • Getting started with NLTK Chapter 1
  • NLP and NLTK talk at google http//www.youtube.com
    /watch?vkeXW_5-llD0

Language processing task NLTK modules Functionality
Accessing corpora nltk.corpus standardized interfaces to corpora and lexicons
String processing nltk.tokenize, nltk.stem tokenizers, sentence tokenizers, stemmers
Collocation discovery nltk.collocations t-test, chi-squared, point-wise mutual information
Part-of-speech tagging nltk.tag n-gram, backoff, Brill, HMM, TnT
Classification nltk.classify, nltk.cluster decision tree, maximum entropy, naive Bayes, EM, k-means
Chunking nltk.chunk regular expression, n-gram, named-entity
Parsing nltk.parse chart, feature-based, unification, probabilistic, dependency
This is not the complete list
Source nltk.org
26
Topics
  • Text corpora other resources
  • Words (Morphology, tokenization, stemming,
    part-of-speech, WSD, collocations, lexical
    acquisition, language models)
  • Syntax chunking, PCFG parsing
  • Statistical models (esp. for classification)
  • Applications
  • Text classification
  • Information extraction
  • Machine translation
  • Semantic Interpretation
  • Sentiment Analysis
  • QA / Summarization
  • Information retrieval

27
Next Assignment
  • Due before next class Tue Sep 1
  • No turn-in
  • Download and install Python and NLTK
  • Download the NLTK Book Collection, as described
    at the beginning of chapter 1 of the book Natural
    Language Processing with Python
  • Readings
  • Chapter 1 of the book Natural Language Processing
    with Python
  • Chapter 3 of Foundations of Statistical NLP
  • Next class
  • Linguistic Essentials
  • Python Introduction
About PowerShow.com