Welcome to Introduction to Natural Language Processing CIS 530 - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Welcome to Introduction to Natural Language Processing CIS 530

Description:

journalists. scientists. transfers. Pakistan. nuclear. Khan ... name 'North Korea' Explicit Semantic Representations. E1: Founder. Names: 'Abdul Qadeer Khan' ... – PowerPoint PPT presentation

Number of Views:97
Avg rating:3.0/5.0
Slides: 37
Provided by: mitchel4
Category:

less

Transcript and Presenter's Notes

Title: Welcome to Introduction to Natural Language Processing CIS 530


1
Welcome toIntroduction to Natural Language
ProcessingCIS 530
  • Mitch Marcus

2
CIS 530 General Information Spring 2008
  • Instructor Mitch Marcus
  • Email mitch_at_cis.upenn.edu Office Levine
    503Office Hours To be arranged
  • TA TBA
  • Meeting Time and Location
  • TR 430pm-600pm
  • Towne 307
  • Prerequisites
  • An intro to AI OR an intro in natural language
    syntax.
  • Course Work
  • Several problem sets in the earlier part of the
    course a course project during the second half
    of the course.

3
Textbooks
  • Required
  • Jurafsky Martin, SPEECH and LANGUAGE
    PROCESSING An Introduction to Natural Language
    Processing, Computational Linguistics, and Speech
    Recognition, Draft of 2nd edition (will be
    provided)
  • Chris Manning and Hinrich Shutze, Foundations of
    Statistical Natural Language Processing, MIT
    Press, 1999. (available online from the Penn
    campus at http//cognet.mit.edu/library/books/view
    ?isbn0262133601 )
  • Various supplementary readings.
  • Other Useful Books
  • An introductory Python book (such as Lutz
    Ascher, Learning Python, O'Reilly, 1999) if you
    don't already know Python (available online
    through Safari)
  • David Beazley, Python Essential Reference, 2nd
    Edition, New Riders, 2001.

4
NLP Grand Challenges
  • The Ultimate Goal For computers to use NL as
    effectively as humans do.
  • Reading and writing text
  • Abstracting
  • Monitoring
  • Extraction into Databases
  • Interactive Dialogue Natural, effective access
    to computer systems
  • Informal Speech Input and Output
  • Translation Input and Output in Multiple
    Languages

5
The Past Human Machine Interfaces
  • SHRDLU (Winograd, 1969)
  • A fragile demonstration of the fundamental vision
  • PUNDIT (Hirschman, Palmer, 1989)
  • Information Extraction from Real World texts

6
SHRDLU A demonstration proof
  • Person PICK UP A BIG RED BLOCK.
  • Computer OK. (does it)
  • Person GRASP THE PYRAMID.
  • Computer I DON'T UNDERSTAND WHICH PYRAMID YOU
    MEAN.
  • Person FIND A BLOCK WHICH IS TALLER THAN THE ONE
    YOU ARE HOLDING AND PUT IT INTO THE BOX.
  • Computer BY "IT", I ASSUME YOU MEAN THE BLOCK
    WHICH IS TALLER THAN THE ONE I AM HOLDING.
  • Computer OK. (does it)
  • Person WHAT DOES THE BOX CONTAIN?
  • Computer THE BLUE PYRAMID AND THE BLUE BLOCK.
  • Person WHAT IS THE PYRAMID SUPPORTED BY?
  • Computer THE BOX.

7
SHRDLU A demonstration proof
8
The Past Information Extraction
  • PUNDIT (Palmer et al. 1987)
  • Sample CASREP automatic summary
  • FAILURE OF ONE OF TWO SACS. UNIT HAD LOW OUTPUT
    AIR PRESSURE. RESULTED IN SLOW GAS TURBINE
    START. TROUBLESHOOTING REVEALED NORMAL SAC LUBE
    OIL PRESSURE AND TEMPERATURE. EROSION OF IMPELLOR
    BLADE TIP EVIDENT. CAUSE OF EROSION OF IMPELLOR
    BLADE UNDETERMINED. NEW SAC RECEIVED.

9
(No Transcript)
10
The Past Crucial flaws in the paradigm
  • These systems worked well, BUT
  • Usually, only for a small set of examples
  • Person-years of work to port to new applications
    and, often, to extend coverage on a single
    application
  • Very limited and inconsistent coverage of English

11
Interactive systems often worked well
  • because of a magical factPeople automatically
    adapt and limit their language given a small set
    of exemplars if the underlying linguistic
    generalizations are HABITABLE
  • This wont handle non-interactive language

12
The State of NLP
  • NLP Past
  • Rich Representations
  • NLP Present
  • Powerful Statistical Disambiguation

13
An Early Robust Statistical NLP Application
  • A Statistical Model For Etymology (Church 85)
  • Determining etymology is crucial for
    text-to-speech

14
An Early Robust Statistical NLP Application
  • Etymology can be determined reasonably accurately
    from statistics computed from letter sequences
    trigrams!

15
A Central Challenge Extracting Meaning
??Meaning Extractor??
Text or speech
Meaning
16
Meaning representations should capture
  • Entities
  • Of some type Nation, Know-how
  • Events and relations
  • Predicates with arguments
  • Recursively
  • And More
  • Quantifiers

The founder of Pakistan's nuclear program, Abdul
Qadeer Khan, has admitted he transferred nuclear
technology to Iran, Libya and North Korea.
17
Literal vs. Implicit Meaning
  • Cognitive beings automatically
  • combine literal meaning
  • with world knowledge
  • to see implicit meaning
  • Q Whose greed? Q Whose ambition?
  • Understanding this involves inferring implicit
    meaning
  • Recent NLP has focused on robust extraction of
    shallow, literal meaning

The founder of Pakistan's nuclear program, Abdul
Qadeer Khan, has admitted he transferred nuclear
technology to Iran, Libya and North Korea, a
Pakistani government official said Monday The
transfers were made during the late 1980s and in
the early and mid 1990s, and were motivated by
"personal greed and ambition," an official said.
18
Levels of Representation
Full Semantics
Explicit Semantics
Syntax
Words
Morphology
19
Word Unigram Representation
  • Unigrams
  • The founder of Pakistan's nuclear program, Abdul
    Qadeer Khan, has admitted he transferred nuclear
    technology to Iran, Libya and North Korea, a
    Pakistani government official said Monday. Khan
    made the confession in a written statement
    submitted "a couple of days ago" to investigators
    probing allegations of nuclear proliferation by
    Pakistan, the official told The Associated Press
    on condition on anonymity. The transfers were
    made during the late 1980s and in the early and
    mid 1990s, and were motivated by "personal greed
    and ambition," the official said. The official
    said the transfers were not authorized by the
    government.

20
Word Bigram Representation
  • Bigrams
  • The founder of Pakistan's nuclear program, Abdul
    Qadeer Khan, has admitted he transferred nuclear
    technology to Iran, Libya and North Korea, a
    Pakistani government official said Monday. Khan
    made the confession in a written statement
    submitted "a couple of days ago" to investigators
    probing allegations of nuclear proliferation by
    Pakistan, the official told The Associated Press
    on condition on anonymity. The transfers were
    made during the late 1980s and in the early and
    mid 1990s, and were motivated by "personal greed
    and ambition," the official said. The official
    said the transfers were not authorized by the
    government.

21
Levels of Representation
Full Semantics
Explicit Semantics
Syntax
Words
Morphology
Also, higher representations require lower
22
Syntax Representation Treebank
  • TreeBank includes
  • Part of speech (not shown here)
  • Syntactic structure

23
1995 A breakthrough in parsing
  • 106 words of Treebank Annotation
  • Machine Learning Robust Parsers

Training Program
training sentences
answers

Models
The founder of Pakistan's nuclear program, Abdul
Qadeer Khan, has admitted he transferred nuclear
technology to Iran, Libya and North Korea
Trees
Parser
  • 1990 Best hand-built parsers 40-60 accuracy
    (guess)
  • 1995 Statistical parsers 90 accuracy

24
Rich Linguistic Representations Powerful
Machine Learning Robust, Effective NLP
  • 1970s, 80s Focus on Linguistic Representatins
  • 1990s, early 2000s Focus on Machine Learning
  • Recently New work combining the two

25
Levels of Representation
Full Semantics
Explicit Semantics
Syntax
Words
Morphology
Also, higher representations require lower
26
Shallow Verb Semantics Propbank
  • The
  • founder
  • of
  • Pakistans
  • nuclear department
  • Abdul Qadeer Khan
  • has
  • admitted
  • he
  • transferred
  • nuclear technology
  • to
  • Iran,
  • Libya,
  • and
  • North Korea

NP
NP
PP
NP
S
NP
NP
VP
VP
SBAR
NP
S
VP
NP
PP
NP
NP
NP
  • PropBank adds
  • Lexical semantics of verbs

NP
27
A Very First ExperimentRecovering Semantic
Structure Automatically
Training Program
Training Program
training sentences
training sentences
Treebank
Propbank


Models
Models
Trees
Semantic Analyzer
Semantic Relations
Sentences
Parser
Semantic Relations Retrieved (Hacioglu et al.,
Univ of Colorado)
28
Explicit Semantic Representations
A1 act Acknowledge agent E1
object Founder name Abdul Qadeer
Khan description A3 act Establish
agent E1 organization E2
object Agency description Pakistans
nuclear department
proposition A2 act Transfer agent E1
theme E4 object Know-How
description nuclear technology
destination (and E5 object
Nation name Pakistan" E6
object Nation name Libya"
E7 object Nation name North
Korea" )
E1 Founder Names Abdul Qadeer
Khan Descriptions The founder of Pakistans
nuclear department, he
Establish Agent Org
E2 Agency Descriptions Pakistans nuclear
department
Subsidiary SubOrg SuperOrg
E3 Nation Names Pakistan
E4 Know-How Descriptions nuclear technology
Acknowledge Person Fact
E5 Nation Names Iran
E6 Nation Names Libya
Transfer Agent Item Dest
E7 Nation Names North Korea
29
Vision Building decoders for literal meaning
Correct parse trees
Training Program
Training sentences
Current State of Art

Models
The founder of Pakistan's nuclear program, Abdul
Qadeer Khan, has admitted he transferred nuclear
technology to Iran, Libya and North Korea
Parser
30
Levels of Representation
Full Semantics
Explicit Semantics
Syntax
Words
Morphology
Also, higher representations require lower
31
PASCAL Recognizing Textual Entailment
  • Pattern Analysis, Statistical Modelling and
    Computational Learning a Network of Excellence
    sponsored by the EU as part of its IST program.

32
Approximate Syllabus
33
Unit I Intro Word-Based Methods
  • Introduction to Python
  • N-Gram Word-Based Models of Syntax
  • Word Distributions
  • Smoothing Backoff
  • Entropy and Relative Entropy
  • Word Classes and Part of Speech Tagging
  • Tag Set Design
  • Hidden Markov Models
  • Transformation-based Learning
  • Speech Recognition
  • Why is Speech Recognition Hard?
  • HMMs for speech

34
Unit II - Parsing
  • Introduction to Syntactic Analysis
  • Context Free Models CF Parsing for NL Syntax
  • Statistical Parsing of CFGs
  • Probabilistic CFGs
  • Generative Statistical Models
  • Discriminative Models for Parsing
  • Enriched Models for NL Syntax
  • The Inadequacy of CF Models
  • Feature Structures and Unification
  • Tree Adjoining Grammars

35
Syllabus III Meaning
  • Lexical Semantics
  • Word Sense Disambiguation Decision Lists, SVMs
  • Logical Form and Semantics
  • Introduction to Logical Form
  • Mapping from Syntactic Structures to LF
  • Quantifier scope and Cooper Storage
  • Entailment Logical Inference
  • Discourse Pragmatics
  • Reference and Anaphora
  • Text Coherence Discourse Structure

36
Syllabus IV Putting the Pieces Together
  • Machine Translation
  • Synchronous TAGS
  • Statistical Translation
  • Generation Summarization
  • Text Planning, Content Determination and
    Realization
  • Statistical Techniques for Summarization
Write a Comment
User Comments (0)
About PowerShow.com