CSA2050: Natural Language Processing - PowerPoint PPT Presentation

About This Presentation
Title:

CSA2050: Natural Language Processing

Description:

... has appeared in international media from the New York Times to CNN to NPR. ... produce golf clubs to be supplied to Japan. ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 50
Provided by: michael307
Category:

less

Transcript and Presenter's Notes

Title: CSA2050: Natural Language Processing


1
CSA2050 Natural Language Processing
  • Information Extraction
  • Information Extraction
  • Named Entities
  • IE Systems
  • MUC
  • Finite State Machines
  • Pattern Recognition

2
Classification at different granularities
  • Text Categorization
  • Classify an entire document
  • Information Extraction (IE)
  • Identify and classify small units within
    documents
  • Named Entity Extraction (NE)
  • A subset of IE
  • Identify and classify proper names
  • People, locations, organizations

3
Martin Baker, a person
Genomics job
Employers job posting form
4
Aggregator Websites
5
(No Transcript)
6
Aggregator Websites
  • Read in many web pages from different sites
  • Extract information into a database
  • Screen Scraping
  • Can then return data matching particular queries
  • Data mining can extract meaningful insight that
    might not have been obvious

7
(No Transcript)
8
Data Mining
9
IE from Research Papers
10
IE from Commercial Websites
11
What is Information Extraction?
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
12
What is Information Extraction?
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
IE
NAME TITLE ORGANIZATION Bill Gates
CEO Microsoft Bill Veghte VP
Microsoft Richard Stallman founder Free
Soft..
13
What is Information Extraction?
As a familyof techniques
Information Extraction segmentation
classification association
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
aka named entity extraction
14
What is Information Extraction?
A familyof techniques
Information Extraction segmentation
classification association
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
15
What is Information Extraction?
A familyof techniques
Information Extraction segmentation
classification association
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
16
IE in Context
Create ontology
Spider
Filter by relevance
IE
Segment Classify Associate Cluster
Database
Load DB
Query, Search
Documentcollection
Train extraction models
Data mine
Label training data
17
IE in Context Formatting
18
IE in Context Formatting
19
IE in Context Coverage
Web site specific
Genre specific
Formatting
Layout
Amazon.com Book Pages
Resumes
20
IE in Context Coverage
Wide, non-specific
Language
University Names
21
IE in Context Complexity
22
IE in Context Single Field/Record
Jack Welch will retire as CEO of General Electric
tomorrow. The top role at the Connecticut
company will be filled by Jeffrey Immelt.
Single entity
Binary relationship
N-ary record
Person Jack Welch
Relation Person-Title Person Jack
Welch Title CEO
Relation Succession Company General
Electric Title CEO Out
Jack Welsh In Jeffrey Immelt
Person Jeffrey Immelt
Relation Company-Location Company General
Electric Location Connecticut
Location Connecticut
Named entity extraction
23
State of the Art
  • Named entity recognition from newswire text
  • Person, Location, Organization,
  • F1 in high 80s or low- to mid-90s
  • Binary relation extraction
  • Contained-in (Location1, Location2)Member-of
    (Person1, Organization1)
  • F1 in 60s or 70s or 80s
  • Web site structure recognition
  • Extremely accurate performance obtainable
  • Human effort (10min?) required on each site

24
IE Generations
  • Hand-Built Systems Knowledge Engineering
    1980s
  • Rules written by hand
  • Require experts who understand both the systems
    and the domain
  • Iterative guess-test-tweak-repeat cycle
  • Automatic, Trainable Rule-Extraction Systems
    1990s
  • Rules discovered automatically using predefined
    templates, using automated rule learners
  • Require huge, labeled corpora (effort is just
    moved!)
  • Statistical Models 1997
  • Use machine learning to learn which features
    indicate boundaries and types of entities.
  • Learning usually supervised may be partially
    unsupervised

25
IE Techniques
Lexicons
Abraham Lincoln was born in Kentucky.
member?
Alabama Alaska Wisconsin Wyoming
26
Trainable IE Systems
  • Pros
  • Annotating text is simpler faster than writing
    rules.
  • Domain independent
  • Domain experts dont need to be linguists or
    programers.
  • Learning algorithms ensure full coverage of
    examples.
  • Cons
  • Hand-crafted systems perform better, especially
    at hard tasks. (but this is changing)
  • Training data might be expensive to acquire
  • May need huge amount of training data
  • Hand-writing rules isnt that hard!!

27
MUC Genesis of IE
  • DARPA funded significant efforts in IE in the
    early to mid 1990s.
  • Message Understanding Conference (MUC) was an
    annual event/competition where results were
    presented.
  • Focused on extracting information from news
    articles
  • Terrorist events
  • Industrial joint ventures
  • Company management changes
  • Information extraction of particular interest to
    the intelligence community (CIA, NSA). (Note
    early 90s)

28
MUC
  • Named entity
  • Person, Organization, Location
  • Co-reference
  • Clinton ? President Bill Clinton
  • Template element
  • Perpetrator, Target
  • Template relation
  • Incident
  • Multilingual

29
MUC Typical Text
  • Bridgestone Sports Co. said Friday it has set up
    a joint venture in Taiwan with a local concern
    and a Japanese trading house to produce golf
    clubs to be shipped to Japan. The joint venture,
    Bridgestone Sports Taiwan Co., capitalized at 20
    million new Taiwan dollars, will start production
    of 20,000 iron and metal wood clubs a month

30
MUC Typical Text
  • Bridgestone Sports Co. said Friday it has set up
    a joint venture in Taiwan with a local concern
    and a Japanese trading house to produce golf
    clubs to be shipped to Japan. The joint venture,
    Bridgestone Sports Taiwan Co., capitalized at 20
    million new Taiwan dollars, will start production
    of 20,000 iron and metal wood clubs a month

31
MUC Templates
  • Relationship
  • tie-up
  • Entities
  • Bridgestone Sports Co, a local concern, a
    Japanese trading house
  • Joint venture company
  • Bridgestone Sports Taiwan Co
  • Activity
  • ACTIVITY 1
  • Amount
  • NT2,000,000

32
MUC Templates
  • ATIVITY 1
  • Activity
  • Production
  • Company
  • Bridgestone Sports Taiwan Co
  • Product
  • Iron and metal wood clubs
  • Start Date
  • January 1990

33
Example from Fastus (1993)
34
(No Transcript)
35
(No Transcript)
36
1.Complex Words Recognition of multi-words and
proper names
set up new Taiwan dollars
2.Basic Phrases Simple noun groups, verb groups
and particles
a Japanese trading house had set up
3.Complex phrases Complex noun groups and verb
groups
4.Domain Events Patterns for events of interest
to the application Basic templates are to be
built.
5. Merging Structures Templates from different
parts of the texts are merged if they provide
information about the same entity or event.
37
Evaluating IE Accuracy
  • Always evaluate performance on independent,
    manually-annotated test data not used during
    system development.
  • Measure for each test document
  • Total number of correct extractions in the
    solution template N
  • Total number of slot/value pairs extracted by the
    system E
  • Number of extracted slot/value pairs that are
    correct (i.e. in the solution template) C
  • Compute average value of metrics adapted from IR
  • Recall C/N
  • Precision C/E
  • F-Measure Harmonic mean of recall and precision

38
Named Entities
  • Named Entities
  • Person Name Colin Powell, Frodo
  • Location Name Middle East, Aiur
  • Organization UN, DARPA
  • Domain Specific vs. Open Domain

39
Nymble (BBN Corporation)
  • State of the art system
  • Near-human performance 90 accuracy
  • Statistical system
  • Approach Hidden Markov Model (HMM)

40
Nymble (BBN Corporation)
  • Noisy channel paradigm
  • Originally, entities were marked in the raw text
  • Post noisy channel, annotation is lost
  • Probability of most likely sequence of name
    classes (NC) given a sequence of words (W)
  • Pr(NCW) Pr(W,NC) / Pr(W)
  • since the a priori probability of the word
    sequence can be considered constant for any given
    sentence ? maximize just numerator

41
Nymble (BBN Corporation)
Person
Start of Sentence
End of Sentence
Organization
Five other classes
Not-A-Name
42
Automatic Content Extraction
  • DARPA ACE Program
  • Identify Entities
  • Named Bilbo, San Diego, UNICEF
  • Nominal the president, the hobbit
  • Pronominal she
  • Reference resolution
  • Clinton ? the president ? he

43
Question Answering
  • The over-used pipeline paradigm

Question Analysis
Information Retrieval
Answer Extraction
Question
Answer Merging
Answer
44
Question Answering
  • Feedback loops can be present for constraint
    relaxation purposes
  • Not all QA systems adhere to the pipeline
    architecture
  • Question answering flavors
  • Factoid vs. complex
  • Who invented paper? vs. Which of Mr. Bushs
    friends are Black Sabbath fans?
  • Closed vs. open domain

45
Answer Extraction
  • The over-used pipeline paradigm
  • Focus on open domain, factoid question answering

Question Analysis
Information Retrieval
Answer Extraction
Question
Answer Merging
Answer
46
Practical Issues
  • Web Spell Checking
  • Mispling
  • nucular
  • Infrequent forms
  • Niagra vs. Niagara, Filenes vs. Filenes
  • Google QA
  • Genome, video, games

47
Practical Issues
  • Traditional Information Extraction
  • Either expert built or statistical
  • Specific strategies for specific question types
  • Person Bio vs. Location question types
  • Ability to generalize to new questions and new
    question types

48
Practical Issues
  • Who invented Blah?
  • Blah was invented by PersonName
  • Blah was Verb by PersonName
  • where Verb is synonym to invented
  • Blah VerbPhrase by PersonName

49
Popular Resources
  • Experts and/or Learning Algorithms
  • Gazeteers
  • NE taggers
  • Part Of Speech taggers
  • Parsers
  • Wordnet
  • Stopword list
  • Stemmer
Write a Comment
User Comments (0)
About PowerShow.com