Information%20Extraction - PowerPoint PPT Presentation

About This Presentation
Title:

Information%20Extraction

Description:

... classify their pictures/bookmarks/web pages with tags (e.g. wedding), and then ... Seminar announcements. Company information from the web ... – PowerPoint PPT presentation

Number of Views:745
Avg rating:3.0/5.0
Slides: 91
Provided by: raym112
Category:

less

Transcript and Presenter's Notes

Title: Information%20Extraction


1
Information Extraction
Make-up Class Tomorrow (Wed) 10301145AM
?BY 210 (next to the advising office)
  • (Several slides based on those by Ray Mooney,
    Cohen/McCallum (via Dan Welds class)

2
Intended Use of Semantic Web?
  • Pages should be annotated with RDF triples, with
    links to RDF-S (our OWL) background ontology.
  • E.g. See Jim Hendlers page

3
Database vs. Semantic Web Inference(and the
Magellan Story)
  • Also templated extraction as undoing XML?HTML
    conversion. Templated extraction is by
    DOM-patterns unstructured extraction is (sort
    of) by grammar parse tree patterns. Grammar
    learning is mostly from ve examples.

To be added
Rinku Patel
4
Who will annotate the data?
  • Semantic web works if the users annotate their
    pages using some existing ontology (or their own
    ontology, but with mapping to other ontologies)
  • But users typically do not conform to standards..
  • and are not patient enough for delayed
    gratification
  • Two Solutions
  • 1. Intercede in the way pages are created (act as
    if you are helping them write web-pages)
  • What if we change the MS Frontpage/Claris
    Homepage so that they (slyly) add annotations?
  • E.g. The Mangrove project at U. Wash.
  • Help user in tagging their data (allow graphical
    editing)
  • Provide instant gratification by running services
    that use the tags.
  • 2. Collaborative tagging!
  • Folksonomies (look at Wikipedia article)
  • FLICKR, Technorati, deli.cio.us etc
  • CBIOC, ESP game etc.
  • Need to incentivize users to do the annotations..
  • 3. Automated information extraction (next topic)

5
FolksonomiesThe good
  • Bottom-up approach to taxonomies/ontologies
  • In systems like Furl, Flickr and Del.icio.us...
    people classify their pictures/bookmarks/web
    pages with tags (e.g. wedding), and then the most
    popular tags float to the top (e.g. Flickr's tags
    or Del.icio.us on the right)....
  • Folksonomies can work well for certain kinds of
    information because they offer a small reward for
    using one of the popular categories (such as your
    photo appearing on a popular page). People who
    enjoy the social aspects of the system will
    gravitate to popular categories while still
    having the freedom to keep their own lists of
    tags.

6
Works best when Many people Tag the same Info
7
Folksonomies the bad
  • On the other hand, not hard to see a few reasons
    why a folksonomy would be less than ideal in a
    lot of cases
  • None of the current implementations have synonym
    control (e.g. "selfportrait" and "me" are
    distinct Flickr tags, as are "mac" and
    "macintosh" on Del.icio.us).
  • Also, there's a certain lack of precision
    involved in using simple one-word tags--like
    which Lance are we talking about?
  • And, of course, there's no heirarchy and the
    content types (bookmarks, photos) are fairly
    simple.
  • For indexing and library people, folksonomies are
    about as appealing as Wikipedia is to
    encyclopedia editors.
  • But.. there's some interesting stuff happening
    around them.

8
Mass Collaboration ( Mice running the Earth)
  • The quality of the tags generated through
    folksonomies is notoriously hard to control
  • So, design mechanisms that ensure correctness of
    tags..
  • ESP game makes it fun to
  • CBIOC and Google Co-op restrict annotation
    previleges to trusted users..
  • It is hard to get people to tag things in which
    they dont have personal interest..
  • Find incentive structures..
  • ESP makes it a game with points
  • CBIOC and Google Co-op try to promise delayed
    gratification in terms of improved search later..

9
Who will annotate the data?
  • Semantic web works if the users annotate their
    pages using some existing ontology (or their own
    ontology, but with mapping to other ontologies)
  • But users typically do not conform to standards..
  • and are not patient enough for delayed
    gratification
  • Two Solutions
  • 1. Intercede in the way pages are created (act as
    if you are helping them write web-pages)
  • What if we change the MS Frontpage/Claris
    Homepage so that they (slyly) add annotations?
  • E.g. The Mangrove project at U. Wash.
  • Help user in tagging their data (allow graphical
    editing)
  • Provide instant gratification by running services
    that use the tags.
  • 2. Collaborative tagging!
  • Folksonomies (look at Wikipedia article)
  • FLICKR, Technorati, deli.cio.us etc
  • CBIOC, ESP game etc.
  • Need to incentivize users to do the annotations..
  • 3. Automated information extraction

Next Topic
10
Information Extraction (IE)
  • Identify specific pieces of information (data) in
    a unstructured or semi-structured textual
    document.
  • Transform unstructured information in a corpus of
    documents or web pages into a structured
    database.
  • Applied to different types of text
  • Newspaper articles
  • Web pages
  • Scientific articles
  • Newsgroup messages
  • Classified ads
  • Medical notes
  • Wikipedia (info boxes)..

11
Information Extraction vs. NLP?
  • Information extraction is attempting to find some
    of the structure and meaning in the hopefully
    template driven web pages.
  • As IE becomes more ambitious and text becomes
    more free form, then ultimately we have IE
    becoming equal to NLP.
  • Web does give one particular boost to NLP
  • Massive corpora..

12
MUC
  • DARPA funded significant efforts in IE in the
    early to mid 1990s.
  • Message Understanding Conference (MUC) was an
    annual event/competition where results were
    presented.
  • Focused on extracting information from news
    articles
  • Terrorist events
  • Industrial joint ventures
  • Company management changes
  • Information extraction of particular interest to
    the intelligence community (CIA, NSA).

13
Other Applications
  • Job postings
  • Newsgroups Rapier from austin.jobs
  • Web pages Flipdog
  • Job resumes
  • BurningGlass
  • Mohomine
  • Seminar announcements
  • Company information from the web
  • Continuing education course info from the web
  • University information from the web
  • Apartment rental ads
  • Molecular biology information from MEDLINE

14
Wikipedia Infoboxes..
  • Wikipedia has both unstructured text and
    structured info boxes..

Infobox
15
Sample Job Posting
Subject US-TN-SOFTWARE PROGRAMMER Date 17 Nov
1996 173729 GMT Organization Reference.Com
Posting Service Message-ID lt56nigpmrs_at_bilbo.refe
rence.comgt SOFTWARE PROGRAMMER Position
available for Software Programmer experienced in
generating software for PC-Based Voice Mail
systems. Experienced in C Programming. Must be
familiar with communicating with and controlling
voice cards preferable Dialogic, however,
experience with others such as Rhetorix and
Natural Microsystems is okay. Prefer 5 years or
more experience with PC Based Voice Mail, but
will consider as little as 2 years. Need to find
a Senior level person who can come on board and
pick up code with very little training. Present
Operating System is DOS. May go to OS-2 or UNIX
in future. Please reply to Kim
Anderson AdNET (901) 458-2888 fax kimander_at_memphis
online.com
Subject US-TN-SOFTWARE PROGRAMMER Date 17 Nov
1996 173729 GMT Organization Reference.Com
Posting Service Message-ID lt56nigpmrs_at_bilbo.refe
rence.comgt SOFTWARE PROGRAMMER Position
available for Software Programmer experienced in
generating software for PC-Based Voice Mail
systems. Experienced in C Programming. Must be
familiar with communicating with and controlling
voice cards preferable Dialogic, however,
experience with others such as Rhetorix and
Natural Microsystems is okay. Prefer 5 years or
more experience with PC Based Voice Mail, but
will consider as little as 2 years. Need to find
a Senior level person who can come on board and
pick up code with very little training. Present
Operating System is DOS. May go to OS-2 or UNIX
in future. Please reply to Kim
Anderson AdNET (901) 458-2888 fax kimander_at_memphis
online.com
16
Extracted Job Template
computer_science_job id 56nigpmrs_at_bilbo.referenc
e.com title SOFTWARE PROGRAMMER salary company
recruiter state TN city country US language
C platform PC \ DOS \ OS-2 \ UNIX application ar
ea Voice Mail req_years_experience
2 desired_years_experience 5 req_degree desired_
degree post_date 17 Nov 1996
17
Amazon Book Description
. lt/tdgtlt/trgt lt/tablegt ltb class"sans"gtThe Age of
Spiritual Machines When Computers Exceed Human
Intelligencelt/bgtltbrgt ltfont faceverdana,arial,helv
etica size-1gt by lta href"/exec/obidos/search-han
dle-url/indexbooksfield-author
Kurzweil2C20Ray/002-6235079-4593641"gt Ray
Kurzweillt/agtltbrgt lt/fontgt ltbrgt lta
href"http//images.amazon.com/images/P/0140282025
.01.LZZZZZZZ.jpg"gt ltimg src"http//images.amazon.
com/images/P/0140282025.01.MZZZZZZZ.gif" width90
height140 alignleft border0gtlt/agt ltfont
faceverdana,arial,helvetica size-1gt ltspan
class"small"gt ltspan class"small"gt ltbgtList
Pricelt/bgt ltspan classlistpricegt14.95lt/spangtltbrgt
ltbgtOur Price ltfont color990000gt11.96lt/fontgtlt/
bgtltbrgt ltbgtYou Savelt/bgt ltfont color990000gtltbgt2.
99 lt/bgt (20)lt/fontgtltbrgt lt/spangt ltpgt ltbrgt
. lt/tdgtlt/trgt lt/tablegt ltb class"sans"gtThe Age of
Spiritual Machines When Computers Exceed Human
Intelligencelt/bgtltbrgt ltfont faceverdana,arial,helv
etica size-1gt by lta href"/exec/obidos/search-han
dle-url/indexbooksfield-author
Kurzweil2C20Ray/002-6235079-4593641"gt Ray
Kurzweillt/agtltbrgt lt/fontgt ltbrgt lta
href"http//images.amazon.com/images/P/0140282025
.01.LZZZZZZZ.jpg"gt ltimg src"http//images.amazon.
com/images/P/0140282025.01.MZZZZZZZ.gif" width90
height140 alignleft border0gtlt/agt ltfont
faceverdana,arial,helvetica size-1gt ltspan
class"small"gt ltspan class"small"gt ltbgtList
Pricelt/bgt ltspan classlistpricegt14.95lt/spangtltbrgt
ltbgtOur Price ltfont color990000gt11.96lt/fontgtlt/
bgtltbrgt ltbgtYou Savelt/bgt ltfont color990000gtltbgt2.
99 lt/bgt (20)lt/fontgtltbrgt lt/spangt ltpgt ltbrgt
18
Extracted Book Template
Title The Age of Spiritual Machines
When Computers Exceed Human Intelligence Author
Ray Kurzweil List-Price 14.95 Price 11.96
19
Extraction from Templated Text
  • Many web pages are generated automatically from
    an underlying database.
  • Therefore, the HTML structure of pages is fairly
    specific and regular (semi-structured).
  • However, output is intended for human
    consumption, not machine interpretation.
  • An IE system for such generated pages allows the
    web site to be viewed as a structured database.
  • An extractor for a semi-structured web site is
    sometimes referred to as a wrapper.
  • Process of extracting from such pages is
    sometimes referred to as screen scraping.

20
Templated Extraction using DOM Trees
  • Web extraction may be aided by first parsing web
    pages into DOM trees.
  • Extraction patterns can then be specified as
    paths from the root of the DOM tree to the node
    containing the text to extract.
  • May still need regex patterns to identify proper
    portion of the final CharacterData node.

21
Sample DOM Tree Extraction
HTML
Element
BODY
HEADER
Character-Data
FONT
B
Age of Spiritual Machines
A
by
Ray Kurzweil
Title HTML?BODY?B?CharacterData Author HTML?
BODY?FONT?A? CharacterData
22
Template Types
  • Slots in template typically filled by a substring
    from the document.
  • Some slots may have a fixed set of pre-specified
    possible fillers that may not occur in the text
    itself.
  • Terrorist act threatened, attempted,
    accomplished.
  • Job type clerical, service, custodial, etc.
  • Company type SEC code
  • Some slots may allow multiple fillers.
  • Programming language
  • Some domains may allow multiple extracted
    templates per document.
  • Multiple apartment listings in one ad

23
Simple Extraction Patterns
  • Specify an item to extract for a slot using a
    regular expression pattern.
  • Price pattern \b\\d(\.\d2)?\b
  • May require preceding (pre-filler) pattern to
    identify proper context.
  • Amazon list price
  • Pre-filler pattern ltbgtList Pricelt/bgt ltspan
    classlistpricegt
  • Filler pattern \\d(\.\d2)?\b
  • May require succeeding (post-filler) pattern to
    identify the end of the filler.
  • Amazon list price
  • Pre-filler pattern ltbgtList Pricelt/bgt ltspan
    classlistpricegt
  • Filler pattern .
  • Post-filler pattern lt/spangt

24
Simple Template Extraction
  • Extract slots in order, starting the search for
    the filler of the n1 slot where the filler for
    the nth slot ended. Assumes slots always in a
    fixed order.
  • Title
  • Author
  • List price
  • Make patterns specific enough to identify each
    filler always starting from the beginning of the
    document.

25
Pre-Specified Filler Extraction
  • If a slot has a fixed set of pre-specified
    possible fillers, text categorization can be used
    to fill the slot.
  • Job category
  • Company type
  • Treat each of the possible values of the slot as
    a category, and classify the entire document to
    determine the correct filler.

26
Learning for IE
  • Writing accurate patterns for each slot for each
    domain (e.g. each web site) requires laborious
    software engineering.
  • Alternative is to use machine learning
  • Build a training set of documents paired with
    human-produced filled extraction templates.
  • Learn extraction patterns for each slot using an
    appropriate machine learning algorithm.

27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
FindingSweet Spots in computer-mediated
cooperative work
Big Idea 1
  • It is possible to get by with techniques blythely
    ignorant of semantics, when you have humans in
    the loop
  • All you need is to find the right sweet spot,
    where the computer plays a pre-processing role
    and presents potential solutions
  • and the human very gratefully does the in-depth
    analysis on those few potential solutions
  • Examples
  • The incredible success of Bag of Words model!
  • Bag of letters would be a disaster -)
  • Bag of sentences and/or NLP would be good
  • ..but only to your discriminating and irascible
    searchers -)

35
Collaborative Computing AKA Brain Cycle
StealingAKA Computizing Eyeballs
Big Idea 2
  • A lot of exciting research related to web
    currently involves co-opting the masses to help
    with large-scale tasks
  • It is like cycle stealingexcept we are
    stealing human brain cycles (the most idle of
    the computers if there is ever one -)
  • Remember the mice in the Hitch Hikers Guide to
    the Galaxy? (..who were running a mass-scale
    experiment on the humans to figure out the
    question..)
  • Collaborative knowledge compilation (wikipedia!)
  • Collaborative Curation
  • Collaborative tagging
  • Paid collacoration/contracting
  • Many big open issues
  • How do you pose the problem such that it can be
    solved using collaborative computing?
  • How do you incentivize people into letting you
    steal their brain cycles?
  • Pay them! (Amazon mturk.com )

36
Tapping into the Collective Unconscious
Big Idea 3
  • Another thread of exciting research is driven by
    the realization that WEB is not random at all!
  • It is written by humans
  • so analyzing its structure and content allows us
    to tap into the collective unconscious ..
  • Meaning can emerge from syntactic notions such as
    co-occurrences and connectedness
  • Examples
  • Analyzing term co-occurrences in the web-scale
    corpora to capture semantic information (todays
    paper)
  • Analyzing the link-structure of the web graph to
    discover communities
  • DoD and NSA are very much into this as a way of
    breaking terrorist cells
  • Analyzing the transaction patterns of customers
    (collaborative filtering)

37
Information Extraction from unstructured text
38
(No Transcript)
39
Information Extraction from Unstructured Text
  • Semantic web needs
  • Tagged data
  • Background knowledge
  • (blue sky approaches to) automate both
  • Knowledge Extraction
  • Extract base level knowledge (facts) directly
    from the web
  • Automated tagging
  • Start with a background ontology and tag other
    web pages
  • Semtag/Seeker

40
Fielded IE Systems Citeseer, Google Scholar
LibraHow do they do it? Why do they fail?
41
What is Information Extraction
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
NAME TITLE ORGANIZATION
Slides from Cohen McCallum
42
What is Information Extraction
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
IE
NAME TITLE ORGANIZATION Bill Gates
CEO Microsoft Bill Veghte VP
Microsoft Richard Stallman founder Free
Soft..
Slides from Cohen McCallum
43
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification clustering association
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
Slides from Cohen McCallum
44
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
Slides from Cohen McCallum
45
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
Slides from Cohen McCallum
46
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation




Slides from Cohen McCallum
47
IE in Context
Create ontology
Spider
Filter by relevance
Segment Classify
Associate Cluster
IE
Database
Load DB
Query, Search
Documentcollection
Train extraction models
Data mine
Label training data
Slides from Cohen McCallum
48
IE History
  • Pre-Web
  • Mostly news articles
  • De Jongs FRUMP 1982
  • Hand-built system to fill Schank-style scripts
    from news wire
  • Message Understanding Conference (MUC) DARPA
    87-95, TIPSTER 92-96
  • Most early work dominated by hand-built models
  • E.g. SRIs FASTUS, hand-built FSMs.
  • But by 1990s, some machine learning Lehnert,
    Cardie, Grishman and then HMMs Elkan Leek 97,
    BBN Bikel et al 98
  • Web
  • AAAI 94 Spring Symposium on Software Agents
  • Much discussion of ML applied to Web. Maes,
    Mitchell, Etzioni.
  • Tom Mitchells WebKB, 96
  • Build KBs from the Web.
  • Wrapper Induction
  • First by hand, then ML Doorenbos 96,
    Soderland 96, Kushmerick 97,

Slides from Cohen McCallum
49
What makes IE from the Web Different?
Less grammar, but more formatting linking
Newswire
Web
www.apple.com/retail
Apple to Open Its First Retail Store in New York
City MACWORLD EXPO, NEW YORK--July 17,
2002--Apple's first retail store in New York City
will open in Manhattan's SoHo district on
Thursday, July 18 at 800 a.m. EDT. The SoHo
store will be Apple's largest retail store to
date and is a stunning example of Apple's
commitment to offering customers the world's best
computer shopping experience. "Fourteen months
after opening our first retail store, our 31
stores are attracting over 100,000 visitors each
week," said Steve Jobs, Apple's CEO. "We hope our
SoHo store will surprise and delight both Mac and
PC users who want to see everything the Mac can
do to enhance their digital lifestyles."
www.apple.com/retail/soho
www.apple.com/retail/soho/theatre.html
The directory structure, link structure,
formatting layout of the Web is its own new
grammar.
Slides from Cohen McCallum
50
Landscape of IE Tasks (1/4)Pattern Feature
Domain
Text paragraphs without formatting
Grammatical sentencesand some formatting links
Astro Teller is the CEO and co-founder of
BodyMedia. Astro holds a Ph.D. in Artificial
Intelligence from Carnegie Mellon University,
where he was inducted as a national Hertz fellow.
His M.S. in symbolic and heuristic computation
and B.S. in computer science are from Stanford
University. His work in science, literature and
business has appeared in international media from
the New York Times to CNN to NPR.
Non-grammatical snippets,rich formatting links
Tables
Slides from Cohen McCallum
51
Landscape of IE Tasks (2/4)Pattern Scope
Web site specific
Genre specific
Wide, non-specific
Formatting
Layout
Language
Amazon.com Book Pages
Resumes
University Names
Slides from Cohen McCallum
52
Landscape of IE Tasks (3/4)Pattern Complexity
E.g. word patterns
Regular set
Closed set
U.S. phone numbers
U.S. states
Phone (413) 545-1323
He was born in Alabama
The CALD main office can be reached at
412-268-1299
The big Wyoming sky
Ambiguous patterns,needing context
manysources of evidence
Complex pattern
U.S. postal addresses
University of Arkansas P.O. Box 140 Hope, AR
71802
Person names
was among the six houses sold by Hope Feldman
that year.
Headquarters 1128 Main Street, 4th
Floor Cincinnati, Ohio 45210
Pawel Opalinski, SoftwareEngineer at WhizBang
Labs.
Slides from Cohen McCallum
53
Landscape of IE Tasks (4/4)Pattern Combinations
Jack Welch will retire as CEO of General Electric
tomorrow. The top role at the Connecticut
company will be filled by Jeffrey Immelt.
Single entity
Binary relationship
N-ary record
Person Jack Welch
Relation Person-Title Person Jack
Welch Title CEO
Relation Succession Company General
Electric Title CEO Out
Jack Welsh In Jeffrey Immelt
Person Jeffrey Immelt
Relation Company-Location Company General
Electric Location Connecticut
Location Connecticut
Named entity extraction
Slides from Cohen McCallum
54
Evaluation of Single Entity Extraction
TRUTH
Michael Kearns and Sebastian Seung will start
Mondays tutorial, followed by Richard M. Karpe
and Martin Cooke.
PRED
Michael Kearns and Sebastian Seung will start
Mondays tutorial, followed by Richard M. Karpe
and Martin Cooke.
correctly predicted segments 2

Precision

predicted segments 6

correctly predicted segments 2

Recall

true segments 4
1
F1 Harmonic mean of Precision
Recall
((1/P) (1/R)) / 2
Slides from Cohen McCallum
55
State of the Art Performance
  • Named entity recognition
  • Person, Location, Organization,
  • F1 in high 80s or low- to mid-90s
  • Binary relation extraction
  • Contained-in (Location1, Location2)Member-of
    (Person1, Organization1)
  • F1 in 60s or 70s or 80s
  • Wrapper induction
  • Extremely accurate performance obtainable
  • Human effort (30min) required on each site

Slides from Cohen McCallum
56
Landscape of IE Techniques (1/1)Models
Lexicons
Abraham Lincoln was born in Kentucky.
member?
Alabama Alaska Wisconsin Wyoming
and beyond
Any of these models can be used to capture words,
formatting or both.
Slides from Cohen McCallum
57
LandscapeFocus of this Tutorial
Pattern complexity
closed set
regular
complex
ambiguous
Pattern feature domain
words
words formatting
formatting
Pattern scope
site-specific
genre-specific
general
Pattern combinations
entity
binary
n-ary
Models
lexicon
regex
window
boundary
FSM
CFG
Slides from Cohen McCallum
58
Sliding Windows
Slides from Cohen McCallum
59
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Slides from Cohen McCallum
60
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Slides from Cohen McCallum
61
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Slides from Cohen McCallum
62
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Slides from Cohen McCallum
63
A Naïve Bayes Sliding Window Model
Freitag 1997
00 pm Place Wean Hall Rm 5409
Speaker Sebastian Thrun


w t-m
w t-1
w t
w tn
w tn1
w tnm
prefix
contents
suffix
Estimate Pr(LOCATIONwindow) using Bayes
rule Try all reasonable windows (vary length,
position) Assume independence for length, prefix
words, suffix words, content words Estimate from
data quantities like Pr(Place in
prefixLOCATION)
If P(Wean Hall Rm 5409 LOCATION) is above
some threshold, extract it.
Other examples of sliding window Baluja et al
2000 (decision tree over individual words
their context)
Slides from Cohen McCallum
64
Naïve Bayes Sliding Window Results
Domain CMU UseNet Seminar Announcements
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
Field F1 Person Name 30 Location 61 Start
Time 98
Slides from Cohen McCallum
65
Realistic sliding-window-classifier IE
  • What windows to consider?
  • all windows containing as many tokens as the
    shortest example, but no more tokens than the
    longest example
  • How to represent a classifier? It might
  • Restrict the length of window
  • Restrict the vocabulary or formatting used
    before/after/inside window
  • Restrict the relative order of tokens, etc.
  • Learning Method
  • SRV Top-Down Rule Learning Frietag AAAI
    98
  • Rapier Bottom-Up Califf Mooney, AAAI
    99

Slides from Cohen McCallum
66
Rapier results precision/recall
Slides from Cohen McCallum
67
Rapier results vs. SRV
Slides from Cohen McCallum
68
Rule-learning approaches to sliding-window
classification Summary
  • SRV, Rapier, and WHISK Soderland KDD 97
  • Representations for classifiers allow restriction
    of the relationships between tokens, etc
  • Representations are carefully chosen subsets of
    even more powerful representations based on logic
    programming (ILP and Prolog)
  • Use of these heavyweight representations is
    complicated, but seems to pay off in results
  • Can simpler representations for classifiers work?

Slides from Cohen McCallum
69
BWI Learning to detect boundaries
Freitag Kushmerick, AAAI 2000
  • Another formulation learn three probabilistic
    classifiers
  • START(i) Prob( position i starts a field)
  • END(j) Prob( position j ends a field)
  • LEN(k) Prob( an extracted field has length k)
  • Then score a possible extraction (i,j) by
  • START(i) END(j) LEN(j-i)
  • LEN(k) is estimated from a histogram

Slides from Cohen McCallum
70
BWI Learning to detect boundaries
  • BWI uses boosting to find detectors for START
    and END
  • Each weak detector has a BEFORE and AFTER pattern
    (on tokens before/after position i).
  • Each pattern is a sequence of
  • tokens and/or
  • wildcards like anyAlphabeticToken, anyNumber,
  • Weak learner for patterns uses greedy search (
    lookahead) to repeatedly extend a pair of empty
    BEFORE,AFTER patterns

Slides from Cohen McCallum
71
BWI Learning to detect boundaries
Field F1 Person Name 30 Location 61 Start
Time 98
Slides from Cohen McCallum
72
Problems with Sliding Windows and Boundary
Finders
  • Decisions in neighboring parts of the input are
    made independently from each other.
  • Naïve Bayes Sliding Window may predict a seminar
    end time before the seminar start time.
  • It is possible for two overlapping windows to
    both be above threshold.
  • In a Boundary-Finding system, left boundaries are
    laid down independently from right boundaries,
    and their pairing happens as a separate step.

Solution? Joint inference
Slides from Cohen McCallum
73
ExtractionNamed Entity ? Binary Relations
  • How Extend a Sliding Window Approach?

74
Snowball
75
Pattern Representation
  • Brittle candidate generation?
  • Cant extract if location mentioned before
    organization?
  • ltPat_left, Tag_1, Pat_mid, Tag_2, Pat_rtgt
  • Tag_ is a named entity tag
  • Pat_ is vector (in term space)
  • Degree of Match
  • Dependence on Alembic Tagger

76
Generating Evaluating Patterns
  • Generation of Candidate Patterns
  • Evaluation of Candidate Patterns
  • Selectivity vs Coverage vs Confidence (Precision)
  • Rilloffs Conf log Postive
  • 2/2 4/12

77
Evaluating Tuples
P
  • Conf(T) 1 ?(1 Conf(P_i) Match(T, P_i)))

i0
  • Conf(P) Conf_n(P) W Conf_o(P) (1-W)
  • Comments?
  • Simulated Annealing?
  • Discard poor tuples?
  • (vs not count as seeds)
  • Lower confidence of old tuples?

78
Overall Algorithm
  • Relation to EM?
  • Relation to KnowItAll
  • Will it work for the long tail?
  • Tagging vs Full NLP
  • Synonyms
  • Negative Examples
  • General Relations vs Functions (Keys)

79
Evaluation
  • Effect of Seed Quality
  • Effect of Seed Quantity
  • Other Domains
  • Shouldnt this expt be easy?
  • Ease of Use
  • Training Examples vs Parameter Tweaking

80
Contributions
  • Techniques for Pattern Generation
  • Strategies for Evaluating Patterns Tuples
  • Evaluation Methodology Metrics

81
References
  • Bikel et al 1997 Bikel, D. Miller, S.
    Schwartz, R. and Weischedel, R. Nymble a
    high-performance learning name-finder. In
    Proceedings of ANLP97, p194-201.
  • Califf Mooney 1999, Califf, M.E. Mooney, R.
    Relational Learning of Pattern-Match Rules for
    Information Extraction, in Proceedings of the
    Sixteenth National Conference on Artificial
    Intelligence (AAAI-99).
  • Cohen, Hurst, Jensen, 2002 Cohen, W. Hurst,
    M. Jensen, L. A flexible learning system for
    wrapping tables and lists in HTML documents.
    Proceedings of The Eleventh International World
    Wide Web Conference (WWW-2002)
  • Cohen, Kautz, McAllester 2000 Cohen, W Kautz,
    H. McAllester, D. Hardening soft information
    sources. Proceedings of the Sixth International
    Conference on Knowledge Discovery and Data Mining
    (KDD-2000).
  • Cohen, 1998 Cohen, W. Integration of
    Heterogeneous Databases Without Common Domains
    Using Queries Based on Textual Similarity, in
    Proceedings of ACM SIGMOD-98.
  • Cohen, 2000a Cohen, W. Data Integration using
    Similarity Joins and a Word-based Information
    Representation Language, ACM Transactions on
    Information Systems, 18(3).
  • Cohen, 2000b Cohen, W. Automatically Extracting
    Features for Concept Learning from the Web,
    Machine Learning Proceedings of the Seventeeth
    International Conference (ML-2000).
  • Collins Singer 1999 Collins, M. and Singer,
    Y. Unsupervised models for named entity
    classification. In Proceedings of the Joint
    SIGDAT Conference on Empirical Methods in Natural
    Language Processing and Very Large Corpora, 1999.
  • De Jong 1982 De Jong, G. An Overview of the
    FRUMP System. In Lehnert, W. Ringle, M. H.
    (eds), Strategies for Natural Language
    Processing. Larence Erlbaum, 1982, 149-176.
  • Freitag 98 Freitag, D Information extraction
    from HTML application of a general machine
    learning approach, Proceedings of the Fifteenth
    National Conference on Artificial Intelligence
    (AAAI-98).
  • Freitag, 1999, Freitag, D. Machine Learning
    for Information Extraction in Informal Domains.
    Ph.D. dissertation, Carnegie Mellon University.
  • Freitag 2000, Freitag, D Machine Learning for
    Information Extraction in Informal Domains,
    Machine Learning 39(2/3) 99-101 (2000).
  • Freitag Kushmerick, 1999 Freitag, D
    Kushmerick, D. Boosted Wrapper Induction.
    Proceedings of the Sixteenth National Conference
    on Artificial Intelligence (AAAI-99)
  • Freitag McCallum 1999 Freitag, D. and
    McCallum, A. Information extraction using HMMs
    and shrinakge. In Proceedings AAAI-99 Workshop
    on Machine Learning for Information Extraction.
    AAAI Technical Report WS-99-11.
  • Kushmerick, 2000 Kushmerick, N Wrapper
    Induction efficiency and expressiveness,
    Artificial Intelligence, 118(pp 15-68).
  • Lafferty, McCallum Pereira 2001 Lafferty,
    J. McCallum, A. and Pereira, F., Conditional
    Random Fields Probabilistic Models for
    Segmenting and Labeling Sequence Data, In
    Proceedings of ICML-2001.
  • Leek 1997 Leek, T. R. Information extraction
    using hidden Markov models. Masters thesis. UC
    San Diego.
  • McCallum, Freitag Pereira 2000 McCallum, A.
    Freitag, D. and Pereira. F., Maximum entropy
    Markov models for information extraction and
    segmentation, In Proceedings of ICML-2000
  • Miller et al 2000 Miller, S. Fox, H.
    Ramshaw, L. Weischedel, R. A Novel Use of
    Statistical Parsing to Extract Information from
    Text. Proceedings of the 1st Annual Meeting of
    the North American Chapter of the ACL (NAACL), p.
    226 - 233.

Slides from Cohen McCallum
82
More Ambitious (Blue Sky) Approaches
  • Semantic web needs
  • Tagged data
  • Background knowledge
  • (blue sky approaches to) automate both
  • Knowledge Extraction
  • Extract base level knowledge (facts) directly
    from the web
  • Automated tagging
  • Start with a background ontology and tag other
    web pages
  • Semtag/Seeker
  • The information extraction tasks in fielded
    applications like Citeseer/Libra are narrowly
    focused
  • We assume that we are learning specific relations
    (e.g. author/title etc)
  • We assume that the extracted relations will be
    put in a database for db-style look-up

Lets look at state of the feasible art before
going to blue-sky..
83
Extraction from Free Text involvesNatural
Language Processing
Analogy to regex patterns on DOM trees for
structured tex
  • If extracting from automatically generated web
    pages, simple regex patterns usually work.
  • If extracting from more natural, unstructured,
    human-written text, some NLP may help.
  • Part-of-speech (POS) tagging
  • Mark each word as a noun, verb, preposition, etc.
  • Syntactic parsing
  • Identify phrases NP, VP, PP
  • Semantic word categories (e.g. from WordNet)
  • KILL kill, murder, assassinate, strangle,
    suffocate
  • Off-the-shelf software available to do this!
  • The Brill tagger
  • Extraction patterns can use POS or phrase tags.

84
I. Generate-n-Test Architecture
  • Generic extraction patterns (Hearst 92)
  • Cities such as Boston, Los Angeles, and
    Seattle

(C such as NP1, NP2, and NP3) gt
IS-A(each(head(NP)), C),
Template Driven Extraction (where template In in
terms of Syntax Tree)
  • Detailed information for several countries such
    as maps, ProperNoun(head(NP))
  • I listen to pretty much all music but prefer
    country such as Garth Brooks

85
Test
Assess candidate extractions using Mutual
Information (PMI-IR) (Turney 01).
Many variations are possible
86
..but many things indicate cityness
Discriminator phrases fi x is a city x has
a population of x is the capital of y xs
baseball team
  • PMI frequency of I D co-occurrence
  • 5-50 discriminators Di
  • Each PMI for Di is a feature fi
  • Naïve Bayes evidence combination

Keep the probablities with the extracted facts
PMI is used for feature selection. NBC is used
for learning. Hits used for assessing PMI as
well as conditional probabilities
87
Assessment In Action
  1. I Yakima (1,340,000)
  2. D ltclass namegt
  3. ID Yakima city (2760)
  4. PMI (2760 / 1.34M) 0.02
  • I Avocado (1,000,000)
  • ID Avocado city (10)
  • PMI 0.00001 ltlt 0.02

88
Some Sources of ambiguity
  • Time Clinton is the president (in 1996).
  • Context common misconceptions..
  • Opinion Elvis
  • Multiple word senses Amazon, Chicago, Chevy
    Chase, etc.
  • Dominant senses can mask recessive ones!
  • Approach unmasking. Chicago City

89
Chicago
City
Movie
90
Chicago Unmasked
City sense
Movie sense
91
Impact of Unmasking on PMI
Name Recessive Original
Unmask Boost Washington city
0.50 0.99 96 Casablanca
city 0.41 0.93
127 Chevy Chase actor
0.09 0.58 512 Chicago
movie 0.02 0.21 972
92
CBioC Collaborative Bio-Curation
  • Motivation
  • To help get information nuggets of articles and
    abstracts and store in a database.
  • The challenge is that the number of articles are
    huge and they keep growing, and need to process
    natural language.
  • The two existing approaches
  • human curation and use of automatic information
    extraction systems
  • They are not able to meet the challenge, as the
    first is expensive, while the second is
    error-prone.

93
CBioC (contd)
  • Approach We propose a solution that is
    inexpensive, and that scales up.
  • Our approach takes advantage of automatic
    information extraction methods as a starting
    point,
  • Based on the premise that if there are a lot of
    articles, then there must be a lot of readers and
    authors of these articles.
  • We provide a mechanism by which the readers of
    the articles can participate and collaborate in
    the curation of information.
  • We refer to our approach as Collaborative
    Curation''.

94
Using the C-BioCurator System (contd)
95
What is the main difference between Knowitall and
CBIOC?
Assessment Knowitall does it by HITS. CBioC by
voting
96
Annotation
  • The Chicago Bulls announced yesterday that
    Michael Jordan will. . .
  • The ltresource ref"http//tap.stanford.edu/
  • BasketballTeam_Bulls"gtChicago Bullslt/resourcegt
  • announced yesterday that ltresource ref
  • "http//tap.stanford.edu/AthleteJordan,_Michael"gt
  • Michael Jordanlt/resourcegt will...

97
Semantic Annotation
Name Entity Identification
This simplest task of meta-data extraction on NL
is to establish type relation between entities
in the NL resources and concepts in ontologies.
Picture from http//lsdis.cs.uga.edu/courses/SemWe
bFall2005/courseMaterials/CSCI8350-Metadata.ppt
98
Semantics
  • Semantic Annotation
  • - The content of annotation consists of some
    rich
  • semantic information
  • - Targeted not only at human reader of
    resources
  • but also software agents
  • - formal metadata following structural
    standards
  • informal personal notes written in the
    margin while
  • reading an article
  • - explicit carry sufficient information for
    interpretation
  • tacit many personal annotations
    (telegraphic and incomplete)

http//www-scf.usc.edu/csci586/slides/6
99
Uses of Annotation
http//www-scf.usc.edu/csci586/slides/8
100
Objectives of Annotation
  • Generate Metadata for existing information
  • e.g., author-tag in HTML
  • RDF descriptions to HTML
  • Content description to Multimedia files
  • Employ metadata for
  • Improved search
  • Navigation
  • Presentation
  • Summarization of contents

http//www.aifb.uni-karlsruhe.de/WBS/sst/Teaching/
Intelligente20System20im20WWW20SS202000/10-An
notation.pdf
101
Annotation
  • Current practice of annotation for knowledge
    identification and extraction

Reduce burden of text annotation for Knowledge
Management
www.racai.ro/EUROLAN-2003/html/presentations/Sheff
ieldWilksBrewsterDingli/Eurolan2003AlexieiDingli.p
pt
102
SemTag Seeker
  • WWW-03 Best Paper Prize
  • Seeded with TAP ontology (72k concepts)
  • And 700 human judgments
  • Crawled 264 million web pages
  • Extracted 434 million semantic tags
  • Automatically disambiguated

103
SemTag
  • Research project IBM
  • Very large scale largest to date
  • 264 million web pages
  • Goal to provide early set of widespread semantic
    tags through automated generation

104
SemTag
  • Uses broad, shallow knowledge base
  • TAP lexical and taxonomic information about
    popular objects
  • Music
  • Movies
  • Sports
  • Etc.

105
SemTag
  • Problem
  • No write access to original document, so how do
    you annotate?
  • Solution
  • Store annotations in a web-available database

106
SemTag
  • Semantic Label Bureau
  • Separate store of semantic annotation information
  • HTTP server that can be queried for annotation
    information
  • Example
  • Find all semantic tags for a given document
  • Find all semantic tags for a particular object

107
SemTag
  • Methodology

108
SemTag
  • Three phases
  • Spotting Pass
  • Tokenize the document
  • All instances plus 20 word window
  • Learning Pass
  • Find corpus-wide distribution of terms at each
    internal node of taxonomy
  • Based on a representative sample
  • Tagging Pass
  • Scan windows to disambiguate each reference
  • Finally determined to be a TAP object

109
SemTag
  • Another problem magnified by the scale
  • Ambiguity Resolution
  • Two fundamental categories of ambiguities
  • Some labels appear at multiple locations
  • Some entities have labels that occur in contexts
    that have no representative in the taxonomy

110
SemTag
  • Solution
  • Taxonomy Based Disambiguation (TBD)
  • TBD expectation
  • Human tuned parameters used in small, critical
    sections
  • Automated approaches deal with bulk of
    information

111
SemTag
  • TBD methodology
  • Each node in the taxonomy is associated with a
    set of labels
  • Cats, Football, Cars all contain jaguar
  • Each label in the text is stored with a window of
    20 words the context
  • Each node has an associated similarity function
    mapping a context to a similarity
  • Higher similarity ? more likely to contain a
    reference

112
SemTag
  • Similarity
  • Built a 200,000 word lexicon (200,100 most common
    100 most common)
  • 200,000 dimensional vector space
  • Training spots (label, context) and correct node
  • Estimated the distribution of terms for nodes
  • Standard cosine similarity
  • TFIDF vectors (context vs. node)

113
SemTag
Is a context c appropriate for a node v
References inside the taxonomy vs. References
outside the taxonomy Multiple nodes b r ? b !
p(v)
114
SemTag
  • Some internal nodes very popular
  • Associate a measurement of how accurate Sim is
    likely to be at a node
  • Also, how ambiguous the node is overall
    (consistency of human judgment)
  • TBD Algorithm returns 1 or 0 to indicate
    whether a particular context c is on topic for a
    node v
  • 82 accuracy on 434 million spots

115
SemTag
116
Summary
  • Information extraction can be motivated either as
    explicating more structure from the data or as an
    automated way to Semantic Web
  • Extraction complexity depends on whether the text
    you have is templated or free-form
  • Extraction from templated text can be done by
    regular expressions
  • Extraction from free form text requires NLP
  • Can be done in terms of parts-of-speech-tagging
  • Annotation involves connecting terms in a free
    form text to items in the background knowledge
  • It too can be automated
Write a Comment
User Comments (0)
About PowerShow.com