Information Extraction - PowerPoint PPT Presentation

About This Presentation
Title:

Information Extraction

Description:

Identify specific pieces of information (data) in a unstructured or semi ... Present Operating System is DOS. May go to OS-2 or UNIX in future. Please reply to: ... – PowerPoint PPT presentation

Number of Views:118
Avg rating:3.0/5.0
Slides: 58
Provided by: raym112
Category:

less

Transcript and Presenter's Notes

Title: Information Extraction


1
Information Extraction
  • (Slides based on those by Ray Mooney, Craig
    Knoblock, Dan Weld and Perry)

2
Information Extraction (IE)
  • Identify specific pieces of information (data) in
    a unstructured or semi-structured textual
    document.
  • Transform unstructured information in a corpus of
    documents or web pages into a structured
    database.
  • Applied to different types of text
  • Newspaper articles
  • Web pages
  • Scientific articles
  • Newsgroup messages
  • Classified ads
  • Medical notes

3
MUC
  • DARPA funded significant efforts in IE in the
    early to mid 1990s.
  • Message Understanding Conference (MUC) was an
    annual event/competition where results were
    presented.
  • Focused on extracting information from news
    articles
  • Terrorist events
  • Industrial joint ventures
  • Company management changes
  • Information extraction of particular interest to
    the intelligence community (CIA, NSA).

4
Other Applications
  • Job postings
  • Newsgroups Rapier from austin.jobs
  • Web pages Flipdog
  • Job resumes
  • BurningGlass
  • Mohomine
  • Seminar announcements
  • Company information from the web
  • Continuing education course info from the web
  • University information from the web
  • Apartment rental ads
  • Molecular biology information from MEDLINE

5
Sample Job Posting
Subject US-TN-SOFTWARE PROGRAMMER Date 17 Nov
1996 173729 GMT Organization Reference.Com
Posting Service Message-ID lt56nigpmrs_at_bilbo.refe
rence.comgt SOFTWARE PROGRAMMER Position
available for Software Programmer experienced in
generating software for PC-Based Voice Mail
systems. Experienced in C Programming. Must be
familiar with communicating with and controlling
voice cards preferable Dialogic, however,
experience with others such as Rhetorix and
Natural Microsystems is okay. Prefer 5 years or
more experience with PC Based Voice Mail, but
will consider as little as 2 years. Need to find
a Senior level person who can come on board and
pick up code with very little training. Present
Operating System is DOS. May go to OS-2 or UNIX
in future. Please reply to Kim
Anderson AdNET (901) 458-2888 fax kimander_at_memphis
online.com
Subject US-TN-SOFTWARE PROGRAMMER Date 17 Nov
1996 173729 GMT Organization Reference.Com
Posting Service Message-ID lt56nigpmrs_at_bilbo.refe
rence.comgt SOFTWARE PROGRAMMER Position
available for Software Programmer experienced in
generating software for PC-Based Voice Mail
systems. Experienced in C Programming. Must be
familiar with communicating with and controlling
voice cards preferable Dialogic, however,
experience with others such as Rhetorix and
Natural Microsystems is okay. Prefer 5 years or
more experience with PC Based Voice Mail, but
will consider as little as 2 years. Need to find
a Senior level person who can come on board and
pick up code with very little training. Present
Operating System is DOS. May go to OS-2 or UNIX
in future. Please reply to Kim
Anderson AdNET (901) 458-2888 fax kimander_at_memphis
online.com
6
Extracted Job Template
computer_science_job id 56nigpmrs_at_bilbo.referenc
e.com title SOFTWARE PROGRAMMER salary company
recruiter state TN city country US language
C platform PC \ DOS \ OS-2 \ UNIX application ar
ea Voice Mail req_years_experience
2 desired_years_experience 5 req_degree desired_
degree post_date 17 Nov 1996
7
Amazon Book Description
. lt/tdgtlt/trgt lt/tablegt ltb class"sans"gtThe Age of
Spiritual Machines When Computers Exceed Human
Intelligencelt/bgtltbrgt ltfont faceverdana,arial,helv
etica size-1gt by lta href"/exec/obidos/search-han
dle-url/indexbooksfield-author
Kurzweil2C20Ray/002-6235079-4593641"gt Ray
Kurzweillt/agtltbrgt lt/fontgt ltbrgt lta
href"http//images.amazon.com/images/P/0140282025
.01.LZZZZZZZ.jpg"gt ltimg src"http//images.amazon.
com/images/P/0140282025.01.MZZZZZZZ.gif" width90
height140 alignleft border0gtlt/agt ltfont
faceverdana,arial,helvetica size-1gt ltspan
class"small"gt ltspan class"small"gt ltbgtList
Pricelt/bgt ltspan classlistpricegt14.95lt/spangtltbrgt
ltbgtOur Price ltfont color990000gt11.96lt/fontgtlt/
bgtltbrgt ltbgtYou Savelt/bgt ltfont color990000gtltbgt2.
99 lt/bgt (20)lt/fontgtltbrgt lt/spangt ltpgt ltbrgt
. lt/tdgtlt/trgt lt/tablegt ltb class"sans"gtThe Age of
Spiritual Machines When Computers Exceed Human
Intelligencelt/bgtltbrgt ltfont faceverdana,arial,helv
etica size-1gt by lta href"/exec/obidos/search-han
dle-url/indexbooksfield-author
Kurzweil2C20Ray/002-6235079-4593641"gt Ray
Kurzweillt/agtltbrgt lt/fontgt ltbrgt lta
href"http//images.amazon.com/images/P/0140282025
.01.LZZZZZZZ.jpg"gt ltimg src"http//images.amazon.
com/images/P/0140282025.01.MZZZZZZZ.gif" width90
height140 alignleft border0gtlt/agt ltfont
faceverdana,arial,helvetica size-1gt ltspan
class"small"gt ltspan class"small"gt ltbgtList
Pricelt/bgt ltspan classlistpricegt14.95lt/spangtltbrgt
ltbgtOur Price ltfont color990000gt11.96lt/fontgtlt/
bgtltbrgt ltbgtYou Savelt/bgt ltfont color990000gtltbgt2.
99 lt/bgt (20)lt/fontgtltbrgt lt/spangt ltpgt ltbrgt
8
Extracted Book Template
Title The Age of Spiritual Machines
When Computers Exceed Human Intelligence Author
Ray Kurzweil List-Price 14.95 Price 11.96
9
Web Extraction
  • Many web pages are generated automatically from
    an underlying database.
  • Therefore, the HTML structure of pages is fairly
    specific and regular (semi-structured).
  • However, output is intended for human
    consumption, not machine interpretation.
  • An IE system for such generated pages allows the
    web site to be viewed as a structured database.
  • An extractor for a semi-structured web site is
    sometimes referred to as a wrapper.
  • Process of extracting from such pages is
    sometimes referred to as screen scraping.

10
Web Extraction using DOM Trees
  • Web extraction may be aided by first parsing web
    pages into DOM trees.
  • Extraction patterns can then be specified as
    paths from the root of the DOM tree to the node
    containing the text to extract.
  • May still need regex patterns to identify proper
    portion of the final CharacterData node.

11
Sample DOM Tree Extraction
HTML
Element
BODY
HEADER
Character-Data
FONT
B
Age of Spiritual Machines
A
by
Ray Kurzweil
Title HTML?BODY?B?CharacterData Author HTML?
BODY?FONT?A? CharacterData
12
Template Types
  • Slots in template typically filled by a substring
    from the document.
  • Some slots may have a fixed set of pre-specified
    possible fillers that may not occur in the text
    itself.
  • Terrorist act threatened, attempted,
    accomplished.
  • Job type clerical, service, custodial, etc.
  • Company type SEC code
  • Some slots may allow multiple fillers.
  • Programming language
  • Some domains may allow multiple extracted
    templates per document.
  • Multiple apartment listings in one ad

13
Simple Extraction Patterns
  • Specify an item to extract for a slot using a
    regular expression pattern.
  • Price pattern \b\\d(\.\d2)?\b
  • May require preceding (pre-filler) pattern to
    identify proper context.
  • Amazon list price
  • Pre-filler pattern ltbgtList Pricelt/bgt ltspan
    classlistpricegt
  • Filler pattern \\d(\.\d2)?\b
  • May require succeeding (post-filler) pattern to
    identify the end of the filler.
  • Amazon list price
  • Pre-filler pattern ltbgtList Pricelt/bgt ltspan
    classlistpricegt
  • Filler pattern .
  • Post-filler pattern lt/spangt

14
Simple Template Extraction
  • Extract slots in order, starting the search for
    the filler of the n1 slot where the filler for
    the nth slot ended. Assumes slots always in a
    fixed order.
  • Title
  • Author
  • List price
  • Make patterns specific enough to identify each
    filler always starting from the beginning of the
    document.

15
Pre-Specified Filler Extraction
  • If a slot has a fixed set of pre-specified
    possible fillers, text categorization can be used
    to fill the slot.
  • Job category
  • Company type
  • Treat each of the possible values of the slot as
    a category, and classify the entire document to
    determine the correct filler.

16
Learning for IE
  • Writing accurate patterns for each slot for each
    domain (e.g. each web site) requires laborious
    software engineering.
  • Alternative is to use machine learning
  • Build a training set of documents paired with
    human-produced filled extraction templates.
  • Learn extraction patterns for each slot using an
    appropriate machine learning algorithm.

17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
Happy Deepawali! Haloween
10/31
4th Nov, 2002.
26
October 31st
27
FindingSweet Spots in computer-mediated
cooperative work
Big Idea 1
  • It is possible to get by with techniques blythely
    ignorant of semantics, when you have humans in
    the loop
  • All you need is to find the right sweet spot,
    where the computer plays a pre-processing role
    and presents potential solutions
  • and the human very gratefully does the in-depth
    analysis on those few potential solutions
  • Examples
  • The incredible success of Bag of Words model!
  • Bag of letters would be a disaster -)
  • Bag of sentences and/or NLP would be good
  • ..but only to your discriminating and irascible
    searchers -)

28
Collaborative Computing AKA Brain Cycle
StealingAKA Computizing Eyeballs
Big Idea 2
  • A lot of exciting research related to web
    currently involves co-opting the masses to help
    with large-scale tasks
  • It is like cycle stealingexcept we are
    stealing human brain cycles (the most idle of
    the computers if there is ever one -)
  • Remember the mice in the Hitch Hikers Guide to
    the Galaxy? (..who were running a mass-scale
    experiment on the humans to figure out the
    question..)
  • Collaborative knowledge compilation (wikipedia!)
  • Collaborative Curation
  • Collaborative tagging
  • Paid collacoration/contracting
  • Many big open issues
  • How do you pose the problem such that it can be
    solved using collaborative computing?
  • How do you incentivize people into letting you
    steal their brain cycles?
  • Pay them! (Amazon mturk.com )

29
Tapping into the Collective Unconscious
Big Idea 3
  • Another thread of exciting research is driven by
    the realization that WEB is not random at all!
  • It is written by humans
  • so analyzing its structure and content allows us
    to tap into the collective unconscious ..
  • Meaning can emerge from syntactic notions such as
    co-occurrences and connectedness
  • Examples
  • Analyzing term co-occurrences in the web-scale
    corpora to capture semantic information (todays
    paper)
  • Analyzing the link-structure of the web graph to
    discover communities
  • DoD and NSA are very much into this as a way of
    breaking terrorist cells
  • Analyzing the transaction patterns of customers
    (collaborative filtering)

30
Automated Support for Semantic Web
  • Semantic web needs
  • Tagged data
  • Background knowledge
  • (blue sky approaches to) automate both
  • Automated tagging
  • Start with a background ontology and tag other
    web pages
  • Semtag/Seeker
  • Knowledge Extraction
  • Extract base level knowledge (facts) directly
    from the web

31
Extraction from Free Text involvesNatural
Language Processing
  • If extracting from automatically generated web
    pages, simple regex patterns usually work.
  • If extracting from more natural, unstructured,
    human-written text, some NLP may help.
  • Part-of-speech (POS) tagging
  • Mark each word as a noun, verb, preposition, etc.
  • Syntactic parsing
  • Identify phrases NP, VP, PP
  • Semantic word categories (e.g. from WordNet)
  • KILL kill, murder, assassinate, strangle,
    suffocate
  • Off-the-shelf software available to do this!
  • The Brill tagger
  • Extraction patterns can use POS or phrase tags.

32
I. Generate-n-Test Architecture
  • Generic extraction patterns (Hearst 92)
  • Cities such as Boston, Los Angeles, and
    Seattle

(C such as NP1, NP2, and NP3) gt
IS-A(each(head(NP)), C),
  • Detailed information for several countries such
    as maps, ProperNoun(head(NP))
  • I listen to pretty much all music but prefer
    country such as Garth Brooks

33
Test
Assess candidate extractions using Mutual
Information (PMI-IR) (Turney 01).
Many variations are possible
34
Assessment
  • PMI frequency of I D co-occurrence
  • 5-50 discriminators Di
  • Each PMI for Di is a feature fi
  • Naïve Bayes evidence combination

PMI is used for feature selection. NBC is used
for learning. Hits used for assessing PMI as
well as conditional probabilities
35
Assessment In Action
  • I Yakima (1,340,000)
  • D ltclass namegt
  • ID Yakima city (2760)
  • PMI (2760 / 1.34M) 0.02
  • I Avocado (1,000,000)
  • ID Avocado city (10)
  • PMI 0.00001 ltlt 0.02

36
Some Sources of ambiguity
  • Time Clinton is the president (in 1996).
  • Context common misconceptions..
  • Opinion Elvis
  • Multiple word senses Amazon, Chicago, Chevy
    Chase, etc.
  • Dominant senses can mask recessive ones!
  • Approach unmasking. Chicago City

37
Chicago
City
Movie
38
Chicago Unmasked
City sense
Movie sense
39
Impact of Unmasking on PMI
Name Recessive Original
Unmask Boost Washington city
0.50 0.99 96 Casablanca
city 0.41 0.93
127 Chevy Chase actor
0.09 0.58 512 Chicago
movie 0.02 0.21 972
40
CBioC Collaborative Bio-Curation
  • Motivation
  • To help get information nuggets of articles and
    abstracts and store in a database.
  • The challenge is that the number of articles are
    huge and they keep growing, and need to process
    natural language.
  • The two existing approaches
  • human curation and use of automatic information
    extraction systems
  • They are not able to meet the challenge, as the
    first is expensive, while the second is
    error-prone.

41
CBioC (contd)
  • Approach We propose a solution that is
    inexpensive, and that scales up.
  • Our approach takes advantage of automatic
    information extraction methods as a starting
    point,
  • Based on the premise that if there are a lot of
    articles, then there must be a lot of readers and
    authors of these articles.
  • We provide a mechanism by which the readers of
    the articles can participate and collaborate in
    the curation of information.
  • We refer to our approach as Collaborative
    Curation''.

42
Using the C-BioCurator System (contd)
43
What is the main difference between Knowitall and
CBIOC?
Assessment Knowitall does it by HITS. CBioC by
voting
44
Annotation
  • The Chicago Bulls announced yesterday that
    Michael Jordan will. . .
  • The ltresource ref"http//tap.stanford.edu/
  • BasketballTeam_Bulls"gtChicago Bullslt/resourcegt
  • announced yesterday that ltresource ref
  • "http//tap.stanford.edu/AthleteJordan,_Michael"gt
  • Michael Jordanlt/resourcegt will...

45
Semantic Annotation
Name Entity Identification
This simplest task of meta-data extraction on NL
is to establish type relation between entities
in the NL resources and concepts in ontologies.
Picture from http//lsdis.cs.uga.edu/courses/SemWe
bFall2005/courseMaterials/CSCI8350-Metadata.ppt
46
Semantics
  • Semantic Annotation
  • - The content of annotation consists of some
    rich
  • semantic information
  • - Targeted not only at human reader of
    resources
  • but also software agents
  • - formal metadata following structural
    standards
  • informal personal notes written in the
    margin while
  • reading an article
  • - explicit carry sufficient information for
    interpretation
  • tacit many personal annotations
    (telegraphic and incomplete)

http//www-scf.usc.edu/csci586/slides/6
47
Uses of Annotation
http//www-scf.usc.edu/csci586/slides/8
48
Objectives of Annotation
  • Generate Metadata for existing information
  • e.g., author-tag in HTML
  • RDF descriptions to HTML
  • Content description to Multimedia files
  • Employ metadata for
  • Improved search
  • Navigation
  • Presentation
  • Summarization of contents

http//www.aifb.uni-karlsruhe.de/WBS/sst/Teaching/
Intelligente20System20im20WWW20SS202000/10-An
notation.pdf
49
Annotation
  • Current practice of annotation for knowledge
    identification and extraction

Reduce burden of text annotation for Knowledge
Management
www.racai.ro/EUROLAN-2003/html/presentations/Sheff
ieldWilksBrewsterDingli/Eurolan2003AlexieiDingli.p
pt
50
SemTag Seeker
  • WWW-03 Best Paper Prize
  • Seeded with TAP ontology (72k concepts)
  • And 700 human judgments
  • Crawled 264 million web pages
  • Extracted 434 million semantic tags
  • Automatically disambiguated

51
SemTag
  • Research project IBM
  • Very large scale largest to date
  • 264 million web pages
  • Goal to provide early set of widespread semantic
    tags through automated generation

52
SemTag
  • Uses broad, shallow knowledge base
  • TAP lexical and taxonomic information about
    popular objects
  • Music
  • Movies
  • Sports
  • Etc.

53
SemTag
  • Problem
  • No write access to original document, so how do
    you annotate?
  • Solution
  • Store annotations in a web-available database

54
SemTag
  • Semantic Label Bureau
  • Separate store of semantic annotation information
  • HTTP server that can be queried for annotation
    information
  • Example
  • Find all semantic tags for a given document
  • Find all semantic tags for a particular object

55
SemTag
  • Methodology

56
SemTag
  • Three phases
  • Spotting Pass
  • Tokenize the document
  • All instances plus 20 word window
  • Learning Pass
  • Find corpus-wide distribution of terms at each
    internal node of taxonomy
  • Based on a representative sample
  • Tagging Pass
  • Scan windows to disambiguate each reference
  • Finally determined to be a TAP object

57
SemTag
  • Another problem magnified by the scale
  • Ambiguity Resolution
  • Two fundamental categories of ambiguities
  • Some labels appear at multiple locations
  • Some entities have labels that occur in contexts
    that have no representative in the taxonomy

58
SemTag
  • Solution
  • Taxonomy Based Disambiguation (TBD)
  • TBD expectation
  • Human tuned parameters used in small, critical
    sections
  • Automated approaches deal with bulk of
    information

59
SemTag
  • TBD methodology
  • Each node in the taxonomy is associated with a
    set of labels
  • Cats, Football, Cars all contain jaguar
  • Each label in the text is stored with a window of
    20 words the context
  • Each node has an associated similarity function
    mapping a context to a similarity
  • Higher similarity ? more likely to contain a
    reference

60
SemTag
  • Similarity
  • Built a 200,000 word lexicon (200,100 most common
    100 most common)
  • 200,000 dimensional vector space
  • Training spots (label, context) and correct node
  • Estimated the distribution of terms for nodes
  • Standard cosine similarity
  • TFIDF vectors (context vs. node)

61
SemTag
Is a context c appropriate for a node v
References inside the taxonomy vs. References
outside the taxonomy Multiple nodes b r ? b !
p(v)
62
SemTag
  • Some internal nodes very popular
  • Associate a measurement of how accurate Sim is
    likely to be at a node
  • Also, how ambiguous the node is overall
    (consistency of human judgment)
  • TBD Algorithm returns 1 or 0 to indicate
    whether a particular context c is on topic for a
    node v
  • 82 accuracy on 434 million spots

63
SemTag
64
Summary
  • Information extraction can be motivated either as
    explicating more structure from the data or as an
    automated way to Semantic Web
  • Extraction complexity depends on whether the text
    you have is templated or free-form
  • Extraction from templated text can be done by
    regular expressions
  • Extraction from free form text requires NLP
  • Can be done in terms of parts-of-speech-tagging
  • Annotation involves connecting terms in a free
    form text to items in the background knowledge
  • It too can be automated
Write a Comment
User Comments (0)
About PowerShow.com