Information Extraction from the World Wide Web - PowerPoint PPT Presentation

Loading...

PPT – Information Extraction from the World Wide Web PowerPoint presentation | free to download - id: 12deba-OThhM



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Information Extraction from the World Wide Web

Description:

Tutorial Outline. IE History. Landscape of problems and solutions ... Focus of this Tutorial. Pattern complexity. Pattern feature domain. Pattern scope ... – PowerPoint PPT presentation

Number of Views:195
Avg rating:3.0/5.0
Slides: 144
Provided by: AndrewM163
Learn more at: http://www.cs.cmu.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Information Extraction from the World Wide Web


1
Information Extraction from the World Wide Web
  • Andrew McCallum
  • University of Massachusetts Amherst
  • William Cohen
  • Carnegie Mellon University

2
Example The Problem
Martin Baker, a person
Genomics job
Employers job posting form
3
Example A Solution
4
Extracting Job Openings from the Web
5
Job Openings Category Food Services Keyword
Baker Location Continental U.S.
6
Data Mining the Extracted Job Information
7
What is Information Extraction
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying…
NAME TITLE ORGANIZATION
8
What is Information Extraction
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying…
IE
NAME TITLE ORGANIZATION Bill Gates
CEO Microsoft Bill Veghte VP
Microsoft Richard Stallman founder Free
Soft..
9
What is Information Extraction
As a family of techniques
Information Extraction segmentation
classification clustering association
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying…
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
10
What is Information Extraction
As a family of techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying…
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
11
What is Information Extraction
As a family of techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying…
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
12
What is Information Extraction
As a family of techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying…
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation




13
IE in Context
Create ontology
Spider
Filter by relevance
IE
Segment Classify Associate Cluster
Database
Load DB
Query, Search
Document collection
Train extraction models
Data mine
Label training data
14
Why IE from the Web?
  • Science
  • Grand old dream of AI Build large KB and reason
    with it. IE from the Web enables the creation
    of this KB.
  • IE from the Web is a complex problem that
    inspires new advances in machine learning.
  • Profit
  • Many companies interested in leveraging data
    currently locked in unstructured text on the
    Web.
  • Not yet a monopolistic winner in this space.
  • Fun!
  • Build tools that we researchers like to use
    ourselves Cora CiteSeer, MRQE.com, FAQFinder,…
  • See our work get used by the general public.

KB Knowledge Base
15
Tutorial Outline
  • IE History
  • Landscape of problems and solutions
  • Parade of models for segmenting/classifying
  • Sliding window
  • Boundary finding
  • Finite state machines
  • Trees
  • Overview of related problems and solutions
  • Where to go from here

16
IE History
  • Pre-Web
  • Mostly news articles
  • De Jongs FRUMP 1982
  • Hand-built system to fill Schank-style scripts
    from news wire
  • Message Understanding Conference (MUC) DARPA
    87-95, TIPSTER 92-96
  • Most early work dominated by hand-built models
  • E.g. SRIs FASTUS, hand-built FSMs.
  • But by 1990s, some machine learning Lehnert,
    Cardie, Grishman and then HMMs Elkan Leek 97,
    BBN Bikel et al 98
  • Web
  • AAAI 94 Spring Symposium on Software Agents
  • Much discussion of ML applied to Web. Maes,
    Mitchell, Etzioni.
  • Tom Mitchells WebKB, 96
  • Build KBs from the Web.
  • Wrapper Induction
  • Initially hand-build, then ML Soderland 96,
    Kushmeric 97,…

17
What makes IE from the Web Different?
Less grammar, but more formatting linking
Newswire
Web
www.apple.com/retail
Apple to Open Its First Retail Store in New York
City MACWORLD EXPO, NEW YORK--July 17,
2002--Apple's first retail store in New York City
will open in Manhattan's SoHo district on
Thursday, July 18 at 800 a.m. EDT. The SoHo
store will be Apple's largest retail store to
date and is a stunning example of Apple's
commitment to offering customers the world's best
computer shopping experience. "Fourteen months
after opening our first retail store, our 31
stores are attracting over 100,000 visitors each
week," said Steve Jobs, Apple's CEO. "We hope our
SoHo store will surprise and delight both Mac and
PC users who want to see everything the Mac can
do to enhance their digital lifestyles."
www.apple.com/retail/soho
www.apple.com/retail/soho/theatre.html
The directory structure, link structure,
formatting layout of the Web is its own new
grammar.
18
Landscape of IE Tasks (1/4) Pattern Feature
Domain
Text paragraphs without formatting
Grammatical sentences and some formatting links
Astro Teller is the CEO and co-founder of
BodyMedia. Astro holds a Ph.D. in Artificial
Intelligence from Carnegie Mellon University,
where he was inducted as a national Hertz fellow.
His M.S. in symbolic and heuristic computation
and B.S. in computer science are from Stanford
University. His work in science, literature and
business has appeared in international media from
the New York Times to CNN to NPR.
Non-grammatical snippets, rich formatting links
Tables
19
Landscape of IE Tasks (2/4) Pattern Scope
Web site specific
Genre specific
Wide, non-specific
Formatting
Layout
Language
Amazon.com Book Pages
Resumes
University Names
20
Landscape of IE Tasks (3/4) Pattern Complexity
E.g. word patterns
Regular set
Closed set
U.S. phone numbers
U.S. states
Phone (413) 545-1323
He was born in Alabama…
The CALD main office can be reached at
412-268-1299
The big Wyoming sky…
Ambiguous patterns, needing context and many
sources of evidence
Complex pattern
U.S. postal addresses
Person names
University of Arkansas P.O. Box 140 Hope, AR
71802
…was among the six houses sold by Hope Feldman
that year.
Pawel Opalinski, Software Engineer at WhizBang
Labs.
Headquarters 1128 Main Street, 4th
Floor Cincinnati, Ohio 45210
21
Landscape of IE Tasks (4/4) Pattern Combinations
Jack Welch will retire as CEO of General Electric
tomorrow. The top role at the Connecticut
company will be filled by Jeffrey Immelt.
Single entity
Binary relationship
N-ary record
Person Jack Welch
Relation Person-Title Person Jack
Welch Title CEO
Relation Succession Company General
Electric Title CEO Out
Jack Welsh In Jeffrey Immelt
Person Jeffrey Immelt
Relation Company-Location Company General
Electric Location Connecticut
Location Connecticut
Named entity extraction
22
Evaluation of Single Entity Extraction
TRUTH
Michael Kearns and Sebastian Seung will start
Mondays tutorial, followed by Richard M. Karpe
and Martin Cooke.
PRED
Michael Kearns and Sebastian Seung will start
Mondays tutorial, followed by Richard M. Karpe
and Martin Cooke.
correctly predicted segments 2

Precision

predicted segments 6

correctly predicted segments 2

Recall

true segments 4
1
F1 Harmonic mean of Precision
Recall
((1/P) (1/R)) / 2
23
State of the Art Performance
  • Named entity recognition
  • Person, Location, Organization, …
  • F1 in high 80s or low- to mid-90s
  • Binary relation extraction
  • Contained-in (Location1, Location2) Member-of
    (Person1, Organization1)
  • F1 in 60s or 70s or 80s
  • Wrapper induction
  • Extremely accurate performance obtainable
  • Human effort (30min) required on each site

24
Landscape of IE Techniques (1/1) Models
Classify Pre-segmented Candidates
Lexicons
Sliding Window
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
member?
Classifier
Classifier
Alabama Alaska … Wisconsin Wyoming
which class?
which class?
Try alternate window sizes
Context Free Grammars
Finite State Machines
Boundary Models
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
BEGIN
Most likely state sequence?
NNP
V
P
NP
V
NNP
Most likely parse?
Classifier
PP
which class?
VP
NP
VP
BEGIN
END
BEGIN
END
S
…and beyond
Any of these models can be used to capture words,
formatting or both.
25
Landscape Focus of this Tutorial
Pattern complexity
closed set
regular
complex
ambiguous
Pattern feature domain
words
words formatting
formatting
Pattern scope
site-specific
genre-specific
general
Pattern combinations
entity
binary
n-ary
Models
lexicon
regex
window
boundary
FSM
CFG
26
Sliding Windows
27
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
28
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
29
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
30
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
31
A Naïve Bayes Sliding Window Model
Freitag 1997
00 pm Place Wean Hall Rm 5409
Speaker Sebastian Thrun
…
…
w t-m
w t-1
w t
w tn
w tn1
w tnm
prefix
contents
suffix
32
Naïve Bayes Sliding Window Results
Domain CMU UseNet Seminar Announcements
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
Field F1 Person Name 30 Location 61 Start
Time 98
33
SRV a realistic sliding-window-classifier IE
system
Frietag AAAI 98
  • What windows to consider?
  • all windows containing as many tokens as the
    shortest example, but no more tokens than the
    longest example
  • How to represent a classifier? It might
  • Restrict the length of window
  • Restrict the vocabulary or formatting used
    before/after/inside window
  • Restrict the relative order of tokens
  • Etc…

A token followed by a 3-char numeric token just
after the title
lttitlegtCourse Information for CS213lt/titlegt lth1gtCS
213 C Programminglt/h1gt
34
SRV a rule-learner for sliding-window
classification
  • Top-down rule learning
  • let RULES
  • while (there are uncovered positive
    examples)
  • // construct a rule R to add to RULES
  • let R be a rule covering all examples
  • while (R covers too many negative examples)
  • let C argmaxC VALUE( R, RC,
    uncoveredExamples)
  • over some set of candidate conditions C
  • let R R - C
  • let RULES RULES R

35
SRV a rule-learner for sliding-window
classification
  • Search metric SRV algorithm greedily adds
    conditions to maximize information gain of R
  • VALUE(R,R,Data) IDatap ( p log p p log
    p)
  • where p (p ) is fraction of data covered by R
    (R)
  • To prevent overfitting
  • rules are built on 2/3 of data, then their false
    positive rate is estimated with a Dirichlet on
    the 1/3 holdout set.
  • Candidate conditions …

36
Learning first-order rules
  • A sample zero-th order rule set
  • (tok1InTitle tok1StartsPara tok2triple)
  • or (prevtok2EqCourse prevtok1EqNumber) or …
  • First-order rules can be learned the same
    waywith additional search to find best
    condition
  • phrase(X) - firstToken(X,A), not startPara(A),
  • nextToken(A,B), triple(B)
  • phrase(X) - firstToken(X,A), prevToken(A,C),
    eq(C,number),
  • prevToken(C,D), eq(D,course)
  • Semantics
  • p(X) - q(X),r(X,Y),s(Y) X exists Y
    q(X) and r(X,Y) and s(Y)

37
SRV a rule-learner for sliding-window
classification
  • Primitive predicates used by SRV
  • token(X,W), allLowerCase(W), numerical(W), …
  • nextToken(W,U), previousToken(W,V)
  • HTML-specific predicates
  • inTitleTag(W), inH1Tag(W), inEmTag(W),…
  • emphasized(W) inEmTag(W) or inBTag(W) or …
  • tableNextCol(W,U) U is some token in the
    column after the column W is in
  • tablePreviousCol(W,V), tableRowHeader(W,T),…

38
SRV a rule-learner for sliding-window
classification
  • Non-primitive conditions used by SRV
  • every(X, f, c) for all W in X f(W)c
  • variables tagged must be used in earlier
    conditions
  • underlined values will be replaced by constants,
    e.g., every(X, isCapitalized, true)
  • some(X, W, ltf1,…,fkgt, g, c) exists W
    g(fk(…(f1(W)…))c
  • e.g., some(X, W, prevTok,prevTok,inTitle,false)
  • set of paths ltf1,…,fkgt considered grows over
    time.
  • tokenLength(X, relop, c)
  • position(W,direction,relop, c)
  • e.g., tokenLength(X,gt,4), position(W,fromEnd,lt,2)

39
Utility of non-primitive conditions in greedy
rule search
  • Greedy search for first-order rules is hard
    because useful conditions can give no immediate
    benefit
  • phrase(X) Ã token(X,A), prevToken(A,B),inTitle(B
    ),
  • nextToken(A,C), tripleton(C)

40
Rapier an alternative approach
Califf Mooney, AAAI 99
  • A bottom-up rule learner
  • initialize RULES to be one rule per example
  • repeat
  • randomly pick N pairs of rules (Ri,Rj)
  • let G1…,GN be the consistent pairwise
    generalizations
  • let G argminG COST(G,RULES)
  • let RULES RULES G R covers(G,R)
  • where COST(G,RULES) size of RULES- R
    covers(G,R) and covers(G,R) means every
    example matching G matches R

41
lttitlegtCourse Information for CS213lt/titlegt lth1gtCS
213 C Programminglt/h1gt …
courseNum(window1) Ã token(window1,CS),
doubleton(CS), prevToken(CS,CS213),
inTitle(CS213), nextTok(CS,213),
numeric(213), tripleton(213),
nextTok(213,C), tripleton(C), ….
lttitlegtSyllabus and meeting times for Eng
214lt/titlegt lth1gtEng 214 Software Engineering for
Non-programmers lt/h1gt…
courseNum(window2) Ã token(window2,Eng),
tripleton(Eng), prevToken(Eng,214),
inTitle(214), nextTok(Eng,214),
numeric(214), tripleton(214),
nextTok(214,Software), …
courseNum(X) - token(X,A),
prevToken(A, B), inTitle(B),
nextTok(A,C)), numeric(C),
tripleton(C), nextTok(C,D), …
42
Rapier an alternative approach
  • Combines top-down and bottom-up learning
  • Bottom-up to find common restrictions on content
  • Top-down greedy addition of restrictions on
    context
  • Use of part-of-speech and semantic features (from
    WORDNET).
  • Special pattern-language based on sequences of
    tokens, each of which satisfies one of a set of
    given constraints
  • lt lttok2ate,hit,POS2vbgt, lttok2thegt,
    ltPOS2nngtgt

43
Rapier results precision/recall
44
Rapier results vs. SRV
45
Rule-learning approaches to sliding-window
classification Summary
  • SRV, Rapier, and WHISK Soderland KDD 97
  • Representations for classifiers allow restriction
    of the relationships between tokens, etc
  • Representations are carefully chosen subsets of
    even more powerful representations based on logic
    programming (ILP and Prolog)
  • Use of these heavyweight representations is
    complicated, but seems to pay off in results
  • Can simpler representations for classifiers work?

46
BWI Learning to detect boundaries
Freitag Kushmerick, AAAI 2000
  • Another formulation learn three probabilistic
    classifiers
  • START(i) Prob( position i starts a field)
  • END(j) Prob( position j ends a field)
  • LEN(k) Prob( an extracted field has length k)
  • Then score a possible extraction (i,j) by
  • START(i) END(j) LEN(j-i)
  • LEN(k) is estimated from a histogram

47
BWI Learning to detect boundaries
  • BWI uses boosting to find detectors for START
    and END
  • Each weak detector has a BEFORE and AFTER pattern
    (on tokens before/after position i).
  • Each pattern is a sequence of tokens and/or
    wildcards like anyAlphabeticToken, anyToken,
    anyUpperCaseLetter, anyNumber, …
  • Weak learner for patterns uses greedy search (
    lookahead) to repeatedly extend a pair of empty
    BEFORE,AFTER patterns

48
BWI Learning to detect boundaries
Field F1 Person Name 30 Location 61 Start
Time 98
49
Problems with Sliding Windows and Boundary
Finders
  • Decisions in neighboring parts of the input are
    made independently from each other.
  • Naïve Bayes Sliding Window may predict a seminar
    end time before the seminar start time.
  • It is possible for two overlapping windows to
    both be above threshold.
  • In a Boundary-Finding system, left boundaries are
    laid down independently from right boundaries,
    and their pairing happens as a separate step.

50
Finite State Machines
51
Hidden Markov Models
HMMs are the standard sequence modeling tool in
genomics, music, speech, NLP, …
Graphical model
Finite state model
S
S
S
transitions
t
-
1
t
t1
...
...
observations
...
Generates State sequence Observation
sequence
O
O
O
t
t
1
-
t
1
o1 o2 o3 o4 o5 o6 o7 o8
Parameters for all states Ss1,s2,… Start
state probabilities P(st ) Transition
probabilities P(stst-1 ) Observation
(emission) probabilities P(otst ) Training
Maximize probability of training observations (w/
prior)
Usually a multinomial over atomic, fixed alphabet
52
IE with Hidden Markov Models
Given a sequence of observations
Yesterday Lawrence Saul spoke this example
sentence.
and a trained HMM
Find the most likely state sequence (Viterbi)
Yesterday Lawrence Saul spoke this example
sentence.
Any words said to be generated by the designated
person name state extract as a person name
Person name Lawrence Saul
53
HMM Example Nymble
Bikel, et al 1998, BBN IdentiFinder
Task Named Entity Extraction
Transition probabilities
Observation probabilities
Person
end-of-sentence
P(ot st , st-1 )
P(st st-1, ot-1 )
start-of-sentence
Org
P(ot st , ot-1 )
or

(Five other name classes)
Back-off to
Back-off to
P(st st-1 )
P(ot st )
Other
P(st )
P(ot )
Train on 450k words of news wire text.
Case Language F1 . Mixed
English 93 Upper English 91 Mixed Spanish 90

Results
Other examples of shrinkage for HMMs in IE
Freitag and McCallum 99
54
Regrets from Atomic View of Tokens
Would like richer representation of text
multiple overlapping features, whole chunks of
text.
  • line, sentence, or paragraph features
  • length
  • is centered in page
  • percent of non-alphabetics
  • white-space aligns with next line
  • containing sentence has two verbs
  • grammatically contains a question
  • contains links to authoritative pages
  • emissions that are uncountable
  • features at multiple levels of granularity
  • Example word features
  • identity of word
  • is in all caps
  • ends in -ski
  • is part of a noun phrase
  • is in a list of city names
  • is under node X in WordNet or Cyc
  • is in bold font
  • is in hyperlink anchor
  • features of past future
  • last person name was female
  • next two words are and Associates

55
Problems with Richer Representation and a
Generative Model
  • These arbitrary features are not independent
  • Overlapping and long-distance dependences
  • Multiple levels of granularity (words,
    characters)
  • Multiple modalities (words, formatting, layout)
  • Observations from past and future
  • HMMs are generative models of the text
  • Generative models do not easily handle these
    non-independent features. Two choices
  • Model the dependencies. Each state would have
    its own Bayes Net. But we are already starved
    for training data!
  • Ignore the dependencies. This causes
    over-counting of evidence (ala naïve Bayes).
    Big problem when combining evidence, as in
    Viterbi!

56
Conditional Sequence Models
  • We would prefer a conditional model P(so)
    instead of P(s,o)
  • Can examine features, but not responsible for
    generating them.
  • Dont have to explicitly model their
    dependencies.
  • Dont waste modeling effort trying to generate
    what we are given at test time anyway.
  • If successful, this answers the challenge of
    integrating the ability to handle many arbitrary
    features with the full power of finite state
    automata.

57
Locally Normalized Conditional Sequence Model
Maximum Entropy Markov Models McCallum, Freitag
Pereira, 2000 MaxEnt POS Tagger Ratnaparkhi,
1996 SNoW-based Markov Model Punyakanok Roth,
2000
Conditional
Generative (traditional HMM)
S
S
S
S
S
S
transitions
t
-
1
t
t1
transitions
t
-
1
t
t1
...
...
...
...
observations
observations
...
...
O
O
O
O
O
O
t
t
1
-
t
1
t
t
1
-
t
1
Standard belief propagation forward-backward
procedure. Viterbi and Baum-Welch follow
naturally.
58
Locally Normalized Conditional Sequence Model
Maximum Entropy Markov Models McCallum, Freitag
Pereira, 2000 MaxEnt POS Tagger Ratnaparkhi,
1996 SNoW-based Markov Model Punyakanok Roth,
2000
Or, more generally
Conditional
Generative (traditional HMM)
S
S
S
S
S
S
transitions
t
-
1
t
t1
transitions
t
-
1
t
t1
...
...
...
...
...
...
observations
...
entire observation sequence
O
O
O
O
t
t
t
1
-
t
1
Standard belief propagation forward-backward
procedure. Viterbi and Baum-Welch follow
naturally.
59
Exponential Form for Next State Function
st-1
Black-box classifier
weight
feature
Overall Recipe - Labeled data is assigned to
transitions. - Train each states exponential
model by maximum likelihood (iterative scaling
or conjugate gradient).
60
Feature Functions
o
Yesterday Lawrence Saul spoke this example
sentence.
o1 o2 o3
o4 o5 o6
o7
s1
s2
s3
s4
61
Experimental Data
38 files belonging to 7 UseNet FAQs
Example
ltheadgt X-NNTP-Poster NewsHound
v1.33 ltheadgt Archive-name acorn/faq/part2 ltheadgt
Frequency monthly ltheadgt ltquestiongt 2.6)
What configuration of serial cable should I
use? ltanswergt ltanswergt Here follows a
diagram of the necessary connection ltanswergt prog
rams to work properly. They are as far as I know
ltanswergt agreed upon by commercial comms
software developers fo ltanswergt ltanswergt
Pins 1, 4, and 8 must be connected together
inside ltanswergt is to avoid the well known
serial port chip bugs. The
Procedure For each FAQ, train on one file, test
on other average.
62
Features in Experiments
  • begins-with-number
  • begins-with-ordinal
  • begins-with-punctuation
  • begins-with-question-word
  • begins-with-subject
  • blank
  • contains-alphanum
  • contains-bracketed-number
  • contains-http
  • contains-non-space
  • contains-number
  • contains-pipe
  • contains-question-mark
  • contains-question-word
  • ends-with-question-mark
  • first-alpha-is-capitalized
  • indented
  • indented-1-to-4
  • indented-5-to-10
  • more-than-one-third-space
  • only-punctuation
  • prev-is-blank
  • prev-begins-with-ordinal
  • shorter-than-30

63
Models Tested
  • ME-Stateless A single maximum entropy classifier
    applied to each line independently.
  • TokenHMM A fully-connected HMM with four states,
    one for each of the line categories, each of
    which generates individual tokens (groups of
    alphanumeric characters and individual
    punctuation characters).
  • FeatureHMM Identical to TokenHMM, only the lines
    in a document are first converted to sequences of
    features.
  • MEMM The Maximum Entropy Markov Model described
    in this talk.

64
Results
65
From HMMs to MEMMs to CRFs
Conditional Random Fields (CRFs)
Lafferty, McCallum, Pereira 2001
St-1
St
St1
...
HMM
...
Ot
Ot1
Ot-1
MEMM
St-1
St
St1
...
Ot
Ot1
Ot-1
...
St-1
St
St1
...
CRF
Ot
Ot1
Ot-1
...
(A special case of MEMMs and CRFs.)
66
Conditional Random Fields (CRFs)
St
St1
St2
St3
St4
O Ot, Ot1, Ot2, Ot3, Ot4
Markov on s, conditional dependency on o.
Hammersley-Clifford-Besag theorem stipulates that
the CRF has this forman exponential function of
the cliques in the graph.
Assuming that the dependency structure of the
states is tree-shaped (linear chain is a trivial
tree), inference can be done by dynamic
programming in time O(o S2)just like HMMs.
67
General CRFs vs. HMMs
  • More general and expressive modeling technique
  • Comparable computational efficiency
  • Features may be arbitrary functions of any or all
    observations
  • Parameters need not fully specify generation of
    observations require less training data
  • Easy to incorporate domain knowledge
  • State means only state of process, vs state of
    process and observational history Im keeping

68
Efficient Inference
69
Training CRFs
  • Methods
  • iterative scaling (quite slow)
  • conjugate gradient (much faster)
  • conjugate gradient with preconditioning (super
    fast)
  • limited-memory quasi-Newton methods (also super
    fast)
  • Complexity comparable to standard Baum-Welch

Sha Pereira 2002 Malouf 2002
70
Voted Perceptron Sequence Models
Collins 2002
Like CRFs with stochastic gradient ascent and a
Viterbi approximation.
Analogous to the gradient for this one training
instance
Avoids calculating the partition function
(normalizer), Zo, but gradient ascent, not
2nd-order or conjugate gradient method.
71
MEMM CRF Related Work
  • Maximum entropy for language tasks
  • Language modeling Rosenfeld 94, Chen
    Rosenfeld 99
  • Part-of-speech tagging Ratnaparkhi 98
  • Segmentation Beeferman, Berger Lafferty 99
  • Named entity recognition MENE Borthwick,
    Grishman,…98
  • HMMs for similar language tasks
  • Part of speech tagging Kupiec 92
  • Named entity recognition Bikel et al 99
  • Other Information Extraction Leek 97, Freitag
    McCallum 99
  • Serial Generative/Discriminative Approaches
  • Speech recognition Schwartz Austin 93
  • Reranking Parses Collins, 00
  • Other conditional Markov models
  • Non-probabilistic local decision models Brill
    95, Roth 98
  • Gradient-descent on state path LeCun et al 98
  • Markov Processes on Curves (MPCs) Saul Rahim
    99
  • Voted Perceptron-trained FSMs Collins 02

72
Part-of-speech Tagging
Pereira 2001 personal comm.
45 tags, 1M words training data, Penn Treebank
DT NN NN , NN , VBZ
RB JJ IN PRP VBZ DT NNS , IN
RB JJ NNS TO PRP VBG NNS
WDT VBP RP NNS JJ , NNS
VBD .
The asbestos fiber , crocidolite, is unusually
resilient once it enters the lungs , with even
brief exposures to it causing symptoms that show
up decades later , researchers said .
Using spelling features
use words, plus overlapping features
capitalized, begins with , contains hyphen,
ends in -ing, -ogy, -ed, -s, -ly, -ion, -tion,
-ity, -ies.
73
Person name Extraction
McCallum 2001, unpublished
74
Person name Extraction
75
Features in Experiment
  • Capitalized Xxxxx
  • Mixed Caps XxXxxx
  • All Caps XXXXX
  • Initial Cap X….
  • Contains Digit xxx5
  • All lowercase xxxx
  • Initial X
  • Punctuation .,!(), etc
  • Period .
  • Comma ,
  • Apostrophe
  • Dash -
  • Preceded by HTML tag
  • Character n-gram classifier says string is a
    person name (80 accurate)
  • In stopword list (the, of, their, etc)
  • In honorific list (Mr, Mrs, Dr, Sen, etc)
  • In person suffix list (Jr, Sr, PhD, etc)
  • In name particle list (de, la, van, der, etc)
  • In Census lastname list segmented by P(name)
  • In Census firstname list segmented by P(name)
  • In locations lists (states, cities, countries)
  • In company name list (J. C. Penny)
  • In list of company suffixes (Inc, Associates,
    Foundation)

Hand-built FSM person-name extractor says yes,
(prec/recall 30/95) Conjunctions of all
previous feature pairs, evaluated at the current
time step. Conjunctions of all previous feature
pairs, evaluated at current step and one step
ahead. All previous features, evaluated two steps
ahead. All previous features, evaluated one step
behind.
Total number of features 200k
76
Training and Testing
  • Trained on 65469 words from 85 pages, 30
    different companies web sites.
  • Training takes 4 hours on a 1 GHz Pentium.
  • Training precision/recall is 96 / 96.
  • Tested on different set of web pages with similar
    size characteristics.
  • Testing precision is 92 95,
    recall is 89 91.

77
Chinese Word Segmentation
McCallum Feng, to appear
  • Trained on 800 segmented sentences from UPenn
    Chinese Treebank.
  • Training time 2 hours with L-BFGS.
  • Training F1 99.4
  • Testing F1 99.3
  • Previous top contendors F1 85-95

78
Inducing State-Transition Structure
Chidlovskii, 2000
K-reversible grammars
79
Limitations of HMM/CRF models
  • HMM/CRF models have a linear structure
  • Web documents have a hierarchical structure
  • Are we suffering by not modeling this structure
    more explicitly?
  • How can one learn a hierarchical extraction
    model?
  • Coming up STALKER, a hierarchical
    wrapper-learner
  • But first how do we train wrapper-learners?

80
Tree-based Models
81
  • Extracting from one web site
  • Use site-specific formatting information e.g.,
    the JobTitle is a bold-faced paragraph in column
    2
  • For large well-structured sites, like parsing a
    formal language
  • Extracting from many web sites
  • Need general solutions to entity extraction,
    grouping into records, etc.
  • Primarily use content information
  • Must deal with a wide range of ways that users
    present data.
  • Analogous to parsing natural language
  • Problems are complementary
  • Site-dependent learning can collect training data
    for a site-independent learner
  • Site-dependent learning can boost accuracy of a
    site-independent learner on selected key sites

82
(No Transcript)
83
(No Transcript)
84
(No Transcript)
85
STALKER Hierarchical boundary finding
Muslea,Minton Knoblock 99
  • Main idea
  • To train a hierarchical extractor, pose a series
    of learning problems, one for each node in the
    hierarchy
  • At each stage, extraction is simplified by
    knowing about the context.

86
(No Transcript)
87
(No Transcript)
88
(No Transcript)
89
(BEFORE(), AFTERnull)
90
(BEFORE(), AFTERnull)
91
(BEFORE(), AFTERnull)
92
Stalker hierarchical decomposition of two web
sites
93
Stalker summary and results
  • Rule format
  • landmark automata format for rules which
    extended BWIs format
  • E.g. ltagtW. Cohenlt/agt CMU Web IE lt/ligt
  • BWI BEFORE(lt, /, a,gt, ANY, )
  • STALKER BEGIN SkipTo(lt, /, a, gt), SkipTo()
  • Top-down rule learning algorithm
  • Carefully chosen ordering between types of rule
    specializations
  • Very fast learning e.g. 8 examples vs. 274
  • A lesson we often control the IE training
    data!

94
Why low sample complexity is important in
wrapper learning
At training time, only four examples are
availablebut one would like to generalize to
future pages as well…
95
Wrapster a hybrid approach to representing
wrappers
Cohen,JensenHurst WWW02
  • Common representations for web pages include
  • a rendered image
  • a DOM tree (tree of HTML markup text)
  • gives some of the power of hierarchical
    decomposition
  • a sequence of tokens
  • a bag of words, a sequence of characters, a node
    in a directed graph, . . .
  • Questions
  • How can we engineer a system to generalize
    quickly?
  • How can we explore representational choices
    easily?

96
Wrapster architecture
  • Bias is an ordered set of builders.
  • Builders are simple micro-learners.
  • A single master algorithm co-ordinates learning.
  • Hybrid top-down/bottom-up rule learning
  • Terminology
  • Span substring of page, created by a predicate
  • Predicate subset of spanspan, created by a
    builder
  • Builder a micro-learner, created by hand

97
Wrapster predicates
  • A predicate is a binary relation on spans
  • p(s t) means that t is extracted from s.
  • Membership in a predicate can be tested
  • Given (s,t), is p(s,t) true?
  • Predicates can be executed
  • EXECUTE(s,t) t p(s,t)

98
Example Wrapster predicate
html
  • http//wasBang.org/aboutus.html
  • WasBang.com contact info
  • Currently we have offices in two locations
  • Pittsburgh, PA
  • Provo, UT

head
body
…
p
p
WasBang.com .. info
ul
Currently..
li
li
a
a
Pittsburgh, PA
Provo, UT
99
Example Wrapster predicate
  • Example
  • p(s1,s2) iff s2 are the tokens below an li node
    inside a ul node inside s1.
  • EXECUTE(p,s1) extracts
  • Pittsburgh, PA
  • Provo, UT
  • http//wasBang.org/aboutus.html
  • WasBang.com contact info
  • Currently we have offices in two locations
  • Pittsburgh, PA
  • Provo, UT

100
Wrapster builders
  • Builders are based on simple, restricted
    languages, for example
  • Ltagpath p is defined by tag1,…,tagk and
    ptag1,…,tagk(s1,s2) is true iff s1 and s2
    correspond to DOM nodes and s2 is reached from s1
    by following a path ending in tag1,…,tagk
  • EXECUTE(pul,li,s1) Pittsburgh,PA, Provo,
    UT
  • Lbracket p is defined by a pair of strings
    (l,r), and pl,r(s1,s2) is true iff s2 is preceded
    by l and followed by r.
  • EXECUTE(pin,locations,s1) two

101
Wrapster builders
  • For each language L there is a builder B which
    implements
  • LGG( positive examples of p(s1,s2)) least
    general p in L that covers all the positive
    examples (like pairwise generalization)
  • For Lbracket, longest common prefix and suffix of
    the examples.
  • REFINE(p, examples ) a set of ps that cover
    some but not all of the examples.
  • For Ltagpath, extend the path with one additional
    tag that appears in the examples.
  • Builders/languages can be combined
  • E.g. to construct a builder for (L1 and L2) or
  • (L1 composeWith L2)

102
Wrapster builders - examples
  • Compose tagpaths and brackets
  • E.g., extract strings between ( and ) inside
    a list item inside an unordered list
  • Compose tagpaths and language-based extractors
  • E.g., extract city names inside the first
    paragraph
  • Extract items based on position inside a rendered
    table, or properties of the rendered text
  • E.g., extract items inside any column headed by
    text containing the words Job and Title
  • E.g. extract items in boldfaced italics

103
Composing builders
  • Composing builders for Ltagpath and Lbracket.
  • LGG of the locations would be
  • (ptags composeWith pL,R )
  • where
  • tags ul,li
  • L (
  • R )
  • Jobs at WasBang.com
  • Call (888)-555-1212 now to apply!
  • Webmaster (New York). Perl, servlets essential.
  • Librarian (Pittsburgh). MLS required.
  • Ski Instructor (Vancouver). Snowboarding skills
    also useful.

104
Composing builders structural/global
  • Jobs at WasBang.com
  • Call Alberta Hill at 1-888-555-1212 now to apply!
  • Webmaster (New York). Perl, servlets essential.
  • Librarian (Pittsburgh). MLS required.
  • Ski Instructor (Vancouver). Snowboarding skills
    also useful.
  • Composing builders for Ltagpath and Lcity
  • Lcity pcity where pcity(s1,s2) iff s2 is a
    city name inside of s2.
  • LGG of the locations would be
  • ptags composeWith pcity

105
Table-based builders
How to represent links to pages about
singers? Builders can be based on a geometric
view of a page.
106
Wrapster results
F1
examples
107
Wrapster results
Examples needed for 100 accuracy
108
Site-dependent vs. site-independent IE
  • When is formatting information useful?
  • On a single site, format is extremely consistent.
  • Across many sites, format can vary widely.
  • Can we improve a site-independent classifier
    using site-dependent format features? For
    instance
  • Smooth predictions toward ones that are
    locally consistent with formatting.
  • Learn a wrapper from noisy labels given by a
    site-independent IE system.
  • First step obtaining features from the builders

109
Feature construction using builders
  • - Let D be the set of all positive examples.
    Generate many small training sets Di from D, by
    sliding small windows over D.
  • - Let P be the set of all predicates found by
    any builder from any subset Di.
  • - For each predicate p, add a new feature fp that
    is true for exactly those x2 D that are extracted
    from their containing page by p.

110
List1
111
List2
112
List3
113
(No Transcript)
114
Learning Formatting Patterns On the
Fly Scoped Learning
Bagnell, Blei, McCallum, 2002
Formatting is regular on each site, but there are
too many different sites to wrap. Can we get the
best of both worlds?
115
Scoped Learning Generative Model
a
q
  • For each of the D documents
  • Generate the multinomial formatting feature
    parameters f from p(fa)
  • For each of the N words in the document
  • Generate the nth category cn from p(cn).
  • Generate the nth word (global feature) from
    p(wncn,q)
  • Generate the nth formatting feature (local
    feature) from p(fncn,f)

f
c
w
f
N
D
116
Inference
Given a new web page, we would like to classify
each word resulting in c c1, c2,…, cn
This is not feasible to compute because of the
integral and sum in the denominator. We
experimented with two approximations - MAP
point estimate of f - Variational inference
117
MAP Point Estimate
If we approximate f with a point estimate, f,
then the integral disappears and c decouples. We
can then label each word with

A natural point estimate is the posterior mode a
maximum likelihood estimate for the local
parameters given the document in question
E-step
M-step
118
Global Extractor Precision 46, Recall 75
119
Scoped Learning Extractor Precision 58,
Recall 75 D Error -22
120
Broader View
Up to now we have been focused on segmentation
and classification
Create ontology
Spider
Filter by relevance
IE
Segment Classify Associate Cluster
Database
Load DB
Query, Search
Document collection
Train extraction models
Data mine
Label training data
121
Broader View
Now touch on some other issues
Create ontology
3
Spider
Filter by relevance
IE
Tokenize
Segment Classify Associate Cluster
1
2
Database
Load DB
Query, Search
Document collection
Train extraction models
4
Data mine
5
Label training data
1
122
(1) Association as Binary Classification
Sebastian Thrun conferred with Sue Becker, the
NIPS2002 General Chair.
Person
Person
Role
Person-Role (Sebastian Thrun, NIPS2002 General
Chair) ? NO
Person-Role ( Sue Becker, NIPS2002
General Chair) ? YES
Do this with SVMs and tree kernels over parse
trees.
Zelenko et al, 2002
123
(1) Association with Finite State Machines
Ray Craven, 2001
… This enzyme, UBC6, localizes to the endoplasmic
reticulum, with the catalytic domain facing the
cytosol. …
DET this N enzyme N ubc6 V localizes PREP to ART t
he ADJ endoplasmic N reticulum PREP with ART the A
DJ catalytic N domain V facing ART the N cytosol
Subcellular-localization (UBC6, endoplasmic
reticulum)
124
(1) Association using Parse Tree
Miller et al 2000
Simultaneously POS tag, parse, extract
associate!
Increase space of parse constitutes to
include entity and relation tags Notation Descrip
tion . ch head constituent
category cm modifier constituent category Xp X of
parent node t POS tag w word Parameters e.g.
. P(chcp) P(vps) P
(cmcp,chp,cm-1,wp) P(per/nps,vp,null,said) P(tm
cm,th,wh) P(per/nnpper/np,vbd,said) P(wmcm,tm,t
h,wh) P(nanceper/np,per/nnp,vbd,said)
(This is also a great example of extraction using
a tree model.)
125
(1) Association with Graphical Models
Roth Yih 2002
Capture arbitrary-distance dependencies among
predictions.
126
(1) Association with Graphical Models
Roth Yih 2002
Also capture long-distance dependencies among
predictions.
Random variable over the class of relation
between entity 2 and 1, e.g. over lives-in,
is-boss-of,…
person
Random variable over the class of entity 1, e.g.
over person, location,…
lives-in
Local language models contribute evidence to
relation classification.
person?
Local language models contribute evidence to
entity classification.
Dependencies between classes of entities and
relations!
Inference with loopy belief propagation.
127
(1) Association with Graphical Models
Roth Yih 2002
Also capture long-distance dependencies among
predictions.
Random variable over the class of relation
between entity 2 and 1, e.g. over lives-in,
is-boss-of,…
person
Random variable over the class of entity 1, e.g.
over person, location,…
lives-in
Local language models contribute evidence to
relation classification.
location
Local language models contribute evidence to
entity classification.
Dependencies between classes of entities and
relations!
Inference with loopy belief propagation.
128
(1) Association of records from the web
5 label types sufficient for modeling 500 sites
Jensen Cohen, 2001
Toys.com Company Info Kites Bicycles …
Name Box Kite Company Toys.com Location
Oregon Order 1-800-FLY-KITE Cost
100 Description Great for kids Color
blue Size small
Company Info Location Oregon
Kites Box Kite 100 Stunt Kite 300
Name Stunt Kite Company Toys.com Location
Oregon Order 1-800-FLY-KITE Cost
300 Description Lots of fun Color red Size big
Box Kite Great for kids Detailed specs
Order Info Call 1-800-FLY-KITE
Stunt Kite Lots of fun Detailed specs
Specs Color blue Size small
Specs Color red Size big
129
Broader View
Now touch on some other issues
Create ontology
3
Spider
Filter by relevance
IE
Tokenize
Segment Classify Associate Cluster
1
2
Database
Load DB
Query, Search
Document collection
Train extraction models
4
Data mine
5
Label training data
1
130
(2) Clustering for Reference Matching and
De-duplication
Borthwick, 2000
Learn Pr (duplicate, not-duplicate record1,
record2) with a Maximum Entropy classifier.
Do greedy agglomerative clustering using this
Probability as a distance metric.
131
(2) Clustering for Reference Matching and
De-duplication
  • Efficiently clustering large data sets by
    pre-clustering with a cheap distance metric.
  • McCallum, Nigam Ungar, 2000
  • Learn a better distance metric.
  • Cohen Richman, 2002
  • Dont simply merge greedily capture dependencies
    among multiple merges.
  • Pasula, Marthi, Milch, Russell, Shpitser, NIPS
    2002

132
Broader View
Now touch on some other issues
Create ontology
3
Spider
Filter by relevance
IE
Tokenize
Segment Classify Associate Cluster
1
2
Database
Load DB
Query, Search
Document collection
Train extraction models
4
Data mine
5
Label training data
1
133
(3) Automatically Inducing an Ontology
Riloff, 95
Two inputs
(1)
(2)
Heuristic interesting meta-patterns.
134
(3) Automatically Inducing an Ontology
Riloff, 95
Subject/Verb/Object patterns that occur more
often in the relevant documents than the
irrelevant ones.
135
Broader View
Now touch on some other issues
Create ontology
3
Spider
Filter by relevance
IE
Tokenize
Segment Classify Associate Cluster
1
2
Database
Load DB
Query, Search
Document collection
Train extraction models
4
Data mine
5
Label training data
1
136
(4) Training IE Models using Unlabeled Data
Collins Singer, 1999
…says Mr. Cooper, a vice president of …
NNP NNP appositive phrase, headpresident
Use two independent sets of features
Contents full-stringMr._Cooper, contains(Mr.),
contains(Cooper) Context context-typeappositi
ve, appositive-headpresident
1. Start with just seven rules and 1M
sentences of NYTimes
full-stringNew_York ? Location fill-stringCalifo
rnia ? Location full-stringU.S. ?
Location contains(Mr.) ? Person contains(Incorpor
ated) ? Organization full-stringMicrosoft ?
Organization full-stringI.B.M. ? Organization
2. Alternately train label using each
feature set.
3. Obtain 83 accuracy at finding person,
location, organization other in appositives
and prepositional phrases!
See also Brin 1998, Riloff Jones 1999
137
Broader View
Now touch on some other issues
Create ontology
3
Spider
Filter by relevance
IE
Tokenize
Segment Classify Associate Cluster
1
2
Database
Load DB
Query, Search
Document collection
Train extra
About PowerShow.com