Information%20Extraction%20with%20Conditional%20Random%20Fields - PowerPoint PPT Presentation

About This Presentation
Title:

Information%20Extraction%20with%20Conditional%20Random%20Fields

Description:

Information Extraction with Conditional Random Fields Andrew McCallum University of Massachusetts Amherst Joint work with Aron Culotta, Wei Li and Ben Wellner – PowerPoint PPT presentation

Number of Views:308
Avg rating:3.0/5.0
Slides: 76
Provided by: Andrew1245
Category:

less

Transcript and Presenter's Notes

Title: Information%20Extraction%20with%20Conditional%20Random%20Fields


1
Information Extractionwith Conditional Random
Fields
  • Andrew McCallum
  • University of Massachusetts Amherst
  • Joint work with
  • Aron Culotta, Wei Li and Ben Wellner
  • Bruce Croft, James Allan, David Pinto, Xing Wei
  • John Lafferty (CMU), Fernando Pereira (UPenn)

2
IE fromChinese Documents regarding Weather
Department of Terrestrial System, Chinese Academy
of Sciences
200k documents several millennia old - Qing
Dynasty Archives - memos - newspaper articles -
diaries
3
IE from Research Papers
McCallum et al 99
4
IE from Research Papers
5
Extracting Job Openings from the Web
6
A Portal for Job Openings
7
Data Mining the Extracted Job Information
8
What is Information Extraction
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
NAME TITLE ORGANIZATION
9
What is Information Extraction
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
IE
NAME TITLE ORGANIZATION Bill Gates
CEO Microsoft Bill Veghte VP
Microsoft Richard Stallman founder Free
Soft..
10
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification clustering association
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
11
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
12
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
13
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation




14
IE in Context
Create ontology
Spider
Filter by relevance
IE
Segment Classify Associate Cluster
Database
Load DB
Query, Search
Documentcollection
Train extraction models
Data mining Prediction Outlier detection
Decision support
Label training data
15
Issues that arise
  • Application issues
  • Directed spidering
  • Page filtering
  • Sequence segmentation
  • Segment classification
  • Record association
  • Co-reference
  • Scientific issues
  • Learning more than 100k parameters from limited
    and noisy training data
  • Taking advantage of rich, interdependent
    features.
  • Deciding which input feature representation to
    use.
  • Efficient inference in models with many
    interrelated predictions.
  • Clustering massive data sets

16
Issues that arise
  • Application issues
  • Directed spidering
  • Page filtering
  • Sequence segmentation
  • Segment classification
  • Record association
  • Co-reference
  • Scientific issues
  • Learning more than 100k parameters from limited
    and noisy training data
  • Taking advantage of rich, interdependent
    features.
  • Deciding which input feature representation to
    use.
  • Efficient inference in models with many
    interrelated predictions.
  • Clustering massive data sets

17
Outline
a
  • The need. The task.
  • Review of Random Field Models for IE.
  • Recent work
  • New results for Table Extraction
  • New training procedure using Feature Induction
  • New extended model for Co-reference Resolution
  • New model for Factorial Sequence Labeling
  • Conclusions

18
Hidden Markov Models
HMMs are the standard sequence modeling tool in
genomics, music, speech, NLP,
Graphical model
Finite state model
S
S
S
transitions
t
-
1
t
t1
...
...
observations
...
Generates State sequence Observation
sequence
O
O
O
t
t
1
-
t
1
o1 o2 o3 o4 o5 o6 o7 o8
Parameters for all states Ss1,s2, Start
state probabilities P(st ) Transition
probabilities P(stst-1 ) Observation
(emission) probabilities P(otst ) Training
Maximize probability of training observations (w/
prior)
Usually a multinomial over atomic, fixed alphabet
19
IE with Hidden Markov Models
Given a sequence of observations
Yesterday Ben Rosenberg spoke this example
sentence.
and a trained HMM
person name
location name
background
Find the most likely state sequence (Viterbi)
Yesterday Ben Rosenberg spoke this example
sentence.
Any words said to be generated by the designated
person name state extract as a person name
Person name Ben Rosenberg
20
We want More than an Atomic View of Words
Would like richer representation of text many
arbitrary, overlapping features of the words.
S
S
S
identity of word ends in -ski is capitalized is
part of a noun phrase is in a list of city
names is under node X in WordNet is in bold
font is indented is in hyperlink anchor last
person name was female next two words are and
Associates
t
-
1
t
t1

is Wisniewski

part ofnoun phrase
ends in -ski
O
O
O
t
t
1
-
t
1
21
Problems with Richer Representationand a Joint
Model
  • These arbitrary features are not independent.
  • Multiple levels of granularity (chars, words,
    phrases)
  • Multiple dependent modalities (words, formatting,
    layout)
  • Past future
  • Two choices

Ignore the dependencies. This causes
over-counting of evidence (ala naïve Bayes).
Big problem when combining evidence, as in
Viterbi!
Model the dependencies. Each state would have its
own Bayes Net. But we are already starved for
training data!
S
S
S
S
S
S
t
-
1
t
t1
t
-
1
t
t1
O
O
O
O
O
O
t
t
t
1
-
t
1
-
t
1
t
1
22
Conditional Sequence Models
  • We prefer a model that is trained to maximize a
    conditional probability rather than joint
    probabilityP(so) instead of P(s,o)
  • Can examine features, but not responsible for
    generating them.
  • Dont have to explicitly model their
    dependencies.
  • Dont waste modeling effort trying to generate
    what we are given at test time anyway.

23
From HMMs to MEMMs to CRFs
Conditional Finite State Sequence Models
McCallum, Freitag Pereira, 2000
Lafferty, McCallum, Pereira 2001
St-1
St
St1
...
HMM
...
Ot
Ot1
Ot-1
St-1
St
St1
...
MEMM
Ot
Ot1
Ot-1
...
St-1
St
St1
...
CRF
Ot
Ot1
Ot-1
...
(A special case of MEMMs and CRFs.)
24
From HMMs to CRFs
Conditional Finite State Sequence Models
McCallum, Freitag Pereira, 2000
Lafferty, McCallum, Pereira 2001
St-1
St
St1
...
Joint
...
Ot
Ot1
Ot-1
Conditional
St-1
St
St1
...
Ot
Ot1
Ot-1
...
where
(A super-special case of Conditional Random
Fields.)
25
Conditional Random Fields (CRFs)
Lafferty, McCallum, Pereira 2001
St
St1
St2
St3
St4
O Ot, Ot1, Ot2, Ot3, Ot4
Markov on s, conditional dependency on o.
Hammersley-Clifford theorem stipulates that the
CRF has this forman exponential function of the
cliques in the graph.
Assuming that the dependency structure of the
states is tree-shaped (linear chain is a trivial
tree), inference can be done by dynamic
programming in time O(o S2)just like
HMMs. Set parameters by maximum likelihood,
using optimization method on dL.
26
Training CRFs
  • Methods
  • iterative scaling (quite slow)
  • conjugate gradient (much faster)
  • limited-memory quasi-Newton methods, BFGS
    (super fast)
  • Each iteration with complexity comparable to
    standard Baum-Welch

Sha Pereira 2002 Malouf 2002
27
General CRFs vs. HMMs
  • More general and expressive modeling technique
  • Comparable computational efficiency for inference
  • Features may be arbitrary functions of any or all
    observations
  • Parameters need not fully specify generation of
    observations require less training data
  • Easy to incorporate domain knowledge

28
Main Point 1
Conditional probability sequence models give
great flexibility regarding features used, and
have efficient dynamic-programming-based
algorithms for inference.
29
Outline
a
  • The need. The task.
  • Review of Random Field Models for IE.
  • Recent work
  • New results for Table Extraction
  • New training procedure using Feature Induction
  • New extended model for Co-reference Resolution
  • New model for Factorial Sequence Labeling
  • Conclusions

a
30
Table Extraction from Government Reports
Cash receipts from marketings of milk during 1995
at 19.9 billion dollars, was slightly below
1994. Producer returns averaged 12.93 per
hundredweight, 0.19 per hundredweight
below 1994. Marketings totaled 154 billion
pounds, 1 percent above 1994. Marketings
include whole milk sold to plants and dealers as
well as milk sold directly to consumers.


An estimated 1.56 billion pounds of milk
were used on farms where produced, 8 percent
less than 1994. Calves were fed 78 percent of
this milk with the remainder consumed in
producer households.



Milk Cows
and Production of Milk and Milkfat
United States,
1993-95
-------------------------------------------------
-------------------------------
Production of Milk and Milkfat
2/ Number
-------------------------------------------------
------ Year of Per Milk Cow
Percentage Total
Milk Cows 1/------------------- of Fat in All
------------------
Milk Milkfat Milk Produced Milk
Milkfat ----------------------------------------
----------------------------------------
1,000 Head --- Pounds --- Percent
Million Pounds

1993 9,589 15,704 575
3.66 150,582 5,514.4 1994
9,500 16,175 592 3.66
153,664 5,623.7 1995 9,461
16,451 602 3.66 155,644
5,694.3 ----------------------------------------
---------------------------------------- 1/
Average number during year, excluding heifers not
yet fresh. 2/ Excludes milk
sucked by calves.

31
Table Extraction from Government Reports
Pinto, McCallum, Wei, Croft, 2003 SIGIR
100 documents from www.fedstats.gov
Labels
CRF
  • Non-Table
  • Table Title
  • Table Header
  • Table Data Row
  • Table Section Data Row
  • Table Footnote
  • ... (12 in all)

Cash receipts from marketings of milk during 1995
at 19.9 billion dollars, was slightly below
1994. Producer returns averaged 12.93 per
hundredweight, 0.19 per hundredweight
below 1994. Marketings totaled 154 billion
pounds, 1 percent above 1994. Marketings
include whole milk sold to plants and dealers as
well as milk sold directly to consumers.


An estimated 1.56 billion pounds of milk
were used on farms where produced, 8 percent
less than 1994. Calves were fed 78 percent of
this milk with the remainder consumed in
producer households.



Milk Cows
and Production of Milk and Milkfat
United States,
1993-95
-------------------------------------------------
-------------------------------
Production of Milk and Milkfat
2/ Number
-------------------------------------------------
------ Year of Per Milk Cow
Percentage Total
Milk Cows 1/------------------- of Fat in All
------------------
Milk Milkfat Milk Produced Milk
Milkfat ----------------------------------------
----------------------------------------
1,000 Head --- Pounds --- Percent
Million Pounds

1993 9,589 15,704 575
3.66 150,582 5,514.4 1994
9,500 16,175 592 3.66
153,664 5,623.7 1995 9,461
16,451 602 3.66 155,644
5,694.3 ----------------------------------------
---------------------------------------- 1/
Average number during year, excluding heifers not
yet fresh. 2/ Excludes milk
sucked by calves.
Features
  • Percentage of digit chars
  • Percentage of alpha chars
  • Indented
  • Contains 5 consecutive spaces
  • Whitespace in this line aligns with prev.
  • ...
  • Conjunctions of all previous features, time
    offset 0,0, -1,0, 0,1, 1,2.

32
Table Extraction Experimental Results
Pinto, McCallum, Wei, Croft, 2003 SIGIR
Line labels, percent correct
Table segments, F1
HMM
65
64
Stateless MaxEnt
85
-
D error 85
D error 77
CRF w/out conjunctions
52
68
95
92
CRF
33
Main Point 2
Conditional Random Fields were more accurate in
practice than a generative model ... on a table
extraction task. ... (and others)
(but still not on some others, such as Hindi
NER)
34
Outline
a
  • The need. The task.
  • Review of Random Field Models for IE.
  • Recent work
  • New results for Table Extraction
  • New training procedure using Feature Induction
  • New extended model for Co-reference Resolution
  • New model for Factorial Sequence Labeling
  • Conclusions

a
a
35
A key remaining question for CRFs, given their
freedom to include the kitchen sink's worth of
features, is what features to include?
36
Feature Induction for CRFs
McCallum, 2003, UAI
  1. Begin with knowledge of atomic features, but no
    features yet in the model.
  2. Consider many candidate features, including
    atomic and conjunctions.
  3. Evaluate each candidate feature.
  4. Add to the model some that are ranked highest.
  5. Train the model.

37
Candidate Feature Evaluation
McCallum, 2003, UAI
Common method Information Gain
True optimization criterion Likelihood of
training data
  • Technical meat is in how to calculate this
  • efficiently for CRFs
  • Mean field approximation
  • Emphasize error instances (related to Boosting)
  • Newton's method to set l

38
Named Entity Recognition
McCallum Li, 2003, CoNLL
CRICKET - MILLNS SIGNS FOR BOLAND CAPE TOWN
1996-08-22 South African provincial side Boland
said on Thursday they had signed Leicestershire
fast bowler David Millns on a one year contract.
Millns, who toured Australia with England A in
1992, replaces former England all-rounder Phillip
DeFreitas as Boland's overseas professional.
Labels Examples
PER Yayuk Basuki Innocent Butare ORG 3M KDP
Leicestershire LOC Leicestershire Nirmal
Hriday The Oval MISC Java Basque 1,000
Lakes Rally
39
Automatically Induced Features
McCallum Li, 2003, CoNLL
Index Feature 0 inside-noun-phrase
(ot-1) 5 stopword (ot) 20 capitalized
(ot1) 75 wordthe (ot) 100 in-person-lexicon
(ot-1) 200 wordin (ot2) 500 wordRepublic
(ot1) 711 wordRBI (ot) headerBASEBALL 1027 he
aderCRICKET (ot) in-English-county-lexicon
(ot) 1298 company-suffix-word (firstmentiont2) 40
40 location (ot) POSNNP (ot) capitalized
(ot) stopword (ot-1) 4945 moderately-rare-first-
name (ot-1) very-common-last-name
(ot) 4474 wordthe (ot-2) wordof (ot)
40
Named Entity Extraction Results
McCallum Li, 2003, CoNLL
Method F1 features CRFs w/out Feature
Induction 75 1M CRFs with Feature
Induction 90 60k based on LikelihoodGain
41
Named Entity Extraction Results
McCallum Li, 2003, CoNLL
Method F1 BBN's Identifinder 73 CRFs
w/out Feature Induction 75 CRFs with Feature
Induction 84 based on InfoGain CRFs with
Feature Induction 90 based on LikelihoodGain
42
Noun Phrase Segmentation
B I I
B I I O O O Rockwell
International Corp. 's Tulsa unit said it
signed B I I
O B I O B
Ia tentative agreement extending its contract
with Boeing Co. O O B
I O B B I I to
provide structural parts for Boeing 's 747
jetliners.
Number of features Several million? 3.8
million -- 800 thousand 25 thousand
Method
Combination of 8 SVMs Kudo Matsumoto
2001 CRF with hand-engineered features Sha
Pereira 2002 CRF with Feature Induction
Statistical tie for best performance in the world
43
Optical Character Recognition (OCR)
16x8 images of handwritten charsfrom 150 human
subjectschar sequences forming words average
length 8 Train on 600, Test on 5500
Taskar, Guestrin, Koller 2003 ICML
Maximum-margin Markov Networks (M3)
30
20
Classification error
10
M3
CRF
SVM
CRF FI
LogReg
M3 quad
M3 cubic
SVM cube
SVM quad
44
Main Point 3
One of CRFs chief challenges---choosing the
feature set---can also be addressed in a formal
way, based on maximum likelihood. For CRFs (and
many other methods) Information Gain is not the
appropriate metric for selecting features.
45
Outline
a
  • The need. The task.
  • Review of Random Field Models for IE.
  • Recent work
  • New results for Table Extraction
  • New training procedure using Feature Induction
  • New extended model for Co-reference Resolution
  • New model for Factorial Sequence Labeling
  • Conclusions

a
a
a
46
IE in Context
Create ontology
Spider
Filter by relevance
IE
Segment Classify Associate Cluster
Database
Load DB
Query, Search
Documentcollection
Train extraction models
Data mining Prediction Outlier detection
Decision support
Label training data
47
IE in Context
Create ontology
Spider
Filter by relevance
IE
Segment Classify Associate Cluster
Database
Load DB
Query, Search
Documentcollection
Train extraction models
Data mining Prediction Outlier detection
Decision support
Label training data
48
Coreference Resolution
AKA "record linkage", "database record
deduplication", "citation matching", "object
correspondence", "identity uncertainty"
Output
Input
News article, with named-entity "mentions" tagged
Number of entities, N 3 1 Secretary of
State Colin Powell he Mr. Powell
Powell 2 Condoleezza Rice she
Rice 3 President Bush Bush
Today Secretary of State Colin Powell met with .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . he
. . . . . . . . . . . . . . . . . . . Condoleezza
Rice . . . . . . . . . Mr Powell . . . . . . . .
. .she . . . . . . . . . . . . . . . . . . . . .
Powell . . . . . . . . . . . . . . . President
Bush . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . Rice . . . . . . . . . .
. . . . . . Bush . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . .
49
Inside the Traditional Solution
Pair-wise Affinity Metric
Mention (3)
Mention (4)
Y/N?
. . . Mr Powell . . .
. . . Powell . . .
N Two words in common 29 Y One word in
common 13 Y "Normalized" mentions are string
identical 39 Y Capitalized word in
common 17 Y gt 50 character tri-gram
overlap 19 N lt 25 character tri-gram
overlap -34 Y In same sentence 9 Y Within
two sentences 8 N Further than 3 sentences
apart -1 Y "Hobbs Distance" lt 3 11 N Number
of entities in between two mentions
0 12 N Number of entities in between two mentions
gt 4 -3 Y Font matches 1 Y Default -19
OVERALL SCORE 98 gt threshold0
50
The Problem
Pair-wise merging decisions are being made
independently from each other
. . . Mr Powell . . .
affinity 98
Y
. . . Powell . . .
N
affinity -104
They should be made in relational dependence with
each other.
Y
affinity 11
. . . she . . .
Affinity measures are noisy and imperfect.
51
A UCBerkeley Solution
Russell 2001, Pasula et al 2002
(Applied to citation matching, and object
correspondence in vision)
N
id
words
context
id
surname
distance
fonts
age
gender
. . .
2) Number of entities is hard-coded into
the model structure, but we are supposed
to predict num entities! Thus we must modify
model structure during inference---MCMC.
. . .
52
A Markov Random Field for Co-reference
(MRF)
McCallum Wellner, 2003, ICML submitted
. . . Mr Powell . . .
45
Make pair-wise merging decisions in dependent
relation to each other by - calculating a joint
prob. - including all edge weights - adding
dependence on consistent triangles.
. . . Powell . . .
Y/N
Y/N
-30
Y/N
11
. . . she . . .
53
A Markov Random Field for Co-reference
(MRF)
McCallum Wellner, 2003, ICML submitted
. . . Mr Powell . . .
-(45)
. . . Powell . . .
N
N
-(-30)
Y
(11)
-4
. . . she . . .
54
A Markov Random Field for Co-reference
(MRF)
McCallum Wellner, 2003, ICML submitted
. . . Mr Powell . . .
(45)
. . . Powell . . .
Y
N
-(-30)
Y
(11)
-infinity
. . . she . . .
55
A Markov Random Field for Co-reference
(MRF)
McCallum Wellner, 2003, ICML submitted
. . . Mr Powell . . .
(45)
. . . Powell . . .
Y
N
-(-30)
N
-(11)
. . . she . . .
64
56
Inference in these MRFs Graph Partitioning
Boykov, Vekler, Zabih, 1999, Kolmogorov
Zabih, 2002, Yu, Cross, Shi, 2002
. . . Mr Powell . . .
45
. . . Powell . . .
-106
-30
-134
11
. . . Condoleezza Rice . . .
. . . she . . .
10
57
Inference in these MRFs Graph Partitioning
Boykov, Vekler, Zabih, 1999, Kolmogorov
Zabih, 2002, Yu, Cross, Shi, 2002
. . . Mr Powell . . .
45
. . . Powell . . .
-106
-30
-134
11
. . . Condoleezza Rice . . .
. . . she . . .
10
58
Inference in these MRFs Graph Partitioning
Boykov, Vekler, Zabih, 1999, Kolmogorov
Zabih, 2002, Yu, Cross, Shi, 2002
. . . Mr Powell . . .
45
. . . Powell . . .
-106
-30
-134
11
. . . Condoleezza Rice . . .
. . . she . . .
10
-22
59
Inference in these MRFs Graph Partitioning
Boykov, Vekler, Zabih, 1999, Kolmogorov
Zabih, 2002, Yu, Cross, Shi, 2002
. . . Mr Powell . . .
45
. . . Powell . . .
-106
-30
-134
11
. . . Condoleezza Rice . . .
. . . she . . .
10
314
60
Markov Random Fields for Co-reference
  • Train by maximum likelihood
  • Can calculate gradient by Gibbs sampling, or
    approximate by stochastic gradient ascent (e.g.
    voted perceptron).
  • Given labeled training data in which partions are
    given, learn an affinity measure for which
    partitioning will re-produce those partitions.
  • In need of better algorithms for graph
    partitioning
  • Standard algorithms (e.g. Fiducia Mathesis) do
    not apply with negative edge weights.
  • Currently using randomized version of
    Correlational Clustering" Bansal, Blum Chawala,
    2002---a greedy algorithm.

61
Co-reference Experimental Results
McCallum Wellner, 2003
Proper noun co-reference DARPA ACE broadcast
news transcripts, 117 stories Partition
F1 Pair F1 Single-link threshold 16 18 Best
prev match Morton 83 89 MRFs 88 92
Derror30 Derror28 DARPA MUC-6 newswire
article corpus, 30 stories Partition F1 Pair
F1 Single-link threshold 11 7 Best prev
match Morton 70 76 MRFs 74 80
Derror13 Derror17
62
Outline
a
  • The need. The task.
  • Review of Random Field Models for IE.
  • Recent work
  • New results for Table Extraction
  • New training procedure using Feature Induction
  • New extended model for Co-reference Resolution
  • New model for Factorial Sequence Labeling
  • Conclusions

a
a
a
a
a
63
Multiple Nested Predictionson the Same Sequence
Named-entity tag
Part-of-speech
Segmentation
(output prediction)
fully-connected
Chinese character
(input observation)
64
Multiple Nested Predictionson the Same Sequence
Named-entity tag
Part-of-speech
(output prediction)
Segmentation
(input observation)
fully-connected
Chinese character
(input observation)
65
Multiple Nested Predictionson the Same Sequence
Named-entity tag
(output prediction)
Part-of-speech
(input obseration)
Segmentation
(input observation)
fully-connected
Chinese character
(input observation)
But errors in each stage are compounding. Uncertai
nty from one stage to the next is not preserved.
66
Predict Cross-Product of all Labelson each
Position
O(V x 14852) parameters
O(o x 14852) running time
3 x 45 x 11 1485 possible states
e.g. state label Wordbeg/Noun/Person
SegmentationPOSNE
(output prediction)
fully-connected
Chinese character
(input observation)
67
Factorial CRFs
Undirected analogue to Dynamic Bayes Nets (DBNs)
Rohanimanesh McCallum 2003
3 45 11 59, O(V x 59) parameters
Named-entity tag
(output prediction)
Part-of-speech
(output prediction)
Segmentation
(output prediction)
fully-connected
Chinese character
(input observation)
Perform inference in this cyclic graph with Tree
Re-parameterization Jaakkola, Wainwright,
Willsky, 2001
68
POSNounPhrase Experimental Results
Rohanimanesh McCallum, 2003
All labels, correct
Cascaded CRF
72
77
FCRF
CRF w/ oracle POS
92
69
POSNounPhrase Experimental Results
Rohanimanesh McCallum, 2003
All labels, correct
POS labels, correct
NP labels, correct
Cascaded CRF
72
77
90
77
82
FCRF
86
CRF w/ oracle POS
92
100
92
70
Outline
a
  • The need. The task.
  • Review of Random Field Models for IE.
  • Recent work
  • New results for Table Extraction
  • New training procedure using Feature Induction
  • New extended model for Co-reference Resolution
  • New model for Factorial Sequence Labeling
  • Conclusions

a
a
a
a
a
a
71
CRF Related Work
  • Maximum entropy for language tasks
  • Language modeling Rosenfeld 94, Chen
    Rosenfeld 99
  • POS tagging, conditioning on previous state
    Ratnaparkhi 98
  • Segmentation Beeferman, Berger, Lafferty 99
  • Other Conditional Markov Models
  • Sequence of Winnow classifiers Roth 98
  • Gradient descent on state path LeCun et al 98
  • Maximum Entropy Markov Models McCallum, Freitag,
    Pereira 2000, used by Klein, Smarr, Ngueng,
    Manning 2003,
  • Maximum Margin sequence models Taskar, Guestrin,
    Koller 2003, Altun, Hofmann 2003, Joachims
    2003
  • Training methods
  • Limited Memory Quasi-Newton Malouf 2002, Sha
    Pereira 2002
  • Voted Perceptron Collins 2002
  • Adaptive Over-relaxed Bound Optimization Roweis
    2003

72
Conclusions
  • Conditional Markov Random fields combine the
    benefits of
  • Conditional probability models (arbitrary
    features)
  • Markov models (for sequences or other relations)
  • Success in
  • Feature induction
  • Coreference analysis
  • Factorial finite state models
  • Future work
  • Structure learning, semi-supervised learning.
  • Tight integration of IE and Data Mining

73
End of talk
74
Voted Perceptron Sequence Models
Collins 2002
Like CRFs with stochastic gradient ascent and a
Viterbi approximation.
Analogous tothe gradientfor this onetraining
instance
Avoids calculating the partition function
(normalizer), Zo, but gradient ascent, not
2nd-order or conjugate gradient method.
75
Part-of-speech Tagging
Pereira 2001 personal comm.
45 tags, 1M words training data, Penn Treebank
DT NN NN , NN , VBZ
RB JJ IN PRP VBZ DT NNS , IN
RB JJ NNS TO PRP VBG NNS
WDT VBP RP NNS JJ , NNS
VBD .
The asbestos fiber , crocidolite, is unusually
resilient once it enters the lungs , with even
brief exposures to it causing symptoms that show
up decades later , researchers said .
Using spelling features
Error oov error error D err oov error D err
HMM 5.69 45.99
CRF 5.55 48.05 4.27 -24 23.76 -50
use words, plus overlapping features
capitalized, begins with , contains hyphen,
ends in -ing, -ogy, -ed, -s, -ly, -ion, -tion,
-ity, -ies.
76
Big-Picture Future Directions
  • Unification of IE, data mining, decision making
  • Build large, working systems to test these ideas
  • Bring CiteSeer to UMass build super-replacement
  • Configuring training complex systems with less
    human effort
  • Resource-bounded human/machine interaction
  • Learning from labeled and unlabeled
    data,semi-supervised learning.
  • Combinations of text with other modalities
  • images, video, sound
  • robotics, state estimation, model-building
  • bioinformatics, security ...
  • Use of IE-inspired models for applications in
  • networking, compilers, security, bio-sequences,
    ...

77
IE from ...
  • SEC Filings
  • Dale Kirkland (NASD) charged with detecting
    securities trading fraud
  • Open Source software
  • Lee Giles (PennState) building system that
    supports code sharing, interoperability and
    assessment of reliability.
  • Biology and Bioinformatics Research Pubs
  • Karen Duca (Virginia Bioinformatics Institute)
    part of large team trying to assemble facts that
    will help guide future experiments.
  • Web pages about researchers, groups, pubs,
    conferences and grants
  • Steve Mahaney (NSF) charged with
    Congressionally-mandated effort by NSF to model
    research impact of its grants.

78
IE from Cargo Container Ship Manifests
Cargo Tracking Div. US Navy
Write a Comment
User Comments (0)
About PowerShow.com