Title: CRFs and Joint Inference in NLP
1CRFs and Joint Inferencein NLP
- Andrew McCallum
- Computer Science Department
- University of Massachusetts Amherst
Joint work with Charles Sutton, Aron Culotta,
Xuerui Wang, Ben Wellner, Fuchun Peng, Michael
Hay.
2From Text to Actionable Knowledge
Spider
Filter
Data Mining
IE
Segment Classify Associate Cluster
Discover patterns - entity types - links /
relations - events
Database
Documentcollection
Actionableknowledge
Prediction Outlier detection Decision support
3Joint Inference
Uncertainty Info
Spider
Filter
Data Mining
IE
Segment Classify Associate Cluster
Discover patterns - entity types - links /
relations - events
Database
Documentcollection
Actionableknowledge
Emerging Patterns
Prediction Outlier detection Decision support
4An HLT Pipeline
SNA, KDD, Events TDT, Summarization Coreference
Relations NER Parsing MT ASR
5An HLT Pipeline
SNA, KDD TDT, Summarization Coreference Relation
s NER Parsing MT ASR
Unified, joint inference.
6Joint Inference
Uncertainty Info
Spider
Filter
Data Mining
IE
Segment Classify Associate Cluster
Discover patterns - entity types - links /
relations - events
Database
Documentcollection
Actionableknowledge
Emerging Patterns
Prediction Outlier detection Decision support
7Solution
Unified Model
Spider
Filter
Data Mining
IE
Segment Classify Associate Cluster
Discover patterns - entity types - links /
relations - events
Probabilistic Model
Documentcollection
Actionableknowledge
Prediction Outlier detection Decision support
8(Linear Chain) Conditional Random Fields
Lafferty, McCallum, Pereira 2001
Undirected graphical model, trained to
maximize conditional probability of output
sequence given input sequence
Finite state model
Graphical model
OTHER PERSON OTHER ORG TITLE
output seq
y
y
y
y
y
t2
t3
t
-
1
t
t1
FSM states
. . .
observations
x
x
x
x
x
t
t
t
t
1
-
2
3
t
1
input seq
said Jones a Microsoft VP
9Outline
a
- Motivating Joint Inference for NLP.
- Brief introduction of Conditional Random Fields
- Joint inference Motivation and examples
- Joint Labeling of Cascaded Sequences (Belief
Propagation) - Joint Labeling of Distant Entities (BP by Tree
Reparameterization) - Joint Co-reference Resolution (Graph
Partitioning) - Joint Segmentation and Co-ref (Sparse BP)
- Joint Extraction and Data Mining (Iterative)
- Topical N-gram models
a
10Jointly labeling cascaded sequencesFactorial CRFs
Sutton, Khashayar, McCallum, ICML 2004
Named-entity tag
Noun-phrase boundaries
Part-of-speech
English words
11Jointly labeling cascaded sequencesFactorial CRFs
Sutton, Khashayar, McCallum, ICML 2004
Named-entity tag
Noun-phrase boundaries
Part-of-speech
English words
12Jointly labeling cascaded sequencesFactorial CRFs
Sutton, Khashayar, McCallum, ICML 2004
Named-entity tag
Noun-phrase boundaries
Part-of-speech
English words
But errors cascade--must be perfect at every
stage to do well.
13Jointly labeling cascaded sequencesFactorial CRFs
Sutton, Khashayar, McCallum, ICML 2004
Named-entity tag
Noun-phrase boundaries
Part-of-speech
English words
Joint prediction of part-of-speech and
noun-phrase in newswire, matching accuracy with
only 50 of the training data.
Inference Loopy Belief Propagation
142. Jointly labeling distant mentionsSkip-chain
CRFs
Sutton, McCallum, SRL 2004
Senator Joe Green said today .
Green ran for
Dependency among similar, distant mentions
ignored.
152. Jointly labeling distant mentionsSkip-chain
CRFs
Sutton, McCallum, SRL 2004
Senator Joe Green said today .
Green ran for
14 reduction in error on most repeated field in
email seminar announcements.
Inference Tree reparameterization BP
See also Finkel, et al, 2005
Wainwright et al, 2002
163. Joint co-reference among all pairsAffinity
Matrix CRF
Entity resolutionObject correspondence
. . . Mr Powell . . .
45
. . . Powell . . .
Y/N
Y/N
-99
Y/N
25 reduction in error on co-reference of
proper nouns in newswire.
11
. . . she . . .
Inference Correlational clustering graph
partitioning
McCallum, Wellner, IJCAI WS 2003, NIPS 2004
Bansal, Blum, Chawla, 2002
17Transfer Learning with Factorial CRFs
Sutton, McCallum, 2005
Emailed seminar entities
Email English words
60k words training.
From Terri Stankus ltstankus_at_cs.cmu.edugt To
seminars_at_cs.cmu.edu Date 26 Feb 1992 GRAND
CHALLENGES FOR MACHINE LEARNING Jaime
Carbonell School of Computer Science
Carnegie Mellon University 330
pm 7500 Wean Hall Machine learning
has evolved from obscurity in the 1970s into a
vibrant and popular discipline in artificial
intelligence during the 1980s and 1990s. As a
result of its success and growth, machine
learning is evolving into a collection of related
disciplines inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning), learning
theory (e.g. PAC learning), genetic algorithms,
connectionist learning, hybrid systems, and so on.
Too little labeled training data.
18Transfer Learning with Factorial CRFs
Sutton, McCallum, 2005
Train on related task with more data.
Newswire named entities
Newswire English words
200k words training.
CRICKET - MILLNS SIGNS FOR BOLAND CAPE TOWN
(1996-08-22) South African provincial side Boland
said on Thursday they had signed Leicestershire
fast bowler David Millns on a one year contract.
Millns, who toured Australia with England A in
1992, replaces former England all-rounder Phillip
DeFreitas as Boland's overseas professional.
19Transfer Learning with Factorial CRFs
Sutton, McCallum, 2005
At test time, label email with newswire NEs...
Newswire named entities
Email English words
20Transfer Learning with Factorial CRFs
Sutton, McCallum, 2005
then use these labels as features for final task
Emailed seminar annmt entities
Newswire named entities
Email English words
21Transfer Learning with Factorial CRFs
Sutton, McCallum, 2005
Use joint inference at test time.
Seminar Announcement entities
Newswire named entities
English words
An alternative to hierarchical Bayes. Neednt
know anything about parameterization of subtask.
Accuracy No transfer lt Cascaded Transfer lt
Joint Inference Transfer
11 Reduction in Error
224. Joint segmentation and co-reference
Extraction from and matching of research paper
citations.
o
s
World Knowledge
Laurel, B. Interface Agents Metaphors with
Character, in The Art of Human-Computer
Interface Design, B. Laurel (ed), Addison-Wesley,
1990.
c
Co-reference decisions
y
y
p
Brenda Laurel. Interface Agents Metaphors with
Character, in Laurel, The Art of Human-Computer
Interface Design, 355-366, 1990.
Databasefield values
c
y
c
Citation attributes
s
s
Segmentation
o
o
35 reduction in co-reference error by using
segmentation uncertainty.
6-14 reduction in segmentation error by using
co-reference.
Inference Sparse Generalized Belief Propagation
Wellner, McCallum, Peng, Hay, UAI 2004
see also Marthi, Milch, Russell, 2003
Pal, Sutton, McCallum, 2005
23Joint IE and Coreference from Research Paper
Citations
4. Joint segmentation and co-reference
Textual citation mentions(noisy, with duplicates)
Paper database, with fields,clean, duplicates
collapsed
AUTHORS TITLE VENUE Cowell,
Dawid Probab Springer Montemerlo,
ThrunFastSLAM AAAI Kjaerulff
Approxi Technic
24Citation Segmentation and Coreference
Laurel, B. Interface Agents Metaphors with
Character , in The Art of Human-Computer
Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents Metaphors
with Character , in Smith , The Art of
Human-Computr Interface Design , 355-366 ,
1990 .
25Citation Segmentation and Coreference
Laurel, B. Interface Agents Metaphors with
Character , in The Art of Human-Computer
Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents Metaphors
with Character , in Smith , The Art of
Human-Computr Interface Design , 355-366 ,
1990 .
- Segment citation fields
26Citation Segmentation and Coreference
Laurel, B. Interface Agents Metaphors with
Character , in The Art of Human-Computer
Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Y ? N
Brenda Laurel . Interface Agents Metaphors
with Character , in Smith , The Art of
Human-Computr Interface Design , 355-366 ,
1990 .
- Segment citation fields
- Resolve coreferent citations
27Citation Segmentation and Coreference
Laurel, B. Interface Agents Metaphors with
Character , in The Art of Human-Computer
Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Y ? N
Brenda Laurel . Interface Agents Metaphors
with Character , in Smith , The Art of
Human-Computr Interface Design , 355-366 ,
1990 .
Segmentation Quality Citation Co-reference (F1)
No Segmentation 78
CRF Segmentation 91
True Segmentation 93
- Segment citation fields
- Resolve coreferent citations
28Citation Segmentation and Coreference
Laurel, B. Interface Agents Metaphors with
Character , in The Art of Human-Computer
Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Y ? N
Brenda Laurel . Interface Agents Metaphors
with Character , in Smith , The Art of
Human-Computr Interface Design , 355-366 ,
1990 .
AUTHOR Brenda Laurel TITLE Interface
Agents Metaphors with CharacterPAGES
355-366BOOKTITLE The Art of Human-Computer
Interface DesignEDITOR T. SmithPUBLISHER
Addison-WesleyYEAR 1990
- Segment citation fields
- Resolve coreferent citations
- Form canonical database record
Resolving conflicts
29Citation Segmentation and Coreference
Laurel, B. Interface Agents Metaphors with
Character , in The Art of Human-Computer
Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Y ? N
Brenda Laurel . Interface Agents Metaphors
with Character , in Smith , The Art of
Human-Computr Interface Design , 355-366 ,
1990 .
AUTHOR Brenda Laurel TITLE Interface
Agents Metaphors with CharacterPAGES
355-366BOOKTITLE The Art of Human-Computer
Interface DesignEDITOR T. SmithPUBLISHER
Addison-WesleyYEAR 1990
- Segment citation fields
- Resolve coreferent citations
- Form canonical database record
jointly.
Perform
30IE Coreference Model
AUT AUT YR TITL TITL
CRF Segmentation
s
Observed citation
x
J Besag 1986 On the
31IE Coreference Model
AUTHOR J Besag YEAR 1986 TITLE On
the
Citation mention attributes
c
CRF Segmentation
s
Observed citation
x
J Besag 1986 On the
32IE Coreference Model
Smyth , P Data mining
Structure for each citation mention
c
s
x
J Besag 1986 On the
Smyth . 2001 Data Mining
33IE Coreference Model
Smyth , P Data mining
Binary coreference variablesfor each pair of
mentions
c
s
x
J Besag 1986 On the
Smyth . 2001 Data Mining
34IE Coreference Model
Smyth , P Data mining
Binary coreference variablesfor each pair of
mentions
y
n
n
c
s
x
J Besag 1986 On the
Smyth . 2001 Data Mining
35IE Coreference Model
Smyth , P Data mining
AUTHOR P Smyth YEAR 2001 TITLE Data
Mining ...
Research paper entity attribute nodes
y
n
n
c
s
x
J Besag 1986 On the
Smyth . 2001 Data Mining
36IE Coreference Model
Smyth , P Data mining
Research paper entity attribute node
y
y
y
c
s
x
J Besag 1986 On the
Smyth . 2001 Data Mining
37IE Coreference Model
Smyth , P Data mining
y
n
n
c
s
x
J Besag 1986 On the
Smyth . 2001 Data Mining
38Inference by Sparse Generalized BP
Pal, Sutton, McCallum 2005
Smyth , P Data mining
Exact inference onthese linear-chain regions
From each chainpass an N-best Listinto
coreference
J Besag 1986 On the
Smyth . 2001 Data Mining
39Inference by Sparse Generalized BP
Pal, Sutton, McCallum 2005
Smyth , P Data mining
Approximate inferenceby graph partitioning
Make scale to 1Mcitations with
CanopiesMcCallum, Nigam, Ungar 2000
integrating outuncertaintyin samplesof
extraction
J Besag 1986 On the
Smyth . 2001 Data Mining
40InferenceSample N-best List from CRF
Segmentation
When calculating similarity with another
citation, have more opportunity to find correct,
matching fields.
Name Title Book Title Year
Laurel, B. Interface Agents Metaphors with Character The Art of Human Computer Interface Design 1990
Laurel, B. Interface Agents Metaphors with Character The Art of Human Computer Interface Design 1990
Laurel, B. Interface Agents Metaphors with Character The Art of Human Computer Interface Design 1990
Name Title
Laurel, B Interface Agents Metaphors with Character The
Laurel, B. Interface Agents Metaphors with Character
Laurel, B. Interface Agents Metaphors with Character
y ? n
41Inference by Sparse Generalized BP
Pal, Sutton, McCallum 2005
Smyth , P Data mining
Exact (exhaustive) inferenceover entity
attributes
y
n
n
J Besag 1986 On the
Smyth . 2001 Data Mining
42Inference by Sparse Generalized BP
Pal, Sutton, McCallum 2005
Smyth , P Data mining
Revisit exact inferenceon IE linear chain,now
conditioned on entity attributes
y
n
n
J Besag 1986 On the
Smyth . 2001 Data Mining
43Parameter Estimation Piecewise Training
Sutton McCallum 2005
Divide-and-conquer parameter estimation
IE Linear-chainExact MAP
Coref graph edge weightsMAP on individual edges
Entity attribute potentialsMAP, pseudo-likelihood
y
n
n
In all casesClimb MAP gradient
withquasi-Newton method
444. Joint segmentation and co-reference
Wellner, McCallum, Peng, Hay, UAI 2004
o
Extraction from and matching of research paper
citations.
s
World Knowledge
Laurel, B. Interface Agents Metaphors with
Character, in The Art of Human-Computer
Interface Design, B. Laurel (ed), Addison-Wesley,
1990.
c
Co-reference decisions
y
y
p
Databasefield values
Brenda Laurel. Interface Agents Metaphors with
Character, in Laurel, The Art of Human-Computer
Interface Design, 355-366, 1990.
c
c
Citation attributes
y
s
s
Segmentation
o
o
35 reduction in co-reference error by using
segmentation uncertainty.
6-14 reduction in segmentation error by using
co-reference.
Inference Variant of Iterated Conditional Modes
Besag, 1986
45Outline
a
- Motivating Joint Inference for NLP.
- Brief introduction of Conditional Random Fields
- Joint inference Motivation and examples
- Joint Labeling of Cascaded Sequences (Belief
Propagation) - Joint Labeling of Distant Entities (BP by Tree
Reparameterization) - Joint Co-reference Resolution (Graph
Partitioning) - Joint Segmentation and Co-ref (Sparse BP)
- Joint Extraction and Data Mining (Iterative)
- Topical N-gram models
a
46 George W. Bushs father is George H. W. Bush
(son of Prescott Bush).
47 George W. Bushs father is George H. W. Bush
(son of Prescott Bush).
48 George W. Bushs father is George H. W. Bush
(son of Prescott Bush).
49 George W. Bushs father is George H. W. Bush
(son of Prescott Bush).
50Relation Extraction as Sequence Labeling
- George W. Bush
- George H. W. Bush (son of Prescott Bush)
51Learning Relational Database Features
- George W. Bush
- George H. W. Bush (son of Prescott Bush)
Name Son
Prescott Bush George H. W. Bush
George H. W. Bush George W. Bush
52Highly weighted relational paths
- Many Family equivalences
- SiblingParent_Offspring
- CousinParent_Sibling_Offspring
- CollegeParent_College
- ReligionParent_Religion
- AllyOpponent_Opponent
- FriendPerson_Same_School
- Preliminary results nice performance boost using
relational features (8 absolute F1)
53Testing on Unknown Entities
- John F. Kennedy
- son of Joseph P. Kennedy, Sr. and Rose
Fitzgerald
Name Son
Joseph P. Kennedy John F. Kennedy
Rose Fitzgerald John F. Kennedy
Use relational features with second-pass CRF
54Next Steps
- Feature induction to discover complex rules
- Measure relational features sensitivity to noise
in DB - Collective inference among related relations
55Outline
a
- Motivating Joint Inference for NLP.
- Brief introduction of Conditional Random Fields
- Joint inference Motivation and examples
- Joint Labeling of Cascaded Sequences (Belief
Propagation) - Joint Labeling of Distant Entities (BP by Tree
Reparameterization) - Joint Co-reference Resolution (Graph
Partitioning) - Joint Segmentation and Co-ref (Sparse BP)
- Joint Extraction and Data Mining (Iterative)
- Topical N-gram models
a
a
56Topical N-gram Model - Our first attempt
Wang McCallum
?
?
0, 1, 12, 22, 13, 23, 33
z1
z2
z3
z4
. . .
y1
y2
y3
y4
. . .
w1
w2
w3
w4
. . .
D
?1
?2
?
?1
?
?2
W
W
T
T
57Beyond bag-of-words
Wallach
?
?
z1
z2
z3
z4
. . .
w1
w2
w3
w4
. . .
D
?
?
TW
58LDA-COL (Collocation) Model
Griffiths Steyvers
?
?
z1
z2
z3
z4
. . .
y1
y2
y3
y4
. . .
w1
w2
w3
w4
. . .
D
?1
?2
?1
?2
?
?
W
T
W
59Topical N-gram Model
Wang McCallum
?
?
z1
z2
z3
z4
. . .
y1
y2
y3
y4
. . .
w1
w2
w3
w4
. . .
D
?1
?2
?
?1
?
?2
W
W
T
T
60Topical N-gram Model
Wang McCallum
?
?
z1
z2
z3
z4
. . .
y1
y2
y3
y4
. . .
w1
w2
w3
w4
. . .
D
?1
?2
?
?1
?
?2
W
W
T
T
61Topic Comparison
LDA
learning optimal reinforcement state problems poli
cy dynamic action programming actions function mar
kov methods decision rl continuous spaces step pol
icies planning
62Topic Comparison
LDA
Topical N-grams (2)
Topical N-grams (1)
motion response direction cells stimulus figure co
ntrast velocity model responses stimuli moving cel
l intensity population image center tuning complex
directions
motion visual field position figure direction fiel
ds eye location retina receptive velocity vision m
oving system flow edge center light local
receptive field spatial frequency temporal
frequency visual motion motion energy tuning
curves horizontal cells motion detection preferred
direction visual processing area mt visual
cortex light intensity directional
selectivity high contrast motion
detectors spatial phase moving stimuli decision
strategy visual stimuli
63Topic Comparison
LDA
Topical N-grams (2)
Topical N-grams (1)
speech word training system recognition hmm speake
r performance phoneme acoustic words context syste
ms frame trained sequence phonetic speakers mlp hy
brid
word system recognition hmm speech training perfor
mance phoneme words context systems frame trained
speaker sequence speakers mlp frames segmentation
models
speech recognition training data neural
network error rates neural net hidden markov
model feature vectors continuous speech training
procedure continuous speech recognition gamma
filter hidden control speech production neural
nets input representation output layers training
algorithm test set speech frames speaker dependent
64Summary
- Joint inference can avoid accumulating errors in
an pipeline from extraction to data mining. - Examples
- Factorial finite state models
- Jointly labeling distant entities
- Coreference analysis
- Segmentation uncertainty aiding coreference
vice-versa - Joint Extraction and Data Mining
- Many examples of sequential topic models.