Title: A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance
1A Conditional Random Field for Discriminatively-tr
ained Finite-state String Edit Distance
- Andrew McCallum
- Kedar Bellare
- Fernando Pereira
Thanks to Charles Sutton, Xuerui Wang and Mikhail
Bilenko for helpful discussions.
2String Edit Distance
- Distance between sequences x and y
- cost of lowest-cost sequence of edit operations
that transform string x into y.
3String Edit Distance
- Distance between sequences x and y
- cost of lowest-cost sequence of edit operations
that transform string x into y. - Applications
- Database Record Deduplication
Apex International Hotel Grassmarket Street
Apex Internatl Grasmarket Street
Records are duplicates of the same hotel?
4String Edit Distance
- Distance between sequences x and y
- cost of lowest-cost sequence of edit operations
that transform string x into y. - Applications
- Database Record Deduplication
- Biological Sequences
AGCTCTTACGATAGAGGACTCCAGA
AGGTCTTACCAAAGAGGACTTCAGA
5String Edit Distance
- Distance between sequences x and y
- cost of lowest-cost sequence of edit operations
that transform string x into y. - Applications
- Database Record Deduplication
- Biological Sequences
- Machine Translation
Il a achete une pomme
He bought an apple
6String Edit Distance
- Distance between sequences x and y
- cost of lowest-cost sequence of edit operations
that transform string x into y. - Applications
- Database Record Deduplication
- Biological Sequences
- Machine Translation
- Textual Entailment
He bought a new car last night
He purchased a brand new automobile yesterday
evening
7Levenshtein Distance
1966
copy Copy a character from x to y (cost
0) insert Insert a character into y (cost
1) delete Delete a character from y (cost
1) subst Substitute one character for
another (cost 1)
Edit operations
8Levenshtein Distance
copy Copy a character from x to y (cost
0) insert Insert a character into y (cost
1) delete Delete a character from y (cost
1) subst Substitute one character for
another (cost 1)
Edit operations
Dynamic program
W i l l l e a m 0 1 2 3 4 5 6 7 8 W 1 0 1 2
3 4 5 6 7 i 2 1 0 1 2 3 4 5 6 l 3 2 1 0 1 2 3 4
5 l 4 3 2 1 0 1 2 3 4 i 5 4 3 2 1 1 2 3 4 a 6 5 4
3 2 2 2 2 4 m 7 6 5 4 3 3 3 3 2
D(i,j) score of best alignment
from x1... xi to y1... yj.
insert
D(i-1,j-1) ?(xi?yj ) D(i,j) min
D(i-1,j) 1 D(i,j-1) 1
subst
total cost distance
9Levenshtein Distancewith Markov Dependencies
repeated delete is cheaper
Cost after a c i d s copy Copy a
character from x to y 0 0 0 0 insert Insert a
character into y 1 1 1 delete Delete a
character from y 1 1 1 subst Substitute one
character for another 1 1 1 1
Edit operations
W i l l l e a m 0 1 2 3 4 5 6 7 8 W 1 0 1 2
3 4 5 6 7 i 2 1 0 1 2 3 4 5 6 l 3 2 1 0 1 2 3 4
5 l 4 3 2 1 0 1 2 3 4 i 5 4 3 2 1 1 2 3 4 a 6 5 4
3 2 2 2 2 4 m 7 6 5 4 3 3 3 3 2
subst
copy
insert
delete
3D DP table
10?Ristad Yianilos (1997)
Essentially a Pair-HMM, generating a
edit/state/alignment-sequence and two strings
Learn via EM Expectation step Calculate
likelihood of alignment paths Maximization
step Make those paths more likely.
11Ristad Yianilos Regrets
- ?Limited features of input strings
- Examine only single character pair at a time
- Difficult to use upcoming string context,
lexicons, ... - Example Senator John Green John Green
- Limited edit operations
- Difficult to generate arbitrary jumps in both
strings - Example UMass University of Massachusetts.
- Trained only on positive match data
- Doesnt include information-rich near misses
- Example ACM SIGIR ? ACM SIGCHI
So, consider model trained by conditional
probability
12Conditional Probability (Sequence) Models
- We prefer a model that is trained to maximize a
conditional probability rather than joint
probabilityP(yx) instead of P(y,x) - Can examine features, but not responsible for
generating them. - Dont have to explicitly model their dependencies.
13From HMMs to Conditional Random Fields
Linear-chain
Lafferty, McCallum, Pereira 2001
yt-1
yt
yt1
Joint
...
...
xt
xt1
xt-1
14(Linear Chain) Conditional Random Fields
Lafferty, McCallum, Pereira 2001
Undirected graphical model, trained to
maximize conditional probability of output
sequence given input sequence
where
Finite state model
Graphical model
OTHER PERSON OTHER ORG TITLE
output seq
y
y
y
y
y
t2
t3
t
-
1
t
t1
FSM states
. . .
observations
x
x
x
x
x
t
2
t
3
t
t
1
-
t
1
said Jones a Microsoft VP
input seq
15CRF String Edit Distance
x1
string 1 alignment string 2
W i l l i a m _ W . _ C o h o n W i l l l e a
m _ C o h e n
a.i1 a.e a.i2
1 2 3 4 4 5
6 7 8 9 10 11 12 13 14 15 16
copy
copy
copy
copy
copy
copy
copy
copy
copy
copy
copy
subst
subst
insert
delete
delete
delete
1 2 3 4 5 6
7 8 8 8 8 9 10 11 12 13 14
x2
joint complete data likelihood
conditional complete data likelihood
16CRF String Edit Distance FSM
subst
copy
insert
delete
17CRF String Edit Distance FSM
subst
copy
match m 1
insert
delete
Start
subst
copy
non-match m 0
insert
delete
18CRF String Edit Distance FSM
x1 Tommi Jaakkola x2 Tommi Jakola
subst
copy
Probability summed over all alignments in match
states 0.8
match m 1
insert
delete
Start
subst
copy
Probability summed over all alignments in
non-match states 0.2
non-match m 0
insert
delete
19CRF String Edit Distance FSM
x1 Tom Dietterich x2 Tom Dean
subst
copy
Probability summed over all alignments in match
states 0.1
match m 1
insert
delete
Start
subst
copy
Probability summed over all alignments in
non-match states 0.9
non-match m 0
insert
delete
20Parameter Estimation
Given training set of string pairs and
match/non-match labels, objective fn is the
incomplete log likelihood
- Expectation Maximization
- E-step Estimate distribution over alignments,
, using current parameters - M-step Change parameters to maximize the
complete (penalized) log likelihood, with an
iterative quasi-Newton method (BFGS)
This is conditional EM, but avoid complexities
of Jebara 1998, because no need to solve
M-step in closed form.
21Efficient Training
- Dynamic programming table is 3Dx1 x2
100, S 12, .... 120,000 entries - Use beam search during E-stepPal, Sutton,
McCallum 2005 - Unlike completely observed CRFs, objective
function is not convex. - Initialize parameters not at zero, but so as to
yield a reasonable initial edit distance.
22What Alignments are Learned?
x1 Tommi Jaakkola x2 Tommi Jakola
T o m m i J a a k k o l a T o m m i J a k
o l a
subst
copy
match m 1
insert
delete
Start
subst
copy
non-match m 0
insert
delete
23What Alignments are Learned?
x1 Bruce Croft x2 Tom Dean
subst
copy
match m 1
insert
delete
Start
B r u c e C r o f t T o m D e a n
subst
copy
non-match m 0
insert
delete
24What Alignments are Learned?
x1 Jaime Carbonell x2 Jamie Callan
subst
copy
match m 1
insert
delete
Start
J a i m e C a r b o n e l
l J a m i e C a l l a n
subst
copy
non-match m 0
insert
delete
25Example Learned Alignment
26Summary of Advantages
- Arbitrary features of the input strings
- Examine past, future context
- Use lexicons, WordNet
- Extremely flexible edit operations
- Single operation may make arbitrary jumps in both
strings, of size determined by input features - Discriminative Training
- Maximize ability to predict match vs non-match
27Experimental ResultsData Sets
- Restaurant name, Restaurant address
- 864 records, 112 matches
- E.g. Abes Bar Grill, E. Main St
Abes Grill, East Main Street - People names, UIS DB generator
- synthetic noise
- E.g. John Smith vs Snith, John
- CiteSeer Citations
- In four sections Reason, Face, Reinforce,
Constraint - E.g. Rusell Norvig, Artificial Intelligence
A Modern... Russell Norvig,
Artificial Intelligence An Intro...
28Experimental ResultsFeatures
- same, different
- same-alphabetic, different alphbetic
- same-numeric, different-numeric
- punctuation1, punctuation2
- alphabet-mismatch, numeric-mismatch
- end-of-1, end-of-2
- same-next-character, different-next-character
29Experimental ResultsEdit Operations
- insert, delete, substitute/copy
- swap-two-characters
- skip-word-if-in-lexicon
- skip-parenthesized-words
- skip-any-word
- substitute-word-pairs-in-translation-lexicon
- skip-word-if-present-in-other-string
30Experimental Results
Bilenko Mooney 2003
F1 (average of precision and recall)
Restaurant address 0.686 0.712 0.380 0.532
CiteSeer Reason Face Reinf Constraint
0.927 0.952 0.893 0.924 0.938 0.966 0.907 0.941 0
.897 0.922 0.903 0.923 0.924 0.875 0.808 0.913
Restaurant name 0.290 0.354 0.365 0.433
Distance metric Levenshtein Learned
Leven. Vector Learned Vector
31Experimental Results
Bilenko Mooney 2003
F1 (average of precision and recall)
CiteSeer Reason Face Reinf Constraint
0.927 0.952 0.893 0.924 0.938 0.966 0.907 0.941 0
.897 0.922 0.903 0.923 0.924 0.875 0.808 0.913 0.
964 0.918 0.917 0.976
Restaurant name 0.290 0.354 0.365 0.433 0.448
Restaurant address 0.686 0.712 0.380 0.532 0.783
Distance metric Levenshtein Learned
Leven. Vector Learned Vector CRF Edit Distance
32Experimental Results
Data set person names, with word-order noise
added
F1 0.856 0.981
Without skip-if-present-in-other-string With
skip-if-present-in-other-string
33Related Work
- Learned Edit Distance
- Bilenko Mooney 2003, Cohen et al 2003,...
- Joachims 2003 Max-margin, trained on
alignments - Conditionally-trained models with latent
variables - Jebara 1999 Conditional Expectation
Maximization - Quattoni, Collins, Darrell 2005 CRF for visual
object recognition, with latent classes for
object sub-patches - Zettlemoyer Collins 2005 CRF for mapping
sentences to logical form, with latent parses.
34Predictive Random FieldsLatent Variable Models
fit byMulti-way Conditional Probability
McCallum, Wang, Pal, 2005
- For clustering structured data,ala Latent
Dirichlet Allocation its successors - But an undirected model,like the Harmonium
Welling, Rosen-Zvi, Hinton, 2005 - But trained by a multi-conditional objective
O P(AB,C) P(BA,C) P(CA,B)e.g. A,B,C are
different modalities(c.f. Predictive
Likelihood)
35Predictive Random Fieldsmixture of Gaussians on
synthetic data
McCallum, Wang, Pal, 2005
Data, classify by color
Generatively trained
Predictive Random Field
Conditionally-trained Jebara 1998
36Predictive Random Fieldsvs. Harmoniunon
document retrieval task
McCallum, Wang, Pal, 2005
Predictive Random Field,multi-way conditionally
trained
Conditionally-trained,to predict class labels
Harmonium, joint,with class labels and words
Harmonium, joint with words
37Summary
- String edit distance
- Widely used in many fields
- As in CRF sequence labeling, benefit by
- conditional-probability training, and
- ability to use arbitrary, non-independent input
features - Example of conditionally-trained model
withlatent variables. - Find the alignments that most help distinguish
match from non-match. - May ultimately want the alignments, but only have
relatively-easier-to-label /- labels at training
time Distantly-labeled data,
semi-supervised learning - Future work Edit distance on trees.
- See also Predictive Random Fieldshttp//www.cs.
umass.edu/pal/PRFTR.pdf
38End of talk