A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance

Description:

A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles Sutton ... – PowerPoint PPT presentation

Number of Views:151
Avg rating:3.0/5.0
Slides: 37
Provided by: Andrew1724
Category:

less

Transcript and Presenter's Notes

Title: A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance


1
A Conditional Random Field for Discriminatively-tr
ained Finite-state String Edit Distance
  • Andrew McCallum
  • Kedar Bellare
  • Fernando Pereira

Thanks to Charles Sutton, Xuerui Wang and Mikhail
Bilenko for helpful discussions.
2
String Edit Distance
  • Distance between sequences x and y
  • cost of lowest-cost sequence of edit operations
    that transform string x into y.

3
String Edit Distance
  • Distance between sequences x and y
  • cost of lowest-cost sequence of edit operations
    that transform string x into y.
  • Applications
  • Database Record Deduplication

Apex International Hotel Grassmarket Street
Apex Internatl Grasmarket Street
Records are duplicates of the same hotel?
4
String Edit Distance
  • Distance between sequences x and y
  • cost of lowest-cost sequence of edit operations
    that transform string x into y.
  • Applications
  • Database Record Deduplication
  • Biological Sequences

AGCTCTTACGATAGAGGACTCCAGA
AGGTCTTACCAAAGAGGACTTCAGA
5
String Edit Distance
  • Distance between sequences x and y
  • cost of lowest-cost sequence of edit operations
    that transform string x into y.
  • Applications
  • Database Record Deduplication
  • Biological Sequences
  • Machine Translation

Il a achete une pomme
He bought an apple
6
String Edit Distance
  • Distance between sequences x and y
  • cost of lowest-cost sequence of edit operations
    that transform string x into y.
  • Applications
  • Database Record Deduplication
  • Biological Sequences
  • Machine Translation
  • Textual Entailment

He bought a new car last night
He purchased a brand new automobile yesterday
evening
7
Levenshtein Distance
1966
copy Copy a character from x to y (cost
0) insert Insert a character into y (cost
1) delete Delete a character from y (cost
1) subst Substitute one character for
another (cost 1)
Edit operations
8
Levenshtein Distance
copy Copy a character from x to y (cost
0) insert Insert a character into y (cost
1) delete Delete a character from y (cost
1) subst Substitute one character for
another (cost 1)
Edit operations
Dynamic program
W i l l l e a m 0 1 2 3 4 5 6 7 8 W 1 0 1 2
3 4 5 6 7 i 2 1 0 1 2 3 4 5 6 l 3 2 1 0 1 2 3 4
5 l 4 3 2 1 0 1 2 3 4 i 5 4 3 2 1 1 2 3 4 a 6 5 4
3 2 2 2 2 4 m 7 6 5 4 3 3 3 3 2
D(i,j) score of best alignment
from x1... xi to y1... yj.
insert
D(i-1,j-1) ?(xi?yj ) D(i,j) min
D(i-1,j) 1 D(i,j-1) 1
subst
total cost distance
9
Levenshtein Distancewith Markov Dependencies
repeated delete is cheaper
Cost after a c i d s copy Copy a
character from x to y 0 0 0 0 insert Insert a
character into y 1 1 1 delete Delete a
character from y 1 1 1 subst Substitute one
character for another 1 1 1 1
Edit operations
W i l l l e a m 0 1 2 3 4 5 6 7 8 W 1 0 1 2
3 4 5 6 7 i 2 1 0 1 2 3 4 5 6 l 3 2 1 0 1 2 3 4
5 l 4 3 2 1 0 1 2 3 4 i 5 4 3 2 1 1 2 3 4 a 6 5 4
3 2 2 2 2 4 m 7 6 5 4 3 3 3 3 2
subst
copy
insert
delete
3D DP table
10
?Ristad Yianilos (1997)
Essentially a Pair-HMM, generating a
edit/state/alignment-sequence and two strings
Learn via EM Expectation step Calculate
likelihood of alignment paths Maximization
step Make those paths more likely.
11
Ristad Yianilos Regrets
  • ?Limited features of input strings
  • Examine only single character pair at a time
  • Difficult to use upcoming string context,
    lexicons, ...
  • Example Senator John Green John Green
  • Limited edit operations
  • Difficult to generate arbitrary jumps in both
    strings
  • Example UMass University of Massachusetts.
  • Trained only on positive match data
  • Doesnt include information-rich near misses
  • Example ACM SIGIR ? ACM SIGCHI

So, consider model trained by conditional
probability
12
Conditional Probability (Sequence) Models
  • We prefer a model that is trained to maximize a
    conditional probability rather than joint
    probabilityP(yx) instead of P(y,x)
  • Can examine features, but not responsible for
    generating them.
  • Dont have to explicitly model their dependencies.

13
From HMMs to Conditional Random Fields
Linear-chain
Lafferty, McCallum, Pereira 2001
yt-1
yt
yt1
Joint
...
...
xt
xt1
xt-1
14
(Linear Chain) Conditional Random Fields
Lafferty, McCallum, Pereira 2001
Undirected graphical model, trained to
maximize conditional probability of output
sequence given input sequence
where
Finite state model
Graphical model
OTHER PERSON OTHER ORG TITLE
output seq
y
y
y
y
y
t2
t3
t
-
1
t
t1
FSM states
. . .
observations
x
x
x
x
x
t
2
t
3
t
t
1
-
t
1
said Jones a Microsoft VP
input seq
15
CRF String Edit Distance
x1
string 1 alignment string 2
W i l l i a m _ W . _ C o h o n W i l l l e a
m _ C o h e n
a.i1 a.e a.i2
1 2 3 4 4 5
6 7 8 9 10 11 12 13 14 15 16
copy
copy
copy
copy
copy
copy
copy
copy
copy
copy
copy
subst
subst
insert
delete
delete
delete
1 2 3 4 5 6
7 8 8 8 8 9 10 11 12 13 14
x2
joint complete data likelihood
conditional complete data likelihood
16
CRF String Edit Distance FSM
subst
copy
insert
delete
17
CRF String Edit Distance FSM
subst
copy
match m 1
insert
delete
Start
subst
copy
non-match m 0
insert
delete
18
CRF String Edit Distance FSM
x1 Tommi Jaakkola x2 Tommi Jakola
subst
copy
Probability summed over all alignments in match
states 0.8
match m 1
insert
delete
Start
subst
copy
Probability summed over all alignments in
non-match states 0.2
non-match m 0
insert
delete
19
CRF String Edit Distance FSM
x1 Tom Dietterich x2 Tom Dean
subst
copy
Probability summed over all alignments in match
states 0.1
match m 1
insert
delete
Start
subst
copy
Probability summed over all alignments in
non-match states 0.9
non-match m 0
insert
delete
20
Parameter Estimation
Given training set of string pairs and
match/non-match labels, objective fn is the
incomplete log likelihood
  • Expectation Maximization
  • E-step Estimate distribution over alignments,
    , using current parameters
  • M-step Change parameters to maximize the
    complete (penalized) log likelihood, with an
    iterative quasi-Newton method (BFGS)

This is conditional EM, but avoid complexities
of Jebara 1998, because no need to solve
M-step in closed form.
21
Efficient Training
  • Dynamic programming table is 3Dx1 x2
    100, S 12, .... 120,000 entries
  • Use beam search during E-stepPal, Sutton,
    McCallum 2005
  • Unlike completely observed CRFs, objective
    function is not convex.
  • Initialize parameters not at zero, but so as to
    yield a reasonable initial edit distance.

22
What Alignments are Learned?
x1 Tommi Jaakkola x2 Tommi Jakola
T o m m i J a a k k o l a T o m m i J a k
o l a
subst
copy
match m 1
insert
delete
Start
subst
copy
non-match m 0
insert
delete
23
What Alignments are Learned?
x1 Bruce Croft x2 Tom Dean
subst
copy
match m 1
insert
delete
Start
B r u c e C r o f t T o m D e a n
subst
copy
non-match m 0
insert
delete
24
What Alignments are Learned?
x1 Jaime Carbonell x2 Jamie Callan
subst
copy
match m 1
insert
delete
Start
J a i m e C a r b o n e l
l J a m i e C a l l a n
subst
copy
non-match m 0
insert
delete
25
Example Learned Alignment
26
Summary of Advantages
  • Arbitrary features of the input strings
  • Examine past, future context
  • Use lexicons, WordNet
  • Extremely flexible edit operations
  • Single operation may make arbitrary jumps in both
    strings, of size determined by input features
  • Discriminative Training
  • Maximize ability to predict match vs non-match

27
Experimental ResultsData Sets
  • Restaurant name, Restaurant address
  • 864 records, 112 matches
  • E.g. Abes Bar Grill, E. Main St
    Abes Grill, East Main Street
  • People names, UIS DB generator
  • synthetic noise
  • E.g. John Smith vs Snith, John
  • CiteSeer Citations
  • In four sections Reason, Face, Reinforce,
    Constraint
  • E.g. Rusell Norvig, Artificial Intelligence
    A Modern... Russell Norvig,
    Artificial Intelligence An Intro...

28
Experimental ResultsFeatures
  • same, different
  • same-alphabetic, different alphbetic
  • same-numeric, different-numeric
  • punctuation1, punctuation2
  • alphabet-mismatch, numeric-mismatch
  • end-of-1, end-of-2
  • same-next-character, different-next-character

29
Experimental ResultsEdit Operations
  • insert, delete, substitute/copy
  • swap-two-characters
  • skip-word-if-in-lexicon
  • skip-parenthesized-words
  • skip-any-word
  • substitute-word-pairs-in-translation-lexicon
  • skip-word-if-present-in-other-string

30
Experimental Results
Bilenko Mooney 2003
F1 (average of precision and recall)
Restaurant address 0.686 0.712 0.380 0.532
CiteSeer Reason Face Reinf Constraint
0.927 0.952 0.893 0.924 0.938 0.966 0.907 0.941 0
.897 0.922 0.903 0.923 0.924 0.875 0.808 0.913
Restaurant name 0.290 0.354 0.365 0.433
Distance metric Levenshtein Learned
Leven. Vector Learned Vector
31
Experimental Results
Bilenko Mooney 2003
F1 (average of precision and recall)
CiteSeer Reason Face Reinf Constraint
0.927 0.952 0.893 0.924 0.938 0.966 0.907 0.941 0
.897 0.922 0.903 0.923 0.924 0.875 0.808 0.913 0.
964 0.918 0.917 0.976
Restaurant name 0.290 0.354 0.365 0.433 0.448
Restaurant address 0.686 0.712 0.380 0.532 0.783

Distance metric Levenshtein Learned
Leven. Vector Learned Vector CRF Edit Distance
32
Experimental Results
Data set person names, with word-order noise
added
F1 0.856 0.981
Without skip-if-present-in-other-string With
skip-if-present-in-other-string
33
Related Work
  • Learned Edit Distance
  • Bilenko Mooney 2003, Cohen et al 2003,...
  • Joachims 2003 Max-margin, trained on
    alignments
  • Conditionally-trained models with latent
    variables
  • Jebara 1999 Conditional Expectation
    Maximization
  • Quattoni, Collins, Darrell 2005 CRF for visual
    object recognition, with latent classes for
    object sub-patches
  • Zettlemoyer Collins 2005 CRF for mapping
    sentences to logical form, with latent parses.

34
Predictive Random FieldsLatent Variable Models
fit byMulti-way Conditional Probability
McCallum, Wang, Pal, 2005
  • For clustering structured data,ala Latent
    Dirichlet Allocation its successors
  • But an undirected model,like the Harmonium
    Welling, Rosen-Zvi, Hinton, 2005
  • But trained by a multi-conditional objective
    O P(AB,C) P(BA,C) P(CA,B)e.g. A,B,C are
    different modalities(c.f. Predictive
    Likelihood)

35
Predictive Random Fieldsmixture of Gaussians on
synthetic data
McCallum, Wang, Pal, 2005
Data, classify by color
Generatively trained
Predictive Random Field
Conditionally-trained Jebara 1998
36
Predictive Random Fieldsvs. Harmoniunon
document retrieval task
McCallum, Wang, Pal, 2005
Predictive Random Field,multi-way conditionally
trained
Conditionally-trained,to predict class labels
Harmonium, joint,with class labels and words
Harmonium, joint with words
37
Summary
  • String edit distance
  • Widely used in many fields
  • As in CRF sequence labeling, benefit by
  • conditional-probability training, and
  • ability to use arbitrary, non-independent input
    features
  • Example of conditionally-trained model
    withlatent variables.
  • Find the alignments that most help distinguish
    match from non-match.
  • May ultimately want the alignments, but only have
    relatively-easier-to-label /- labels at training
    time Distantly-labeled data,
    semi-supervised learning
  • Future work Edit distance on trees.
  • See also Predictive Random Fieldshttp//www.cs.
    umass.edu/pal/PRFTR.pdf

38
End of talk
Write a Comment
User Comments (0)
About PowerShow.com