Global Inference in Learning for Natural Language Processing - PowerPoint PPT Presentation

About This Presentation

Title:

Global Inference in Learning for Natural Language Processing

Description:

When was Winnie the Pooh written? Why did Chris write two books of ... There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 49

Provided by: View

Category:

more less

Transcript and Presenter's Notes

Title: Global Inference in Learning for Natural Language Processing

1
Global Inference in Learning forNatural Language
Processing

Vasin Punyakanok
Department of Computer Science
University of Illinois at Urbana-Champaign
Joint work with Dan Roth, Wen-tau Yih, and Dav
Zimak

2
Story Comprehension
(ENGLAND, June, 1989) - Christopher Robin is
alive and well. He lives in England. He is the
same person that you read about in the book,
Winnie the Pooh. As a boy, Chris lived in a
pretty home called Cotchfield Farm. When Chris
was three years old, his father wrote a poem
about him. The poem was printed in a magazine
for others to read. Mr. Robin then wrote a book.
He made up a fairy tale land where Chris lived.
His friends were animals. There was a bear
called Winnie the Pooh. There was also an owl
and a young pig, called a piglet. All the
animals were stuffed toys that Chris owned. Mr.
Robin made them come to life with his words. The
places in the story were all near Cotchfield
Farm. Winnie the Pooh was written in 1925.
Children still love to read about Christopher
Robin and his animal friends. Most people don't
know he is a real person who is grown now. He
has written two books of his own. They tell what
it is like to be famous.

Who is Christopher Robin?
What did Mr. Robin do when Chris was three years
old?
When was Winnie the Pooh written?
Why did Chris write two books of his own?

3
Stand Alone Ambiguity Resolution

Context Sensitive Spelling Correction
IIllinois bored of education
board
Word Sense Disambiguation
...Nissan Car and truck plant is
divide life into plant and animal kingdom
Part of Speech Tagging
(This DT) (can N) (will MD) (rust V)
DT,MD,V,N
Coreference Resolution
The dog bit the kid. He was taken to a hospital.
The dog bit the kid. He was taken to a
veterinarian.

4
Textual Entailment

Eyeing the huge market potential, currently led
by Google, Yahoo took over search company
Overture Services Inc. last year.
Yahoo acquired Overture.
Question Answering
Who acquired Overture?

5
Inference and Learning

Global decisions in which several local decisions
play a role but there are mutual dependencies on
their outcome.
Learned classifiers for different sub-problems
Incorporate classifiers information, along with
constraints, in making coherent decisions
decisions that respect the local classifiers as
well as domain context specific constraints.
Global inference for the best assignment to all
variables of interest.

6
Text Chunking
y
NP
ADJP
VP
ADVP
VP
The
guy
standing
there
is
so
tall
x
7
Full Parsing
S
y
NP
VP
NP
ADJP
VP
ADVP
The
guy
standing
there
is
so
tall
x
8
Outline

Semantic Role Labeling Problem
Global Inference with Integer Linear Programming
Some Issues with Learning and Inference
Global vs Local Training
Utility of Constraints in the Inference
Conclusion

9
Semantic Role Labeling

I left my pearls to my daughter in my will .
IA0 left my pearlsA1 to my daughterA2 in
my willAM-LOC .
A0 Leaver
A1 Things left
A2 Benefactor
AM-LOC Location

10
Semantic Role Labeling

PropBank Palmer et. al. 05 provides a large
human-annotated corpus of semantic verb-argument
relations.
It adds a layer of generic semantic labels to
Penn Tree Bank II.
(Almost) all the labels are on the constituents
of the parse trees.
Core arguments A0-A5 and AA
different semantics for each verb
specified in the PropBank Frame files
13 types of adjuncts labeled as AM-arg
where arg specifies the adjunct type

11
Semantic Role Labeling
12
The Approach

Pruning
Use heuristics to reduce the number of candidates
(modified from XuePalmer04)
Argument Identification
Use a binary classifier to identify arguments
Argument Classification
Use a multiclass classifier to classify arguments
Inference
Infer the final output satisfying linguistic and
structure constraints

13
Learning

Both argument identifier and argument classifier
are trained phrase-based classifiers.
Features (some examples)
voice, phrase type, head word, path, chunk, chunk
pattern, etc. some make use of a full syntactic
parse
Learning Algorithm SNoW
Sparse network of linear functions
weights learned by regularized Winnow
multiplicative update rule with averaged weight
vectors
Probability conversion is done via softmax
pi expacti/?j expactj

14
Inference

The output of the argument classifier often
violates some constraints, especially when the
sentence is long.
Finding the best legitimate output is formalized
as an optimization problem and solved via Integer
Linear Programming.
Input
The probability estimation (by the argument
classifier)
Structural and linguistic constraints
Allows incorporating expressive (non-sequential)
constraints on the variables (the arguments
types).

15
Integer Linear Programming Inference

For each argument ai and type t (including null)
Set up a Boolean variable ai,t indicating if ai
is classified as t
Goal is to maximize
?i score(ai t ) ai,t
Subject to the (linear) constraints
Any Boolean constraints can be encoded this way
If score(ai t ) P(ai t ), the objective is
find the assignment that maximizes the expected
number of arguments that are correct and
satisfies the constraints

16
Linear Constraints

No overlapping or embedding arguments
?ai,aj overlap or embed ai,null aj,null ? 1

17
Constraints

Constraints
No overlapping or embedding arguments
No duplicate argument classes for A0-A5
Exactly one V argument per predicate
If there is a C-V, there must be V-A1-C-V pattern
If there is an R-arg, there must be arg somewhere
If there is a C-arg, there must be arg somewhere
before
Each predicate can take only core arguments that
appear in its frame file.
More specifically, we check for only the minimum
and maximum ids

18
SRL Results (CoNLL-2005)

Training section 02-21
Development section 24
Test WSJ section 23
Test Brown from Brown corpus (very small)

19
Inference with Multiple SRL systems

Goal is to maximize
?i score(ai t ) ai,t
Subject to the (linear) constraints
Any Boolean constraints can be encoded this way
score(ai t ) ?k fk(ai t )
If system k has no opinion on ai, use a prior
instead

20
Results with Multiple Systems (CoNLL-2005)
21
Outline

Semantic Role Labeling Problem
Global Inference with Integer Linear Programming
Some Issues with Learning and Inference
Global vs Local Training
Utility of Constraints in the Inference
Conclusion

22
Learning and Inference
Training w/o Constraints
Testing Inference with Constraints
IBT Inference-based Training
f1(x)
X
f2(x)
f3(x)
Y
f4(x)
f5(x)
Which one is better? When and Why?
23
Comparisons of Learning Approaches

Coupling (IBT)
Optimize the true global objective function (This
should be better in the limit)
Decoupling (LI)
More efficient
Reusability of classifiers
Modularity in training
No global examples required

24
Claims

When the local classification problems are
easy, LI outperforms IBT.
Only when the local problems become difficult to
solve in isolation, IBT outperforms LI, but
needs a large enough number of training examples.
Will show experimental results and theoretical
intuition to support our claims.

25
Perceptron-based Global Learning
f1(x)
X
f2(x)
f3(x)
Y
f4(x)
f5(x)
26
Simulation

There are 5 local binary linear classifiers
Global classifier is also linear
h(x) argmaxy2C(Y) ?i fi(x,yi)
Constraints are randomly generated
The hypothesis is linearly separable at the
global level given that the constraints are known
The separability level at the local level is
varied

27
Bound Prediction
LI vs. IBT the more identifiable individual
problems are the better overall performance is
with LI

Local ? ?opt ( ( d log m log 1/? ) / m )1/2

Global ? 0 ( ( cd log m c2d log 1/? ) /
m )1/2

28
Relative Merits SRL
Difficulty of the learning problem( features)
easy
hard
29
Summary

When the local classification problems are
easy, LI outperforms IBT.
Only when the local problems become difficult to
solve in isolation, IBT outperforms LI, but
needs a large enough number of training examples.
Why does inference help at all?

30
About Constraints

We always assume that global coherency is good
Constraints does help in real world applications
Performance is usually measured at the local
prediction
Depending on the performance metric, constraints
can hurt

31
Results Contribution of Expressive Constraints
Roth Yih 05

Basic Learning with statistical constraints
only
Additional constraints added at evaluation time
(efficiency)

F1
CRF-D
CRF-ML
69.14
66.46
basic (Viterbi)
diff
diff
69.74
67.10
no dup
0.60
0.64
73.64
71.78
cand
3.90
4.68
73.71
71.71
argument
0.07
-0.07
73.78
71.72
verb pos
0.07
0.01
73.91
71.94
disallow
0.13
0.22
32
Assumptions

y h y1, , yl i
Non-interactive classifiers fi(x,yi)Each
classifier does not use as inputs the outputs of
other classifiers
Inference is linear summation
hun(x) argmaxy2Y ?i fi(x,yi)
hcon(x) argmaxy2C(Y) ?i fi(x,yi)
C(Y) always contains correct outputs
No assumption on the structure of constraints

33
Performance Metrics

Zero-one loss
Mistakes are calculated in terms of global
mistakes
y is wrong if any of yi is wrong
Hamming loss
Mistakes are calculated in terms of local mistakes

34
Zero-One Loss

Constraints cannot hurt
Constraints never fix correct global outputs
This is not true for Hamming Loss

35
Boolean Cube

4-bit binary outputs

4 mistakes
3 mistakes
2 mistakes
1 mistake
0 mistake
36
Hamming Loss
0011
37
Best Classifiers
38
When Constraints Cannot Hurt

?i distance between the correct label and the
2nd best label
?i distance between the predicted label and the
correct label
Fcorrect i fi is correct
Fwrong i fi is wrong
Constraints cannot hurt if
8 i 2 Fcorrect ?i gt ?i 2 Fwrong ?i

39
An Empirical Investigation

SRL System
CoNLL-2005 WSJ test set

40
An Empirical Investigation
41
Good Classifiers
42
Bad Classifiers
43
Average Distance vs Gain in Hamming Loss

Good
High Loss ! Low Score(Low Gain)

44
Good Classifiers
45
Bad Classifiers
46
Average Gain in Hamming Loss vs Distance

Good
High Score ! Low Loss (High Gain)

47
Utility of Constraints

Constraints improve the performance because the
classifiers are good
Good Classifiers
When the classifier is correct, it allows large
margin between the correct label and the 2nd best
label
When the classifier is wrong, the correct label
is not far away from the predicted one

48
Conclusions

Show how global inference can be used
Semantic Role Labeling
Tradeoffs between Coupling vs. Decoupling
learning and inference
Investigation of utility of constraints
The analyses are very preliminary
Average-case analysis for the tradeoffs between
Coupling vs. Decoupling learning and inference
Better understanding for using constraints
More interactive classifiers
Different performance metrics, e.g. F1
Relation with margin