Day 4: Reranking/Attention shift; surprisal-based sentence processing

About This Presentation

Title:

Day 4: Reranking/Attention shift; surprisal-based sentence processing

Description:

Title: Information-theoretic models Author: Roger Levy Last modified by: Roger Levy Created Date: 6/24/2006 4:20:36 PM Document presentation format – PowerPoint PPT presentation

Number of Views:108

Avg rating:3.0/5.0

Slides: 42

Provided by: rogerl62

Learn more at: https://pages.ucsd.edu

Category:

more less

Transcript and Presenter's Notes

Title: Day 4: Reranking/Attention shift; surprisal-based sentence processing

1
Day 4 Reranking/Attention shift surprisal-based
sentence processing

Roger Levy
University of Edinburgh
University of California San Diego

2
Overview for the day

Reranking Attention shift
Crash course in information theory
Surprisal-based sentence processing

3
Reranking Attention shift

Suppose an input prefix w1I determines a ranked
set of incremental structural analyses, call it
Struct(w1i)
In general, adding a new word wi1 to the input
will determine a new ranked set of analysis
Struct(w1i1)
A reranking theory attributes processing
difficulty to some function comparing the
structural analyses
An attention shift theory is a special case where
difficulty is predicted only when the
highest-ranked analysis differs between
Struct(w1i) and Struct(w1i1)

4
Conceptual issues

Granularity what precisely is specified in an
incremental structural analysis?
Ranking metric how are analyses ranked?
e.g.in terms of conditional probabilities P( T
w1i)
Degree of parallelism how many (and which)
analyses are retained in Struct(w1i)?

5
Crocker Brants 2000

brainstorming session

6
Attention shift an example

Parallel comprehension two or more analyses
entertained simultaneously
Disambiguation comes at following context, many
workers
There is an extra cost paid (reading is slower)
at disambiguating context
Eye-tracking (Frazier and Rayner 1987)
Self-paced reading (MacDonald 1993)

The warehouse fires many workers each spring
7
Pruning isnt enough

Jurafsky analyzed NN/NV ambiguity for warehouse
fires and concluded no pruning could happen

267 1
3.8 1
8
Idea of attention shift

Suppose that a change in the top-ranked candidate
induces empirically-observed difficulty
Not the same as serial parsing, which doesnt
even entertain alternate parses unless the
current parse breaks down
Why would this happen?
People could be gathering more information about
the preferred parse, and need extra time to do
this when the preferred parse changes
People could simply be surprised, and this could
interrupt normal reading processes

9
Crocker Brants 2000

Adopt an attention-shift linking hypothesis
(page 660 unfortunately not stated very
explicitly)
Architectural aspects of their system
Bottom-up, incremental parsing architecture
Some pruning at every layer from bottom on up
No lexicalization in the grammar
Skip other details

10
N/V ambiguity under attention shift

Crocker Brants 2000 relative strength of each
interpretation changes from word to word

11
N/V attention shift which probs?

This analysis relies on lexical syntactic
probabilities
P(firesNN) is higher than P(firesVBZ)
P(NP -gt Det NN NN) is low, and putting many
after a subject NP is low-probability
Is this a satisfactory analysis? (c.f. day 1!)
MacDonald 1993 found no disambiguating-context
difficulty when noun (corporation) doesnt
support noun-compound analysis
These are, at the least, bilexical affinities

The corporation fires many workers each spring
12
Results from MacDonald 1993

Difficulty only with warehouse not
corporation fires
Observed difficulty delayed a bit (spillover)

relative difficulty in ambiguous case
13
How to estimate parse probs

In an attention-shift model, conditional
probabilities are of primary interest
warehouse fires vs. corporation fires creates
a practical problem
Model should include P(fireswarehouse,NN,NV)
and P(firescorporation,NN,NV)
But no parsed corpus even contains fires in the
same sentence with either of these words
What do we do here?

14
How to estimate parse probs (2)

MacDonald 1993s approach collect relevant
quantitative norm data and correlate with RTs
warehouse head vs. modifying noun freq
corresponds to P(NNwarehouse) fires noun/verb
ambiguous word usage
corresponds (indirectly) to P(firesNN)
warehouse fires modifierhead cooccurrence rate
corresponds to P(fireswarehouse,NN)
warehouse fires plausibility ratings as NV vs. as
NN
how plausible is it to have a fire in a
warehouse
how plausible is it to have a warehouse fire
someone?

15
How to estimate parse probs (omit)

We can use MacDonalds head vs. modifying freq
plus cooccurrence freq, plus bigram and unigram
frequencies, to determine P(NN) in each case

corpus estimate
MacDonalds estimates
0.46
0.028
16
How to estimate parse probs (3)

In the era of gigantic corpora (e.g., the Web),
another approach the counting method
To estimate P(NNthe warehouse fires), simply
collect a sample of the warehouse fires and count
how many of them are NN usages
Many pitfalls!
often cant hold external sentence context
constant
vulnerable to undisclosed workings of search
engines
hand-filtering the results is imperative
assumes human prob. estimates will match corpus
freqs
BUT it gives access to huge data!

17
How to estimate parse probs (3)

Crude method well use a corpus search (Google)
to estimate P(NNwarehouse,fires)
21 instances (excluding psycholinguistics hits!)
of warehouse fires found all were NN
two of these were potentially NV contexts
At least some evidence that P(NNwarehouse,fires)
is above 0.5
Supports attention-shift analysis

I heard an interview on NPR of a Vieux Carre
(French Quarter) native who explained how the
warehouse fires started...
Not all the warehouse fires were so devastating,
...
18
Attention shift in MV/RR ambiguity?

McRae et al. 1998 also has an attention-shift
interpretation (pursued by Narayanan Jurafsky
2002)

shift to RR for good patients
shift to RR for good agents
the crook/cop
19
Reranking/Attention shift summary

Reranking attributes difficulty to changes in the
ranking over interpretations caused by a given
word
Attention shift is a special form in which
changes in the highest-ranked candidate matter

20
Overview for the day

Reranking Attention shift
Tiny introduction to information theory
Surprisal-based sentence processing

21
Tiny intro to information theory

Shannon information content, or surprisal, of an
event
Example a bent coin with P(heads)0.4
A loaded die with P(1)0.4 also has h(1)1.32

(sometimes called the entropy of event x)
22
Tiny intro to information theory (2)

The entropy of a discrete probability
distribution is the expected value of its Shannon
information content
Example the entropy of a fair coin is
Our bent P(heads)0.4 coin has entropy less than
1

23
(No Transcript)
24
Tiny intro to information theory (3)

Our loaded die with P(1)0.4 doesnt have its
entropy completely determined yet. Two examples
A fair die has entropy of 2.58

25
Overview for the day

Reranking Attention shift
Crash course in information theory
Surprisal-based sentence processing

26
Hale 2001, Levy 2005 surprisal

Let the difficulty of a word be its surprisal
given its context
Captures the expectation intuition the more we
expect an event, the easier it is to process
Many probabilistic formalisms, including PCFGs
(Jelinek Lafferty 1991, Stolcke 1995), can give
us word surprisals

27
Intuitions for surprisal PCFGs

Consider the following PCFG
Calculate surprisal at destroyed in these
sentences

P(S ? NP VP) 1.0 P(NP ? DT N) 0.4
P(NP ? DT N N) 0.3 P(NP ? DT Adj N) 0.3
P(N ? warehouse) 0.03 P(N ? fires) 0.02
P(DT ? the) 0.3 P(VP ? V)
0.3 P(VP ? V NP) 0.4 P(VP ? V PP)
0.1 P(V ? fires) 0.05 P(V ? destroyed)
0.04
the warehouse fires destroyed the
neighborhood. the fires destroyed the
neighborhood.
28
Connection with reranking models

Levy 2005 shows that surprisal is a special form
of reranking model
In particular, if reranking cost is taken as the
KL divergence between old new parse
distributions
then reranking cost turns out equivalent to
surprisal of the new word wi
Thus representation neutrality is an interesting
consequence of the surprisal theory

a measure of the penalty incurred by encoding
one probability distribution with another
29
Levy 2006 syntactically constrained contexts

In many cases, you know that you have to
encounter a particular category C
But you dont know when youll encounter it, or
which member of C will actually appear
Call these syntactically constrained contexts
In these contexts, the more information related
to C you obtain, the sharper your expectations
about C generally turn out to be
Interesting contrast to some non-probabilistic
theories that say holding onto the related
information is hard

30
Constrained contexts final verbs

Konieczny 2000 looked at reading times at German
final verbs

Er hat die Gruppe geführt He has the group led
He led the group
Er hat die Gruppe auf den Berg geführt He has
the group to the mountain led
He led the group to the mountain
Er hat die Gruppe auf den SEHR SCHÖNEN Berg
geführt He has the group to the VERY BEAUTIFUL
mtn. led
He led the group to the very beautiful mountain
31
Locality predictions and empirical results

Locality-based models (Gibson 1998) predict
difficulty for longer clauses
But Konieczny found that final verbs were read
faster in longer clauses

Er hat die Gruppe geführt
He led the group
Er hat die Gruppe auf den Berg geführt
He led the group to the mountain
...die Gruppe auf den sehr schönen Berg geführt
He led the group to the very beautiful mountain
32
Surprisals predictions
Er hat die Gruppe (auf den (sehr schönen) Berg)
geführt
Er hat die Gruppe (auf den (sehr schönen) Berg)
geführt
Er hat die Gruppe (auf den (sehr schönen) Berg)
geführt
33
Deriving Koniecznys results

Seeing more having more information
More information more accurate expectations

NP? PP-goal? PP-loc? Verb? ADVP?

Once weve seen a PP goal were unlikely to see
another
So the expectation of seeing anything else goes
up
For pi(w), used a PCFG derived empirically from a
syntactically annotated corpus of German (the
NEGRA treebank)

34
Facilitative ambiguity and surprisal

Review of when ambiguity facilitates processing

The daughteri of the colonelj who shot
himselfi/j The daughteri of the colonelj who
shot herselfi/j
The soni of the colonelj who shot himselfi/j

(Traxler et al. 1998 Van Gompel et al. 2001,
2005)

35
Traditional account probabilistic serial
disambiguation

Sometimes the reader attaches the RC low...
and everythings OK
But sometimes the reader attaches the RC high
and the continuation is anomalous
So were seeing garden-pathing some of the time

NP
NP
PP
RC
P
the daughter
NP
who shot
of
the colonel
36
Surprisal as a parallel alternative

Surprisal marginalizes over possible syntactic
structures

assume a generative model where choice between
herself and himself determined only by
antecedents gender

self
herself
37
(No Transcript)
38
Ambiguity reduces the surprisal
daughterwho shot cant contribute probability
mass to himself
But sonwho shot can
39
Ambiguity/surprisal conclusion

Cases where ambiguity reduces difficulty arent
problematic for parallel constraint satisfaction
Although they are problematic for competition
Attributing difficulty to surprisal rather than
competition is a satisfactory revision of
constraint-based theories

40
Surprisal and garden paths theory

Revisiting the horse raced past the barn fell
After the horse raced past the barn, assume 2
parses
Jurafsky 1996 estimated the probability ratio of
these parses as 821
The surprisal differential of fell in reduced
versus unreduced conditions should thus be log2
83 6.4 bits

(assuming independence between RC reduction and
main verb)
41
Surprisal and garden paths practice

An unlexicalized PCFG (from Brown corpus) gets
right monotonicity of surprisals at
disambiguating word fell
But there are some unwanted results too

this is right but diff. is small
42
Surprisal and garden paths

raced has high surprisal because the grammar is
unlexicalized no connection with horse
Unfortunately, lexicalization in practice
wouldnt help race as a verb never co-occurs
with horse in Penn Treebank!
surprisal differential at fell is small for the
same reason
failure to account for lexical preferences of
raced means that probability of RR alternative is
likely overestimated
Is surprisal a plausible source of explanation
for most dramatic garden-path effects? Still
seems unclear.

43
Surprisal summary

Motivation expectations affect processing
When people encounter something unexpected, they
are surprised
Translates into slower reading (processing
difficulty?)
This intuition can be captured and formalized
using tools from probability theory, information
theory, and statistical NLP

Day 4: Reranking/Attention shift; surprisal-based sentence processing - PowerPoint PPT Presentation

Day 4: Reranking/Attention shift; surprisal-based sentence processing

Title: Information-theoretic models Author: Roger Levy Last modified by: Roger Levy Created Date: 6/24/2006 4:20:36 PM Document presentation format – PowerPoint PPT presentation