Title: Day 4: Reranking/Attention shift; surprisal-based sentence processing
1Day 4 Reranking/Attention shift surprisal-based
sentence processing
- Roger Levy
- University of Edinburgh
-
- University of California San Diego
2Overview for the day
- Reranking Attention shift
- Crash course in information theory
- Surprisal-based sentence processing
3Reranking Attention shift
- Suppose an input prefix w1I determines a ranked
set of incremental structural analyses, call it
Struct(w1i) - In general, adding a new word wi1 to the input
will determine a new ranked set of analysis
Struct(w1i1) - A reranking theory attributes processing
difficulty to some function comparing the
structural analyses - An attention shift theory is a special case where
difficulty is predicted only when the
highest-ranked analysis differs between
Struct(w1i) and Struct(w1i1)
4Conceptual issues
- Granularity what precisely is specified in an
incremental structural analysis? - Ranking metric how are analyses ranked?
- e.g.in terms of conditional probabilities P( T
w1i) - Degree of parallelism how many (and which)
analyses are retained in Struct(w1i)?
5Crocker Brants 2000
6Attention shift an example
- Parallel comprehension two or more analyses
entertained simultaneously - Disambiguation comes at following context, many
workers - There is an extra cost paid (reading is slower)
at disambiguating context - Eye-tracking (Frazier and Rayner 1987)
- Self-paced reading (MacDonald 1993)
The warehouse fires many workers each spring
7Pruning isnt enough
- Jurafsky analyzed NN/NV ambiguity for warehouse
fires and concluded no pruning could happen
267 1
3.8 1
8Idea of attention shift
- Suppose that a change in the top-ranked candidate
induces empirically-observed difficulty - Not the same as serial parsing, which doesnt
even entertain alternate parses unless the
current parse breaks down - Why would this happen?
- People could be gathering more information about
the preferred parse, and need extra time to do
this when the preferred parse changes - People could simply be surprised, and this could
interrupt normal reading processes
9Crocker Brants 2000
- Adopt an attention-shift linking hypothesis
- (page 660 unfortunately not stated very
explicitly) - Architectural aspects of their system
- Bottom-up, incremental parsing architecture
- Some pruning at every layer from bottom on up
- No lexicalization in the grammar
- Skip other details
10N/V ambiguity under attention shift
- Crocker Brants 2000 relative strength of each
interpretation changes from word to word
11N/V attention shift which probs?
- This analysis relies on lexical syntactic
probabilities - P(firesNN) is higher than P(firesVBZ)
- P(NP -gt Det NN NN) is low, and putting many
after a subject NP is low-probability - Is this a satisfactory analysis? (c.f. day 1!)
- MacDonald 1993 found no disambiguating-context
difficulty when noun (corporation) doesnt
support noun-compound analysis - These are, at the least, bilexical affinities
The corporation fires many workers each spring
12Results from MacDonald 1993
- Difficulty only with warehouse not
corporation fires - Observed difficulty delayed a bit (spillover)
relative difficulty in ambiguous case
13How to estimate parse probs
- In an attention-shift model, conditional
probabilities are of primary interest - warehouse fires vs. corporation fires creates
a practical problem - Model should include P(fireswarehouse,NN,NV)
and P(firescorporation,NN,NV) - But no parsed corpus even contains fires in the
same sentence with either of these words - What do we do here?
14How to estimate parse probs (2)
- MacDonald 1993s approach collect relevant
quantitative norm data and correlate with RTs - warehouse head vs. modifying noun freq
- corresponds to P(NNwarehouse) fires noun/verb
ambiguous word usage - corresponds (indirectly) to P(firesNN)
- warehouse fires modifierhead cooccurrence rate
- corresponds to P(fireswarehouse,NN)
- warehouse fires plausibility ratings as NV vs. as
NN - how plausible is it to have a fire in a
warehouse - how plausible is it to have a warehouse fire
someone?
15How to estimate parse probs (omit)
- We can use MacDonalds head vs. modifying freq
plus cooccurrence freq, plus bigram and unigram
frequencies, to determine P(NN) in each case
corpus estimate
MacDonalds estimates
0.46
0.028
16How to estimate parse probs (3)
- In the era of gigantic corpora (e.g., the Web),
another approach the counting method - To estimate P(NNthe warehouse fires), simply
collect a sample of the warehouse fires and count
how many of them are NN usages - Many pitfalls!
- often cant hold external sentence context
constant - vulnerable to undisclosed workings of search
engines - hand-filtering the results is imperative
- assumes human prob. estimates will match corpus
freqs - BUT it gives access to huge data!
17How to estimate parse probs (3)
- Crude method well use a corpus search (Google)
to estimate P(NNwarehouse,fires) - 21 instances (excluding psycholinguistics hits!)
of warehouse fires found all were NN - two of these were potentially NV contexts
- At least some evidence that P(NNwarehouse,fires)
is above 0.5 - Supports attention-shift analysis
I heard an interview on NPR of a Vieux Carre
(French Quarter) native who explained how the
warehouse fires started...
Not all the warehouse fires were so devastating,
...
18Attention shift in MV/RR ambiguity?
- McRae et al. 1998 also has an attention-shift
interpretation (pursued by Narayanan Jurafsky
2002)
shift to RR for good patients
shift to RR for good agents
the crook/cop
19Reranking/Attention shift summary
- Reranking attributes difficulty to changes in the
ranking over interpretations caused by a given
word - Attention shift is a special form in which
changes in the highest-ranked candidate matter
20Overview for the day
- Reranking Attention shift
- Tiny introduction to information theory
- Surprisal-based sentence processing
21Tiny intro to information theory
- Shannon information content, or surprisal, of an
event - Example a bent coin with P(heads)0.4
- A loaded die with P(1)0.4 also has h(1)1.32
(sometimes called the entropy of event x)
22Tiny intro to information theory (2)
- The entropy of a discrete probability
distribution is the expected value of its Shannon
information content - Example the entropy of a fair coin is
- Our bent P(heads)0.4 coin has entropy less than
1
23(No Transcript)
24Tiny intro to information theory (3)
- Our loaded die with P(1)0.4 doesnt have its
entropy completely determined yet. Two examples - A fair die has entropy of 2.58
25Overview for the day
- Reranking Attention shift
- Crash course in information theory
- Surprisal-based sentence processing
26Hale 2001, Levy 2005 surprisal
- Let the difficulty of a word be its surprisal
given its context - Captures the expectation intuition the more we
expect an event, the easier it is to process - Many probabilistic formalisms, including PCFGs
(Jelinek Lafferty 1991, Stolcke 1995), can give
us word surprisals
27Intuitions for surprisal PCFGs
- Consider the following PCFG
- Calculate surprisal at destroyed in these
sentences
P(S ? NP VP) 1.0 P(NP ? DT N) 0.4
P(NP ? DT N N) 0.3 P(NP ? DT Adj N) 0.3
P(N ? warehouse) 0.03 P(N ? fires) 0.02
P(DT ? the) 0.3 P(VP ? V)
0.3 P(VP ? V NP) 0.4 P(VP ? V PP)
0.1 P(V ? fires) 0.05 P(V ? destroyed)
0.04
the warehouse fires destroyed the
neighborhood. the fires destroyed the
neighborhood.
28Connection with reranking models
- Levy 2005 shows that surprisal is a special form
of reranking model - In particular, if reranking cost is taken as the
KL divergence between old new parse
distributions - then reranking cost turns out equivalent to
surprisal of the new word wi - Thus representation neutrality is an interesting
consequence of the surprisal theory
a measure of the penalty incurred by encoding
one probability distribution with another
29Levy 2006 syntactically constrained contexts
- In many cases, you know that you have to
encounter a particular category C - But you dont know when youll encounter it, or
which member of C will actually appear - Call these syntactically constrained contexts
- In these contexts, the more information related
to C you obtain, the sharper your expectations
about C generally turn out to be - Interesting contrast to some non-probabilistic
theories that say holding onto the related
information is hard
30Constrained contexts final verbs
- Konieczny 2000 looked at reading times at German
final verbs
Er hat die Gruppe geführt He has the group led
He led the group
Er hat die Gruppe auf den Berg geführt He has
the group to the mountain led
He led the group to the mountain
Er hat die Gruppe auf den SEHR SCHÖNEN Berg
geführt He has the group to the VERY BEAUTIFUL
mtn. led
He led the group to the very beautiful mountain
31Locality predictions and empirical results
- Locality-based models (Gibson 1998) predict
difficulty for longer clauses - But Konieczny found that final verbs were read
faster in longer clauses
Er hat die Gruppe geführt
He led the group
Er hat die Gruppe auf den Berg geführt
He led the group to the mountain
...die Gruppe auf den sehr schönen Berg geführt
He led the group to the very beautiful mountain
32Surprisals predictions
Er hat die Gruppe (auf den (sehr schönen) Berg)
geführt
Er hat die Gruppe (auf den (sehr schönen) Berg)
geführt
Er hat die Gruppe (auf den (sehr schönen) Berg)
geführt
33Deriving Koniecznys results
- Seeing more having more information
- More information more accurate expectations
NP? PP-goal? PP-loc? Verb? ADVP?
- Once weve seen a PP goal were unlikely to see
another - So the expectation of seeing anything else goes
up - For pi(w), used a PCFG derived empirically from a
syntactically annotated corpus of German (the
NEGRA treebank)
34Facilitative ambiguity and surprisal
- Review of when ambiguity facilitates processing
The daughteri of the colonelj who shot
himselfi/j The daughteri of the colonelj who
shot herselfi/j
The soni of the colonelj who shot himselfi/j
- (Traxler et al. 1998 Van Gompel et al. 2001,
2005)
35Traditional account probabilistic serial
disambiguation
- Sometimes the reader attaches the RC low...
- and everythings OK
- But sometimes the reader attaches the RC high
- and the continuation is anomalous
- So were seeing garden-pathing some of the time
NP
NP
PP
RC
P
the daughter
NP
who shot
of
the colonel
36Surprisal as a parallel alternative
- Surprisal marginalizes over possible syntactic
structures
- assume a generative model where choice between
herself and himself determined only by
antecedents gender
self
herself
37(No Transcript)
38Ambiguity reduces the surprisal
daughterwho shot cant contribute probability
mass to himself
But sonwho shot can
39Ambiguity/surprisal conclusion
- Cases where ambiguity reduces difficulty arent
problematic for parallel constraint satisfaction - Although they are problematic for competition
- Attributing difficulty to surprisal rather than
competition is a satisfactory revision of
constraint-based theories
40Surprisal and garden paths theory
- Revisiting the horse raced past the barn fell
- After the horse raced past the barn, assume 2
parses - Jurafsky 1996 estimated the probability ratio of
these parses as 821 - The surprisal differential of fell in reduced
versus unreduced conditions should thus be log2
83 6.4 bits
(assuming independence between RC reduction and
main verb)
41Surprisal and garden paths practice
- An unlexicalized PCFG (from Brown corpus) gets
right monotonicity of surprisals at
disambiguating word fell - But there are some unwanted results too
this is right but diff. is small
42Surprisal and garden paths
- raced has high surprisal because the grammar is
unlexicalized no connection with horse - Unfortunately, lexicalization in practice
wouldnt help race as a verb never co-occurs
with horse in Penn Treebank! - surprisal differential at fell is small for the
same reason - failure to account for lexical preferences of
raced means that probability of RR alternative is
likely overestimated - Is surprisal a plausible source of explanation
for most dramatic garden-path effects? Still
seems unclear.
43Surprisal summary
- Motivation expectations affect processing
- When people encounter something unexpected, they
are surprised - Translates into slower reading (processing
difficulty?) - This intuition can be captured and formalized
using tools from probability theory, information
theory, and statistical NLP
44Tomorrow
- Other information-theoretic approaches to on-line
sentence processing - Brief look at connectionist approaches to
sentence processing - General discussion course wrap-up