Day 4: Reranking/Attention shift; surprisal-based sentence processing - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Day 4: Reranking/Attention shift; surprisal-based sentence processing

Description:

Title: Information-theoretic models Author: Roger Levy Last modified by: Roger Levy Created Date: 6/24/2006 4:20:36 PM Document presentation format – PowerPoint PPT presentation

Number of Views:108
Avg rating:3.0/5.0
Slides: 42
Provided by: rogerl62
Learn more at: https://pages.ucsd.edu
Category:

less

Transcript and Presenter's Notes

Title: Day 4: Reranking/Attention shift; surprisal-based sentence processing


1
Day 4 Reranking/Attention shift surprisal-based
sentence processing
  • Roger Levy
  • University of Edinburgh
  • University of California San Diego

2
Overview for the day
  • Reranking Attention shift
  • Crash course in information theory
  • Surprisal-based sentence processing

3
Reranking Attention shift
  • Suppose an input prefix w1I determines a ranked
    set of incremental structural analyses, call it
    Struct(w1i)
  • In general, adding a new word wi1 to the input
    will determine a new ranked set of analysis
    Struct(w1i1)
  • A reranking theory attributes processing
    difficulty to some function comparing the
    structural analyses
  • An attention shift theory is a special case where
    difficulty is predicted only when the
    highest-ranked analysis differs between
    Struct(w1i) and Struct(w1i1)

4
Conceptual issues
  • Granularity what precisely is specified in an
    incremental structural analysis?
  • Ranking metric how are analyses ranked?
  • e.g.in terms of conditional probabilities P( T
    w1i)
  • Degree of parallelism how many (and which)
    analyses are retained in Struct(w1i)?

5
Crocker Brants 2000
  • brainstorming session

6
Attention shift an example
  • Parallel comprehension two or more analyses
    entertained simultaneously
  • Disambiguation comes at following context, many
    workers
  • There is an extra cost paid (reading is slower)
    at disambiguating context
  • Eye-tracking (Frazier and Rayner 1987)
  • Self-paced reading (MacDonald 1993)

The warehouse fires many workers each spring
7
Pruning isnt enough
  • Jurafsky analyzed NN/NV ambiguity for warehouse
    fires and concluded no pruning could happen

267 1
3.8 1
8
Idea of attention shift
  • Suppose that a change in the top-ranked candidate
    induces empirically-observed difficulty
  • Not the same as serial parsing, which doesnt
    even entertain alternate parses unless the
    current parse breaks down
  • Why would this happen?
  • People could be gathering more information about
    the preferred parse, and need extra time to do
    this when the preferred parse changes
  • People could simply be surprised, and this could
    interrupt normal reading processes

9
Crocker Brants 2000
  • Adopt an attention-shift linking hypothesis
  • (page 660 unfortunately not stated very
    explicitly)
  • Architectural aspects of their system
  • Bottom-up, incremental parsing architecture
  • Some pruning at every layer from bottom on up
  • No lexicalization in the grammar
  • Skip other details

10
N/V ambiguity under attention shift
  • Crocker Brants 2000 relative strength of each
    interpretation changes from word to word

11
N/V attention shift which probs?
  • This analysis relies on lexical syntactic
    probabilities
  • P(firesNN) is higher than P(firesVBZ)
  • P(NP -gt Det NN NN) is low, and putting many
    after a subject NP is low-probability
  • Is this a satisfactory analysis? (c.f. day 1!)
  • MacDonald 1993 found no disambiguating-context
    difficulty when noun (corporation) doesnt
    support noun-compound analysis
  • These are, at the least, bilexical affinities

The corporation fires many workers each spring
12
Results from MacDonald 1993
  • Difficulty only with warehouse not
    corporation fires
  • Observed difficulty delayed a bit (spillover)

relative difficulty in ambiguous case
13
How to estimate parse probs
  • In an attention-shift model, conditional
    probabilities are of primary interest
  • warehouse fires vs. corporation fires creates
    a practical problem
  • Model should include P(fireswarehouse,NN,NV)
    and P(firescorporation,NN,NV)
  • But no parsed corpus even contains fires in the
    same sentence with either of these words
  • What do we do here?

14
How to estimate parse probs (2)
  • MacDonald 1993s approach collect relevant
    quantitative norm data and correlate with RTs
  • warehouse head vs. modifying noun freq
  • corresponds to P(NNwarehouse) fires noun/verb
    ambiguous word usage
  • corresponds (indirectly) to P(firesNN)
  • warehouse fires modifierhead cooccurrence rate
  • corresponds to P(fireswarehouse,NN)
  • warehouse fires plausibility ratings as NV vs. as
    NN
  • how plausible is it to have a fire in a
    warehouse
  • how plausible is it to have a warehouse fire
    someone?

15
How to estimate parse probs (omit)
  • We can use MacDonalds head vs. modifying freq
    plus cooccurrence freq, plus bigram and unigram
    frequencies, to determine P(NN) in each case

corpus estimate
MacDonalds estimates
0.46
0.028
16
How to estimate parse probs (3)
  • In the era of gigantic corpora (e.g., the Web),
    another approach the counting method
  • To estimate P(NNthe warehouse fires), simply
    collect a sample of the warehouse fires and count
    how many of them are NN usages
  • Many pitfalls!
  • often cant hold external sentence context
    constant
  • vulnerable to undisclosed workings of search
    engines
  • hand-filtering the results is imperative
  • assumes human prob. estimates will match corpus
    freqs
  • BUT it gives access to huge data!

17
How to estimate parse probs (3)
  • Crude method well use a corpus search (Google)
    to estimate P(NNwarehouse,fires)
  • 21 instances (excluding psycholinguistics hits!)
    of warehouse fires found all were NN
  • two of these were potentially NV contexts
  • At least some evidence that P(NNwarehouse,fires)
    is above 0.5
  • Supports attention-shift analysis

I heard an interview on NPR of a Vieux Carre
(French Quarter) native who explained how the
warehouse fires started...
Not all the warehouse fires were so devastating,
...
18
Attention shift in MV/RR ambiguity?
  • McRae et al. 1998 also has an attention-shift
    interpretation (pursued by Narayanan Jurafsky
    2002)

shift to RR for good patients
shift to RR for good agents
the crook/cop
19
Reranking/Attention shift summary
  • Reranking attributes difficulty to changes in the
    ranking over interpretations caused by a given
    word
  • Attention shift is a special form in which
    changes in the highest-ranked candidate matter

20
Overview for the day
  • Reranking Attention shift
  • Tiny introduction to information theory
  • Surprisal-based sentence processing

21
Tiny intro to information theory
  • Shannon information content, or surprisal, of an
    event
  • Example a bent coin with P(heads)0.4
  • A loaded die with P(1)0.4 also has h(1)1.32

(sometimes called the entropy of event x)
22
Tiny intro to information theory (2)
  • The entropy of a discrete probability
    distribution is the expected value of its Shannon
    information content
  • Example the entropy of a fair coin is
  • Our bent P(heads)0.4 coin has entropy less than
    1

23
(No Transcript)
24
Tiny intro to information theory (3)
  • Our loaded die with P(1)0.4 doesnt have its
    entropy completely determined yet. Two examples
  • A fair die has entropy of 2.58

25
Overview for the day
  • Reranking Attention shift
  • Crash course in information theory
  • Surprisal-based sentence processing

26
Hale 2001, Levy 2005 surprisal
  • Let the difficulty of a word be its surprisal
    given its context
  • Captures the expectation intuition the more we
    expect an event, the easier it is to process
  • Many probabilistic formalisms, including PCFGs
    (Jelinek Lafferty 1991, Stolcke 1995), can give
    us word surprisals

27
Intuitions for surprisal PCFGs
  • Consider the following PCFG
  • Calculate surprisal at destroyed in these
    sentences

P(S ? NP VP) 1.0 P(NP ? DT N) 0.4
P(NP ? DT N N) 0.3 P(NP ? DT Adj N) 0.3
P(N ? warehouse) 0.03 P(N ? fires) 0.02
P(DT ? the) 0.3 P(VP ? V)
0.3 P(VP ? V NP) 0.4 P(VP ? V PP)
0.1 P(V ? fires) 0.05 P(V ? destroyed)
0.04
the warehouse fires destroyed the
neighborhood. the fires destroyed the
neighborhood.
28
Connection with reranking models
  • Levy 2005 shows that surprisal is a special form
    of reranking model
  • In particular, if reranking cost is taken as the
    KL divergence between old new parse
    distributions
  • then reranking cost turns out equivalent to
    surprisal of the new word wi
  • Thus representation neutrality is an interesting
    consequence of the surprisal theory

a measure of the penalty incurred by encoding
one probability distribution with another
29
Levy 2006 syntactically constrained contexts
  • In many cases, you know that you have to
    encounter a particular category C
  • But you dont know when youll encounter it, or
    which member of C will actually appear
  • Call these syntactically constrained contexts
  • In these contexts, the more information related
    to C you obtain, the sharper your expectations
    about C generally turn out to be
  • Interesting contrast to some non-probabilistic
    theories that say holding onto the related
    information is hard

30
Constrained contexts final verbs
  • Konieczny 2000 looked at reading times at German
    final verbs

Er hat die Gruppe geführt He has the group led
He led the group
Er hat die Gruppe auf den Berg geführt He has
the group to the mountain led
He led the group to the mountain
Er hat die Gruppe auf den SEHR SCHÖNEN Berg
geführt He has the group to the VERY BEAUTIFUL
mtn. led
He led the group to the very beautiful mountain
31
Locality predictions and empirical results
  • Locality-based models (Gibson 1998) predict
    difficulty for longer clauses
  • But Konieczny found that final verbs were read
    faster in longer clauses

Er hat die Gruppe geführt
He led the group
Er hat die Gruppe auf den Berg geführt
He led the group to the mountain
...die Gruppe auf den sehr schönen Berg geführt
He led the group to the very beautiful mountain
32
Surprisals predictions
Er hat die Gruppe (auf den (sehr schönen) Berg)
geführt
Er hat die Gruppe (auf den (sehr schönen) Berg)
geführt
Er hat die Gruppe (auf den (sehr schönen) Berg)
geführt
33
Deriving Koniecznys results
  • Seeing more having more information
  • More information more accurate expectations

NP? PP-goal? PP-loc? Verb? ADVP?
  • Once weve seen a PP goal were unlikely to see
    another
  • So the expectation of seeing anything else goes
    up
  • For pi(w), used a PCFG derived empirically from a
    syntactically annotated corpus of German (the
    NEGRA treebank)

34
Facilitative ambiguity and surprisal
  • Review of when ambiguity facilitates processing

The daughteri of the colonelj who shot
himselfi/j The daughteri of the colonelj who
shot herselfi/j
The soni of the colonelj who shot himselfi/j
  • (Traxler et al. 1998 Van Gompel et al. 2001,
    2005)

35
Traditional account probabilistic serial
disambiguation
  • Sometimes the reader attaches the RC low...
  • and everythings OK
  • But sometimes the reader attaches the RC high
  • and the continuation is anomalous
  • So were seeing garden-pathing some of the time

NP
NP
PP
RC
P
the daughter
NP
who shot
of
the colonel
36
Surprisal as a parallel alternative
  • Surprisal marginalizes over possible syntactic
    structures
  • assume a generative model where choice between
    herself and himself determined only by
    antecedents gender

self
herself
37
(No Transcript)
38
Ambiguity reduces the surprisal
daughterwho shot cant contribute probability
mass to himself
But sonwho shot can
39
Ambiguity/surprisal conclusion
  • Cases where ambiguity reduces difficulty arent
    problematic for parallel constraint satisfaction
  • Although they are problematic for competition
  • Attributing difficulty to surprisal rather than
    competition is a satisfactory revision of
    constraint-based theories

40
Surprisal and garden paths theory
  • Revisiting the horse raced past the barn fell
  • After the horse raced past the barn, assume 2
    parses
  • Jurafsky 1996 estimated the probability ratio of
    these parses as 821
  • The surprisal differential of fell in reduced
    versus unreduced conditions should thus be log2
    83 6.4 bits

(assuming independence between RC reduction and
main verb)
41
Surprisal and garden paths practice
  • An unlexicalized PCFG (from Brown corpus) gets
    right monotonicity of surprisals at
    disambiguating word fell
  • But there are some unwanted results too

this is right but diff. is small
42
Surprisal and garden paths
  • raced has high surprisal because the grammar is
    unlexicalized no connection with horse
  • Unfortunately, lexicalization in practice
    wouldnt help race as a verb never co-occurs
    with horse in Penn Treebank!
  • surprisal differential at fell is small for the
    same reason
  • failure to account for lexical preferences of
    raced means that probability of RR alternative is
    likely overestimated
  • Is surprisal a plausible source of explanation
    for most dramatic garden-path effects? Still
    seems unclear.

43
Surprisal summary
  • Motivation expectations affect processing
  • When people encounter something unexpected, they
    are surprised
  • Translates into slower reading (processing
    difficulty?)
  • This intuition can be captured and formalized
    using tools from probability theory, information
    theory, and statistical NLP

44
Tomorrow
  • Other information-theoretic approaches to on-line
    sentence processing
  • Brief look at connectionist approaches to
    sentence processing
  • General discussion course wrap-up
Write a Comment
User Comments (0)
About PowerShow.com