Linguistics Methodology meets Language Reality: - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Linguistics Methodology meets Language Reality:

Description:

Collins' Head/Dependency Parser. Michael Collins 1998 UPenn PhD thesis ... Collins' found tighter statistical estimates of tree likelihoods with more ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 37
Provided by: collo
Category:

less

Transcript and Presenter's Notes

Title: Linguistics Methodology meets Language Reality:


1
Linguistics Methodology meets Language
Reality
the quest for robustness, scalability, and
portability in (spoken) language applications
  • Bob Carpenter
  • SpeechWorks International

2
The Standard Cliché(s)
  • Moores Cliché
  • Exponential growth in computing power and memory
    will continue to open up new possibilities
  • The Internet Cliché
  • With the advent and growth of the world-wide web,
    an ever increasing amount of information must be
    managed

3
More Standard Clichés
  • The Convergence Cliché
  • Data, voice and video networking will be
    integrated over a universal network, that
  • includes land lines and wireless
  • includes broadband and narrowband
  • likely implementation is IP (internet protocol)
  • The Interface Cliché
  • The three forces above (growth in computing
    power, information online, and networking) will
    both enable and require new interfaces
  • Speech will become as common as graphics

4
Some Comp Ling Clichés
  • The Standard Linguists Cliché
  • But it must be recognized that the notion
    probability of a sentence is an entirely
    useless one, under any known interpretation of
    this term.
  • Noam Chomsky, 1969 essay on Quine
  • The Standard Engineers Cliché
  • Anytime a linguist leaves the group the
    recognition rate goes up.
  • Fred Jelinek, 1988 address to DARPA

5
The Theoretical Abstraction
  • mature, monolingual, native language speaker
  • idealized to complete knowledge of language
  • static, homogenous language community
  • all speakers learn identical grammars
  • competence (vs. performance)
  • performance is a natural class
  • wetware implementation follows theory in
    divorcing knowledge of language from processing
  • assumes the existence and innateness of a
    language faculty

6
The Explicit Methodology
  • Emprical Basis is binary grammaticality
    judgements
  • intuitive (to a properly trained linguist)
  • innateness and the language faculty
  • appropriate for phonetics through dialogue
  • in practice, very little agreement at boundaries
    and no standard evaluations of theories vs. data
  • Models of particular languages
  • by grammars that generate formal languages
  • low priority for transformationalists
  • high priority for monostratalists/computationalist
    s

7
The Holy Grail of Linguistics
  • A grammar meta-formalism in which
  • all and only natural language grammars (idealized
    as above) can be expressed
  • assumed to correspond to the language faculty
  • Grail is sought by every major camp of linguist
  • Explains why all major linguistic theories look
    alike from any perspective outside of a
    linguistics department
  • The expedient abstractions have become an end in
    themselves

8
But, Applications Require
  • Robustness
  • acoustic and linguistic variation
  • disfluencies and noise
  • Scalability
  • from embedded devices to palmtops to clients to
    servers
  • across tasks from simple to complex
  • system-initiative form-filling to mixed
    initiative dialogue
  • Portability
  • simple adaptation to new tasks and new domains
  • preferably automated as much as possible

9
The 64,000 Question
  • How do humans handle unrestricted language so
    effortlessly in real time?
  • Unfortunately, the classical linguistic
    assumptions and methodology completely ignore
    this issue
  • Psycholinguistics has uncovered some baselines
  • lexicon (and syntax?) highly parallel
  • time course of processing totally online
  • information integration lt 200ms for all sources
  • But is short on explanations

10
(AI) Success by Stupidity
  • Jaime Carbonells Argument (ECAI, mid 1990s)
  • Apparent intelligence because theyre too
    limited to do anything wrong right answer
    hardcoded
  • Typical in Computational NL Grammars
  • lexicon limited to demo
  • rules limited to common ones (eg no heavy shift)
  • Scaling up usually destroys this limited
    success
  • 1,000,000s of grammatical readings with large
    grammars

11
My Favorite Experiments I
  • Mike Tanenhaus et al. (Univ. Rochester)
  • Head-Mounted Eye Tracking

Pick up the yellow plate
Clearly shows that understanding is online
12
My Favorite Experiments (II)
  • Garden Paths and Context Sensitive
  • Crain Steedman (U.Connecticut U. Edinburgh)
  • if noun is not unique in context,
    postmodificiation is much more likely than if
    noun picks out unique individual
  • Garden Paths are Frequency and Agreement
    Sensitive
  • Tanenhaus et al.
  • The horse raced past the barn fell. (raced
    likely past)
  • The horses brought into the barn fell. (brought
    likely participle, and less likely activity for
    horses)

13
Stats Explanation or Stopgap
  • A Common View
  • Statistics are some kind of approximation of
    underlying factors requiring further explanation.
  • Steve Abneys Analogy (ATT Labs)
  • Statistical Queueing Theory
  • Consider traffic flows through a toll gate on a
    highway.
  • Underlying factors are diverse, and explain the
    actions of each driver, their cars, possible
    causes of flat tires, drunk drivers, etc.
  • Statistics is more insightful explanatory in
    this case as it captures emergent generalizations
  • It is a reductionist error to insist on low-level
    account

14
Competence vs. Performance
  • What is computed vs. how it is computed
  • The what can be traditional grammatical structure
  • All structures not computed, regardless of the
    how
  • Define what probabilistically, independently of
    how

15
Algebraic vs. Statistical
  • False Dichotomy
  • All statistical systems have an algebraic basis,
    even if trivial
  • The Good News
  • Best statistical systems have best linguistic
    conditioning (most explanatory in traditional
    sense)
  • Statistical estimatiors far less significant than
    the appropriate linguistic conditioning
  • Rest of the talk provides examples of this

16
Bayesian Statistical Modeling
  • Concerned with prior and posterior probabilities
  • Allows updates of reasoning
  • Bayes Law P(A,B) P(AB) P(B) P(BA) P(A)
  • Eg Source/Channel Model for Speech Recognition
  • Ws sequence of words
  • As sequence of acoustic observations
  • Compute ArgMax_Ws P(WsAs)
  • ArgMax_Ws P(WsAs)
  • ArgMax_Ws P(AsWs) P(Ws) / P(As)
  • ArgMax_Ws P(AsWs) P(Ws)
  • P(AsWs) acoustic model P(Ws) language
    model

17
Simple Bayesian Update Example
  • Monty Halls Lets Make a Deal
  • Three curtains with prize behind one, no other
    info
  • Contestant chooses one of three
  • Monty then opens curtain of one of others that
    does not have the prize
  • if you choose curtain 2, then one of curtain 1
    or 3 must not contain prize
  • Monty then lets you either keep your first guess,
    or change to the remaining curtain he didnt
    open.
  • Should you switch, stay, or doesnt it matter?

18
Answer
  • Yes! You should switch.
  • Why? Consider possiblities

19
Defaults via Bayesian Inference
  • Bayesian Inference provides an explanation for
    rationality of default reasoning
  • Reason by choosing an action to maximize expected
    payoff given some knowledge
  • ArgMax_Action Payoff(Action)
    P(ActionKnowledge)
  • Given additional information update to Knowledge
  • ArgMax_Action Payoff(Action)
    P(ActionKnowledge)
  • Chosen action may be different, as in Lets Make
    a Deal
  • Inferences are not logically sound, but are
    rational
  • Bayesian framework integrates partiality and
    uncertainty of background knowledge

20
Example Allophonic Variation
  • English Pronunciation (M. Riley A. Llolje,
    ATT)
  • Derived from TIMIT with phoneme/phone labels
  • orthographic bottle
  • phonological / b aa t ax l /
    (ARPAbet phonemes)
  • phonetic 0.75 b aa dx el
    (TIMITbet phones)
  • 0.13 b aa t el
  • 0.10 b aa dx ax l
  • 0.02 b aa t ax l
  • Allophonic variation is non-deterministic

21
Eg Allophonic Variation (contd)
  • Simple statistical model (simplified w/o
    insertion)
  • Estimate probability of phones given phonemes
  • P(a1,,aMp1,,pM)
  • P(a1p1,,pM) P(a2p1,,pM,a1)

  • P(aMp1,,pM,a1,,aM-1)
  • Approximate phoneme context to /- k phones
  • Approximate phone history to 0 or 1 phones
  • 0 P(aJpJ-K,,pJ,,pJK) ...
  • 1 P(aJpJ-K,,pJ,,pJK, aJ-1)
  • Uses word boundary marker and stress

22
Eg Allophonic Variation (concld)
  • Cluster phonological features using decision
    trees
  • Sparse data smoothed by decision trees over
    standard features (/- stop, voicing,
    aspiration, etc.)
  • Conditional entropy w/o context 1.5 bits, w 0.8
  • Most likely allophone correct 85.5, in top 5,
    99
  • Average 17 pronunciations/word to get 95
  • Robust handles multiple pronunciations
  • Scalable to whole of English pronunciation
  • Portable easy to move to new dialects with
    training
  • K. Knight (ISI) similar techniques for Japenese
    pronunciation of English words!

23
Example Co-articulation
  • HMMs have been applied to speech since mid-70s
  • Two major recent improvements, the first being
    simply more training data and cycles
  • Second is Context-dependent triphones
  • Instead of one HMM per phoneme/phone, use one per
    context-dependent triphone
  • example t-ru an r preceded by t and
    followed by u
  • crucially clustered by phonological features to
    overcome sparsity

24
Exploratory Data Analysis
  • (Trendier data mining Trendiest information
    harvesting)
  • Specious Argument A statistical model wont help
    explain linguistic processes.
  • Counter 1 Abneys anti-reductionist
  • But even if you dont believe that
  • Counter 2 In other sciences (pace linguistic
    tradition), statistics is used to discover
    regularities
  • Allophone example had your pronunciation
  • / d / is 51likely to realize as jh , 37 as
    d
  • if / d / realizes as jh , / y / deletes 84
  • if / d / realizes as d , / y / deletes 10

25
Balancing Gricean Maxims
  • Grice gives us conflicting maxims
  • quantity (exactly as informative as required)
  • quality (try to make your contribution true)
  • manner (be perspicuous eg. avoid ambiguity, be
    brief)
  • Manner pulls in opposite directions
  • quality without ambiguity lengthens statements
  • quantity and and (part of) manner require brevity
  • Balance by estimating a multidimensional
    goodness metric for generation

26
Gricean Balance (contd)
  • Consider problem for aggregation in generation
  • Every student ran slowly or walked quickly.
  • Aggregates to
  • Every student ran slowly or every student walked
    quickly.
  • This reduces sentence length, shortens clause
    length, and increases ambiguity.
  • These tradeoffs need to be balanced

27
Collins Head/Dependency Parser
  • Michael Collins 1998 UPenn PhD thesis
  • Parses WSJ with 90 constituent precision/recall
  • Generative model of tree probabilities
  • Clever Linguistic Decomposition and Training
  • P(RootCat, HeadTag, HeadWord)
  • P(DaughterCatMotherCat, HeadTag, HeadWord)
  • P(SubCatMotherCat, DtrCat, HeadTag, HeadWord)
  • P(ModifierCat, ModiferTag, ModifierWord
  • SubCat, MotherCat, DaughterCat,
    HeadTag,
  • HeadWord, Distance)

28
Eg Collins Parser (contd)
  • Distance encodes heaviness
  • Adjunct vs. Complement modifiers distinguished
  • Head Words and Tags model lexical variation and
    word-word attachment preferences
  • Also conditions punctuation, coordination, UDCs
  • 12,000 word vocabulary plus unknown word
    attachment model (by Collins) and tag model (by
    A. Ratnaparkhi, another 1998 UPenn thesis)
  • Smoothed by backing off words to categories
  • Trivial statistical estimators power is
    conditioning

29
Computational Complexity
  • Wide coverage linguistic grammar generate
    millions of readings
  • But Collins parser runs faster than real time on
    a notebook on unseen sentences of length up to
    100
  • How? Pruning.
  • Collins found tighter statistical estimates of
    tree likelihoods with more features and more
    complex grammars ran faster because a tighter
    beam could be used
  • (E. Charniak S. Caraballo at Brown have really
    pushed the envelope here)

30
Complexity (contd)
  • Collins parser is not complete in the usual
    sense
  • But neither are humans (eg. garden paths)
  • Can trade speed for accuracy in statistical
    parsers
  • Syntax is not processed autonomously
  • Humans cant parse without context, semantics,
    etc.
  • Even phone or phoneme detection is very
    challenging, especially in a noisy environment
  • Top-down expectations and knowledge of likely
    bottom-up combinations prune the vast search
    space on line
  • Question is how to combine it with other factors

31
N-best and Word Graphs
  • Speech recognizers can return n-best histories
  • flights from Boston today
  • flights from Austin today
  • flights for Boston to pay
  • lights for Boston to pay
  • Can also return a packed word graph of histories
    sum of path log probs equal acoustics /
    word-string joint log prob

32
Probabilistic Graph Processing
  • The architecture were exploring in the context
    of spoken dialogue systems involves
  • Speech recognizers that produce probabilistic
    word graph output
  • A tagger that transforms a word graph into a
    word/tag graph with scores given by joint
    probabilities
  • A parser that transforms a word/tag graph into a
    graph-based chart (as in CKY or chart parsing)
  • Allows each module to rescore output of previous
    modules decision
  • Apply this architecture to speech act detection,
    dialogue act selection, and in generation

33
Prices rose sharply after hours15-best as a
word/tag graph minimization
34
Challenge Beat n-grams
  • Backed off trigram models estimated from 300M
    words of WSJ provide best language models
  • We know there is more to language than two words
    of history
  • Challenge is to find out how to model it.

35
Conclusions
  • Need ranking of hypotheses for applications
  • Beam can reduce processing time to linear
  • need good statistics to do this
  • More linguistic features are better for stat
    models
  • can induce the relevant ones and weights from
    data
  • linguistic rules emerge from these
    generalizations
  • Using acoustic / word / tag / syntax graphs
    allows the propogation of uncertainty
  • ideal is totally online (model is compatible with
    this)
  • approximation allows simpler modules to do first
    pruning

36
Plugs
  • Run, dont walk, to read
  • Steve Abney. 1996. Statistical methods and
    linguistics. In J. L. Klavans and P. Resnik,
    eds., The Balancing Act. MIT Press.
  • Mark Seidenberg and Maryellen MacDonald. 1999. A
    probabilistic constraints approach to language
    acquisition and processing. Cognitive Science.
  • Dan Jurafsky and James H. Martin. 2000. Speech
    and Language Processing. Prentice-Hall.
  • Chris Manning and Hinrich Schuetze. 1999.
    Statistical Natural Language Processing. MIT
    Press.
Write a Comment
User Comments (0)
About PowerShow.com