Probabilistic Graphs: Efficient Natural Spoken Language Processing - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Probabilistic Graphs: Efficient Natural Spoken Language Processing

Description:

Collins' Head/Dependency Parser. Michael Collins (AT&T) 1998 UPenn PhD Thesis ... Collins' wide coverage linguistic grammar generates millions of readings for ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 29
Provided by: collo
Category:

less

Transcript and Presenter's Notes

Title: Probabilistic Graphs: Efficient Natural Spoken Language Processing


1
Probabilistic GraphsEfficient Natural (Spoken)
Language Processing
2
The Standard Clichés
  • Moores Cliché
  • Exponential growth in computing power and memory
    will continue to open up new possibilities
  • The Internet Cliché
  • With the advent and growth of the world-wide web,
    an ever increasing amount of information must be
    managed

3
More Standard Clichés
  • The Convergence Cliché
  • Data, voice and video networking will be
    integrated over a universal network, that
  • includes land lines and wireless
  • includes broadband and narrowband
  • likely implementation is IP (internet protocol)
  • The Interface Cliché
  • The three forces above (growth in computing
    power, information online, and networking) will
    both enable and require new interfaces
  • Speech will become as common as graphics

4
Application Requirements
  • Robustness
  • acoustic and linguistic variation
  • disfluencies and noise
  • Scalability
  • from embedded devices to palmtops to clients to
    servers
  • across tasks from simple to complex
  • system-initiative form-filling to mixed
    initiative dialogue
  • Portability
  • simple adaptation to new tasks and new domains
  • preferably automated as much as possible

5
The Big Question
  • How do humans handle unrestricted language so
    effortlessly in real time?
  • Unfortunately, the classical linguistic
    assumptions and methodology completely ignore
    this issue
  • This is dangerous strategy for processing natural
    spoken language

6
My Favorite Experiments I
  • Head-Mounted Eye Tracking
  • Mike Tanenhaus et al. (Univ. Rochester)

Pick up the yellow plate
  • Clearly shows human understanding is online

7
My Favorite Experiments (II)
  • Garden Paths and Context Sensitivity
  • Crain Steedman (U.Connecticut U. Edinburgh)
  • if noun denotation is not a singleton in context,
    postmodificiation is much more likely
  • Garden Paths are Frequency and Agreement
    Sensitive
  • Tanenhaus et al.
  • The horse raced past the barn fell. (raced
    likely past)
  • The horses brought into the barn fell. (brought
    likely participle, and less likely activity for
    horses)

8
Conclusion Function Evolution
  • Humans agressively prune in real time
  • This is an existence proof there must be enough
    info to do so we just need to find it
  • All linguistic information is brought in at 200ms
  • Other pruning strategies have no such existence
    proof
  • Speakers are cooperative in their use of language
  • Especially with spoken language, which is very
    different than written language due to real-time
    requirements
  • (Co-?)Evolution of language and speakers to
    optimize these requirements

9
Stats Explanation or Stopgap?
  • The Common View
  • Statistics are some kind of approximation of
    underlying factors requiring further explanation.
  • Steve Abneys Analogy (ATT Labs)
  • Statistical Queueing Theory
  • Consider traffic flows through a toll gate on a
    highway.
  • Underlying factors are diverse, and explain the
    actions of each driver, their cars, possible
    causes of flat tires, drunk drivers, etc.
  • Statistics is more insightful explanatory in
    this case as it captures emergent generalizations
  • It is a reductionist error to insist on low-level
    account

10
Algebraic vs. Statistical
  • False Dichotomy
  • Statistical systems have an algebraic basis, even
    if trivial
  • Best performing statistical systems have best
    linguistic conditioning
  • Holds for phonology/phonetics and
    morphology/syntax
  • Most explanatory in traditional sense
  • Statistical estimators less significant than
    conditioning
  • In other sciences, statistics used for
    exploratory data analysis
  • trendier data mining trendiest information
    harvesting
  • Emergent statistical generalizations can be
    algebraic

11
The Speech Recognition Problem
  • The Recognition Problem
  • Find most likely sequence w of words given the
    sequence of acoustic observation vectors a
  • Use Bayes law to create a generative model
  • Max w . P(wa) Max w . P(aw) P(w) / P(a)
  • Max w . P(aw)
    P(w)
  • Language Model P(w) usually n-grams -
    discrete
  • Acoustic Model P(aw) usually HMMs -
    cont. density
  • Challenge 1 beat trigram language models
  • Challenge 2 extend this paradigm to NLP

12
N-best and Word Graphs
  • Speech recognizers can return n-best histories
  • 1. flights from Boston today 2. lights for
    Boston to pay
  • 3. flights from Austin today 4. flights for
    Boston to pay
  • Or a packed word graph of histories
  • sum of path log probs equals acoustic/language
    log prob
  • Path closest to utterance in dense graphs much
    better
  • than first-best on average density 124
    515 18011

13
Probabilistic Graph Processing
  • The architecutre were exploring in the context
    of spoken dialogue systems involves
  • Speech recognizers that produce a probabilistic
    word graph output, with scores given by acoustic
    probabilities
  • A tagger that transforms a word graph into a
    word/tag graph with scores given by joint
    probabilities
  • A parser that transforms a word/tag graph into a
    syntactic graph (as in CKY parsing) with scores
    given by grammar
  • Allows each module to rescore output of previous
    modules decision
  • Long Term Apply this architecture to speech act
    detection, dialogue act selection, and in
    generation

14
Probabilistic Graph Tagger
  • In probabilistic word graph
  • P(AsWs) conditional acoustic likelihoods or
    confidences
  • Out probabilistic word/tag graph
  • P(Ws,Ts) joint word/tag likelihoods ignores
    acoustics
  • P(As,Ws,Ts) joint acoustic/word/tag likelihoods
    but
  • General history-based implementation in Java
  • next tag/word probability a function of specified
    history
  • operates purely left to right on forward pass
  • backwards prune to edges within a beam / on
    n-best path
  • able to output hypotheses online
  • optional backwards confidence rescoring not
    P(As,Ws,Ts)
  • need node for each active history class for
    proper model

15
Backwards Rescore Minimize
All Paths 1. A,C,E 1/64 3. B,C,D
1/256 2. A,C,D 1/128
4. B,C,E 1/512
  • Edge gets sum of all path scores that go through
    it
  • Normalize by total (1/64 1/128 1/256
    1/512)

Note outputs sum to 1 after backward pass
16
Tagger Probability Model
  • Exact Probabilities
  • P(As,Ws,Ts) P(Ws,Ts) P(AsWs,Ts)
  • P(Ws,Ts) P(Ts) P(WsTs) top-down
  • Approximations
  • Two Tag History tag trigram
  • P(Ts) PRODUCT_n P(T_n T_n-2, T_n-1)
  • Words Depend only on Tags HMM
  • P(WsTs) PRODUCT_n P(W_n T_n)
  • Pronunciation Independent of Tag use standard
    acoustics
  • P(AsWs,Ts) P(AsWs)

17
Prices rose sharply today
0. -35.612683136497516 NNS/prices VBD/rose
RB/sharply NN/today (0, 2NNS/prices) (2,
10VBD/rose) (10, 14RB/sharply) (14,
15NN/today) 1. -37.035496392922575 NNS/prices
VBD/rose RB/sharply NNP/today (0,
2NNS/prices) (2, 10VBD/rose) (10,
14RB/sharply) (14, 15NNP/today) 2.
-40.439580756197934 NNS/prices VBP/rose
RB/sharply NN/today (0, 2NNS/prices) (2,
9VBP/rose) (9, 11RB/sharply) (11,
15NN/today) 3. -41.86239401262299 NNS/prices
VBP/rose RB/sharply NNP/today (0,
2NNS/prices) (2, 9VBP/rose) (9, 11RB/sharply)
(11, 15NNP/today) 4. -43.45450487625557
NN/prices VBD/rose RB/sharply NN/today (0,
1NN/prices) (1, 6VBD/rose) (6, 14RB/sharply)
(14, 15NN/today) 5. -44.87731813268063
NN/prices VBD/rose RB/sharply NNP/today (0,
1NN/prices) (1, 6VBD/rose) (6, 14RB/sharply)
(14, 15NNP/today) 6. -45.70597331609037
NNS/prices NN/rose RB/sharply NN/today (0,
2NNS/prices) (2, 8NN/rose) (8, 13RB/sharply)
(13, 15NN/today) 7. -45.81027979248346
NNS/prices NNP/rose RB/sharply NN/today (0,
2NNS/prices) (2, 7NNP/rose) (7, 12RB/sharply)
(12, 15NN/today) 8. ..
18
Prices rose sharply after hours15-best as a
word/tag graph minimization
19
Prices rose sharply after hours15-best as a
word/tag graph minimization collapsing
roseVBD
roseNN
pricesNN
roseVBP
afterRB
pricesNNS
sharplyRB
afterIN
hoursNNS
roseNNP
20
Weighted Minimize (isnt easy)
  • Can push probabilities back through graph
  • Ratio of scores must be equivalent for sound
    minimization (difference of log scores)
  • Assume x gt y operation preserves sum of paths
  • B,A wx C,A zy

21
Weighted Minimize is Problematic
  • Cant minimize if ratio is not the same
  • To push, must have amount
  • to push
  • (x1-x2) (y1-y2)
  • ex1 / ex2 ey1 / ey2

22
How to Collect n Best in O(n k)
  • Do a forward pass through graph, saving
  • best total path score at each node
  • backpointers to all previous nodes, with scores
  • This is done during tagging (linear in max length
    k )
  • Algorithm
  • add first-best and second best final path to
    priority queue
  • k times, repeat
  • follow backpointer of best path on queue to
    beginning
  • save next best (if any) at each node on
    queue
  • Can do same for all paths within beam epsilon
  • Result is deterministic minimize before parsing

23
Collins Head/Dependency Parser
  • Michael Collins (ATT) 1998 UPenn PhD Thesis
  • Generative model of tree probabilities P(Tree)
  • Parses WSJ with 90 constituent precision/recall
  • Best performance for single parser, but
    Hendersons Johns Hopkins Thesis beat it by
    blending with other parsers (Charniak
    Ratnaparkhi)
  • Formal language induced from simple smoothing
    of treebank is trivial Word (Charniak)
  • Collins parser runs in real time
  • Collins naïve C implementation
  • Parses 100 of test set

24
Collins Grammar Model
  • Similar to GPSG CG (aka HPSG) model
  • Subcat frames adjuncts / complements
    distinguished
  • Generalized Coordination
  • Unbounded Dependencies via slash
  • Punctuation
  • Distance metric codes word order (canonical
    not)
  • Probabilities conditioned top-down
  • 12,000 word vocabulary (gt 5 occs in treebank)
  • backs off to a words tag
  • approximates unknown words from words with lt 5
    occs
  • Induces feature information statistically

25
Collins Statistics (Simplified)
  • Choose Start Symbol, Head Tag, Head Word
  • P(RootCat, HeadTag, HeadWord)
  • Project Daughter and Left/Right Subcat Frames
  • P(DaughterCatMotherCat, HeadTag, HeadWord)
  • P(SubCatMotherCat, DtrCat, HeadTag, HeadWord)
  • Attach Modifier (Comp/Adjunct Left/Right)
  • P(ModifierCat, ModiferTag, ModifierWord
    SubCat, . . MotherCat, DaughterCat,
    HeadTag, HeadWord, Distance)

26
Complexity and Efficiency
  • Collins wide coverage linguistic grammar
    generates millions of readings for simple strings
  • But Collins parser runs faster than real time on
    unseen sentences of arbitrary length
  • How?
  • Punchline Time-Syncrhonous Beam Search Reduces
    time to Linear
  • Tighter estimates with more features and more
    complex grammars ran faster and more accurately
  • Beam allows tradeoff of accuracy (search error)
    and speed

27
Completeness Dialogue
  • Collins parser is not complete in the usual
    sense
  • But neither are humans (eg. garden paths)
  • Syntactic features alone dont determine
    structure
  • Humans cant parse without context, semantics,
    etc.
  • Even phone or phoneme detection is very
    challenging, especially in a noisy environment
  • Top-down expectations and knowledge of likely
    bottom-up combinations prune the vast search
    space on line
  • Question is how to combine it with other factors
  • Next steps semantics, pragmatics dialogue

28
Conclusions
  • Need ranking of hypotheses for applications
  • Beam can reduce processing time to linear
  • need good statistics to do this
  • More linguistic features are better for stat
    models
  • can induce the relevant ones and weights from
    data
  • linguistic rules emerge from these
    generalizations
  • Using acoustic / word / tag / syntax graphs
    allows the propogation of uncertainty
  • ideal is totally online (model is compatible with
    this)
  • approximation allows simpler modules to do first
    pruning
Write a Comment
User Comments (0)
About PowerShow.com