Industrial Language Modeling A Pessimists Perspective: The glass is empty - PowerPoint PPT Presentation

About This Presentation
Title:

Industrial Language Modeling A Pessimists Perspective: The glass is empty

Description:

... to tell the' 'swerve too tell the' Fivegram ... tell the truth s1' (.01) '...smell the soup s1' (.02) 16 ... 'Swear to tell the truth about speech recognition' ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 32
Provided by: ResearchM53
Category:

less

Transcript and Presenter's Notes

Title: Industrial Language Modeling A Pessimists Perspective: The glass is empty


1
Industrial Language ModelingA Pessimists
PerspectiveThe glass is empty
  • Joshua Goodman
  • Microsoft Research

2
Why language model research isnt useful for
industry
  • 1) Most language model research cannot be
    combined with current recognizers efficiency
    issues
  • 2) Solutions to theoretical problems rather than
    real problems (smoothing, caching)
  • 3) Most language model research ignores speed and
    size constraints

3
How a Speech Recognizer Works
  • In practice, its not so simple
  • Acoustic scoring and language model scoring are
    tightly integrated for thresholding
  • Otherwise, we would need to consider ALL possible
    word sequences, not just likely ones

4
5-grams
  • 5-grams have lower perplexity than trigram

5
Speech recognizer slowdowns
  • Speech recognizer uses tricks (dynamic
    programming) to merge hypotheses
  • Trigram
    Fivegram

swear to tell the swerve to smell the swe
ar too tell the swerve too smell the swerve
to tell the
swerve too tell the
tell the smell the
Fivegram is not worth the extra cost
6
Nasty details matter
  • Many nasty details about speech recognizers
    affect language modeling
  • Most speech recognizers use a phonetic tree
  • Tree representation works badly with some
    language modeling techniques, especially
    clustering

7
Phonetic Tree Representation
TH
U
R
L
E
T
Tell the
U
P
U
S
N
F
A
N
8
Phonetic Tree Representation
P(XTell The) TRUTH .1 TRULY .05 TO .01 SO
UP .01
SOON .003 SAF .02 SAN .002
1
TH
1
U
R
1
L
0.5
1
E
0.1
T
Tell the
0.1
U
P
1
0.02
0.5
U
S
N
0.3
1
F
1
A
0.1
N
9
Consider clustering methods
  • IBM clustering
  • Interpolate word trigram with class trigram
  • Pibm(wi wi-2 wi-1)
  • ? P (wi wi-2 wi-1)
  • (1-?)P (class(wi) class(wi-2)class(wi-1)) ?
  • P(wi class(wi))

10
Clustered Phonetic Tree Representation ???
TH
?
?
U
R
Pibm (truthtell the) ? P (truthtell the)
(1-?)P (STATEMENT COMMUNICATE ART.) ? P(truthS
TATEMENT)
?
L
?
E
?
T
?
?
Tell the
U
?
P
?
U
?
?
S
N
?
F
A
?
N
11
Count CutoffsWhy smoothing doesnt matter
  • Most dictation systems are trained on billions of
    words of training data, which would use about a
    gigabyte
  • I dont have a gigabyte
  • Solution count cutoffs
  • Smoothing only matters on small counts
  • (OK, it really does matter, read my paper, but
    thats not the point Im making today)

12
Caching doesnt matter either
  • Caching gets huge wins in theory
  • In practice, errors tend to get locked in
  • User says Recognize speech
  • System hears Wreck a nice beach
  • User says Speech recognition
  • System hears Beach wreck ignition

13
Dont forget size
  • Space is also ignored
  • 5-grams, most clustering, sentence mixture
    models, skipping models all use more space
  • Example clustering
  • Pibm(wi wi-2 wi-1)
  • ? P (wi wi-2 wi-1)
  • (1-?)P (class(wi) class(wi-2)class(wi-1)) ?
  • P(wi class(wi))
  • Need to store two models instead of 1

14
Summary Why language model research is useless
  • Speech recognition mechanics interact badly with
    many techniques including 5-grams, clustering
  • Cache errors get locked in
  • Smoothing doesnt matter with high count cutoffs
  • In practice, the details are important
  • In practice, trigram models are just too good

15
Sentence Mixture Modelin speech recognizer
tell the truth s1 (.01) smell the soup s1 (
.02)
  • Sentence mixture models are a promising new
    technique
  • Different models for each sentence type
  • With 5 sentence types, 5 times as much work

tell the truth s2 (.01) smell the soup s2 (
.03)
tell the truth s3 (.03) smell the soups3 (.
03)
tell the truth s4 (.01) smell the soup s4 (
.04)
tell the truth s5 (.02) smell the soup s5 (
.01)
16
Solution in theory Make users correct mistakes
  • User says Recognize speech
  • System hears Wreck a nice beach
  • User says Change beach to speech
  • User says Speech recognition
  • System hears Speech wreck ignition

17
Problems
  • User doesnt notice mistake
  • User doesnt feel like correcting mistake now
  • User says change beach to speech
  • System hears change beach to speak
  • User says Who do I recognize
  • User says Change who to whom
  • System is confused

18
Why everything you learned is right
  • Why everything you learned is right, just not
    always, and its harder than you thought it would
    be
  • Prof. Ostendorf isnt cruel
  • Language modeling research isnt useless
  • I spend my own time working on these problems
    Im just bitter and cynical

19
Why smoothing is useful
  • For large vocabulary dictation, we have billions
    of words for training
  • Most of it is newspaper text, or encylopedias
  • Not too many people are reporters or encyclopedia
    authors
  • We would kill for 1 million words of real data,
    and we would use all of it, or nearly so very
    low count cutoffs. Good smoothing would help.

20
Why smoothing is useful
  • For anything except dictation, situation is even
    worse.
  • Each new application needs its own language model
    training data
  • Travel, weather, stocks, news, telephone-based
    email access each language model is different
  • Requires painful, expensive transcription
  • Every piece of data is useful cant afford high
    count cutoffs or bad smoothing

21
Speed solutionMultiple pass decoding
  • Speed is major problem for 5-grams, sentence
    mixture models, and some forms of clustering
  • We can use a multiple pass approach
  • First pass uses a normal trigram
  • Recognizer outputs best 100 hypotheses
  • These are then rescored

22
N-best re-recognition
  • User says Swear to tell the truth
  • Recognizer outputs
  • Swerve to smell the soup (A.0005, L.001)
  • Swear to tell the soup (A.0003,
    L.0001)
  • Swear to tell the truth (A.0002,
    L.0001)
  • Swerve to smell the truth (A.0004,
    L.00002)
  • Language model rescores each hypothesis

23
N-best re-recognition problems
  • What if right hypothesis is not present?
  • Exponentially many hypotheses
  • Swear to tell the truth about speech
    recognition
  • Swear/swerve to/2/two/too tell/smell/belltruth/
    soup/tooth
  • speech recognition/beach wreck ignition

24
Lattice rescoring
  • Recognizer outputs a lattice
  • Swear to smell
    soup
  • Swerve too tell the
    truth
  • 2 spell
    tooth
  • First step is to expand the lattice so that it
    contains only trigram level dependencies

25
Lattice with trigrams
  • 2
  • Swear to
  • tell
  • Swerve to tell
  • 2

Expand the lattice in such a way that trigram
probabilities correspond to transition
probabilities

Lattice cant be too big Harder for 5-grams, but
can be done
26
Lattice/n-best problems
  • All rescoring has to be done at end of
    recognition Leads to latency time between
    when user stops speaking and when he gets a
    response. Only fast techniques can be used for
    most apps.
  • Recognizer output can change
  • Swerve to smell the soup about beach
  • ?Swear to tell the truth about speech
  • Right answer might not be anywhere in lattice

27
Lattice/n-best advantages
  • Great for doing research!
  • Recognizer and language model can be completely
    separate
  • Very complex models can be used
  • Used by some products

28
Clusters in practice
  • IBM Hard way to do clustering interacts badly
    with phonetic tree
  • Alternative use trigram that backs off to bigram
    that backs off to P(ZY)
  • By picking right form for clustering, we can
    integrate with the speech recognizer

29
Phonetic Tree with backoff
1
1
TH
1
Trigram
R
U
0.1
T
L
Tell the
0.5
1
E
0.1
U
0.02
1
1
TH
1
R
U
0.1
T
Bigram
L
the
0.5
1
E
0.1
U
1
1
TH
0.02
1
R
U
0.1
T
Unigram
L
0.5
1
E
0.1
U
0.02
30
Clustered phonetic tree with backoff
1
1
TH
1
Trigram
R
U
0.1
T
L
Tell the
0.5
1
E
0.1
U
0.02
1
1
TH
1
R
U
0.1
T
Bigram
L
the
0.5
1
E
0.1
U
1
1
TH
0.02
1
R
U
0.1
T
P(ZY)
L
0.5
1
E
0.1
U
0.02
31
ConclusionEverything you learned is useful, just
hard
  • Everything you learned is useful
  • But its much more work to use it in practice
    than you thought
  • Need to pay careful attention to the speech
    recognizer or other application to integrate it
Write a Comment
User Comments (0)
About PowerShow.com