Industrial Language Modeling A Pessimists Perspective: The glass is empty - PowerPoint PPT Presentation

About This Presentation

Title:

Industrial Language Modeling A Pessimists Perspective: The glass is empty

Description:

... to tell the' 'swerve too tell the' Fivegram ... tell the truth s1' (.01) '...smell the soup s1' (.02) 16 ... 'Swear to tell the truth about speech recognition' ... – PowerPoint PPT presentation

Number of Views:72

Avg rating:3.0/5.0

Slides: 32

Provided by: ResearchM53

Category:

more less

Transcript and Presenter's Notes

Title: Industrial Language Modeling A Pessimists Perspective: The glass is empty

1
Industrial Language ModelingA Pessimists
PerspectiveThe glass is empty

Joshua Goodman
Microsoft Research

2
Why language model research isnt useful for
industry

1) Most language model research cannot be
combined with current recognizers efficiency
issues
2) Solutions to theoretical problems rather than
real problems (smoothing, caching)
3) Most language model research ignores speed and
size constraints

3
How a Speech Recognizer Works

In practice, its not so simple
Acoustic scoring and language model scoring are
tightly integrated for thresholding
Otherwise, we would need to consider ALL possible
word sequences, not just likely ones

4
5-grams

5-grams have lower perplexity than trigram

5
Speech recognizer slowdowns

Speech recognizer uses tricks (dynamic
programming) to merge hypotheses
Trigram
Fivegram

swear to tell the swerve to smell the swe
ar too tell the swerve too smell the swerve
to tell the
swerve too tell the
tell the smell the
Fivegram is not worth the extra cost
6
Nasty details matter

Many nasty details about speech recognizers
affect language modeling
Most speech recognizers use a phonetic tree
Tree representation works badly with some
language modeling techniques, especially
clustering

7
Phonetic Tree Representation
TH
U
R
L
E
T
Tell the
U
P
U
S
N
F
A
N
8
Phonetic Tree Representation
P(XTell The) TRUTH .1 TRULY .05 TO .01 SO
UP .01
SOON .003 SAF .02 SAN .002
1
TH
1
U
R
1
L
0.5
1
E
0.1
T
Tell the
0.1
U
P
1
0.02
0.5
U
S
N
0.3
1
F
1
A
0.1
N
9
Consider clustering methods

IBM clustering
Interpolate word trigram with class trigram
Pibm(wi wi-2 wi-1)
? P (wi wi-2 wi-1)
(1-?)P (class(wi) class(wi-2)class(wi-1)) ?
P(wi class(wi))

10
Clustered Phonetic Tree Representation ???
TH
?
?
U
R
Pibm (truthtell the) ? P (truthtell the)
(1-?)P (STATEMENT COMMUNICATE ART.) ? P(truthS
TATEMENT)
?
L
?
E
?
T
?
?
Tell the
U
?
P
?
U
?
?
S
N
?
F
A
?
N
11
Count CutoffsWhy smoothing doesnt matter

Most dictation systems are trained on billions of
words of training data, which would use about a
gigabyte
I dont have a gigabyte
Solution count cutoffs
Smoothing only matters on small counts
(OK, it really does matter, read my paper, but
thats not the point Im making today)

12
Caching doesnt matter either

Caching gets huge wins in theory
In practice, errors tend to get locked in
User says Recognize speech
System hears Wreck a nice beach
User says Speech recognition
System hears Beach wreck ignition

13
Dont forget size

Space is also ignored
5-grams, most clustering, sentence mixture
models, skipping models all use more space
Example clustering
Pibm(wi wi-2 wi-1)
? P (wi wi-2 wi-1)
(1-?)P (class(wi) class(wi-2)class(wi-1)) ?
P(wi class(wi))
Need to store two models instead of 1

14
Summary Why language model research is useless

Speech recognition mechanics interact badly with
many techniques including 5-grams, clustering
Cache errors get locked in
Smoothing doesnt matter with high count cutoffs
In practice, the details are important
In practice, trigram models are just too good

15
Sentence Mixture Modelin speech recognizer
tell the truth s1 (.01) smell the soup s1 (
.02)

Sentence mixture models are a promising new
technique
Different models for each sentence type
With 5 sentence types, 5 times as much work

tell the truth s2 (.01) smell the soup s2 (
.03)
tell the truth s3 (.03) smell the soups3 (.
03)
tell the truth s4 (.01) smell the soup s4 (
.04)
tell the truth s5 (.02) smell the soup s5 (
.01)
16
Solution in theory Make users correct mistakes

User says Recognize speech
System hears Wreck a nice beach
User says Change beach to speech
User says Speech recognition
System hears Speech wreck ignition

17
Problems

User doesnt notice mistake
User doesnt feel like correcting mistake now
User says change beach to speech
System hears change beach to speak
User says Who do I recognize
User says Change who to whom
System is confused

18
Why everything you learned is right

Why everything you learned is right, just not
always, and its harder than you thought it would
be
Prof. Ostendorf isnt cruel
Language modeling research isnt useless
I spend my own time working on these problems
Im just bitter and cynical

19
Why smoothing is useful

For large vocabulary dictation, we have billions
of words for training
Most of it is newspaper text, or encylopedias
Not too many people are reporters or encyclopedia
authors
We would kill for 1 million words of real data,
and we would use all of it, or nearly so very
low count cutoffs. Good smoothing would help.

20
Why smoothing is useful

For anything except dictation, situation is even
worse.
Each new application needs its own language model
training data
Travel, weather, stocks, news, telephone-based
email access each language model is different
Requires painful, expensive transcription
Every piece of data is useful cant afford high
count cutoffs or bad smoothing

21
Speed solutionMultiple pass decoding

Speed is major problem for 5-grams, sentence
mixture models, and some forms of clustering
We can use a multiple pass approach
First pass uses a normal trigram
Recognizer outputs best 100 hypotheses
These are then rescored

22
N-best re-recognition

User says Swear to tell the truth
Recognizer outputs
Swerve to smell the soup (A.0005, L.001)
Swear to tell the soup (A.0003,
L.0001)
Swear to tell the truth (A.0002,
L.0001)
Swerve to smell the truth (A.0004,
L.00002)
Language model rescores each hypothesis

23
N-best re-recognition problems

What if right hypothesis is not present?
Exponentially many hypotheses
Swear to tell the truth about speech
recognition
Swear/swerve to/2/two/too tell/smell/belltruth/
soup/tooth
speech recognition/beach wreck ignition

24
Lattice rescoring

Recognizer outputs a lattice
Swear to smell
soup
Swerve too tell the
truth
2 spell
tooth
First step is to expand the lattice so that it
contains only trigram level dependencies

25
Lattice with trigrams

2
Swear to
tell
Swerve to tell
2

Expand the lattice in such a way that trigram
probabilities correspond to transition
probabilities

Lattice cant be too big Harder for 5-grams, but
can be done
26
Lattice/n-best problems

All rescoring has to be done at end of
recognition Leads to latency time between
when user stops speaking and when he gets a
response. Only fast techniques can be used for
most apps.
Recognizer output can change
Swerve to smell the soup about beach
?Swear to tell the truth about speech
Right answer might not be anywhere in lattice

27
Lattice/n-best advantages

Great for doing research!
Recognizer and language model can be completely
separate
Very complex models can be used
Used by some products

28
Clusters in practice

IBM Hard way to do clustering interacts badly
with phonetic tree
Alternative use trigram that backs off to bigram
that backs off to P(ZY)
By picking right form for clustering, we can
integrate with the speech recognizer

29
Phonetic Tree with backoff
1
1
TH
1
Trigram
R
U
0.1
T
L
Tell the
0.5
1
E
0.1
U
0.02
1
1
TH
1
R
U
0.1
T
Bigram
L
the
0.5
1
E
0.1
U
1
1
TH
0.02
1
R
U
0.1
T
Unigram
L
0.5
1
E
0.1
U
0.02
30
Clustered phonetic tree with backoff
1
1
TH
1
Trigram
R
U
0.1
T
L
Tell the
0.5
1
E
0.1
U
0.02
1
1
TH
1
R
U
0.1
T
Bigram
L
the
0.5
1
E
0.1
U
1
1
TH
0.02
1
R
U
0.1
T
P(ZY)
L
0.5
1
E
0.1
U
0.02
31
ConclusionEverything you learned is useful, just
hard

Everything you learned is useful
But its much more work to use it in practice
than you thought
Need to pay careful attention to the speech
recognizer or other application to integrate it

Write a Comment

User Comments (0)