Title: Industrial Language Modeling A Pessimists Perspective: The glass is empty
1Industrial Language ModelingA Pessimists
PerspectiveThe glass is empty
- Joshua Goodman
- Microsoft Research
2Why language model research isnt useful for
industry
- 1) Most language model research cannot be
combined with current recognizers efficiency
issues
- 2) Solutions to theoretical problems rather than
real problems (smoothing, caching)
- 3) Most language model research ignores speed and
size constraints
3How a Speech Recognizer Works
- In practice, its not so simple
- Acoustic scoring and language model scoring are
tightly integrated for thresholding
- Otherwise, we would need to consider ALL possible
word sequences, not just likely ones
45-grams
- 5-grams have lower perplexity than trigram
5Speech recognizer slowdowns
- Speech recognizer uses tricks (dynamic
programming) to merge hypotheses
- Trigram
Fivegram
swear to tell the swerve to smell the swe
ar too tell the swerve too smell the swerve
to tell the
swerve too tell the
tell the smell the
Fivegram is not worth the extra cost
6Nasty details matter
- Many nasty details about speech recognizers
affect language modeling
- Most speech recognizers use a phonetic tree
- Tree representation works badly with some
language modeling techniques, especially
clustering
7Phonetic Tree Representation
TH
U
R
L
E
T
Tell the
U
P
U
S
N
F
A
N
8Phonetic Tree Representation
P(XTell The) TRUTH .1 TRULY .05 TO .01 SO
UP .01
SOON .003 SAF .02 SAN .002
1
TH
1
U
R
1
L
0.5
1
E
0.1
T
Tell the
0.1
U
P
1
0.02
0.5
U
S
N
0.3
1
F
1
A
0.1
N
9Consider clustering methods
- IBM clustering
- Interpolate word trigram with class trigram
- Pibm(wi wi-2 wi-1)
- ? P (wi wi-2 wi-1)
- (1-?)P (class(wi) class(wi-2)class(wi-1)) ?
- P(wi class(wi))
10Clustered Phonetic Tree Representation ???
TH
?
?
U
R
Pibm (truthtell the) ? P (truthtell the)
(1-?)P (STATEMENT COMMUNICATE ART.) ? P(truthS
TATEMENT)
?
L
?
E
?
T
?
?
Tell the
U
?
P
?
U
?
?
S
N
?
F
A
?
N
11Count CutoffsWhy smoothing doesnt matter
- Most dictation systems are trained on billions of
words of training data, which would use about a
gigabyte
- I dont have a gigabyte
- Solution count cutoffs
- Smoothing only matters on small counts
- (OK, it really does matter, read my paper, but
thats not the point Im making today)
12Caching doesnt matter either
- Caching gets huge wins in theory
- In practice, errors tend to get locked in
- User says Recognize speech
- System hears Wreck a nice beach
- User says Speech recognition
- System hears Beach wreck ignition
13Dont forget size
- Space is also ignored
- 5-grams, most clustering, sentence mixture
models, skipping models all use more space
- Example clustering
- Pibm(wi wi-2 wi-1)
- ? P (wi wi-2 wi-1)
- (1-?)P (class(wi) class(wi-2)class(wi-1)) ?
- P(wi class(wi))
- Need to store two models instead of 1
14Summary Why language model research is useless
- Speech recognition mechanics interact badly with
many techniques including 5-grams, clustering
- Cache errors get locked in
- Smoothing doesnt matter with high count cutoffs
- In practice, the details are important
- In practice, trigram models are just too good
15Sentence Mixture Modelin speech recognizer
tell the truth s1 (.01) smell the soup s1 (
.02)
- Sentence mixture models are a promising new
technique
- Different models for each sentence type
- With 5 sentence types, 5 times as much work
tell the truth s2 (.01) smell the soup s2 (
.03)
tell the truth s3 (.03) smell the soups3 (.
03)
tell the truth s4 (.01) smell the soup s4 (
.04)
tell the truth s5 (.02) smell the soup s5 (
.01)
16Solution in theory Make users correct mistakes
- User says Recognize speech
- System hears Wreck a nice beach
- User says Change beach to speech
- User says Speech recognition
- System hears Speech wreck ignition
17Problems
- User doesnt notice mistake
- User doesnt feel like correcting mistake now
- User says change beach to speech
- System hears change beach to speak
- User says Who do I recognize
- User says Change who to whom
- System is confused
18Why everything you learned is right
- Why everything you learned is right, just not
always, and its harder than you thought it would
be
- Prof. Ostendorf isnt cruel
- Language modeling research isnt useless
- I spend my own time working on these problems
Im just bitter and cynical
19Why smoothing is useful
- For large vocabulary dictation, we have billions
of words for training
- Most of it is newspaper text, or encylopedias
- Not too many people are reporters or encyclopedia
authors
- We would kill for 1 million words of real data,
and we would use all of it, or nearly so very
low count cutoffs. Good smoothing would help.
20Why smoothing is useful
- For anything except dictation, situation is even
worse.
- Each new application needs its own language model
training data
- Travel, weather, stocks, news, telephone-based
email access each language model is different
- Requires painful, expensive transcription
- Every piece of data is useful cant afford high
count cutoffs or bad smoothing
21Speed solutionMultiple pass decoding
- Speed is major problem for 5-grams, sentence
mixture models, and some forms of clustering
- We can use a multiple pass approach
- First pass uses a normal trigram
- Recognizer outputs best 100 hypotheses
- These are then rescored
22N-best re-recognition
- User says Swear to tell the truth
- Recognizer outputs
- Swerve to smell the soup (A.0005, L.001)
- Swear to tell the soup (A.0003,
L.0001)
- Swear to tell the truth (A.0002,
L.0001)
- Swerve to smell the truth (A.0004,
L.00002)
- Language model rescores each hypothesis
23N-best re-recognition problems
- What if right hypothesis is not present?
- Exponentially many hypotheses
- Swear to tell the truth about speech
recognition
- Swear/swerve to/2/two/too tell/smell/belltruth/
soup/tooth
- speech recognition/beach wreck ignition
24Lattice rescoring
- Recognizer outputs a lattice
- Swear to smell
soup
- Swerve too tell the
truth
- 2 spell
tooth
- First step is to expand the lattice so that it
contains only trigram level dependencies
25Lattice with trigrams
- 2
- Swear to
- tell
- Swerve to tell
- 2
Expand the lattice in such a way that trigram
probabilities correspond to transition
probabilities
Lattice cant be too big Harder for 5-grams, but
can be done
26Lattice/n-best problems
- All rescoring has to be done at end of
recognition Leads to latency time between
when user stops speaking and when he gets a
response. Only fast techniques can be used for
most apps. - Recognizer output can change
- Swerve to smell the soup about beach
- ?Swear to tell the truth about speech
- Right answer might not be anywhere in lattice
27Lattice/n-best advantages
- Great for doing research!
- Recognizer and language model can be completely
separate
- Very complex models can be used
- Used by some products
28Clusters in practice
- IBM Hard way to do clustering interacts badly
with phonetic tree
- Alternative use trigram that backs off to bigram
that backs off to P(ZY)
- By picking right form for clustering, we can
integrate with the speech recognizer
29Phonetic Tree with backoff
1
1
TH
1
Trigram
R
U
0.1
T
L
Tell the
0.5
1
E
0.1
U
0.02
1
1
TH
1
R
U
0.1
T
Bigram
L
the
0.5
1
E
0.1
U
1
1
TH
0.02
1
R
U
0.1
T
Unigram
L
0.5
1
E
0.1
U
0.02
30Clustered phonetic tree with backoff
1
1
TH
1
Trigram
R
U
0.1
T
L
Tell the
0.5
1
E
0.1
U
0.02
1
1
TH
1
R
U
0.1
T
Bigram
L
the
0.5
1
E
0.1
U
1
1
TH
0.02
1
R
U
0.1
T
P(ZY)
L
0.5
1
E
0.1
U
0.02
31ConclusionEverything you learned is useful, just
hard
- Everything you learned is useful
- But its much more work to use it in practice
than you thought
- Need to pay careful attention to the speech
recognizer or other application to integrate it