Title: Bootstrap Estimates For Confidence Intervals In ASR Performance Evaluation
1Bootstrap Estimates For Confidence Intervals In
ASR Performance Evaluation
Presented by Patty Liu
2Introduction (1/2)
- The most popular performance measure in automatic
speech recognition is the word error rate - the number of word in sentence
- the edit distance, or Levenshtein distance
,between the recognizer output and the reference
transcription of sentence - The edit distance is the minimum number of
insert, substitute and delete operations
necessary to transform one sentence into the
other.
3Introduction (2/2)
- The word error rate is an attractive metric,
because it is intuitive, it corresponds well with
application scenarios and (unlike sentence error
rate) it is sensitive to small changes. - On the downside it is not very amenable to
statistical analysis. - - W is really a rate (number of errors per
spoken word), and not a probability (chance of
misrecognizing a word). - - Moreover, error events do not occur
independently. - Nevertheless hardly any other publication reports
figures on these significance tests. Instead it
is common to report the absolute and relative
change in word error rate only.
4Motivation
- What error rate do we have to expect when
changing to a different test set? - How reliable is a observed improvement of a
system?
5Bootstrap (1/5)
- The bootstrap is a computer-based method for
assigning measures of accuracy to statistical
estimates. - The core idea is to create replications of a
statistic by random sampling from the data set
with replacement (so-called Monte Carlo
estimates). - We assume that the test corpus can be divided
into s segments for which the recognition result
is independent and the number of errors can thus
be evaluated independently. - For speaker-independent CSR it seems appropriate
to to choose the set of all utterances of one
speaker as a segment.
6Bootstrap (2/5)
- For each sentence we record the number of
words and the number errors - The following procedure is repeated B times
(typically B 103 . . . 104) For b 1. . . B
we randomly select with replacement s pairs from
X, to generate a bootstrap sample - The sample will contain several of the original
sentences multiple times, while others are
missing.
7Bootstrap (3/5)
8Bootstrap (4/5)
- Then we calculate the word error rate on this
sample - The Wb are called bootstrap replications of W.
They can be thought of as samples of the word
error rate from an ensemble of virtual test sets. - The uncertainty of Wboot can be quantified by the
standard error, which has the following bootstrap
estimate
9Bootstrap (5/5)
- For large s the distribution of W is
approximately Gaussian. In this case the true
word error rate lies with 90 probability in the
interval . - Even when s is small, we can use the table of
replications Wb to determine percentiles which
in turn can serve as confidence intervals. - For a chosen error threshold a, let be
the - smallest value in the list ,
and be the largest. - The interval
- contains the true value of W with probability
1 - 2a. This is the bootstrap-t confidence
interval. - For example With B 1000 and a 0.05, we
sort the list of Wb and use the values at
position 50 and 950.
10EXAMPLE NAB (North American Business News)
ERROR RATES
smaller standard error
narrower confidence intervals
11EXAMPLE NAB ERROR RATES
narrower confidence intervals
90
5
5
Cboot
Cboot
Wboot
12Comparing Systems (1/2)
- Competing algorithms are usually tested on the
same data. - Given two recognition systems A and B with word
error counts and , the (absolute)
difference in word error rate is - We can apply the same bootstrap technique to the
quantity ?W as we did to W. The crucial point is
that we calculate the difference in the number of
errors of the two systems on identical bootstrap
samples. - The important consequence that ?W has much lower
variance than W of either system.
13Comparing Systems (2/2)
- In addition to the two-tailed confidence interval
, we may be more interested in
whether system B is a real improvement over
system A. - T(x) is the step function, which is one for x gt 0
and zero otherwise. So the poi function is the
relative number of bootstrap samples which favor
system B. We call this measure probability of
improvement (poi).
14Example System Comparison
- The system used for the examples described
earlier now plays the role of system B, while a
second system with slightly different acoustic
models is system A. - System B is apparently better by 0.3 to 0.4
absolute in terms of word error rate. The
probability of improvement ranging between 82
and 95, indicates that we can be moderately
confident that this reflects a real superiority
of system B, but we should not be too surprised
if a fourth test set would be favorable to system
A. - The notable advantage of this differential
analysis is that the standard error of ?W is
approximately one third of the standard error of
W.
15Example System Comparison
16Example System Comparison
0.4
17Conclusion
- We would like to emphasize that what we propose
is not a new metric for performance evaluation,
but a refined analysis of an established metric
(word error rate). - The proposed method seems attractive, because it
is easy to use, it makes no assumption about the
distribution of errors, results are directly
related to word error rate, and the probability
of improvement provides an intuitive figure of
significance.
18Open Vocabulary Speech Recognition with Flat
Hybrid Models
19Introduction (1/3)
- Large vocabulary speech recognition systems
operate with a fixed large but finite vocabulary. - Systems operating with a fixed vocabulary are
bound to encounter so-called out-of-vocabulary
(OOV) words. These are problematic for a number
of reasons - - An OOV word will never be recognized (even if
the user repeats it), but will be substituted by
some in-vocabulary word. - - Neighboring words are also often
misrecognized. - - Later processing stages (e.g. translation,
understanding, document retrieval) cannot recover
from OOV errors. - - OOV words are often content words.
20Introduction (2/3)
- The decision rule and knowledge sources used by a
large vocabulary speech recognition system - acoustic model
- relates acoustic features x to phoneme
sequences , typically an HMM (vocabulary
independent) - pronunciation lexicon
- assigns one (or more) phoneme string(s)
to each word - language model
- assigns probabilities to sentences from a
finite set of words
21Introduction (3/3)
- For open vocabulary recognition, we propose to
conceptually abandon the words in favor of
individual letters. Unlike words, the set of
different letters G in a writing system is
finite. - Concerning the link to the acoustic realization,
the set of phonemes can also be considered
finite for a given language. These considerations
suggest the following model -
- - acoustic model
- - pronunciation model
- provides a pronunciation for any
string of letters - - sub-lexical language model
- assigns probabilities to character strings
22Introduction (3/3)
- Alternatively the pronunciation model and
sub-lexical language model can be combined into a
joint graphonemic model
23Grapheme-to-Phoneme Conversion (1/3)
- Obviously this approach to open-vocabulary
recognition is strongly connected to
grapheme-to-phoneme conversion (G2P), where we
seek the most likely pronunciation for a given
orthographic form - The underlying assumption of this model is that,
for each word, its orthographic form and its
pronunciation are generated by a common sequence
of graphonemic units. - Each unit is a pair
of a letter sequence and a phoneme sequence
of possibly different length. - We refer to such a unit as a graphone. (Various
other names have been suggested
grapheme-phonemejoint multigram, graphoneme,
grapheme-to-phoneme correspondence (GPC), chunk)
24Grapheme-to-Phoneme Conversion (2/3)
- The joint probability distribution
is thus reduced to a probability distribution
over graphone sequences which we model
using a standard M-gram - The complexity of this model depends on two
parameters the range of the M-gram model and the
allowed size of the graphones. We allow the
number of letters and phonemes to vary between
zero and an upper limit
25Grapheme-to-Phoneme Conversion (3/3)
- We were able to verify that shorter units in
combination with longer range M-gram modeling
yields the best result for the grapheme-to-phoneme
task.
26Models for Open Vocabulary ASR (1/2)
- We combine the lexical entries with the
(sub-lexical) graphones derived from
grapheme-to-phoneme conversion to form an unified
set of recognition units . - From the perspective of OOV detection the
sub-lexical units Q have been called fragments
or fillers. - By treating words and fragments uniformly the
decision rule becomes - The sequence model p(u) can be characterized as
hybrid because it contains mixed M-grams
containing both words and fragments. - It can also be characterized as flat, as
opposed to structured approaches that predict and
model OOV words with different models.
27Models for Open Vocabulary ASR (2/2)
- A shortcoming of this model is that it leaves
undetermined were word boundaries (i.e. blanks)
should be placed. - The heuristic used in this study is to compose
any consecutive sub-lexical units into a single
word and to treat all lexical units as individual
words.
28Experiments
- We have three different established vocabularies
with 5, 20 and 64 thousand words, each
corresponding to the most frequent words in the
language model training corpus. - For each baseline pronunciation dictionary a
grapheme-to-phoneme model was trained with
different length constraints
- using EM training with M-gram
length of 3 . The recognition vocabulary was then
augmented with all graphones inferred by this
procedure. - For quantitative analysis, we evaluated both word
error rate (WER) and letter error rate (LER).
Letter error rate is more favorable with respect
to almost-correct words and corresponds with the
correction effort in dictation applications.
29Experiments
30Experiments--Bootstrap Analysis of OOV Impact
- We are particularly interested in the effect of
OOV words on the recognition error rate. This
effect could be studied by varying the systems
vocabulary. - However, changing the recognition system in this
way might introduce secondary effects such as
increased confusability between vocabulary
entries. - Alternatively we can alter the test set. By
extending the bootstrap technique , we create an
ensemble of virtual test corpora with a varying
number of OOV words, and respective WER. This
distribution allows us to study the correlation
between OOV rate and word error rate without
changing the recognition system.
31Experiments
- This procedure is detailed in the following For
each sentence i 1 . . . s we record the number
of words , the number of OOV words and
the number of recognition errors - For b 1 . . .B (typically ) we randomly
select with replacement s tuples from X to
generate a bootstrap sample - The OOV rate and word error rate on this sample
32Experiments
- The bootstrap replications OOVb and WERb can be
visualized by a scatter plot. We quantify the
observed linear relation between OOV rate and WER
by a linear lest squares fit. - The slope of the fitted line reflects the number
of word errors per OOV word. For this reason we
call this quantity OOV impact.
33Discussion (1/2)
34Discussion (2/2)
- Obviously the improvement in error rate depends
strongly on the OOV rate. - It is interesting to compare the OOV impact
factor (word errors per OOV word) The baseline
systems have values between 1.7 and 2, supporting
the common wisdom that each OOV word causes two
word errors. - Concerning the optimal choice of fragment size L,
we note that there are two counteracting effects
- - Larger L values increase the size of the
graphone inventory, which in turn causes data
sparseness problems, leading to worse
grapheme-to-phoneme performance. - - Smaller values for L cause the unit
inventory to contain many very short words with
high probabilities, leading to spurious
insertions in the recognition result. The present
experiments suggest that the best trade-off is at
L 4.
35Conclusion
- We have shown that we can significantly improve a
well optimized state-of-the-art recognition
system by using a simple flat hybrid sub-lexical
model. The improvement was observed on a wide
range of out-of-vocabulary rates. - Even for very low OOV rates, no deterioration
occurred. - We found that using fragments of up to four
letters or phonemes yielded optimal recognition
results, while using non-trivial chunks is
detrimental to grapheme-to-phoneme conversion.
36OPEN-VOCABULARY SPOKEN TERM DETECTIONUSING
GRAPHONE-BASED HYBRID RECOGNITION SYSTEMS
Murat Akbacak, Dimitra Vergyri, Andreas
Stolcke Speech Technology and Research
Laboratory SRI International, Menlo Park, CA
94025, USA
37Introduction (1/3)
- Recently, NIST defined a new task, spoken term
detection (STD), in which the goal is to locate a
specified term rapidly and accurately in large
heterogeneous audio archives, to be used
ultimately as input to more sophisticated audio
search systems. - The evaluation metric has two important
characteristics - (1) Missing a term is penalized more heavily
than having a false alarm for that term, - (2) Detection results are averaged over all
query terms rather than over their occurrences,
i.e., the performance metric considers the
contribution of each term equally.
38Introduction (2/3)
- Results of the NIST 2006 STD evaluation have
shown that systems based on word recognition have
an accuracy advantage over systems based on
sub-word recognition (although they typically pay
a price in run time). - Yet, word recognition system are usually based on
a fixed vocabulary, resulting in a word-based
index that does not allow text-based searching
for OOV words. - To retrieve OOVs, as well as misrecognized IV
words, audio search based on sub-word units (such
as syllables and phone N-grams) has been employed
in many systems. - During recognition, shorter units are more robust
to errors and word variants than longer units,
but longer units capture more discriminative
information and are less susceptible to false
matches during retrieval.
39Introduction (3/3)
- In order to move toward solutions that address
the problem of misrecognition (both IV and OOV)
during audio search, previous studies have
employed fusion methods to recover from ASR
errors during retrieval. - Here, we propose a hybrid STD system that uses
words and sub-word units together in the
recognition vocabulary. The ASR vocabulary is
augmented by graphone units. - We extract from ASR lattices a hybrid index,
which is then converted into a regular word index
by a post-processing step that joins graphones
into words. - It is important to represent ASR lattices with
only words (with an expanded vocabulary) rather
than with words and sub-word units since the
lattices might serve as input to other
information processing algorithms, such as for
named entity tagging or information extraction,
which assume a word-based representation.
40The STD Task -Data
- The test data consists of audio waveforms, a list
of regions to be searched, and a list of query
terms. - For expedience we focus in this study on English
and the genre with the highest OOV rate,
broadcast news (BN).
41The STD Task -Evaluation Metric
- Basic detection performance will be characterized
in the usual way via standard detection error
tradeoff (DET) curves of miss probability (
) versus false alarm probability ( ). - detection threshold
- the number of correct (true)
detections of term with a score greater than or
equal to . - the true number of occurrences of
term in the corpus. - the number of spurious
(incorrect) detections of term with a score
greater than or equal to . - the number of opportunities for
incorrect detection of term in the corpus (
Non-Target term trials).
42STD System Description (1/3)
43STD System Description (2/3)
- I. Indexing step
- First, audio input is run through the STT system
that produces word or word graphone recognition
hypotheses and lattices. These are converted into
a candidate term index with times and detection
scores (posteriors). - When hybrid recognition (wordgraphone) is
employed, graphones in the resulting index are
combined into words. To be able to do this, we
keep word start/end information with a tag in the
graphone representation (e.g., .,
.,. indicate a graphone at the
beginning, or end, or in the middle of a word,
respectively).
44STD System Description (3/3)
- II. Searching step
- During the retrieval step, first the search terms
are extracted from the candidate term list, and
then a decision function is applied to accept or
reject the candidate based on its detection
score.
45STD System Description -Speech-to-Text System
- I. Recognition Engine
- The STT system used for this task was a sped-up
version of the STT systems used in the NIST
evaluation for 2004 Rich Transcription (RT-04). - STT is using SRIs Decipher(TM)
speaker-independent continuous speech recognition
system, which is based on continuous density,
state-clustered hidden Markov models (HMMs), with
a vocabulary optimized for the BN genre.
46STD System Description -Speech-to-Text System
- II. Graphone-based Hybrid Recognition
- To compensate for OOV words during retrieval, we
used an approach that use sub-word units
-graphones- to model OOV words. - The underlying assumption used in this model is
that, for each word, its orthographic form and
its pronunciation are generated by a common
sequence of graphonemic units. - Each graphone is a pair of a letter sequence and
a phoneme sequence of possibly different lengths.
47STD System Description -Speech-to-Text System
- In our experiments, we used 50K words (excluding
the 10K most frequent ones in our vocabulary) to
train the graphone module, with maximum window
length, M, set to 4. - A hybrid word graphone LM was estimated and
used for recognition. - Following is an example of an OOV word modeled by
graphones - abromowitz abro mo
witz - where graphones are represented by their
grapheme strings enclosed in brackets, and
and tags are used to mark word boundary
information that is later used to join graphones
back into words for indexing.
48STD System Description -N-gram Indexing
- Since the lattice structure provides additional
information about the correct hypothesis could
appear, to avoid misses (which have a higher cost
in the evaluation score than false alarms)
several studies have used the whole hypothesized
word lattice to obtain the searchable index. - We used the lattice-tool in SRILM (version 1.5.1)
to extract the list of all word/graphone N-grams
(up to N5 for a word-only (W) STD system, N8
for a hybrid (WG) STD system). - The term posterior for each N-gram is computed as
the forward-backward combined score (acoustic,
language, and prosodic scores were used) through
all the lattice paths that share the N-gram
nodes.
49STD System Description -Term Retrieval
- The term retrieval was implemented using the Unix
join command, which concatenates the lines of the
sorted term list and the index file for the terms
common to both. The computational cost of this
simple retrieval mechanism depends only on the
size of the index. - Each putative retrieved term is marked with a
hard decision (YES/NO). Our decision-making
module relies on the posterior probabilities
generated by the STT system. - One of two techniques were employed during the
decision-making process. - - The first one determines a global threshold
for posterior probability (GL-TH) by maximizing
the ATWV score, which for this task was found to
be 0.4 and 0.0001 for word-based and hybrid
systems respectively. - - An alternative strategy can be formulated
that computes a term-specific threshold
(TERM-TH), which has a simple analytical solution.
50STD System Description -Term Retrieval
- Based on decision theory the optimal threshold
for each candidate should satisfy - where is the value of a correct detection and
is the cost for a false alarm. - For the ATWV metric we have
- Since the number of true occurrences of the
term is unknown we approximate it for the
calculation of the optimal by the sum of the
posterior probabilities of the term in the
corpus.
51Experiment Results (1/2)
- In the hybrid (WG) STD system there are
approximately 15K graphones added to the
recognition vocabulary. - On average every OOV word results in two errors
(itself, and one for a neighboring word because
of incorrect context in the language model
scoring). (e.g., going from 60K vocabulary to 20K
vocabulary leads to a 3.2 increase in WER, and
the hybrid system brings this number down to
1.3).
52Experiment Results (2/2)
- An interesting observation is that even for IV
terms the hybrid (WG) STD yields better
performance than the word-only (W) STD system. - This is because hybrid recognition improves both
IV-word and OOV-word recognition, resulting in
better retrieval performance for IV and OOV words
at the same time.