Bootstrap Estimates For Confidence Intervals In ASR Performance Evaluation - PowerPoint PPT Presentation

1 / 52
About This Presentation
Title:

Bootstrap Estimates For Confidence Intervals In ASR Performance Evaluation

Description:

Alternatively the pronunciation model and sub-lexical language model can be ... For each baseline pronunciation dictionary a grapheme-to-phoneme model was ... – PowerPoint PPT presentation

Number of Views:100
Avg rating:3.0/5.0
Slides: 53
Provided by: Patt95
Category:

less

Transcript and Presenter's Notes

Title: Bootstrap Estimates For Confidence Intervals In ASR Performance Evaluation


1
Bootstrap Estimates For Confidence Intervals In
ASR Performance Evaluation
Presented by Patty Liu
2
Introduction (1/2)
  • The most popular performance measure in automatic
    speech recognition is the word error rate
  • the number of word in sentence
  • the edit distance, or Levenshtein distance
    ,between the recognizer output and the reference
    transcription of sentence
  • The edit distance is the minimum number of
    insert, substitute and delete operations
    necessary to transform one sentence into the
    other.

3
Introduction (2/2)
  • The word error rate is an attractive metric,
    because it is intuitive, it corresponds well with
    application scenarios and (unlike sentence error
    rate) it is sensitive to small changes.
  • On the downside it is not very amenable to
    statistical analysis.
  • - W is really a rate (number of errors per
    spoken word), and not a probability (chance of
    misrecognizing a word).
  • - Moreover, error events do not occur
    independently.
  • Nevertheless hardly any other publication reports
    figures on these significance tests. Instead it
    is common to report the absolute and relative
    change in word error rate only.

4
Motivation
  • What error rate do we have to expect when
    changing to a different test set?
  • How reliable is a observed improvement of a
    system?

5
Bootstrap (1/5)
  • The bootstrap is a computer-based method for
    assigning measures of accuracy to statistical
    estimates.
  • The core idea is to create replications of a
    statistic by random sampling from the data set
    with replacement (so-called Monte Carlo
    estimates).
  • We assume that the test corpus can be divided
    into s segments for which the recognition result
    is independent and the number of errors can thus
    be evaluated independently.
  • For speaker-independent CSR it seems appropriate
    to to choose the set of all utterances of one
    speaker as a segment.

6
Bootstrap (2/5)
  • For each sentence we record the number of
    words and the number errors
  • The following procedure is repeated B times
    (typically B 103 . . . 104) For b 1. . . B
    we randomly select with replacement s pairs from
    X, to generate a bootstrap sample
  • The sample will contain several of the original
    sentences multiple times, while others are
    missing.

7
Bootstrap (3/5)
8
Bootstrap (4/5)
  • Then we calculate the word error rate on this
    sample
  • The Wb are called bootstrap replications of W.
    They can be thought of as samples of the word
    error rate from an ensemble of virtual test sets.
  • The uncertainty of Wboot can be quantified by the
    standard error, which has the following bootstrap
    estimate

9
Bootstrap (5/5)
  • For large s the distribution of W is
    approximately Gaussian. In this case the true
    word error rate lies with 90 probability in the
    interval .
  • Even when s is small, we can use the table of
    replications Wb to determine percentiles which
    in turn can serve as confidence intervals.
  • For a chosen error threshold a, let be
    the
  • smallest value in the list ,
    and be the largest.
  • The interval
  • contains the true value of W with probability
    1 - 2a. This is the bootstrap-t confidence
    interval.
  • For example With B 1000 and a 0.05, we
    sort the list of Wb and use the values at
    position 50 and 950.

10
EXAMPLE NAB (North American Business News)
ERROR RATES
smaller standard error
narrower confidence intervals
11
EXAMPLE NAB ERROR RATES
narrower confidence intervals
90
5
5
Cboot
Cboot
Wboot
12
Comparing Systems (1/2)
  • Competing algorithms are usually tested on the
    same data.
  • Given two recognition systems A and B with word
    error counts and , the (absolute)
    difference in word error rate is
  • We can apply the same bootstrap technique to the
    quantity ?W as we did to W. The crucial point is
    that we calculate the difference in the number of
    errors of the two systems on identical bootstrap
    samples.
  • The important consequence that ?W has much lower
    variance than W of either system.

13
Comparing Systems (2/2)
  • In addition to the two-tailed confidence interval
    , we may be more interested in
    whether system B is a real improvement over
    system A.
  • T(x) is the step function, which is one for x gt 0
    and zero otherwise. So the poi function is the
    relative number of bootstrap samples which favor
    system B. We call this measure probability of
    improvement (poi).

14
Example System Comparison
  • The system used for the examples described
    earlier now plays the role of system B, while a
    second system with slightly different acoustic
    models is system A.
  • System B is apparently better by 0.3 to 0.4
    absolute in terms of word error rate. The
    probability of improvement ranging between 82
    and 95, indicates that we can be moderately
    confident that this reflects a real superiority
    of system B, but we should not be too surprised
    if a fourth test set would be favorable to system
    A.
  • The notable advantage of this differential
    analysis is that the standard error of ?W is
    approximately one third of the standard error of
    W.

15
Example System Comparison
16
Example System Comparison
0.4
17
Conclusion
  • We would like to emphasize that what we propose
    is not a new metric for performance evaluation,
    but a refined analysis of an established metric
    (word error rate).
  • The proposed method seems attractive, because it
    is easy to use, it makes no assumption about the
    distribution of errors, results are directly
    related to word error rate, and the probability
    of improvement provides an intuitive figure of
    significance.

18
Open Vocabulary Speech Recognition with Flat
Hybrid Models
19
Introduction (1/3)
  • Large vocabulary speech recognition systems
    operate with a fixed large but finite vocabulary.
  • Systems operating with a fixed vocabulary are
    bound to encounter so-called out-of-vocabulary
    (OOV) words. These are problematic for a number
    of reasons
  • - An OOV word will never be recognized (even if
    the user repeats it), but will be substituted by
    some in-vocabulary word.
  • - Neighboring words are also often
    misrecognized.
  • - Later processing stages (e.g. translation,
    understanding, document retrieval) cannot recover
    from OOV errors.
  • - OOV words are often content words.

20
Introduction (2/3)
  • The decision rule and knowledge sources used by a
    large vocabulary speech recognition system
  • acoustic model
  • relates acoustic features x to phoneme
    sequences , typically an HMM (vocabulary
    independent)
  • pronunciation lexicon
  • assigns one (or more) phoneme string(s)
    to each word
  • language model
  • assigns probabilities to sentences from a
    finite set of words

21
Introduction (3/3)
  • For open vocabulary recognition, we propose to
    conceptually abandon the words in favor of
    individual letters. Unlike words, the set of
    different letters G in a writing system is
    finite.
  • Concerning the link to the acoustic realization,
    the set of phonemes can also be considered
    finite for a given language. These considerations
    suggest the following model
  • - acoustic model
  • - pronunciation model
  • provides a pronunciation for any
    string of letters
  • - sub-lexical language model
  • assigns probabilities to character strings

22
Introduction (3/3)
  • Alternatively the pronunciation model and
    sub-lexical language model can be combined into a
    joint graphonemic model

23
Grapheme-to-Phoneme Conversion (1/3)
  • Obviously this approach to open-vocabulary
    recognition is strongly connected to
    grapheme-to-phoneme conversion (G2P), where we
    seek the most likely pronunciation for a given
    orthographic form
  • The underlying assumption of this model is that,
    for each word, its orthographic form and its
    pronunciation are generated by a common sequence
    of graphonemic units.
  • Each unit is a pair
    of a letter sequence and a phoneme sequence
    of possibly different length.
  • We refer to such a unit as a graphone. (Various
    other names have been suggested
    grapheme-phonemejoint multigram, graphoneme,
    grapheme-to-phoneme correspondence (GPC), chunk)

24
Grapheme-to-Phoneme Conversion (2/3)
  • The joint probability distribution
    is thus reduced to a probability distribution
    over graphone sequences which we model
    using a standard M-gram
  • The complexity of this model depends on two
    parameters the range of the M-gram model and the
    allowed size of the graphones. We allow the
    number of letters and phonemes to vary between
    zero and an upper limit

25
Grapheme-to-Phoneme Conversion (3/3)
  • We were able to verify that shorter units in
    combination with longer range M-gram modeling
    yields the best result for the grapheme-to-phoneme
    task.

26
Models for Open Vocabulary ASR (1/2)
  • We combine the lexical entries with the
    (sub-lexical) graphones derived from
    grapheme-to-phoneme conversion to form an unified
    set of recognition units .
  • From the perspective of OOV detection the
    sub-lexical units Q have been called fragments
    or fillers.
  • By treating words and fragments uniformly the
    decision rule becomes
  • The sequence model p(u) can be characterized as
    hybrid because it contains mixed M-grams
    containing both words and fragments.
  • It can also be characterized as flat, as
    opposed to structured approaches that predict and
    model OOV words with different models.

27
Models for Open Vocabulary ASR (2/2)
  • A shortcoming of this model is that it leaves
    undetermined were word boundaries (i.e. blanks)
    should be placed.
  • The heuristic used in this study is to compose
    any consecutive sub-lexical units into a single
    word and to treat all lexical units as individual
    words.

28
Experiments
  • We have three different established vocabularies
    with 5, 20 and 64 thousand words, each
    corresponding to the most frequent words in the
    language model training corpus.
  • For each baseline pronunciation dictionary a
    grapheme-to-phoneme model was trained with
    different length constraints
  • using EM training with M-gram
    length of 3 . The recognition vocabulary was then
    augmented with all graphones inferred by this
    procedure.
  • For quantitative analysis, we evaluated both word
    error rate (WER) and letter error rate (LER).
    Letter error rate is more favorable with respect
    to almost-correct words and corresponds with the
    correction effort in dictation applications.

29
Experiments
30
Experiments--Bootstrap Analysis of OOV Impact
  • We are particularly interested in the effect of
    OOV words on the recognition error rate. This
    effect could be studied by varying the systems
    vocabulary.
  • However, changing the recognition system in this
    way might introduce secondary effects such as
    increased confusability between vocabulary
    entries.
  • Alternatively we can alter the test set. By
    extending the bootstrap technique , we create an
    ensemble of virtual test corpora with a varying
    number of OOV words, and respective WER. This
    distribution allows us to study the correlation
    between OOV rate and word error rate without
    changing the recognition system.

31
Experiments
  • This procedure is detailed in the following For
    each sentence i 1 . . . s we record the number
    of words , the number of OOV words and
    the number of recognition errors
  • For b 1 . . .B (typically ) we randomly
    select with replacement s tuples from X to
    generate a bootstrap sample
  • The OOV rate and word error rate on this sample

32
Experiments
  • The bootstrap replications OOVb and WERb can be
    visualized by a scatter plot. We quantify the
    observed linear relation between OOV rate and WER
    by a linear lest squares fit.
  • The slope of the fitted line reflects the number
    of word errors per OOV word. For this reason we
    call this quantity OOV impact.

33
Discussion (1/2)
34
Discussion (2/2)
  • Obviously the improvement in error rate depends
    strongly on the OOV rate.
  • It is interesting to compare the OOV impact
    factor (word errors per OOV word) The baseline
    systems have values between 1.7 and 2, supporting
    the common wisdom that each OOV word causes two
    word errors.
  • Concerning the optimal choice of fragment size L,
    we note that there are two counteracting effects
  • - Larger L values increase the size of the
    graphone inventory, which in turn causes data
    sparseness problems, leading to worse
    grapheme-to-phoneme performance.
  • - Smaller values for L cause the unit
    inventory to contain many very short words with
    high probabilities, leading to spurious
    insertions in the recognition result. The present
    experiments suggest that the best trade-off is at
    L 4.

35
Conclusion
  • We have shown that we can significantly improve a
    well optimized state-of-the-art recognition
    system by using a simple flat hybrid sub-lexical
    model. The improvement was observed on a wide
    range of out-of-vocabulary rates.
  • Even for very low OOV rates, no deterioration
    occurred.
  • We found that using fragments of up to four
    letters or phonemes yielded optimal recognition
    results, while using non-trivial chunks is
    detrimental to grapheme-to-phoneme conversion.

36
OPEN-VOCABULARY SPOKEN TERM DETECTIONUSING
GRAPHONE-BASED HYBRID RECOGNITION SYSTEMS
Murat Akbacak, Dimitra Vergyri, Andreas
Stolcke Speech Technology and Research
Laboratory SRI International, Menlo Park, CA
94025, USA
37
Introduction (1/3)
  • Recently, NIST defined a new task, spoken term
    detection (STD), in which the goal is to locate a
    specified term rapidly and accurately in large
    heterogeneous audio archives, to be used
    ultimately as input to more sophisticated audio
    search systems.
  • The evaluation metric has two important
    characteristics
  • (1) Missing a term is penalized more heavily
    than having a false alarm for that term,
  • (2) Detection results are averaged over all
    query terms rather than over their occurrences,
    i.e., the performance metric considers the
    contribution of each term equally.

38
Introduction (2/3)
  • Results of the NIST 2006 STD evaluation have
    shown that systems based on word recognition have
    an accuracy advantage over systems based on
    sub-word recognition (although they typically pay
    a price in run time).
  • Yet, word recognition system are usually based on
    a fixed vocabulary, resulting in a word-based
    index that does not allow text-based searching
    for OOV words.
  • To retrieve OOVs, as well as misrecognized IV
    words, audio search based on sub-word units (such
    as syllables and phone N-grams) has been employed
    in many systems.
  • During recognition, shorter units are more robust
    to errors and word variants than longer units,
    but longer units capture more discriminative
    information and are less susceptible to false
    matches during retrieval.

39
Introduction (3/3)
  • In order to move toward solutions that address
    the problem of misrecognition (both IV and OOV)
    during audio search, previous studies have
    employed fusion methods to recover from ASR
    errors during retrieval.
  • Here, we propose a hybrid STD system that uses
    words and sub-word units together in the
    recognition vocabulary. The ASR vocabulary is
    augmented by graphone units.
  • We extract from ASR lattices a hybrid index,
    which is then converted into a regular word index
    by a post-processing step that joins graphones
    into words.
  • It is important to represent ASR lattices with
    only words (with an expanded vocabulary) rather
    than with words and sub-word units since the
    lattices might serve as input to other
    information processing algorithms, such as for
    named entity tagging or information extraction,
    which assume a word-based representation.

40
The STD Task -Data
  • The test data consists of audio waveforms, a list
    of regions to be searched, and a list of query
    terms.
  • For expedience we focus in this study on English
    and the genre with the highest OOV rate,
    broadcast news (BN).

41
The STD Task -Evaluation Metric
  • Basic detection performance will be characterized
    in the usual way via standard detection error
    tradeoff (DET) curves of miss probability (
    ) versus false alarm probability ( ).
  • detection threshold
  • the number of correct (true)
    detections of term with a score greater than or
    equal to .
  • the true number of occurrences of
    term in the corpus.
  • the number of spurious
    (incorrect) detections of term with a score
    greater than or equal to .
  • the number of opportunities for
    incorrect detection of term in the corpus (
    Non-Target term trials).

42
STD System Description (1/3)
43
STD System Description (2/3)
  • I. Indexing step
  • First, audio input is run through the STT system
    that produces word or word graphone recognition
    hypotheses and lattices. These are converted into
    a candidate term index with times and detection
    scores (posteriors).
  • When hybrid recognition (wordgraphone) is
    employed, graphones in the resulting index are
    combined into words. To be able to do this, we
    keep word start/end information with a tag in the
    graphone representation (e.g., .,
    .,. indicate a graphone at the
    beginning, or end, or in the middle of a word,
    respectively).

44
STD System Description (3/3)
  • II. Searching step
  • During the retrieval step, first the search terms
    are extracted from the candidate term list, and
    then a decision function is applied to accept or
    reject the candidate based on its detection
    score.

45
STD System Description -Speech-to-Text System
  • I. Recognition Engine
  • The STT system used for this task was a sped-up
    version of the STT systems used in the NIST
    evaluation for 2004 Rich Transcription (RT-04).
  • STT is using SRIs Decipher(TM)
    speaker-independent continuous speech recognition
    system, which is based on continuous density,
    state-clustered hidden Markov models (HMMs), with
    a vocabulary optimized for the BN genre.

46
STD System Description -Speech-to-Text System
  • II. Graphone-based Hybrid Recognition
  • To compensate for OOV words during retrieval, we
    used an approach that use sub-word units
    -graphones- to model OOV words.
  • The underlying assumption used in this model is
    that, for each word, its orthographic form and
    its pronunciation are generated by a common
    sequence of graphonemic units.
  • Each graphone is a pair of a letter sequence and
    a phoneme sequence of possibly different lengths.

47
STD System Description -Speech-to-Text System
  • In our experiments, we used 50K words (excluding
    the 10K most frequent ones in our vocabulary) to
    train the graphone module, with maximum window
    length, M, set to 4.
  • A hybrid word graphone LM was estimated and
    used for recognition.
  • Following is an example of an OOV word modeled by
    graphones
  • abromowitz abro mo
    witz
  • where graphones are represented by their
    grapheme strings enclosed in brackets, and
    and tags are used to mark word boundary
    information that is later used to join graphones
    back into words for indexing.

48
STD System Description -N-gram Indexing
  • Since the lattice structure provides additional
    information about the correct hypothesis could
    appear, to avoid misses (which have a higher cost
    in the evaluation score than false alarms)
    several studies have used the whole hypothesized
    word lattice to obtain the searchable index.
  • We used the lattice-tool in SRILM (version 1.5.1)
    to extract the list of all word/graphone N-grams
    (up to N5 for a word-only (W) STD system, N8
    for a hybrid (WG) STD system).
  • The term posterior for each N-gram is computed as
    the forward-backward combined score (acoustic,
    language, and prosodic scores were used) through
    all the lattice paths that share the N-gram
    nodes.

49
STD System Description -Term Retrieval
  • The term retrieval was implemented using the Unix
    join command, which concatenates the lines of the
    sorted term list and the index file for the terms
    common to both. The computational cost of this
    simple retrieval mechanism depends only on the
    size of the index.
  • Each putative retrieved term is marked with a
    hard decision (YES/NO). Our decision-making
    module relies on the posterior probabilities
    generated by the STT system.
  • One of two techniques were employed during the
    decision-making process.
  • - The first one determines a global threshold
    for posterior probability (GL-TH) by maximizing
    the ATWV score, which for this task was found to
    be 0.4 and 0.0001 for word-based and hybrid
    systems respectively.
  • - An alternative strategy can be formulated
    that computes a term-specific threshold
    (TERM-TH), which has a simple analytical solution.

50
STD System Description -Term Retrieval
  • Based on decision theory the optimal threshold
    for each candidate should satisfy
  • where is the value of a correct detection and
    is the cost for a false alarm.
  • For the ATWV metric we have
  • Since the number of true occurrences of the
    term is unknown we approximate it for the
    calculation of the optimal by the sum of the
    posterior probabilities of the term in the
    corpus.

51
Experiment Results (1/2)
  • In the hybrid (WG) STD system there are
    approximately 15K graphones added to the
    recognition vocabulary.
  • On average every OOV word results in two errors
    (itself, and one for a neighboring word because
    of incorrect context in the language model
    scoring). (e.g., going from 60K vocabulary to 20K
    vocabulary leads to a 3.2 increase in WER, and
    the hybrid system brings this number down to
    1.3).

52
Experiment Results (2/2)
  • An interesting observation is that even for IV
    terms the hybrid (WG) STD yields better
    performance than the word-only (W) STD system.
  • This is because hybrid recognition improves both
    IV-word and OOV-word recognition, resulting in
    better retrieval performance for IV and OOV words
    at the same time.
Write a Comment
User Comments (0)
About PowerShow.com