A Survey of Boosting HMM Acoustic Model Training - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

A Survey of Boosting HMM Acoustic Model Training

Description:

The No Free Lunch Theorem states that ... decision trees, multilayer perceptrons, condensed nearest neighbor ... company, first and family names. Evaluations: ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 57
Provided by: Ryan85
Category:

less

Transcript and Presenter's Notes

Title: A Survey of Boosting HMM Acoustic Model Training


1
A Survey of Boosting HMM Acoustic Model Training
2
Introduction
  • The No Free Lunch Theorem states that
  • There is no single learning algorithm that in any
    domain always induces the most accurate learner
  • Learning is an ill-posed problem and with finite
    data, each algorithm converges to a different
    solution and fails under different circumstances
  • Though the performance of a learner may be
    fine-tuned, but still there are instances on
    which even the best learner is not accurate
    enough
  • The idea is..
  • There may be another learner that is accurate on
    these instances
  • By suitably combining multiple learners then,
    accuracy can be improved

3
Introduction
  • Since there is no point in combining learners
    that always make similar decisions
  • The aim is to be able to find a set of
    base-learners who differ in their decisions so
    that they will complement each other
  • There are different ways the multiple
    base-learners are combined to generate the final
    outputs
  • Multiexpert combination methods
  • Voting and its variants
  • Mixture of experts
  • Stacked generalization
  • Multistage combination methods
  • Cascading

4
Voting
  • The simplest way to combine multiple classifiers
  • which corresponds to taking a linear combination
    of the learners
  • this is also known as ensembles and linear
    opinion pools
  • The name voting comes from its use in
    classification
  • if , called
    plurality voting
  • if , called
    majority voting

5
Bagging
  • Bagging is a voting method whereby base-learners
    are made different by training them over slightly
    different training sets
  • is done by bootstrap
  • where given a training set X of size N, we draw N
    instances randomly from X with replacement
  • In bagging, generating complementary
    base-learners is left to chance and to the
    instability of the learning method
  • A learning algorithm is an unstable algorithm if
    small changes in the training set causes a large
    difference in the generated learner
  • decision trees, multilayer perceptrons, condensed
    nearest neighbor
  • Bagging is short for Bootstrap aggregating

Breiman, L. 1996. Bagging Predictors. Machine
Learning 26, 123-140
6
Boosting
  • In boosting, we actively try to generate
    complementary base-learners by training the next
    learner on the mistakes of the previous learners
  • The original boosting algorithms (Schapire 1990)
    combines three weak learners to generate a strong
    learner
  • In the sense of the probably approximately
    correct (PAC) learning model
  • Disadvantage
  • It requires a very large training sample

Schapire, R.E. 1990. The Strength of Weak
Learnability. Machine Learning 5, 197-227
7
AdaBoost
  • AdaBoost, short for adaptive boosting, uses the
    same training set over and over and thus need not
    be large and it can also combine an arbitrary
    number of base-learners, not three
  • The idea is to modify the probabilities of
    drawing the instances as a function of the error
  • The probability of a correctly classified
    instance is decreased, then a new sample set is
    drawn from the original sample according to these
    modified probabilities
  • That focuses more on instances misclassified by
    previous learner
  • Schapire et al. explain that the success of
    AdaBoost is due to its property of increasing the
    margin
  • Schapire. et al. 1998. Boosting the Margin A
    New Explanation for Effectiveness of Voting
    Methods Annals of Statistics 26, 1651-1686

Freund and Schapire. 1996. Experiments with a
New Boosting Algorithm In ICML 13, 148-156
8
AdaBoost.M2 (Freund and
Schapire, 1997)
Freund and Schapire. 1997. A decision-theoretic
generalization of on-line learning and an
application to boosting Journal of Computer and
System Sciences 55, 119-139
9
Evolution of Boosting Algo.
5
4
6
ICSLP 04R. Zhang A. RudnickyA Frame Level
Boosting Training Scheme for Acoustic Modeling
ICASSP 04C. Dimitrakakis S. BengioBoosting
HMMs with An Application to Speech Recognition
ICSLP 04R. Zhang A. RudnickyApply N-Best
List Re-Ranking to Acoustic Model Combinations of
Boosting Training
D
ICASSP 00G. Zweig M. PadmanabhanBoosting
Gaussian Mixtures in An LVCSR System
7
ICSLP 04R. Zhang A. RudnickyOptimizing
Boosting with Discriminative Criteria
8
EuroSpeech 05R. Zhang et al.Investigations on
Ensemble Based Semi-Supervised Acoustic Model
Training
C
ICASSP 99H. SchwenkUsing Boosting to Improve a
Hybrid HMM/Neural Network Speech Recognizer
1999
1996
2003
2002
1997
2000
2004
2005
2006
9
ICSLP 06R. Zhang A. Rudnicky Investigations
of Issues for Using Multiple Acoustic Models to
Improve CSR
A
ICSLP 96G. Cook T. RobinsonBoosting the
Performance of Connectionist LVSR
Neural Network
SpeechCom 06 C. Meyer H. Schramm Boosting HMM
Acoustic Models in LVCSR
2
ICASSP 03R. Zhang A. RudnickyImproving the
Performance of An LVCSR System Through Ensembles
of Acoustic Models
GMM
B
EuroSpeech 97G. Cook et al.Ensemble Methods
for Connectionist Acoustic Modeling
HMM
3
EuroSpeech 03 R. Zhang A. Rudnicky Comparative
Study of Boosting and Non-Boosting Training for
Constructing Ensembles of Acoustic Models
0
ICASSP 02C. MeyerUtterance-Level Boosting of
HMM Speech Recognition
1
ICASSP 02 I. Zitouni et al. Combination of
Boosting and Discriminative Training for Natural
Language Call Steering Systems
10
Improving The Performance of An LVCSR System
Through Ensembles of Acoustic Models
  • ICASSP 2003
  • Rong Zhang and Alexander I. Rudnicky
  • Language Technologies Institute,
  • School of Computer Science
  • Carnegie Mellon University

11
Bagging vs. Boosting
  • Bagging
  • In each round, bagging randomly selects a number
    of examples from the original training set, and
    produces a new single classifier based on the
    selected subset
  • The final classifier is built by choosing the
    hypothesis best agreed on by single classifiers
  • Boosting
  • In boosting, the single classifiers are
    iteratively trained in a fashion such that
    hard-to-classify examples are given increasing
    emphasis
  • A parameter that measures the classifiers
    importance is determined in respect of its
    classification accuracy
  • The final hypothesis is the weighted majority
    vote from the single classifiers

12
Algorithms
  • The first algorithm is based on the intuition
    that an incorrectly recognized utterance should
    receive more attention in training
  • If the weight of an utterance is 2.6, we first
    add two copies of the utterance to the new
    training set, and then add its third copy with
    probability 0.6

13
Algorithms
  • The exponential increase in the size of training
    set is a severe problem for algorithm 1
  • Algorithm 2 is proposed to address this problem

14
Algorithms
  • In algorithm 1 and 2, there is no concern to
    measure how important a model is relative to
    others
  • Good model should play more important role than
    bad one

15
Experiments
  • Corpus CMU Communicator system
  • Experimental results

16
Comparative Study of Boosting and Non-Boosting
Training for Constructing Ensembles of Acoustic
Models
  • Rong Zhang and Alexander I. Rudnicky
  • Language Technologies Institute, CMU
  • EuroSpeech 2003

17
Non-Boosting method
  • Bagging
  • is a commonly used method in machine learning
    field
  • randomly selects a number of examples from the
    original training set and produces a new single
    classifier
  • in this paper, we call it a non-Boosting method
  • Based on the intuition
  • The misrecognized utterance should receive more
    attention in the successive training

18
Algorithms
? is a parameter that prevents the size of the
training set from being too large.
19
Experiments
  • The corpus
  • Training set 31248 utterances Test set 1689
    utterances

20
A Frame Level Boosting Training Scheme for
Acoustic Modeling
  • ICSLP 2004
  • Rong Zhang and Alexander I. Rudnicky
  • Language Technologies Institute,
  • School of Computer Science
  • Carnegie Mellon University

21
Introduction
  • In the current Boosting algorithm, utterance is
    the basic unit used for acoustic model training
  • Our analysis shows that there are two notable
    weaknesses in this setting..
  • First, the objective function of current Boosting
    algorithm is designed to minimize utterance error
    instead of word error
  • Second, in the current algorithm, an utterance is
    treated as a unity for resample
  • This paper proposes a frame level Boosting
    training scheme for acoustic modeling to address
    these two problems

22
Frame Level Boosting Training Scheme
  • The metrics that we will use in Boosting training
    is the frame level conditional probability
    -----(word level)
  • Objective function

is the pseudo
loss for frame t, which describes the degree of
confusion of this frame for recognition
23
Frame Level Boosting Training Scheme
  • Training Scheme
  • How to resample the frame level training data?
  • to duplicate for times and creates
    a new utterance for acoustic model training

24
Experiments
  • Corpus CMU Communicator system
  • Experimental results

25
Boosting HMM acoustic models in large vocabulary
speech recognition
  • Carsten Meyer, Hauke Schramm
  • Philips Research Laboratories, Germany
  • SPEECH COMMUNICATION 2006

26
Utterance approach for boosting in ASR
  • An intuitive way of applying boosting to HMM
    speech recognition is at the utterance level
  • Thus, boosting is used to improve upon an initial
    ranking of candidate word sequences
  • The utterance approach has two advantages
  • First, it is directly related to the sentence
    error rate
  • Second, it is computationally much less expensive
    than boosting applied at the level of feature
    vectors

27
Utterance approach for boosting in ASR
  • In utterance approach, we define the input
    patterns to be the sequence of feature
    vectors corresponding to the entire utterance
  • denotes one possible candidate word sequence
    of the speech recognizer, being the correct
    word sequence for utterance
  • The a posteriori confidence measure is calculated
    on basis of the N-best list for utterance

28
Utterance approach for boosting in ASR
  • Based on the confidence values and AdaBoost.M2
    algorithm, we calculate an utterance weight
    for each training utterance
  • Subsequently, the weight are used in maximum
    likelihood and discriminative training of
    Gaussian mixture model

29
Utterance approach for boosting in ASR
  • Some problem encountered when apply it to
    large-scale continuous speech application
  • The N-best lists of reasonable length (e.g.
    N100) generally contain only a tiny fraction of
    the possible classification results
  • This has two consequences
  • In training, it may lead to sub-optimal utterance
    weights
  • In recognition, Eq. (1) cannot be applied
    appropriately

30
Utterance approach for CSR--Training
  • Training
  • A convenient strategy to reduce the complexity of
    the classification task and to provide more
    meaningful N-best lists consists in chopping of
    the training data
  • For long sentences, it simply means to insert
    additional sentence break symbols at silence
    intervals with a given minimum length
  • This reduces the number of possible
    classifications of each sentence fragment, so
    that the resulting N-best lists should cover a
    sufficiently large fraction of hypotheses

31
Utterance approach for CSR--Decoding
  • Decoding lexical approach for model combination
  • A single pass decoding setup, where the
    combination of the boosted acoustic models is
    realized at a lexical level
  • The basic idea is to add a new pronunciation
    model by replicating the set of phoneme symbols
    in each boosting iteration (e.g. by appending
    the suffix _t to the phoneme symbol)
  • The new phoneme symbols, represent the underlying
    acoustic model of boosting iteration

au, au_1 ,au_2,
32
Utterance approach for CSR--Decoding
  • Decoding lexical approach for model combination
    (cont.)
  • Add to each phonetic transcription in the
    decoding lexicon a new transcription using the
    corresponding phoneme set
  • Use the reweighted training data to train the
    boosted classifier
  • Decoding is then performed using the extended
    lexicon and the set of acoustic models weighted
    by their unigram prior probabilities which are
    estimated on the training data

sic_a, sic_1 a_1 ,
weighted summation
33
In more detail
Training
Training corpus
_t
Boosting Iteration t
Mt
phonetically transcribed
training corpus(Mt)
ML/MMI training
pronunciation variant
sic_a, sic_1 a_1 ,
Decoding
Lexicon
M1,M2,,Mt
unweighted model combination weighted model
combination
extend
34
In more detail
35
Weighted model combination
  • Word level model combination

36
Experiments
  • Isolated word recognition
  • Telephone-bandwidth large vocabulary isolated
    word recognition
  • SpeechDat(II) German meterial
  • Continuous speech recognition
  • Professional dictation and Switchboard

37
Isolated word recognition
  • Database
  • Training corpus consists of 18k utterances
    (4.3h) of city, company, first and family names
  • Evaluations
  • LILI test corpus 10k single word utterances
    (3.5h) 10k words lexicon (matched conditions)
  • Names corpus an inhouse collection of 676
    utterances (0.5h) two different decoding lexica
    10k lex, 190k lex (acoustic conditions are
    matched, whereas there is a lexical mismatch)
  • Office corpus 3.2k utterances (1.5h), recorded
    over microphone in clean conditions 20k lexicon
    (an acoustic mismatch to the training conditions)

38
Isolated word recognition
  • Boosting ML models

39
Isolated word recognition
  • Combining boosting and discriminative training
  • The experiments in isolated word recognition
    showed that boosting may improve the best test
    error rates

40
Continuous speech recognition
  • Database
  • Professional dictation
  • An inhouse data collection of real-life
    recordings of medical reports
  • The acoustic training corpus consists of about
    58h of data
  • Evaluations were carried out on two test corpora
  • Development corpus consists of 5.0h of speech
  • Evaluation corpus consists of 3.3h of speech
  • Switchboard
  • Consisting of spontaneous conversations recorded
    over telephone line 57h(73h) of male(female)
  • Evaluations corpus
  • Containing about 1h(0.5h) of male(female)

41
Continuous speech recognition
  • Professional dictation

42
  • Switchboard

43
Conclusions
  • In this paper, a boosting approach which can be
    applied to any HMM based speech recognizer was be
    presented and evaluated
  • The increased recognizer complexity and thus
    decoding effort of the boosted systems is a major
    drawback compared to other training techniques
    like discriminative training

44
Probably Approximately Correct Learning
  • We would like our hypothesis to be approximately
    correct, namely, that the error probability be
    bounded by some value
  • We also would like to be confident in our
    hypothesis in that we want to know that our
    hypothesis will be correct most of the time, so
    we want to be probably correct as well
  • Given a class, , and examples drawn from
    some unknown but fixed probability distribution,
    such that with probability at least ,
    the hypothesis has error at most , for
    arbitrary and

45
Probably Approximately Correct Learning
  • How many training examples N should we have, such
    that with probability at least 1 ? d, h has error
    at most e ?

most specific hypothesis, S
most general hypothesis, G
  • Each strip is at most e/4
  • Pr that we miss a strip 1? e/4
  • Pr that N instances miss a strip (1 ? e/4)N
  • Pr that N instances miss 4 strips 4(1 ? e/4)N
  • 4(1 ? e/4)N d and (1 ? x)exp( ? x)
  • 4exp(? eN/4) d and N (4/e)log(4/d)

h Î H, between S and G is consistent and make
up the version space
46
The Boosting Approach to Machine Learning An
Overview
  • Robert E. Schapire
  • ATT Labs, USA
  • MSRI Workshop on Nonlinear Estimation and
    Classification, 2002

47
Abstract
  • This paper overviews some of recent work on
    boosting including
  • Analyses of AdaBoosts training error and
    generalization error
  • Boostings connection to game theory and linear
    programming
  • The relationship between boosting and logistic
    regression
  • Extensions of AdaBoost for multiclass
    classification problems
  • Methods of incorporation human knowledge into
    boosting

48
References
  • Freund and Schapire. 1997. A decision-theoretic
    generalization of on-line learning and an
    application to boosting Journal of Computer and
    System Sciences 55, 119-139
  • Meir and Ratsch. 2003 An introduction to
    boosting and leveraging in Advanced Lectures on
    Machine Learning (LNAI2600), 118-183

49
Introduction
  • Boosting is based on the observation
  • finding many rough rules of thumb can be a lot
    easier than finding a single, highly accurate
    prediction rule
  • Two fundamental questions
  • How should each distribution be chosen on each
    round?
  • How should the weak rules be combined into a
    single rule?

A method for finding rough rules of thumb is
called as weak or base learning algorithm
50
AdaBoost algorithm
51
AdaBoost algorithm cont.
  • The base learners job is to find a base
    classifier appropriate for the
    distribution
  • In the binary case, the base learners then is to
    minimize the error
  • AdaBoost choose a parameter that
    intuitively measures the importance that it
    assigns to

52
Analyzing the training error
  • The most theoretical property of AdaBoost
    concerns its ability to reduce the training error
  • The training error of the final classifier is
    bounded as follows

define
53
Detail derivation
54
Analyzing the training error cont.
  • The training error can be reduced most rapidly by
    choosing and on each round to minimize
  • In the case of binary classifiers,

55
Analyzing the training error cont.
  • Thus, if each base classifier is slightly better
    than random so that for some ,
    then the training error drops exponentially fast
    in T
  • The fact that AdaBoost is a procedure for finding
    a linear combination f of base classifiers which
    attempts to minimize

AdaBoost is doing a kind of steepest descent
search to minimize above equation where the
search is constrained at each step to follow
coordinate directions
Mason et al. 1999. Boosting Algorithms as
gradient descent in Advances in Neural
Information Processing Systems 12, 2000
56
Detail derivation
Write a Comment
User Comments (0)
About PowerShow.com