Regularized Adaptation: Theory, Algorithms and Applications - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Regularized Adaptation: Theory, Algorithms and Applications

Description:

while applying certain regularization strategy to achieve good generalization performance ... 'Accuracy-regularization' We want to minimize the empirical error ... – PowerPoint PPT presentation

Number of Views:204
Avg rating:3.0/5.0
Slides: 41
Provided by: Emi251
Category:

less

Transcript and Presenter's Notes

Title: Regularized Adaptation: Theory, Algorithms and Applications


1
Regularized Adaptation Theory, Algorithms and
Applications
  • Xiao Li
  • Electrical Engineering Department
  • University of Washington

2
Roadmap
  • Introduction
  • Theoretical results
  • A Bayesian fidelity prior for adaptation
  • Generalization error bounds
  • Regularized adaptation algorithms
  • SVM and MLP adaptation
  • Experiments on vowel and object classification
  • The application to the Vocal Joystick
  • Conclusions and future work

3
Inductive Learning
  • Given
  • a set of m samples (xi, yi) p(x, y)
  • a decision function space F X ? 1
  • Goal
  • learn a decision function that minimizes
    the expected error
  • In practice
  • minimize the empirical error
  • while applying certain regularization strategy to
    achieve good generalization performance

4
Why Is Regularization Helpful?
  • Learning theory says
  • Frequentist Vapniks VC bound expresses F as a
    function of the VC dimension of F
  • Bayesian the Occams Razor bound expresses F as
    a function of the prior probability of f
  • Accuracy-regularization
  • We want to minimize the empirical error as well
    as the capacity
  • Frequentist support vector machines
  • Bayesian Bayesian model selection

5
Adaptive Learning
  • Two related yet different distributions
  • Training
  • target (test-time)
  • Given
  • An unadapted model
  • Adaptation data (labeled)
  • Goal
  • Learn an adapted model that is as close as
    possible to our desired model
  • Notes
  • Assume sufficient training data but limited
    adaptation data
  • Training data is not preserved

6
Scenarios
  • Customization
  • Speech recognition speaker adaptation
  • Handwriting recognition writer adaptation
  • Language processing domain adaptation
  • Evolutionary environments
  • Spam filtering
  • Incremental/sequential learning
  • Start from a simple or rough model and refine
    incrementally

7
Practical Work on Adaptation
  • Gaussian mixture models (GMMs)
  • MAP (Gauvain 94) MLLR (Leggetter 95)
  • Support vector machines (SVMs)
  • Boosting-like approach (Matic 93)
  • Weighted combination of old support vectors and
    adaptation data (Wu 04)
  • Multi-layer perceptrons (MLPs)
  • Shared internal representation (Baxter 95,
    Caruana 97, Stadermann 05)
  • Linear input network (Neto 95)
  • Conditional maximum entropy models
  • Gaussian prior (Chelba 04)

8
This Work Seeks Answers to
  • A unified and principled approach to adaptation
  • applicable to a variety of classifiers
  • amenable to variations in the amount of
    adaptation data
  • Quantitative relationships between
  • the generalization error bound (or sample
    complexity bound) and
  • the divergence between training and target
    distributions

9
Roadmap
  • Introduction
  • Theoretical results
  • A Bayesian fidelity prior for adaptation
  • Generalization error bounds
  • Regularized adaptation algorithms
  • SVM and MLP adaptation
  • Experiments on vowel and object classification
  • The application to the Vocal Joystick
  • Conclusions and future work

10
Bayesian Fidelity Prior
  • Adaptation objective
  • Remp( f ) empirical error on the adaptation
    data
  • Pfid( f ) Bayesian fidelity prior
  • Fidelity prior
  • How likely a classifier is given a training
    distribution (rather than a training set key
    difference from hierarchical Bayes approaches,
    e.g. Baxter 97)
  • Applicable to different classifiers
  • Relates to the KL-divergence

11
Generative Models
  • Generative models p( x, y f )
  • Classification
  • Posterior
  • Assume f tr and f ad are the true models
    generating the training and target distributions
    respectively, i.e.
  • Note that this assumption is justifiable if the
    function space contains the true models and if we
    use the log likelihood loss

standard prior, chosen before training
12
Fidelity Prior for Generative Models
  • Key result
  • where ß gt 0
  • Implication
  • Fidelity prior at the desired model
  • We are more likely to learn our desired model
    using the fidelity prior than using the standard
    prior

13
Instantiations
  • To compute the fidelity prior
  • assuming a uniform standard prior, this prior
    is determined by the KL-divergence
  • In cases the KL-divergence does not have a close
    form, we use an upper bound instead (hence a
    lower bound on the prior)
  • Gaussian models
  • The fidelity prior is a normal-Wishart
    distribution
  • Mixture models
  • An upper bound on the KL-divergence (using log
    sum inequality)
  • Hidden Markov models
  • An upper bound on the KL-divergence (Silva 06)

14
Discriminative Models
  • A unified view of SVMs, MLPs, CRFs and etc.
  • Affine classifiers in a transformed space f (
    w, b )
  • Classification
  • Conditional likelihood (for binary case)

15
Discriminative Models (cont.)
  • Conditional models p( y x, f )
  • Classification
  • Posterior
  • Assume f tr and f ad are the true models
    generating the training and target conditional
    distributions respectively, i.e.

16
Fidelity Prior for Conditional Models
  • Again a divergence
  • where ß gt 0
  • What if we do not know ptr(x, y)
  • We seek an upper bound on the KL-divergence and
    hence a lower bound on the prior
  • Key result
  • where

17
Roadmap
  • Introduction
  • Theoretical results
  • A Bayesian fidelity prior for adaptation
  • Generalization error bounds
  • Regularized adaptation algorithms
  • SVM and MLP adaptation
  • Experiments on vowel and object classification
  • The application to the Vocal Joystick
  • Conclusions and future work

18
Occams Razor Bound for Adaptation
  • For a countable function space

19
Bound using standard prior
Bounds using divergence prior
m
20
PAC-Bayesian Bounds for Adaptation
  • For both countable and uncountable function
    spaces
  • Choice of prior p( f ) and posterior q( f )
  • D( q( f )p( f ) ) and stochastic error are
    easily computable
  • Use pfid( f ) or its related forms as prior
  • Choose q( f ) to have the same parametric form
  • Examples
  • Gaussian models
  • Linear classifier

21
Roadmap
  • Introduction
  • Theoretical results
  • A Bayesian fidelity prior for adaptation
  • Generalization error bounds
  • Regularized adaptation algorithms
  • SVM and MLP adaptation
  • Experiments on vowel and object classification
  • The application to the Vocal Joystick
  • Conclusions and future work

22
Algorithms Derived from the Fidelity Prior
  • Generative Models
  • Relation to MAP adaptation
  • Conditional Models
  • Log linear models
  • We focus on SVMs and MLPs

23
Regularized SVM Adaptation
  • Optimization objective
  • Globally optimal solution
  • Regularized fixing old support vectors and
    their coefficients
  • Extended regularized update coefficients of old
    support vectors as well

24
Algorithms in Comparison
  • Unadapted
  • Retrained
  • Use adaptation data only
  • Boosted (Matic 93)
  • Select adaptation data misclassified by the
    unadapted model
  • Combine with old support vectors
  • Bootstrapped (proposed in thesis)
  • Train a seed classifier using adaptation data
    only
  • Select old support vectors correctly classified
    by the seed classifier combine with adaptation
    data
  • Combine with adaptation data
  • Regularized and Extended regularized

25
Regularized MLP Adaptation
  • Optimization objective for a multi-class,
    two-layer MLP
  • Wh2o and Wh2o the hidden-to-output and
    input-to-hidden layer weight matrix respectively
  • W is the L2 norm
  • Remp( f ) cross-entropy, corresponding to log
    loss
  • Locally optimal solution found using
    back-propogation

26
Algorithms in Comparison
  • Unadapted
  • Retrained
  • Start from randomly initialized weight and train
    with weight decay
  • Linear input network (Neto 95)
  • Add a linear transformation in the input space
  • Retrained speaker-independent (Neto 95)
  • Start from the unadapted train both layers
  • Retrained last layer (Baxter 95, Caruana 97,
    Stadermann 05)
  • Start from the unadapted only train the last
    layer
  • Retrained first layer (proposed in thesis)
  • Start from the unadapted only train the first
    layer
  • Regularized
  • Note that all above (except retrained) can be
    considered as special cases of regularized

27
Roadmap
  • Introduction
  • Theoretical results
  • A Bayesian fidelity prior for adaptation
  • Generalization error bounds
  • Regularized adaptation algorithms
  • SVM and MLP adaptation
  • Experiments on vowel and object classification
  • The application to the Vocal Joystick
  • Conclusions and future work

28
Experimental Paradigm
  • Goal
  • To compare adaptation algorithms for a given
    classifier
  • Not to compare adaptation algorithms across
    classifiers
  • Procedure
  • Train an unadapted model on training set
  • Adapt (with supervision) and evaluate via n-fold
    CV on test set
  • Select regularization coefficients on the dev set
  • Corpora
  • VJ vowel dataset (Kilanski 06)
  • NORB image dataset (LeCun 04)

29
VJ Vowel Dataset
  • Task
  • 8 Vowel classes
  • Frame-level classification error rate
  • Speaker adaptation
  • Data allocation
  • Training set 21 speakers, 420K samples
  • For SVM, we random selected 80K samples for
    training
  • Test set 10 speakers, 200 samples
  • Dev set 4 speakers, 80 samples
  • Features
  • 182 dimensions 7 frames of MFCCdelta features

30
SVM Adaptation
  • RBF kernel (std10) optimized for training and
    fixed for adaptation
  • Mean and std. dev over 10 speakers red are
    significant at plt0.001 level

31
MLP Adaptation (I)
  • 50 hidden nodes
  • Mean and std. dev over 10 speakers

32
MLP Adaptation (II)
  • Varying number of vowel classes available in
    adaptation data

33
NORB Image Dataset
  • Task
  • 5 object classes
  • Lighting condition adaptation
  • Data allocation
  • Training set 2700 samples
  • Test set 2700 samples
  • Features
  • 32x32 raw images

34
SVM Adaptation
  • RBF kernel (std500) optimized for training and
    fixed for adaptation
  • Mean and std. dev over 6 lighting conditions

35
MLP Adaptation
  • 30 hidden node
  • Mean and std. dev over 6 lighting conditions

36
Roadmap
  • Introduction
  • Theoretical results
  • A Bayesian fidelity prior for adaptation
  • Generalization error bounds
  • Regularized adaptation algorithms
  • SVM and MLP adaptation
  • Experiments on vowel and object classification
  • The application to the Vocal Joystick
  • Conclusions and future work

37
Why the Vocal Joystick
  • Computer interfaces for individuals with
    motor-impairments
  • Head tracking
  • Eye-gaze tracking
  • Brain-computer interfaces
  • Expensive and error prone
  • Speech is a natural solution, but
  • Most suitable for discrete commands
  • Or, requires a more complex syntax

38
What Is the Vocal Joystick
  • A voice-based interface
  • produce real-time, continuous signals to control
    standard computing devices and robotic arms
  • Acoustic Parameters
  • Vowel quality
  • Loudness
  • Pitch
  • Discrete sound identity
  • VJ mouse

39
VJ Engine
Dynamic Bayesian network
Two-Layer MLP

Phoneme HMMs
40
Adaptation in the VJ
  • Why adaptation is important
  • User variability, style mismatch and channel
    mismatch
  • Adaptation tools
  • Regularized MLP adaptation for vowel
    classification
  • Regularized GMM adaptation for discrete sound
    recognition
Write a Comment
User Comments (0)
About PowerShow.com