The Vocal Joystick: a voicebased humancomputer interface for individuals with motor impairments - PowerPoint PPT Presentation


PPT – The Vocal Joystick: a voicebased humancomputer interface for individuals with motor impairments PowerPoint presentation | free to view - id: b55b1-YzJlO


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

The Vocal Joystick: a voicebased humancomputer interface for individuals with motor impairments


MLP GD. 9.27. 10.35. 12.10. GMM MLLR. 3.6s. 1.8s. 1.2s. 4-class case ... methods (maximum-likelihood linear regression (MLLR), and gradient descent (GD) ... – PowerPoint PPT presentation

Number of Views:242
Avg rating:3.0/5.0
Slides: 35
Provided by: jeffb98


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: The Vocal Joystick: a voicebased humancomputer interface for individuals with motor impairments

The Vocal Joysticka voice-based human-computer
interface for individuals with motor impairments
  • J. Bilmes, X. Li, J. Malkin, K. Kilanski, R.
    Wright, K. Kirchhoff, A. Subramanya, S. Harada,
    J. Landay, P. Dowden, H. Chizeck
  • University of Washington, Seattle

  • Voice-based human-computer interfaces
  • What A new solution, the Vocal Joystick
  • How it works
  • classification
  • adaptation
  • acceleration
  • Video Demonstrations
  • User Studies

  • Significant population of individuals with poor
    (or no) motor abilities, but have perfect (or
    good) use of their voice.
  • Many devices exist for their use (sip-and-puff
    switches (similar to Morse code), head-tracking
    mice, eye-tracking mice, etc.)

eye-tracking mouse.
head-mouse video
Issues with existing technology
  • Can be expensive, requiring special purpose
  • It might not be most efficient (leading to user
  • When voice-based, it might not use the full
    capabilities of the human voice
  • reduced communication bandwidth
  • users with (even not quite) full voice control
    can do more
  • Standard speech-recognition non-ideal for
    continuous control (e.g., mouse-movement, robotic
    limb control). Imagine move-left, move-up,

The Vocal Joystick
  • The Vocal Joystick Use the voice to produce
    real-time continuous control signals to control
    standard computing devices and robotic arms.
  • The analogy of a joystick
  • small number of discrete commands (button
    presses) for simple tasks, modality switches,
  • multiple simultaneous continuous degrees of
    freedom to be controlled by continuous aspects of
    your voice (pitch, amplitude, vowel-quality,

Design Goals
  • easy to learn and remember (by the user)
  • keep cognitive load at a minimum
  • easy to speak (reduce vocal strain)
  • easy to recognize (as noise-robust and
    non-confusable as possible)
  • exploitive use full capabilities of human vocal
  • universal (attempt to use vocal characteristics
    that minimize the chance that regional
    languages/dialects preclude its use)
  • complementary can be used jointly with existing
  • computationally cheap leave enough computational
    headroom for other important applications to run.
  • Infrastructure like a library, easy to
    incorporate into applications.

The VJ-Mouse
  • The VJ-mouse
  • long-term goal of project is to voice-control
    arbitrary systems with multi-dimensional
    continuous input space
  • So far, we have mostly concentrated on a
    VJ-controlled mouse (which is still quite
  • Allows us to perform a variety of tasks on a
    standard WIMP desktop (mouse movement and mouse
    clicks, and thus web browsing, slider control,
    some video games, Dasher typing, etc.)
  • Recent work also shows a simple simulated
    robotic arm.

Vocal Joystick Mapping
  • Standard mice map physical space to physical
  • Here, we must map vocal tract articulatory change
    to physical space

Vocal Joystick On Our Mapping
  • Mapping may seem be arbitrary
  • certainly, rotational permutations are likely to
    be equally preferred after practice.
  • The mapping is customizable
  • Ultimately, need a user study to see, relative to
    the typical mouse-gesture workload, what, if
    any, mapping is best.

Vocal Joystick Engine
  • Adaptation and acceleration is crucial

Vocal Joystick Mouse-Control Demonstration
  • View Movie
  • Browsing a news website
  • Playing video games
  • Google maps
  • Training/Visualization tool

How it works
  • A number of problems need to be solved
  • VJ vowel classifier discriminative, and for
    high-classification accuracy, apply rapid
    adaptation (like in speech recognition) in VJs
    vowel classifier
  • acceleration For acceleration, use loudness to
    control speed, and build a mapping from
    intentional-loudness to rate of positional
    change on the 2-D screen.

Goals for VJs vowel classifier
  • Real-time performance
  • latency should be no more than reaction time
    (about 10-50ms)
  • Classification accuracy high robust!
  • Speaker independent system likely to perform
  • Adaptation is key
  • Adaptation algorithm should be
  • fast user shouldnt spent lots of time enrolling
  • accurate (e.g., discriminative or max-margin)
  • No more resources (compute/memory) than speaker
    independent system

Two new ideas
  • 1 Maximum margin MLP
  • 2 Max-margin based adaptation

Standard MLP Training
  • Forward Pass
  • Given input x, matrix multiply by Who followed by
    non-linearity (sigmoid) ?, then matrix multiply
    by Who followed by final non-linearity (softmax).
    Final output
  • Training given target t, cost
  • Update rule gradient descent
  • Not convex, finds only local optimum. Adjusts
    both weight matrices.

Standard Kernel Training
  • Kernel (support-vector) training method linear
    kernel ltw,xtgt
  • Finds optimal (max-margin, low VC-dim)
    separating hyperplane
  • Convex optimization
  • Complexity lives in kernel

Hybrid MLP-SVM Classifier
  • Approach
  • Train MLP parameters using gradient descent
  • Use resulting MLPs input-to-hidden layer
    (nonlinear mapping) ?(x) as input-to-feature
    mapping for SVM training
  • Replace the MLPs hidden-to-output layer (linear
    classifiers) by optimal margin hyperplane
  • Advantages
  • Unique optimal solution for last layer parameters
  • same optimal separating hyperplane guarantees
  • Nonlinear feature mapping is implicitly optimized
    in the form of a kernel (kernel learning)
  • Amenable to max-margin adaptation (described in a
    few slides from now), SVs provides flexibility
    for regularization

Speaker Independent Classifiers and
  • GMM 16 mixtures
  • Two-layer MLP
  • Input 182 MFCC features from 7 consecutive
  • Hidden 50 nodes
  • 7 and 50 were empirically determined to be best
  • Output 4 or 8 classes
  • Results (error rate, percent)

MLP-SVM Adaptation
  • Approach
  • Fix input-to-hidden layer and adapt the last
  • Interpolation achieved by weighted combination of
    trained support vectors (SVg) and new adaptation
    data (Ra)
  • Adaptation objective

Weight pt for trained SVs
Weight 1 for adaptation data
Kernel Adaptation Strategy
  • Remove all data points but support vectors
  • Remove support vectors that are too close
  • Add adaptation data
  • Train using max-margin criterion (convex
    problem), producing a speaker-dependent
    separating hyperplane

Error rate results on VJ data
amount of adaptation data (seconds)
Acceleration in Vocal Joystick
  • Both loudness and pitch were initially considered
    as candidates (and were implemented) to control
  • Pitch ended up being less reliable (even using a
    state-of-the-art pitch tracker)
  • People naturally tended to vocalize more softly
    when wishing to move smaller distances (and vice
  • Loudness ended up working better.

Acceleration in Vocal Joystick
  • Compute direction value dj
  • where vi is directional unit vector for
    classifier output i, ej is unit vector in
    direction j?x,y, and pi is classifier output
    probability for class i at time t.

Acceleration in Vocal Joystick
  • Next, compute acceleration scalar sj
  • where E is current vowel energy, ?i is average
    vowel-i energy, f() and g() are mapping functions
    (from intentional loudness to its desired

Acceleration in Vocal Joystick
  • Final velocity in direction j is Vjdj where
  • and where b and ? are tuning parameters (so far,
    best empirically determined to be ? 1.0, ? 0.6)

User Study VJ vs. modern desktop mouse
  • Earlier version of VJ-engine was used.
  • Compared two tasks
  • link web navigate through a prescribed set of
  • map use google maps, navigate from USA to
    University of Washington map.

User Study VJ vs. mouse
  • Task completion time results
  • link VJ is about 4-times slower, map VJ is
    about 7.5 times slower.

User Study VJ vs. modern Eye Tracking mouse
  • Compared two tasks
  • Target acquisition how quickly can you move from
    a starting object to another target object and
    click on that object. Variations a function of
    distance, target size, and relative angle
  • Web browsing similar to previous task (navigate
    a prescribed set of links)

Target Acquisition (VJ vs. ET)
Web Browsing Task (VJ vs. ET)
  • A new voice-based human-computer interface for
    individuals with motor impairments.
  • Use continuous aspects of the human voice to
    affect continuous movement in on-screen devices
  • New classification, rapid adaptation, and
    acceleration algorithms.
  • It appears to work!

(No Transcript)
Standard MLP Training
  • Forward Pass
  • Input x presented
  • Matrix multiply by Who followed by non-linearity
    (sigmoid) ?
  • Matrix multiply by Who followed by final
    non-linearity (softmax)
  • Final output
  • Training given target pattern t, propagate delta
    (y-t) back through using gradient descent to
    update two weight matrices.

Outline the Why, What, and Results
The Why
  • The Vocal Joystick Project at the University of
  • continuous control with your voice (mice, robotic

The What
  • Comparison of adaptation strategies for Gaussian
    mixture models (GMMs), multi-layered perceptions
    (MLPs), and max-margin trained MLPs.

The Results
  • Max-margin trained and adapted MLPs out-performs
    standard adaptation methods (maximum-likelihood
    linear regression (MLLR), and gradient descent

Maximum Margin Learning and Adaptation of MLP
Our Initial Solution This Paper
  • Xiao Li, Jeff Bilmes and Jonathan Malkin
  • Signal, Speech, and Language Interpretation
    Laboratory (SSLI-LAB)
  • Department of Electrical Engineering
  • University of Washington, Seattle

  • Vocal-Joystick Vowel Database (our own
  • Constant-vowel recordings with different
  • duration
  • loudness
  • pitch
  • 15 speakers (out of 40) used for a training and
    test set. Each speaker has 18 utterances for each
  • Training set 10 speakers
  • Test set 5 speakers

Adaptation Parameters
  • Weight pt is determined by how close a SV is to
    the adaptation data distribution.
  • Use SI support-vector if far enough from margin
  • Use a hard threshold dgt0 which controls the
    tradeoff between SI model and adaptation data
  • In experiments so far, coefficient C during
    adaptation is the same as that used in training
  • this need not be the best assumption however!

Preliminary Adaptation Experiments
  • Vary the amount of adaptation data
  • 1, 2 and 3 utterances (1.2, 1.8 and 3.6s)
  • d determined on eval set
  • 4-class case choose all training support vecs
  • 8-class case choose about ½ the support vecs
  • Comparison (using same C as in training)
  • MLP Gradient descent adaptation of 2nd layer
  • MLP-SVM SV based adaptation

Database for adaptation
  • Training set 10 speakers
  • Test set 5 speakers for each speaker
  • 18 utterances for each vowel are divided into 6
  • Adapt on each subset and evaluate on the rest
  • The error rate is an average over 6 subsets
  • The final error rate is an average over 5
    speakers, and hence 30 subsets for each vowel
    (averaged then again to get final number)

Previous Work
  • Explicitly model the source of variation
  • Vocal tract length normalization (Cohen 94, Lin
    95, Eide 96, Welling 02)
  • Statistical methods to adapt the classifier
  • Gaussian mixture models (GMM) Maximum-likelihood
    linear regression MLLR (Gales 96), MAP
    (Gauvain94), Eigenvoice (Kuhn 00)
  • Multilayer perceptrons (MLP)
  • Adding speaker-dependent units (Neto 95, Strom
  • Re-training part of the last layer (Stadermann
  • Adding additional input layer (Abrash 95)
  • Support vector machines (SVM) Incremental
    learning approach (Matic 93, Peng 02)

Vocal Joystick 3-joint Robot-Arm Control
  • A number of issues need to be resolved
  • how to do extremely accurate real-time
    classification of vowels mixed in with discrete
    commands, amplitude/energy detection, and do so
    without using up all computing resources
  • How should acceleration (as in a standard mouse)
    be generalized to the vocal-tract to 2D space

  • Discussions
  • The performance of an MLP classifier can be
    enhanced by applying maximum margin training in
    the last layer.
  • The SVs can be combined with adaptation data,
    with weights, to retrain the MLP last layer for
  • Future work will present new ways to adapt MLPs
    in max-margin framework that so far work even

MLP-Kernel Training
  • Kernel (support-vector) training method linear
  • Only difference is use if MLP-optimized
    input-to-hidden layer for input-to-feature space
    mapping in SVM

Two new ideas used here
  • 1 Maximum margin MLP
  • Use input layer of already trained MLP as
    input-to-feature space mapping ?(x)
  • Given this mapping (data-driven kernel), train
    optimal separating hyperplane corresponding to
    2nd layer of MLP
  • 2 Max-margin based adaptation
  • Keep support vectors from a speaker independent
    trained system (as above), and add (at most) only
    them to a set of adaptation training
  • Weight speaker independent support vectors
    depending on amount of adaptation data