Voice Browsers - PowerPoint PPT Presentation


PPT – Voice Browsers PowerPoint presentation | free to download - id: 10034-MWQ0N


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Voice Browsers


Interaction via key pads, spoken commands, listening to prerecorded speech, ... To-do lists, shopping lists, and calorie counters. 13. Advancing Towards Voice ... – PowerPoint PPT presentation

Number of Views:2227
Avg rating:3.0/5.0
Slides: 104
Provided by: shalgi
Learn more at: http://www.cs.huji.ac.il


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Voice Browsers

Voice Browsers
GeneralMagic Demo
  • Making the Web accessible to more of us, more of
    the time.

SDBI November 2001, Shani Shalgi
What is a Voice Browser?
  • Expanding access to the Web
  • Will allow any telephone to be used to access
    appropriately designed Web-based services
  • Server-based
  • Voice portals

What is a Voice Browser?
  • Interaction via key pads, spoken commands,
    listening to prerecorded speech, synthetic speech
    and music.
  • An advantage to people with visual impairment
  • Web access while keeping hands eyes free for
    other things (eg. Driving).

What is a Voice Browser?
  • Mobile Web
  • Naturalistic dialogs with Web-based services.

  • Far more people today have access to a telephone
    than have access to a computer with an Internet
  • Many of us have already or soon will have a
    mobile phone within reach wherever we go.

  • Easy to use - for people with no knowledge or
    fear of computers.
  • Voice interaction can escape the physical
    limitations on keypads and displays as mobile
    devices become ever smaller.

  • Many companies to offer services over the phone
    via menus traversed using the phone's keypad.
    Voice Browsers are the next generation of call
    centers, which will become Voice Web portals to
    the company's services and related websites,
    whether accessed via the telephone network or via
    the Internet.

  • Disadvantages to existing methods
  • WAP (Cellular phones, Palm Pilots)
  • Small screens
  • Access Speed
  • Limited or fragmented availability
  • Akward input
  • Price
  • Lack of user habit

Differences Between Graphical Voice Browsing
  • Graphical browsing is more passive due to the
    persistence of the visual information
  • Voice browsing is more active since the user has
    to issue commands.
  • Graphical Browsers are client-based, whereas
    Voice Browsers are server-based.

Possible Applications
  • Accessing business information
  • The corporate "front desk" which asks callers
    who or what they want
  • Automated telephone ordering services
  • Support desks
  • Order tracking
  • Airline arrival and departure information
  • Cinema and theater booking services
  • Home banking services

Possible Applications (2)
  • Accessing public information
  • Community information such as weather, traffic
    conditions, school closures, directions and
  • Local, national and international news
  • National and international stock market
  • Business and e-commerce transactions

Possible Applications (3)
  • Accessing personal information
  • Voice mail
  • Calendars, address and telephone lists
  • Personal horoscope
  • Personal newsletter
  • To-do lists, shopping lists, and calorie counters

Advancing Towards Voice
  • Until now, speech recognition and synthesis
    technologies had to be handcrafted into
  • Voice Browsers intend the voice technologies to
    be handcrfted directly into web servers.
  • This demands transformation of Web content into
    formats better suited to the needs of voice
    browsing or authoring content directly for voice

  • The World Wide Web Consortium (W3C) develops
    interoperable technologies (specifications,
    guidelines, software, and tools) to lead the Web
    to its full potential as a forum for information,
    commerce, communication, and collective

WC3 Speech Interface Framework
  • Pronunciation Lexicon
  • Call Control
  • Voice Browser Interoperation
  • VoiceXML
  • Speech Synthesis
  • Speech Recognition
  • DTMF Grammars
  • Speech Grammars
  • Stochastic (N-Gram) Language Models
  • Semantic Interpretation

  • VoiceXML is a dialog markup language designed for
    telephony applications, where users are
    restricted to voice and DTMF (touch tone) input.

Web Server
Speech Synthesis
  • The specification defines a markup language for
    prompting users via a combination of prerecorded
    speech, synthetic speech and music. You can
    select voice characteristics (name, gender and
    age) and the speed, volume, pitch, and emphasis.
    There is also provision for overriding the
    synthesis engine's default pronunciation.

Speech Recognition
Speech Grammars
Semantic Interpretation
Stochastic Language Models
Touch Tone
DTMF Grammars
DTMF Grammars
  • Touch tone input is often used as an alternative
    to speech recognition.
  • Especially useful in noisy conditions or when the
    social context makes it awkward to speak.
  • The W3C DTMF grammar format allows authors to
    specify the expected sequence of digits, and to
    bind them to the appropriate results

Speech Grammars
  • In most cases, user prompts are very carefully
    designed to encourage the user to answer in a
    form that matches context free grammar rules.
  • Speech Grammars allow authors to specify rules
    covering the sequences of words that users are
    expected to say in particular contexts. These
    contexual clues allow the recognition engine to
    focus on likely utterances, improving the chances
    of a correct match.

Stochastic (N-Gram) Language Models
  • In some applications it is appropriate to use
    open ended prompts (how can I help). In these
    cases, context free grammars are unuseful.
  • The solution is to use a stochastic language
    model. Such models specify the probability that
    one word occurs following certain others. The
    probabilities are computed from a collection of
    utterances collected from many users.

Semantic Interpretation
  • The recognition process matches an utterance to a
    speech grammar, building a parse tree as a
  • There are two approaches to harvesting semantic
    results from the parse tree
  • 1. Annotating grammar rules with semantic
    interpretation tags (ECMAScript).
  • 2. Representing the result in XML.

Semantic Interpretation - Example
  • For example (1st approach), the user utterance
  • "I would like a medium coca cola and a large
    pizza with pepperoni and mushrooms.
  • could be converted to the following semantic
  • drink
  • beverage "coke
  • drinksize "medium
  • pizza
  • pizzasize "large"
  • topping "pepperoni", "mushrooms"

Pronunciation Lexicon
  • Application developers sometimes need to ability
    to tune speech engines, whether for synthesis or
  • W3C is developing a markup language for an open
    portable specification of pronunciation
    information using a standard phonetic alphabet.
  • The most commonly needed pronunciations are for
    proper nouns such as surnames or business names.

Call Control
  • Fine-grained control of speech (signal
    processing) resources and telephony resources in
    a VoiceXML telephony platform.
  • Will enable application developers to use markup
    to perform call screening, whisper call waiting,
    call transfer, and more.
  • Can be used to transfer a user from one voice
    browser to another on a competely different

Voice Browser Interoperation
  • Mechanisms to transfer application state, such as
    a session identifier, along with the user's audio
  • The user could start with a visual interaction on
    a cell phone and follow a link to switch to a
    VoiceXML application.
  • The ability to transfer a session identifier
    makes it possible for the Voice Browser
    application to pick up user preferences and other
    data entered into the visual application.

Voice Browser Interoperation (2)
  • Finally, the user could transfer from a VoiceXML
    application to a customer service agent.
  • The agent needs the ability to use their console
    to view information about the customer, as
    collected during the preceding VoiceXML
    application. The ability to transfer a session
    identifier can be used to retrieve this
    information from the customer database.

Voice Style Sheets?
  • Some extensions are proposed to HTML 4.0 and CSS2
    to support voice browsing
  • Prerecorded content is likely to include music
    and different speakers. These effects can be
    reproduced to some extent via the aural style
    sheets features in CSS2.

Voice Style Sheets!
  • Volume
  • Rate
  • Pitch
  • Direction
  • Spelling out text letter by letter
  • Speech fonts (male/female, adult/child etc.)
  • Inserted text before and after element content
  • Sound effects and music

Authors want control over how the document is
rendered. Aural style sheets (part of CSS2)
provide a basis for controlling a range of
How Does It Work?
  • How do I connect?
  • Do I speak to the browser or does the browser
    speak to me?
  • What is seen on the screen?
  • How do I enter input?

  • How does the browser understand what I say?
  • How can I tell it what I want?
  • what if it doesnt understand?

Overview on Speech Technologies
  • Speech Synthesis
  • Text to Speech
  • Speech Recognition
  • Speech Grammars
  • Stochastic n-gram models
  • Semantic Interpretation

What is Speech Synthesis?
  • Generating machine voice by arranging phonemes
    (k, ch, sh, etc.) into words.
  • There are several algorithms for performing
    Speech Synthesis. The choice depends on the task
    they're used for.

How is Speech Synthesis Performed?
  • The easiest way is to just record the voice of a
    person speaking thedesired phrases.
  • This is useful if only a restricted volume of
    phrases and sentences is used, e.g. schedule
    information of incoming flights. The quality
    depends on the way recording is done.

How is Speech Synthesis Performed?
  • Another option is to record a large database of
  • Requires large memory storage
  • Limited vocabulary
  • No prosodic information
  • More sophisticated but worse in quality are
    Text-To-Speech algorithms.

How is Speech Synthesis Performed?Text To Speech
  • Text-To-Speech algorithms split the speech into
    smaller pieces. The smaller the units, the less
    they are in number, but the quality also
  • An often used unit is the phoneme,the smallest
    linguistic unit. Depending on the language used,
    there are about 35-50 phonemes in western
    European languages, i.e. we need only 35-50
    single recordings.
  • february twenty fifth f eh b r ax r iy t w eh n
    t iy f ih f th

Text To Speech
  • The problem is, combining them as fluent speech
    requires fluent transitions between the elements.
    The intelligibility is therefore lower, but the
    memory required is small.
  • A solution is using diphones. Instead of
    splitting at the transitions, the cut is done at
    the center of the phonemes, leaving the
    transitions themselves intact.

Text To Speech
  • This means there are now approximately 1600
    recordings needed (4040).
  • The longer the units become, the more elements
    there are, but the qualityincreases along with
    the memory required.

Text To Speech
  • Other units which are widely usedare
    half-syllables, syllables, words, or combinations
    of them, e.g. wordstems and inflectional
  • TTS is dictionary-driven. The larger the
    dictionary resident in the browser is, the better
    the quality.
  • For unknown words, falls back on rules for
    regular pronunciation.

Text To Speech
  • Vocabulary is unlimited!!!
  • But what about the prosodic information?
  • Pronunciation depends on the context in which a
    word occurs. Limited linguistic analysis is
  • How can I help?
  • Help is on the way!

Text To Speech
  • Another example
  • I have read the first chapter.
  • I will read some more after lunch.
  • For these cases, and in the cases of irregular
    words and name pronunciation, authors need a way
    to provide supplementary TTS information and to
    indicate when it applies.

Text To Speech
  • But specialized representations for phonemic and
    prosodic information can be off putting for
    non-specialist users.
  • For this reason it is common to see simplified
    ways to write down pronunciation, for instance,
    the word "station" can be defined as
  • station stay-shun

Text To Speech
  • This approach encourages users to add
    pronunciation information, leading to an increase
    in the quality of spoken documents, compared to
    more complex and harder to learn approaches.
  • This is where W3C comes in
  • Providing a specification to enable consistent
    control (generating, authoring, processing) of
    voice output by speech synthesizers for varying
    speech content, for use in voice browsing and in
    other contexts.

Overview on Speech Technologies
  • Speech Synthesis
  • Text to Speech
  • Speech Recognition
  • Speech Grammars
  • Stochastic n-gram models
  • Semantic Interpretation

Speech Recognition
Speech Recognition
Speech Recognition
Speech Recognition
Speech Recognition
  • Automatic speech recognition is the process by
    which a computer maps an acoustic speech signal
    to text.
  • Speech is first digitized and then matched
    against a dictionary of coded waveforms. The
    matches areconverted into text.

Speech Recognition
  • Types of voice recognition applications
  • Command systems recognize a few hundred words and
    eliminate using the mouse or keyboard for
    repetitive commands.
  • Discrete voice recognition systems are used for
    dictation, but require a pause between each word.
  • Continuous voice recognition understands natural
    speech without pauses and is the most process

Speech Recognition
  • A speaker dependent system is developed to
    operate for a single speaker.
  • These systems are usually easier to develop,
    cheaper to buy and more accurate, but not as
    flexible as speaker adaptive or speaker
    independent systems.

Speech Recognition
  • A speaker independent system is developed to
    operate for any speaker of a particular type
    (e.g. American English).
  • These systems are the most difficult to develop,
    most expensive and accuracy is lower than speaker
    dependent systems. However, they are more

Speech Recognition
  • A speaker adaptive system is developed to adapt
    its operation to the characteristics of new
    speakers. It's difficulty lies somewhere between
    speaker independent and speaker dependent

Speech Recognition
  • Speech recognition technologies today are highly
  • There is a huge gap between the ability to
    recognize speech and the ability to interpret

How is Speech Recognition Performed?
  • Speech recognition technology involves complex
    statistical models that characterize the
    properties of sounds, taking into account factors
    such as male vs. female voices, accents, speaking
    rate, background noise, etc.
  • The process of speech recognition includes 5

1. Capture and digital sampling 2. Spectral
representation and analysis 3. Segmentation. 4.
Phonetic Modeling 5. Search and Match
How is Speech Recognition Performed?
  • Speech Grammars
  • HMM (Hidden Markov Modelling)
  • DTW (Dynamic Time Warping)
  • NNs (Neural Networks)
  • Expert systems
  • Combinations of techniques.
  • HMM-based systems are currently the most
    commonly used and most successful approach.

Speech Grammars
  • The grammar allows a speech application to
    indicate to a recognizer what it should listen
    for, specifically
  • Words that may be spoken,
  • Patterns in which those words may occur,
  • Language of the spoken words.

Speech Grammars
  • In simple speech recognition/speech understanding
    systems, the expected input sentences are often
    modeled by a strict grammar (such as a CFG).
  • In this case, the user is only allowed to utter
    those sentences, that are explicitly covered by
    the grammar.
  • Good for menus, form filling, ordering services,

Speech Grammars
  • Experience shows that a context free grammar with
    reasonable complexity can never foresee all the
    different sentence patterns, users come up with
    in spontaneous speech input.
  • This approach is therefore not sufficient for
    robust speech recognition/ understanding tasks or
    free text input applications such as dictation.

For Example
  • Possible answers to a question may be "Yes" or
    "No, but it could also be any other word used
    for negative or positive response. It could be
    "Ya," "you betch'ya," "sure," "of course" and
    many other expressions. It is necessary to feed
    the speech recognition engine with likely
    utterances representing the desired response.

Speech Grammars
  • What is done?
  • Beta and Pilot versions
  • Upgrade versions

Speech Grammars - Example
  • very

Speech Grammars - Example
  • very
  • big
  • pizza
  • with
  • and

Hidden Markov Model
  • Notations
  • T Observation sequence length
  • O o1,o2,,oT Observation sequence
  • N Number of States (we either know or guess)
  • Q q1qN finite set of possible states
  • M number of possible observations
  • V v1,v2,,vM finite set of possible
  • Xt state at time t (state variable)

Hidden Markov Model
  • Distributional parameters
  • A aij where aij P(Xt1 qj Xt qi)
    (transition probabilities)
  • B bi(k) where bi(k) P(Ot vk Xt qi)
    (observation probabilities)
  • ?t P(X0 qi) (initial state distribution)

Hidden Markov Model
  • Definitions
  • A Hidden Markov Model (HMM) is a five-tuple
  • Let ? A,B,? denote the parameters for a given
    HMM with fixed Q and V.

Hidden Markov Model
  • Problems
  • 1. Find P(O ?), the probability of the
    observations given the model.
  • 2. Find the most likely state trajectory
  • X x1,x2,,xT given the model and
    observations. (Find X so that P(O,X ?) is
  • 3. Adjust the ? parameters to maximize
  • P(O ?)

Language Models
  • A Language model is a probability distribution
    over word sequences
  • P(And nothing but the truth) ?? 0.001
  • P(And nuts sing on the roof) ? 0

The Equation
Notation W' argmaxW P(OW) P(W)
The N-Gram (Markovian) Language Model
  • Hard to compute P(W)
  • P(And nothing but the truth)
  • Step 1 Decompose probability -
  • P(And nothing but the truth)
  • P(And) ?P(nothing and) ?
  • P(but and nothing) ? P(the and
    nothing but) ? P(truth and nothing but

The Trigram Approximation
  • Assume each word depends only on the previous two
    words (three words total tri means three, gram
    means writing)
  • P(the whole truth and nothing but) ?
  • P(thenothing but)
  • P(truth whole truth and nothing but the) ?
  • P(truthbut the)

N-Gram - The Markovian Model
  • The Markovian state machine is an automatation
    with statistical weights
  • A state represents a phoneme, diphone or word.
  • We do not include all options, but only those
    which are related to the context or subject.
  • We calculate all probable paths from beginning to
    end of phrase/word and return the one with the
    maximum probability.

Back to Trigrams
  • How do we find the probabilities?
  • Get real text, and start counting!
  • P(the nothing but) ?
  • Count(nothing but the)
  • Count(nothing but)

  • Why stop at 3-grams?
  • If P(zrstuvwxy)?? P(zxy) is good, then
    P(zrstuvwxy) ? P(zvwxy) is better!
  • 4-gram, 5-gram start to become expensive...

The N-Gram (Markovian) Language Model - Summary
  • N-Gram language models are used in large
    vocabulary speech recognition systems to provide
    the recognizer with an a-priori likelihood P(W)
    of a given word sequence W.
  • The N-Gram language model is usually derived from
    large training texts that share the same language
    characteristics as expected input.

Combining Speech Grammars and N-Gram Models
  • Using an N-Gram model in the recognizer and a CFG
    in a (separate) understanding component
  • Integrating special N-Gram rules at various
    levels in a CFG to allow for flexible input in
    specific context
  • using a CFG to model the structure of phrases
    (e.g. numeric expressions) that incorporated in a
    higher-level N-Gram model (class N-Grams)

Overview on Speech Technologies
  • Speech Synthesis
  • Text to Speech
  • Speech Recognition
  • Speech Grammars
  • Stochastic n-gram models
  • Semantic Interpretation

Semantic Interpretation
  • We have recognized the phrases and words, what
  • Problems
  • What does the user mean?
  • We have the right keywords, but the phrase is
    meaningless or unclear.

Semantic Interpretation
  • As stated before, the technologies of speech
    recognition exceed those of interpretation.
  • Most interpreters are base on key words.
  • Sometimes this is not good enough!

Back To Voice Browsers
  • Making the Web accessible to more of us, more
    of the time.
  • Personal Browser Demo
  • Now well talk about voiceXML, navigation and
    various problems

VoiceXML - Example 1
  • Hello World!
  • The top-level element is , which is mainly
    a container for dialogs. There are two types of
    dialogs forms and menus. Forms present
    information and gather input menus offer choices
    of what to do next.

VoiceXML - Example 1
  • Hello World!
  • This example has a single form, which contains a
    block that synthesizes and presents "Hello
    World!" to the user. Since the form does not
    specify a successor dialog, the conversation ends.

VoiceXML - Example 2
  • Our second example asks the user for a choice of
    drink and then submits it to a server script
  • Would you like coffee,tea, milk, or
  • type"application/grammarxml"/
  • ink2.asp"/

VoiceXML - Example 2
  • A sample interaction is
  • C (computer) Would you like coffee, tea, milk,
    or nothing?
  • H (human) Orange juice.
  • C I did not understand what you said. (a
    platform-specific default message.)
  • C Would you like coffee, tea, milk, or nothing?
  • H Tea
  • C (continues in document drink2.asp)

VoiceXML - Architectural Model
Web Server
VoiceXML interpreter context may listen for a
special escape phrase that takes the user to a
high-level personal assistant, or for escape
phrases that alter user preferences like volume
or text-to-speech characteristics.
The implementation platform generates events in
response to user actions (e.g. spoken or
character input received, disconnect) and system
events (e.g. timer expiration).
Scope of VoiceXML
  • Output of synthesized speech (TTS)
  • Output of audio files.
  • Recognition of spoken input.
  • Recognition of DTMF input.
  • Recording of spoken input.
  • Control of dialog flow.
  • Telephony features such as call transfer and

The language provides means for collecting
character and/or spoken input, assigning
the input to document-defined request variables,
and making decisions that affect the
interpretation of documents written in the
language. A document may be linked to other
documents through Universal Resource Identifiers
  • Voice XML is intended to be analogous to
    graphical surfing.
  • There are limitations.
  • Excellent for menu applications.
  • Awkward for open dialog applications
  • There are other languages VoXML, omniviewXML

  • The user might be able to speak the word "follow"
    when she hears a hypertext link she wishes to
  • The user could also interrupt the browser to
    request a short list of the relevant links.

Navigation example
  • User links?
  • Browser The links are
  • 1 company info
  • 2 latest news
  • 3 placing an order
  • 4 search for product details
  • Please say the number now
  • User 2
  • Browser Retrieving latest news...

Navigation through Headings
  • Another command could be used to request a list
    of the document's headings. This would allow
    users to browse an outline form of the document
    as a means to get to the section that interests

Navigation to Specific URLs
  • Graphical Browsers allow entering a wanted URL in
    the browser window
  • How is this supported in Voice Browsers?
  • Think What problems do you anticipate?
  • Will we be able to transfer from any voice portal
    to any other?
  • How do we know where to go?

How Slow / Fast ?
  • If voice browsers are meant to replace human
    operator dialog, they must be fast in response.
  • Speech Recognition / Interpretation / Synthesis
    depend on implementation
  • When a user requests a certain document, several
    related documents can be downloaded for easier

Friendly vs. Annoying
  • How friendly do you want the service to be?
  • Friendly is sometimes time consuming.
  • What percentage of the time does the user talk
    and what percentage of the time is he listening?
  • What parameters can I control?

Voice and Graphics
  • Can I access the Voice Browser through my
  • Some sites are authored only for voice.
  • Some will be for both. This leads to more
    difficulties which must be dealt with.

Inserted text
  • When a hypertext link is spoken by a speech
    synthesizer, the author may wish to insert text
    before and after the link's caption, to guide the
    user's response.
  • For example
  • Driving instruction
  • May be offered by the voice browser using the
    following words
  • For driving instructions press 1

Inserted text
  • The words "For and "Press 1" were added to the
    text embedded in the anchor element.
  • On first glance it looks as if this 'wrapper'
    text should be left for the voice browser to
    generate, but on further examination you can
    easily find problems with this approach.

Inserted text
  • For example, the text for the following element
    cannot be For
  • Leave us a
  • We need to say
  • To leave us a message, press 5

Inserted text
  • The CSS2 draft specification includes the means
    to provide "generated text" before and after
    element content.
  • For example
  • style'cue-before "To"
  • cue-after ", press 5"'
  • hrefLeaveMessage.htmlLeave us a message

Handling Errors and Ambiguities
  • Users might easily enter unexpected or ambiguous
    input, or just pause, providing no input at all.
  • Some examples to errors which might generate
  • When presented with a numbered list of links, the
    user enters a number that is outside the range
    presented .
  • The phrase uttered by the user matches more than
    one template rule.

Handling Errors and Ambiguities
  • The phrase\sound uttered doesn't match a known
  • The user looses track and the browser needs to
    time-out and offer assistance
  • Ums and Errs
  • Authors will have control over the browser
    response to selection errors and timeouts.
  • Other errors might be dealt with by the browser
    or platform.

Some Nice Demos
  • Email assistant demo
  • Bank service demo (cough, ambiguity)
  • Financial Center Demo (ums)
  • Telectronics Demo

Who has implemented VoiceXML interpreters?
  • BeVocal Café
  • General Magic
  • HeyAnita's FreeSpeech Developer Network
  • IBM Voice Server SDK Beta Program based on
    VoiceXML Version 1.0
  • Motorolas Mobile Application Development Toolkit

Who has implemented VoiceXML interpreters?
  • Nuance Developer Network
  • Open VXI VoiceXML interpreter
  • PIPEBEACHs speechWeb
  • Teleras DeVXchange
  • Tellme Studio
  • VoiceGenie
About PowerShow.com