VoiceXML: SSML Speech Synthesis Markup Language Recorded speech and audio - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

VoiceXML: SSML Speech Synthesis Markup Language Recorded speech and audio

Description:

Speak enclosed text in the given style. Implemented (with limitations) in some platforms ... such as: airline, equity, street, city, state, citystate, address ... – PowerPoint PPT presentation

Number of Views:558
Avg rating:3.0/5.0
Slides: 29
Provided by: Michael2145
Category:

less

Transcript and Presenter's Notes

Title: VoiceXML: SSML Speech Synthesis Markup Language Recorded speech and audio


1
VoiceXML SSML (Speech Synthesis Markup
Language)Recorded speech and audio
2
Acknowledgements
  • Prof. Mctear, Natural Language Processing,
    http//www.infj.ulst.ac.uk/nlp/index.html,
    University of Ulster.

3
Overview
  • Speech Synthesis Markup Language (SSML)
  • Phases of Text to Speech Synthesis
  • Structure analysis
  • Text normalisation
  • Text to phoneme conversion
  • Prosody analysis
  • Waveform production
  • Recorded speech

4
SSML
  • Speech Synthesis Markup Language
  • enables developers to override default
    specifications
  • Stages
  • Structure analysis
  • Text normalisation
  • Text to phoneme conversion
  • Prosody analysis
  • Waveform production

5
Structure Analysis
  • Division of text into basic elements e.g.
    sentence, paragraph to support more natural
    phrasing
  • ltsgt - sentence
  • ltpgt - paragraph
  • Structure inferred from punctuation and
    formatting, but
  • Dr. Lewis works at the clinic on Sunset Dr. in
    western Portland.
  • Dr. Smith lives at 214 Elm Dr.  He weighs 214 lb.
    He plays bass guitar.  He also likes to fish
    last week he caught a 20 lb. bass.ltpgt    ltsgtDr.
    Smith lives at 214 Elm Dr. lt/sgt    ltsgtHe weighs
    214 lb.lt/sgt     ltsgtHe plays bass guitar. lt/sgt
        ltsgtHe also likes to fish last week he
    caught a 20 lb. bass.lt/sgt
  • lt/pgt

6
Text Normalisation
  • Annotation of text so that it is spoken correctly
  • Ambiguous examples
  • 1/2 - may be spoken as half, January second,
    February first, or one of two. 
  • Dr. may be doctor or drive e.g. Dr. John
    Dr. is rewritten as Doctor John Drive
  • St. may be saint or street e.g. St. John
    St. is written as Saint John Street.
  • Acronyms e.g. ACM or IEEE should be spelled out,
    others are pronounced as words e.g. RAM, ROM
  • Email addresses e.g. catazman_at_bee.com
  • First part Cat Azman, C.A.Tazman, or C.
    Atazman? 
  • Last part Bee dot com or B.E.E. dot com?

7
ltsubgt
  • New in VoiceXML 2.0. Speech Synthesis Markup.
  • Syntax
  • ltsub alias"substituteText" gt OriginalText lt/subgt
  • DescriptionLanguage element whose alias
    attribute provides substitute text to be spoken
    instead of the contained text. This allows the
    document to contain both a written and a spoken
    form for a string

8
ltsubgt
  • ltsub alias "doctor"gtDr.lt/subgt
  • Smith lives at
  • ltsub alias "two fourteen "gt214 lt/subgt
  • Elm ltsub alias "drive"gtDr. lt/subgt   
  • He weighs ltsub alias "two hundred and
    fourteen"gt214 lt/subgt
  • ltsub alias "pounds"gt lb.lt/subgt  
  • He plays bass guitar.
  • He also likes to fish last week he caught a ltsub
    alias "twenty"gt20 lt/subgtltsub alias "pound"gt
    lb. lt/subgt bass.  
  • ltsub alias "doctor"gtDr. lt/subgt
  • Smith lives at 214 Elm
  • ltsub alias "drive"gtDr. lt/subgt
  • He weighs 214 ltsub alias "pounds"gt lb.
    lt/subgt   
  • He plays bass guitar.   
  • He also likes to fish last week he caught a 20
    ltsub alias "pound"gt lb. lt/subgt bass.

9
ltsay-asgt
  • Speak enclosed text in the given style
  • Implemented (with limitations) in some platforms
  • Example numbers
  • Contained text can be interpreted as a number.
    The allowed number formats are ordinal, cardinal,
    and digits.
  • ltsay-as type"numberordinal"gt12lt/say-asgt is
    spoken as "twelfth
  • ltsay-as type"numberdigits"gt12lt/say-asgt is
    spoken as "one two".
  • Other types acronyms, currency, time, date,
    duration, measures, telephone, spell-out, names,
    and net.
  • Bevocal provides a set of extended tags for items
    such as airline, equity, street, city, state,
    citystate, address

10
Text to phoneme conversion
  • Specify pronunciation of words that are difficult
    to pronounce, e.g.
  • read reed / red
  • wind Wind the watch when you face into the wind
  • ltphonemegt - uses the standard phonetic alphabet,
    the International Phonetic Alphabet (IPA). 
  • He plays        ltphoneme alphabet "ipa"
    ph"U0062 U0258 U0073"gt bass lt/phonemegt guitar.
  • He also likes to fish last week he caught a ltsub
    alias "twenty"gt20 lt/subgt        ltsub alias
    "pound"gt lb. lt/subgt        ltphoneme alphabet
    "ipa" ph"U0062 U00E6 U0073"gt bass lt/phonemegt.

Unicode numbers
11
Attributes of ltphonemegt
  • alphabetThe phonetic alphabet used to specify
    the pronunciation of the word contained in the
    ltphonemegt element
  • phThe phonetic spelling of this word expressed
    using the alphabet. The only valid values for
    this attribute are ph"ipa" and vendor-defined
    strings of the form ph "x-organization" or ph
    "x-organization-alphabet ".
  • Using the IPA requires some linguistic training. 
    For an excellent tutorial on the IPA symbols and
    sounds, see http//www.unil.ch/ling/english/phonet
    ique/table-eng.html. 
  • For an overview of the IPA and a full chart of
    symbols, see http//www.arts.gla.ac.uk/IPA/ipa.htm
    l. 
  • The sounds used in English and their IPA symbols
    are illustrated in http//www.antimoon.com/how/pro
    nunc-soundsipa.htm. You can hear each sound by
    clicking the word that contains the sound. 
  • To identify the corresponding Unicode number, go
    to http//web.uvic.ca/ling/resources/ipa/charts/un
    icode_intro.htm, move the cursor above the IPA
    symbol, and the Unicode value will appear.  

12
Prosody analysis
  • Pitch (intonation or melody), timing (rhythm),
    pauses, speech rate, emphasis on words, and the
    relative timing of segments and pauses. 
  • most TTS engines have a prosody analysis
    algorithm responsible for producing the prosody
    of synthesized speech, which is often based on
    the parts of speech.  For example, nouns, verbs,
    and adjectives may be accented whereas,
    auxiliary verbs and prepositions may be
    distressed. 
  • Spoken speech pauses for commas and properly
    inflects the speech depending upon whether the
    sentence is declarative, interrogative, or
    exclamatory. 
  • Prosody rules and algorithms are not perfect and
    are a topic of ongoing research.  Prosody rules
    for different spoken national languages may be
    quite different.  For example, the prosody for
    American, British, Indian, and Jamaican
    pronunciations of English are different. 

13
ltprosodygt pitch
  • refers to the highness or lowness of speech
  • (currently not implemented in bevocal cafe)
  • measured by the frequency (Hz, vibrations per
    second) of the sound
  • can be specified with
  • A number followed by Hz
  • A relative change expressed as a percentage  for
    example, "18.2" or "-10.3"
  • A relative change as a relative number for
    example, "10" or "-8.7"
  • One of the following words "x-high", "high",
    "medium", "low", "x-low", or "default"

14
ltprosodygt range
  • Range - specifies the variability of the pitch. 
  • specified using the same options as pitch e.g.
  • (currently not implemented in bevocal cafe)
  • ltprosody pitch "medium" range "x-low"gt     

15
ltprosodygt contour
  • describes the actual pitch contour for the text. 
  • (currently not implemented in bevocal cafe)
  • set of time segments with a target pitch
    specified for each time segment. 
  • Each time segment is defined as a percentage of
    the total time for speaking the contained text
    e.g. (25, 25, 25, 25) would speak the
    contained text in four equal segments. 
  • An interpolation algorithm smoothes the
    transitions between the time segments.  For
    example, a contour can be used to describe the
    increase in pitch at the end of a question as
    follows
  • ltprosody contour "(90, medium) (10, high)"gt
    You said what? lt/prosodygt

16
ltprosodygt rate, duration
  • Rate.  The speaking rate expressed using
    words-per-minute (currently not implemented in
    bevocal cafe), specified using any of the
    following
  • A number
  • A relative change expressed as a percentage  for
    example, "18.2" or "-10.3"
  • A relative change as a relative number for
    example, "10" or "-8.7"
  • One of the following words "x-fast", "fast",
    "medium", "slow", "x-slow", or "default"
  • The students name is ltprosody rate-10"gt
    John Scott lt/prosodygt
  • Duration.  A value in seconds or milliseconds for
    the desired time to read the element contents
    e.g.
  • ltprosody duration "10s"gt

17
ltprosodygt volume
  • Volume.  Specifies how loudly or quietly the
    words are spoken, specified by
  • A number in the range from 0.0 to 100.0
  • A relative change expressed as a percentage  for
    example "18.2" or "-10.3"
  • A relative change as a relative number for
    example, "10" or "-8.7"
  • One of the following words "loud", "medium",
    "soft", "low", "x-soft", or "silent"
  • ltprosody volume "loud"gt text to be spoken
     lt/prosodygt

18
ltemphasisgt
  • formerly ltemphgt
  • level values strong moderate, none and
    reduced. 
  • none used to prevent the speech synthesis
    processor from emphasizing words that it might
    typically emphasize
  • ltemphasis level "strong"gthelplt/emphasisgt  

19
ltbreakgt
  • specifies when to insert silence (or pause) in
    text
  • strength - the strength of the prosodic break.
     Values are "none" "x-small", "small","medium"
    (the default value), "large", or "x-large"
  • time e.g. "250ms", "3s".
  • Welcome to the Student System
  • ltbreak time "250ms"/gt
  • Please say one of the following

20
Waveform Production
  • Process of converting a textual representation to
    acoustical sounds which humans hear and interpret
    as human-like speech.
  • ltvoicegt - uses a different voice from the default
    specified for TTS
  • ltvoice age3" gender"female"gt text to
    speak lt/voicegt
  • ltaudiogt - specifies what audio to present to user
  • ltdescgt - specifies text-only output describing
    the audio output (e.g. dog barking)

21
Other SSML elements
  • ltspeakgt - defines a container for a speech
    synthesis document
  • not required when SSML tags are used in PCDATA
    within VoiceXML.
  • ltlexicongt - specifies a pronunciation lexicon
    document which the speech synthesis engine uses
    to generate the pronunciation of words. 
  • format not yet defined, see documentation of
    VoiceXML browser vendor
  • ltmarkgt - places a marker into the text to be
    processed by the speech synthesis engine, e.g.
    ltmark name "pause"/gtWhen encountered, the
    speech synthesis pauses and throws an event
    referencing the marker name. A built-in event
    handler processes the event and causes the speech
    synthesis engine to resume.

22
ltaudiogt playing prerecorded audio files
  • Output can consist of a combination of
    prerecorded files, audio streams, or synthesised
    speech e.g.ltpromptgtWelcome to the Student
    System ltaudio src AudioSample.wav /gtHow can
    I help you?lt/promptgt
  • ltaudiogt can have alternative content in case the
    audio sample is not available e.g.ltaudio src
    welcome.wav gt Welcome to the Student System
    lt/audiogt

23
Recording speech input using ltrecordgt
  • ltrecordgt is a form element similar to ltfieldgt
  • It is used to collect a recording from the user
    that can be played back or submitted to a server
  • It has a ltpromptgt element and can have a ltfilledgt
    element
  • It can have a grammar for a spoken command to
    terminate the recording

24
Attributes of ltrecordgt
  • name - The name of a variable that holds the
    value of the recorded item. 
  • expr - The value of the recorded item variable. 
  • beepThere are two possible values beep "true"
    and beep "false" If true, a beep tone is
    presented to the user just before the recording
    begins.  The default is false.
  • maxtimeThe maximum duration of the recording,
    beginning when the recording starts. For example,
    maxtime "10s" where "10s" means 10 seconds. 
  • finalsilenceThe interval of silence indicating
    the end of speech.   For example, finalsilence
    "3s" (not implemented in IBM Voice Server SDK)
  • dtmftermThere are two possible values dtmfterm
    "true and dtmfterm "false" If true, then any
    DTMF key press not matched by an active grammar
    will terminate the input. The default is true. 
  • typeMedia format of the resulting recording.  A
    media type is a file format written in the form
    type/subtype.  For audio files, the type is
    always audio. 

25
Example using ltrecordgt
  • ltformgt
  • ltrecord name "msg" beep "true" maxtime "5s
    finalsilence "5000ms" dtmfterm "true" type
    "audio/x-wavgt
  • ltprompt timeout "5s"gt
  • Record your message after the beep.
  • lt/promptgt
  • lt/recordgt
  • ltfilledgt
  • lt!-- when recording is completed, replay recorded
    message -gt
  • ltpromptgt You said ltaudio expr"msg"/gt lt/promptgt
  • lt/filledgt
  • lt/formgt

26
Submitting recording to the server
  • In this example, a recording has been stored in
    the variable msg and the system confirms if the
    user wishes to keep it
  • ltfield name"confirm type booleangt
    ltpromptgt Your message is ltaudio expr"msg"/gt.
    lt/promptgt
  • ltpromptgt To keep it, say yes. To discard it, say
    no. lt/promptgt ltfilledgt
  • ltif cond"confirm"gt
  • ltsubmit next"save_message.jsp"
    enctype"multipart/form-data" method"post"
    namelist"msg"/gt
  • lt/ifgt
  • ltclear/gt
  • lt/filledgt
  • lt/fieldgt

27
ltrecordgt shadow variables (1)
  • NB name represents the name of the form item
    variable
  • name.duration - The duration of the recording in
    milliseconds
  • name.size - The size of the recording in bytes
  • name.termchar - The DTMF key used by the caller
    to terminate the recording.  This variable is
    undefined if a key was not used to terminate the
    audio.
  • name.maxtime - true indicates the recording was
    terminated because the maxtime duration was
    reached.  false indicates the recording was not
    terminated due to maxtime.

28
ltrecordgt shadow variables (2)
  • name.utterance - The string of words spoken by
    the user if the recording was terminated by
    speech recognition input. This shadow variable is
    undefined if the recording was not terminated by
    speech recognition input.
  • name.confidence - The confidence level (0.0
    1.0) if the recording was terminated by speech.
    This shadow variable is undefined if the
    recording was not terminated by speech
    recognition input.  The confidence level refers
    to the speech recognizer's estimate of the
    accuracy of its results, in this case the
    accuracy of the contents of name.utterance.

29
Dealing with user hang up during recording
  • When a user hangs up during recording, the
    recording terminates and a connection.disconnect.h
    angup event is thrown. Audio recorded up until
    the hangup is available through the ltrecordgt
    variable e.g.
  • ltcatch eventconnection.disconnect.hangupgt
  • action such as submit recording to server
  • lt/catchgt

30
Exercise SSML markup
  • Create a file using some SSML markup for TTS.
  • Examples
  • He drove his new car, ltprosody pitch"-10"
    range"-20" volume"-20"gtnot his ugly old
    carlt/prosodygt, because he wanted to seem more
    ltemphasis levelstronggt impressive lt/emphasisgt
  • My user number is ltsay-as interpret-asdigitsgt
    145678 lt/say-asgt
  • Sample file tts.vxml

31
Exercise recording and using audio files
  • Create a simple application that includes a field
    in which you ask the user to speak some
    information, such as name and address, that is
    recorded by the system for later playback.
  • Play back a pre-recorded file (music to be played
    as introduction)
Write a Comment
User Comments (0)
About PowerShow.com