Speech - PowerPoint PPT Presentation

Loading...

PPT – Speech PowerPoint presentation | free to download - id: 102bee-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Speech

Description:

Speech & Language Modeling. Cindy Burklow & Jay Hatcher. CS521 ... Formant Synthesis. Recordings. Concatenative synthesis. Unit Selection. Waveform Synthesis ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 30
Provided by: xyz198
Learn more at: http://www.mgnet.org
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Speech


1
Speech Language Modeling
  • Cindy Burklow Jay Hatcher
  • CS521 March 30, 2006

2
Agenda
What is Speech Recognition?
Challenges of Speech Recognition
Expresso III Case Study
IBM Superhuman Speech Tech
Speech Synthesis
3
What is Speech Recognition
One long Rule book
Two approaches
Deductive Framework
How does it work?
Search Algorithms Math Models
Phonemes
4
Hunting Speech
5
Phoneme Sequence
6
Phonemes Energy
7
Challenges of Speech Recognition
Users own preferences
Limit Speech Range
Noise
People
Software
Infinite Combinations
8
Expresso III
Project
How?
Who?
Why?
What?
9
Expresso III
  • How is it different?
  • Why try a new method?
  • Co-Articulation
  • Independencies
  • Duration
  • Linear Dynamic Model (LDM)

10
Expresso III
  • Why Linear Dynamic Model (LDM)?
  • Expresso III s Hypothesis
  • Testing Methods
  • Includes error models
  • Only linear models allowed
  • Series of tests (5 total)
  • Increase phones training data
  • Switching Iteration Data classification
  • Generated histograms of log likelihood
  • Divide Conquer Technique
  • Results

11
IBM Superhuman Speech Tech
ViaVoice 4.4
Products
Goal
Comprehend languages Translate
dynamically Create on-the-fly
subtitles on TV Speak commands
Free-Form Command MASTOR
TALES PDAS, IPODS, DVRs
Get performance comparable to humans in the next
five years. -IBM Jan. 2006
12
Free-Form Command
  • Commands associated with objects
  • Simplified Language
  • Partnering with Specialized Hardware
    Manufacturing
  • Finding Cliché markets
  • Well-chosen Algorithms

13
IBMs MASTOR
  • Multilingual Automatic Speech-to-Speech Translator

14
IBMs Tales
  • Server-based system
  • Dynamically Transcribe translates any words
    spoken into English subtitles
  • Requires long processing time
  • Real-time translations are impossible
  • 60-70 accuracy rate
  • High subscription fee for users

15
Expanding Speech Recognition Applications
  • PDAs to collect data

iPod Email RSS Read Aloud
16
Navigate Your DVR with Speech
Voice commands
Requires microphone TV remote Headset
17
Text to Speech Systems
  • Two major steps
  • Convert the text into a pronounceable format
  • Look for domain specific sections like time,
    dates, numbers, addresses, and abbreviations
  • Try to identify homographs and the contexts in
    which they occur
  • Use some combination of dictionary and rule-based
    approaches as a guide to pronunciation
  • Synthesize speech from the phonetic
    representation using one of many possible
    approaches

18
Speech Synthesis
Continuum of Speech Synthesis methods
Concatenative synthesis
Waveform Synthesis
Formant Synthesis
Recordings
Diphone Synthesis
HMM-based synthesis
Unit Selection
Hybrid Approaches
Articulatory Synthesis
19
Speech Synthesis at CMU
  • Carnegie Mellon University has been doing
    extensive research in both speech recognition and
    speech synthesis
  • Research primarily uses the Festival Speech
    Synthesis System, an open-source framework
    developed by Edinburgh University

20
Speech Synthesis at CMU
  • Research has primarily focused on Diphone
    Synthesis, with some additional exploration into
    Unit Selection.

21
Speech Synthesis at CMU
  • Diphone synthesis allows greater control of pitch
    and voice inflection, but often has a more
    robotic sound to it.
  • Example This is a short introduction to the
    Festival Speech Synthesis System. Festival was
    developed by Alan Black and Paul Taylor, at the
    Centre for Speech Technology Research, University
    of Edinburgh.

22
Speech Synthesis at CMU
  • Improvements can be made by performing
    statistical analysis of the text as a
    preprocessing step before synthesis.
  • This helps with pacing, homographs, and other
    situations where pronunciation differs depending
    on context.
  • He wanted to go for a drive in.
  • He wanted to go for a drive in the country.
  • My cat who lives dangerously has nine lives.
  • Henry V Part I Act II Scene XI Mr X is I
    believe, V I Lenin, and not Charles I.

23
Speech Synthesis at CMU
  • Unit selection can be used instead of diphones to
    improve how natural the voice sounds by using
    whole phones (e.g. syllables) and not just
    diphones (sound transitions)
  • The following examples are based on the same
    speaker
  • Diphones
  • Unit Selection

24
Speech Synthesis at CMU
  • With care, unit selection can produce very
    convincing natural sound.
  • Original Sound
  • Synthesis from natural phones, pitch, and
    duration data
  • However, it is difficult to generalize Unit
    Selection for a variety of situations, and if it
    does poorly it sounds much worse than diphones.
  • Example

25
Speech Synthesis at CMU
  • Most commercial TTS packages use Unit Selection
    with medium to large databases of samples.
  • Example Neospeech VoiceText
  • These produce higher quality sound at the expense
    of memory and processor power.
  • CMUs Festival implementation has focused more on
    Diphone Synthesis to reduce memory footprint and
    allow greater control of the synthesizer.

26
Speech Synthesis at CMU
  • Diphone Synthesis can control inflection, pitch,
    and other factors dynamically.
  • A short example with no prosody.
  • A short example with declination.
  • A short example with accents on stressed
    syllables and end tones.
  • A short example with statistically trained
    intonation and duration models.

27
Conclusion
  • CMUs research using Festival has lead to useful
    technology for embedded systems and servers. The
    Diphone Synthesis model they have developed can
    produce generally intelligible speech with
    minimal memory and processing costs. The model
    is still being worked on and may one day reach a
    natural level of quality.

28
References and Useful Links
  • What is speech recognition Challenges?
  • http//www.extremetech.com/article2/0,1697,1826664
    ,00.asp
  • http//csdl2.computer.org/persagen/DLAbsToc.jsp?re
    sourcePath/dl/mags/co/toccomp/mags/co/2002/04/r
    4toc.xmlDOI10.1109/MC.2002.993770
  • http//en.wikipedia.org/wiki/Speech_recognition
  • http//cslu.cse.ogi.edu/HLTsurvey/ch1node7.html
  • Expresso III Case Study
  • http//www.cstr.ed.ac.uk/publications/users/s01298
    66_abstracts.htmlCouper-02
  • http//www.cstr.ed.ac.uk/publications/users/s01298
    66.html
  • IBM Superhuman Speech Tech
  • http//www.ibm.com
  • http//www.pcmag.com/article2/0,1895,1915071,00.as
    p

29
References and Useful Links
  • The Festival Speech Synthesis System
  • NeoSpeech VoiceText Demo
  • ATTs TTS FAQ
  • Reviews of Popular Speech Synthesizers
  • Speech Engine Listings with Samples
  • BrightSpeech.com
  • Festival at CMU
  • FestVox
About PowerShow.com