CMU Shpinx Speech Recognition Engine - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

CMU Shpinx Speech Recognition Engine

Description:

Finding out how an efficient speech recognition engine can be implemented. ... Conversa Web 3.0. Free TTS. Fully implemented with Java. ... – PowerPoint PPT presentation

Number of Views:468
Avg rating:3.0/5.0
Slides: 35
Provided by: try3
Category:

less

Transcript and Presenter's Notes

Title: CMU Shpinx Speech Recognition Engine


1
CMU Shpinx Speech Recognition Engine
  • Reporter Chun-Feng Liao
  • NCCU Dept. of Computer Sceince
  • Intelligent Media Lab

2
Purposes of this project
  • Finding out how an efficient speech recognition
    engine can be implemented.
  • Examine the source code of Sphinx2 to find out
    the role and function of each component.
  • Reading key chapters of Dr. Mosur K.
    Ravishankars thesis as a reference.
  • Some demo programs will be given during oral
    presentation.

3
Presentation Agenda
  • Project Summary/ Agenda/ Goal. (In English)
  • Introduction.
  • Basics of Speech Recognitions.
  • Architecture of CMU Sphinx.
  • Acoustic Model and HMM.
  • Language Model.
  • Java Platform Issues.
  • Demo
  • Conclusion.

4
Voice Technologies
  • In the mid- to late 1990s, personal computers
    started to become powerful enough to support ASR
  • The two key underlying technologies behind these
    advances are speech recognition (SR) and
    text-to-speech synthesis (TTS).

5
Basics of Speech Recognition
6
Speech Recognition
  • Capturing speech (analog) signals
  • Digitizing the sound waves, converting them to
    basic language units or phonemes(??).
  • Constructing words from phonemes, and
    contextually analyzing the words to ensure
    correct spelling for words that sound alike (such
    as write and right).

7
Speech Recognition Process Flow
SourceMicrosoft Speech.NET Home(http//www.micros
oft.com/speech/ )
8
Recognition Process Flow Summary
  • Step 1User Input
  • The system catches users voice in the form of
    analog acoustic signal .
  • Step 2Digitization
  • Digitize the analog acoustic signal.
  • Step 3Phonetic Breakdown
  • Breaking signals into phonemes.

9
Recognition Process Flow Summary(2)
  • Step 4Statistical Modeling
  • Mapping phonemes to their phonetic representation
    using statistics model.
  • Step 5Matching
  • According to grammar , phonetic representation
    and Dictionary , the system returns an n-best
    list (I.e.a word plus a confidence score)
  • Grammar-the union words or phrases to constraint
    the range of input or output in the voice
    application.
  • Dictionary-the mapping table of phonetic
    representation and word(EXthu,thee?the)

10
Architecture of CMU Sphinx.
11
Introduction to CMU Sphinx
  • A speech recognition system developed at Carnegie
    Mellon University.
  • Consists of a set of libraries
  • core speech recognition functions
  • low-level audio capture
  • Continuous speech decoding
  • Speaker-independent

12
Brief History of CMU Sphinx
  • Sphinx-I (1987)
  • The first user independent ,high performance ASR
    of the world.
  • Written in C by Kai-Fu Lee (?????,??Microsoft
    Asia??????/???).
  • Sphinx-II (1992)
  • Written by Xuedong Huang in C. (?????,??Microsoft
    Speech.NET?????)
  • 5-state HMM / N-gram LM.
  • (??????,CMU Sphinx??????Microsoft Speech SDK?????)

13
Brief History of CMU Sphinx (2)
  • Sphinx 3 (1996)
  • Built by Eric Thayer and Mosur Ravishankar.
  • Slower than Sphinx-II but the design is more
    flexible.
  • Sphinx 4 (Originally Sphinx 3j)
  • Refactored from Sphinx 3.
  • Fully implemented in Java.
  • Not finished yet.

14
Components of CMU Sphinx
15
Front End
  • libsphinx2fe.lib / libsphinx2ad.lib
  • Low-level audio access
  • Continuous Listening and Silence Filtering
  • Front End API overview.

16
Knowledge Base
  • The data that drives the decoder.
  • Three sets of data
  • Acoustic Model.
  • Language Model.
  • Lexicon (Dictionary).

17
Acoustic Model
  • /model/hmm/6k
  • Database of statistical model.
  • Each statistical model represents a phoneme.
  • Acoustic Models are trained by analyzing large
    amount of speech data.

18
HMM in Acoustic Model
  • HMM represent each unit of speech in the Acoustic
    Model.
  • Typical HMM use 3-5 states to model a phoneme.
  • Each state of HMM is represented by a set of
    Gaussian mixture density functions.
  • Sphinx2 default phone set.

19
Gaussian Mixtures
  • Refer to text book p 33 eq 38
  • Represent each state in HMM.
  • Each set of Gaussian Mixtures are called
    senones.
  • HMM can share senones.

20
(No Transcript)
21
Language Model
  • Describes what is likely to be spoken in a
    particular context
  • Word transitions are defined in terms of
    transition probabilities
  • Helps to constrain the search space
  • See examples of LM.

22
N-gram Language Model
  • Probability of word N dependent on word N-1, N-2,
    ...
  • Bigrams and trigrams most commonly used
  • Used for large vocabulary applications such as
    dictation
  • Typically trained by very large (millions of
    words) corpus

23
Decoder
  • Selects next set of likely states
  • Scores incoming features against these states
  • Drop low scoring states
  • Generates results

24
Speech in Java Platform
25
Sun Java Speech API
  • First released on October 26, 1998.
  • The Java Speech API allows Java applications to
    incorporate speech technology into their user
    interfaces.
  • Defines a cross-platform API to support command
    and control recognizers, dictation systems and
    speech synthesizers.

26
Implementations of Java Speech API
  • Open Source
  • FreeTTS / CMU Sphinx4.
  • IBM Speech for Java.
  • Cloud Garden.
  • LH TTS for Java Speech API.
  • Conversa Web 3.0.

27
Free TTS
  • Fully implemented with Java.
  • Based upon Flite 1.1 a small run-time speech
    synthesis engine developed at CMU.
  • Partial support for JSAPI 1.0.
  • Speech Recognition functions.
  • JSML.

28
Sphinx 4 (Sphinx 3j)
  • Fully implemented with Java.
  • Speed is equal or faster than Sphinx3.
  • Acoustic model and Language model is under
    construction.
  • Source code are available by CVS.(but you can not
    run any applications without models !)

For Example To check out the Sphinx4 ,you can
using the following command. cvs -z3
-dpserveranonymous_at_cvs.sourceforge.net/cvsroot/
cmusphinx co sphinx4
29
Java Platform Issues
  • GC makes managing data much easier
  • Native engines typically optimize inner loops for
    the CPU ? can't do that on the Java platform.
  • Native engines arrange data to
  • optimize cache hits ? can't really do that either.

30
DEMO
  • Sphinx-II batch mode.
  • Sphinx-II live mode.
  • Sphinx-II Client / Server mode.
  • A Simple Free TTS Application.
  • (Java-based) TTS vs (c-based)SR .
  • Motion Planner with Free TTS-using Java Web
    Start.(This is GRA course final project)

31
Summary
  • Sphinx is a open source Speech Recognition
    developed at CMU.
  • FE / KB / Decoder form the core of SR system.
  • FE receives and processes speech signal.
  • Knowledge Base provide data for Decoder.
  • Decoder search the states and return the results.
  • Speech Recognition is a challenging problem for
    the Java platform.

32
Reference
  • Mosur K.Ravishankar, Efficient Alogrithms for
    Speech Recognition, CMU, 1996.
  • Mosur K.Ravishankar, Kevin A. Lenzo ,Sphinx-II
    User Guide , CMU,2001.
  • Xuedong Huang,Alex Acerd,Hsiao-Wuen hon,Spoken
    Language Processing,Prentice Hall,2000.

33
Reference (on-line)
  • On-line documents of Java Speech API
  • http//java.sun.com/products/java-media/speech/
  • On-line documents of Free TTS
  • http//freetts.sourceforge.net/docs/
  • On-line documents of Sphinx-II
  • http//www.speech.cs.cmu.edu/sphinx/

34
Q A
Write a Comment
User Comments (0)
About PowerShow.com