Progress of Sphinx 3.X, From X=4 to X=5 - PowerPoint PPT Presentation

About This Presentation
Title:

Progress of Sphinx 3.X, From X=4 to X=5

Description:

Progress of Sphinx 3.X, From X=4 to X=5. By. Arthur Chan. Evandro Gouvea. Yitao Sun ... 'Editor': Arthur Chan - do a lot of editing. Authors: ... – PowerPoint PPT presentation

Number of Views:109
Avg rating:3.0/5.0
Slides: 36
Provided by: csC76
Learn more at: http://www.cs.cmu.edu
Category:
Tags: chan | progress | sphinx

less

Transcript and Presenter's Notes

Title: Progress of Sphinx 3.X, From X=4 to X=5


1
Progress of Sphinx 3.X, From X4 to X5
  • By
  • Arthur Chan
  • Evandro Gouvea
  • Yitao Sun
  • David Huggins-Daines
  • Jahanzeb Sherwani

2
What is CMU Sphinx?
  • Definition 1
  • a large vocabulary speech recognizer with high
    accuracy and speed performance.
  • Definition 2
  • a collection of tools and resources that enables
    developers/researchers to build successful speech
    recognizers

3
Brief History of Sphinx
  • More detail version can be found at,
  • www.cs.cmu.edu/archan/presentation/SphinxDev20040
    624.ppt

1987 Sphinx I
1992 Sphinx II
1996 Sphinx III S3 slow
1999 Sphinx III S3 fast or S3.2
2001-2002 -Sphinx IV Development Initiated -S3.3
2004 Jul S3.4
2004 Oct S3.5 RCII
2000 Sphinx become open-source
4
What is Sphinx 3.X?
  • An extension of Sphinx 3s recognizers
  • Sphinx 3.X (X5) means Sphinx 3.5
  • ? It helps to confuse people more.
  • Provide functionalities such as
  • Real-time speech recognition
  • Speaker adaptation
  • Developers Application Interfaces (APIs)
  • 3.X (Xgt3) is motivated by Project CALO

5
Development History of Sphinx 3.X
S3 -Sphinx 3 flat-lexicon recognizer (s3 slow)
S3.2 -Sphinx 3 tree-lexicon recognizer (s3 fast)
S3.3 -w live-mode demo
S3.4 -fast GMM computation -support class-based
LM -some support for dynamic LM
S3.5 some support on speaker adaptation -live
mode APIs -Sphinx 3 and Sphinx 3.X code merge
6
This talk
  • A general summary of whats going on.
  • Less technical than 3.4 talk
  • Folks were so confused by jargons in speech
    recognitions black magic.
  • More for code development, less for acoustic
    modeling
  • Reason I have not much time to do both ?
  • (Incorrect version) We need to adopt the latest
    technology to clown 2 to 3 Arthur Chan(s) for the
    CALO project. Prof. Alex Rudnicky, in one CALO
    meeting in 2004
  • (Kindly corrected by Prof. Alan Black) We
    need to adopt the latest technology to clone 2 to
    3 Arthur Chan(s) for the CALO project. Prof.
    Alex Rudnicky, in one CALO meeting in 2004
  • More on a project point of view
  • Speech recognition software easily shows
    phenomena described in Mythical Man-Month.

7
This talk (outline)
  • Sphinx 3.X, The recognizer (From X4 to X5) (10
    pages)
  • Accuracy and Speed (5 pages)
  • Speaker Adaptation (1 page)
  • Application Interfaces (APIs) (2 pages)
  • Architecture (2 pages)
  • Sphinx as a collection of resources (10 pages)
  • Code distribution and management (3 pages)
  • Infrastructure of Training (1 page)
  • SphinxTrain tools of training acoustic models.
    (1 page)
  • Documentation (3 pages)
  • Team and Organization (2 pages)
  • Development plan for Sphinx 3.X (X gt 6) (2
    pages)
  • Relationship between speech recognition and other
    speech researches. (4 pages)

8
Accuracy and Speed
  • Why Sphinx 3.X ? Why not Sphinx 2?
  • Due to the limitation of computation in 90s
  • S2 only support restricted version of
    semi-continuous HMM (SCHMM)
  • S3.X supports fully continuous HMM (FCHMM)
  • Accuracy improvement is around relative 30
  • You will see benchmarking results two slides
    later
  • Speed
  • S3.X is still slower than S2
  • But in many tasks, it seems to becomes reasonable
    to use it.
  • (YOU CAN FIND THE RESULTS FEWS SLIDES)

9
Speed
  • Fast Search techniques
  • Lexical tree search (s3.2)
  • Viterbi beam tuning and Histogram beam
    Pruning(s3.2)
  • Ravis talk www.cs.cmu.edu/archan/presentation/s3
    .2.ppt
  • Phoneme look-ahead (s3.4 by Jahanzeb)
  • Fast GMM computation techniques (s3.4)
  • Using the measurement in the literature, that
    means
  • 75-90 of GMM computation reduction with fast
    GMM computation pruning.
  • lt10 relative degradation can usually be achieved
    in clean database.
  • Further Detail Four-Layer Categorization Scheme
    of Fast GMM Computation Techniques A. Chan et
    al.

10
Accuracy Benchmarking (Communicator Task)
  • Test platform, 2.2G Pentium IV
  • CMU Communicator task
  • Vocabulary size (3k) , perplexity 90
  • All tunings were done without sacrificing 5
    performance.
  • Batch mode decoder is used. (decode)
  • Sphinx 2 (tuned w speed-up techniques)
  • WER 17.8 (0.34xRT)
  • Baseline results Sphinx 3.X 32 gaussian-FCGMM
  • WER 14.053 (2.40xRT)
  • Baseline results Sphinx 3.X, 64 gaussian-FCGMM
  • WER 11.7 (3.67xRT)
  • Tuned Sphinx 3.X 64 gaussian-FCGMM
  • WER 12.851 (0.87 xRT), 12.152 (1.17xRT)
  • Rong can make it better Boosting training
    results 10.5

11
Accuracy/Speed Benchmarking (WSJ Task)
  • Test platform, 2.2G Pentium
  • Vocabulary Size (5k)
  • Standard NVP task.
  • Trained by both WSJ0 and WSJ1
  • Sphinx 2, 14.5 (?)
  • Sphinx 3.X, 8 gaussian-FCGMM
  • un-tuned 7.3 1.6xRT
  • tuned 8.29 0.52xRT

12
Accuracy/SpeedBenchmarking (Future Plan)
  • Issue 1 Large variance in GMM computation.
  • Average performance is good, worse case can be
    disastrous.
  • Issue 2 Tuning requires a black magician
  • Automatic tuning is necessary.
  • Issue 3 Still need to work on larger databases
    (e.g. WSJ 20k, BN)
  • training setup need to be dig up
  • Issue 4 Speed up in noisy corpus is tricky.
  • Results are not satisfactory (20-30 degradation
    in accuracy)

13
Speaker Adaptation
  • Start to support MLLR-based speaker adaptation
  • yAxb , estimate A, b in a maximum likelihood
    fashion (Legetter 94)
  • Current functionality of sphinx 3.X SphinxTrain
  • Allow estimation of transformation matrix
  • Transforming means offline
  • Transforming means online
  • Decoder only support single regression class.
  • Code gives exactly the same results as Sam Joos
    code.
  • Not fully benchmarked yet, still experimental

14
Live-mode APIs
  • Thanks to Yitao
  • Sets of C APIs that provide recognition
    functionality
  • Close to Sphinx 2s style of APIs
  • Speech recognition resource initialization/un-init
    ialization
  • Functions for Utterance level begin/end/process
    waveforms

15
Live-mode APIs What are missing?
  • What we lack
  • Dynamic LM addition and deletion
  • part of the plan of s3.6
  • Finite state machine implementation
  • part of plan of s3.X where X8 or 9
  • End-pointer integration and APIs
  • Ziad Al Bawabs model-based classifier
  • Now as a customized version, s3ep

16
Architecture
  • Code duplication is the root of many evils
  • Four tools of s3 are now incorporated into S3.5
  • align an aligner
  • allphone a phoneme recognizer
  • astar lattice to N-best generation
  • dag lattice best-path search
  • Many thanks to Dr. Carl Quillen of MIT Lincoln

17
Architecture Next Step
  • decode_anytopo will be the next
  • Things we may incorporate someday
  • SphinxTrain
  • CMU-Cambridge LM Toolkit
  • lm3g2dmp and cepview

18
Code Distribution and Management
  • Distribution
  • Internal Release -gt RC I -gt RC II .. -gt RC N
  • If no one yell during calm-down period of RC N
  • Then, put a tar ball on Sourceforge web page
  • At every release,
  • Distribution have to go through 10 platforms of
    compilation
  • First announcement usually made at the RC period.
  • Web page is maintained by
  • Evandro (lt-extremely sane)

19
Digression Other versions of Sphinx 3.X
  • Code that are
  • Not satisfying design goal of the software
  • S3 slow w/ GMM Computation
  • www.cs.cmu.edu/archan/software/s3fast.tgz
  • S3.5 with end-pointer
  • www.cs.cmu.edu/archan/software/s3ep.tgz
  • CMU Researchers code and implementation
  • E.g. According to legend, Rita has gt10 versions
    of Sphinx and SphinxTrain.

20
Code Management
  • Concurrent Versions System (CVS) is used in
    Sphinx
  • Also used in other projects e.g. CALO and
    Festival
  • A very effective way to tie resource and
    knowledge together
  • Problems Still has a lot of separate versions
    of code in CMU not in Sphinxs CVS.
  • Please kindly contact us if you work on something
    using Sphinx or derived from Sphinx

21
Infrastructure of Training
  • A need for persistence and version control
  • Baseline were lost after several years.
  • setup will be now available in CVS for
  • Communicator (11.5)
  • WSJ 5k NVP (7.3)
  • ICSI Phase 3 Training
  • Far from the state of the art
  • Need to re-engineer and do archeology
  • Will add more tasks to the archive
  • You are welcomed to change the setup if you dont
    like it
  • But you need to check in what you have done

22
SphinxTrain
  • SphinxTrain is never officially released
  • Still under work.
  • For sphinx3.X (Xgt5), corresponding timestamp of
    SphinxTrain will also be published.
  • Recent Progress
  • Better on-line help
  • Added support for adaptation
  • Better support in perl scripts for FCHMM
    (Evandro)
  • Silence deletion in Baum-Welch Training
    (experimental)

23
Hieroglyph Using Sphinx for building speech
recognizers
  • Project Hieroglyphs
  • An effort to build a set of complete
    documentation for using Sphinx, SphinxTrain and
    CMU LM Toolkit fo building speech applications.
  • Largely based on Evandro, Rita, Ravi, Ronis
    docs.
  • Editor Arthur Chan lt- do a lot of editing
  • Authors
  • Arthur, David, Evandro, Rita, Ravi, Roni, Yitao

24
Hieroglyph An outline
  • Chapter 1 Licensing of Sphinx, SphinxTrain and
    LM Toolkit
  • Chapter 2 Introduction to Sphinx
  • Chapter 3 Introduction to Speech Recognition
  • Chapter 4 Recipe of Building Speech Application
    using Sphinx
  • Chapter 5 Different Software Toolkits of Sphinx
  • Chapter 6 Acoustic Model Training
  • Chapter 7 Language Model Training
  • Chapter 8 Search Structure and Speed-up of the
    Speech recognizer
  • Chapter 9 Speaker Adaptation
  • Chapter 10 Research using Sphinx
  • Chapter 11 Development using Sphinx
  • Appendix A Command Line Information
  • Appendix B FAQ

25
Hieroglyph Status
  • Still in the drafting stage
  • Chapter I License and use of Sphinx,
    SphinxTrain and CMU LM Toolkit (1st draft, 3rd
    Rev)
  • Chapter II Introduction to Sphinx, SphinxTrain
    and CMU LM Toolkit (1st draft, 1st Rev)
  • Chapter VIII Search Structure and Speed-up of
    Sphinx's recognizers (1st draft, 1st Rev)
  • Chapter IX Speaker adaptation using Sphinx (1st
    draft, 2nd Rev)
  • Chapter XI Development using Sphinx (1st draft,
    1st Rev)
  • Appendix A.2 Full SphinxTrain Command Line
    Information (1st draft, 2nd Rev)
  • Writing Quality Low
  • The 1st draft will be completed ½ year later
    (hopefully)

26
Team and Organization
  • Sphinx Developers
  • A group of volunteers who maintain and enhance
    Sphinx and related resources
  • Current Members
  • Arthur Chan (Project Manager / Coordinator)
  • Evandro Gouvea (Maintainer / Developer)
  • David Huggins-Daines (Developer)
  • Yitao Sun (Developer)
  • Ravi Mosur (Speech Advisor)
  • Alex Rudnicky (Speech Advisor)
  • All of you
  • Application Developers
  • Modeling experts
  • Linguists
  • Users

27
Team and Organization
  • We need help!
  • Several positions are still available for
    volunteers
  • Project Manager Enable Development of Sphinx
  • Translation kick/fix miscellaneous people
    (lightly) everyday.
  • Maintainer Ensure integrity of Sphinx code and
    resource
  • Translation a good chance for you to understand
    life more
  • Tester Enable test-based development in Sphinx
  • Translation a good way to increase blood
    pressure.
  • Developers Incorporate state-of-art technology
    into Sphinx
  • Translation deal with legacy code and start to
    write legacy code yourself
  • For your projects, you can also send us temp
    people.
  • Regular meetings are scheduled biweekly.
  • Though, if we are too busy, we just skip it.

28
Next 6 months Sphinx 3.6
  • More refined speaker adaptation
  • More support on dynamic LM
  • More speed-up of the code
  • Better documentation (Complete 1st Draft of
    Hieroglyph?)
  • Confidence measure(?)

29
If we still survive and have a full team
  • Roadmap of Sphinx 3.X (Xgt6)
  • X7,
  • Decoder, Trainer code merge
  • FSG implementation
  • Confidence annotation
  • X8
  • Trainer fixes
  • LM manipulation support
  • X9
  • Better covariance modeling and speaker adaptation
  • Hieroglyph completed
  • Xgt 10 To move on, innovation is necessary.

30
Speech recognition and other Research
  • The goal of Sphinx
  • Support innovation and development of new speech
    applications
  • A conscious and correct decision in long term
    speech recognition research
  • In Speech Synthesis
  • aligner is important for unit selection
  • In Parsing/Dialog Modeling
  • Sphinx 3.X still has a lot of errors!
  • We still need Phoenix! (Robust Parser)
  • We still need Ravenclaw House! (Dialog Manager)
  • In Speech Applications
  • Good recognizer is the basis

31
Cost of Research in Speech Recognition
  • 30 WER reduction is usually perceivable to users
  • i.e. roughly translate to 1-2 good algorithmic
    improvements
  • Under a well-educated researchers group
  • known techniques usually require ½ year to
    implement and test.
  • Unknown techniques will take more time. (1 year
    per innovation)
  • Experienced developers
  • 1 month to implement known techniques
  • 3 months to innovate

32
Therefore
  • It still makes sense to continuously support on,
  • speech recognizer development
  • acoustic modeling improvement.
  • To consolidate, what we were lacking
  • 1, code and project management
  • Multi-developer environment is strictly
    essential.
  • 2, transferal of research to development
  • 3, acoustic modeling research discriminative
    training, speaker adaptation

33
Future of Sphinx 3.X
  • ICSLP 2004
  • From Decoding Driven to Detection-Based
    Paradigms for Automatic Speech Recognition by
    Prof. Chin-Hui Lee
  • Speech Recognition
  • Still an open problem at 2004
  • Role of Speech Recognition in Speech Application
  • Still largely unknown
  • Require open minds to understand

34
Conclusion
  • Weve done something in 2004
  • Our effort starts to make a difference
  • We still need to do more in 2005
  • Making a Sphinx 3.X a backbone of speech
    application development
  • Consolidation of the current research and
    development in Sphinx
  • Seek for ways for sustainable development

35
Thank you!
Write a Comment
User Comments (0)
About PowerShow.com