Title: Progress of Sphinx 3.X, From X=4 to X=5
1Progress of Sphinx 3.X, From X4 to X5
- By
- Arthur Chan
- Evandro Gouvea
- Yitao Sun
- David Huggins-Daines
- Jahanzeb Sherwani
2What is CMU Sphinx?
- Definition 1
- a large vocabulary speech recognizer with high
accuracy and speed performance. - Definition 2
- a collection of tools and resources that enables
developers/researchers to build successful speech
recognizers
3Brief History of Sphinx
- More detail version can be found at,
- www.cs.cmu.edu/archan/presentation/SphinxDev20040
624.ppt
1987 Sphinx I
1992 Sphinx II
1996 Sphinx III S3 slow
1999 Sphinx III S3 fast or S3.2
2001-2002 -Sphinx IV Development Initiated -S3.3
2004 Jul S3.4
2004 Oct S3.5 RCII
2000 Sphinx become open-source
4What is Sphinx 3.X?
- An extension of Sphinx 3s recognizers
- Sphinx 3.X (X5) means Sphinx 3.5
- ? It helps to confuse people more.
- Provide functionalities such as
- Real-time speech recognition
- Speaker adaptation
- Developers Application Interfaces (APIs)
- 3.X (Xgt3) is motivated by Project CALO
5Development History of Sphinx 3.X
S3 -Sphinx 3 flat-lexicon recognizer (s3 slow)
S3.2 -Sphinx 3 tree-lexicon recognizer (s3 fast)
S3.3 -w live-mode demo
S3.4 -fast GMM computation -support class-based
LM -some support for dynamic LM
S3.5 some support on speaker adaptation -live
mode APIs -Sphinx 3 and Sphinx 3.X code merge
6This talk
- A general summary of whats going on.
- Less technical than 3.4 talk
- Folks were so confused by jargons in speech
recognitions black magic. - More for code development, less for acoustic
modeling - Reason I have not much time to do both ?
- (Incorrect version) We need to adopt the latest
technology to clown 2 to 3 Arthur Chan(s) for the
CALO project. Prof. Alex Rudnicky, in one CALO
meeting in 2004 - (Kindly corrected by Prof. Alan Black) We
need to adopt the latest technology to clone 2 to
3 Arthur Chan(s) for the CALO project. Prof.
Alex Rudnicky, in one CALO meeting in 2004 - More on a project point of view
- Speech recognition software easily shows
phenomena described in Mythical Man-Month.
7This talk (outline)
- Sphinx 3.X, The recognizer (From X4 to X5) (10
pages) - Accuracy and Speed (5 pages)
- Speaker Adaptation (1 page)
- Application Interfaces (APIs) (2 pages)
- Architecture (2 pages)
- Sphinx as a collection of resources (10 pages)
- Code distribution and management (3 pages)
- Infrastructure of Training (1 page)
- SphinxTrain tools of training acoustic models.
(1 page) - Documentation (3 pages)
- Team and Organization (2 pages)
- Development plan for Sphinx 3.X (X gt 6) (2
pages) - Relationship between speech recognition and other
speech researches. (4 pages)
8Accuracy and Speed
- Why Sphinx 3.X ? Why not Sphinx 2?
- Due to the limitation of computation in 90s
- S2 only support restricted version of
semi-continuous HMM (SCHMM) - S3.X supports fully continuous HMM (FCHMM)
- Accuracy improvement is around relative 30
- You will see benchmarking results two slides
later - Speed
- S3.X is still slower than S2
- But in many tasks, it seems to becomes reasonable
to use it. - (YOU CAN FIND THE RESULTS FEWS SLIDES)
9Speed
- Fast Search techniques
- Lexical tree search (s3.2)
- Viterbi beam tuning and Histogram beam
Pruning(s3.2) - Ravis talk www.cs.cmu.edu/archan/presentation/s3
.2.ppt - Phoneme look-ahead (s3.4 by Jahanzeb)
- Fast GMM computation techniques (s3.4)
- Using the measurement in the literature, that
means - 75-90 of GMM computation reduction with fast
GMM computation pruning. - lt10 relative degradation can usually be achieved
in clean database. - Further Detail Four-Layer Categorization Scheme
of Fast GMM Computation Techniques A. Chan et
al.
10Accuracy Benchmarking (Communicator Task)
- Test platform, 2.2G Pentium IV
- CMU Communicator task
- Vocabulary size (3k) , perplexity 90
- All tunings were done without sacrificing 5
performance. - Batch mode decoder is used. (decode)
- Sphinx 2 (tuned w speed-up techniques)
- WER 17.8 (0.34xRT)
- Baseline results Sphinx 3.X 32 gaussian-FCGMM
- WER 14.053 (2.40xRT)
- Baseline results Sphinx 3.X, 64 gaussian-FCGMM
- WER 11.7 (3.67xRT)
- Tuned Sphinx 3.X 64 gaussian-FCGMM
- WER 12.851 (0.87 xRT), 12.152 (1.17xRT)
- Rong can make it better Boosting training
results 10.5
11Accuracy/Speed Benchmarking (WSJ Task)
- Test platform, 2.2G Pentium
- Vocabulary Size (5k)
- Standard NVP task.
- Trained by both WSJ0 and WSJ1
- Sphinx 2, 14.5 (?)
- Sphinx 3.X, 8 gaussian-FCGMM
- un-tuned 7.3 1.6xRT
- tuned 8.29 0.52xRT
12Accuracy/SpeedBenchmarking (Future Plan)
- Issue 1 Large variance in GMM computation.
- Average performance is good, worse case can be
disastrous. - Issue 2 Tuning requires a black magician
- Automatic tuning is necessary.
- Issue 3 Still need to work on larger databases
(e.g. WSJ 20k, BN) - training setup need to be dig up
- Issue 4 Speed up in noisy corpus is tricky.
- Results are not satisfactory (20-30 degradation
in accuracy)
13Speaker Adaptation
- Start to support MLLR-based speaker adaptation
- yAxb , estimate A, b in a maximum likelihood
fashion (Legetter 94) - Current functionality of sphinx 3.X SphinxTrain
- Allow estimation of transformation matrix
- Transforming means offline
- Transforming means online
- Decoder only support single regression class.
- Code gives exactly the same results as Sam Joos
code. - Not fully benchmarked yet, still experimental
14Live-mode APIs
- Thanks to Yitao
- Sets of C APIs that provide recognition
functionality - Close to Sphinx 2s style of APIs
- Speech recognition resource initialization/un-init
ialization - Functions for Utterance level begin/end/process
waveforms
15Live-mode APIs What are missing?
- What we lack
- Dynamic LM addition and deletion
- part of the plan of s3.6
- Finite state machine implementation
- part of plan of s3.X where X8 or 9
- End-pointer integration and APIs
- Ziad Al Bawabs model-based classifier
- Now as a customized version, s3ep
16Architecture
- Code duplication is the root of many evils
- Four tools of s3 are now incorporated into S3.5
- align an aligner
- allphone a phoneme recognizer
- astar lattice to N-best generation
- dag lattice best-path search
- Many thanks to Dr. Carl Quillen of MIT Lincoln
17Architecture Next Step
- decode_anytopo will be the next
- Things we may incorporate someday
- SphinxTrain
- CMU-Cambridge LM Toolkit
- lm3g2dmp and cepview
18Code Distribution and Management
- Distribution
- Internal Release -gt RC I -gt RC II .. -gt RC N
- If no one yell during calm-down period of RC N
- Then, put a tar ball on Sourceforge web page
- At every release,
- Distribution have to go through 10 platforms of
compilation - First announcement usually made at the RC period.
- Web page is maintained by
- Evandro (lt-extremely sane)
19Digression Other versions of Sphinx 3.X
- Code that are
- Not satisfying design goal of the software
- S3 slow w/ GMM Computation
- www.cs.cmu.edu/archan/software/s3fast.tgz
- S3.5 with end-pointer
- www.cs.cmu.edu/archan/software/s3ep.tgz
- CMU Researchers code and implementation
- E.g. According to legend, Rita has gt10 versions
of Sphinx and SphinxTrain.
20Code Management
- Concurrent Versions System (CVS) is used in
Sphinx - Also used in other projects e.g. CALO and
Festival - A very effective way to tie resource and
knowledge together - Problems Still has a lot of separate versions
of code in CMU not in Sphinxs CVS. - Please kindly contact us if you work on something
using Sphinx or derived from Sphinx
21Infrastructure of Training
- A need for persistence and version control
- Baseline were lost after several years.
- setup will be now available in CVS for
- Communicator (11.5)
- WSJ 5k NVP (7.3)
- ICSI Phase 3 Training
- Far from the state of the art
- Need to re-engineer and do archeology
- Will add more tasks to the archive
- You are welcomed to change the setup if you dont
like it - But you need to check in what you have done
22SphinxTrain
- SphinxTrain is never officially released
- Still under work.
- For sphinx3.X (Xgt5), corresponding timestamp of
SphinxTrain will also be published. - Recent Progress
- Better on-line help
- Added support for adaptation
- Better support in perl scripts for FCHMM
(Evandro) - Silence deletion in Baum-Welch Training
(experimental)
23Hieroglyph Using Sphinx for building speech
recognizers
- Project Hieroglyphs
- An effort to build a set of complete
documentation for using Sphinx, SphinxTrain and
CMU LM Toolkit fo building speech applications. - Largely based on Evandro, Rita, Ravi, Ronis
docs. - Editor Arthur Chan lt- do a lot of editing
- Authors
- Arthur, David, Evandro, Rita, Ravi, Roni, Yitao
24Hieroglyph An outline
- Chapter 1 Licensing of Sphinx, SphinxTrain and
LM Toolkit - Chapter 2 Introduction to Sphinx
- Chapter 3 Introduction to Speech Recognition
- Chapter 4 Recipe of Building Speech Application
using Sphinx - Chapter 5 Different Software Toolkits of Sphinx
- Chapter 6 Acoustic Model Training
- Chapter 7 Language Model Training
- Chapter 8 Search Structure and Speed-up of the
Speech recognizer - Chapter 9 Speaker Adaptation
- Chapter 10 Research using Sphinx
- Chapter 11 Development using Sphinx
- Appendix A Command Line Information
- Appendix B FAQ
25Hieroglyph Status
- Still in the drafting stage
- Chapter I License and use of Sphinx,
SphinxTrain and CMU LM Toolkit (1st draft, 3rd
Rev) - Chapter II Introduction to Sphinx, SphinxTrain
and CMU LM Toolkit (1st draft, 1st Rev) - Chapter VIII Search Structure and Speed-up of
Sphinx's recognizers (1st draft, 1st Rev) - Chapter IX Speaker adaptation using Sphinx (1st
draft, 2nd Rev) - Chapter XI Development using Sphinx (1st draft,
1st Rev) - Appendix A.2 Full SphinxTrain Command Line
Information (1st draft, 2nd Rev) - Writing Quality Low
- The 1st draft will be completed ½ year later
(hopefully)
26Team and Organization
- Sphinx Developers
- A group of volunteers who maintain and enhance
Sphinx and related resources - Current Members
- Arthur Chan (Project Manager / Coordinator)
- Evandro Gouvea (Maintainer / Developer)
- David Huggins-Daines (Developer)
- Yitao Sun (Developer)
- Ravi Mosur (Speech Advisor)
- Alex Rudnicky (Speech Advisor)
- All of you
- Application Developers
- Modeling experts
- Linguists
- Users
27Team and Organization
- We need help!
- Several positions are still available for
volunteers - Project Manager Enable Development of Sphinx
- Translation kick/fix miscellaneous people
(lightly) everyday. - Maintainer Ensure integrity of Sphinx code and
resource - Translation a good chance for you to understand
life more - Tester Enable test-based development in Sphinx
- Translation a good way to increase blood
pressure. - Developers Incorporate state-of-art technology
into Sphinx - Translation deal with legacy code and start to
write legacy code yourself - For your projects, you can also send us temp
people. - Regular meetings are scheduled biweekly.
- Though, if we are too busy, we just skip it.
28Next 6 months Sphinx 3.6
- More refined speaker adaptation
- More support on dynamic LM
- More speed-up of the code
- Better documentation (Complete 1st Draft of
Hieroglyph?) - Confidence measure(?)
29If we still survive and have a full team
- Roadmap of Sphinx 3.X (Xgt6)
- X7,
- Decoder, Trainer code merge
- FSG implementation
- Confidence annotation
- X8
- Trainer fixes
- LM manipulation support
- X9
- Better covariance modeling and speaker adaptation
- Hieroglyph completed
- Xgt 10 To move on, innovation is necessary.
30Speech recognition and other Research
- The goal of Sphinx
- Support innovation and development of new speech
applications - A conscious and correct decision in long term
speech recognition research - In Speech Synthesis
- aligner is important for unit selection
- In Parsing/Dialog Modeling
- Sphinx 3.X still has a lot of errors!
- We still need Phoenix! (Robust Parser)
- We still need Ravenclaw House! (Dialog Manager)
- In Speech Applications
- Good recognizer is the basis
31Cost of Research in Speech Recognition
- 30 WER reduction is usually perceivable to users
- i.e. roughly translate to 1-2 good algorithmic
improvements - Under a well-educated researchers group
- known techniques usually require ½ year to
implement and test. - Unknown techniques will take more time. (1 year
per innovation) - Experienced developers
- 1 month to implement known techniques
- 3 months to innovate
32Therefore
- It still makes sense to continuously support on,
- speech recognizer development
- acoustic modeling improvement.
- To consolidate, what we were lacking
- 1, code and project management
- Multi-developer environment is strictly
essential. - 2, transferal of research to development
- 3, acoustic modeling research discriminative
training, speaker adaptation
33Future of Sphinx 3.X
- ICSLP 2004
- From Decoding Driven to Detection-Based
Paradigms for Automatic Speech Recognition by
Prof. Chin-Hui Lee - Speech Recognition
- Still an open problem at 2004
- Role of Speech Recognition in Speech Application
- Still largely unknown
- Require open minds to understand
34Conclusion
- Weve done something in 2004
- Our effort starts to make a difference
- We still need to do more in 2005
- Making a Sphinx 3.X a backbone of speech
application development - Consolidation of the current research and
development in Sphinx - Seek for ways for sustainable development
35Thank you!