Sphinx-3 - PowerPoint PPT Presentation

About This Presentation
Title:

Sphinx-3

Description:

Currently, the pronunciation lexicon is almost entirely hand-crafted. ... Precise pronunciation and case information has rarely been present in LM training data. ... – PowerPoint PPT presentation

Number of Views:145
Avg rating:3.0/5.0
Slides: 18
Provided by: Spe562
Category:

less

Transcript and Presenter's Notes

Title: Sphinx-3


1
Sphinx-3
  • An Overview

2
Installation Overview Sphinx 3
  • Doc
  • Model
  • Hmm hub4_cd_continuous
  • Lm
  • Scripts (Unix scripts)
  • Src
  • Libs3audio
  • Libs3decoder
  • Libutil
  • Programs
  • Win32 batch files and workspaces

3
Overview Sphinx 3
  • The s3.3 decoder is based on the conventional
    Viterbi search algorithm and beam search
    heuristics. It uses a lexical-tree search
    structure somewhat like the Sphinx-II decoder,
    but with some improvements for greater accuracy
    than the latter. It takes its input from
    pre-recorded speech in raw PCM format and writes
    its recognition results to output files.

4
Installation Overview Sphinx 3
  • The decoder needs the following inputs
  • Lexical model The lexical or pronunciation model
    contains pronunciations for all the words of
    interest to the decoder. Sphinx-3 uses phonetic
    units to build word pronunciations. Currently,
    the pronunciation lexicon is almost entirely
    hand-crafted.
  • Acoustic model Sphinx uses acoustic models based
    on statistical hidden Markov models (HMMs). The
    acoustic model is trained from acoustic training
    data using the Sphinx-3 trainer. The trainer is
    capable of building acoustic models with a wide
    range of structures, such as discrete,
    semi-continuous, or continuous. However, the s3.3
    decoder is only capable of handling continuous
    acoustic models.
  • Language model (LM) Sphinx-3 uses a conventional
    backoff bigram or trigram language model.

5
Installation Overview Sphinx 3
  • Speech input specification This distribution
    contains four executable files, three of which
    perform recognition.
  • livedecode decodes live speech, that is, speech
    incoming from your audio card.
  • livepretend decodes in batch mode using a control
    file that describes the input to be decoded into
    text.
  • decode decodes also uses a control file for batch
    mode processing. In the latter, the entire input
    to be processed must be available beforehand,
    i.e.,
  • the raw audio samples must have been preprocessed
    into cepstrum files.
  • Also note that the decoder cannot handle
    arbitrary lengths of speech input. Each separate
    piece (or utterance) to be processed by the
    decoder must be no more than 300 sec. long.
    Typically, one uses a segmenter to chop up a
    cepstrum stream into manageable segments of up to
    20 or 30 sec. duration.

6
Overview Sphinx 3
  • Outputs
  • The decoder can produce two types of recognition
    output
  • Recognition hypothesis A single best recognition
    result (or hypothesis) for each utterance
    processed. It is a linear word sequence, with
    additional attributes such as their time
    segmentation and scores.
  • Word lattice A word-graph of all possible
    candidate words recognized during the decoding of
    an utterance, including other attributes such as
    their time segmentation and acoustic likelihood
    scores.
  • In addition, the decoder also produces a detailed
    log to stdout/stderr that can be useful in
    debugging, gathering statistics, etc.

7
Sphinx 3 Signal Processing Front End
  • Input Speech waveform 16-bits (sampling rate
    16kHz)
  • Input Front End Processing Parameters
  • Pre-Emphasis module (pre-emphasis alpha 0.97)
  • Framing (100 frames/sec)
  • Windowing (window length 0.025625 sec)
  • Power Spectrum (using DFT size 512)
  • Filtering (lower 133.334 Hz, upper 6855.4976
    Hz)
  • Mel Spectrum (multiplying the Power Spectrum with
    the Mel weighting filters (number of Mel Filters
    40))
  • Mel Cepstrum (number of cepstra 13)
  • Mel Frequency Cepstral Coefficients (39 32-bit
    floats)

8
Sphinx 3 Acoustic Model
  • An acoustic model is represented by the following
    collection of files
  • A model definition (or mdef) file.
  • It defines the set of basephone and triphone HMMs
  • the mapping of each HMM state to a senone,
  • and the mapping of each HMM to a state transition
    matrix.
  • Gaussian mean and variance (or mean and var)
    files.
  • These files contain all the Gaussian codebooks in
    the model.
  • The Gaussian means and corresponding variance
    vectors are separated into the two files.
  • A mixture weights (or mixw) file containing the
    Gaussian mixture weights for all the senones in
    the model.
  • A state transition matrix (or tmat) file
    containing all the HMM state transition
    topologies and their transition probabilities in
    the model.
  • An optional sub-vector quantized model (or subvq)
    file containing an approximation of the acoustic
    model, for efficient evaluation.
  • The mean, var, mixw, and tmat files are produced
    by the Sphinx-3 trainer, and their file formats
    should be documented there.

9
Sphinx 3 Acoustic Model
  • Continuous Broadcast News Acoustic Models
  • The models have been trained using 140 hours of
    1996 and 1997 hub4 training data, available from
    the Language Data Consortium. The phoneset for
    which models have been provided is that of the
    CMU dictionary version 0.6d.
  • The dictionary has been used without stress
    markers, resulting in 40 phones, including the
    silence phone, SIL. Adding stress markers
    degrades performance by about 5 relative.
  • Rita SinghSphinx Speech GroupSchool of Computer
    ScienceCarnegie Mellon UniversityPittsburgh, PA
    15213

10
Sphinx 3 Acoustic Model
  • Hmm hub4_cd_continuous files
  • 8gau.6000sen.quant
  • Hub4opensrc.6000.mdef
  • Means
  • Mixture_weights
  • Transition_matrises
  • variances

11
Sphinx 3 Language Model
  • The main language model (LM) used by the Sphinx
    decoder is a conventional bigram or trigram
    language model.
  • The CMU-Cambridge SLM toolkit is capable of
    generating such a model from LM training data.
  • Its output is an ascii text file. A
  • large text LM file can be very slow to load into
    memory. To speed up this process, the LM must be
    compiled into a binary form.
  • The code to convert from an ascii text file to
    the binary format is available at SourceForge in
    the CVS tree, in a module named share.
  • The vocabulary of the LM is the set of words
    covered by the unigrams.

12
Sphinx 3 Language Model
  • In Sphinx, the LM cannot distinguish between
    different pronunciations of the same word. For
    example
  • even though the lexicon might contain two
    different pronunciation entries for the word READ
    (present and past tense forms), the language
    model cannot distinguish between the two. Both
    pronunciations would inherit the same probability
    from the language model.
  • Secondly, the LM is case-insensitive. For
    example
  • it cannot contain two different tokens READ and
    read.
  • The reasons for the above restrictions are
    historical. Precise pronunciation and case
    information has rarely been present in LM
    training data. It would certainly be desirable to
    do away with the restrictions at some time in the
    future. (Project)

13
Sphinx 3 Language Model
  • In the an4 directory you will see a list of
    files, of which the following are of special
    importance
  • args.an4.test An argument file, used by the batch
    file you have executed, specifying all command
    line arguments for the example executable.
  • an4.dict The dictionary file for this language
    model, defining all words in terms of phonemes.
  • an4.ug.lm The mono-bi-tri-gram data in a readable
    format used to predict word sequences, this is
    the actual language 'model'.
  • an4.ug.lm.DMP The mono-bi-tri-gram data in a
    binary format used by the toolkit during runtime.

14
Sphinx 3 Batch Files
  • Win32 batch files and workspaces

15
Sphinx3-test.bat
  • set S3ROOT..\..
  • cd S3ROOT
  • Set S3BATCH.\win32\msdev\programs\livepretend\Deb
    ug\livepretend.exe
  • set TASK.\model\lm\an4
  • set CTLFILE.\win32\batch\an4.ctl (an4.ctl
    pittsburgh.littleendian)
  • set ARGS.\model\lm\an4\args.an4.test
  • echo sphinx3-test
  • echo Run CMU Sphinx-3 in Batch mode to decode an
    example utterance.
  • echo This batch script assumes all files are
    relative to the main directory (S3ROOT).
  • echo When running this, look for a line that
    starts with "FWDVIT"
  • echo If the installation is correct, this line
    should read
  • echo FWDVID P I T T S B U R G H (null)
  • S3BATCH CTLFILE TASK ARGS

16
Sphinx3-simple.bat
  • rem This batch script assumes all files are
    relative to the dir (S3ROOT).
  • set S3ROOT..\..
  • cd S3ROOT
  • set S3CONTINUOUS.\win32\msdev\programs\livedecode
    \Debug\ livedecode.exe
  • set ARGS.\model\lm\an4\args.an4.test
  • echo "sphinx3-simple"
  • echo "Demo CMU Sphinx-3 decoder called with
    command line arguments."
  • echo "ltexecuting S3CONTINUOUS, please waitgt"
  • S3CONTINUOUS ARGS
  • (Change set ARGS.\model\lm\dutch_test\args.5417.
    test)

17
Sphinx Source Code
  • agc.cAutomatic gain control (on signal energy),
    ascr.cSenone acoustic score.
  • beam.cPruning beam widths ,bio.cBinary file I/O
    support
  • cmn.cCepstral mean normalization and variance
    normalization
  • corpus.cControl file processing,
    cont_mgau.cMixture Gaussians (acoustic model)
  • decode.cMain file for decode
  • dict.cPronunciation lexicon
  • dict2pid.cGeneration of triphones for the
    pronunciation dictionary
  • feat.cFeature vectors computation
  • fillpen.cFiller word probabilities
  • gausubvq.cStandalone acoustic model sub-vector
    quantizer
  • hmm.cHMM evaluation
  • hyp.hRecognition hypotheses data type
  • kb.hAll knowledge bases, search structures used
    by decoderkbcore.cCollection of core knowledge
    baseslextree.cLexical search treelive.cLive
    decoder functionslm.cTrigram language
    modellogs3.cSupport for log-likelihood
    operationsmain_live_example.cMain file for
    livedecode showing use of live.cmain_live_pretend.
    cMain file for livepretend showing use of
    live.cmdef.cAcoustic model definitions3types.hVari
    ous data types, for ease of modificationsubvq.cSub
    -vector quantized acoustic modeltmat.cHMM
    transition matrices (topology definition)vector.cV
    ector operations, quantization,
    etc.vithist.cBackpointer table (Viterbi
    history)wid.cMapping between LM and lexicon word
    IDs
Write a Comment
User Comments (0)
About PowerShow.com