LSA 352: Speech Recognition and Synthesis

About This Presentation

LSA 352: Speech Recognition and Synthesis


Telephone-based Information (directions, air travel, banking, etc) Hands-free (in car) ... based Information (directions, air travel, banking, etc) Eyes-free ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 102
Provided by: DanJur6
Learn more at:


Transcript and Presenter's Notes

Title: LSA 352: Speech Recognition and Synthesis

LSA 352 Speech Recognition and Synthesis
  • Dan Jurafsky
  • Lecture 1
  • 1) Overview of Course
  • 2) Refresher Intro to Probability
  • 3) Language Modeling

IP notice some slides for today from Josh
Goodman, Dan Klein, Bonnie Dorr, Julia
Hirschberg, Sandiway Fong
  • Overview of Course
  • Probability
  • Language Modeling
  • Language Modeling means probabilistic grammar

  • Speech Recognition
  • Speech-to-Text
  • Input a wavefile,
  • Output string of words
  • Speech Synthesis
  • Text-to-Speech
  • Input a string of words
  • Output a wavefile

Automatic Speech Recognition (ASR)Automatic
Speech Understanding (ASU)
  • Applications
  • Dictation
  • Telephone-based Information (directions, air
    travel, banking, etc)
  • Hands-free (in car)
  • Second language ('L2') (accent reduction)
  • Audio archive searching
  • Linguistic research
  • Automatically computing word durations, etc

Applications of Speech Synthesis/Text-to-Speech
  • Games
  • Telephone-based Information (directions, air
    travel, banking, etc)
  • Eyes-free (in car)
  • Reading/speaking for disabled
  • Education Reading tutors
  • Education L2 learning

Applications of Speaker/Lg Recognition
  • Language recognition for call routing
  • Speaker Recognition
  • Speaker verification (binary decision)
  • Voice password, telephone assistant
  • Speaker identification (one of N)
  • Criminal investigation

History foundational insights 1900s-1950s
  • Automaton
  • Markov 1911
  • Turing 1936
  • McCulloch-Pitts neuron (1943)
  • http//
  • http//
  • Shannon (1948) link between automata and Markov
  • Human speech processing
  • Fletcher at Bell Labs (1920s)
  • Probabilistic/Information-theoretic models
  • Shannon (1948)

Synthesis precursors
  • Von Kempelen mechanical (bellows, reeds) speech
    production simulacrum
  • 1929 Channel vocoder (Dudley)

History Early Recognition
  • 1920s Radio Rex
  • Celluloid dog with iron base held within house by
    electromagnet against force of spring
  • Current to magnet flowed through bridge which was
    sensitive to energy at 500 Hz
  • 500 Hz energy caused bridge to vibrate,
    interrupting current, making dog spring forward
  • The sound e (ARPAbet eh) in Rex has 500 Hz

History early ASR systems
  • 1950s Early Speech recognizers
  • 1952 Bell Labs single-speaker digit recognizer
  • Measured energy from two bands (formants)
  • Built with analog electrical components
  • 2 error rate for single speaker, isolated digits
  • 1958 Dudley built classifier that used
    continuous spectrum rather than just formants
  • 1959 Denes ASR combining grammar and acoustic
  • 1960s
  • FFT - Fast Fourier transform (Cooley and Tukey
  • LPC - linear prediction (1968)
  • 1969 John Pierce letter Whither Speech
  • Random tuning of parameters,
  • Lack of scientific rigor, no evaluation metrics
  • Need to rely on higher level knowledge

ASR 1970s and 1980s
  • Hidden Markov Model 1972
  • Independent application of Baker (CMU) and
    Jelinek/Bahl/Mercer lab (IBM) following work of
    Baum and colleagues at IDA
  • ARPA project 1971-1976
  • 5-year speech understanding project 1000 word
    vocab, continous speech, multi-speaker
  • Only 1 CMU system achieved goal
  • 1980s
  • Annual ARPA Bakeoffs
  • Large corpus collection
  • Resource Management
  • Wall Street Journal

State of the Art
  • ASR
  • speaker-independent, continuous, no noise,
    worlds best research systems
  • Human-human speech 13-20 Word Error Rate
  • Human-machine speech 3-5 WER
  • TTS (demo next week)

LVCSR Overview
  • Large Vocabulary Continuous (Speaker-Independent)
    Speech Recognition
  • Build a statistical model of the speech-to-words
  • Collect lots of speech and transcribe all the
  • Train the model on the labeled speech
  • Paradigm Supervised Machine Learning Search

Unit Selection TTS Overview
  • Collect lots of speech (5-50 hours) from one
    speaker, transcribe very carefully, all the
    syllables and phones and whatnot
  • To synthesize a sentence, patch together
    syllables and phones from the training data.
  • Paradigm search

Requirements and Grading
  • Readings
  • Required Text
  • Selected chapters on web from
  • Jurafsky Martin, 2000. Speech and Language
  • Taylor, Paul. 2007. Text-to-Speech Synthesis.
  • Grading
  • Homework 75 (3 homeworks, 25 each)
  • Participation 25
  • You may work in groups

Overview of the course
  • http//

6. Introduction to Probability
  • Experiment (trial)
  • Repeatable procedure with well-defined possible
  • Sample Space (S)
  • the set of all possible outcomes
  • finite or infinite
  • Example
  • coin toss experiment
  • possible outcomes S heads, tails
  • Example
  • die toss experiment
  • possible outcomes S 1,2,3,4,5,6

Slides from Sandiway Fong
Introduction to Probability
  • Definition of sample space depends on what we are
  • Sample Space (S) the set of all possible
  • Example
  • die toss experiment for whether the number is
    even or odd
  • possible outcomes even,odd
  • not 1,2,3,4,5,6

More definitions
  • Events
  • an event is any subset of outcomes from the
    sample space
  • Example
  • die toss experiment
  • let A represent the event such that the outcome
    of the die toss experiment is divisible by 3
  • A 3,6
  • A is a subset of the sample space S
  • Example
  • Draw a card from a deck
  • suppose sample space S heart,spade,club,diamond
    (four suits)
  • let A represent the event of drawing a heart
  • let B represent the event of drawing a red card
  • A heart
  • B heart,diamond

Introduction to Probability
  • Some definitions
  • Counting
  • suppose operation oi can be performed in ni ways,
  • a sequence of k operations o1o2...ok
  • can be performed in n1 ? n2 ? ... ? nk ways
  • Example
  • die toss experiment, 6 possible outcomes
  • two dice are thrown at the same time
  • number of sample points in sample space 6 ? 6

Definition of Probability
  • The probability law assigns to an event a
    nonnegative number
  • Called P(A)
  • Also called the probability A
  • That encodes our knowledge or belief about the
    collective likelihood of all the elements of A
  • Probability law must satisfy certain properties

Probability Axioms
  • Nonnegativity
  • P(A) gt 0, for every event A
  • Additivity
  • If A and B are two disjoint events, then the
    probability of their union satisfies
  • P(A U B) P(A) P(B)
  • Normalization
  • The probability of the entire sample space S is
    equal to 1, I.e. P(S) 1.

An example
  • An experiment involving a single coin toss
  • There are two possible outcomes, H and T
  • Sample space S is H,T
  • If coin is fair, should assign equal
    probabilities to 2 outcomes
  • Since they have to sum to 1
  • P(H) 0.5
  • P(T) 0.5
  • P(H,T) P(H)P(T) 1.0

Another example
  • Experiment involving 3 coin tosses
  • Outcome is a 3-long string of H or T
  • Assume each outcome is equiprobable
  • Uniform distribution
  • What is probability of the event that exactly 2
    heads occur?
  • 1/8 1/8 1/8
  • 3/8

Probability definitions
  • In summary
  • Probability of drawing a spade from 52
    well-shuffled playing cards

Probabilities of two events
  • If two events A and B are independent
  • Then
  • P(A and B) P(A) x P(B)
  • If flip a fair coin twice
  • What is the probability that they are both heads?
  • If draw a card from a deck, then put it back,
    draw a card from the deck again
  • What is the probability that both drawn cards are
  • A coin is flipped twice
  • What is the probability that it comes up heads
    both times?

How about non-uniform probabilities? An example
  • A biased coin,
  • twice as likely to come up tails as heads,
  • is tossed twice
  • What is the probability that at least one head
  • Sample space hh, ht, th, tt (h heads, t
  • Sample points/probability for the event
  • ht 1/3 x 2/3 2/9 hh 1/3 x 1/3 1/9
  • th 2/3 x 1/3 2/9 tt 2/3 x 2/3 4/9
  • Answer 5/9 ?0.56 (sum of weights in red)

Moving toward language
  • Whats the probability of drawing a 2 from a deck
    of 52 cards with four 2s?
  • Whats the probability of a random word (from a
    random dictionary page) being a verb?

Probability and part of speech tags
  • Whats the probability of a random word (from a
    random dictionary page) being a verb?
  • How to compute each of these
  • All words just count all the words in the
  • of ways to get a verb number of words which
    are verbs!
  • If a dictionary has 50,000 entries, and 10,000
    are verbs. P(V) is 10000/50000 1/5 .20

Conditional Probability
  • A way to reason about the outcome of an
    experiment based on partial information
  • In a word guessing game the first letter for the
    word is a t. What is the likelihood that the
    second letter is an h?
  • How likely is it that a person has a disease
    given that a medical test was negative?
  • A spot shows up on a radar screen. How likely is
    it that it corresponds to an aircraft?

More precisely
  • Given an experiment, a corresponding sample space
    S, and a probability law
  • Suppose we know that the outcome is within some
    given event B
  • We want to quantify the likelihood that the
    outcome also belongs to some other given event A.
  • We need a new probability law that gives us the
    conditional probability of A given B
  • P(AB)

An intuition
  • A is its raining now.
  • P(A) in dry California is .01
  • B is it was raining ten minutes ago
  • P(AB) means what is the probability of it
    raining now if it was raining 10 minutes ago
  • P(AB) is probably way higher than P(A)
  • Perhaps P(AB) is .10
  • Intuition The knowledge about B should change
    our estimate of the probability of A.

Conditional probability
  • One of the following 30 items is chosen at random
  • What is P(X), the probability that it is an X?
  • What is P(Xred), the probability that it is an X
    given that it is red?

Conditional Probability
  • let A and B be events
  • p(BA) the probability of event B occurring
    given event A occurs
  • definition p(BA) p(A ? B) / p(A)

Conditional probability
  • P(AB) P(A ? B)/P(B)
  • Or

Note P(A,B)P(AB) P(B) Also P(A,B) P(B,A)
  • What is P(A,B) if A and B are independent?
  • P(A,B)P(A) P(B) iff A,B independent.
  • P(heads,tails) P(heads) P(tails) .5 .5
  • Note P(AB)P(A) iff A,B independent
  • Also P(BA)P(B) iff A,B independent

Bayes Theorem
  • Swap the conditioning
  • Sometimes easier to estimate one kind of
    dependence than the other

Deriving Bayes Rule
  • Probability
  • Conditional Probability
  • Independence
  • Bayes Rule

How many words?
  • I do uh main- mainly business data processing
  • Fragments
  • Filled pauses
  • Are cat and cats the same word?
  • Some terminology
  • Lemma a set of lexical forms having the same
    stem, major part of speech, and rough word sense
  • Cat and cats same lemma
  • Wordform the full inflected surface form.
  • Cat and cats different wordforms

How many words?
  • they picnicked by the pool then lay back on the
    grass and looked at the stars
  • 16 tokens
  • 14 types
  • SWBD
  • 20,000 wordform types,
  • 2.4 million wordform tokens
  • Brown et al (1992) large corpus
  • 583 million wordform tokens
  • 293,181 wordform types
  • Let N number of tokens, V vocabulary number
    of types
  • General wisdom V gt O(sqrt(N))

Language Modeling
  • We want to compute P(w1,w2,w3,w4,w5wn), the
    probability of a sequence
  • Alternatively we want to compute
    P(w5w1,w2,w3,w4,w5) the probability of a word
    given some previous words
  • The model that computes P(W) or P(wnw1,w2wn-1)
    is called the language model.
  • A better term for this would be The Grammar
  • But Language model or LM is standard

Computing P(W)
  • How to compute this joint probability
  • P(the,other,day,I,was,walking,along,
  • Intuition lets rely on the Chain Rule of

The Chain Rule of Probability
  • Recall the definition of conditional
  • Rewriting
  • More generally
  • P(A,B,C,D) P(A)P(BA)P(CA,B)P(DA,B,C)
  • In general
  • P(x1,x2,x3,xn) P(x1)P(x2x1)P(x3x1,x2)P(xnx1

The Chain Rule Applied to joint probability of
words in sentence
  • P(the big red dog was)
  • P(the)P(bigthe)P(redthe big)P(dogthe big
    red)P(wasthe big red dog)

Very easy estimate
  • How to estimate?
  • P(theits water is so transparent that)
  • P(theits water is so transparent that)
  • C(its water is so transparent that the)
  • _______________________________
  • C(its water is so transparent that)

  • There are a lot of possible sentences
  • Well never be able to get enough data to compute
    the statistics for those long prefixes
  • P(lizardthe,other,day,I,was,walking,along,and,saw
  • Or
  • P(theits water is so transparent that)

Markov Assumption
  • Make the simplifying assumption
  • P(lizardthe,other,day,I,was,walking,along,and,saw
    ,a) P(lizarda)
  • Or maybe
  • P(lizardthe,other,day,I,was,walking,along,and,saw
    ,a) P(lizardsaw,a)

Markov Assumption
  • So for each component in the product replace with
    the approximation (assuming a prefix of N)
  • Bigram version

Estimating bigram probabilities
  • The Maximum Likelihood Estimate

An example
  • ltsgt I am Sam lt/sgt
  • ltsgt Sam I am lt/sgt
  • ltsgt I do not like green eggs and ham lt/sgt
  • This is the Maximum Likelihood Estimate, because
    it is the one which maximizes P(Training

Maximum Likelihood Estimates
  • The maximum likelihood estimate of some parameter
    of a model M from a training set T
  • Is the estimate
  • that maximizes the likelihood of the training set
    T given the model M
  • Suppose the word Chinese occurs 400 times in a
    corpus of a million words (Brown corpus)
  • What is the probability that a random word from
    some other text will be Chinese
  • MLE estimate is 400/1000000 .004
  • This may be a bad estimate for some other corpus
  • But it is the estimate that makes it most likely
    that Chinese will occur 400 times in a million
    word corpus.

More examples Berkeley Restaurant Project
  • can you tell me about any good cantonese
    restaurants close by
  • mid priced thai food is what im looking for
  • tell me about chez panisse
  • can you give me a listing of the kinds of food
    that are available
  • im looking for a good place to eat breakfast
  • when is caffe venezia open during the day

Raw bigram counts
  • Out of 9222 sentences

Raw bigram probabilities
  • Normalize by unigrams
  • Result

Bigram estimates of sentence probabilities
  • P(ltsgt I want english food lt/sgt)
  • p(iltsgt) x p(wantI) x p(englishwant)
    x p(foodenglish) x p(lt/sgtfood)
  • .24 x .33 x .0011 x 0.5 x 0.68
  • .000031

What kinds of knowledge?
  • P(englishwant) .0011
  • P(chinesewant) .0065
  • P(towant) .66
  • P(eat to) .28
  • P(food to) 0
  • P(want spend) 0
  • P (i ltsgt) .25

The Shannon Visualization Method
  • Generate random sentences
  • Choose a random bigram ltsgt, w according to its
  • Now choose a random bigram (w, x) according to
    its probability
  • And so on until we choose lt/sgt
  • Then string the words together
  • ltsgt I
  • I want
  • want to
  • to eat
  • eat Chinese
  • Chinese food
  • food lt/sgt

(No Transcript)
Shakespeare as corpus
  • N884,647 tokens, V29,066
  • Shakespeare produced 300,000 bigram types out of
    V2 844 million possible bigrams so, 99.96 of
    the possible bigrams were never seen (have zero
    entries in the table)
  • Quadrigrams worse What's coming out looks like
    Shakespeare because it is Shakespeare

The wall street journal is not shakespeare (no
  • We train parameters of our model on a training
  • How do we evaluate how well our model works?
  • We look at the models performance on some new
  • This is what happens in the real world we want
    to know how our model performs on data we havent
  • So a test set. A dataset which is different than
    our training set
  • Then we need an evaluation metric to tell us how
    well our model is doing on the test set.
  • One such metric is perplexity (to be introduced

Unknown words Open versus closed vocabulary tasks
  • If we know all the words in advanced
  • Vocabulary V is fixed
  • Closed vocabulary task
  • Often we dont know this
  • Out Of Vocabulary OOV words
  • Open vocabulary task
  • Instead create an unknown word token ltUNKgt
  • Training of ltUNKgt probabilities
  • Create a fixed lexicon L of size V
  • At text normalization phase, any training word
    not in L changed to ltUNKgt
  • Now we train its probabilities like a normal word
  • At decoding time
  • If text input Use UNK probabilities for any word
    not in training

Evaluating N-gram models
  • Best evaluation for an N-gram
  • Put model A in a speech recognizer
  • Run recognition, get word error rate (WER) for A
  • Put model B in speech recognition, get word error
    rate for B
  • Compare WER for A and B
  • In-vivo evaluation

Difficulty of in-vivo evaluation of N-gram models
  • In-vivo evaluation
  • This is really time-consuming
  • Can take days to run an experiment
  • So
  • As a temporary solution, in order to run
  • To evaluate N-grams we often use an approximation
    called perplexity
  • But perplexity is a poor approximation unless the
    test data looks just like the training data
  • So is generally only useful in pilot experiments
    (generally is not sufficient to publish)
  • But is helpful to think about.

  • Perplexity is the probability of the test set
    (assigned by the language model), normalized by
    the number of words
  • Chain rule
  • For bigrams
  • Minimizing perplexity is the same as maximizing
  • The best language model is one that best predicts
    an unseen test set

A totally different perplexity Intuition
  • How hard is the task of recognizing digits
    0,1,2,3,4,5,6,7,8,9,oh easy, perplexity 11 (or
    if we ignore oh, perplexity 10)
  • How hard is recognizing (30,000) names at
    Microsoft. Hard perplexity 30,000
  • If a system has to recognize
  • Operator (1 in 4)
  • Sales (1 in 4)
  • Technical Support (1 in 4)
  • 30,000 names (1 in 120,000 each)
  • Perplexity is 54
  • Perplexity is weighted equivalent branching

Slide from Josh Goodman
Perplexity as branching factor
Lower perplexity better model
  • Training 38 million words, test 1.5 million
    words, WSJ

Lesson 1 the perils of overfitting
  • N-grams only work well for word prediction if the
    test corpus looks like the training corpus
  • In real life, it often doesnt
  • We need to train robust models, adapt to test
    set, etc

Lesson 2 zeros or not?
  • Zipfs Law
  • A small number of events occur with high
  • A large number of events occur with low frequency
  • You can quickly collect statistics on the high
    frequency events
  • You might have to wait an arbitrarily long time
    to get valid statistics on low frequency events
  • Result
  • Our estimates are sparse! no counts at all for
    the vast bulk of things we want to estimate!
  • Some of the zeroes in the table are really zeros
    But others are simply low frequency events you
    haven't seen yet. After all, ANYTHING CAN
  • How to address?
  • Answer
  • Estimate the likelihood of unseen N-grams!

Slide adapted from Bonnie Dorr and Julia
Smoothing is like Robin HoodSteal from the rich
and give to the poor (in probability mass)
Slide from Dan Klein
Laplace smoothing
  • Also called add-one smoothing
  • Just add one to all the counts!
  • Very simple
  • MLE estimate
  • Laplace estimate
  • Reconstructed counts

Laplace smoothed bigram counts
Laplace-smoothed bigrams
Reconstituted counts
Note big change to counts
  • C(count to) went from 608 to 238!
  • P(towant) from .66 to .26!
  • Discount d c/c
  • d for chinese food .10!!! A 10x reduction
  • So in general, Laplace is a blunt instrument
  • Could use more fine-grained method (add-k)
  • But Laplace smoothing not used for N-grams, as we
    have much better methods
  • Despite its flaws Laplace (add-k) is however
    still used to smooth other probabilistic models
    in NLP, especially
  • For pilot studies
  • in domains where the number of zeros isnt so

Better discounting algorithms
  • Intuition used by many smoothing algorithms
  • Good-Turing
  • Kneser-Ney
  • Witten-Bell
  • Is to use the count of things weve seen once to
    help estimate the count of things weve never seen

Good-Turing Josh Goodman intuition
  • Imagine you are fishing
  • There are 8 species carp, perch, whitefish,
    trout, salmon, eel, catfish, bass
  • You have caught
  • 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon,
    1 eel 18 fish
  • How likely is it that next species is new (i.e.
    catfish or bass)
  • 3/18
  • Assuming so, how likely is it that next species
    is trout?
  • Must be less than 1/18

Slide adapted from Josh Goodman
Good-Turing Intuition
  • Notation Nx is the frequency-of-frequency-x
  • So N101, N13, etc
  • To estimate total number of unseen species
  • Use number of species (words) weve seen once
  • c0 c1 p0 N1/N
  • All other estimates are adjusted (down) to give
    probabilities for unseen

Slide from Josh Goodman
Good-Turing Intuition
  • Notation Nx is the frequency-of-frequency-x
  • So N101, N13, etc
  • To estimate total number of unseen species
  • Use number of species (words) weve seen once
  • c0 c1 p0 N1/N p0N1/N3/18
  • All other estimates are adjusted (down) to give
    probabilities for unseen

P(eel) c(1) (11) 1/ 3 2/3
Slide from Josh Goodman
(No Transcript)
Bigram frequencies of frequencies and GT
  • In practice, assume large counts (cgtk for some k)
    are reliable
  • That complicates c, making it
  • Also we assume singleton counts c1 are
    unreliable, so treat N-grams with count of 1 as
    if they were count0
  • Also, need the Nk to be non-zero, so we need to
    smooth (interpolate) the Nk counts before
    computing c from them

Backoff and Interpolation
  • Another really useful source of knowledge
  • If we are estimating
  • trigram p(zxy)
  • but c(xyz) is zero
  • Use info from
  • Bigram p(zy)
  • Or even
  • Unigram p(z)
  • How to combine the trigram/bigram/unigram info?

Backoff versus interpolation
  • Backoff use trigram if you have it, otherwise
    bigram, otherwise unigram
  • Interpolation mix all three

  • Simple interpolation
  • Lambdas conditional on context

How to set the lambdas?
  • Use a held-out corpus
  • Choose lambdas which maximize the probability of
    some held-out data
  • I.e. fix the N-gram probabilities
  • Then search for lambda values
  • That when plugged into previous equation
  • Give largest probability for held-out set
  • Can use EM to do this search

Katz Backoff
Why discounts P and alpha?
  • MLE probabilities sum to 1
  • So if we used MLE probabilities but backed off to
    lower order model when MLE prob is zero
  • We would be adding extra probability mass
  • And total probability would be greater than 1

GT smoothed bigram probs
Intuition of backoffdiscounting
  • How much probability to assign to all the zero
  • Use GT or other discounting algorithm to tell us
  • How to divide that probability mass among
    different contexts?
  • Use the N-1 gram estimates to tell us
  • What do we do for the unigram words not seen in
  • Out Of Vocabulary OOV words

OOV words ltUNKgt word
  • Out Of Vocabulary OOV words
  • We dont use GT smoothing for these
  • Because GT assumes we know the number of unseen
  • Instead create an unknown word token ltUNKgt
  • Training of ltUNKgt probabilities
  • Create a fixed lexicon L of size V
  • At text normalization phase, any training word
    not in L changed to ltUNKgt
  • Now we train its probabilities like a normal word
  • At decoding time
  • If text input Use UNK probabilities for any word
    not in training

Practical Issues
  • We do everything in log space
  • Avoid underflow
  • (also adding is faster than multiplying)

ARPA format
(No Transcript)
Language Modeling Toolkits
  • CMU-Cambridge LM Toolkit

Google N-Gram Release
Google N-Gram Release
  • serve as the incoming 92
  • serve as the incubator 99
  • serve as the independent 794
  • serve as the index 223
  • serve as the indication 72
  • serve as the indicator 120
  • serve as the indicators 45
  • serve as the indispensable 111
  • serve as the indispensible 40
  • serve as the individual 234

Advanced LM stuff
  • Current best smoothing algorithm
  • Kneser-Ney smoothing
  • Other stuff
  • Variable-length n-grams
  • Class-based n-grams
  • Clustering
  • Hand-built classes
  • Cache LMs
  • Topic-based LMs
  • Sentence mixture models
  • Skipping LMs
  • Parser-based LMs

  • LM
  • N-grams
  • Discounting Good-Turing
  • Katz backoff with Good-Turing discounting
  • Interpolation
  • Unknown words
  • Evaluation
  • Entropy, Entropy Rate, Cross Entropy
  • Perplexity
  • Advanced LM algorithms
Write a Comment
User Comments (0)