Title: Language modeling for speaker recognition
1Language modeling for speaker recognition
Dan Gillick January 20, 2004
2Outline
- Author identification
- Trying to beat Doddingtons idiolect modeling
strategy (speaker recognition) - My next project
3Author ID (undergrad. thesis)
- Problem
- train models for each of k authors
- given some test text written by 1 of those
authors, identify the correct author - Variations
- different kinds of models
- different size test samples
- different k
4Character n-gram models
- What?
- 27 tokens a-z, ltspacegt
- some text generated from such a trigram model
- you orthad gool of anythilly
- uncand or prafecaustiont and to hing that put
ably
5Character n-gram models
- Why?
- very simple
- data sparseness less troublesome than with word
n-grams - supposed to be state-of-the-art or at least close
to it - (Khmelev, D, Tweedie, F.J. Using Markov Chains
for the Identification of Writers Literary and
Linguistic Computing, 16(4) 299-307. 2001.)
6Character n-grams Setup
- task pick correct author from 10 possible
authors - training data 3 novels for each author
- test data text from a held-out novel
- jack-knifing 4 novels for each of 20 authors
7Character n-grams Results
- task picking 1 author from 10 possible authors
- training data size 3 novels
8Character n-gram models
- Why does it work?
- captures some word choice information
- picks up word endings (ing, -tion, -ly, etc.)
- not hurt much by data sparseness issues
9Key-list models
- Incentive
- ought to be able to beat character n-grams
- develop a new modeling method more focused on
that which differentiates between authors
(characters and words are both useful for topic
recognition, but that doesnt mean they are best
for author recognition)
10Key-list models
- Idea
- convert the text stream into a stream of only
authorship-relevant symbols (I called these lists
of symbols key-lists) - each symbol is a regular expression to allow for
broad definitions (/tion/ captures any
nounification) - text not accounted for by the key-list is
represented by ltshortgt, ltmedgt, or ltlonggt markers - build n-gram models from these new streams
11Key-list models
Sample key-list
Regular Expression Description
(\w)(,)(\s) comma
(\w)(\.)(\s) period
(\b)(offortoaroundafter )(\b) common prepositions
(\b)(waswere \wed(\b) passive voice
(\b)(iswaswillareweream)(\b) is conjugations
(\b)(\wing)(\b) ends in ing
(\b)(\wly)(\b) adverb
(\b)(andbutornotifthenelse)(\b) logical
(\b)(as)(\b) as
(\b)(wouldshouldcould)(\b) modal verbs
- sample trigram ltcommagt ltshortgt ltperiodgt
12Key-list models Results
- task picking 1 author from 10 possible authors
- training data size 3 novels
13Key-list models Results
- Some other interesting results
- key-lists with just punctuation (as well as
ltshortgt, ltmedgt, ltlonggt) performed almost as well
as the best key-lists - all key-lists were outperformed by the best
n-letter model when test data size lt 10,000
chars. but all key-list models eventually
surpassed the n-letter models
14Key-list models
- Things I didnt do
- vary amount of training data
- spend a long time trying different key-lists
- combine key-list results with each other or with
the character results - a lot of other stuff
- The thesis is available on the web
http//www.dgillick.com/resource/thesis.pdf
15Outline
- Author identification
- Trying to beat Doddingtons idiolect modeling
strategy (speaker recognition) - My next project
16G. Doddingtons LM strategy
- create LMs with a limited vocabulary of the most
commonly occurring 2000 bigrams - to smooth out zeroes, boost each bigram prob. by
0.001 - score by calculating
- logprob(testtarget) logprob(testbkg)
- logprobs are joint probabilities
- logprob(AB) logprob(A) logprob(BA)
17G. Doddingtons LM Setup
- Switchboard 1 data
- collected in early 90s from all over the US
- 2,400 (5 min.) conversations among 543 speakers
- corpus divided into 6 splits and tested using
jack-knifing through the splits - manual transcripts provided by MS. State
- Task
- 8 conversation sides used as training data to
build models for each target speaker - 1 conversation side used as test data
- background model built from 3 splits of held-out
data - jack-knifing allowed for almost 10,000 trials
18G. Doddingtons LM Results
- Notes
- these results are my own attempt to replicate the
original experiments - SRI reported EER 8.65 for this same experiment
19Adapted bigram models
- Incentive
- adapting target models from a much larger
background model should yield better estimates of
probabilities in the language models - Specifically
- use same 2000 bigram vocabulary
- target probabilities are a mixture of training
probabilities and background probabilities - mixture weight is 21 target databkg. data
20Adapted bigram models Results
- Notes
- nearly identical performance
- combination of the 2 systems yields almost no
improvement - why isnt the adapted version better?
21Can anything improve on 8.68?
- Trigrams?
- use same count threshold to make a list of the
top 700 trigrams (a lot of, I dont know were
among the most common) - Character models?
- worked well for authorship
- included all character combinations (no limited
vocabulary) - tried bigram and trigram models
22Scores and combinations
adapt. word bigrams EER 8.89
adapt. word trigrams EER 11.88
adapt.char. bigrams EER 13.73
adapt. char. trigrams EER 17.92
adapted words EER 8.46
adapted characters EER 13.24
adapted words adapted characters EER 7.89
GD bigrams EER 8.68
23Final Comparison
24What about less training data?
- 1 conversation-side training
- character models might provide more of an
advantage with less data? - not so.
- GD EER 22.5
- adapted character EER 30
- adapted word EER 20
- maybe these character models pick up on the topic
of that 1 conversation - havent tried any other size training data
25Outline
- Author identification
- Trying to beat GDs result
- My next project
26Key-lists for speaker recognition
- key-list n-grams picked up on phrasing (comma and
period were valuable tokens) - automatic transcripts dont have punctuation but
they do have pause and duration information - use reg. exps. and duration info. to capture
idiosynchratic speaker phrasing - capture other speech information in key-lists?
(energy, f0, etc.)
27Acknowledgements
- Thanks to
- Anand and Luciana at SRI for trying to help me
replicate their results - Barbara for providing advice
- Barry and Kofi for helping with computers and
stuff - George