Title: Large Scale Evaluation of Corpus-based Synthesizers: The Blizzard Challenge 2005
1Large Scale Evaluation of Corpus-based
SynthesizersThe Blizzard Challenge 2005
- Christina Bennett
- Language Technologies Institute
- Carnegie Mellon University
- Student Research Seminar
- September 23, 2005
2What is corpus-based speech synthesis?
3Need for Speech Synthesis Evaluation
Motivation
- Determine effectiveness of our improvements
- Closer comparison of various corpus-based
techniques - Learn about users' preferences
- Healthy competition promotes progress and brings
attention to the field
4Blizzard Challenge Goals
Motivation
- Compare methods across systems
- Remove effects of different data by providing
requiring same data to be used - Establish a standard for repeatable evaluations
in the field - My goal Bring need for improved speech
synthesis evaluation to forefront in community
(positioning CMU as a leader in this regard)
5Blizzard Challenge Overview
Chal lenge
- Released first voices and solicited participation
in 2004 - Additional voices and test sentences released
Jan. 2005 - 1 - 2 weeks allowed to build voices synthesize
sentences - 1000 samples from each system
- (50 sentences x 5 tests x 4 voices)
6Evaluation Methods
Chal lenge
- Mean Opinion Score (MOS)
- Evaluate sample on a numerical scale
- Modified Rhyme Test (MRT)
- Intelligibility test with tested word within a
carrier phrase - Semantically Unpredictable Sentences (SUS)
- Intelligibility test preventing listeners from
using knowledge to predict words
7Challenge setup Tests
Chal lenge
- 5 tests from 5 genres
- 3 MOS tests (1 to 5 scale)
- News, prose, conversation
- 2 type what you hear tests
- MRT Now we will say ___ again
- SUS det-adj-noun-verb-det-adj-noun
- 50 sentences collected from each system, 20
selected for use in testing
8Challenge setup Systems
Chal lenge
- 6 systems (random ID A-F)
- CMU
- Delaware
- Edinburgh (UK)
- IBM
- MIT
- Nitech (Japan)
- Plus 1 Team Recording Booth (ID X)
- Natural examples from the 4 voice talents
9Challenge setup Voices
Chal lenge
- CMU ARCTIC databases
- American English 2 male, 2 female
- 2 from initial release
- bdl (m)
- slt (f)
- 2 new DBs released for quick build
- rms (m)
- clb (f)
10Challenge setup Listeners
Chal lenge
- Three listener groups
- S speech synthesis experts (50)
- 10 requested from each participating site
- V volunteers (60, 97 registered)
- Anyone online
- U native US English speaking undergraduates
(58, 67 registered) - Solicited and paid for participation
-
as of 4/14/05
11Challenge setup Interface
Chal lenge
- Entirely online
- http//www.speech.cs.cmu.edu/blizzard/register-R.h
tml - http//www.speech.cs.cmu.edu/blizzard/login.html
- Register/login with email address
- Keeps track of progress through tests
- Can stop and return to tests later
- Feedback questionnaire at end of tests
12Fortunately, Team X is clear winner
Results
Listener type S Listener type S Listener type V Listener type V Listener type U Listener type U
MOS type-in MOS type-in MOS type-in MOS type-in MOS type-in MOS type-in
X - 4.76 X - 8.5 X - 4.41 X - 10.3 X - 4.58 X - 7.3
D - 3.19 D - 14.7 D - 3.02 D - 17.1 D - 3.06 D - 16.3
E - 3.11 B - 15.0 E - 2.83 A - 19.7 E - 2.83 A - 19.3
C - 2.91 A - 17.4 B - 2.66 B - 20.3 B - 2.67 B - 19.6
B - 2.88 E - 20.6 C - 2.48 E - 25.0 C - 2.42 E - 21.7
F - 2.15 C - 22.5 F - 2.07 C - 25.6 A - 2.00 C - 22.8
A - 2.07 F - 32.7 A - 1.98 F - 41.8 F - 1.98 F - 35.2
13Team D consistently outperforms others
Results
Listener type S Listener type S Listener type V Listener type V Listener type U Listener type U
MOS type-in MOS type-in MOS type-in MOS type-in MOS type-in MOS type-in
X - 4.76 X - 8.5 X - 4.41 X - 10.3 X - 4.58 X - 7.3
D - 3.19 D - 14.7 D - 3.02 D - 17.1 D - 3.06 D - 16.3
E - 3.11 B - 15.0 E - 2.83 A - 19.7 E - 2.83 A - 19.3
C - 2.91 A - 17.4 B - 2.66 B - 20.3 B - 2.67 B - 19.6
B - 2.88 E - 20.6 C - 2.48 E - 25.0 C - 2.42 E - 21.7
F - 2.15 C - 22.5 F - 2.07 C - 25.6 A - 2.00 C - 22.8
A - 2.07 F - 32.7 A - 1.98 F - 41.8 F - 1.98 F - 35.2
14Speech experts are biased optimistic
Results
Listener type S Listener type S Listener type V Listener type V Listener type U Listener type U
MOS type-in MOS type-in MOS type-in MOS type-in MOS type-in MOS type-in
X - 4.76 X - 8.5 X - 4.41 X - 10.3 X - 4.58 X - 7.3
D - 3.19 D - 14.7 D - 3.02 D - 17.1 D - 3.06 D - 16.3
E - 3.11 B - 15.0 E - 2.83 A - 19.7 E - 2.83 A - 19.3
C - 2.91 A - 17.4 B - 2.66 B - 20.3 B - 2.67 B - 19.6
B - 2.88 E - 20.6 C - 2.48 E - 25.0 C - 2.42 E - 21.7
F - 2.15 C - 22.5 F - 2.07 C - 25.6 A - 2.00 C - 22.8
A - 2.07 F - 32.7 A - 1.98 F - 41.8 F - 1.98 F - 35.2
15Speech experts are better in fact experts
Results
Listener type S Listener type S Listener type V Listener type V Listener type U Listener type U
MOS type-in MOS type-in MOS type-in MOS type-in MOS type-in MOS type-in
X - 4.76 X - 8.5 X - 4.41 X - 10.3 X - 4.58 X - 7.3
D - 3.19 D - 14.7 D - 3.02 D - 17.1 D - 3.06 D - 16.3
E - 3.11 B - 15.0 E - 2.83 A - 19.7 E - 2.83 A - 19.3
C - 2.91 A - 17.4 B - 2.66 B - 20.3 B - 2.67 B - 19.6
B - 2.88 E - 20.6 C - 2.48 E - 25.0 C - 2.42 E - 21.7
F - 2.15 C - 22.5 F - 2.07 C - 25.6 A - 2.00 C - 22.8
A - 2.07 F - 32.7 A - 1.98 F - 41.8 F - 1.98 F - 35.2
16Voice results Listener preference
Results
- slt is most liked, followed by rms
- Type S
- slt - 43.48 of votes cast rms - 36.96
- Type V
- slt - 50 of votes cast rms - 28.26
- Type U
- slt - 47.27 of votes cast rms - 34.55
- But, preference does not necessarily match test
performance
17Voice results Test performance
Results
Female voices - slt
Listener type S all sys-MOS natural-MOS all sys-type-in natural-type-in
Listener type S rms - 3.233 bdl - 4.827 rms - 10.5 rms - 3.2
Listener type S clb - 3.154 rms - 4.809 clb - 16.0 clb - 9.3
Listener type S slt - 2.994 slt - 4.738 slt - 20.8 bdl - 9.4
Listener type S bdl - 2.941 clb - 4.690 bdl - 22.7 slt - 11.3
Listener type V clb - 2.946 rms - 4.568 rms - 14.0 rms - 3.8
Listener type V rms - 2.894 clb - 4.404 clb - 17.1 bdl - 12.0
Listener type V slt - 2.884 bdl - 4.382 slt - 25.2 slt - 12.0
Listener type V bdl - 2.635 slt - 4.296 bdl - 29.3 clb - 13.1
Listener type U clb - 2.987 slt - 4.611 clb - 11.9 slt - 5.9
Listener type U slt - 2.930 clb - 4.587 slt - 17.5 clb - 5.9
Listener type U rms - 2.873 rms - 4.584 rms - 17.6 rms - 8.8
Listener type U bdl - 2.678 bdl - 4.551 bdl - 28.7 bdl - 9.1
18Voice results Test performance
Results
Female voices - clb
Listener type S all sys-MOS natural-MOS all sys-type-in natural-type-in
Listener type S rms - 3.233 bdl - 4.827 rms - 10.5 rms - 3.2
Listener type S clb - 3.154 rms - 4.809 clb - 16.0 clb - 9.3
Listener type S slt - 2.994 slt - 4.738 slt - 20.8 bdl - 9.4
Listener type S bdl - 2.941 clb - 4.690 bdl - 22.7 slt - 11.3
Listener type V clb - 2.946 rms - 4.568 rms - 14.0 rms - 3.8
Listener type V rms - 2.894 clb - 4.404 clb - 17.1 bdl - 12.0
Listener type V slt - 2.884 bdl - 4.382 slt - 25.2 slt - 12.0
Listener type V bdl - 2.635 slt - 4.296 bdl - 29.3 clb - 13.1
Listener type U clb - 2.987 slt - 4.611 clb - 11.9 slt - 5.9
Listener type U slt - 2.930 clb - 4.587 slt - 17.5 clb - 5.9
Listener type U rms - 2.873 rms - 4.584 rms - 17.6 rms - 8.8
Listener type U bdl - 2.678 bdl - 4.551 bdl - 28.7 bdl - 9.1
19Voice results Test performance
Results
Male voices - rms
Listener type S all sys-MOS natural-MOS all sys-type-in natural-type-in
Listener type S rms - 3.233 bdl - 4.827 rms - 10.5 rms - 3.2
Listener type S clb - 3.154 rms - 4.809 clb - 16.0 clb - 9.3
Listener type S slt - 2.994 slt - 4.738 slt - 20.8 bdl - 9.4
Listener type S bdl - 2.941 clb - 4.690 bdl - 22.7 slt - 11.3
Listener type V clb - 2.946 rms - 4.568 rms - 14.0 rms - 3.8
Listener type V rms - 2.894 clb - 4.404 clb - 17.1 bdl - 12.0
Listener type V slt - 2.884 bdl - 4.382 slt - 25.2 slt - 12.0
Listener type V bdl - 2.635 slt - 4.296 bdl - 29.3 clb - 13.1
Listener type U clb - 2.987 slt - 4.611 clb - 11.9 slt - 5.9
Listener type U slt - 2.930 clb - 4.587 slt - 17.5 clb - 5.9
Listener type U rms - 2.873 rms - 4.584 rms - 17.6 rms - 8.8
Listener type U bdl - 2.678 bdl - 4.551 bdl - 28.7 bdl - 9.1
20Voice results Test performance
Results
Male voices - bdl
Listener type S all sys-MOS natural-MOS all sys-type-in natural-type-in
Listener type S rms - 3.233 bdl - 4.827 rms - 10.5 rms - 3.2
Listener type S clb - 3.154 rms - 4.809 clb - 16.0 clb - 9.3
Listener type S slt - 2.994 slt - 4.738 slt - 20.8 bdl - 9.4
Listener type S bdl - 2.941 clb - 4.690 bdl - 22.7 slt - 11.3
Listener type V clb - 2.946 rms - 4.568 rms - 14.0 rms - 3.8
Listener type V rms - 2.894 clb - 4.404 clb - 17.1 bdl - 12.0
Listener type V slt - 2.884 bdl - 4.382 slt - 25.2 slt - 12.0
Listener type V bdl - 2.635 slt - 4.296 bdl - 29.3 clb - 13.1
Listener type U clb - 2.987 slt - 4.611 clb - 11.9 slt - 5.9
Listener type U slt - 2.930 clb - 4.587 slt - 17.5 clb - 5.9
Listener type U rms - 2.873 rms - 4.584 rms - 17.6 rms - 8.8
Listener type U bdl - 2.678 bdl - 4.551 bdl - 28.7 bdl - 9.1
21Voice results Natural examples
Results
Listener type S MOS type-in Listener type S MOS type-in Listener type V MOS type-in Listener type V MOS type-in Listener type U MOS type-in Listener type U MOS type-in
bdl - 4.827 rms - 3.2 rms - 4.568 rms - 3.8 slt - 4.611 slt - 5.9
rms - 4.809 clb - 9.3 clb - 4.404 bdl - 12.0 clb - 4.587 clb - 5.9
slt - 4.738 bdl - 9.4 bdl - 4.382 slt - 12.0 rms - 4.584 rms - 8.8
clb - 4.690 slt - 11.3 slt - 4.296 clb - 13.1 bdl - 4.551 bdl - 9.1
What makes natural rms different?
22Voice results By system
Results
- Only system B consistent across listener types
(slt best MOS, rms best WER) - Most others showed group trends, i.e.
- (with exception of B above and F)
- S rms always best WER, often best MOS
- V slt usually best MOS, clb usually best WER
- U clb usually best MOS and always best WER
- ? Again, people clearly dont prefer the voices
they most easily understand
23Lessons learned Listeners
Lessons
- Reasons to exclude listener data
- Incomplete test, failure to follow directions,
inability to respond (type-in), unusable
responses - Type-in tests very hard to process automatically
- Homophones, misspellings/typos, dialectal
differences, smart listeners - Group differences
- V most variable, U most controlled, S least
problematic but not representative
24Lessons learned Test design
Lessons
- Feedback re tests
- MOS Give examples to calibrate scale (ordering
schema) use multiple scales (lay-people?) - Type-in Warn about SUS hard to remember SUS
words too unusual/hard to spell - Uncontrollable user test setup
- Pros Cons to having natural examples in the mix
- Analyzing user response (), differences in
delivery style (-), availability of voice talent
(?)
25Goals Revisited
Lessons
- One methodology clearly outshined rest
- All systems used same data allowing for actual
comparison of systems - Standard for repeatable evaluations in the field
was established - My goal Brought attention to need for better
speech synthesis evaluation (while positioning
CMU as the experts)
26For the Future
Future
- (Bi-)Annual Blizzard Challenge
- Introduced at Interspeech 2005 special session
- Improve design of tests for easier analysis
post-evaluation - Encourage more sites to submit their systems!
- More data resources (problematic for the
commercial entities) - Expand types of systems accepted ( therefore
test types) - e.g. voice conversion
27Lessons learned Listeners (I)
Lessons
- Reasons for excluding listener data
- Incomplete test
- Failure to follow directions
- Inability to respond to type-in tests (e.g.
non-native) - Unusable responses (8)
- No effort (dont know)
- Inappropriate response (wrong sentence)
- Extremely suspicious scoring (e.g. natural
examples scored very low relatively)
28Lessons learned Listeners (II)
Lessons
- Type-in tests very hard to process
automatically! - Homophones
- (dug / Doug, cereal / serial)
- Misspellings / typos
- (comacozee, kamakazi, kamakasy, )
- Dialectal differences
- (been? bean or Ben?)
- Smart listeners!
29Lessons learned Listeners (III)
Lessons
- Group V most variable
- Least motivation for task (fewest completions)
- Requires close analysis to determine seriousness
- Non-natives difficult to extract
- Group U most controlled
- Motivated by not science
- Group S least problematic, but maybe not
representative
30Lessons learned Test design (I)
Lessons
- Most feedback very good, but
- MOS scale
- Give examples to calibrate 1 to 5
- Ordering schema will counterbalance effects of
user-defined scale - Use different scales for various qualities such
as naturalness, intelligibility, etc. - Lay-people dont understand these terms, and they
are not easily explained - Intelligibility tested directly via type-in tests
31Lessons learned Test design (II)
Lessons
- Type-in tests
- Warn listeners about SUS!
- Hard to remember nonsensical phrases (i.e.
sentences too long) - Words too unusual and/or hard to spell
- ? Together these add an unwanted memory/spelling
dimension to the task
32Lessons learned Test design (III)
Lessons
- Uncontrollable user test setup
- One particular audio player/browser combination
forced browser to advance to new page for every
sound file - Pros Cons to having natural examples in the mix
- Valuable for analyzing user responses
- Difficulty in evaluation due to clear effects of
different delivery styles - Availability of voice talent for future
evaluations
33References
- A. Black and K. Tokuda, The Blizzard Challenge
2005 Evaluating corpus-based speech synthesis on
common datasets, to appear in Interspeech 2005,
Lisbon, Portugal, 2005. - C. Bennett, Large Scale Evaluation of
Corpus-based Synthesizers Results and Lessons
from the Blizzard Challenge 2005, to appear in
Interspeech 2005, Lisbon, Portugal, 2005. - C. Bennett (et al) journal paper more info to
come