Large Scale Evaluation of Corpus-based Synthesizers: The Blizzard Challenge 2005 - PowerPoint PPT Presentation

About This Presentation

Title:

Large Scale Evaluation of Corpus-based Synthesizers: The Blizzard Challenge 2005

Description:

Large Scale Evaluation. of Corpus-based Synthesizers: The Blizzard Challenge 2005 ... http://www.speech.cs.cmu.edu/blizzard/register-R.html ... – PowerPoint PPT presentation

Number of Views:23

Avg rating:3.0/5.0

Slides: 27

Provided by: cben4

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Large Scale Evaluation of Corpus-based Synthesizers: The Blizzard Challenge 2005

1
Large Scale Evaluation of Corpus-based
SynthesizersThe Blizzard Challenge 2005

Christina Bennett
Language Technologies Institute
Carnegie Mellon University
Student Research Seminar
September 23, 2005

2
What is corpus-based speech synthesis?
3
Need for Speech Synthesis Evaluation
Motivation

Determine effectiveness of our improvements
Closer comparison of various corpus-based
techniques
Learn about users' preferences
Healthy competition promotes progress and brings
attention to the field

4
Blizzard Challenge Goals
Motivation

Compare methods across systems
Remove effects of different data by providing
requiring same data to be used
Establish a standard for repeatable evaluations
in the field
My goal Bring need for improved speech
synthesis evaluation to forefront in community
(positioning CMU as a leader in this regard)

5
Blizzard Challenge Overview
Chal lenge

Released first voices and solicited participation
in 2004
Additional voices and test sentences released
Jan. 2005
1 - 2 weeks allowed to build voices synthesize
sentences
1000 samples from each system
(50 sentences x 5 tests x 4 voices)

6
Evaluation Methods
Chal lenge

Mean Opinion Score (MOS)
Evaluate sample on a numerical scale
Modified Rhyme Test (MRT)
Intelligibility test with tested word within a
carrier phrase
Semantically Unpredictable Sentences (SUS)
Intelligibility test preventing listeners from
using knowledge to predict words

7
Challenge setup Tests
Chal lenge

5 tests from 5 genres
3 MOS tests (1 to 5 scale)
News, prose, conversation
2 type what you hear tests
MRT Now we will say ___ again
SUS det-adj-noun-verb-det-adj-noun
50 sentences collected from each system, 20
selected for use in testing

8
Challenge setup Systems
Chal lenge

6 systems (random ID A-F)
CMU
Delaware
Edinburgh (UK)
IBM
MIT
Nitech (Japan)
Plus 1 Team Recording Booth (ID X)
Natural examples from the 4 voice talents

9
Challenge setup Voices
Chal lenge

CMU ARCTIC databases
American English 2 male, 2 female
2 from initial release
bdl (m)
slt (f)
2 new DBs released for quick build
rms (m)
clb (f)

10
Challenge setup Listeners
Chal lenge

Three listener groups
S speech synthesis experts (50)
10 requested from each participating site
V volunteers (60, 97 registered)
Anyone online
U native US English speaking undergraduates
(58, 67 registered)
Solicited and paid for participation
as of 4/14/05

11
Challenge setup Interface
Chal lenge

Entirely online
http//www.speech.cs.cmu.edu/blizzard/register-R.h
tml
http//www.speech.cs.cmu.edu/blizzard/login.html
Register/login with email address
Keeps track of progress through tests
Can stop and return to tests later
Feedback questionnaire at end of tests

12
Fortunately, Team X is clear winner
Results
Listener type S Listener type S Listener type V Listener type V Listener type U Listener type U
MOS type-in MOS type-in MOS type-in MOS type-in MOS type-in MOS type-in
X - 4.76 X - 8.5 X - 4.41 X - 10.3 X - 4.58 X - 7.3
D - 3.19 D - 14.7 D - 3.02 D - 17.1 D - 3.06 D - 16.3
E - 3.11 B - 15.0 E - 2.83 A - 19.7 E - 2.83 A - 19.3
C - 2.91 A - 17.4 B - 2.66 B - 20.3 B - 2.67 B - 19.6
B - 2.88 E - 20.6 C - 2.48 E - 25.0 C - 2.42 E - 21.7
F - 2.15 C - 22.5 F - 2.07 C - 25.6 A - 2.00 C - 22.8
A - 2.07 F - 32.7 A - 1.98 F - 41.8 F - 1.98 F - 35.2
13
Team D consistently outperforms others
Results
Listener type S Listener type S Listener type V Listener type V Listener type U Listener type U
MOS type-in MOS type-in MOS type-in MOS type-in MOS type-in MOS type-in
X - 4.76 X - 8.5 X - 4.41 X - 10.3 X - 4.58 X - 7.3
D - 3.19 D - 14.7 D - 3.02 D - 17.1 D - 3.06 D - 16.3
E - 3.11 B - 15.0 E - 2.83 A - 19.7 E - 2.83 A - 19.3
C - 2.91 A - 17.4 B - 2.66 B - 20.3 B - 2.67 B - 19.6
B - 2.88 E - 20.6 C - 2.48 E - 25.0 C - 2.42 E - 21.7
F - 2.15 C - 22.5 F - 2.07 C - 25.6 A - 2.00 C - 22.8
A - 2.07 F - 32.7 A - 1.98 F - 41.8 F - 1.98 F - 35.2
14
Speech experts are biased optimistic
Results
Listener type S Listener type S Listener type V Listener type V Listener type U Listener type U
MOS type-in MOS type-in MOS type-in MOS type-in MOS type-in MOS type-in
X - 4.76 X - 8.5 X - 4.41 X - 10.3 X - 4.58 X - 7.3
D - 3.19 D - 14.7 D - 3.02 D - 17.1 D - 3.06 D - 16.3
E - 3.11 B - 15.0 E - 2.83 A - 19.7 E - 2.83 A - 19.3
C - 2.91 A - 17.4 B - 2.66 B - 20.3 B - 2.67 B - 19.6
B - 2.88 E - 20.6 C - 2.48 E - 25.0 C - 2.42 E - 21.7
F - 2.15 C - 22.5 F - 2.07 C - 25.6 A - 2.00 C - 22.8
A - 2.07 F - 32.7 A - 1.98 F - 41.8 F - 1.98 F - 35.2
15
Speech experts are better in fact experts
Results
Listener type S Listener type S Listener type V Listener type V Listener type U Listener type U
MOS type-in MOS type-in MOS type-in MOS type-in MOS type-in MOS type-in
X - 4.76 X - 8.5 X - 4.41 X - 10.3 X - 4.58 X - 7.3
D - 3.19 D - 14.7 D - 3.02 D - 17.1 D - 3.06 D - 16.3
E - 3.11 B - 15.0 E - 2.83 A - 19.7 E - 2.83 A - 19.3
C - 2.91 A - 17.4 B - 2.66 B - 20.3 B - 2.67 B - 19.6
B - 2.88 E - 20.6 C - 2.48 E - 25.0 C - 2.42 E - 21.7
F - 2.15 C - 22.5 F - 2.07 C - 25.6 A - 2.00 C - 22.8
A - 2.07 F - 32.7 A - 1.98 F - 41.8 F - 1.98 F - 35.2
16
Voice results Listener preference
Results

slt is most liked, followed by rms
Type S
slt - 43.48 of votes cast rms - 36.96
Type V
slt - 50 of votes cast rms - 28.26
Type U
slt - 47.27 of votes cast rms - 34.55
But, preference does not necessarily match test
performance

17
Voice results Test performance
Results
Female voices - slt
Listener type S all sys-MOS natural-MOS all sys-type-in natural-type-in
Listener type S rms - 3.233 bdl - 4.827 rms - 10.5 rms - 3.2
Listener type S clb - 3.154 rms - 4.809 clb - 16.0 clb - 9.3
Listener type S slt - 2.994 slt - 4.738 slt - 20.8 bdl - 9.4
Listener type S bdl - 2.941 clb - 4.690 bdl - 22.7 slt - 11.3
Listener type V clb - 2.946 rms - 4.568 rms - 14.0 rms - 3.8
Listener type V rms - 2.894 clb - 4.404 clb - 17.1 bdl - 12.0
Listener type V slt - 2.884 bdl - 4.382 slt - 25.2 slt - 12.0
Listener type V bdl - 2.635 slt - 4.296 bdl - 29.3 clb - 13.1
Listener type U clb - 2.987 slt - 4.611 clb - 11.9 slt - 5.9
Listener type U slt - 2.930 clb - 4.587 slt - 17.5 clb - 5.9
Listener type U rms - 2.873 rms - 4.584 rms - 17.6 rms - 8.8
Listener type U bdl - 2.678 bdl - 4.551 bdl - 28.7 bdl - 9.1
18
Voice results Test performance
Results
Female voices - clb
Listener type S all sys-MOS natural-MOS all sys-type-in natural-type-in
Listener type S rms - 3.233 bdl - 4.827 rms - 10.5 rms - 3.2
Listener type S clb - 3.154 rms - 4.809 clb - 16.0 clb - 9.3
Listener type S slt - 2.994 slt - 4.738 slt - 20.8 bdl - 9.4
Listener type S bdl - 2.941 clb - 4.690 bdl - 22.7 slt - 11.3
Listener type V clb - 2.946 rms - 4.568 rms - 14.0 rms - 3.8
Listener type V rms - 2.894 clb - 4.404 clb - 17.1 bdl - 12.0
Listener type V slt - 2.884 bdl - 4.382 slt - 25.2 slt - 12.0
Listener type V bdl - 2.635 slt - 4.296 bdl - 29.3 clb - 13.1
Listener type U clb - 2.987 slt - 4.611 clb - 11.9 slt - 5.9
Listener type U slt - 2.930 clb - 4.587 slt - 17.5 clb - 5.9
Listener type U rms - 2.873 rms - 4.584 rms - 17.6 rms - 8.8
Listener type U bdl - 2.678 bdl - 4.551 bdl - 28.7 bdl - 9.1
19
Voice results Test performance
Results
Male voices - rms
Listener type S all sys-MOS natural-MOS all sys-type-in natural-type-in
Listener type S rms - 3.233 bdl - 4.827 rms - 10.5 rms - 3.2
Listener type S clb - 3.154 rms - 4.809 clb - 16.0 clb - 9.3
Listener type S slt - 2.994 slt - 4.738 slt - 20.8 bdl - 9.4
Listener type S bdl - 2.941 clb - 4.690 bdl - 22.7 slt - 11.3
Listener type V clb - 2.946 rms - 4.568 rms - 14.0 rms - 3.8
Listener type V rms - 2.894 clb - 4.404 clb - 17.1 bdl - 12.0
Listener type V slt - 2.884 bdl - 4.382 slt - 25.2 slt - 12.0
Listener type V bdl - 2.635 slt - 4.296 bdl - 29.3 clb - 13.1
Listener type U clb - 2.987 slt - 4.611 clb - 11.9 slt - 5.9
Listener type U slt - 2.930 clb - 4.587 slt - 17.5 clb - 5.9
Listener type U rms - 2.873 rms - 4.584 rms - 17.6 rms - 8.8
Listener type U bdl - 2.678 bdl - 4.551 bdl - 28.7 bdl - 9.1
20
Voice results Test performance
Results
Male voices - bdl
Listener type S all sys-MOS natural-MOS all sys-type-in natural-type-in
Listener type S rms - 3.233 bdl - 4.827 rms - 10.5 rms - 3.2
Listener type S clb - 3.154 rms - 4.809 clb - 16.0 clb - 9.3
Listener type S slt - 2.994 slt - 4.738 slt - 20.8 bdl - 9.4
Listener type S bdl - 2.941 clb - 4.690 bdl - 22.7 slt - 11.3
Listener type V clb - 2.946 rms - 4.568 rms - 14.0 rms - 3.8
Listener type V rms - 2.894 clb - 4.404 clb - 17.1 bdl - 12.0
Listener type V slt - 2.884 bdl - 4.382 slt - 25.2 slt - 12.0
Listener type V bdl - 2.635 slt - 4.296 bdl - 29.3 clb - 13.1
Listener type U clb - 2.987 slt - 4.611 clb - 11.9 slt - 5.9
Listener type U slt - 2.930 clb - 4.587 slt - 17.5 clb - 5.9
Listener type U rms - 2.873 rms - 4.584 rms - 17.6 rms - 8.8
Listener type U bdl - 2.678 bdl - 4.551 bdl - 28.7 bdl - 9.1
21
Voice results Natural examples
Results
Listener type S MOS type-in Listener type S MOS type-in Listener type V MOS type-in Listener type V MOS type-in Listener type U MOS type-in Listener type U MOS type-in
bdl - 4.827 rms - 3.2 rms - 4.568 rms - 3.8 slt - 4.611 slt - 5.9
rms - 4.809 clb - 9.3 clb - 4.404 bdl - 12.0 clb - 4.587 clb - 5.9
slt - 4.738 bdl - 9.4 bdl - 4.382 slt - 12.0 rms - 4.584 rms - 8.8
clb - 4.690 slt - 11.3 slt - 4.296 clb - 13.1 bdl - 4.551 bdl - 9.1
What makes natural rms different?
22
Voice results By system
Results

Only system B consistent across listener types
(slt best MOS, rms best WER)
Most others showed group trends, i.e.
(with exception of B above and F)
S rms always best WER, often best MOS
V slt usually best MOS, clb usually best WER
U clb usually best MOS and always best WER
? Again, people clearly dont prefer the voices
they most easily understand

23
Lessons learned Listeners
Lessons

Reasons to exclude listener data
Incomplete test, failure to follow directions,
inability to respond (type-in), unusable
responses
Type-in tests very hard to process automatically
Homophones, misspellings/typos, dialectal
differences, smart listeners
Group differences
V most variable, U most controlled, S least
problematic but not representative

24
Lessons learned Test design
Lessons

Feedback re tests
MOS Give examples to calibrate scale (ordering
schema) use multiple scales (lay-people?)
Type-in Warn about SUS hard to remember SUS
words too unusual/hard to spell
Uncontrollable user test setup
Pros Cons to having natural examples in the mix
Analyzing user response (), differences in
delivery style (-), availability of voice talent
(?)

25
Goals Revisited
Lessons

One methodology clearly outshined rest
All systems used same data allowing for actual
comparison of systems
Standard for repeatable evaluations in the field
was established
My goal Brought attention to need for better
speech synthesis evaluation (while positioning
CMU as the experts)

26
For the Future
Future

(Bi-)Annual Blizzard Challenge
Introduced at Interspeech 2005 special session
Improve design of tests for easier analysis
post-evaluation
Encourage more sites to submit their systems!
More data resources (problematic for the
commercial entities)
Expand types of systems accepted ( therefore
test types)
e.g. voice conversion

27
Lessons learned Listeners (I)
Lessons

Reasons for excluding listener data
Incomplete test
Failure to follow directions
Inability to respond to type-in tests (e.g.
non-native)
Unusable responses (8)
No effort (dont know)
Inappropriate response (wrong sentence)
Extremely suspicious scoring (e.g. natural
examples scored very low relatively)

28
Lessons learned Listeners (II)
Lessons

Type-in tests very hard to process
automatically!
Homophones
(dug / Doug, cereal / serial)
Misspellings / typos
(comacozee, kamakazi, kamakasy, )
Dialectal differences
(been? bean or Ben?)
Smart listeners!

29
Lessons learned Listeners (III)
Lessons

Group V most variable
Least motivation for task (fewest completions)
Requires close analysis to determine seriousness
Non-natives difficult to extract
Group U most controlled
Motivated by not science
Group S least problematic, but maybe not
representative

30
Lessons learned Test design (I)
Lessons

Most feedback very good, but
MOS scale
Give examples to calibrate 1 to 5
Ordering schema will counterbalance effects of
user-defined scale
Use different scales for various qualities such
as naturalness, intelligibility, etc.
Lay-people dont understand these terms, and they
are not easily explained
Intelligibility tested directly via type-in tests

31
Lessons learned Test design (II)
Lessons

Type-in tests
Warn listeners about SUS!
Hard to remember nonsensical phrases (i.e.
sentences too long)
Words too unusual and/or hard to spell
? Together these add an unwanted memory/spelling
dimension to the task

32
Lessons learned Test design (III)
Lessons

Uncontrollable user test setup
One particular audio player/browser combination
forced browser to advance to new page for every
sound file
Pros Cons to having natural examples in the mix
Valuable for analyzing user responses
Difficulty in evaluation due to clear effects of
different delivery styles
Availability of voice talent for future
evaluations

33
References

A. Black and K. Tokuda, The Blizzard Challenge
2005 Evaluating corpus-based speech synthesis on
common datasets, to appear in Interspeech 2005,
Lisbon, Portugal, 2005.
C. Bennett, Large Scale Evaluation of
Corpus-based Synthesizers Results and Lessons
from the Blizzard Challenge 2005, to appear in
Interspeech 2005, Lisbon, Portugal, 2005.
C. Bennett (et al) journal paper more info to
come